You are on page 1of 991

Laurie A.

Schintler
Connie L. McNeely
Editors

Encyclopedia of
Big Data
Encyclopedia of Big Data
Laurie A. Schintler • Connie L. McNeely
Editors

Encyclopedia of Big Data

With 54 Figures and 29 Tables


Editors
Laurie A. Schintler Connie L. McNeely
George Mason University George Mason University
Fairfax, VA, USA Fairfax, VA, USA

ISBN 978-3-319-32009-0 ISBN 978-3-319-32010-6 (eBook)


ISBN 978-3-319-32011-3 (print and electronic bundle)
https://doi.org/10.1007/978-3-319-32010-6

© Springer Nature Switzerland AG 2022


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software, or
by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with
regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

This encyclopedia was born in recognition of the fact that the big data
revolution is upon us. Referring generally to data characterized by the
“7 Vs” – volume, variety, velocity, variability, veracity, vulnerability, and
value – big data is increasingly ubiquitous, impacting nearly every aspect of
society in every corner of the globe. It has become an essential player and
resource in today’s expanding digitalized and information-driven world and is
embedded in a complex and dynamic ecosystem comprised of various indus-
tries, groups, algorithms, disciplines, platforms, applications, and enabling
technologies.
On the one hand, big data is a critical driver of productivity, innovation, and
well-being. In this context, various sources of big data – for example, satellites,
digital sensors, observatories, crowdsourcing mechanisms, mobile devices,
the World Wide Web, and the Internet of Things – coupled with advancing
analytical and computational capacities and capabilities, continue to contribute
to data-driven solutions and positive transformations across all sectors of the
economy. On the other hand, the uses and applications of big data come with
an array of challenges and problems, some of which are technical in nature and
others that involve ethical, social, and legal dimensions that affect and are
affected by societal constraints and considerations.
Understanding the opportunities and challenges brought on by the explo-
sion of information that marks society today requires consideration of an
encompassing array of questions and issues that arise with and because of
big data. For example, the massive size and high dimensionality of big datasets
present computational challenges and problems of validation linked not only
to selection biases and measurement errors but also to spurious correlations
and storage and scalability blockages.
Moreover, the bigger the data, the bigger the potential for its use and its
misuse, whether relative to innovations and progress or referencing data
asymmetries, ethical violations, discrimination, and biases. Accordingly, a
wide range of topics, along with policies and strategies, regarding the nature
and engagement of big data across levels of analysis, are needed to ensure that
the possible benefits of big data are maximized while the downsides are
minimized.
Against this backdrop, the Springer Nature Encyclopedia of Big Data offers
a complex and diverse picture of big data viewed through a multidimensional
technological and societal lens, considering related aspects and trends within
and across different domains, disciplines, and sectors. Moreover, the field of

v
vi Preface

big data itself is highly fluid, with new analytical and processing modalities,
concepts, and applications unfolding and evolving on an ongoing basis.
Reflecting the breadth, depth, and dynamics of the field – and of the big data
ecosystem itself – the Encyclopedia of Big Data is designed to provide a
comprehensive, foundational, and cutting-edge perspective on the topic. It is
intended to be a resource for various audiences – from the big data novice to
the data scientist, from the researcher to the practitioner, and from the analyst
to the generally interested lay public. The Encyclopedia has an international
focus, covering the many aspects, uses, and applications of big data that
transcend national boundaries. Accordingly, the Encyclopedia of Big Data
draws upon the expertise and experience of leading scholars and practitioners
from all over the world. Our aim is that it will serve as a valuable resource for
understanding and keeping abreast of the constantly evolving, complex, and
critical field of big data.

Fairfax, USA Laurie A. Schintler


January 2022 Connie L. McNeely
Editors
List of Topics

Agile Data Big Data Workforce


AgInformatics Big Geo-data
Agriculture Big Humanities Project
Algorithm Big Variety Data
Algorithmic Complexity Bioinformatics
American Bar Association Biomedical Data
American Civil Liberties Union Biometrics
American Library Association Biosurveillance
Animals Blockchain
Anomaly Detection Blogs
Anonymity Border Control/Immigration
Anonymization Techniques Brand Monitoring
Anthropology Business
Antiquities Trade, Illicit Business Intelligence Analytics
Apple Business-to-Community (B2C)
Archaeology Cancer
Artificial Intelligence Cell Phone Data
Arts Census Bureau (U.S.)
Asian Americans Advancing Justice Centers for Disease Control and Prevention
Association Versus Causation (CDC)
Astronomy Charter of Fundamental Rights (EU)
Authoritarianism Chemistry
Authorship Analysis and Attribution Clickstream Analytics
Automated Modeling/Decision Making Climate Change, Hurricanes/Typhoons/Cyclones
Aviation Climate Change, Rising Temperatures
Behavioral Analytics Cloud Computing
Bibliometrics/Scientometrics Cloud Services
Big Data and Theory Collaborative Filtering
Big Data Concept Common Sense Media
Big Data Literacy Communication Quantity
Big Data Quality Communications
Big Data Research and Development Initiative Complex Event Processing (CEP)
(Federal, U.S.) Complex Networks
Big Data Theory Computational Social Sciences

vii
viii List of Topics

Computer Science Data Visualization


Content Management System (CMS) Database Management Systems (DBMS)
Content Moderation Datafication
Contexts Data-Information-Knowledge-Action Model
Core Curriculum Issues (Big Data Research/ Data-Information-Knowledge-Wisdom (DIKW)
Analysis) Pyramid, Framework, Continuum
Corporate Social Responsibility Decision Theory
Corpus Linguistics Deep Learning
Correlation Versus Causation De-identification/Re-identification
COVID-19 Pandemic Demographic Data
Crowdsourcing Digital Advertising Alliance
Cultural Analytics Digital Divide
Curriculum, Higher Education, and Social Digital Ecosystem
Sciences Digital Knowledge Network Divide (DKND)
Curriculum, Higher Education, Humanities Digital Literacy
Cyber Espionage Digital Storytelling, Big Data Storytelling
Cyberinfrastructure (U.S.) Disaster Planning
Cybersecurity Discovery Analytics, Discovery Informatics
Dashboard Diversity
Data Aggregation Driver Behavior Analytics
Data Architecture and Design Drones
Data Brokers and Data Services Drug Enforcement Administration (DEA)
Data Center Earth Science
Data Cleansing E-Commerce
Data Discovery Economics
Data Exhaust Education and Training
Data Fusion Electronic Health Records (EHR)
Data Governance Ensemble Methods
Data Integration Entertainment
Data Integrity Environment
Data Lake Epidemiology
Data Management and Artificial Intelligence (AI) Ethical and Legal Issues
Data Mining Ethics
Data Monetization European Commission
Data Munging and Wrangling European Commission: Directorate-General for
Data Processing Justice (Data Protection Division)
Data Profiling European Union
Data Provenance European Union Data Protection Supervisor
Data Quality Management Evidence-Based Medicine
Data Repository Facebook
Data Science Facial Recognition Technologies
Data Scientist Financial Data and Trend Prediction
Data Sharing Financial Services
Data Storage Forestry
Data Streaming Fourth Amendment
Data Synthesis Fourth Industrial Revolution
Data Virtualization Fourth Paradigm
List of Topics ix

France Middle East


Gender and Sexuality Mobile Analytics
Genealogy Multiprocessing
Geography Multi-threading
Google National Association for the Advancement of
Google Analytics Colored People
Google Books Ngrams National Oceanic and Atmospheric
Google Flu Administration
Governance National Organization for Women
Granular Computing National Security Agency (NSA)
Graph-Theoretic Computations/Graph Databases Natural Hazards
Health Care Delivery Natural Language Processing (NLP)
Health Informatics Netflix
High Dimensional Data Network Advertising Initiative
HIPAA Network Analytics
Human Resources Network Data
Humanities (Digital Humanities) Neural Networks
Industrial and Commercial Bank of China NoSQL (Not Structured Query Language)
Informatics Nutrition
Information Commissioner, United Kingdom Online Advertising
Information Overload Online Identity
Information Quantity Ontologies
Information Society Open Data
Integrated Data System Open-Source Software
Intelligent Transportation Systems (ITS) Participatory Health and Big Data
Interactive Data Visualization Patient Records
International Development Patient-Centered (Personalized) Health
International Labor Organization PatientsLikeMe
International Nongovernmental Organizations Persistent Identifiers (PIDs) for Cultural Heritage
(INGOs) Pharmaceutical Industry
Internet Association, The Policy Analytics
Internet of Things (IoT) Political Science
Internet: Language Pollution, Air
Italy Pollution, Land
Journalism Pollution, Water
Keystroke Capture Precision Population Health
Knowledge Management Predictive Analytics
LexisNexis Prevention
Link Prediction in Networks Privacy
Link/Graph Mining Probabilistic Matching
LinkedIn Profiling
Machine Learning Psychology
Maritime Transport Recommender Systems
Mathematics Regression
Media Regulation
Medicaid Religion
Metadata Risk Analysis
x List of Topics

R-Programming Supercomputing, Exascale Computing, High


Salesforce Performance Computing
Satellite Imagery/Remote Sensing Supply Chain and Big Data
Scientometrics Surface Web vs Deep Web vs Dark Web
Semantic/Content Analysis/Natural Language Sustainability
Processing Systems Science
Semiotics Tableau Software
Semi-structured Data Technological Singularity
Sensor Technologies Telemedicine
Sentic Computing Time Series Analytics
Sentiment Analysis Transnational Crime
“Small” Data Transparency
Smart Cities Transportation Visualization
Social Media Treatment
Social Media and Security United Nations Educational, Scientific and
Social Network Analysis Cultural Organization (UNESCO)
Social Sciences Upturn
Socio-spatial Analytics Visualization
South Korea Voice User Interaction
Space Research Paradigm Vulnerability
Spain Web Scraping
Spatial Data White House Big Data Initiative
Spatial Econometrics White House BRAIN Initiative
Spatial Scientometrics WikiLeaks
Spatiotemporal Analytics Wikipedia
Standardization World Bank
State Longitudinal Data System Zappos
Storage Zillow
Structured Query Language (SQL)
About the Editors

Laurie A. Schintler George Mason University, Fairfax, VA, USA

Laurie A. Schintler, Ph.D., is an associate professor in the Schar School of


Policy and Government at George Mason University, where she also serves as
director for data and technology research initiatives in the Center for Regional
Analysis. Dr. Schintler received her Ph.D. degree in regional and urban
planning from the University of Illinois, Urbana-Champaign. Her primary
areas of expertise and research lie at the intersection of big data, emerging
technologies, complexity theory, regional development, information science,
critical infrastructure, innovation, and policy analytics. A recent focal point of
her research is on the determinants and impacts, and related challenges and
opportunities of big data use in a regional and “smart city” context. She is also
active in developing data-driven analytical methods for characterizing and
modeling socio-spatial interaction and dynamics. Additionally, Dr. Schintler
conducts research on the complex interplay between technological divides –
including the big data divide – and related social disparities. Her research also
addresses ethical and social impacts and other issues associated with the use of
big data, artificial intelligence, blockchain – and emerging modes of human-
machine interaction – in relation to policy and program development.
Dr. Schintler is very professionally active, with numerous peer-reviewed
publications, reports, conference proceedings, co-edited volumes, and grants
and contracts.

Connie L. McNeely George Mason University, Fairfax, VA, USA

Connie L. McNeely, Ph.D., is a sociologist and professor in the Schar School


of Policy and Government at George Mason University, where she is also the
director of the Center for Science, Technology, and Innovation Policy. Her
teaching and research address various aspects of science, technology, and
innovation, big data, emerging technologies, public policy, and governance.
Dr. McNeely has directed major projects on big data and digitalization pro-
cesses, scientific networks, and broadening participation and inclusion in
science and technology fields. Along with studies focused on applications of
information technologies and informatics, she has conducted research
concerning data democratization and data interoperability, leveraging large,

xi
xii About the Editors

complex datasets to inform policy development and implementation. Her


recent work has engaged related issues involving artificial intelligence and
ethics, human-machine relations, digital divides, and big data and discovery
analytics. She has ongoing projects examining institutional and cultural
dynamics in matters of big data engagement and ethical and social impacts,
with particular attention to questions of societal inequities and inequalities.
Dr. McNeely has numerous publications and is active in several professional
associations, serves as a reviewer and evaluator in a variety of programs and
venues, and sits on several advisory boards and committees. Dr. McNeely
earned her B.A. (A.B.) in sociology from the University of Pennsylvania and
M.A. (A.M.) and Ph.D. in sociology from Stanford University.
Contributors

Natalia Abuín Vences Complutense University of Madrid, Madrid, Spain


Gagan Agrawal School of Computer and Cyber Sciences, Augusta Univer-
sity, Augusta, GA, USA
Nitin Agarwal University of Arkansas Little Rock, Little Rock, AR, USA
Rajeev Agrawal Information Technology Laboratory, US Army Engineer
Research and Development Center, Vicksburg, MS, USA
Btihaj Ajana King’s College London, London, UK
Omar Alghushairy Department of Computer Science, University of Idaho,
Moscow, ID, USA
Samer Al-khateeb Creighton University, Omaha, NE, USA
Gordon Alley-Young Department of Communications and Performing Arts,
Kingsborough Community College, City University of New York, New York,
NY, USA
Abdullah Alowairdhi Department of Computer Science, University of
Idaho, Moscow, ID, USA
Rayan Alshamrani Department of Computer Science, University of Idaho,
Moscow, ID, USA
Raed Alsini Department of Computer Science, University of Idaho, Moscow,
ID, USA
Ashrf Althbiti Department of Computer Science, University of Idaho,
Moscow, ID, USA
Ines Amaral University of Minho, Braga, Minho, Portugal
Instituto Superior Miguel Torga, Coimbra, Portugal
Autonomous University of Lisbon, Lisbon, Portugal
Scott W. Ambler Disciplined Agile Consortium, Toronto, ON, Canada
R. Bruce Anderson Earth & Environment, Boston University, Boston, MA,
USA
Florida Southern College, Lakeland, FL, USA
xiii
xiv Contributors

Janelle Applequist The Zimmerman School of Advertising and Mass


Communications, University of South Florida, Tampa, FL, USA
Giuseppe Arbia Universita’ Cattolica Del Sacro Cuore, Catholic University
of the Sacred Heart, Rome, Italy
Claudia Arcidiacono Dipartimento di Agricoltura, Alimentazione e
Ambiente, University of Catania, Catania, Italy
Lázaro M. Bacallao-Pino University of Zaragoza, Zaragoza, Spain
National Autonomous University of Mexico, Mexico City, Mexico
Jonathan Z. Bakdash Human Research and Engineering Directorate, U.S.
Army Research Laboratory, Aberdeen Proving Ground, MD, USA
Paula K. Baldwin Department of Communication Studies, Western Oregon
University, Monmouth, OR, USA
Warren Bareiss Department of Fine Arts and Communication Studies, Uni-
versity of South Carolina Upstate, Spartanburg, SC, USA
Feras A. Batarseh College of Science, George Mason University, Fairfax,
VA, USA
Anamaria Berea Department of Computational and Data Sciences, George
Mason University, Fairfax, VA, USA
Center for Complexity in Business, University of Maryland, College Park,
MD, USA
Magdalena Bielenia-Grajewska Division of Maritime Economy, Depart-
ment of Maritime Transport and Seaborne Trade, University of Gdansk,
Gdansk, Poland
Intercultural Communication and Neurolinguistics Laboratory, Department of
Translation Studies, University of Gdansk, Gdansk, Poland
Colin L. Bird Department of Chemistry, University of Southampton,
Southampton, UK
Tobias Blanke Department of Digital Humanities, King’s College London,
London, UK
Camilla B. Bosanquet Schar School of Policy and Government, George
Mason University, Arlington, VA, USA
Mustapha Bouakkaz University Amar Telidji Laghouat, Laghouat, Algeria
Jan Lauren Boyles Greenlee School of Journalism and Communication,
Iowa State University, Ames, IA, USA
David Brown Southern New Hampsire University, University of Central
Florida College of Medicine, Huntington Beach, CA, USA
University of Wyoming, Laramie, WY, USA
Stephen W. Brown Alliant International University, San Diego, CA, USA
Contributors xv

Emilie Bruzelius Arnhold Institute for Global Health, Icahn School of


Medicine at Mount Sinai, New York, NY, USA
Department of Epidemiology, Joseph L. Mailman School of Public Health,
Columbia University, New York, NY, USA
Kenneth Button Schar School of Policy and Government, George Mason
University, Arlington, VA, USA
Erik Cambria School of Computer Science and Engineering, Nanyang
Technological University, Singapore, Singapore
Steven J. Campbell University of South Carolina Lancaster, Lancaster, SC,
USA
Pilar Carrera Universidad Carlos III de Madrid, Madrid, Spain
Daniel N. Cassenti U.S. Army Research Laboratory, Adelphi, MD, USA
Guido Cervone Geography, and Meteorology and Atmospheric Science, The
Pennsylvania State University, University Park, PA, USA
Wendy Chen George Mason University, Arlington, VA, USA
Yixin Chen Department of Communication Studies, Sam Houston State
University, Huntsville, TX, USA
Tao Cheng SpaceTimeLab, University College London, London, UK
Yon Jung Choi Center for Science, Technology, and Innovation Policy,
George Mason University, Fairfax, VA, USA
Davide Ciucci Università degli Studi di Milano-Bicocca, Milan, Italy
Deborah Elizabeth Cohen Smithsonian Center for Learning and Digital
Access, Washington, DC, USA
Germán G. Creamer School of Business, Stevens Institute of Technology,
Hoboken, NJ, USA
Francis Dalisay Communication & Fine Arts, College of Liberal Arts &
Social Sciences, University of Guam, Mangilao, GU, USA
Andrea De Montis Department of Agricultural Sciences, University of
Sassari, Sassari, Italy
Trevor Diehl Media Innovation Lab (MiLab), Department of Communica-
tion, University of Vienna, Wien, Austria
Dimitra Dimitrakopoulou School of Journalism and Mass Communication,
Aristotle University of Thessaloniki, Thessaloniki, Greece
Derek Doran Department of Computer Science and Engineering, Wright
State University, Dayton, OH, USA
Patrick Doupe Arnhold Institute for Global Health, Icahn School of
Medicine at Mount Sinai, New York, NY, USA
xvi Contributors

Stuart Dunn Department of Digital Humanities, King’s College London,


London, UK
Ryan S. Eanes Department of Business Management, Washington College,
Chestertown, MD, USA
Catherine Easton School of Law, Lancaster University, Bailrigg, UK
R. Elizabeth Griffin Dominion Astrophysical Observatory, British Columbia,
Canada
Robert Faggian Centre for Regional and Rural Futures, Deakin University,
Burwood, VIC, Australia
James H. Faghmous Arnhold Institute for Global Health, Icahn School of
Medicine at Mount Sinai, New York, NY, USA
Arash Jalal Zadeh Fard Department of Computer Science, University of
Georgia, Athens, GA, USA
Vertica (Hewlett Packard Enterprise), Cambridge, MA, USA
Jennifer Ferreira Centre for Business in Society, Coventry University,
Coventry, UK
Katherine Fink Department of Media, Communications, and Visual Arts,
Pace University, Pleasantville, NY, USA
David Freet Eastern Kentucky University, Southern Illinois University,
Edwardsville, IL, USA
Lisa M. Frehill Energetics Technology Center, Indian Head, MD, USA
Jeremy G. Frey Department of Chemistry, University of Southampton,
Southampton, UK
Martin H. Frické University of Arizona, Tucson, AZ, USA
Kassandra Galvez Florida Southern College, Lakeland, FL, USA
Katherine R. Gamble U.S. Army Research Laboratory, Adelphi, MD, USA
Song Gao Department of Geography, University of California, Santa
Barbara, CA, USA
Department of Geography, University of Wisconsin-Madison, Madison, WI,
USA
Alberto Luis García Departamento de Ciencias de la Comunicación
Aplicada, Facultad de Ciencias de la información, Universidad Complutense
de Madrid, Madrid, Spain
Sandra Geisler Fraunhofer Institute for Applied Information Technology
FIT, Sankt Augustin, Germany
Matthew Geras Florida Southern College, Lakeland, FL, USA
Homero Gil de Zúñiga Media Innovation Lab (MiLab), Department of
Communication, University of Vienna, Wien, Austria
Contributors xvii

Erik Goepner George Mason University, Arlington, VA, USA


Yessenia Gomez School of Public Health Institute for Applied Environmen-
tal Health, University of Maryland, College Park, MD, USA
Steven J. Gray The Bartlett Centre for Advanced Spatial Analysis,
University College London, London, UK
Jong-On Hahm Department of Chemistry, Georgetown University,
Washington, DC, USA
Rihan Hai RWTH Aachen University, Aachen, Germany
Muhiuddin Haider School of Public Health Institute for Applied Environ-
mental Health, University of Maryland, College Park, MD, USA
Layla Hashemi Terrorism, Transnational Crime, and Corruption Center,
George Mason University, Fairfax, VA, USA
James Haworth SpaceTimeLab, University College London, London, UK
Martin Hilbert Department of Communication, University of California,
Davis, Davis, CA, USA
Kai Hoberg Kühne Logistics University, Hamburg, Germany
Mél Hogan Department of Communication, Media and Film, University of
Calgary, Calgary, AB, Canada
Hemayet Hossain Centre for Regional and Rural Futures, Deakin University,
Burwood, VIC, Australia
Gang Hua Visual Computing Group, Microsoft Research, Beijing, China
Fang Huang Tetherless World Constellation, Rensselaer Polytechnic Insti-
tute, Troy, NY, USA
Brigitte Huber Media Innovation Lab (MiLab), Department of Communi-
cation, University of Vienna, Wien, Austria
Carolynne Hultquist Geoinformatics and Earth Observation Laboratory,
Department of Geography and Institute for CyberScience, The Pennsylvania
State University, University Park, PA, USA
Suzi Iacono OIA, National Science Foundation, Alexandria, VA, USA
Ashiq Imran Department of Computer Science & Engineering, University of
Texas at Arlington, Arlington, TX, USA
Ece Inan Girne American University Canterbury, Canterbury, UK
Elmira Jamei College of Engineering and Science, Victoria University,
Melbourne, VIC, Australia
J. Jacob Jenkins California State University Channel Islands, Camarillo,
CA, USA
Madeleine Johnson Centre for Regional and Rural Futures, Deakin Univer-
sity, Burwood, VIC, Australia
xviii Contributors

Patrick Juola Department of Mathematics and Computer Science,


McAnulty College and Graduate School of Liberal Arts, Duquesne University,
Pittsburgh, PA, USA
Anirudh Kadadi Department of Computer Systems Technology, North
Carolina A&T State University, Greensboro, NC, USA
Hina Kazmi George Mason University, Fairfax, VA, USA
Corey Koch Florida Southern College, Lakeland, FL, USA
Erik W. Kuiler George Mason University, Arlington, VA, USA
Joanna Kulesza Department of International Law and International
Relations, University of Lodz, Lodz, Poland
Matthew J. Kushin Department of Communication, Shepherd University,
Shepherdstown, WV, USA
Kim Lacey Saginaw Valley State University, University Center, MI, USA
Sabrina Lai Department of Civil and Environmental Engineering and
Architecture, University of Cagliari, Cagliari, Italy
Paul Anthony Laux Lerner College of Business and Economics and J.P.
Morgan Chase Fellow, Institute for Financial Services Analytics, University of
Delaware, Newark, DE, USA
Simone Z. Leao City Futures Research Centre, Faculty of Built Environ-
ment, University of New South Wales, Sydney, NSW, Australia
Jooyeon Lee Hankuk University of Foreign Studies, Seoul, Korea (Republic
of)
Joshua Lee Schar School of Policy and Government, George Mason
University, Fairfax, VA, USA
Yulia A. Levites Strekalova College of Journalism and Communications,
University of Florida, Gainesville, FL, USA
Loet Leydesdorff Amsterdam School of Communication Research
(ASCoR), University of Amsterdam, Amsterdam, The Netherlands
Meng-Hao Li George Mason University, Fairfax, VA, USA
Siona Listokin Schar School of Policy and Government, George Mason
University, Fairfax, VA, USA
Kim Lorber Social Work Convening Group, Ramapo College of New
Jersey, Mahwah, NJ, USA
Travis Loux Department of Epidemiology and Biostatistics, College
for Public Health and Social Justice, Saint Louis University, St. Louis, MO,
USA
Xiaogang Ma Department of Computer Science, University of Idaho,
Moscow, ID, USA
Contributors xix

Wolfgang Maass Saarland University, Saarbrücken, Germany


Marcienne Martin Laboratoire ORACLE [Observatoire Réunionnais des
Arts, des Civilisations et des Littératures dans leur Environnement] Université
de la Réunion Saint-Denis France, Montpellier, France
Lourdes S. Martinez School of Communication, San Diego State Univer-
sity, San Diego, CA, USA
Julian McAuley Computer Science Department, UCSD, San Diego, USA
Ernest L. McDuffie The Global McDuffie Group, Longwood, FL, USA
Ryan McGrady North Carolina State University, Raleigh, NC, USA
Heather McIntosh Mass Media, Minnesota State University, Mankato, MN,
USA
Connie L. McNeely George Mason University, Fairfax, VA, USA
Esther Mead Department of Information Science, University of Arkansas
Little Rock, Little Rock, AR, USA
John A. Miller Department of Computer Science, University of Georgia,
Athens, GA, USA
Staša Milojević Luddy School of Informatics, Computing, and Engineering,
Indiana University, Bloomington, IN, USA
Murad A. Mithani School of Business, Stevens Institute of Technology,
Hoboken, NJ, USA
Giuseppe Modica Dipartimento di Agraria, Università degli Studi
Mediterranea di Reggio Calabria, Reggio Calabria, Italy
David Cristian Morar Schar School of Policy and Government, George
Mason University, Fairfax, VA, USA
Marco Morini Dipartimento di Comunicazione e Ricerca Sociale,
Universita’ degli Studi “La Sapienza”, Roma, Italy
Diana Nastasia Department of Applied Communication Studies, Southern
Illinois University Edwardsville, Edwardsville, IL, USA
Sorin Nastasia Department of Applied Communication Studies, Southern
Illinois University Edwardsville, Edwardsville, IL, USA
Alison N. Novak Department of Public Relations and Advertising, Rowan
University, Glassboro, NJ, USA
Paul Nulty Centre for Research in Arts Social Science and Humanities,
University of Cambridge, Cambridge, United Kingdom
Christopher Nyamful Department of Computer Systems Technology, North
Carolina A&T State University, Greensboro, NC, USA
Daniel E. O’Leary Marshall School of Business, University of Southern
California, Los Angeles, CA, USA
xx Contributors

Barbara Cook Overton Communication Studies, Louisiana State Univer-


sity, Baton Rouge, LA, USA
Communication Studies, Southeastern Louisiana University, Hammond, LA,
USA
Jeffrey Parsons Memorial University of Newfoundland, St. John’s, Canada
Christopher Pettit City Futures Research Centre, Faculty of Built Environ-
ment, University of New South Wales, Sydney, NSW, Australia
William Pewen Department of Health, Nursing and Nutrition, University of
the District of Columbia, Washington, DC, USA
Jürgen Pfeffer Bavarian School of Public Policy, Technical University of
Munich, Munich, Germany
Matthew Pittman School of Journalism & Communication, University of
Oregon, Eugene, OR, USA
Colin Porlezza IPMZ - Institute of Mass Communication and Media
Research, University of Zurich, Zürich, Switzerland
Anirudh Prabhu Tetherless World Constellation, Rensselaer Polytechnic
Institute, Troy, NY, USA
Sandeep Purao Bentley University, Waltham, USA
Christoph Quix Fraunhofer Institute for Applied Information Technology
FIT, Sankt Augustin, Germany
Hochschule Niederrhein University of Applied Sciences, Krefeld, Germany
Lakshmish Ramaswamy Department of Computer Science, University of
Georgia, Athens, GA, USA
Ramón Reichert Department for Theatre, Film and Media Studies, Vienna
University, Vienna, Austria
Sarah T. Roberts Department of Information Studies, University of
California, Los Angeles, Los Angeles, CA, USA
Scott N. Romaniuk University of South Wales, Pontypridd, UK
Alirio Rosales University of British Columbia, Vancouver, Canada
Christopher Round George Mason University, Fairfax, VA, USA
Booz Allen Hamilton, Inc., McLean, VA, USA
Seref Sagiroglu Department of Computer Engineering, Gazi University,
Ankara, Turkey
Sergei A. Samoilenko George Mason University, Fairfax, VA, USA
Zerrin Savaşan Department of International Relations, Sub-Department of
International Law, Selçuk University, Konya, Turkey
Deepak Saxena Indian Institute of Public Health Gandhinagar, Gujarat, India
Contributors xxi

Laurie A. Schintler George Mason University, Fairfax, VA, USA


Jon Schmid Georgia Institute of Technology, Atlanta, GA, USA
Hans C. Schmidt Pennsylvania State University – Brandywine, Philadelphia,
PA, USA
Jason Schmitt Communication and Media, Clarkson University, Potsdam,
NY, USA
Stephen T. Schroth Department of Early Childhood Education, Towson
University, Baltimore, MD, USA
Raquel Vinader Segura Complutense University of Madrid, Madrid,
Spain
Marc-David L. Seidel Sauder School of Business, University of British
Columbia, Vancouver, BC, Canada
Kimberly F. Sellers Department of Mathematics and Statistics, Georgetown
University, Washington, DC, USA
Padmanabhan Seshaiyer George Mason University, Fairfax, VA, USA
Alexander Sessums Florida Southern College, Lakeland, FL, USA
Mehdi Seyedmahmoudian School of Software and Electrical Engineering,
Swinburne University of Technology, Melbourne, VIC, Australia
Salma Sharaf School of Public Health Institute for Applied Environmental
Health, University of Maryland, College Park, MD, USA
Alan R. Shark Public Technology Institute, Washington, DC, USA
Schar School of Policy and Government, George Mason University, Fairfax,
VA, USA
Kim Sheehan School of Journalism & Communication, University of
Oregon, Eugene, OR, USA
Louise Shelley Terrorism, Transnational Crime, and Corruption Center,
George Mason University, Fairfax, VA, USA
Marina Shilina Moscow State University (Russia), Moscow, Russia
Stephen D. Simon P. Mean Consulting, Leawood, KS, USA
Aram Sinnreich School of Communication, American University,
Washington, DC, USA
Jörgen Skågeby Department of Media Studies, Stockholm University,
Stockholm, Sweden
Christine Skubisz Department of Communication Studies, Emerson College,
Boston, MA, USA
Department of Behavioral Health and Nutrition, University of Delaware,
Newark, DE, USA
xxii Contributors

Mick Smith North Carolina A&T State University, Greensboro, NC, USA
Clare Southerton Centre for Social Research in Health and Social Policy
Research Centre, UNSW, Sydney, Sydney, NSW, Australia
Ralf Spiller Macromedia University, Munich, Germany
Victor Sposito Centre for Regional and Rural Futures, Deakin University,
Burwood, VIC, Australia
Alex Stojcevski School of Software and Electrical Engineering, Swinburne
University of Technology, Melbourne, VIC, Australia
Veda C. Storey J Mack Robinson College of Business, Georgia State
University, Atlanta, GA, USA
Yulia A. Strekalova College of Journalism and Communications, University
of Florida, Gainesville, FL, USA
Daniele C. Struppa Donald Bren Presidential Chair in Mathematics,
Chapman University, Orange, CA, USA
Jennifer J. Summary-Smith Florida SouthWestern State College, Fort
Myers, FL, USA
Culver-Stockton College, Canton, MO, USA
Melanie Swan New School University, New York, NY, USA
Yuzuru Tanaka Graduate School of Information Science and Technology,
Hokkaido University, Sapporo, Hokkaido, Japan
Niccolò Tempini Department of Sociology, Philosophy and Anthropology
and Egenis, Centre for the Study of the Life Sciences, University of Exeter,
Exeter, UK
Doug Tewksbury Communication Studies Department, Niagara University,
Niagara, NY, USA
Subash Thota Synectics for Management Decisions, Inc., Arlington, VA, USA
Ulrich Tiedau Centre for Digital Humanities, University College London,
London, UK
Kristin M. Tolle University of Washington, eScience Institute, Redmond,
WA, USA
Catalina L. Toma Communication Science, University of Wisconsin-Mad-
ison, Madison, WI, USA
Rochelle E. Tractenberg Collaborative for Research on Outcomes and –
Metrics, Washington, DC, USA
Departments of Neurology; Biostatistics, Bioinformatics & Biomathematics;
and Rehabilitation Medicine, Georgetown University, Washington, DC, USA
Contributors xxiii

Chiara Valentini Department of Management, Aarhus University, School of


Business and Social Sciences, Aarhus, Denmark
Damien Van Puyvelde University of Glasgow, Glasgow, UK
Matthew S. VanDyke Department of Communication, Appalachian State
University, Boone, NC, USA
Andreas Veglis School of Journalism and Mass Communication, Aristotle
University of Thessaloniki, Thessaloniki, Greece
Natalia Abuín Vences Complutense University of Madrid, Madrid, Spain
Raquel Vinader Segura Complutense University of Madrid, Madrid, Spain
Rey Juan Carlos University, Fuenlabrada, Madrid, Spain
Jing Wang School of Communication and Information, Rutgers University,
New Brunswick, NJ, USA
Anne L. Washington George Mason University, Fairfax, VA, USA
Nigel Waters Department of Geography and Civil Engineering, University
of Calgary, Calgary, AB, Canada
Brian E. Weeks Communication Studies Department, University of
Michigan, Ann Arbor, MI, USA
Adele Weiner Audrey Cohen School For Human Services and Education,
Metropolitan College of New York, New York, NY, USA
Tao Wen Earth and Environmental Systems Institute, Pennsylvania State
University, University Park, PA, USA
Carson C. Woo University of British Columbia, Vancouver, Canada
Rhonda Wrzenski Indiana University Southeast, New Albany, IN, USA
Masahiro Yamamoto Department of Communication, University at Albany –
SUNY, Albany, NY, USA
Fan Yang Department of Communication Studies, University of Alabama at
Birmingham, Birmingham, AL, USA
Qinghua Yang Department of Communication Studies, Texas Christian
University, Fort Worth, TX, USA
Sandul Yasobant Center for Development Research (ZEF), University of
Bonn, Bonn, Germany
Xinyue Ye Landscape Architecture & Urban Planning, Texas A&M Univer-
sity, College Station, TX, USA
Dzmitry Yuran School of Arts and Communication, Florida Institute of
Technology, Melbourne, FL, USA
xxiv Contributors

Ting Zhang Department of Accounting, Finance and Economics, Merrick


School of Business, University of Baltimore, Baltimore, MD, USA

Weiwu Zhang College of Media and Communication, Texas Tech Univer-


sity, Lubbock, TX, USA

Bo Zhao College of Earth, Ocean, and Atmospheric Sciences, Oregon State


University, Corvallis, OR, USA
Fen Zhao Alpha Edison, Los Angeles, CA, USA
A

Advanced Analytics This chapter is organized into the following


sections:
▶ Business Intelligence Analytics
1. Why Disciplined Agile Big Data?
2. Be Agile: An Agile Mindset for Data
Professionals
Agile Data 3. Do Agile: The Agile Database Technique
Stack
Scott W. Ambler 4. Last Words
Disciplined Agile Consortium, Toronto, ON,
Canada
Why Disciplined Agile Big Data?

To succeed at big data you must be able to process The Big Data environment is complex. You are
large volumes of data, data that is very often dealing with overwhelming amounts of data
unstructured. More importantly, you must be able coming in from a large number of disparate
to swiftly react to emerging opportunities and data sources; the data is often of questionable
insights before your competitor does. A Disci- quality and integrity, and the data is often com-
plined Agile approach to big data is evolutionary ing from sources that are outside your scope of
and collaborative in nature, leveraging proven influence. You need to respond to quickly
strategies from the traditional, lean, and agile changing stakeholder needs without increasing
canons. Collaborative strategies increase both the the technical debt within your organization. It is
velocity and quality of work performed while clear that in the one extreme traditional
reducing overhead. Evolutionary strategies – approaches to data management are insuffi-
those that deliver incremental value through itera- ciently responsive, yet at the other extreme,
tive application of architecture and design model- mainstream agile strategies (in particular
ing, database refactoring, automated regression Scrum) come up short for addressing your
testing, continuous integration (CI) of data assets, long-term data management ideas. You need a
continuous deployment (CD) of data assets, and middle ground that combines techniques for just
configuration management – build a solid data enough modeling and planning at the most
foundation that will stand the test of time. In effect responsible moments for doing so with engi-
this is the application of proven, leading-edge neering techniques that produce high-quality
software engineering practices to big data. assets that are easily evolved yet will still stand
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
2 Agile Data

the test of time. That middle ground is Disci- database techniques have been proven in practice
plined Agile Big Data. and enjoy both commercial and open source
Disciplined Agile (DA) (Ambler and Lines tooling support (Fig. 1).
2012) is a hybrid framework that combines strat- We say they form a stack because in order to be
egies from a range of sources including Scrum, viable, each technique requires the one immedi-
Agile Modeling, Agile Data, Unified Process, ately below it. For it to make sense to continu-
Kanban, traditional, and many other sources. DA ously deploy database changes you need to be
promotes a pragmatic and flexible strategy for able to develop small and valuable vertical slices,
tailoring and evolving processes that reflect the which in turn require clean architecture and
situation that you face. A Disciplined Agile design, and so on. Let’s explore each one in
approach to Big Data leverages agile strategies greater detail.
architecture and design modeling and modern
software engineering techniques. These practices, Continuous Database Deployment
described below, are referred to as the agile data- Continuous deployment (CD) refers to the prac-
base technique stack. The aim is to quickly meet tice that when an integration build is successful (it
the dynamic needs of the marketplace without compiles, passes all tests, and passes any auto-
short-changing the long-term viability of your mated analysis checks), your CD tool will auto-
organization. matically deploy to the next appropriate
environment(s) (Sadalage 2003). This includes
both changes to your business logic code as well
as to your database. As you see in the following
Be Agile: An Agile Mindset for Data
diagram, if the build runs successfully on a devel-
Professionals
oper’s work station their changes are propagated
automatically into the team integration environ-
In many ways agility is more of an attitude than a
ment (which automatically invokes the integration
skillset. The common characteristics of agile pro-
build in that space). When the build is successful
fessionals are:
the changes are promoted into an integration test-
ing environment, and so on (Fig. 2).
• Willing to work closely with others, working in
The aim of continuous database deployment is
pairs or small teams as appropriate
to reduce the time, cost, and risk of releasing
• Pragmatic in that they are willing to do what
database changes. Continuous database deploy-
needs to be done to the extent that it needs to be
ment only works if you are able to organize the
done
functionality you are delivering into small, yet
• Open minded, willing to experiment and learn
still valuable, vertical slices.
new techniques
• Responsible and therefore willing to seek the
Vertical Slicing
help of the right person(s) for the task at hand
A vertical slice is a top to bottom, fully
• Eager to work iteratively and incrementally,
implemented and tested piece of functionality
creating artifacts that are sufficient to the task
that provides some form of business value to an
at hand
end user. It should be possible to easily deploy a
vertical slice into production upon request. A ver-
Do Agile: The Agile Database Techniques tical slice can be very small, such as a single value
Stack on a report, the implementation of a business rule
or calculation, or a new reporting view. For an
Of course it isn’t sufficient to “be agile” if you agile team, all of this implementation work should
don’t know how to “do agile.” The following be accomplished during a single iteration/sprint,
figure overviews the critical technical techniques typically a one- or two-week period. For teams
required for agile database evolution. These agile following a lean delivery lifecycle, this timeframe
Agile Data 3

Agile Data, Fig. 1 The agile database technique stack

Agile Data, Fig. 2 Continuous database deployment


4 Agile Data

typically shrinks to days and even hours in some details will be fleshed out later as construction
cases. progresses.
For a Big Data solution, a vertical slice is fully 2. Initial architecture envisioning. Your archi-
implemented from the appropriate data sources tecture strategy is typically captured in a free-
all the way through to a data warehouse (DW), form architecture diagram, network diagram,
data mart (DM), or business intelligence (BI) or UML deployment diagram. Your model(s)
solution. For the data elements required by the should capture potential data sources; how data
vertical slice, you need to fully implement the will flow from the data sources to the target
following: data warehouse(s) or data marts; and how that
work flows through combinations of data
• Extraction from the data source(s) extraction, data transformation, and data load-
• Staging of the raw source data (if you stage ing capabilities.
data) 3. Look-ahead modeling. Sometimes referred to
• Transformation/cleansing of the source data as “backlog refinement” or “backlog
• Loading the data into the DW grooming,” the goal of look-ahead modeling
• Loading into your data marts (DMs) is to explore work that is a few weeks in the
• Updating the appropriate BI views/reports future. This is particularly needed in complex
where needed domains where there may be a few weeks of
detailed data analysis required to work through
A key concept is that you only do the work for the semantics of your source data. For teams
the vertical slice that you’re currently working on. taking a sprint/iteration-based approach, this
This is what enables you to get the work done in a may mean that during the current iteration
matter of days (and even hours once you get good someone(s) on the team explores requirements
at it) instead of weeks or months. It should be clear to be implemented one or two iterations in the
that vertical slicing is only viable when you are future.
able to take an agile approach to modeling. 4. Model storming. This is a just-in-time (JIT)
modeling strategy where you explore some-
Agile Data Modeling thing through in greater detail, perhaps work
Many traditional data professionals believe that through the details of what a report should look
they need to perform detailed, up-front require- like or how the logic of a business calculation
ments, architecture, and design modeling before should work.
they can begin construction work. Not only has 5. Test-driven development (TDD). With TDD,
this been shown to be an ineffective strategy in your tests both validate your work and specify
general, when it comes to the dynamically evolv- it. Specification can be done at the require-
ing world of Big Data environments, it also proves ments level with acceptance tests and at the
to be disastrous. A Disciplined Agile approach design level with developer tests. More on
strives to keep the benefits of modeling and plan- this later.
ning, which are to think things through, yet avoid
the disadvantages associated with detailed docu- Clean Architecture and Design
mentation and making important decisions long High-quality IT assets are easier to understand, to
before you need to. DA does this by applying work with, and to evolve. In many ways, clean
light-weight Agile Modeling (Ambler 2002) strat- architecture and design are fundamental enablers
egies such as: of agility in general. Here are a few important
considerations for you:
1. Initial requirements envisioning. This
includes both usage modeling, likely via user 1. Choose a data warehouse architecture par-
stores and epics, and conceptual modeling. adigm. Although there is something to be said
These models are high-level at first; their about both the Inmon and Kimball strategies, I
Agile Data 5

generally prefer DataVault 2 (Lindstedt and The second stage, the transition period, shows
Olschimke 2015). DataVault 2 (DV2) has its how Customer contains both the original version
roots in the Inmon approach, bringing learn- of the schema (the Name column), the new/ A
ings in from Kimball and more importantly desired version of the schema, and scaffolding
practical experiences dealing with DW/BI and code to keep the two columns in sync. The tran-
Big Data in a range of situations. sition period is required so as to give the people
2. Focus on loose coupling and high cohesion. responsibility for any systems that access cus-
When a system is loosely coupled, it should be tomer name to update their code to instead work
easy to evolve its components without signifi- with the new columns. This approach is based on
cant effects on other components. Components the Java Development Kit (JDK) deprecation
that are highly cohesive do one thing and one strategy. The scaffolding code, in this case a trig-
thing only, in data parlance they are “highly ger that keeps the four columns consistent with
normalized.” one another, is required so that the database main-
3. Adopt common conventions. Guidelines tains integrity over the transition period. There
around data naming conventions, architectural may be hundreds of systems accessing this infor-
guidelines, coding conventions, user experi- mation – at first they will all be accessing the
ence (UX) conventions, and others promote original schema but over time they will be updated
greater consistency in the work produced. to access the new version of the schema – and
4. Train and coach your people. Unfortunately because these systems can not all be reworked at
few IT professionals these days get explicit once the database must be responsible for its own
training in architecture and design strategies, integrity. Once the transition period ends and the
resulting in poor quality work that increases existing systems that access the Customer table
your organization’s overall technical debt. have been update accordingly, the original schema
and the scaffolding code can be removed safely
Database Refactoring (Fig. 3).
A refactoring is a simple change to your design
that improves its quality without changing its
semantics in a practical manner. A database
refactoring is a simple change to a database
schema that improves the quality of its design
OR improves the quality of the data that it con-
tains (Ambler and Sadalage 2006). Database
refactoring enables you to safely and easily evolve
database schemas, including production database
schemas, over time by breaking large changes into
a collection of smaller less-risky changes.
Refactoring enables you to keep existing clean
designs of high quality and to safely address prob-
lems in poor quality implementations.
Let’s work through an example. The following
diagram depicts three stages in the life of the Split
Column database refactoring. The first stage
shows the original database schema where we
see that the Customer table has a Name column
where the full name of a person is stored. We have
decided that we want to improve the quality of this
table by splitting the column into three – in this
case FirstName, MiddleName, and LastName. Agile Data, Fig. 3 Example database refactoring
6 Agile Data

Automated Database Testing both validate and specify. You can do this at the
Quality is paramount for agility. Disciplined Agile requirements level by writing user acceptance
teams will develop, in an evolutionary manner of tests, a strategy referred to as behavior driven
course, an automated regression test suite that design (BDD) or acceptance test driven design
validates their work. They will run this test suite (ATDD), and at the design level via developer
many times a day so as to detect any problems as tests. By rethinking the order in which you work,
early as possible. Automated regression testing in this case by testing first not last, you can stream-
like this enables teams to safely make changes, line your approach while you increase its quality.
such as refactorings, because if they inject a prob-
lem they will be able to quickly find and then fix it. Continuous Database Integration
When it comes to testing, a database the follow- Continuous integration (CI) is a technique where
ing diagram summarizes the kind of tests that you you automatically build and test your system
should consider implementing (Ambler 2013). Of every time someone checks in a code change
course there is more to testing Big Data (Sadalage 2003). Disciplined agile developers
implementations than this, you will also want to will typically update a few lines of code, or
develop automated tests/checks for the entire chain make a small change to a configuration file, or
from data sources through your data processing make a small change to a PDM and then check
architecture into your DW/BI solution (Fig. 4). their work into their configuration management
In fact, very disciplined teams will take a test- tool. The CI tool monitors this, and when it detects
driven development (TDD) approach where they a check, it automatically kicks off the build and
write tests before they do the work to implement regression test suite in the background. This pro-
the functionality that the tests validate (Guernsey vides very quick feedback to team members,
2013). As a result, the tests do double duty – they enabling them to detect issues early.

Agile Data, Fig. 4 What to test in a database


AgInformatics 7

Configuration Management
Configuration management is at the bottom of the AgInformatics
stack, providing a foundation for all other agile A
database techniques. In this case there is nothing Andrea De Montis1, Giuseppe Modica2 and
special about the assets that you are creating – Claudia Arcidiacono3
ETL code, configuration files, data models, test 1
Department of Agricultural Sciences, University
data, stored procedures, and so on – in that if they of Sassari, Sassari, Italy
2
are worth creating then they are also worth putting Dipartimento di Agraria, Università degli Studi
under CM control. Mediterranea di Reggio Calabria, Reggio
Calabria, Italy
3
Dipartimento di Agricoltura, Alimentazione e
Ambiente, University of Catania, Catania, Italy
Last Words

I would like to end with two simple messages: Synonyms


First, you can do this. Everything described in this
chapter is pragmatic, supported by tooling, and E-agriculture; Precision agriculture; Precision
has been proven in practice in numerous contexts. farming
Second, you need to do this. The modern,
dynamic business environment requires you to
work in a reactive manner that does not short Definition
change your organization’s future. The Disci-
plined Agile approach described in this chapter The term stems from the blending of the two
describes how to do exactly that. words agriculture and informatics and refers to
the application of informatics to the analysis,
design and development of agricultural activi-
Further Reading ties. It overarches expressions such as Precision
Agriculture (PA), Precision Livestock Farming
Ambler, S. W. (2002). Agile modeling: Effective practices
for extreme programming and the unified process. New (PLF), and Agricultural landscape analysis and
York: Wiley. planning. The adoption of AgInformatics can
Ambler, S. W. (2013). Database testing: How to regression accelerate agricultural development by provid-
test a relational database. Retrieved from http://www.
ing farmers and decision makers with more
agiledata.org/essays/databaseTesting.html.
Ambler, S. W., & Lines, M. (2012). Disciplined accessible, complete, timely, and accurate infor-
agile delivery: A practitioner’s guide to agile soft- mation. However, it is still hindered by a number
ware delivery in the enterprise. New York: IBM of important yet unresolved issues including big
Press.
data handling, multiple data sources and limited
Ambler, S. W., & Sadalage, P. J. (2006). Refactoring data-
bases: Evolutionary database design. Boston: Addison standardization, data protection, and lack of opti-
Wesley. mization models. Development of knowledge-
Guernsey, M., III. (2013). Test-driven database develop- based systems in the farming sector would
ment: Unlocking agility. Upper Saddle River: Addison-
require key components, supported by Internet
Wesley Professional.
Lindstedt, D., & Olschimke, M. (2015). Building a scal- of things (IoT), data acquisition systems, ubiqui-
able data warehouse with database 2.0. Waltham: Mor- tous computing and networking, machine-to-
gan Kaufman. machine (M2M) communications, effective
Sadalage, P. J. (2003). Recipes for continuous
management of geospatial and temporal data,
database integration: Evolutionary database develop-
ment. Upper Saddle River: Addison-Wesley and ICT-supported cooperation among
Professional. stakeholders.
8 AgInformatics

Generalities altitude RS by means of unmanned aerial systems


(UAS), and recently robots) and information (e.g.,
This relatively new expression derives from a weather, environment, soil, crop, and production
combination of the two terms agriculture and data) needed to optimize and customize the
informatics, hence alluding to the application of timing, amount, and placement of inputs includ-
informatics to the analysis, design, and develop- ing seeds, fertilizers, pesticides, and irrigation,
ment of agricultural activities. It broadly involves activities that were later applied also inside closed
the study and practice of creating, collecting, stor- environments, buildings, and facilities, such as for
ing and retrieving, manipulating, classifying, and protected cultivation.
sharing information concerning both natural and To accomplish the operational functions of a
engineered agricultural systems. The domains of complex farm, FMISs for PA are designed to
application are mainly agri-food and environmen- manage information about processes, resources
tal sciences and technologies, while sectors (materials, information, and services), procedures
include biosystems engineering, farm manage- and standards, and characteristics of the final
ment, crop production, and environmental moni- products (Sørensen et al. 2010). Nowadays dedi-
toring. In this respect, it encompasses the cated FMISs operate on networked online frame-
management of the information coming from works and are able to process a huge amount of
applications and advances of information and data. The execution of their functions implies the
communication technologies (ICTs) in agriculture adoption of various management systems, data-
(e.g., global navigation satellite system, GNSS; bases, software architectures, and decision
remote sensing, RS; wireless sensor networks, models. Relevant examples of information man-
WSN; and radio-frequency identification, RFID) agement between different actors are supply chain
and performed through specific agriculture infor- information systems (SCIS) including those spe-
mation systems, models, and methodologies (e.g., cifically designed for traceability and supply
farm management information systems, FMIS; chain planning.
GIScience analyses; Data Mining; decision sup- Recently, PA has evolved to predictive and
port systems, DSS). prescriptive agriculture. Predictive agriculture
AgInformatics is an umbrella concept that regards the activity of combining and using a
includes and overlaps issues covered in precision large amount of data to improve knowledge and
agriculture (PA), precision livestock farming predict trends, whereas prescriptive agriculture
(PLF), and agricultural landscape analysis and involves the use of detailed, site-specific recom-
planning, as follows. mendations for a farm field. Today PA embraces
new terms such as precision citrus farming, preci-
Precision Agriculture (PA) sion horticulture, precision viticulture, precision
PA was coined in 1929 and later defined as “a livestock farming, and precision aquaculture (Li
management strategy that uses information tech- and Chung 2015).
nologies to bring data from multiple sources to
bear on decisions associated with crop produc- Precision Livestock Farming (PLF)
tion” (Li and Chung 2015). The concept evolved The increase in activities related to livestock farm-
since the late 1980s due to new fertilization equip- ing triggered the definition of the new term preci-
ment, dynamic sensing, crop yield monitoring sion livestock farming (PLF), namely, the real-
technologies, and GNSS technology for auto- time monitoring technologies aimed at managing
mated machinery guidance. the smallest manageable production unit’s tempo-
Therefore, PA technology has provided ral variability, known as “the per animal
farmers with the tools (e.g., built-in sensors in approach” (Berckmans 2004). PLF consists in
farming machinery, GIS tools for yield monitor- the real-time gathering of data related to livestock
ing and mapping, WSNs, satellite and low- animals and their close environment, applying
AgInformatics 9

knowledge-based computer models, and extra- networks (often provided by means of smart
cting useful information for automatic monitoring applications) and geospatial information on the
and control purposes. It implies monitoring ani- Web (GeoWeb). Spatial decision support systems A
mal health, welfare, behavior, and performance (SDSSs) are computer-based systems that help
and the early detection of illness or a specific decision makers in the solution of complex prob-
physiological status and unfolds in several activ- lems, such as in agriculture, land use allocation,
ities including real-time analysis of sounds, and management. SDSSs implement diverse
images, and accelerometer data, live weight forms of multi-criteria decision analysis
assessment, condition scoring, and online milk (MCDA). GIS-based MCDA can be considered
analysis. In PLF, continuous measurements and as a class of SDSS. Implementing GIS-MCDA
a reliable prediction of variation in animal data or within the World Wide Web environment can
animal response to environmental changes are help to bridge the gap between the public and
integrated in the definition of models and algo- experts and favor public participation.
rithms that allow for taking control actions (e.g.,
climate control, feeding strategies, and therapeu-
tic decisions). Conclusion

Agricultural Landscape Analysis and Planning Technologies have the potential to change modes
Agricultural landscape analysis and planning is of producing agri-food and livestock. ICTs can
increasingly based on the development of inter- accelerate agricultural development by providing
operable spatial data infrastructures (SDIs) that more accessible, complete, timely, or accurate
integrate heterogeneous multi-temporal spatial information at the appropriate moment to decision
datasets and time-series information. makers. Concurrently, management concepts,
Nearly all agricultural data has some form of such as PA and PLF, may play an important role
spatial component, and GISs allow to visualize in driving and accelerating adoption of ICT tech-
information that might otherwise be difficult to nologies. However, the application of PA solu-
interpret (Pierce and Clay 2007). tions has been slow due to a number of
Land use/land cover (LU/LC) change detec- important yet unresolved issues including big
tion methods are widespread in several research data handling, limited standardization, data pro-
fields and represent an important issue dealing tection, and lack of optimization models and
with the modification analysis of agricultural depends as well on infrastructural conditions
uses. In this framework, RS imagery plays a such as availability of broadband internet in rural
key role and involves several steps dealing with areas. The adoption of FMISs in agriculture is
the classification of continuous radiometric hindered by barriers connected to poor interfac-
information remotely surveyed into tangible ing, interoperability and standardized formats,
information, often exposed as thematic maps in and dissimilar technological equipment adoption.
GIS environments, and that can be utilized in Development of knowledge-based systems in the
conjunction with other data sets. Among classi- farming sector would require key components,
fication techniques, object-based image analysis supported by IoT, data acquisition systems, ubiq-
(OBIA) is one of the most powerful techniques uitous computing and networking, M2M commu-
and gained popularity since the early 2000s in nications, effective management of geospatial and
extracting meaningful objects from high-resolu- temporal data, traceability systems along the sup-
tion RS imagery. ply chain, and ICT-supported cooperation among
Proprietary data sources are integrated with stakeholders. Recent designs and prototypes
social data created by citizens, i.e., volunteered using cloud computing and the future Internet
geographic information (VGI). VGI includes generic enablers for inclusion in FMIS have
crowdsourced geotagged information from social recently been proposed and lay the groundwork
10 Agriculture

for future applications. A modification, which is


underway, from proprietary tools to Internet- Agriculture
based open systems supported by cloud hosting
services will enable a more effective cooperation Madeleine Johnson, Hemayet Hossain,
between actors of the supply chain. One of the Victor Sposito and Robert Faggian
limiting factors in the adoption of SCIS is a lack of Centre for Regional and Rural Futures, Deakin
interoperability, which would require implemen- University, Burwood, VIC, Australia
tation of virtual supply chains based on the
virtualization of physical objects such as con-
tainers, products, and trucks. Recent and promis- Synonyms
ing developments of the spatial decision-making
deal with the interaction and the proactive AgInformatics; Digital agriculture; Smart
involvement of the final users, implementing the agriculture
so-called collaborative or participative Web-based
GIS-MCDA systems. Computers science and IT
evolvements affect the developments of RS in Big Data and (Smart) Agriculture
agriculture, leading to the need for new methods
and solutions to the challenges of big data in a Big data and digital technology are driving the
cloud computing environment. latest transformation of agriculture – to what is
becoming increasingly referred to as “smart agri-
culture” or sometimes “digital agriculture.” This
Cross-References term encompasses farming systems that employ
digital sensors and information to support deci-
▶ Agriculture sion-making. Smart agriculture is an umbrella con-
▶ Cloud cept that includes precision agriculture (see
▶ Data Processing ▶ “AgInformatics” – De Montis et al. 2017) – in
▶ Satellite Imagery/Remote Sensing many countries (e.g., Australia) precision agricul-
▶ Sensor Technologies ture commonly refers to cropping practices that use
▶ Socio-spatial Analytics GPS guidance systems to assist with seed, fertil-
▶ Spatial Data izer, and chemical applications. It therefore tends to
be associated specifically with cropping farming
systems and deals primarily with infield variability.
Further Reading Smart agriculture, however, refers to all farming
systems and deals with decision-making informed
Berckmans, D. (2004). Automatic on-line monitoring of by location, contextual data, and situational aware-
animals by precision livestock farming. In Proceedings ness. The sensors employed in smart agriculture
of the ISAH conference on animal production in
can range from simple feedback systems, such as a
Europe: The Way Forward in a Changing World.
Saint-Malo, pp. 27–31. thermostat that acts to regulate a machines temper-
Li, M., & Chung, S. (2015). Special issue on precision ature, to complex machine learning algorithms that
agriculture. Computers and Electronics in Agriculture, inform pest and disease management strategies.
112, 1.
The term big data, in an agricultural context, is
Pierce, F. J., & Clay, D. (Eds.). (2007). GIS applications in
agriculture. Boca Raton: CRC Press Taylor and Francis related but distinct – it refers to computerized ana-
Group. lytical systems that utilize large databases of infor-
Sørensen, C. G., Fountas, S., Nash, E., Pesonen, L., mation to identify statistical relationships that then
Bochtis, D., Pedersen, S. M., Basso, B., & Blackmore,
inform decision support tools. This often includes
S. B. (2010). Conceptual model of a future farm man-
agement information system. Computers and Electron- big data from nonagricultural sources, such as
ics in Agriculture, 72(1), 37–47. weather or climate data or market data.
Agriculture 11

An example of how these concepts interact in becomes much more valuable. This is true across
practice: a large dataset may be established that the full agricultural value chain.
contains the yield results of many varietal trials Nonetheless, smart agriculture, IoT, and big A
across a broad geographical area and over a long data are impacting on the full agricultural value
period of time (including detailed information chain. Here we list some examples according to
pertaining to the location of each trial, such as farming system type (as outlined by AFI 2016):
soil type, climatic data, fertilizer, and chemical
application rates, among others). This data could 1. Cropping systems: Variable rate application
be analyzed to specifically determine the best technology (precision agriculture), unmanned
variety for a particular geographic location and aerial vehicles or drones for crop assessment,
thus form the basis for a decision support system. remote sensing via satellite imagery.
These two steps constitute the data and the ana- 2. Extensive livestock: Walkover weighing scales
lytic components of “big data” in an agricultural and auto-drafting equipment, livestock track-
context. The data could then inform other activi- ing systems, remote and proximal sensor sys-
ties, such as the application (location and rate) of tems for pasture management, virtual fencing.
chemicals, fertilizers, and seed through digital- 3. Dairy: As for extensive livestock, plus individ-
capable and GPS-guided farm machinery (preci- ual animal ID systems and animal activity
sion agriculture). meters that both underpin integrated dairy and
herd management systems.
4. Horticulture: Input monitoring and manage-
Applications of Big Data in Smart ment systems (irrigation and fertigation),
Agriculture robotic harvesting systems, automated post-
harvest systems (grading, packing, chilling).
Big data, and in particular big data analytics, are
often described as disruptive technologies that are Overall, while the technology is still relatively
having a profound effect on economies. The new, agriculture is already seeing substantial pro-
amount of data being collected is increasing expo- ductivity gains from its use. Further transforma-
nentially, and the cost of computing and digital tive impacts will be felt when real-time
sensors is decreasing exponentially. As such, the information business process decisions and off-
range of consumer goods (including farm machin- farm issues (e.g., postharvest track and trace of
ery and equipment) that incorporates Internet or products), such as planning, problem-solving, risk
network connectivity as a standard feature is management, and marketing are underpinned by
growing. The result is a rapidly expanding “Inter- big data.
net of things” (IoT) and large volumes of new
data. For example, John Deere tractors are fitted
with sensors that collect and transmit soil and crop Challenges and Implications
data, which farmers can subscribe to access via
proprietary software portals (Bronson and In an agricultural context, there are several
Knezevic 2016). The challenge in agriculture is challenges.
reaching a point where available data and data- First, convincing farmers that the data (and its
bases qualify as “big.” Yield measurements from a collection) are not merely a novelty but something
single paddock within one growing season are of that will drive significant productivity improve-
little value because such limited data does not ment in the future may be difficult. In many cases
inform actionable decision taking. But, when the the hardware and infrastructure required to collect
same data is collected across many paddocks and and use agricultural data are expensive (and pro-
many seasons, it can be analyzed for trends that hibitively so in developing countries) or
inform on-farm decision-making and thus unavailable in rural areas (especially fast and
12 AI

reliable Internet access), and the benefits may not Cross-References


be realized for many years. Similarly, technical
literacy could be a barrier in some cases. These ▶ AgInformatics
issues are, however, common to many on-farm ▶ Data Processing
practice change exercises that drive improve- ▶ Socio-spatial Analytics
ments in efficiency or productivity and can gen- ▶ Spatial Data
erally be overcome.
Second, farmers may perceive that there are
privacy and security issues associated with mak- Further Reading
ing data about their farm available to unknown
third parties (Wolfert et al. 2017). Large proprie- Australian Farm Institute. (2016). The implications of dig-
ital farming and big data for Australian agriculture.
tary systems from the private sector are available
Surry Hills: NSW Australian Farm Institute. ISBN 978-
to capture and store significant amounts of data 1-921808-38-8.
that is then made available to farmers via subscrip- Bronson, K., & Knezevic, I. (2016). Big data in food and
tion. But, linking big data systems to commercial agriculture. Big Data & Society, 3, 1–5. https://doi.org/
10.1177/2053951716648174.
benefit raises the possibility of biased recommen-
De Montis, A., Modica, G., & Arcidiacono, C. (2017).
dations. Similarly, farmers may be reluctant to AgInformatics. Encylopedia of Big Data. https://doi.
provide detailed farm data to public or open- org/10.1007/978-3-319-32001-4_218-1.
source decision support systems because they Wolfert, S., Ge, L., Verdouw, C., & Bogaardt, M. (2017).
Big data in smart farming – A review. Agricultural
often do not trust government agencies. These
Systems, 153, 69–80. https://doi.org/10.1016/j.agsy.
systems also lack ongoing development and sup- 2017.01.023.
port for end users.
Finally, a sometimes-over-looked issue is that
of data quality. In the race for quantity, it is easy
to forget quality and the fact that not all digital AI
sensors are created equal. Data is generated at
varying resolutions, with varying levels of error ▶ Artificial Intelligence
and uncertainty, from machinery in various states
of repair. The capacity of analytical techniques to
keep pace with the amount of data, to filter out
poor quality data, and to generate information Algorithm
that is suitable at a range of resolutions are all
key issues for big analytics. For analyses to Laurie A. Schintler1 and Joshua Lee2
1
underpin accurate agricultural forecasting or pre- George Mason University, Fairfax, VA, USA
2
dictive services that improve productivity, Schar School of Policy and Government, George
advancements in intelligent processing and ana- Mason University, Fairfax, VA, USA
lytics are required.
Ultimately, it is doubtful that farmer knowl-
edge can ever be fully replaced by big data and Overview
analytic services. The full utility of big data for
agriculture will be realized when the human We are now living in an “algorithm society.”
components of food and fiber production chains Indeed, algorithms have become ubiquitous, run-
are better integrated with the digital components ning behind the scenes everywhere for various
to ensure that the outputs are relevant for plan- purposes, from recommending movies to optimiz-
ning (forecasting and predicting), communica- ing autonomous vehicle routing to detecting
tion, and management of (agri)business fraudulent financial transactions. Nevertheless,
processes. algorithms are far from new. The idea of an
Algorithm 13

algorithm, referring generally to a set of rules to data; extracting relevant and meaningful informa-
follow for solving a problem or achieving a goal, tion and content from big data streams; detecting
goes back thousands of years. However, the use of anomalies, predicting, classifying, and learning A
algorithms has exploded in recent years for a patterns of association; and protecting privacy
couple of interrelated reasons: and cybersecurity. Despite the benefits of algo-
rithms in the big data era, their use and application
1. Advancements in computational and informa- in society come with various ethical, legal, and
tion processing technologies have made it eas- social downsides and dangers, which must be
ier to develop, codify, implement, and execute addressed and managed.
algorithms.
2. Open-source digital platforms and
crowdsourcing projects enable algorithmic Machine Learning Algorithms
code to be shared and disseminated to a large
audience. Machine learning leverages algorithms for
3. The complexities and nuances of big data cre- obtaining insights (e.g., uncovering unknown pat-
ate unique computational and analytical chal- terns), creating models for prediction and classifi-
lenges, which demand algorithms. cation, and controlling automated systems. In this
regard, there are many different classes of algo-
Algorithms used for big data management, rithms. In supervised machine learning, algorithms
analysis, modeling, and governance comprise a are applied to a training data set containing attri-
complex ecosystem, as illustrated in Fig. 1. Spe- butes and outcomes (or labels) to develop a model
cifically, algorithms are used for capturing, that can predict or classify with a minimal level of
indexing, and processing massive, fast-moving model error. In contrast, unsupervised learning

Algorithm, Fig. 1 Ecoystem of algorithms for big data management, analysis, and modeling
14 Algorithm

algorithms are given a training set without any the unique demands of big data, particularly those
correct output (or labels) in advance. The algo- relating to the volume, velocity, variety, veracity,
rithm’s role is to figure out how to partition the and vulnerability of the data. Dynamic algorithms
data into different classes or groupings based on the help to manage fast-moving big data. Specifically,
similarity (or dissimilarity) of the attributes or such algorithms design data structures that reflect
observations. Association rule mining algorithms the evolving nature of a problem so that data
reveal patterns of association between features queries and updates can be done quickly and
based on their co-occurrence. Semi-supervised efficiently without starting from scratch. As big
algorithms are used in instances where not all data tends to be very large, it often exceeds our
observations have an output or label. Such algo- capacity to store, organize, and process it. Algo-
rithms exploit the available observations to create a rithms can be used to reduce the size and dimen-
partially trained model, which is then used to infer sionality of the data before it goes into storage and
the output or labels for the incomplete observa- to optimize storage capacity itself. Big data tends
tions. Finally, reinforcement learning algorithms, to be fraught with errors, noise, incompleteness,
which are often used for controlling and maximiz- bias, and redundancies, which can compromise
ing automated agents’ performance – e.g., autono- the accuracy and efficiency of machine learning
mous vehicles, produce their own training data algorithms. Data cleansing algorithms identify
based on information collected from their interac- imperfections and anomalies, transform the data
tion with the environment. Agents then adjust their accordingly, and validate the transformed data.
behavior to maximize a reward or minimize risk. Other algorithms are used for data integration,
Artificial Neural Networks (ANNs) are biolog- data aggregation, data transmission, data
ically inspired learning systems that simulate how discretization, and other pre-processing tasks. A
the human brain processes information. Such cross-cutting set of challenges relate to data secu-
models contain flexible weights along pathways rity and, more specifically, the privacy, integrity,
connected to “neurons” and an activation function confidentiality, and accessibility of the data.
that shapes the nature of the output. In ANNs, Encryption algorithms, which encode data and
algorithms are used to optimize the learning pro- information, are used to address such concerns.
cess – i.e., to minimize a cost function. Deep
neural learning is an emerging paradigm, where
the algorithms themselves adapt and learn the Societal Implications of Algorithms
optimal parameter settings – i.e., they “learn to
learn.” Deep learning contains many more layers While algorithms are beneficial to big data man-
and parameters than conventional ANNs. Each agement, modeling, and analysis, as highlighted,
layer of nodes trains on features from the output their use comes part and parcel with an array of
of the prior layers. This idea, known as feature downsides and dangers. One issue is algorithmic
hierarchy, enables deep learning to effectively and bias and discrimination. Indeed, algorithms have
efficiently model complex phenomena containing been shown to produce unfair outcomes and deci-
nonlinearities and multiple interacting features sions, favoring (or disfavoring) certain groups or
and dynamics. communities over others. The use of facial recog-
nition algorithms for predicting criminality is a
case and point. In particular, such systems are
Algorithms for Big Data Management notoriously biased in terms of race, gender, and
age. Algorithmic bias stems in part from the data
Conventional data management tools, techniques, used for training, testing, and validating machine
and technologies were not designed for big data. learning models, especially if it is skewed or
Various kinds of algorithms are used to address incomplete (e.g., due to sampling bias) or reflects
Algorithmic Complexity 15

societal gaps and disparities in the first place. The A survey of big data management: Taxonomy and
algorithms themselves can also amplify and con- state-of-the-art. Journal of Network and Computer
Applications, 71, 151–166.
tribute to biases. Compounding matters are that Yu, P. K. (2020). The algorithmic divide and equality in the A
algorithms are often opaque, particularly in deep age of artificial intelligence. Florida Law Review, 72,
learning models, which have complex architec- 19–44.
tures that cannot be easily uncovered, explained,
or understood. Standards, policies, and ethical and
legal frameworks are imperative for mitigating the
negative implications of algorithms. Moreover, Algorithmic Analysis
transparency is critical for ensuring that people
understand the inner-workings of the algorithms ▶ Algorithmic Complexity
that are used to make decisions that affect their
lives and well-being. Considering new and advanc-
ing capabilities in Explainable Artificial Intelli-
gence (XAI), algorithms themselves could soon Algorithmic Complexity
play an active role in this regard, adding new
dimensions and dynamics to the “algorithmic Patrick Juola
society.” Department of Mathematics and Computer
Science, McAnulty College and Graduate School
of Liberal Arts, Duquesne University, Pittsburgh,
Cross-References PA, USA

▶ Algorithmic Complexity
▶ Artificial Intelligence Synonyms
▶ Data Governance
▶ Deep Learning Algorithmic analysis; Big O notation
▶ Machine Learning

Introduction
Further Reading
Algorithmic complexity theory is the theoretical
Li, K. C., Jiang, H., Yang, L. T., & Cuzzocrea, A. (Eds.).
(2015). Big data: Algorithms, analytics, and applica-
analysis of the amount of resources consumed by
tions. Boca Raton: CRC Press. a process in executing a particular algorithm or
Mnich, M. (2018). Big data algorithms beyond machine solving a particular problem. As such, it is a
learning. KI – Künstliche Intelligenz, 32(1), 9–17. measure of the inherent difficulty of various prob-
Olhede, S. C., & Wolfe, P. J. (2018). The growing ubiquity
of algorithms in society: Implications, impacts and
lems and also of the efficiency of proposed solu-
innovations. Philosophical Transactions of the Royal tions. The resources measured can be almost
Society A: Mathematical, Physical and Engineering anything, such as the amount of computer mem-
Sciences, 376(2128), 20170364. ory required, the number of gates required to
Prabhu, C. S. R., Chivukula, A. S., Mogadala, A., Ghosh,
R., & Livingston, L. J. (2019). Big data analytics. In
embed the solution in hardware, and the number
Big data analytics: Systems, algorithms, applications of parallel processors required, but it most often
(pp. 1–23). Singapore: Springer. refers to the amount of time required for a com-
Schuilenburg, M., & Peeters, R. (Eds.). (2020). The algo- puter program to successfully execute and, in
rithmic society: Technology, power, and knowledge.
London: Routledge.
particular, to differences in the amount of
Siddiqa, A., Hashem, I. A. T., Yaqoob, I., Marjani, M., resources that cannot be solved simply by using
Shamshirband, S., Gani, A., & Nasaruddin, F. (2016). better equipment.
16 Algorithmic Complexity

An Example comparable computers. If Algorithm 1 were run


on a computer 10 times as fast, then it would
Consider the problem of determining whether complete in (effectively) time equal to N2/10,
each element in an N-element array is unique or, faster than Algorithm 2.
in other words, whether or not the array contains By contrast, Algorithm 3 is inherently more
any pairs. A naïve but simple solution would be to efficient than either of the other algorithms, suffi-
compare every element with every other element; ciently faster to beat any amount of money thrown
if no two elements are equal, every element is at the issue:
unique. The following pseudocode illustrates
this algorithm: Algorithm 3:
sort the array such that a[i] > ¼ a[i + 1] for
Algorithm 1: every element i (Statement A3)
for every element a[i] in the array for every element a[i] in the (sorted) array
for every element a[j] in the array if a[i] ¼ a[i + 1] report false and quit
if i 6¼ j and a[i] ¼ a[j] report false and (Statement A4)
quit (Statement A1) if all element-pairs have been compared,
if all element-pairs have been compared, report true and quit
report true and quit
The act of sorting will bring all like-valued
Because there are N2 element-pairs to com- elements together; if there are pairs in the original
pare, Statement A1 will be executed up to N2 data, they will be in adjacent elements after
times. The program as a whole will thus require sorting, and a single loop looking for adjacent
at least N2 statement execution times to elements with the same value will find any pairs
complete. (if they exist) in N passes or fewer through the
A slightly more efficient algorithm designer loops. The total time to execute Algorithm 3 is
would notice that if element a[x] has been com- thus roughly equal to N (the number of times
pared to element a[y], there is no need to compare statement A4 is executed) plus the amount of
element a[y] to element a[x] later. One can there- time it takes to sort an array of N elements.
fore restrict comparisons between element a[x] Sorting is a well-studied problem; many differ-
and elements later in the array, as in the following ent algorithms have been proposed, and it is
pseudocode: accepted that it takes approximately N times
log2(N) steps to sort such an array. The total time
Algorithm 2: of Algorithm 3 is thus N + N log2(N), which is less
for every element a[i] in the array than 2(N log2(N)), which in turn is less than N2 for
for every element a[j] (j > i) in the array large values of N. Algorithm 3, therefore, is more
if a[i] ¼ a[j] report false and quit efficient than Algorithm 1 or 2, and the efficiency
(Statement A2) gap gets larger as N (the amount of data) gets
if all element-pairs have been compared, bigger.
report true and quit

In this case, the first element will be compared Mathematical Expression


against N-1 other elements, the second against N-
2, and so forth. Statement A2 will thus be exe- Complexity is usually expressed in terms of com-
cuted (1 + 2 + 3 + 4 + . . . + (N-1)) times, for a total plexity classes using the so-called algorithmic
of N(N-1)/2 times. Since N2 > N(N-1)/2, Algo- notation (also known as “big O” or “big Oh”
rithm 2 could be considered marginally more effi- notation.) In general, algorithmic notation
cient. However, note that this comparison describes the limit behavior of a function in
assumes that Algorithms 1 and 2 are running on terms of equivalence classes. For polynomial
Algorithmic Complexity 17

functions, such as aN3+ bN2+ cN1+ d, the value of


as indeed would any cubic polynomial function.
the function is dominated (for large N) by N3. If N
An even simpler rule of thumb is that the
 a, then the exact value of a does not matter very A
deepest number of nested loops in a computer
much, and even less do the values of b, d, and d.
program or algorithm controls the complexity of
Similarly, for large N, any (constant) multiplier of
the overall program. A program that loops over all
N2 is larger than any constant times N log2N,
the data will need to examine each point and
which in turn is larger than any constant multiplier
hence is at least O(N). A program that contains
of N.
two nested loops (such as Algorithms 1 and 2) will
More formally, for any two functions f(N) and
be O(N2) and so forth.
g(N),

f ðNÞ ¼ OðgðNÞÞ ð1Þ


Some Examples
if and only if there are positive constants K and n0
such that As discussed above, the most naïve sorting algo-
rithms are O(N2) as they involve comparing each
jf ðNÞj  KjgðNÞj f or all N > n0 ð2Þ time to most if not all other items in the array. Fast
sorting algorithms such as mergesort and
In less formal terms, as N gets larger, a multiple heapsort are O(N log2(N)). Searching for an
of the function g() eventually gets above f() and item in an unsorted list is O(N) because every
stays there indefinitely. Thus, even if you speeded element must potentially be examined. Searching
up the algorithm represented by f() by any con- for an item in a sorted list is O(N log2(N)) because
stant multiplier (e.g., by running the program on a binary search can be used to eliminate large sec-
computer K times as fast), g() would still be more tions of the list.
efficient for large problems. Problems that can be solved in constant time
Because of the asymmetry of this definition, (such as determining if a number is positive or
the O() notation specifically establishes an negative) are said to be O(1).
upper bound (worst case) on algorithm effi- A particularly important class of algorithms are
ciently. There are other, related notations (“big those for which the fastest known algorithm is
omega” and “big theta) that denote upper exponential (O(cN)) or worse. For example, the
bounds and exact (upper and lower) bounds, so-called travelling salesman problem involves
respectively. finding the shortest closed path through a given
In practice, this definition is rarely used; set of points. These problems are generally con-
instead people tend to use a few rules of thumb sidered to be very hard to solve as the best-known
to simplify calculations. For example, if f (N) is algorithm is still very complex and time-
the sum of several terms, only the largest term (the consuming.
one with the largest power of N) is of interest. If f
is the product of several factors, only factors that
depend on x are of interest. Thus if f() were the
function Further Reading

Aho, A. V., & Ullman, J. D. (1983). Data structures and


f ðNÞ ¼ 21N3 þ 3N2 þ 17N  4 ð3Þ
algorithms. Pearson.
Knuth, D. (1976, Apr-June). Big omicron and big omega
the first rule tells us that only the first term (21N3) and big theta. SIGACT News.
matters, and the second rule tells us that the con- Knuth, D. E. (1998). Sorting and searching, 2nd edn. The
art of computer programming, vol. 3 (p. 780). Pearson
stant 21 does not matter. Hence
Education.
 Sedgewick, R., & Wayne, K. (2011). Algorithms. Addison-
f ðNÞ ¼ O N3 ð4Þ Wesley Professional.
18 American Bar Association

visitors monthly. The career services provide


American Bar Association members the opportunity to network with poten-
tial employers, granting access to valuable data
Jennifer J. Summary-Smith and personal information.
Florida SouthWestern State College, Fort Myers, Other benefits for members include access to
FL, USA the ABA’s 22 sections, 6 divisions, and 6 forums.
Culver-Stockton College, Canton, MO, USA Members can participate in a community where
they can interact with professionals in a variety
of practice specialties. Each of the groups pro-
The American Bar Association (ABA) is one of the vides members the opportunity to facilitate in-
world’s largest voluntary associations of lawyers, depth examinations of trends, issues, and regula-
law students, legal, and law professions in the tions in specific areas of law and other special
United States. Its national headquarters is located interests. Members can also enrich their careers
in Chicago, Illinois, with a large branch office in with ABA’s committees and task forces, which
Washington D.C. According to the ABA’s website, provide access to specialty groups and internal
it has nearly 400,000 members and more than ABA departments; the groups range from anti-
3,500 entities. The American Bar Association was trust and family law, to the law student division
established when 75 lawyers from 20 states, and and judicial division. The ABA also advocates
the District of Columbia, came together on August for inclusion and diversity initiatives committed
21, 1878, in Saratogo Springs, New York. Since its to eliminating bias and promoting diversity. The
founding in 1878, the ABA has played an impor- ABA publishes annual reports on the following
tant role in the development of the law profession issues: persons with disabilities participation
in the United States. The ABA website states that rates, racial and ethnic diversity, women in lead-
“the ABA is committed to supporting the legal ership positions, and lesbian, gay, bisexual, and
profession with practical resources for legal pro- transgender participation. Through the use of the
fessionals while improving the administration of ABA’s data, members are able to learn and
justice, accrediting law schools, establishing model understand valuable information, regarding the
ethical codes, and more.” The ABA is also com- every changing landscape of law and society.
mitted to serving its members, refining the legal Members use the data to help guide their own
profession, eradicating bias and promoting diver- practices, influencing decision-making and pub-
sity, and evolving the rule of law in the entire lic policy. Although technology can positively
United States and around the globe. Thus, becom- affect ABA members’ careers by providing
ing a member of ABA has several benefits in terms vital information, there are concerns in the legal
of access to exclusive data. profession in regard to its influence on social
interaction within the work environment.

Membership Benefits Involving Big Data


Benefits and Concerns of Technology in
A benefit of becoming a member of the ABA is the Workplace
that it allows access to free career services for job
seekers and employers. As it states on the ABA’s In a recent study by Glen Vogel, he analyzed
website, job seekers can search and apply for issues associated with the generational gap in
more than 450 legal jobs across the nation. The using technology and social media by legal pro-
ABA’s website provides the opportunity to upload fessionals. According to the article, Internet social
one’s resume, receive email alerts, and access media (ISM) is a concern within the profession,
monthly webinars by experts who provide career blurring the lines between professional and per-
advice. Employers have access to more than 5,400 sonal tasks. ISM can also foster technology over-
resumes, email alerts, and reach more than 16,500 load, resulting a need for reevaluating workplace
American Civil Liberties Union 19

etiquette and rules of professional conduct. Vogel


posits that over the past decade legal professionals American Civil Liberties Union
have been using ISM for more than connecting A
with people. Users are participating on a global Doug Tewksbury
front, engaging in ISM to influence society. With a Communication Studies Department,
surge in the amount of users in the legal work- Niagara University, Niagara, NY, USA
place, there are growing concerns with confiden-
tiality and the traditional work environment. As
younger generations enter the workforce, the gap The American Civil Liberties Union (ACLU)
between older generations widens. Vogel adds it is is an American legal advocacy organization
important for every generation to be willing to that defends US Constitutional rights through
accept new technologies because they can provide civil litigation, lobbying efforts, educational
to be useful tools within the workplace. campaigns, and community organization. While
The American Bar Association is a proprietor not its sole purpose, the organization has histor-
of big data, influencing the legal profession in the ically focused much of its attention on issues
United States and around the world. The ABA surrounding the freedom of expression, and as
continues to expand its information technology expression has become increasingly mediated
services, recently partnering with ADAR IT. through online channels, the ACLU has fought
Marie Lazzara writes that ADAR is the provider numerous battles to protect individuals’ First and
of the private cloud, supporting law firms with Fourth Amendment rights of free online expres-
benefits such as remote desktop access and disas- sion, unsurveilled by government or corporate
ter recovery. As more organizations, such as the authorities.
ABA, make strides to bridge this gap, one thing is Founded in 1920, the ACLU has been at the
certain, the big data phenomenon has an influence forefront of a number of precedent-setting cases
on the legal profession. in the US court system. It is perhaps most well-
known for its defense of First Amendment rights,
particularly in its willingness to take on unpopu-
lar or controversial cases, but has also regularly
Cross-References
fought for equal access and protection from dis-
crimination (particularly for groups of people
▶ Cloud Services
who have traditionally been denied these rights
▶ Data Brokers
under the law), Second Amendment protection
▶ Ethical and Legal Issues
for the right to bear arms, and due process under
▶ LexisNexis
the law, amongst others. The ACLU has pro-
vided legal representation or amicus curiae
briefs for a number of notable precedent-setting
Further Reading legal cases, including Tennessee v. Scopes
American Bar Association, http://www.americanbar.org/
(1921), Gitlow v. New York (1925), Korematsu
aba.html. Accessed July 2014. v. United States (1944), Brown v. Board of Edu-
Lazzara, M. ADAR IT Named Premium Solutions Provider cation (1954), Miranda v. Arizona (1966), Roe v.
by American Bar Association. http://www.prweb.com/ Wade (1973), and dozens of others. Its stated
releases/2014/ADAR/prweb12053119.htm. Accessed
July 2014.
mission is “to defend and preserve the individual
Vogel, G. (2013). A Review of the International Bar Asso- rights and liberties guaranteed to every person in
ciation, LexisNexis Technology Studies, and the Amer- this country by the Constitution and laws of the
ican Bar Association’s Commission on Ethics 20/20: United States.”
the Legal Profession’s Response to the Issues Associ-
ated With the Generational Gap in Using Technology
The balance between civil liberties and
and Internet Social Media. The Journal of the Legal national security is an always-contentious rela-
Profession, 38, 95 p. tionship, and the ACLU has come down strongly
20 American Civil Liberties Union

on the side of privacy for citizens. The passage In terms of its advocacy campaigns, the
of the controversial USA PATRIOT Act in 2001 organization has supported Digital 4th, a Fourth
and its subsequent renewals led to sweeping Amendment activist group, advocating for a non-
governmental powers of warrantless surveil- partisan focus on new legislative action to update
lance, data collection, wiretapping, and data the now-outdated Electronic Communications
mining, many of which continue to today. Pro- Privacy Act (ECPA), a 30-year-old bill that still
ponents of the bill defended its necessity in the governs much of online privacy law. Similarly, the
name of national security in the digital age; ACLU has strongly supported Net Neutrality, the
opponents argued that it would fundamentally equal distribution of high-speed broadband traffic.
violate the civil rights of American citizens and The Free Future campaign has made the case for
create a surveillance state. The ACLU would be governmental uses of technology in accountable,
among the leading organizations challenging a transparent, and constitutionally sound ways on
number of practices resulting from the passage such issues as body-worn cameras for police, dig-
of the bill. ital surveillance and data mining, hacking and
The cases during this era are numerous, but data breaches, and traffic cameras, amongst other
several are particularly noteworthy in their rela- technological issues, as well as through the
tionship to governmental and corporate data Demand Your DotRights campaign. In 2003, the
collection and surveillance. In 2004, the ACLU CAN-SPAM act was on its way through Con-
represented Calyx Internet Access, a New York gress, and the ACLU took the unpopular position
internet service provider, in Doe v. Ashcroft. The that the act unjustly restricted the freedom of
FBI had ordered the ISP to hand over user data speech online and would serve as a chilling effect
through issuing a National Security Letter, a de on speech, as it has continued to argue in several
facto warrantless subpoena, along with issuing other anti-spam legislative bills.
a gag order on discussing the existence of the The ACLU has built its name on defending
inquiry, a common provision of this type of civil rights, and the rise of information-based cul-
letter. In ACLU v. National Security Agency ture has resulted in a greatly expanded practice
(NSA) (2006), the organization unsuccessfully and scope of the organization’s focus. However,
led a lawsuit against the federal government with cases such as the 2013–2014 revelations that
arguing that its practice of warrantless came from the Snowden affair on NSA surveil-
wiretapping was a violation of Fourth Amend- lance, it is clear that the ongoing tension between
ment protections. Similarly lawsuits were filed the rise of new information technologies, the gov-
against AT&T, Verizon, and a number of other ernment’s desire for surveillance in the name of
telecommunication corporations during this era. national security, and the public’s right to Consti-
The ACLU would represent the plaintiffs in tutional protection under the Fourth Amendment
Clapper v. Amnesty International (2013), an is far from resolved.
unsuccessful attempt to challenge the Foreign
Intelligence Surveillance Act’s provision that
allows for the NSA’s warrantless surveillance Further Reading
and mass data collection and analysis of individ-
uals’ electronic communications. It has strongly American Civil Liberties Union. (2014). Key Issues/About
Us. Available at https://www.aclu.org/key-issues.
supported the whistleblower revelations of
Herman, S. N. (2011). Taking liberties: The war on terror
Edward Snowden in his 2013 leak of classified and the erosion of American democracy. New York:
NSA documents detailing the extent of the orga- Oxford University Press.
nization’s electronic surveillance of the commu- Klein, W., & Baldwin, R. N. (2006). Liberties lost: The
endangered legacy of the ACLU. Santa Barbara:
nications of over a billion people worldwide, Greenwood Publishing Group.
including millions of domestic American Walker, S. (1999). In defense of American liberties: A
citizens. history of the ACLU. Carbondale, IL: SIU Press.
American Library Association 21

and articles related to the use of, promise of, and


American Library Association the risks associated with big data. Several other
ALA divisions are also involved with big data. A
David Brown The Association for Library Collections &
Southern New Hampsire University, University of Technical Service (ALCTS) division discusses
Central Florida College of Medicine, Huntington issues related to the management, organization,
Beach, CA, USA and cataloging of big data and its sources. The
University of Wyoming, Laramie, WY, USA Library Information Technology Association
(LITA) is an ALA division that is involved with
the technological and user services activities that
The American Library Association (ALA) is a advance the collection, access, and use of big data
voluntary organization that represents libraries and big data sources.
and librarians around the world. Worldwide, the
ALA is the largest and oldest professional
organization for libraries, librarians, information Big Data Activities of the Association of
science centers, and information scientists. The College & Research Libraries (ACRL)
association was founded in 1876 in Philadelphia,
Pennsylvania. Since its inception, the ALA has The Association of College & Research Libraries
provided leadership for the development, promo- (ACRL) is actively involved with the opportuni-
tion, and improvement of libraries, information ties and challenges presented by big data. As
access, and information science. The ALA is pri- science and technology advance, our world
marily concerned with learning enhancement and becomes more and more connected and linked.
information access for all people. The organiza- These links in and of themselves may be consid-
tion strives to advance the profession through its ered big data, and much of the information that
initiatives and divisions within the organization. they transmit is big data. Within the ACRL, big
The primary action areas for the ALA are advo- data is conceptualized in terms of the three Vs: its
cacy, education, lifelong learning, intellectual volume, its velocity, and its variety. Volume refers
freedom, organizational excellence, diversity, to the tremendously large size of the big data.
equitable access to information and services, However, ACRL stresses that the size of the data
expansion of all forms of literacy, and library set is a function of the particular problem one is
transformation to maintain relevance in a dynamic investigating and size is only one attribute of big
and increasing global digitalized environment. data. Velocity refers to the speed at which data is
While ALA is composed of several different divi- generated, needed, and used. As new information
sions, there is no single division devoted exclu- is generated exponentially, the need to catalogue,
sively to big data. Rather, a number of different organize, and develop user-friendly means of
divisions are working to develop and implement accessing these big data increases multiple expo-
policies and procedures that will enhance the nentially. The utility of big data is a function of the
quality of, the security of, the access to, and the speed at which it can be accessed and used. For
utility of big data. maximum utility, big data needs to be accurately
catalogued, interrelated, and integrated with other
big data sets. Variety refers to the many different
ALA Divisions Working with Big Data types of data that are typically components of and
are integrated into big data. Traditionally, data sets
At this time, the Association of College & Research consist of a relatively small number of different
Libraries (ACRL) is a primary division of the types of data, like word-processed documents,
ALA that is concerned with big data issues. The graphs, and pictures. Big data on the other hand
ACRL has published a number of papers, guides, is typically concerned with many additional types
22 Animals

of information such as emails, audio and video- Further Reading


tapes, sketches, artifacts, data sets, and many
other kinds of quantitative and qualitative data. American Library Association. About ALA. http://www.
ala.org/aboutala/. Accessed 10 Aug 2014.
In addition, big data information is usually pre-
American Library Association. Association for Library
sented in many different languages, dialects, and Collections and Technical Services. http://www.ala.
tones. A key point that ACRL stresses is that as org/alcts/. Accessed 10 Aug 2014.
disciplines advance, the need for and the value of American Library Association. Library Information Tech-
nology Association (LITA). http://www.ala.org/lita/.
big data will increase. However, this advancement
Accessed 10 Aug 2014.
can be facilitated or inhibited by the degree to Bieraugel, Mark. Keeping up with... big data. American
which the big data can be accessed and used. Library Association. http://www.ala.org/acrl/publica
Within this context, librarians who are also infor- tions/keeping_up_with/big_data. Accessed 10 Aug
2014.
mation scientists are and will continue to be
Carr, P. L. (2014). Reimagining the library as a technology:
invaluable resources that can assist with the col- An analysis of Ranganathan’s five laws of library sci-
lection, storage, retrieval, and utilization of big ence within the social construction of technology
data. Specifically, ACRL anticipates needs for framework. The Library Quarterly, 84(2), 152–164.
Federer, L. (2013). The librarian as research information-
specialists in the areas of big data management,
ist: A case study. Journal of the Medical Library
big data security, big data cataloguing, big data Association, 101(4), 298–302.
storage, big data updating, and big data accessing. Finnemann, N. O. (2014). Research libraries and the
Internet: On the transformative dynamic between institu-
tions and digital media. Journal of Documentation, 70(2),
202–220.
Conclusion Gordon-Murnane, L. (2012). Big data: A big opportunity
for Librarians. Online, 36(5), 30–34.
The American Library Association and its
member libraries, librarians, and information scien-
tists are involved in shaping the future of big data. As
disciplines and professions continue to advance with Animals
big data, librarians and information scientists’ skills
need to advance to enable them to provide valuable Marcienne Martin
resources for strategists, decision-makers, policy- Laboratoire ORACLE [Observatoire Réunionnais
makers, researchers, marketers, and many other big des Arts, des Civilisations et des Littératures dans
data users. The ability to effectively use big data will leur Environnement] Université de la Réunion
be a key to success as the world economy and its data Saint-Denis France, Montpellier, France
sources expand. In this rapidly evolving environ-
ment, the work of the ALA will be highly valuable
and an important human resource for business, If in the digital world, an array of data in
industry, government, academic and research plan- exponential growth is compiled, as expressed
ners, decision-makers, and program evaluators who by Microsoft in the following: “data volume
want and need to use big data. is expanding tenfold every five years. Much
of this new data is driven by devices from the
more than 1.2 billion people who are connected
Cross-References to the Internet worldwide, with an average
of 4.3 connected devices per person (Micro-
▶ Automated Modeling/Decision Making soft_Modern_Data_Warehouse_white_paper.pdf
▶ Big Data Concept (2016, p. 6) – https://www.microsoft.com/fr-fr/
▶ Big Data Quality sql-server/big-data-data-warehousing),” their
▶ Data Preservation redistribution varies according to the topic
▶ Data Processing concerned. Thus, the animal world can be broken
▶ Data Storage down according to a descriptive and analytical
Animals 23

mode (biology, for example) but also through the associated with the language whose intentionality
emotional field of the human being. serves the adaptation of the Homo sapiens to their
The living world is based on the synthesis of environment through the creation and transmis- A
complex molecular developments which have sion of informative messages given to their
evolved towards an autocatalytic system, repro- congeners.
ductive and evolutionary (Calvin). Darwin was In addition, both cognitive and language struc-
the precursor of many studies on the origin and tures are subdivided into various layers, such as
evolution of species. In this regard, Philippe et al. the representation of objects in the world and their
indicate (1995) that the present species contain in symbolization. The relation between humans/
their genome sequences inherited from a com- animals in their function of predator/prey is the
mon progenitor. Eukaryotics form a set of lines, basis of a reconstruction of the animal by the
in which – with animals, plants, and fungi – all human being as part of a symbolic approach.
the great biological groups are found. For the Lévi-Strauss, anthropologist, demonstrated that
majority of us, these groups appear to constitute the concept of totem was born from a human
the majority of the diversity of the living world being’s identification with certain animal charac-
and, moreover, contain our own species. This teristics as among the Chippawa tribe, North
arborescent type structure is shown in the dia- American Indians, where people of the “fish
gram below (Lecointre and Le Guyader 2001) clan” had little hair, those of the “bear clan”
(Fig. 1). were distinguished by long, black hair and an
Communication, in whatever form, is the sub- angry and combative temperament, or those of
structure that allows various species of the world the “clan of the crane” by a screaming voice
of the living to continue to exist in space and in (1962, p. 142). In contrast, we find anthropomor-
time. The transmission of information is also part phized animals in some fairy tales, such as “Little
such as spotting means. At the same time, predator Red Riding Hood” by Grimm where the wolf
and prey, living developed its way of life through plays the role of a carnivorous grandmother or in
the search for food and through its own protection fables, like those of La Fontaine.
as well as that of its species. This functioning The imaginary for humans has contributed to
mode corresponds to the level 1 of Maslow’s the reconstruction of the animal as part of the
pyramid of needs, which is basic needs, like Greek mythology, such as the Centaurs, hybrid
food or shelter. With the emergence of language beings, half human, half Equidae, or Medusa,
in the hominid, primate, member of the simian one of the three Gorgons, whose hair was made
group, communication started to use other tools. of snakes. Some divine entities wear accessories
Indeed, the particularity of the human being is belonging to the animal world as the devil with
his or her thoughts, more precisely his or her horns worn by Bovidae or the angels with their
(¼their) consciousness of their existence. This wings referring to the species of birds. Supersti-
was affirmed by Descartes (2000) in his famous tion gives some animals protective or destructive
formula: Cogito ergo sum. The thought is powers, such as a black cat which was associated
with witchcraft in the Middle Ages, or at that
times, when human beings found a swarm of
Ebactéries bees attached to a tree in their garden this phe-
nomenon was considered a bad sign they had to
give a silver coin to these insects as a New Year’s
gift (Lacarrière 1987). The sacralization of the
animal is also a special relationship of the human
Eucaryotes being with animals, like the bull-headed god
Archées
Apis, or the sacred cat in ancient Egypt. Carica-
tures have been also inspired by animals to high-
Animals, Fig. 1 Diagram of the living world light particular character traits at such a known
24 Animals

public personality. The projection of the human refers to the different songs emitted by birds (class
being and animals between human entry in a of the Aves). Applications, such as Hashtag
register other than his own species or that of the (https://fr.wikipedia.org/wiki/Hashtag), which is
animal in the human species may be born out of a “meaningful continuation sequence of written
telescoping of predator and prey roles played by characters without a space, beginning with the
all living and questioning the Human being. # sign (sharp) (http://www.programme-tv.net/
Modern technologies are at the origin of a new news/buzz/44259-twitter-c-est-quoi-un-hashtag/),”
animal mythology with well-known animated YouTube (https://www.youtube.com/?gl¼FR&
films, such as those of Walt Disney and its vari- hl¼fr), which offers every user to create videos
ous characters, such as Mickey Mouse or and put them online, allowing any Internet user
Donald Duck. to share their different experiences, whatever their
The representation of an object of the world nature, in their relationship with animals, or,
evolves according to various factors, including again, Instagram (https://www.instagram.com/?
the progress of science. Various studies have hl¼fr), which opens up the sharing of photos and
tried to understand the mode of thinking in the videos between friends. An example that made the
animal in comparison with that of the human buzz on Instagram is that of Koyuki, the grumpy
being. Dortier (1998) specifies as well as every- cat (https://fr.pinterest.com/pin/5538020854
where in the living world that animals exhibit 12023724/).
more or less elaborated cognitive abilities. Fur-
thermore, primatology, which is the science ded-
icated to the study of the species of primates, Further Reading
shows in the context of the phylogenetic filia-
tions of the pygmy chimpanzees of Zaire and Calvin, M. (1975). L’origine de la vie. La recherche en
biologie moléculaire (pp. 201–222). Paris: Editions du
African chimpanzees that we share 98% of
Seuil.
their genetic program (Diamond 1992, p. 10). Darwin, C. (1973). L’origine des espèces.
This new approach to the human being in relation Verviers: Marabout Université.
to animals where it mentions his belonging to the Descartes, R. (2000). Discours de la méthode.
Paris: Flammarion.
animal world may have changed the perception
Diamond, J. (1992). Le troisième singe – Essai sur
regarding the animal world. The protection of the l’évolution et l’avenir de l’animal humain. Paris:
animal, which is considered a sensitive being, has Gallimard.
become wide-spread in the societies of the twenty- Dortier, J. F. (1998). Du calamar à Einstein. . . L’évolution
de l’intelligence. Le cerveau et la pensée – La
first century.
révolution des sciences cognitives (pp. 303–309).
In its relation with the human being, the term Paris: Éditions Sciences humaines.
animal includes two categories: the wild animal Lacarrière, J. (1987). Les évangiles des quenouilles. Paris:
and the domestic animal. The latter lives on the Imago.
Lecointre, G., & Le Guyader, H. (2001). Classification
personal territory of the human being and also
phylogénétique du vivant. Paris: Belin.
enters their emotional field. In a search made Lévi-Strauss, C. (1962). La pensée sauvage. Paris:
with the help of the Google (https://www.google. Librairie Plon.
fr/search?q¼hashtag&oq¼Hastag&aqs¼chrome. Maslow, A. (2008). Devenir le meilleur de soi-même –
Besoins fondamentaux, motivation et personnalité.
1.69i57j0l5.3573j0j7&sourceid¼chrome&ie¼U
Paris: Eyrolles.
TF-8#q¼twitter+animaux) search engine, the Philippe, H., Germot, A., Le Guyader, H., & Adoutte, A.
number of sites which express themselves with (1995). Que savons-nous de l’histoire évolutive
Twitter (https://twitter.com/?lang¼fr) – a service des eucaryotes ? 1. L’arbre universel du vivant et
les difficultés de la reconstruction phylogénétique.
which is used to relay short information from user
Med Sci, 11, 8 (I–XIII), 1–2. http://www.ipubli.
to user approximate the figure of 32,400,000 inserm.fr/bitstream/handle/10608/2438/MS_1995_
results. It is worth noting that the term “twitter” 8_I.pdf.
Anomaly Detection 25

Anomaly Examples
Anomaly Detection
The cost of failed software can be high indeed. A
Feras A. Batarseh For example, in 1996, a test flight of a European
College of Science, George Mason University, launch system, Ariane 5 # 501, failed as a result of
Fairfax, VA, USA an anomaly. Upon launch, the rocket veered off its
path and was destroyed by its self-destruction
system to avoid further damage. This loss was
Synonyms later analyzed and linked to a simple floating
number anomaly. Another famous example is
Defect detection; Error tracing; Testing and regarding a wholesale pharmaceutical distribution
evaluation; Verification company in Texas (called: Fox Meyer Drugs).
The company developed a resources planning
system that failed right after implementation,
Definition because the system was not tested thoroughly.
When Fox Meyer deployed the new system,
Anomaly Detection is the process of uncovering most anomalies floated to the surface, and caused
anomalies, errors, bugs, and defects in software to lots of users’ frustration. That put the organization
eradicate them and increase the overall quality of a into bankruptcy in 1996. Moreover, three people
system. Finding anomalies in big data analytics is died in 1986 when a radiation therapy system
especially important. Big data is “unstructured” called Therac erroneously subjected patients to
by definition, hence, the process of structuring it lethal overdoses of radiation. More recently how-
is continually presented with anomaly detection ever, in 2005, Toyota recalled 160,000 Prius auto-
activities. mobiles from the market because of a software
anomaly in the car’s software. The mentioned
examples are just some of the many projects
gone wrong (Batarseh and Gonzalez 2015); there-
Introduction fore, anomaly detection is a critical and difficult
issue to address.
Data engineering is a challenging process. Dif-
ferent stages of the process affect the outcome
in a variety of ways. Manpower, system design, Anomaly Detection Types
data formatting, variety of data sources, size of
the software, and project budget are among the Although anomalies can be prevented, it is not an
variables that could alter the outcome of an easy task to build fault-free software. Anomalies
engineering project. Nevertheless, software are difficult to trace, locate, and fix; they can
and data anomalies pose one of the most chal- occur due to multiple reasons, examples include:
lenging obstacles in the success of any project. due to a programming mistake, miscommunica-
Anomalies have postponed space shuttle tion among the coders, a misunderstanding
launches, caused problems for airplanes, and between the customer and the developer, a mis-
disrupted credit card and financial systems. take in the data, error in the requirements docu-
Anomaly detection is commonly referred to as ment, a politically biased managerial decision, a
a science as well as an art. It is clearly an change in the domain market standards, and mul-
inexact process, as no two testing teams will tiple other reasons. In most cases, however,
produce the same exact testing design or plan anomalies fall under one of the following cate-
(Batarseh 2012). gories (Batarseh 2012):
26 Anomaly Detection

1. Redundancy – Having the same data in two or Anomaly Detection, Table 1 Anomaly detection
more places. approaches
2. Ambivalence – Mixed data or unclear repre- Anomaly detection
sentation of knowledge. approach Short description
3. Circularity – Closed loops in software; a func- Detection through Logical validation with
analysis of uncertainty, a field of artificial
tion or a system leading to itself as a solution. heuristics intelligence
4. Deficiency – Inefficient representation of Detection through Result-oriented validation
requirements. simulation through building simulations of
5. Incompleteness – Lack of representation of the system
the data or the user requirements. Face/field Preliminary approach (used with
validation and other types of detection). This is a
6. Inconsistency – Any untrue representation of
verification usage-oriented approach
the expert’s knowledge. Predictive A software engineering method,
detection part of testing
Different anomaly detection approaches that Subsystem testing A software engineering method,
have been widely used in many disciplines are part of testing
presented and described in the Table 1. Verification Result-oriented validation,
through case achieved by running tests and
However, based on the recent study by
testing observing the results
National Institute of Standards and Technology
Verification Visual validation and error
(NIST), the data anomaly itself is not the quan- through graphical detection
dary, it is actually the ability to identify the representations
location of the anomaly. That is listed as the Decision trees and Visual validation – observing the
most time-consuming activity of testing. In their directed graphs trees, and the structure of the
system
study, NIST researchers compiled a vast number
Simultaneous Statistical/quantitative
of software and data projects and reached the confidence verification
following conclusion: “If the location of bugs intervals
can be made more precise, both the calendar Paired T-tests Statistical/quantitative
time and resource requirements of testing can verification
be reduced. Modern data and software products Consistency Statistical/quantitative
measures verification
typically contain millions of lines of code. Pre-
Turing testing Result-oriented validation, one of
cisely locating the source of bugs in that code the commonplace artificial
can be very resource consuming.” Based on intelligence methods
that, it can be concluded that anomaly detection Sensitivity analysis Result-oriented data analysis
is an important area of research that is worth Data collection and Usage-oriented validation
exploring (NIST 2002; Batarseh and Gonzalez outlier detection through statistical methods and
data mining
2015).
Visual interaction Visual validation thought user
verification interfaces

Conclusion

Similar to most engineering domains, software


and data require extensive testing and evaluation. the data and provide valid outcomes. Addition-
The main goal of testing is to eliminate anomalies, ally, detection leads to a better overall quality of a
in a process referred to as anomaly detection. data system, therefore, it is a necessary and an
It is not possible to perform data analysis if the unavoidable process. Anomalies occur for many
data has anomalies. Data scientists usually per- reasons, and in many parts of the system, many
form steps such as data cleaning, aggregation, practices lead to anomalies (listed in this entity),
filtering, and many others. All these activities locating them, however, is an interesting engi-
require anomaly detection to be able to verify neering problem.
Anonymity 27

Cross-References If we examine the concept from a textual per-


spective, we have to relate it to that of “author.”
▶ Data Mining When speaking of “anonymous author,” we are A
already establishing a difference, taking up
Foucault’s terms, between the concepts of proper
Further Reading name (corresponding to the civilian, the physical,
empirical individual; as Derrida pointed out: “the
Batarseh, F. (2012). Incremental lifecycle validation of proper name belongs neither to language nor to
knowledge-based systems through CommonKADS.
the element of conceptual generality”) and name
Ph.D. Dissertation Registered at the University of
Central Florida and the Library of Congress. of the author (situated in the plane of language,
Batarseh, F., & Gonzalez, A. (2015). Predicting failures operating as a catalyst of textualities, as the lowest
in contextual software development through data common denominator which agglutinates formal,
analytics. Proceedings of Springer’s Software Quality
thematic, or rhetoric specificities from different
Journal.
Planning Report for NIST. (2002). The economic texts unified by a “signature”). If there is no
impacts of inadequate infrastructure for software author’s name – “signature” – that is, if the text
testing. A report published by the US Department appears to be anonymous (from an “anonymous
of Commerce.
author”), this rubric loses the function of catalyst
to become a generator of intransitive and individ-
ualized textualities, unable to be gathered into a
unified corpus.
Anonymity It is important to understand that the author we
are talking about is not an empirical entity but a
Pilar Carrera
textual organizer. It does not necessarily match
Universidad Carlos III de Madrid, Madrid, Spain
either the name of the empirical author, since a
pseudonym could as well perform this function of
textual organizer, because it keeps secret the
What matter who’s speaking (Beckett)
proper name of the emitter. Foucault (1980: 114)
“Anonymity” refers to the quality or state of being clearly explained this point, in relation to the
anonymous, from Greek anonymos and Latin presence of names of authors in his work, and
anonymus, “what doesn’t have a name or it is what meant for him, in theoretical terms, the
ignored, because it remains occult or unknown,” name of the author (linked to what he calls
according to the Diccionario de Autoridades of “authorial function”), which some critics, dealing
the Real Academia Española. with his writings confused with the empirical sub-
It designates not so much an absence as the ject (the “proper name”): “They ignored the task I
presence of an absence, as Roland Barthes put had set myself: I had no intention of describing
it. The concept points out to the absence of a Buffon or Marx or of reproducing their statements
name for the receiver of a message (reader, or implicit meanings, but, simply stated, I wanted
viewer, critic, etc., reception instance which is to locate the rules that formed a certain number of
constituent of “anonymity”), the absence of concepts and theoretical relationships in their
“signature,” following Derrida. Anonymity is works.” Barthes (1977: 143) also alluded to the
therefore closely linked to the forms of media- confusion between the proper name and the name
tion, including writing. It implies the power to of the author and its consequences: “Criticism still
remain secret (without name) as author for a consist for the most part in saying that
given audience. Seen from a discursive point Baudelaire’s work is the failure of Baudelaire the
of view, anonymity concerns associated with man.” Jean-Paul Sartre (1947) was one of the
big data analysis are related to the generation most famous victims of that misunderstanding,
of consistent narratives from massive and reading Baudelaire’s poems from a Freudian
diverse amounts of data. approach to the author’s life and family traumas.
28 Anonymity

The words “Marx,” “Buffon,” or “Baudelaire” paradox manifested on it of branded, i.e., pub-
do not point to certain individuals with certain licized anonymity). Those who are able to
convictions, biographical circumstances, or spe- determine the expressive and discursive
cific styles, but to a set of textual regularities. In modalities, subsequently fed by users’ activity,
this sense, anonymity, in the context of the autho- usually remain hidden or secret, i.e.,
rial function, points toward a relational deficit. To anonymous.
identify regularities, a minimum number of texts 2. Extension of the above: anonymity as the abil-
is required (a textual “family”) that permit to be ity to see without being seen. In this case,
gathered together through their “belonging” to the anonymity deeps the information gap (inequal-
same “signature.” This socializing function of the ity of knowledge and asymmetric positions in
nonanonymous author (becoming the centripetal the communication process). Those who are
force which allows that different texts live able to remain nameless are situated in a
together) vanishes in the case of anonymous privileged position with respect to those who
authors (or those which made use of different hold a name, because, among other things, they
pseudonyms). do not leave traces, they can not hardly be
Let’s think, for example, of a classic novel, tracked, they have no “history,” therefore no
arrived to us under the rubric “anonymous,” past. It is no coincidence that when the “right”
whose author is, by chance, identified and given to anonymity is claimed by Internet users, it is
a name. From that moment on, the work will be formulated in terms of “right to digital obliv-
“charged” with meanings from its incorporation to ion.” Anonymous is the one that cannot be
the collection of works signed by the author now remembered. Anonymous is also the one who
identified. Similarly, the image we have today, for can see without being seen. In all cases, it
example, of a writer, politician, or philosopher, implies an inequality of knowledge and mani-
would be altered, i.e., reconstituted, if we found fests the oscillation between ignorance and
out that, avoiding his public authorial name, he knowledge.
had created texts whose ideological or aesthetic 3. Anonymity as a practice that permits some acts
significance were inconsistent with his official of speech go “without consequences” for the
production. Let us consider, for example, the eigh- civilian person (for example, in the case of
teenth century fabulists (for instance, the French anonymous defamation or practiced under
La Fontaine or the Spaniard Samaniego) whose false names, or in the case of leaks), eluding
official logic was one of a strict Christian morality, potential sanctions (this brings us back to those
whereas some of their works, remained anony- authors forced to remain secret and hide their
mous for a while and today attributed to them, names in order to avoid being punished for
could be placed within the realm of vulgar their opinions, etc.). In this sense, anonymity
pornography. may contribute to the advancement of knowl-
In the textual Internet’s ecosystem, anonymity edge by allowing the expression of certain
has become a hotspot for different reasons, and the individuals or groups whose opinions or
issue is usually related to: actions won’t be accepted by the generality of
the society (for example, the case of Mary
1. Power, referring to those who control the rules Shelley, Frankenstein’s author, whose novel
underlying Internet narratives (the program- remained unsigned for a quarter of a century).
ming that allows content display by users) 4. Anonymous is also the author who renounces
and are able to take over the system (including the fame of the name, leaving the “auctoritas”
hackers and similar characters; the Anonymous to the text itself; the text itself would assume
organization would be a good example, that role backing what is stated by the strength
denomination included, and because of the of its own argumentative power. In this sense,
Anonymity 29

anonymity is very well suited to permeate instrument of political literary criticism. I


habits and customs. Anonymity also facilitates should like to demonstrate to you that the ten-
appropriation (for instance, in the case of pla- dency of a work of literature can be politically A
giarism), reducing the risks of sanctions correct only if it is also correct in the literary
derived from the “private property of meaning” sense. That means that the tendency which is
(which is what the signature incorporates to the politically correct includes a literary tendency.
text). As Georg Simmel wrote: “The more sep- And let me add at once: this literary tendency,
arate is a product from the subjective mental which is implicitly or explicitly included in
activity of its creator, the more it accommo- every correct political tendency, this and noth-
dates to an objective order, valid in itself, the ing else makes up the quality of a work. It is
more specific is its cultural significance, the because of this that the correct political ten-
more appropriate is to be included as a general dency of a work extends also to its literary
means in the improvement and development of quality: because a political tendency which is
many individual souls (. . .) realisations that are correct comprises a literary tendency which is
objectified at great distance from the subject correct.” In this sense, all writing that makes a
and to some extent lend ‘selflessly’ to be the difference is anonymous from the point view
seasons of mental development.” of the “proper name.” This means that a con-
5. Anonymity, in the unfolding discourse about summated writing process inevitably leads to
mass media, has also been associated with the the loss of the proper name and designates the
condition of massive and vicarious reception, operation by which the individual who writes
made possible by the media, by the anonymous reaches anonymity and then becomes an
masses. In this sense, anonymity is associated author (anonymous or not).
with the indistinct, the lack of individuality, 6. Anonymity concerns related to big data should
and the absence of the shaping and differenti- take into account the fact that those that “own”
ating force of the name. As we see, extremes and sell data are not necessarily the same that
meet and connotations vary depending on the generate those narratives, but in both cases the
historical moment. Anonymity can both indi- economic factor and the logic of profit optimi-
cate a situation of powerlessness (referring, for zation along with the implementation of con-
example, to the masses) and a position of trol and surveillance programs are paramount.
power (in the case, for example, of hackers or The “owners” are situated at the informational
organizations or individuals who “watch” level, according to Shannon and Weaver’s
without being noticed the Internet traffic). notion of information. They stablished the par-
Users “empowerment” through Internet and adigmatic context, the “menu,” within whose
the stated passage from massive audiences to borders syntagmatic storytelling takes place
individualized users does not necessarily through a process of data selection and pro-
incorporate changes in authorial terms, cessing. Users’ opinions and behaviors,
because, as we have seen, we should not con- tracked through different devices connected
fuse the author’s name and the proper name. In to the Internet, constitute the material of what
the same way, authorial commitment and civil- we may call software driven storytelling. The
ian commitment should be distinguished. In fact that users’ information may be turned
this sense, Walter Benjamin wrote in “The against them when used by the powers that
Author as Producer” (Benjamin 1998: 86): be, which is considered one of the main pri-
“For I hope to be able to show you that the vacy threats related to big data, reflects the fact
concept of commitment, in the perfunctory that individuals, in the realm of mass media,
form in which it generally occurs in the debate have become “storytelling fodder,” which is
I have just mentioned, is a totally inadequate probably the most extreme and oppressive
30 Anonymization Techniques

form of realism. Driven by institutionalized


sources and power structures, the reading con- Anonymization Techniques
tract that lies beneath these narratives and the
modes of existence of these discourses are Mick Smith1 and Rajeev Agrawal2
1
structurally resilient to dissent. North Carolina A&T State University,
Greensboro, NC, USA
2
In all these cases, the absence or the presence Information Technology Laboratory, US Army
of the name of the author, and specifically ano- Engineer Research and Development Center,
nymity, has to be considered as an institutional- Vicksburg, MS, USA
ized textual category consummated during the
moment of reception/reading (emitting by itself
does not produce anonymity, a process of recep- Synonyms
tion is required; there is no anonymity without a
“reading”), because it implies not so much a Anonymous data; Data anonymization; Data pri-
quality of the text as a “reading contract.” As vacy; De-Identification; Personally identifiable
Foucault (1980: 138) said, in a context of autho- information
rial anonymity, the questions to be asked will
not be such as: “Who is the real author?,” “Have
we proof of his authenticity and originality?,” Introduction
“What has he revealed of his most profound self
in his language?,” but questions of a very dif- Personal information is constantly being collected
ferent kind: “What are the modes of existence of on individuals as they browse the internet or share
this discourse?,” “Where does it come from; data electronically. This collection of information
how it is circulated; who controls it?,” “What has been further exacerbated with the emergence
placements are determined for possible sub- of the Internet of things and the connectivity of
jects?,” “Who can fulfill these diverse functions many electronic devices. As more data is dissem-
of the subject?” It seems clear that the implica- inated into the world, interconnected patterns are
tions of considering one or another type of created connecting one data record to the next.
questions are not irrelevant, not only artistically The massive data sets that are collected are of
or culturally, but also from a political great value to businesses and data scientists
perspective. alike. To properly protect the privacy of these
individuals, it is necessary to de-identify or
anonymize the data. In other words, personally
Further Reading identifiable information (PII) needs to be
encrypted or altered so that a person’s sensitive
Barthes, R. (1977). The death of the author (1967). In data remains indiscernible to outside sources and
Image, music, text. London: Fontana Press. readable to the pre-approved parties. Some popu-
Benjamin, W. (1998). The author as producer (1934). In
Understanding Brecht. London: Verso.
lar anonymization techniques include noise addi-
Derrida, J. (1988). Signature event context (1971). In Lim- tion, differential privacy, k-anonymity, l-diversity,
ited Inc. Chicago: Northwestern University Press. and t-closeness.
Derrida, J. (2013). Biodegradables (1988). In Signature The need for anonymizing data has come from
Derrida. Chicago: University of Chicago Press.
Foucault, M. (1980). What is an author (1969). In Lan-
the availability of data through big data. Cheaper
guage, counter-memory, practice: Selected essays and storage, improved processing capabilities, and a
interviews. New York: Cornell University Press. greater diversity of analysis techniques have cre-
Sartre, J.-P. (1947). Baudelaire. Paris: Gallimard. ated an environment in which big data can thrive.
Simmel, G. (1908). Das Geheimnis und die geheime
Gesellschaft. In Soziologie. Untersuchungen über die
This has allowed organizations to collect massive
Formen der Vergesellschaftung. Leipzig: Duncker & amounts of data on the customer/client base. This
Humblot. information in turn can then be subjected to a
Anonymization Techniques 31

variety of business intelligence applications so as row is not discernable. Cynthia Dwork provides
to improve the efficiency of the collecting organi- the following definition:
zation. For instance, a hospital can collect various A randomized function K gives ε-differential A
patient health statistics over a series of visits. This privacy if for all data sets D1and D2differing on
information could include vital statistics measure- at most one element, and all S  Range(K),
ments, family history, frequency of visits, test
results, or any other health-related metric. All of Pr½ΚðD1 Þ  S  expðeÞ  Pr½ΚðD1 Þ  S
this data could be analyzed to provide the patient
with an improved plan of care and treatment, As an example think of a database containing
ultimately improving the patient’s overall health the incomes of 75 people in a neighborhood and
and the facilities ability to provide a diagnosis. the average income is $75,000. If one person were
However, the benefits that can be realized from to leave the neighborhood and the average income
the analysis of massive amounts of data come with dropped to $74,000, it would be easy to identify the
the responsibility of protecting the privacy of the income of the departing individual. To overcome
entities whose data is collected. Before the data is this, it would be necessary to apply minimum noise
released, or in some instances analyzed, the sensi- so that the average income before and after would
tive personal information needs to be altered. The not be representative of the change. At the same
challenge comes in deciding upon a method that can time, the computational integrity of the data is
achieve anonymity and preserve the data integrity. maintained. The amount of noise and whether an
exponential or Laplacian mechanism is used is still
subject to ongoing research/discussion.
Noise Addition

The belief with noise addition is that by adding K-Anonymity


noise to data sets that the data becomes ambiguous
and the individual subjects will not be identified. In the k-anonymity algorithm, two common
The noise refers to the skewing of an attribute so methods for anonymizing data are suppression and
that it is displayed as a value within a range. For generalization. By using suppression, the values of
instance, instead of giving one static value for a categorical variable, such as name, are removed
person’s age, it could be adjusted 2 years. If the entirely from the data set. With generalization quan-
subject’s age is displayed as 36, the observer titative variables, such as age or height, are replaced
would not know the exact value, only that the with a range. This in turn makes each record in a
age may be between 34 and 38. The challenge data set indistinguishable from at least k–1 other
with this technique comes in identifying the records. One of the major drawbacks to k-anonym-
appropriate amount of noise. There needs to be ity is that it may be possible to infer identity if
enough to mask the true attribute value, while at certain characteristics are already known. As a sim-
the same time preserving the data mining relation- ple example consider a data set that contains credit
ships that exist within the dataset. decisions from a bank (Table 1). The names have

Anonymization Techniques, Table 1 K-anonymity


Differential Privacy credit example
Age Gender Zip Credit decision
Differential privacy is similar to the noise addition 18–25 M 149** Yes
technique in that the original data is altered 18–25 M 148** No
slightly to prevent any de-identification. How- 32–39 F 149** Yes
ever, it is done in a manner that if a query is 40–47 M 149** Yes
done on two databases that differ in only one 25–32 F 148** No
row that the information contained in the missing 32–39 M 149** Yes
32 Anonymization Techniques

been omitted, the age categorized, and the last two distribution of the attributes in the initial database.
digits of the zip code have been removed. Privacy can be considered a measure of informa-
This obvious example is for the purposes of tion gain. T-Closeness takes this characteristic
demonstrating the weakness of a potential homo- into consideration by assessing an observer’s
geneity attack in k-anonymity. In this case, if it prior and posterior belief about the content of a
was known that a 23-year-old man living in data set as well as the influence of the sensitivity
14,999 was in this data set, the credit decision attribute. As with l-diversity, this approach hides
information for that particular individual could the sensitive values within a data set while
be inferred. maintaining association through “closeness.”
The algorithm uses a distance metric known as
the Earth Mover Distance to measure the level of
L-Diversity closeness. This takes into consideration the
semantic interrelatedness of the attribute values.
L-diversity can be viewed as an extension to k- However, it should be noted that the distance
anonymity in which the goal is to anonymize metric may differ depending on the data types.
specific sensitive values of a data record. For This includes the following distance measures:
instance, in the previous example, the sensitive numerical, equal, and hierarchical.
information would be the credit decision. As with
k-anonymity generalization and suppression tech-
niques are used to mask the true values of the Conclusion
target variable. The authors of the l-diversity prin-
ciple, Ashwin Machanavajjhala, Daniel Kifer, To be effective each anonymization technique
Johannes Gehrke, and Muthuramakrishnan should prevent against the following risks: sin-
Venkitasubramniam, define it as follows: gling out, linkability, and inference. Singling out
A q*-block is l-diverse if it contains at least l well- is the process of isolating data that could identify
represented values for the sensitive attribute S. an individual. Linkability occurs when two or
A table is l-diverse if every q*-block is l-diverse. more records in a data set can be linked to either
an individual or grouping of individuals. Finally
The concept of well-represented has been
inference is the ability to determine the value of
defined in three possible methods: distinct l-
the anonymized data through the values of other
diversity, entropy l-diversity, and recursive
elements within the set. An anonymization
(c, l)-diversity. A criticism of the l-diversity
approach that can mitigate these risks should be
model is that it does not hold up well when the
considered robust and will reduce the possibility
sensitive value has a minimal number of states.
of re-identification. Each of the techniques pre-
As an example, consider the credit decision table
sented address each of these risks differently. The
from above. If that table were extended to
include 1000 records and 999 of them had a following table outlines their respective perfor-
mance (Table 2):
decision of “yes,” then l-diversity would not be
For instance, unlike k-anonymity, l-diversity,
able to provide sufficient equivalence classes.
and t-closeness are not subject to inference attacks
that utilize the homogeneity or background
knowledge of the data set. Similarly, the three
T-Closeness generalization techniques (k-anonymity, l-diver-
sity, and t-closeness), all present differing levels
Continuing with the refinement of de-identifica- of association that can be made due to the cluster-
tion techniques, t-closeness is an extension of l- ing nature of each approach.
diversity. The goal of t-closeness is to create As with any aspect of data collection, sharing,
equivalence classes that approximate the original publishing, and marketing, there is the potential
Anthropology 33

Anonymization Techniques, Table 2 Anonymization


algorithm comparison Anthropology
Technique
Singling
out Linkability Inference
A
Marcienne Martin
Noise addition At risk Possibly Possibly Laboratoire ORACLE [Observatoire Réunionnais
K-anonymity Not at At risk At risk des Arts, des Civilisations et des Littératures dans
risk
leur Environnement] Université de la Réunion
L-diversity Not at At risk Possibly
risk Saint-Denis France, Montpellier, France
T-closeness Not at At risk Possibly
risk
Differential Possibly Possibly Possibly Irrespective of the medium applied, information
privacy compiles data relating to a given study object.
This is the case with anthropology. Indeed, diver-
sity translated through language, culture, or social
for malicious activity. However, the benefits that structure is an important source of information.
can be achieved from the potential analysis of Concerning the study of human beings, answers
such data cannot be overlooked. Therefore, it is have differed according to the different epochs
extremely important to mitigate such risks and cultures. Anthropology as a field of scientific
through the use of effective de-identification tech- research began in the nineteenth century. It
niques so as to protect sensitive personal informa- derived from anthropometry, a science dedicated
tion. As the amount of data becomes more to the dimensional particularities of human being.
abundant and accessible, there is an increased Buffon with the study: Traité des variations de
importance to continuously modify and refine l’espèce humaine (Study on the Variation of the
existing anonymization techniques. Human Species) (1749) and Pierre-Paul Broca,
the founder of the Society of Anthropology of
Paris (1859), are considered rance’s forerunners
Further Reading of this science. In the era of the Internet, data has
become accessible to nearly anyone wishing to
Dwork, C. (2006). Differential privacy. In Automata, lan- consult them. In the free encyclopedia Wikipedia,
guages and programming. Berlin: Springer. there are over 1,864,000 articles in the French
Li, Ninghui, et al. (2007). t-Closeness: Privacy beyond k-
anonymity and l-diversity. IEEE 23rd International language. As for the eleven thematic portals: art,
Conference on Data Engineering, 7. geography, history, leisure, medicine, politics,
Machanavajjhala, A., et al. (2007). l-diversity: Privacy religion, science, society, sport, technology, they
beyond k-anonymity. ACM Transactions on Knowl- subsume 1,636 portals, always in the French lan-
edge Discovery from Data, 1(1), Article 3, 1–12.
Sweeney, L. (2002). k-anonymity: A Model for Protecting guage. Anthropology is one of the entries of this
Privacy. International Journal of Uncertainty, Fuzzi- encyclopedia. Anthropology refers to the science
ness and Knowledge-Based Systems, 10(5). dedicated to the study of the human as a whole,
The European Parliament and of the Council Working either at the physical level, as it belongs to the
Party. (2014). Opinion 05/2014 on anonymisation tech-
niques. http://ec.europa.eu/justice/data-protection/arti animal world, or in the context of both its envi-
cle-29/documentation/opinion-recommendation/files/ ronment and history when analyzed from the per-
2014/wp216_en.pdf. Retrieved on 29 Dec 2014. spective of different human groups who have been
observed. From an etymological point of view, the
term “anthropology” stems from the Greek
“Anthropos,” which contrasts the human to the
Anonymous Data gods; moreover, the Greek word “logos” refers to
“science, speech.” The anthropologist, a specialist
▶ Anonymization Techniques in this field of research, is written in the Greek
34 Anthropology

language as follows: α’ n θ r o π o l o γ o B. Other reasoning, the observed object in its contextual


related sciences, such as anthropology, sociology, relationships is taken into consideration and the
etc., also study the Homo sapiens, but in a partic- concluding speech serves as the culmination of
ular context (ethnicity, sociocultural substrate . . .). the progress of thought. There is also another type
In general, a human being needs references. of reasoning; it is analogical reasoning. In this
Homo sapiens always responds to questioning cognitive process, an unknown object is put into
was it strange, with no less singular explanation, relation, or where some of its given parameters are
but sometimes validated by the phenomenon of incomprehensible, but have some resemblance
beliefs or hypothesis made by them, perhaps with something known, at least according to the
depending on the evolution of technology observer’s perception.
through verifiable hypothesis (dark matter, dark Between differentiation and analogy, Human
energy . . .). These interrogations are at the origin has built different paradigms one of which was
of the creation of mythologies, of religions, and of incorporated divine entities. In contrast, some are
philosophies. So to answer the question of the elaborated from elements of objects in the world,
origin and the meaning of natural phenomena, e.g., the Centaur, half horse, half human. Others
such as thunderstorms, volcanoes, and storms, use the imaginary as a whole, which corresponds
diverse beliefs have attributed these creations to to the rewriting the objects of the world in a
the deities, sometimes in response to human recomposition ad infinitum. For the Greeks, the
behaviors considered as negative; hence, these principle of anthropomorphization of phenomena,
deities have developed these natural disasters. as still unknown or objects whose origin was still
When these phenomena were understood scientif- unexplainable (Gaia the Earth), has been extended
ically, the responses related to these beliefs to some objects such as the Night, the Darkness,
disappeared. Why we live in order to finally die the Death, the Sleep, the Love, and the Desire
is a question that has not been answered satisfac- (Gomperz 1908).
torily yet, except through many religions where The major revolution that has given a new
beliefs and their responses are posed as postulates. orientation to the study of human beings, as a
Nourished by the richness of imagination in species which belongs to the world of the living,
humans, philosophy is a mode of thinking which was the hypothesis made by the English naturalist
tries to provide responses to various questions Darwin in 1859 about the origin of species, their
concerning the meaning of life, to the various variability, their adaptation to their environment,
human behaviors and to their regulations (moral and their evolution. If the laws of heredity were
principles), to the death, to the existence or to the not yet known at this period, it is the Czech-
inexistence of an “architect” at the origin of the German monk and botanist Gregor Johann Men-
world of matter. Concerning the language as a del, who developed the three laws of heredity,
method of transmission, regardless of the tool known as Mendel’s laws, in 1866 after 10 years
used, the thought is based on complex phenom- of study on hybridization in plants. In the
ena. The understanding of the objects in the world journal Nature, in a paper published in 1953, the
induces different forms of reasoning, such as log- researchers James Watson and Francis Crick dem-
ical reasoning and analogical reasoning. Some onstrated the existence of the double helical struc-
types of reasoning include a discourse de facto, ture of DNA. According to Crick, the analysis of
regardless of the type of reasoning (concessive the human genome is an extraordinary approach
reasoning, dialectic reasoning, by reductio ad which is applied to the human being with respect
absurdum) and whatever its form (oral, written). to both its development and physiology with the
In contrast, both the inductive and deductive rea- benefits that it can give such in the medical field.
soning type, even if they are integrated in the In addition, other researchers have worked on
processes of discursive types, are correlated to the phenomenon of entropy and negative entropy
the description of the objects of the world and to at the origin of the transformation of units com-
their place in the human paradigm. As for logical posing the living world. Schrödinger (1993) has
Anthropology 35

put in relation entropy, energy exchanges, proba- poles. In this new relationship, the vision holds
bility, and the order of life. Moreover, Monod the greatest place in the pairs: face-reading and
(1960) evokes the emergence of life and, implic- hand-graphy. A
itly, that of the human as pure chance. Prigogine If we continue the analysis between hominids
(1967) reassesses the question of asking about the and other members of the living world, we find
nature of the living world and, therefore, the that the observation of the environment made by
human, based on the main principles of physics the whole of the living world is intended to protect
as well as those of thermodynamics; the first prin- the species and the diffusion relative to their sur-
ciple affirms the conservation of energy by all vival. These operating modes involve the instinct,
systems and the second principle (principle of either a biological determinism which, in in a
order Boltzmann) holds that an isolated system particular situation, responds to a special behavior
evolves spontaneously toward a state of balance and refers to a basic programming more or less
which corresponds to maximum entropy. adaptable depending on the species. Moreover,
Bio sociology is a particular approach applied whether breeding rituals, love rituals, or the
to the world of the living. So research of the answers to a situation of aggression, behaviors
ethologist Jaisson (1993) addresses the social will be similar to a member of the species to
insects, including ants. This author shows that another; indeed, the survival of the species takes
there is a kind of programming that is the cause priority over that of the member of the species.
of the behavior of the species Formicus (ant) Among the large primate, these answers become
belonging to the order of Hymenoptera. This more appropriate and they open the field of a form
study is similar to that done by Dawkins (2003), of creativity. In an article on chimpanzees, Servais
an ethologist who supported the evolutionary the- (1993) states that these primates do not have any
ory of Darwin but posits that natural selection language; they communicate only by manipulat-
would be initiated by the gene through an existing ing their behavior. They are able, for the most
program, not by the species. This observation puts intelligent of them, to associate, to form coali-
the innate and the acquired as parameters of tions, to conclude pacts, or to have access to a
human culture into question. form of concept thought. They have forms of
These studies have opened an exploratory field “protoculture”; the most famous is without doubt
to the evolutionary biologists, such as Diamond the habit of washing one’s potatoes in some
(1992) who showed the phylogenetic similarity groups of Japanese macaques, but they have no
between the pygmy chimpanzee of Zaire and the cultural production; they have a typical social
common chimpanzee from Africa and Homo sapi- organization in relation with their species, but
ens. These results are based on the molecular they have no written or oral laws. This punctual
genetic studies which have shown that we share creativity among the great simians has grown
over 98% of our genetic program with these pri- exponentially in humans; it is that form of crea-
mates. The 2% which make the difference are tivity which opened the field of imaginary.
somehow the magical openings which allow us, If the questioning concerning the innate and
in our role as human beings, to access to the the acquired has been the subject of various expe-
understanding of the universe in which we live. riences, the study of diverse ethnic groups dem-
This understanding is correlated to the awareness onstrates that through culture the adaptation of the
of existence and its manifestation through dis- human is highly diversified. This is due to the
course and language. Leroi-Gourhan (1964) has genealogical chain which shows the phenomenon
stipulated that two technical centers in many of nomination, which, in turn, is in resonance with
vertebrates result in anthropoids for the formation the construction of individual identity. The
of two functional pairs (hand tool and face-lan- anthropologist and ethnologist Levi-Strauss
guage). The emergence of the graphic symbol at (1962) evokes different modes of naming in use
the end of the reign of the Paleanthropien entails in ethnic groups like the Penan of Borneo as, e.g.,
forging new relationships between two operative the tecknonym meaning “father of such a” or
36 Anthropology

“mother of such a” or the necronym which adding an article or correcting it. The user of the
expresses the family relationship existing with a Internet then plays the role of a contributor, i.e.,
deceased relative and the individually named. writer and corrector; he or she can also report false
Emperaire (1955), a researcher at the Musée de information appearing in the context of articles
L’homme in Paris, gives the example of the written by other Internet users. This multiple role
Alakalufs, an ethnic group living in Tierra del is equivalent to what the philosopher and sociol-
Fuego, which does not name the newborn at ogist Pierre Levy calls “collective intelligence,”
birth; the children do not receive a name; it is namely, the interactions of the cognitive abilities
only when they begin to talk and walk that the of the members of a given group enabling them to
father chooses one. Other systems of genealogical participate in a common project.
chain, such as those designated as “rope” and Within the framework of more specialized
which correspond to a link which groups a man, research domains, many university websites on
his daughter and the son of his daughter or a wife, the Internet offer books and magazines, which
son and daughters of his son (Mead 1963). cannot always be consulted free of charge or with-
Cultural identity is articulated around specific out prior registration. Today’s access to big data
values belonging to a particular society and defin- differs from the period before the arrival of digital
ing it; they have more or less strong emotional technologies, where only the university libraries
connotations; thus a taboo object should be lived were able to meet the demands of students and
out as a territory to not transgress, because the researchers, both in terms of the availability of
threat of various sanctions exist including the works and of their varieties.
death of the transgressor. The anthropologist Access to online knowledge has exponentially
Mead (1963) has exemplified this phenomenon multiplied the opportunity for anyone to improve
by studying the ethnic group of the Arapesh, their knowledge within a given field of study.
who were living in the Torricelli Mountains in
New Guinea at the time when the author was
studying their way of life (1948). The territories Further Reading
of the male group and the territories of the female
group were separated by territories marked as Dawkins, R. (2003). Le gène égoïste. Paris: Éditions
taboos; concerning the flute, an object belonging Odile Jacob.
De Buffon, G.-L. (1749–1789). Histoire Naturelle générale
to the male group, for the female group, this object et particulière: avec la description du Cabinet du Roy,
was prohibited. The semantic content of certain par Buffon et Daubenton. Version en ligne au format
lexical items may differ from one ethnic group texte. http://www.buffon.cnrs.fr/index.php?lang=fr#hn.
to another, and even sometimes may become Diamond, J. (1992). Le troisième singe – Essai sur
l’évolution et l’avenir de l’animal humain. Paris:
an antinomy. Mead cites the Arapesh and the Gallimard.
Mundugumor, two ethnic groups that have devel- Emperaire, J. (1955). Les nomades de la mer. Paris:
oped their identity through entirely different Gallimard.
moral values and behaviors. Thus, Arapesh soci- Gomperz, T. (1908). Griechische Denker: eine Geschichte
der antiken Philosophie. Les penseurs de la Grèce:
ety considers each member as sweet and helpful histoire de la philosophie antique (Vol. 1). Lausanne:
and wants to avoid violence. In contrast, in the Payot. http://catalogue.bnf.fr/ark:/12148/cb30521143f.
ethnic group of the Mundugumor, their values are Jaisson, P. (1993). La fourmi et le sociobiologiste. Paris:
the antonyms of the ethnic group of the Arapesh. Éditions Odile Jacob.
Leroi-Gourhan, A. (1964). Le Geste et la Parole, première
As for big data, the implications can vary partie: Technique et langage. Paris: Albin Michel.
from one culture to another: highlighting history, Lévi-Strauss, C. (1962). La pensée sauvage. Paris: Plon.
traditions, social structure, the official language, Lévy, P. (1997). L’intelligence collective – Pour une
etc. The addition of data by Internet users, anthropologie du cyberespace. Paris: Éditions La
Découverte Poche.
according to their desires and their competencies, Mead, M. (1963). Mœurs et sexualité en Océanie – Sex and
contribute to the development of the free encyclo- temperament in three primitive societies. Paris: Plon.
pedia Wikipedia. Each user can contribute by Monod, J. (1960). Le hasard et la nécessité. Paris: Seuil.
Antiquities Trade, Illicit 37

Prigogine, I. (1967). Introduction to thermodynamics of ancient coins are being offered on eBay annually
irreversible processes. New York: John Wiley with actual sales estimated at $26–59 million
Interscience.
Roger J. (2006). Buffon. Paris: Fayard. (Wartenberg and Brederova 2021). This massive A
Schrödinger, E. (1993). Qu’est-ce que la vie ?: De la supply of coins readily available to customers
physique à la biologie. Paris: Seuil. could only be achieved through the extensive
Servais, V. (1993). Les chimpanzés: un modèle animal de looting of archaeological sites in the
la relation clientélaire. Terrain, 21. https://doi.org/10.
4000/terrain.3073. http://terrain.revues.org/3073. Middle East.
Wiener, N. (1948). Cybernetics or control and communi- With a low barrier to entry, online market-
cation in the animal and the machine. Cambridge, MA: places allow any individual interested in cultural
MIT Press. heritage to collect information and even purchase
the items at affordable costs. Because the antiqui-
ties trade is a gray trade where there is a mixing of
licit and illicit goods, these transactions are often
Antiquities Trade, Illicit completed with impunity. Therefore, sellers cur-
rently do not need to trade on the dark web as they
Layla Hashemi and Louise Shelley seek to reach the largest number of customers and
Terrorism, Transnational Crime, and Corruption are not deterred by the actions of law enforcement
Center, George Mason University, Fairfax, VA, who rarely act against online sellers of antiquities
USA (Brodie 2017).
Social media platforms are also often used in
the antiquities trade. Platforms such as Facebook
The cyber environment and electronic commerce allow traffickers to reach a large audience with a
have democratized the antiquities trade. Previ- casual interest in antiquities, thus normalizing the
ously, the antiquities trade consisted of niche net- idea of looting for profit (Sargent et al. 2020).
works of those associated with galleries and These online venues range from private Facebook
auction houses. While the trade was transnational, groups to fora used to discuss the authenticity and
there were high barriers to entry preventing the value of specific items (Al-Azm and Paul 2018,
average person from becoming involved. Today’s 2019). Moreover, once the initial contact between
antiquities market has been democratized by the the seller and the buyer is made, the trade often
internet and online platforms that allow for the moves to encrypted channels, protecting both the
free and often open sale of cultural property and seller and the purchaser from detection. While
the availability of ancient goods, particularly Facebook recently announced a ban on the sale
coins, often at affordable prices. Therefore, what of historical artifacts, there are still strong indica-
was once a high-end and exclusive commodity is tions that sales have not ceased on the platform
now available to a much broader and less sophis- (Al-Azm and Paul 2019).
ticated customer base. Identifying the actors Detection of participants in the trade is diffi-
behind this trade and understanding the extent cult because of the large volume of relatively
and profits of this trade requires large-scale data small transactions. One solution is the blending
analytics. of manual and automated computational
Many vendors use open web platforms such methods, such as machine learning and social
as VCoins, Etsy, eBay, and other marketplaces to network analysis, to efficiently process data and
advertise and sell their products. These sales are identify leads and gather investigative evidence.
possible because Section 230 of the US Commu- Using large sets of interoperable data, investiga-
nications Decency Act releases websites from tive leads can be supplemented with financial,
responsibility for the content posted on their transport, and other data to examine entire supply
platforms, allowing criminals to conduct their chains from source through transit to destination
business online with near total impunity. Recent countries. Financial investigations of the transfer
analysis in 2020 revealed that around 2.5 million of digital assets allow for the mapping of
38 Apple

transnational transactions through the analysis of


payment processing. Only with techniques using Apple
sophisticated data analytics will it be possible for
investigators to address these crimes. At present, R. Bruce Anderson1,2 and Kassandra Galvez2
1
creative and innovative criminal actors frustrate Earth & Environment, Boston University,
the ability of governments to disrupt this perva- Boston, MA, USA
2
sive online criminal activity causing irreparable Florida Southern College, Lakeland, FL, USA
damage to the international community’s cultural
heritage.
In 1984, Steve Jobs and Steve Wozniak started
a business in a garage. This business went on
to change the global dynamic of computers as it
Cross-References is known today. Apple Inc. all started because
Jobs and Wozniak wanted the machines to be
▶ Persistent Identifiers (PIDs) for Cultural smaller, cheaper, intuitive, and accessible to
Heritage everyday consumers, but more important, user-
▶ Transnational Crime friendly. Over the past 30 years, Apple Inc. has
transformed this simple idea into a multi-billion
dollar industry that includes: laptops, desktops,
Further Reading tablets, music players, and so much more. The
innovative style and hard-wired simplicity of
Al-Azm, A., & Paul, K. A. (2018). How Facebook made it
easier than ever to traffic middle eastern antiquities.
Apple’s approach has proven to be a sustained
https://www.worldpoliticsreview.com/insights/25532/ leader for computer design.
how-facebook-made-it-easier-than-ever-to-traffic- After some successes and failures, Apple Inc.
middle-eastern-antiquities. Accessed 22 Dec 2020. created one of the most revolutionary programs to
Al-Azm, A., & Paul, K. A. (2019). Facebook’s black market in
date: iTunes. In 2001, Apple released the iPod – a
antiquities: Trafficking, terrorism, and war crimes. http://
atharproject.org/wp-content/uploads/2019/06/ATHAR- portal music player. The iPod allowed consumers
FB-Report-June-2019-final.pdf. Accessed 22 Dec 2020. to place music files into the iPod for music “on the
Brodie, N. (2017). How to control the Internet market in go”; however, instead of obtaining the music files
antiquities? The need for regulation and monitoring.
from CD’s, you would obtain them online, via a
Antiquities coalition, policy brief no. 3.
Brodie, N. (2019). Final report: Countering looting of proprietary website “iTunes”. iTunes media
antiquities in Syria and Iraq. https://traccc.gmu.edu/ players systematically changed the way music
sites/default/files/Final-TraCCC-CTP-CTAQM-17- is played and purchased. Consumers could now
006-Report-Jan-7-2019.pdf.
purchase digital copies of albums instead of
Brodie, N., & Sabrine, I. (2018). The illegal excavation and
trade of Syrian cultural objects: A view from the “hardcopy” discs. This affordable way of purchas-
ground. Journal of Field Archaeology, 43(1), 74–84. ing music impacted the music industry in an enor-
https://doi.org/10.1080/00934690.2017.1410919. mous way. Its impact is unparalleled in terms of
Sargent, M., et al. (2020). Tracking and disrupting the
how the music industry profits off music sales and
illicit antiquities trade with open source data. RAND
Corporation. https://www.rand.org/pubs/research_ how new artists have been able to break through.
reports/RR2706.html. Accessed 22 Dec 2020. The iTunes impact, however, reaches far beyond
Wartenberg, U., & Brederova, B. (2021). Plenitudinous: the music industry. Podcasts have impacted the
An analysis of ancient coin sales on eBay in. In L.
Hashemi & L. Shelley (Eds.), Antiquities smuggling:
way we can access educational material, for
In the real and the virtual world. Abingdon/New York: example. Music and media is easily accessible
Routledge. from anywhere in the world via iTunes.
Westcott, T. (2020). Destruction or theft: Islamic state, While Macs and MacBooks are one of their
Iraqi antiquities and organized crime. https://
most profitable items, the iPhone, created in 2007,
globalinitiative.net/wp-content/uploads/2020/03/
Destruction-or-theft-Islamic-State-Iraqi-antiquities- really changed the industry. When Steve Jobs first
and-organized-crime.pdf. Accessed 22 Dec 2020. introduced the iPhone, it was so different from
Apple 39

other devices because it was a music player, a products has brought Apple at least in line with
phone, and an internet device all in one. With the other corporations using Big Data as a source for
iPhone’s touchscreen and other unique features, consumer reflection information. A
companies like Nokia and Blackberry were left in Apple Inc.’s advanced technology has trans-
the dust which resulted in many companies chang- formed the computer industry and has made it
ing their phone devices structural model to have one of the most coveted industries. In 2014,
similar features to the iPhone. Apple Inc. is the world’s second-largest informa-
In 2010, Apple Inc. released a tablet known tion technology company by revenue after
as the iPad. This multi-touch tablet features a cam- Samsung and the world’s third-largest mobile
era, a music player, internet access, and phone maker after Samsung and Nokia. One of
applications. Additionally, the iPad has GPS func- the ways Apple Inc. has become such an empire is
tions, email, and video recording software. The with its retail stores. As of August 2014, Apple
iPad transformed the consumer image of having has 434 retail stores in 16 countries and an online
a laptop. Instead of carrying around a heavy laptop, store available in 43 countries. The Apple Store is
consumers have the option of purchasing a light- a chain of retail stores owned and operated by
weight tablet that has the same features as a laptop. Apple Inc., which deals with computers and
On top of these unique products, Apple other various consumer electronics. These Apple
Inc. utilizes the iOS operating system. iOS uses Stores sell items from iPhone, iPads, MacBooks,
touch-based instructions such as: swipes, taps, iPods, to third party accessories.
and pinches. These various instructions provide The access of third party applications to Apple
specific definitions within the iOS operating hardware – though still relatively closely con-
system. Since its debut in 2007, iOS software trolled – make it possible for Apple to utilize Big
has transformed the nature of phone technology. Data in ways not anticipated in early development
With its yearly updates, the iOS software has of what was essentially a sealed system.
added more distinctive features from Siri to the Since its origin, one of Apple Inc.’s goals was
Game Center. to make computer accessible to everyday people.
Siri is a personal assistant and knowledge nav- Apple Inc. accomplished this goal by partner-
igator that is integrated into the iOS software. Siri ing with President Barack Obama’s ConnectED
responds to spoken commands and allows the user initiative. In June 2013, President Obama
to have constant hands-free phone access. With announced the ConnectED initiative, designed to
the sound of your voice and a touch of a button, enrich K-12 education for every student in Amer-
Siri has full access to the user’s phone. Siri can ica. ConnectED empowers teachers with the best
perform tasks such as calling, texting, searching technology and the training to make the most of it
the web, finding directions, and answering general and empowers students through individualized
questions. With the latest iOS update in the Fall of learning and rich, digital content. President
2013, Siri expanded its knowledge and is now Obama’s mission is to prepare America’s students
able to support websites such as: Bing, Twitter, with the skills they need to get good jobs and
and Wikipedia. Additionally, Siri’s voice was compete with other countries which rely increas-
upgraded to sound more man than machine. ingly on interactive, personalized learning experi-
Apple’s entry into the world of Big Data was ences driven by new technology. President
late – but a link-up with IBM has helped a great Obama states that fewer than 30% of America’s
deal in examining how users actually use their schools have the broadband they need to teach
products. The Apple Watch, which made its using today’s technology. However, under
debut in 2015, is able to gather data of a personal ConnectED, 99% of American students will
nature that lifts usage data to a new level. The have access to next-generation broadband by
success of the applications associated with the 2017. That connectivity will help transform the
Apple Watch and the continuing development of classroom experience for all students, regardless
data-gathering apps for the iPad and other Apple of income. President Obama has also directed the
40 Apple

federal government to make better use of existing provides other education facets such as: inter-
funds to get Internet connectivity and educational views, journals, self-taught books, and more.
technology into classrooms, and into the hands of With the variety of products that consumers
teachers trained on its advantages, and he called can buy through Apple, iCloud has proven to be
on businesses, states, districts, schools, and com- a distinct source to store all users’ information.
munities to support this vision, which requires no iCloud is cloud storage and computing service
congressional action. that was launched in 2011. iCloud allows users
Furthermore, in 2011, Apple Inc. partnered to store data such as music and other iOS applica-
with “Teach for American,” a program that trains tions on computer servers for download to multi-
recent graduates from some of America’s most ple devices. Additionally, iCloud is a data syncing
prestigious universities to teach in the meanest center for email, contacts, calendars, bookmarks,
and most dangerous schools throughout the nation notes, reminders, documents, photos and other
and donated over 9000 first generation iPads to data. As such, much of the data in a “macro”
teachers that work in impoverished and dangerous setting is available to developers as aggregated
schools. These donated iPads came from cus- material. However, in order to use these functions,
tomers who donated to Apple’s public service users must create an Apple ID. An Apple ID is the
program during the iPad 2 launch. These 9000 email you use as a login for every Apple function
first generation iPads were distributed to teachers such as buying songs on iTunes and purchasing
in 38 states. apps from the App Store. By choosing to use the
In addition to President Obama’s ConnectED same Apple ID, costumers have the ability to keep
initiative, Apple Inc. has also provided students all their data in one location. When costumer set
and educators with special discounts which enable up their iPhone, iPad, or iPod touch, they can use
these devices to be much more accessible and the same Apple ID for iCloud services and pur-
affordable. During the months of June to August, chases on the iTunes Store, App Store, and iBooks
Apple Inc. bestows up to $200.00 in savings on Store. On top of that, users can set up their credit
specific Apple products such as MacBooks and card and billing information through the Apple
iPads to students and educators. The only require- ID. This Apple ID allows users to have full access
ment to receive this discount is student identifica- to any purchases on the go through iCloud. How
tion or an educator’s identification. Once the much data is shared by Apple through the cloud
proper verification is shown, students and educa- remains something of a mystery. While individual
tors receive the discount in addition to up to user information is unlikely to be available, Big
$100.00 in iTunes and/or App Store Credit. Data – data at the aggregate level measuring con-
Apple Inc. caters to students not only with sumer usage – almost certainly is.
these discounts but also with various other educa- iCloud allows users to back up the setting and
tion resources. Some of these education resources date on any iOS devices. This date includes
include: iBooks and iTunes U. iBook is an online photos and videos, device settings, app data, mes-
store through Apple Inc. that allows the user to sages, ringtones, and visual voicemails. These
purchase electronic books. These electronic books iCloud backups occur daily the minute one of
are linked to your account and allow you access the consumers’ iOS devices is connected to
wherever you may be with the device. iBooks Wi-Fi and a power source. Additionally, iCloud
include materials from novels, travel guides, and backs up contact information, email accounts, and
textbooks. iTunes U is program for professors that calendars. Once the data is backed up, customers
enables students to access course materials, track will have all the same information on every single
assignments, and organize notes. Additionally, iOS device. For example, if a user has an iPad, an
students can create discussions posts for that spe- iPhone, and a MacBook and starts adding sched-
cific class, add material from outside sources, and ules to their iPhone calendar, the minute the
generate a more specialized course. iTunes U not backup begins, he or she will be able to access
only offers elements for courses but it also that same calendar with the new schedule on his or
Apple 41

her iPad and MacBook. Again, Apple provides one of these alerts: “Security Code Incorrectly
users with this unique connectivity by the use of Entered Too Many Times. Approve this iPhone
an Apple ID and iCloud. When signing up for from one of your other devices using iCloud A
iCloud, users automatically get 5 gigabytes of Keychain. If no devices are available, reset iCloud
free storage. In order to access more gigabytes, Keychain.” or “Your iCloud Security Code has
users can either go to their iCloud account to been entered too many times. Approve this
delete data or users can upgrade. When upgrading, Mac from one of your other devices using
users have three choices: a 10 gigabyte upgrade, a iCloud Keychain. If no devices are available,
20 gigabyte upgrade, and a 50 gigabyte upgrade. reset iCloud Keychain.”
These three choices are priced at $20.00, $40.00, In 2013, Apple Inc. released its most innova-
and $100.00, respectively. These storage tive feature yet: Touch ID. Touch ID is a finger
upgrades are billed annually. print scanner which doubles as a password pro-
The last two unique features that iCloud pro- tection on the iPhone 5 s, which is the latest
vides is Find my iPhone and iCloud Keychain. version of the iPhone. The reason for making
Find my iPhone allows users to track the location Touch ID is because more than 50% of users do
of their iOS device or Mac. By accessing this not use a passcode: with Touch ID, creating and
feature, users can see the device’s approximate using a passcode is seamless. When accessing the
location on a map, display a message and/or play iPhone 5S, users register every single finger into
a sound on the device, change the password on the the system. By allowing this registration, users are
device, and remotely erase its contents. In recent able to unlock their iPhones with any finger.
upgrades, iOS 6 introduced Lost Mode, which is a To unlock the iPhone, users simply place their
new feature that allows users to mark a device as finger on the home button; the Touch ID sensor
“lost,” making it easier to protect and find. The reads the finger print and immediately unlocks the
feature also allows someone that finds the user’s iPhone. Touch ID is not only for passwords but it
lost iPhone to call the user directly without also authorizes purchases onto your Apple ID
unlocking it. This feature has proved to be useful such as: iTunes, iBooks, and the App store. On
in situations where devices are stolen. Since announcing this feature, Apple stated that Touch
the release of this application in 2010, similar ID doesn’t store any images of your fingerprint. It
phone finders have become available for other stores only a mathematical representation of your
“smart” phones. fingerprint. The iPhone 5S also includes a new
The iCloud Keychain functions as a secure advanced security architecture called the Secure
database that allows information including a Enclave within the A7 chip, which was developed
user’s website login passwords, Wi-Fi network to protect passcode and fingerprint data, which
passwords, credit/debit card management, and means that the Fingerprint data is encrypted and
other account data, to be securely stored for protected with a key available only to the Secure
quick access and auto-fill on webpages and else- Enclave. Therefore, your fingerprint is never
where when the user needs instant access to them. accessed by iOS or other apps, never stored on
Once passwords are in the iCloud Keychain, they Apple servers, and never backed up to iCloud or
can be accessed on all devices connected to the anywhere else.
Apple ID. Additionally, to view the running list of As secure as Apple’s data likely is to outside
passwords, credit and debit card information, and investigators or hackers, it seems very likely that
other account data, the user must put in a separate sampling at the Big Data level is constant.
password in order to see the list of secure data.
iCloud also has a security function. If users enter
an incorrect iCloud Security Code too many times Cross-References
when using iCloud Keychain, the users iCloud
Keychain is disabled on that device, the keychain ▶ Business Intelligence
in the cloud is deleted, and the user will receive ▶ Cell Phone Data
42 Archaeology

▶ Cloud Computing secondary investigation into the history of mate-


▶ Cloud Services rial culture.
▶ Data Storage
▶ Education and Training
▶ Voice Data Big Data and the Archaeological
Research Cycle

Further Reading Whether derived from excavation, post-


excavation analysis, experimentation, or simula-
Atkins, R. Top stock in news: Apple (NASDAQ AAPL) tion, archaeologists have only tiny fragments of
plans on building the world’s biggest retail store. Tech
the “global” dataset that represents the material
News Analysis. 21 Aug. 2014. Web. 26 Aug. 2014.
ConnectED Initiative. The White House. The White House, record, or even the record of any specific time
1 Jan. 2014. Web. 26 Aug. 2014. period or region. If one takes any definition of
Happy Birthday, Mac. Apple. Apple, 1 Jan. 2014. Web. “Big Data” as it is generally understood, a corpus
26 Aug. 2014.
of information which is too massive for desktop-
Heath, A. Apple donates 9,000 iPads to teachers working
in impoverished schools. Cult of Mac. 20 Sept. 2011. based or manual analysis or manipulation, no
Web. 26 Aug. 2014. single archaeological dataset is likely to have
iCloud: About iCloud security code alert messages. Apple these attributes of size and scale. The significance
Support. Apple, 20 Oct. 2013. Web. 26 Aug. 2014.
of Big Data for archaeology lies not so much in
iPhone 5s: About touch ID security. Apple Support. Apple,
28 Mar. 2014. Web. 26 Aug. 2014. the analysis and manipulation of single or multi-
Marshal, G. 10 Ways Apple Changed the World. ple collections of vast datasets but rather in the
TechRadar. 10 Mar. 2013. Web. 26 Aug. 2014. bringing together of multiple data, created at dif-
ferent times, for different purposes and according
to different standards; the interpretive and critical
frameworks needed to create knowledge from
Archaeology them. Archaeology is “Big Data” in the sense
that it is “data that is bigger than the sum of its
Stuart Dunn parts.”
Department of Digital Humanities, King’s Those parts are massively varied. Data in
College London, London, UK archaeology can be normal photographic images,
images and data from remote sensing, tabular data
of information such as artifact findspots, numeri-
Introduction cal databases, or text. It should also be noted that
the act of generating archaeological data is rarely,
In one sense, archaeology deals with the biggest if ever, the end of the investigation or project. Any
dataset of all: the entire material record of human dataset produced in the field or the lab typically
history, from the earliest human origins c. 2.2 forms part of a larger interpretation and interpola-
million years Before Present (BP) to the present tion process and – crucially – archaeological data
day. However this dataset is, by its nature, is often not published in a consistent or interoper-
incomplete, fragmentary, and dispersed. Archae- able manner; although approaches to so-called
ology therefore brings a very particular kind of Grey Literature, which constitutes reports from
challenge to the concept of big data. Rather than archaeological surveys and excavations that typi-
real-time analyses of the shifting digital land- cally do not achieve a wide readership, are
scape of data produced by the day to day trans- discussed below. This fits with a general charac-
actions of millions of people and billions of teristic of Big Data, as opposed to the “e-Science/
devices, approaches to big data in archaeology Grid Computing” paradigm of the 2000s.
refer to the sifting and reverse-engineering of Whereas the latter was primarily concerned with
masses of data derived from both primary and “big infrastructure,” anticipating the need for
Archaeology 43

scientists to deal with a “deluge” of monolithic of “Big Data,” this class of material undoubtedly
data emerging from massive projects such as the requires the kinds of approaches in thinking and
Large Hardron Collider, as described by Tony interpretation familiar from elsewhere in the Big A
Hey and Anne Trefethen, Big Data is concerned data agenda. Recent applications in landscape
with the mass of information which grows organ- archaeology have highlighted the need both for
ically as the result of the ubiquity of computing in large capacity and interoperation. For example,
everyday life and in everyday science. In the case integration of data from the in the Stonehenge
of archaeology, it may be considered more as a Hidden Landscape project, also directed by
“complexity deluge,” where small data, produced Gaffney, provides for “seamless” capture of
on a daily basis, forms part of a bigger picture. reams of geophysical data from remote sensing,
There are exceptions: Some individual projects visualizing the Neolithic landscape beneath mod-
in archaeology are concerned with terabyte-scale ern Wiltshire to a degree of clarity and compre-
data. The most obvious example in the UK is the hensiveness that would only have been possible
North Sea Paleolandscapes, led by the University hitherto with expensive and laborious manual sur-
of Birmingham, a project which has reconstructed vey. Due to improved capture techniques, this
the Early Holocene landscape of the bed of the project succeeded in gathering a quantity of data
North Sea, which was an inhabitable landscape in its first two weeks equivalent to that of the
until its inundation between 20,000 and 8,000 landmark Wroxeter survey project in the 1990s.
BP – so-called Doggerland. Vince Gaffney and These early achievements of big data in an
others describe drawing on 3D seismic data gath- archaeological context fall against a background
ered during the process of oil prospection, this of falling hardware costs, lower barriers to usage,
project has used large-scale data analytics and and the availability of generic web-based plat-
visualization to reconstruct the topography of the forms where large-scale distributed research can
preinundation land surface spanning an area larger be conducted. This combination of affordability
than the Netherlands, and to thus allow inferences and usability is bringing about a revolution in
as to what environmental factors might have applications such as those described above,
shaped human habitation of it; although it must where remote sensing is reaching new concepts
be stressed that there is no direct evidence at all of and applications. For example, coverage of freely
that human occupation. While such projects dem- available satellite imagery is now near-total;
onstrate the potential of Big Data technologies for graphical resolution is finer for most areas than
conducting large-scale archaeological research, ever before (1 m or less); and pre-georeferenced
they remain the exception. Most applications in satellite and aerial images are delivered to the
archaeology remain relatively small scale, at least user’s desktop, removing the costly and highly
in terms of the volume of data that is produced, specialized process of locating imagery of the
stored, and preserved. Earth’s surface. Such platforms also allow access
However, this is not to say that approaches to imagery of archaeological sites in regions
which are characteristic of Big Data are not chang- which are practically very difficult or impossible
ing the picture significantly in archaeology, espe- to survey, such as Afghanistan, where declassified
cially in the field of landscape studies. Data from CORONA spy satellite data are now being
geophysics, the science of scanning subterranean employed to construct inventories of the region’s
features using techniques such as magentometry (highly vulnerable) archaeology. If these develop-
and resistivity typically produce relatively large ments cannot be said to have removed the bound-
datasets, which require holistic analysis in order to aries within which archaeologists can produce,
be understood and interpreted. This trend is accen- access, and analyze data, then it has certainly
tuated by the rise of more sophisticated data cap- made them more porous.
ture techniques in the field, which is increasing the As in other domains, strategies for the storage
capacity of data that can be gathered and ana- and preservation of data in archaeology have a
lyzed. Although still not “big” in the literal sense fundamental relationship with relevant aspects of
44 Archaeology

the Big Data paradigm. Much archaeological to underpin an assumption that the primary focus
information lives on the local servers of institu- is data formats which convey larger individual
tions, individuals, and projects; this has always data objects, such as images and geophysics
constituted an obvious barrier to their integration data, with the report noting that “many formats
into a larger whole. However, weighing against have the potential to be Big Data, for example, a
this is the ethical and professional obligation to digital image library could easily be gigabytes in
share, especially in a discipline where the process size. Whilst many of the conclusions reached here
of gathering the data (excavation) destroys its would apply equally to such resources this study
material context. National strategies and bodies is particularly concerned with Big Data formats in
encourage the discharge of this obligation. In the use with technologies such as lidar surveys, laser
UK, as well as data standards and collections held scanning and maritime surveys.”
by English Heritage, the main repository for However, the report also acknowledges that “If
archaeological data is the Archaeology Data Ser- long term preservation and reuse are implicit goals
vice, based at the University of York. The ADS data creators need to establish that the software to
considers for accession any archaeological data be used or toolsets exist to support format migra-
produced in the UK in a variety of formats. This tion where necessary.” It is true that any “Big
includes most of the data formats used in day-to- Data” which is created from an aggregation of
day archaeological workflows: Geographic Infor- “small data” must interoperate. In the case of
mation System (GIS) databases and shapefiles, “social data” from mobile devices, for example,
images, numerical data, and text. In the latter location is a common and standardizable attribute
case, particular note should be given to the that can be used to aggregate Tb-scale datasets:
“Grey Literature” library of archaeological reports heat maps of mobile device usage can be created
from surveys and excavations, which typically which show concentrations of particular kinds of
present archaeological information and data in a activity in particular places at particular times. In
format suitable for rapid publication, rather than more specific contexts hashtags can be used to
the linking and interoperation of that data. Cur- model trends and exchanges between large
rently, the Library contains over 27,000 such groups. Similarly intuitive attributes that can be
reports. Currently, the total volume of the ADS’s used for interoperation, however, elude archaeo-
collections stands at 4.5 Tb (I thank Michael logical data, although there is much emerging
Charno for this information). While this could be interest in Linked Data technologies, which
considered “big” in terms of any collection of data allow the creation of linkages between web-
in the humanities, it is not of a scale which would exposed databases, provided they conform
overwhelm most analysis platforms; however (or can be configured to conform) to predefined
what is key here is that it is most unlikely to be specifications in descriptive languages such as
useful to perform any “global” scale analysis RDF. Such applications have proved immensely
across the entire collection. The individual successful in areas of archaeology concerned with
datasets therein relate to each other only inasmuch particular data types, such as geodata, where there
as they are “archaeological.” In the majority of is a consistent base reference (such as latitude and
cases, there is only fragmentary overlap in terms longitude). However, this raises a question which
of content, topic, and potential use. A 2007 is fundamental to archaeological data in any
ADS/English Heritage report on the challenges sense. Big Data approaches here, even if the data
of Big Data in archaeology identified four types is not “Big” in terms of relative terms to the social
of data format potentially relevant to Big Data in and natural sciences, potentially allows an
the field: LIDAR (Light Detection and Ranging or “n¼all” picture of the data record. As noted
Laser Imaging Detection and Ranging) data, above, however, this record represents only a
which models terrain elevation modelled from tiny fragment of the entire picture. A key question,
airborne sensors, 3D laser scanning, maritime sur- therefore, is does “Big data” thinking risk techno-
vey, and digital video. At first glance this appears logical determination, constraining what
Artificial Intelligence 45

questions can be asked? This is a point which has W., Zitz, T., Floery, S., Verhoeven, G., & Doneus,
concerned archaeologists since the very earliest M. (2012). The Stonehenge Hidden Landscapes Pro-
ject. Archaeological Prospection, 19(2), 147–155.
days of computing in the discipline. In 1975, a Tudhope, D., Binding, C., Jeffrey, S., May, K., & A
skeptical Sir Moses Finley noted that “It would be Vlachidis, A. (2011). A STELLAR role for knowledge
a bold archaeologist who believed he could antic- organization systems in digital archaeology. Bulletin of
ipate the questions another archaeologist or a his- the American Society for Information Science and
Technology, 37(4), 15–18.
torian might ask a decade or a generation later, as
the result of new interests or new results from
older researchers. Computing experience has pro-
duced examples enough of the unfortunate conse-
quences . . . of insufficient anticipation of the Artificial Intelligence
possibilities at the coding stage.”
Feras A. Batarseh
College of Science, George Mason University,
Conclusion Fairfax, VA, USA

Such questions probably cannot be predicted, but


big data is (also) not about predicting questions. Synonyms
The kind of critical framework that Big Data is
advancing, in response to the ever-more linkable AI; Intelligent agents; Machine intelligence
mass of pockets of information, each themselves
becoming larger in size as hardware and software
barriers lower, allows us to go beyond what is Definition
available “just” from excavation and survey. We
Artificial Intelligence (often referred to as AI) is a
can look at the whole landscape in greater detail
and at new levels of complexity. We can harvest field in computer science that is concerned with
public discourse about cultural heritage in social the automation of intelligence and the enablement
of machines to achieve complex tasks in complex
media and elsewhere and ask what that tells us
about that heritage’s place in the contemporary environments. This definition is an augmentation
world. We can examine what are the fundamental of two preexisting commonplace AI definitions
(Goebel et al. 2016; Luger 2005).
building blocks of our knowledge about the past
and ask what do we gain, as well as lose, by AI is an umbrella that has many subdisciplines,
big data analytics is one of them. The traditional
putting them into a form that the World Wide
promise of machine intelligence is being partially
Web can read.
rekindled into a new business intelligence prom-
ise through big data analytics.
This entry covers AI, and its multiple
References
subdisciplines.
Archaeology data service. http://archaeologydataservice.
ac.uk. Accessed 25 May 2017.
Austin, T., & Mitcham, J. (2007). Preservation and man- Introduction
agement strategies for exceptionally large data for-
mats: ‘Big Data’. Archaeology Data Service &
English Heritage: York, 28 Sept 2007. AI is a field that is built on centuries of thought;
Gaffney, V., Thompson, K., & Finch, S. (2007). Mapping however, it became a recognized field for only
Doggerland: The Mesolithic landscapes of the South- over 70 years or so. AI is challenged in many
ern North Sea. Oxford: Archaeopress.
ways, identifying what’s artificial versus what is
Gaffney, C., Gaffney, V., Neubauer, W., Baldwin, E.,
Chapman, H., Garwood, P., Moulden, H., Sparrow, T., real can be tricky in some cases, for example: “A
Bates, R., Löcker, K., Hinterleitner, A., Trinks, I., Nau, tsunami is a large wave in an ocean caused by an
46 Artificial Intelligence

earthquake or a landslide. Natural tsunamis occur neural networks, information is processed


from time to time. You could imagine an artificial by a set of interconnected nodes called
tsunami that was made by humans, by exploding a neurons.
bomb in the ocean for instance, yet, it still qual- 3. Genetic Algorithms (GA): is a method that
ifies as a tsunami. One could also imagine fake finds a solution or an approximation to the
tsunamis: “using computer graphics, or natural, solution for optimization and search problems.
for example, a mirage that looks like a tsunami GAs use biological techniques such as muta-
but is not one.” (Poole and Mackworth 2010). tion, crossover, and inheritance.
However, intelligence is arguably different: 4. Natural Language Processing (NLP): is a dis-
you cannot create an illusion of intelligence cipline that deals with linguistic interactions
or fake it. When a machine acts intelligently, between humans and computers. It is an
it is then intelligent. There is no known way approach dedicated to improving the human-
that a machine would demonstrate intelligence computer interaction. This approach is usually
randomly. used for audio recognition.
The field of AI continuously poses a series of 5. Knowledge-based systems (KBS): are intelli-
questions: How to define or observe intelligence? gent systems that reflect the knowledge of
Is AI safe? Can machines achieve super- a proficient person, also referred to as expert
intelligence? among many other questions. In his systems. KBS are known to be one of the
famous manuscript, “Computing Machinery and earliest disciplines of modern AI.
Intelligence” (Turing 1950), Turing paved the 6. Computer Vision: is a discipline that is
way for many scientists to think about AI through concerned with injecting intelligence to
answering the following: Can Machines think? To enforce the ability of perceiving objects. It
be able to imitate, replicate, or augment human occurs when the computer captures and ana-
intelligence, it is crucial to first understand what lyzes images of the 3D world. This includes
intelligence exactly means. For that, AI becomes a making the computer recognize objects in real-
field that overlaps other areas of study, such as time.
biology (the ability to understand the human brain 7. Robotics: is a central field of AI that deals with
and nervous system); philosophy is another field building machines that imitate human actions
that has been highly concerned with AI (under- and reactions. Robots in some cases have
standing how AI would affect the future of human features such as arms and legs, and in
humanity – among many other philosophical many other cases, are far from how humans
discussions). look like. Robots are referred to as intelligent
agents in some instances.
8. Data Science and Advanced Analytics: is a
AI Disciplines discipline that aims to increase the level of
data-driven decision-making and providing
There have been many efforts towards achieving improved descriptive and predictive
intelligence in machines that has led to the crea- pointers. This entry has been the focus of
tion of many disciplines in AI, such as: recent business AI applications, to the degree
that many interchangeably (wrongly though)
1. Machine Learning: is when intelligent agent refer to it as AI. Many organizations are
learn by exploring their surroundings and adopting this area of research and develop-
while figuring out what actions are the most ment. It has been used in many domains
rewarding. (such as healthcare, government, and bank-
2. Neural Networks: are a learning paradigm ing). Intelligence methods are applied to
inspired by the human nervous system. In structured data, and results are usually
Artificial Intelligence 47

presented in what is referred to as a data Superintelligence, which is a form of intelligence


visualization (using tools such as Tableau, that can continuously learn and replicate human’s
R, SPSS, and PowerBI). thought, understand context, develop emotions, A
intuitions, fears, hopes, and reasoning skills.
Computer agents are a type of intelligent sys- That is a much wider goal of AI than the existing
tem that can interact with humans in a realistic Narrow-intelligence, which presents machines
manner. They have been known to beat the that have the ability perform a predefined set
world’s best chess player and locate hostages in of tasks intelligently. Narrow intelligence is cur-
a military operation. A computer agent is an rently being deployed in many applications such
autonomous or semiautonomous entity that can as driverless cars and intelligent personal assis-
emulate a human. It can be either physical such tants. (5) Turing’s list of potential AI roadblocks:
as a robot or virtual such as an avatar. The ability presented in his famous paper, those challenges
to learn should be part of any system that claims are still deemed relevant (among many other
intelligence. potential challenges).
In spite of the listed major five challenges, AI
already presented multiple advantages such as:
AI Challenges and Successes (1) greater calculation precision, accuracy, and
the lack of errors, (2) performing tasks that
Intelligent agents must be able to adapt to changes humans are not able to or ones that are deemed
in their environment. Such agents, however, have too dangerous (such as space missions, and mil-
been challenged by many critics and thinkers for itary operations), (3) accomplishing very com-
many reasons. Major technical and philosophical plex tasks such as fraud detection, events
challenges to AI include: (1) The devaluation of prediction, and forecasting. Furthermore, AI
humans: many argue that AI would replace had many successful deployments such as:
humans in many areas (such as jobs and day-to- Deep Blue (a chess computer), Autonomous
day services). (2) The lack of hardware that can Cars (produced by Tesla, Google and other tech-
support AI’s extensive computations. Although nology, and automotive companies), IBM’s Wat-
Moore’s law sounds intriguing (which states that son (a jeopardy computer), and Intelligent
the number of registers on an integrated circuit is Personal Assistants (such as Apple’s Siri and
doubling every year), that is still a fairly slow pace Amazon’s Alexa).
for what AI is expected to require in terms of
hardware. (3) The effect of AI: Whenever any
improvement in AI is accomplished, it is Conclusions
disregarded as a calculation in a computer that is
driven by a set of instructions, and not real intel- AI is a continuously evolving field; it overlaps
ligence. This was one of the reasons the AI winter with multiple other areas of research such as com-
occurred (lack of research funding in the field). puter science, psychology, math, biology, philos-
The field kept providing exploratory studies but ophy, and linguistics. AI is both feared by many
there was a lack of real applications to justify the due to the challenges listed in this entry and loved
funding. Recently however, with technologies by many as well due to its many technological
such as Deep Blue and Watson, AI is gaining advantages in critical areas of human interest. AI
attention and attracting funding (in academia, is often referred to as the next big thing, similar to
government, and the industry). (4) Answering the industrial revolution and the digital age.
the basic questions of when and how to achieve Regardless of its pros, cons, downfalls, or poten-
AI. Some researchers are looking to achieve tial greatness, it is an interesting field that is worth
Artificial General Intelligence (AGI) or exploring and expanding.
48 Arts

Further Reading art with Leonardo da Vinci, Michelangelo,


impressionism with Claude Monet, Pierre
Goebel, R., Tanaka, Y., & Wolfgang, W. (2016). Lecture Auguste Renoir, Edouard Manet, and more
notes in artificial intelligence series. In: Proceedings of
recently cubism with Georges Braque, Pablo
the ninth conference on artificial general intelligence,
New York. Picasso, Lyonel Feininger, Fernand Leger, and
Luger, G. (2005). Artificial intelligence, structures and futurism with Luigi Russolo, Umberto Boccioni,
strategies for complex problem solving (5th ed.). just to name a few. There are also artists who have
Addison Wesley, ISBN: 0-321-26318-9.
joined any artwork current as Facteur Cheval
Poole, D., & Mackworth, A. (2010). Atificial intelligence:
Foundation of computer agents (1st ed.). Cambridge whose artistic work has been called a posteriori
University Press, ISBN: 978-0-511-72946-1. “naive art.”
Turing, A. M. (1950). Computing machinery and intelli- Literary movements are part of a specific for-
gence. Journal of the Mind, 59, 433–460.
matting of writing. This is the case with, for
example, in France, Middle Ages with the epic
or the courtly romance; in the nineteenth century,
Romanticism with, in Germany, the circle of Jena;
Arts in England with Lord Byron’s works or else in the
USA, Edgar Allan Poe who included the story of
Marcienne Martin horror as artwork or Herman Melville’s novel
Laboratoire ORACLE [Observatoire Réunionnais with internationally known: Moby Dick (1851);
des Arts, des Civilisations et des Littératures dans in the twentieth century, various movements have
leur Environnement] Université de la Réunion emerged, including the new novel illustrated by
Saint-Denis France, Montpellier, France Alain Robbe-Grillet’s works; in the United States,
writers like Scott Fitzgerald, Ernest Hemingway
belong to contemporary history. Science fiction is
Big Data is a procedure that allows anyone a new scriptural approach created from imagina-
connected to the Internet to access data whose tion with no relation to reality. In music, its
content is as varied as the perception that every writing is at the origin of the creation of innova-
human being can have of objects of the world. tive rhythms transcribed through diverse instru-
This is true for the art. ments. Monody and polyphony have created the
Art is a cognitive approach that is applied to song and the opera. Around the Classical Age
the objects in the world, which is quite unique (eighteenth century), the art of music is tran-
because it uses the notion of “qualia,” which is the scribed in the form of a sonata, symphony, string
qualitative aspect of a particular experience: quartet or chamber music, etc. Popular music as
“Qualia are such things as colors, places and jazz, rock and roll, etc., is an art form appreciated.
times” (Dummett 1978). Moreover, in the myth From the bitonality to the polytonality, the art of
of Platon’s cavern the concept of “beautifulness” music has been enhanced ad infinitum. Finally,
belongs to the world of Ideas. What is more, for architecture is an art form which values monu-
Hegel, art is a sensitive representation of the truth ments and houses; it is the case with modern
approached through an individual form. In other architecture founded by the French architect Le
words, art is a transcription of the objects of the Corbusier.
Reality through the artistic sensibility of the Some theories in psychology consider art as an
author. This phenomenon is at the origin of new act that would allow the sublimation of unfortu-
artistic currents giving direction for the writing of nate experiences. For example, Frida Kahlo
artwork, irrespective of the domain (painting, transcribed her physical suffering in her paintings
sculpture, writing, music . . .), as well as featuring “The broken column” (1944). This technique,
performers who are in resonance with these new called “art therapy,” was developed based on the
views about art. In painting, we mention Gothic relationship between suffering and one’s ability to
art in connection with Fra Angelico, Renaissance express oneself through art and, thus, sublimate
Arts 49

suffering, which expresses one’s resilience capac- transformation; concerning the information
ity. For example, Jean Dubuffet, an artist, discov- coupled with the advancement of the living
ered people’s artworks who were suffering from world, it would have no reality. A
psychiatric disorders; he named this art form The exchange of information is a start that
“Rough art.” Other psychological theorists have allows the world of the living to exist and perpet-
analyzed this phenomenon. uate itself through time. The informational
Through its specific manifestations, art exchange has been the subject of numerous stud-
stands out from the usual human paradigms, like ies. The understanding of the nature of an object in
pragmatism (the objects of reality approached as the world is multifaceted: an unknown object
such), representation (lexical-semantic fields, doxa, generated as hypotheses and beliefs as postulates
culture . . .), or their symbolization (flags, military of more information than a known object. Norbert
decorations . . .). Art uses the fields of imagination, Wiener came up with a concept defined by the
emotions, and the sensibility of the author, which term “cybernetic”; this concept refers to the notion
makes each work unique in its essence. of the quantity of information which is connected
If art is difficult to define in its specificity to a classical notion in statistical mechanics, that
(quale or feeling), it can be analyzed in its mani- is, entropy. As information in a system is a mea-
festations with the informational decryption real- sure of the degree of organization, entropy is a
ized through various perspectives on the work by measure of the degree of disruption of a system.
art critics, authors specialized in this area, maga- One is simply the opposite of the other. In the
zines, etc. Goodman tried to find common invari- world of common realities, the transfer of infor-
ants of a feeling to another concerning the mation is realized from an object X to an object Y,
same object, which he expressed as a matching and this will affect particular environments, pro-
criteria of the qualia by the following equation: vided that such information be updated and sent to
q(x) ¼ q(y) ssi Mxy (q ¼ quale, M ¼ matching). an object Z. In the virtual world, or the Internet,
Qualia are phenomena which belong to the information is available to any user and it is
domain of individual perception, which cannot consulted, generally, at the discretion on of peo-
be transmitted as such from one individual to ple; a contrario, in the living world, information is
another; this phenomenon refers to the concept part of requirements that are linked to survival and
of solipsism that covers the meaning of: “attitude continuity.
of the thinking subject for which his own In addition, information on the Internet is retic-
consciousness is the only reality, the other con- ular, which refers to the basic idea of points which
sciousnesses, the outside world are only as communicate with each other, or differently with
representations.” intersecting lines. The networked structure gener-
The rewriting of the concept of qualia through ates very different behavior from those in relation
an informational system will consider the objects to social structure as tree-like or pyramidal type. As
causing these feelings and not the qualia. The part of the “network of networks” or the Internet,
information system is the substratum from each internet user occupies a dual role: the network
which the living world is based upon. Indeed, node and the link. Indeed, from a point X (surfer), a
whatever the content of information is, its transfer new network can be created and can aggregate new
from a transmitter X will influence the perception members around a common membership (politics,
of the environment and the responses given to it art, fashion . . .) and mediated by tools called
for a receiver Y, which will result, sooner or later, “social networks” such as Facebook and Twitter.
in the “butterfly effect” discovered by Lorenz and The specificity of this type of fast growing net-
developed by Gleick (1989). Furthermore, the works can open on the discovery of an artist with
oscillation of objects in the universe between his work put on the Internet as the South Korean
entropy and negative entropy is articulated around artist Psy with his song “Gangnam Style”
the factor time; without it, neither the evolution of (December 2012) presented on Youtube (https://
objects in the world would exist nor their www.youtube.com/watch?v¼9bZkp7q19 f0).
50 Arts

The database on Internet has increased expo- is also the source of new approaches of art. This
nentially. At a new given information, feedback is is the case of works by artists and which are
given, and this ad infinitum. However, if the infor- retransformed by such another artist; two or
mation is available, it is not solicited by each user more works can coexist while being retrans-
of the Web. Only a personal choice directs the user formed in a third artwork. “In writing, painting,
to a particular type of information based on its or what might be referred to as overlapped art,
own requests. The amount of existing information I could say that art is connected from my feeling to
on the Internet as well as its storage, transfer, new the creation of the other” (Martin 2014). Mircea
feedback made after consultation by web users Bochis, a Romanian artist, has created original
and their speed of transmission are part of the videos mixing poetry of an author and a video
concept of “big data,” which is built around a created by another. After Dark is a new project
double temporal link. Each user can connect to for visiting a museum at night from his computer
any Internet user regardless of where it is on the with the help of a robot. The National Gallery of
planet. The notion of time is in the immediacy. British Art in London or Tate Britain has similar
The amount of information available to each user innovative projects.
is available to anyone, at any time, and regardless Technological development has opened a
of the place of consultation, which refers to a new artistic approach to photography and filmog-
timeless space. raphy. Moreover, art has long been known in
To return to the concept of time, from the privileged social backgrounds and it was not
nineteenth century (era of industrialization) with until 1750 that the first French museum was
the creation of transport rail or automobiles, dis- opened to the public. If art in all its forms, through
tance perception by people has changed. The its digital rewriting (photographs, various mon-
air transport is the source of a change in concepts tages, movies, videos, books online, etc.), is open
of time and distance. Indeed, e.g., a Paris-New to everyone, only personal choices will appeal to
York travel is no longer expressed in the form of big data for consultation.
the distance between these two points, but in the
length of time taken to reach them; thus Paris is
8 h from New York and not 7000 km. The tilting
Further Reading
of the spatial dimension time in time dimension is
in resonance with Internet where contact is part of Bochis, M. (2014). Artist. http://www.bochis.ro/. Accessed
immediacy: time and distance become one; they 15 August 2014.
are redefined in the form of binary information Botet Pradeilles, G., & Martin, M. (2014). Apologie de la
through satellites and various computer media. névrose suivi de En écho. Paris: L’Harmattan.
Buci-Glucksmann, C. (2003). Esthétique de l’éphémère.
The reticular structure of the digital society is Paris: Éditions Galilée.
composed of nodes and human links (Internet, Denizeau, G. (2011). Palais idéal du facteur cheval: Le
experts in the field), but also of technological palais idéal, le tombeau, les écrits. Paris: Nouvelles
links and nodes (hardware, satellite, etc.). Éditions Scala.
Dokic, J., & Égré, P. L’identité des qualia et le critére de
Art is in resonance with the phenomenon of the Goodman. http://j.dokic.free.fr/philo/pdfs/goodman_
rewriting of time and of space, with art called de1.pdf, https://fr.scribd.com/document/144118802/
“ephemeral” in which the artist creates a work L-identite-des-qualia-et-le-critere-de-Goodman-pdf.
that will last only the time of its installation. The Dummett, M. (1978). Truth and other enigmas.
Cambridge, MA: Harvard University Press.
ephemeral art is a way of expressing time in Gleick, J. (1989). La Théorie du Chaos. Paris:
the presence and focusing on the feeling, i.e., the Flammarion.
quale, not the sustainability. This approach is the Hegel, G. F. W. (1835). Esthétique, tome premier.
opposite of artworks whose purpose was to last Traduction française de Ch. Bénard. (posth.). http://
www.uqac.uquebec.ca/zone30/Classiques_des_scienc
beyond the generation that witnessed their crea- es_sociales/index.html.
tion. Examples would be Egyptian pyramids, the Herrera, H. (2003). Frida: biographie de Frida Kahlo.
Venus of Milo, and the Mona Lisa. The Internet Paris: Livre de poche.
Asian Americans Advancing Justice 51

Martin, M. (2017). La nomination dans l’art – Étude des such as the ACLU, the NAACP, and the Center for
œuvres de Mircea Bochis, peintre et sculpteur. Paris: Media Justice, to propose, sign, and release the
Éditions L’Harmattan.
Melville, H. (2011). Moby Dick. Paris: Editions Phébus. “Civil Rights Principles for the Era of Big Data.” A
Platon. (1879). L’État ou la République de Platon. The coalition acknowledged that progress and
Traduction nouvelle par Bastien, Augustin. Paris: advances in technology would foster improve-
Garnier fréres. http://catalogue.bnf.fr/ark:/12148/ ments in the quality of life of citizens and help
cb31121998c.
Wiart, C. (1967). Expression picturale et mitigate discrimination and inequality. However,
psychopathologie. Essai d’analyse et d’automatique because various types of “big data” tools and
documentaires (principe – méthodes – codification). technologies – namely, digital surveillance, pre-
Paris: Editions Doin. dictive analytics, and automated decision-
Wiener, N. (1948). Cybernetics or control and communi-
cation. In The animal and the machine. Cambridge, making – could potentially ease the level in
MA: MIT Press. which businesses and governments are able to
encroach upon the private lives of citizens, the
coalition found it critical that such tools and tech-
nologies are developed and employed with the
Asian Americans Advancing intention of respecting equal opportunity and
Justice equal justice.
According to civilrights.org (2014), the Civil
Francis Dalisay Rights Principles for the Era of Big Data proposes
Communication & Fine Arts, College of Liberal five key principles: (1) stop high-tech profiling,
Arts & Social Sciences, University of Guam, (2) guarantee fairness in automated decisions,
Mangilao, GU, USA (3) maintain constitutional protections,
(4) enhance citizens’ control of their personal
information, and (5) protect citizens from inaccu-
Asian Americans Advancing Justice (AAAJ) is a rate data. These principles were intended to
national nonprofit organization founded in 1991. inform law enforcement, companies, and policy-
It was established to empower Asian Americans, makers about the impact of big data practices on
Pacific Islanders, and other underserved groups, racial justice and the civil and human rights of
ensuring a fair and equitable society for all. The citizens.
organization’s mission is to promote justice, unify
local and national constituents, and empower 1. Stop high-tech profiling. New and emerging
communities. To this end, AAAJ dedicates itself surveillance technologies and techniques have
to develop public policy, educate the public, liti- made it possible to piece together comprehen-
gate, and facilitate in the development of grass- sive details on any citizen or group, resulting in
roots organizations. Some of their recent an increased risk of profiling and discrimina-
accomplishments have included increasing Asian tion. For instance, it was alleged that police in
Americans and Pacific Islanders’ voter turnout New York had used license plate readers to
and access to polls, enhancing immigrants’ access document vehicles that were visiting certain
to education and employment opportunities, and mosques; this allowed the police to track
advocating for greater protections of rights as they where the vehicles were traveling. The acces-
relate to the use of “big data.” sibility and convenience of this technology
meant that this type of surveillance could hap-
pen without policy constraints. The principle
The Civil Rights Principles for the Era of of stopping high-tech profiling was thus
Big Data intended to limit such acts through setting
clear limits and establishing auditing proce-
In 2014, AAAJ joined a diverse coalition com- dures for surveillance technologies and
prising of civil, human, and media rights groups, techniques.
52 Asian Americans Advancing Justice

2. Ensure fairness in automated decisions. personal information ensures that the govern-
Today, computers are responsible for making ment and companies should not be able to
critical decisions that have the potential to disclose private information without a legal
affect the lives of citizens’ in the areas of process to do so.
health, employment, education, insurance, 5. Protect citizens from inaccurate data. This
and lending. For example, major auto insurers principle advocates that when it comes to mak-
are able to use monitoring devices to track ing important decisions about citizens – partic-
drivers’ habits, and as a result, insurers could ularly, the disadvantaged (the poor, persons
potentially deny the best coverage rates to with disabilities, the LGBT community,
those who often drive when and where acci- seniors, and those who lack access to the Inter-
dents are more likely to occur. The principle net) – corporations and the government should
of ensuring fairness in automated decisions work to ensure that their databases contain
advocates that computer systems should be accurate of personal information about citi-
operating fairly in situations and circum- zens. To ensure the accuracy of data, this
stances such as the one described. The coali- could require disclosing the underlying data
tion had recommended, for instance, that and granting citizens the right to correct infor-
independent reviews be employed to assure mation that is inaccurate. For instance, govern-
that systems are working fairly. ment employment verification systems have
3. Preserve constitutional protections. This prin- had higher error rates for legal immigrants
ciple advocates that government databases and individuals with multiple surnames
must be prohibited from undermining core (including many Hispanics) than for other
legal protections, including those concerning legal workers; this has created a barrier to
citizens’ privacy and their freedom of associa- employment. In addition, some individuals
tion. Indeed, it has been argued that data from have lost job opportunities because of inaccu-
warrantless surveillance conducted by the racies in their criminal history information, or
National Security Agency have been used by because their information had been expunged.
federal agencies, including the DEA and the
IRS, even though such data were gathered out- The five principles above continue to help
side the policies that rule those agencies. Indi- inspire subsequent movements highlighting the
viduals with access to government databases growing need to strengthen and protect civil rights
could also potentially use them for improper in the face of technological change. Asian Amer-
purposes. The principle of preserving constitu- icans Advancing Justice and the other members of
tional protections is thus intended to limit such the coalition also continue to advocate for these
instances from occurring. rights and protections.
4. Enhance citizens’ control of their personal
information. According to this principle, citi-
zens should have direct control over how cor-
porations gather data from them, and how
Cross-References
corporations use and share such data. Indeed,
▶ American Civil Liberties Union
personal and private information known and
▶ Centers for Disease Control and Prevention
accessible to a corporation can be shared with
(CDC)
companies and the government. For example,
unscrupulous companies can find vulnerable
customers through accessing and using highly
Further Reading
targeted marketing lists, such as one that might
contain the names and contact information of Civil rights and big data: Background material. http://
citizens who have cancer. In this case, the www.civilrights.org/press/2014/civil-rights-and-big-
principle of enhancing citizens’ control of data.html. Accessed 20 June 2016.
Association Versus Causation 53

Association
Association Analysis
A scientific theory is the relationships between A
▶ Data Mining concepts or variables in ways that describe, pre-
dict, and explain how the world operates. One
type of relationships between variables is associ-
ation or covariation. In this relationship, changes
Association Versus Causation in the values of one variable are related to changes
in the values of the other variable. In other words,
Weiwu Zhang1 and Matthew S. VanDyke2 the two variables shift their values together. Some
1
College of Media and Communication, Texas statistical procedures are needed to establish asso-
Tech University, Lubbock, TX, USA ciation. To determine whether variable A is asso-
2
Department of Communication, Appalachian ciated with variable B, we must see how the
State University, Boone, NC, USA values of variable B shift when two or more
values of variable A occur. If values in variable
B shift systematically with each of the levels of
Scientific knowledge provides a general under- variable A, then we can say there is an association
standing of how the world is connected among between variables A and B. For example, to deter-
one another. It is useful in providing a means of mine whether aggressiveness is really associated
categorizing things (typology), a prediction of with exposure to violent television programs, we
future events, an explanation of past events, and must observe aggressiveness under at least two
a sense of understanding about the causes of the levels of exposure to violent television programs,
phenomenon (causation). Association, also called such as high exposure and low exposure. If higher
correlation or covariation, is an empirical and level of aggressiveness is found under the condi-
statistical relationship between two variables tion of higher exposure to violent television pro-
such that changes in one variable are connected grams than under the condition of lower exposure,
to changes in the other. However, association in we can conclude a positive association between
and of itself does not necessarily imply a causal exposure to television violence and aggressive-
relationship between the two variables. It is only ness. If lower level of aggressiveness is observed
one of several necessary criteria for establishing under the condition of higher exposure to violent
causation. The other two criteria for causal rela- television programs than under the condition of
tionships are time order and non-spurious rela- lower exposure, we can conclude a negative or
tionships. While the advance of big data makes inverse association between the two variables.
it possible and more effective to capture tremen- Both situations indicate that exposure to televi-
dous number of correlations and predictions than sion violence and aggressiveness are associated or
ever before, and statistical analyses may assess the covary.
degree of association between variables with con- To claim that variable A is a cause of variable
tinuous data analyzed from big datasets, one must B, the two variables must be associated with one
consider the theoretical underpinning of the study another. If high- and low viewing of violent pro-
and how data were collected (i.e., in a manner that grams on television are equally related to level of
measurement of an independent variable precedes aggressiveness, then there is no association
measurement of a dependent variable) in order to between watching television violence and aggres-
determine if the causal relationship is valid. siveness. In other words, knowing a person’s
The purpose of this entry is to focus on asso- viewing of violent programs on television does
ciation and one function of scientific knowledge – not help in any way predicting a person’s level of
causation, what they are, how they relate to and aggressiveness. In this case, watching television
differ from each other, and how big data plays any violence cannot be a cause of aggressiveness. On
role in this process. the other hand, simple association between these
54 Association Versus Causation

two variables does not imply causation. Other television violence and aggressiveness are related,
criteria are needed to establish causation. it is much harder to determine which variable
A dominant theoretical framework in media causes the changes in the other. One plausible
communication research is the agenda-setting the- explanation is that the more one views television
ory. McCombs and colleagues’ research suggests violence, the more one imitates the violent behav-
that there is an association between prominent ior on television and becomes more aggressive
media coverage and what people tend to think (per social learning theory). An equally plausible
about. That is, media emphasis on certain issues interpretation is that an aggressive person is usu-
tends to be associated with the perceived impor- ally attracted to violent television programs. With-
tance of issues among the public. Recent research out any convincing evidence about the time order
has examined the agenda-setting effect in the con- or direction of influence, there is no sound basis
text of big data, for example, assessing the rela- for determining which is the cause (independent
tionship between digital content produced by variable) and which is the effect (dependent
traditional media outlets (e.g., print; television) variable).
and user-generated content (i.e., blogs, forums, Some research designs such as controlled
and social media). While agenda-setting research experiment are easier to decide on the time order
typically identifies associations between the of influence. Recent research examining people’s
prominence of media coverage of some issues use of mobile technology employed a field exper-
and the importance public attaches to those issues, iment to understand people’s political web-brows-
research designs must account for the sequence ing behavior. For example, Hoffman and Fang
(i.e., time order) in which variables occur. For tracked individuals’ web-browsing behavior over
example, while it is plausible to think that media 4 months to determine predictors (e.g., political
coverage influences what the public thinks about, ideology) of the amount of time individuals spend
in the age of new media, the public also plays browsing certain political content over others.
increasingly important role in influencing what is Such research is able to establish that some pre-
covered by the news media outlets. Such explora- existing characteristic predicts or manipulation
tions are questions of causality and would require causes a change to the outcome of web-browsing
a consideration of time order sequence between behavior.
variables. Additionally, potential external causes
of variation must be considered in order to truly
establish causation. Non-spurious Relationships

This is the third essential criterion for establishing


Time Order a causal relationship: when a relationship between
two variables is not caused by variation in a third
A second criterion for establishing causality is that or extraneous variable. This means that the seem-
a cause (independent variable) should take place ing association between two variables might be
before its effect (dependent variable). This means caused by a common third or extraneous variable
that changes in the independent variable should (spurious relationship) rather than an influence of
influence changes in the dependent variable, but the presumed independent variable on the depen-
not vice versa. This is also called the direction dent variable. One well-known example is the
of influence (from independent variable to depen- association between a person’s foot size and
dent variable). For some relationships in social one’s verbal ability in the 2010 US Census. If
research, the time order or direction of influence you believe that association or correlation implies
is clear. For instance, one’s parents’ education causation, then you might think so. But the appar-
always occurs before their children’s education. ent relationship between one’s foot size and verbal
For others, the time order is not easy to determine. ability is a spurious one because one’s foot size
For example, while it is easy to find that viewing and verbal ability is linked to a common third
Astronomy 55

variable – age. As one grows older, one’s foot size Reynolds, P. (2007). A primer in theory construction.
becomes larger and as one grows older, one Boston: Pearson/Allyn & Bacon.
Shoemaker, P., Tankard, J., & Lasorsa, D. (2004). How to
becomes better at communicating, but there is no build social science theories. Thousand Oaks: Sage. A
logical and inherent relationship between foot size Singleton, R., & Straits, B. (2010). Approaches to social
and verbal ability. To return to the agenda-setting research (5th ed.). New York: Oxford University Press.
example, perhaps a third variable would influence
the relationship between media issue coverage
and the importance public attaches to issues. For
example, perhaps the nature of issue coverage Astronomy
(e.g., emotional coverage; coverage of issues of
personal importance) would influence what the R. Elizabeth Griffin
public thinks about issues presented by the Dominion Astrophysical Observatory, British
media. Therefore, when we infer a causal relation- Columbia, Canada
ship from an observed association, we need to rule
out the influence of a third variable (or rival
hypothesis) that might have created a spurious Definition
relationship between the variables.
In conclusion, despite the accumulation of The term “Big Data” is severally defined and
enormous number of associations or correlations redefined by many in the fields of scientific obser-
in the era of big data, association still does not vations and the curation and management thereof.
supersede causation. To establish causation, the Probably first coined in reference to the large
criteria of time order and non-spurious relation- volumes of images and similar information-rich
ships must also be met with sound theoretical records promised by present-day and near-future
foundation in the broader context of big data. large-scale, robotic surveys of the night sky, the
term has come to be used in reference to the data
that result from almost any modern experiment in
Cross-References astronomy, and in so doing has lost most of the
special attributes which it was originally intended
▶ Association Analysis to convey. There is no doubt that “big” is only
▶ Correlation Versus Causation relative, and scientific data have always presented
▶ Social Sciences the operator with challenges of size and volume,
so a reasonable definition is also a relative one:
“big data” refers to any set or series of data that are
Further Reading too large to be managed by existing methods and
tools without a major rethink and redevelopment
Babbie, E. (2007). The practice of social research (11th of technique and technology, be they hardware or
ed.). Belmont: Wadsworth. software.
Hermida, A., Lewis, S., & Zamith, R. (2014). Sourcing the
According to the above definition, “big data”
Arab spring: A case study of Andy Carvin’s sources on
Twitter during the Tunisian and Egyptian revolutions. have always featured in astronomy. The early
Journal of Computer-Mediated Communication, 19(3), observers were acutely aware of visible changes
479. manifested by certain stars, so the attendant need
Hoffman, L., & Fang, H. (2014). Quantifying political
to make comparisons between observations called
behavior on mobile devices over time: A user evalua-
tion study. Journal of Information Technology & for an ability to recover past details. Those details
Politics, 11(4), 435. necessarily involved accurate meta-data such as
Mahrt, M., & Scharkow, M. (2013). The value of big data object ID, date, time, and location of the observer,
in digital media research. Journal of Broadcasting &
plus some identifier for the observer. The cata-
Electronic Media, 57(1), 20.
McCombs, M. (2004). Setting the agenda: The mass media logues of observations that were therefore kept
and public opinion. Cambridge, UK: Polity. (hand-written at first) needed to refer to object
56 Astronomy

names that were also catalogued elsewhere, so a floating of an idea to federate large sets of digital
chain of ever bigger data thus sprang up and devel- data, such as those produced by the Sloan Digital
oped. Hand-written catalogues gave way to typed Survey, the Hubble Space Telescope, or 2MASS, in
or printed ones, each a work of substantial size; the order to uncover new relationships and unexpected
Henry Draper Catalogue of positions and spectral correlations which then-current thinking had not
types of the 225,000 brightest stars occupied nine conceived. Even if the concept was not unique in
volumes of the Harvard Annals between 1918 and science at the time, it could be soluble only in
1924, and established a precedent for compiling astronomy because of the highly proficient schemes
and blending catalogued information to collate for data management that then existed uniquely in
usefully the most up-to-date information available. that science (and which continue to lead other sci-
The pattern has continued and has expanded as ences). The outcome – the Virtual Observatory
space missions have yielded observations at wave- (VO) – constitutes an ideal whereby specified data
lengths or frequencies not attainable from the sets, distributed and maintained at source, can be
ground, or have reached objects that are far too accessed remotely and filtered according to speci-
faint for inclusion in the earlier catalogues. New fied selection criteria; as the name implies, the
discoveries therefore bear somewhat unglamorous sources of the observations are data sets rather than
numerical identifiers that reflect the mission or sur- telescopes. But since the astronomical sources
vey concerned (e.g., Comet PanSTARRS involved had resulted from multinational facilities,
C/2012 K1, nova J18073024 + 4,551,325, or pulsar national VO projects soon became aligned under an
PSR J1741–2054). Even devising and maintaining a International VO Alliance. The ideal of enabling
comprehensive and unique nomenclature system is data from quite disparate experiments to be merged
a challenge in itself to the “big data” emanating from effectively required adherence to certain initial
multiobject spectroscopy and multichannel sweeps parameters such as data format and descriptors,
of the sky which are now being finalized for opera- and the definition of minimum meta-data headings;
tion on large telescopes. those are now set out in the “VO Table.”
Astronomy’s hierarchical development in Because objects in the cosmos can change, on
nomenclature belie an ability to resolve at all any time-scale, and either periodically or unexpect-
adequately either long-standing or newly minted edly (or both), the matter of storing astronomy’s
mysteries which their data present. The brighter “big data” has posed prime storage challenges to
the star, the older the scheme for naming it, but whichever age bred the equipment responsible.
that does not imply that astronomers have been Storage deficits have always been present, and sim-
able to work systematically through solving the ply assumed different forms depending on whether
puzzles posed by the brighter ones and that only the limiting factors were scribes to prepare hand-
the new observations recorded in the era of “big written copies, photographic plates that were not
data” still await attention. Far from it. One of the supported by equipment to digitize efficiently all
most puzzling challenges in stellar astronomy the information on them, or (in pre-Internet days)
involves a star of visual magnitude 2.9 (so it is computers and magnetic tapes of sufficient capacity
easily visible to the unaided eye); many stars of and speed to cope with the novel demands of CCD
magnitudes 3, 4, or 5 also present problems of frames. Just as modern storage devices now dwarf
multiplicity, composition, evolution, or status that the expectations of the past, so there is a comfortable
call for many more data at many different wave- assumption that expansions in efficiency, tools and
lengths before some of the unknowns can confi- technologies will somehow cope with ever increas-
dently be removed. “Big data” in that sense are ing demands of the future (“Moore’s law”); certainly
therefore needed for all those sorts and conditions, the current lack of proven adequate devices has not
and even then astronomers will probably find damped astronomy’s enthusiasm for multiscale sur-
good reason to call for yet more. veys, nor required the planners to keep within data-
The concept of “big data” came vividly to the storage bounds that can actually be guaranteed
fore at the start of the twenty-first century with the today.
Authoritarianism 57

An important other side to the “data deluge”


coin is the inherent ability to share and reuse, Authoritarianism
perhaps in an interdisciplinary application, what- A
ever data are collected. While astronomers them- Layla Hashemi
selves, however ambitious, may not have all the Terrorism, Transnational Crime, and Corruption
resources at their command to deal exhaustively Center, George Mason University, Fairfax, VA,
on their own with all the petabytes of data which USA
their experiments will deliver and will continue to
deliver, the multifaceted burdens of interpretation
can nowadays be shared by an ever broadening Individuals in authoritarian societies typically
ensemble of brains, from students and colleagues lack freedom of assembly and freedom of the
in similar domains to citizen scientists by the press, but the internet and social media have pro-
thousand. Not all forms of data analysis can be vided an important voice to many members of
tackled adequately or even correctly by all cate- authoritarian societies. Social media allows indi-
gories of willing analysts, though the Galaxy Zoo viduals to connect with people of similar minds,
project has illustrated the very considerable poten- share opinions, and find a powerful way to counter
tial of amateurs and nonastronomical manpower the isolation often associated with life in authori-
to undertake fundamental classification tasks at tarian societies. However, advances in data col-
the expenditure of very modest training. Indeed, lection, sharing, and storage have also drastically
a potential ability to share data widely and intelli- reshaped the policies and practices of authoritar-
gently is the chief saving grace in astronomy’s ian regimes. Digital technologies have not only
desire to observe more than it can conceivably expanded opportunities for public expression and
manage. discussion, they have also improved government
Broad sharing of astronomical data has capabilities to surveil and censor users and con-
become a possibility and now a reality because tent. This is a complex situation in which, in
ownership of the data has increasingly been authoritarian and repressive contexts, technology
defined as public, though that has not always can be used to stifle and silence dissenting voices.
been the case. Observing time has routinely The development of facial recognition and sur-
been awarded to individual researchers or veillance technology, which have been used to
research groups on the basis of competitive counter crime, can also pose threats to privacy
applications, and (probably stemming from the and freedom.
era of photographic records, when an observa- Illustrative of this are the minority populations
tion consisted of a tangible product that needed a of Uighurs in China who are strategically tracked
home) such observations were traditionally and denied human rights and just treatment by the
regarded as the property of the observatory authoritarian government. Artificial intelligence
where they were recorded. Plate archivists kept and digital technology are key in state surveil-
track of loans to PIs while new observations were lance of the millions of members of this Chinese
being analyzed, but returns were then firmly minority. One of the great challenges of
requested – with the result that, on the whole, addressing the use of digital technology and big
astronomical plate stores are remarkably com- data in authoritarian contexts is determining what
plete. Difficulties arose in the case of privately kinds of data are available. These technologies
(as opposed to publicly or nationally) owned often range from relatively simple information
observatories, as rulings for publicly funded communication technologies (ICTs) such as
organizations were not necessarily accepted by short message service (SMS) or email to complex
privately funded ones, but those problems have data and tools such as geolocation, machine learn-
tended to evaporate as many modern facilities ing, and artificial intelligence.
(and all the larger ones) are multinationally Residents of authoritarian societies have
owned and operated. embraced different forms of social media for
58 Authorship Analysis and Attribution

personal and public expression. Currently, among transformation and global society (Communications
the most popular is Twitter, presently used, for in computer and information science) (Vol. 858, pp.
144–155). Cham: Springer International Publishing.
example, by nearly three million in Iran – despite https://doi.org/10.1007/978-3-030-02843-5_12.
the platform being banned in the country – as well Mechkova, Valeriya, Daniel Pemstein, Brigitte Seim, Steven
as in other authoritarian countries such as Turkey Wilson. 2020. Digital Society Project Dataset v2.
(over 13 million users) and Saudi Arabia (over 12 Tufekci, Z. (2017). Twitter and tear gas: The power and
fragility of networked protest. New Haven/London:
million users). Examining the Twitter activity of Yale University Press.
millions of activists allows social movement
researchers to conduct detailed analyses and deter-
mine sources of discontent and the events that
trigger mobilization. Also, popular online cam-
paigns (e.g., #MeToo and #BlackLivesMatter) Authorship Analysis
have facilitated mobilization for social justice and and Attribution
public discussion of controversial topics such as
sexual harassment, police brutality, and violence Patrick Juola
across borders. ICT was widely used during the Department of Mathematics and Computer
Green Movement and the Arab Spring in the Mid- Science, McAnulty College and Graduate School
dle East and North Africa, and media has often of Liberal Arts, Duquesne University, Pittsburgh,
served as a means to speak truth to power and PA, USA
express public discontent. Before the establishment
of digital technology, forms of media such as news-
papers, radio, photography, and film were used by Synonyms
those living in authoritarian contexts to express
discontent and grievances. Authorship profiling; Authorship verification;
In the era of the internet, massive amounts of Stylistics; Stylometry
information and data can be shared rapidly at a
global scale with few resources. The shift from
publishers and legacy media to online platforms Introduction
and internet communications has expanded the
responsibilities of technology and communica- Authorship attribution is a text classification tech-
tions regulation in nation-states and in interna- nique used to infer the authorship of a document.
tional institutions and transnational corporations, By identifying features of writing style in a docu-
emphasizing the need for increased corporate ment and comparing it to features from other doc-
social responsibility. Even in authoritarian socie- uments, a human analyst or a computer can make a
ties, the shift of power to regulate information determination of stylistic similarity and thus of the
beyond state actors demonstrates the growing plausibility of authorship by any specific person.
role of technology and complex data flows in There are many applications, including education
personal, corporate, and civic spheres in the dig- (plagiarism detection), forensic science (identify-
ital era. ing the author of a piece of evidence such as a
threatening letter), history (resolving questions of
Further Reading disputed works), and journalism (identifying the
true authors behind pen names), among others.
Jumet, K. D. (2018). Contesting the repressive state: Why
ordinary Egyptians protested during the Arab spring.
New York: Oxford University Press. Theory of Authorship Attribution
Kabanov, Y., & Karyagin, M. (2018). Data-driven author-
itarianism: Non-democracies and big data. In D. A.
Alexandrov, A. V. Boukhanovsky, A. V. Chugunov, Human language is a complex system that is
Y. Kabanov, & O. Koltsova (Eds.), Digital underconstrained, in the sense that there are
Authorship Analysis and Attribution 59

normally many ways to express roughly the same analyzed the fifty most common words in the Oz
idea. Writers and speakers are therefore forced to novels as a whole (a collection of fairly simple
make (consciously or unconsciously) choices words like “the,” “of,” “after,” “with,” “that,” and A
about the best way to express themselves in any so forth) and was able to show via standard statis-
given situation. Some of the choices are obvious – tical techniques that Baum and Thompson had
for example, what speakers of American Standard notably different writing styles and that The
English call “chips” speakers of Commonwealth Royal Book clearly matched Thompson’s.
dialects call “crisps” (and their “chips” Americans Among the highest profile authorship attribu-
call “French fries”). Some authors may use the tion analyses is Juola’s analysis of The Cuckoo’s
passive voice often, while others largely avoid it. Calling. Published under the pen name “Robert
Sometimes the choice is less noticeable – when Galbraith,” an anonymous tip on Twitter
you set the table, do you set the fork “to” the left of suggested that the real author was J.K. Rowling
the plate, “on” the left of the plate, or “at” the left of Harry Potter fame. Juola analyzed several dif-
of the plate? While all are grammatically (and ferent aspects of “Galbraith’s” writing style and
semantically) correct, some people have a marked showed that Galbraith had a very similar grammar
preference for one form over another. If this pref- to Rowling, used many of the same types of words
erence can be detected, it can provide evidence for as Rowling, had about the same complexity of
or against authorship of another document that vocabulary as Rowling, used morphemes in the
does or does not match this pattern of preposition same way Rowling did, and even put words
use. together into pairs like Rowling did. The obvious
conclusion, which he drew, is that Rowling was,
in fact, Galbraith, a conclusion that Rowling her-
Examples of Authorship Attribution in self confirmed a few days later.
Practice

After some proposals dating back to the nine- How Does It work?
teenth century [see Juola (2008) for some history],
one of the first examples of authorship was the The general method in these cases (and many
analysis by Mosteller and Wallace (1963) of The others) is very similar and relies on a
Federalist Papers and their authorship. They well-known data classification framework. From
found, for example, that Alexander Hamilton a set of known documents, extract a set of features
never used the word “while” (he used the word (e.g., Binongo’s features were simply the fifty
“whilst” instead) while James Madison was oppo- most common words, while Juola’s features
site. More subtly, they showed that, though both included the lengths of words, the set of all adja-
men used the word “by,” Madison used it much cent word pairs, and the set of letter clusters like
more frequently. From word-based observations the “tion” at the end of “attention”). These sets of
like this, Mosteller and Wallace were able to apply features can then be used as elements to classify
Bayesian statistics to infer the authorship of each unknown works using a standard classification
of the anonymously published The Federalist system such as a support vector machine,
Papers. nearest-neighbor classifier, a deep learning sys-
Binongo (2003) used a slightly different tem, or many others. Similarly, scholars have
method to address the authorship of the 15th documented more than 1000 proposed feature
book in the Oz series. Originally by L. Frank types that could be used in such a system. Despite
Baum, the series was taken over after Baum’s or perhaps because of the open-ended nature of
death by another writer named Ruth Plumly this framework, the search for best practices and
Thompson. The 15th book, The Royal Book of most accurate attribution methods continues.
Oz, was published during this gap and has been One key component of this search is the use of
variously attributed to both authors. Binongo bake-off style competitive evaluations, where
60 Authorship Profiling

researchers are presented with a common data set collection of writings by men and a large col-
and invited to analyze it. Juola (2008) describes a lection of writings by women and then seeing
2004 competition in detail. Other examples of this which collection’s features better matches the
kind of evaluation include the Plagiarism Action unknown document.
Network (PAN) workshops held annually from
2011 to this writing (as of 2018) and the 2017
Forensic Linguistics Dojo sponsored by the Inter- Conclusions
national Association of Forensic Linguists. Activ-
ities like these help scholars identify, evaluate, and Authorship attribution and related problems is an
develop promising methods. important area of research in data classification.
Using techniques derived from big data research,
individual and group attributes of language can be
Other Related Problems identified and used to identify authors by the
attributes their writings share.
Authorship attribution as defined above is actu-
ally only one of the many similar problems.
Scholars traditionally divide attribution prob- Cross-References
lems into two types; the open-class problem
and closed-class problem. In the (easier) ▶ Bibliometrics/Scientometrics
closed-class problem, the analyst is told to
assume that the answer is one of the known
possible authors – for example, it must be Further Reading
Baum or Thompson, but it can’t be an unknown
third party. In the open-class variation, “none of Binongo, J. N. G. (2003). Who wrote the 15th book of Oz?
the above” is an acceptable answer; the An application of multivariate analysis to authorship
attribution. Chance, 16, 9.
unknown document might have been written
Juola, P. (2008). Authorship attribution. Foundations and
by someone else. Open-class problems with trends ® in information retrieval, 1(3), 233–334.
only one known author are typically referred to Juola, P. (2015). The Rowling case: A proposed stan-
as “authorship verification” problems, as the dard analytic protocol for authorship questions. Dig-
ital Scholarship in the Humanities, 30(Suppl. I),
problem is more properly framed as “was the
fqv040.
unknown document written by this (known) Koppel, M., Schler, J., & Argamon, S. (2009). Computa-
person, or wasn’t it?”. Authorship verification tional methods in authorship attribution. Journal of the
is widely recognized as being more difficult Association for Information Science and Technology,
60(1), 9–26.
than simple attribution among a closed-class
Mosteller, F., & Wallace, D. L. (1963). Inference in an
group. authorship problem: A comparative study of discrimi-
Sometimes, authorship scholars are asked to nation methods applied to the authorship of the dis-
infer not the identity but the characteristics of puted federalist papers. Journal of the American
Statistical Association, 58, 275.
the author of a document. For example, was
Stamatatos, E. (2009). A survey of modern authorship
this document written by a man or a woman? attribution methods. Journal of the American Society
Where was the author from? What language for Information Science and Technology, 60, 538.
did the author grow up speaking? These ques-
tions and many others like them have been
studied as part of the “authorship profiling”
problem. Authorship profiling can be Authorship Profiling
addressed using largely the same methods, for
example, by extracting features from a large ▶ Authorship Analysis and Attribution
Automated Modeling/Decision Making 61

inability to store large amounts of data that is


Authorship Verification often generated continuously suggests that deci-
sions pertaining to the use and storage of data, and A
▶ Authorship Analysis and Attribution therefore the boundaries of the eventual decision
making context, need to be defined earlier in the
process. With the parameters of the eventual deci-
sion becoming an apriori consideration, big data is
Automated Modeling/ likely to overcome the human tendency of pro-
Decision Making crastination. It imposes the discipline to recognize
the desired information content early in the pro-
Murad A. Mithani cess. Whether this entails decision processes that
School of Business, Stevens Institute of prefer immediate conclusions or if the early
Technology, Hoboken, NJ, USA choices are limited to the identification of critical
information that will be used for later evaluation,
the dual decision model with a preliminary deci-
Big data promises a significant change in the nature sion far removed from the actual decision offers
of information processing, and hence, decision mak- an opportunity to examine the available alterna-
ing. The general reaction to this trend is that the tives more comprehensively. It allows decision
access and availability of large amounts of data makers to have a greater understanding of the
will improve the quality of individual and organiza- alignment between goals and alternatives. Com-
tional decisions. However, there are also concerns pare this situation to the recruitment model for a
that our expectations may not be entirely correct. human resource department that screens as well as
Rather than simplifying decisions, big data may finalizes prospective candidates in a single round
actually increase the difficulty of making effective of interviews, or separates the process into two
choices. I synthesize the current state of research and stages where the potential candidates are first
explain how the fundamental implications of big identified from the larger pool and they are then
data offer both a promise for improvement but also selected from the short-listed candidates in the
a challenge to our capacity for decision making. second stage. The dual decision model not only
Decision making pertains to the identification facilitates greater insights, it also eliminates the
of the problem, understanding of the potential fatigue that can seriously dampen the capacity for
alternatives, and the evaluation of those alterna- effective decisions. Yet this discipline comes at a
tives to select the ones that optimally resolve the cost. Goals, values, and biases that are part of the
problem. While the promise of big data relates to early phase of a project can leave a lasting imprint.
all aspects of decision making, it more often Any realization later in the project that was not
affects the understanding, the evaluation, and the deliberately or accidently situated in the earlier
selection of alternatives. The resulting implica- context becomes more difficult to incorporate
tions comprise of the dual decision model, higher into the decision. In the context of recruitment, if
granularity, objectivity, and transparency of deci- the skills desired of the selected candidate change
sions, and the bottom-up decision making in orga- after the first stage, it is unlikely that the short-
nizational contexts. I explain each of these listed pool will rank highly in that skill. The more
implications in detail to illustrate the associated unique is the requirement that emerges in the later
opportunities and challenges. stage, the greater is the likelihood that it will not
With data and information exceeding our be sufficiently fulfilled. This tradeoff suggests that
capacity for storage, there is a need for decisions an improvement in our understanding of the
to be made on the fly. While this does not imply choices comes at the cost of limited maneuver-
that all decisions have to be immediate, our ability of an established decision context.
62 Automated Modeling/Decision Making

In addition to the benefits and costs of early increasingly imbibed with an objective orientation,
decisions in the data generation cycle, big data prior knowledge becomes a redundant element.
allows access to information at a much more gran- This however does not eliminate the value of
ular level than possible in the past. Behaviors, domain-level experts. Their role is expected to
attitudes, and preferences can now be tracked in evolve into individuals who know what to look
extensive detail, fairly continuously, and over lon- for (by asking the right questions) and where to
ger periods of time. They can in turn be combined look (by identifying the appropriate sources of
with other sources of data to develop a broader data). Domain expertise and not just experience is
understanding of consumers, suppliers, the mantra to identify people who are likely to be
employees, and competitors. Not only can we the most valuable in this new information age.
understand in much more depth the activities and However, it needs to be acknowledged that this
processes that pertain to various social and eco- belief in objectivity is based on a critical assump-
nomic landscapes, higher level of granularity tion: individuals endowed with identical informa-
makes decisions more informed and, as a result, tion that is sufficient and relevant to the context,
more effective. Unfortunately, granularity also reach identical conclusions. Yet anyone watching
brings with it the potential of distraction. All the same news story reported by different media
data that pertains to a choice may not be necessary outlets knows the fallacy of this assumption. The
for the decision, and excessive understanding can variations that arise when identical facts lead indi-
overload our capacity to make inferences. Ima- viduals to contrasting conclusions are a manifesta-
gine the human skin which is continuously sens- tion of the differences in the way humans work
ing and discarding thermal information generated with information. Human cognitive machinery
from our interaction with the environment. What associates meanings to concepts based on personal
if we had to consciously respond to every signal history. As a result, even while being cognizant of
detected by the skin? It is this loss of granularity our biases, the translation of information into con-
that comes through the human mind responsive clusion can be unique to individuals. Moreover,
only to significant changes in temperature that this effect compounds with the increase in the
saves us from being overwhelmed by data. Even amount of information that is being translated.
though information granularity makes it possible While domain experts may help ensure consistency
to know what was previously impossible, infor- with the prevalent norms of translation, there is
mation overload can lead us astray towards inap- little reason to believe that all domain experts are
propriate choices, and at worse, it can incapacitate generally in agreement. The consensus is possible
our ability to make effective decisions. in the domains of physical sciences where objec-
The third implication of big data is the poten- tive solutions, quantitative measurements, and con-
tial for objectivity. When a planned and compre- ceptual boundaries leave little ambiguity.
hensive examination of alternatives is combined However, the larger domain of human experience
with a deeper understanding of the data, the result is generally devoid of standardized interpretations.
is more accurate information. This makes it less This may be one reason that a study by the Econ-
likely for individuals to come up to an incorrect omist Intelligence Unit (2012) found a significantly
conclusion. This eliminates the personal biases that higher proportion of data-driven organizations in
can prevail in the absence of sufficient information. the industrial sectors such as the natural resources,
Since traditional response to overcome the effect of biotechnology, healthcare, and financial services.
personal bias is to rely on individuals with greater Lack of extensive reliance on data in the other
experience, big data predicts an elimination of the industries is symptomatic of our limited ability
critical role of experience. In this vein, Andrew for consensual interpretation in areas that challenge
McAfee and Erik Brynjolfson (2012) find that the positivistic approach.
regardless of the level of experience, firms that The objective nature of big data produces two
extensively rely on data for decision making are, critical advantages for organizations. The first is
on average, 6% more profitable than their peers. transparency. A clear link between data, informa-
This suggests that as decisions become tion, and decision implies the absence of personal
Automated Modeling/Decision Making 63

and organizational biases. Interested stakeholders information, lean organizations of the future may
can take a closer look at the data and the associ- decrease the flow of information altogether,
ated inferences to understand the basis of conclu- replacing it with data-driven, contextually rich, A
sions. Not only does this promise a greater buy-in and objective findings. In fact, this is imminent
from participants that are affected by those deci- since the dual decision model defines the bound-
sions, it develops a higher level of trust between aries of subsequent choices. Any attempt to disen-
decision makers and the relevant stakeholders, gage the later decision from the earlier one is likely
and it diminishes the need for external monitoring to eliminate the advantages of granularity and
and governance. Thus, transparency favors the objectivity. Flatter organizations of the future will
context in which human interaction becomes eas- delegate not because managers have greater faith in
ier. It paves the way for richer exchange of infor- the lower cadres of the organization but because
mation and ideas. This in turn facilitates the individuals at the lower levels are the ones that are
quality of future decisions. But due to its very likely to be best positioned to make timely deci-
nature, big data makes replications rather difficult. sions. As a result, big data is moving us towards a
The time, energy, and other resources required to bottom-up model of organizational decisions
fully understand or reexamine the basis of choices where people at the interface between data and
makes transparency not an antecedent but a con- findings determine the strategic priorities within
sequence of trust. Participants are more likely to which higher-level executives can make their call.
believe in transparency if they already trust the Compare this with the traditional top-down model
decision makers, and those that are less receptive of organizational decisions where strategic choices
to the choices remain free to accuse the process as of the higher executives define the boundaries of
opaque. Regardless of the comprehensiveness of actions for the lower-level staff. However, the
the disclosed details, transparency largely remains bottom-up approach is also fraught with chal-
a symbolic expression of the participants’ faith in lenges. It minimizes the value of executive vision.
the people managing the process. The subjective process of environmental scanning
A second advantage that arises from the objec- allows senior executives to imbibe their valued
tive nature of data is decentralization. Given that preferences into organizational choices through
decisions made in the presence of big data are more selective attention to information. It enables orga-
objective and require lower monitoring, they are nizations to do what would be uninformed and at
easier to delegate to people who are closer to the times, highly irrational. Yet it sustains the spirit of
action. By relying on proximity and exposure as beliefs that take the form of entrepreneurial action.
the basis of assignments, organizations can save By setting up a mechanism where facts and find-
time and costs by avoiding the repeated concentra- ings run supreme, organization of the future may
tion and evaluation of information that often occurs constrain themselves to do only what is
at the various hierarchical levels as the information measureable. Extensive reliance on data can impair
travels upwards. So unlike the flatter organizations our capacity to imagine what lies beyond the hori-
of the current era which rely on the free flow of zon (Table 1).

Automated Modeling/Decision Making, Table 1 Opportunities and challenges for the decision implications of
big data
Big data implication Opportunity Challenge
1. Dual decision model Comprehensive examination of Early choices can constrain later considerations
alternatives
2. Granularity In-depth understanding Critical information can be lost due to
information overload
3. Objectivity Lack of dependence on experience Inflates the effect of variations in translation
4. Transparency Free-flow of ideas Difficult to validate
5. Bottom-up decision Prompt decisions Impairment of vision
making
64 Aviation

In sum, the big data revolution promises a world’s 1770 dedicated cargo aircraft did 255
change in the way individuals and organiza- billion ton-kilometers in 2017 and along with the
tions make decisions. But it also brings with cargo carried in the belly holds scheduled passen-
it a host of challenges. The opportunities and ger planes, combined to carry about 40% of world
threats discussed in this article reflect different trade by value. Aviation is a large, complex, mul-
facets of the implications that are fundamental tifaceted, capital-intensive industry. It requires
to this revolution. They include the dual deci- considerable amounts of data and their efficient
sion model, granularity, objectivity, transpar- analysis to function safely and economically.
ency, and the bottom-up approach to Big data are widely used by commercial avia-
organizational decisions. The table above sum- tion and in a variety of different ways. It is impor-
marizes how the promise of big data is an tant for weather predictions, maintenance planning,
opportunity as well as a challenge for the crew scheduling, fare setting, and so on. It plays a
future of decision making. crucial role in the efficiency of air navigation ser-
vice providers and in the economics of airline
operations. It also is important for safety and secu-
Cross-References rity of passengers and cargo.

▶ Big Data Quality


▶ Data Governance Weather
▶ Decision Theory
Flying is an intrinsically dangerous activity. It
involves a continual fight against gravity and
Further Reading confronting other natural elements, especially
weather. As enhanced technology has allowed lon-
Boyd, D., & Crawford, K. (2012). Critical questions for big ger and higher flights over more difficult terrains,
data. Information, Communication & Society, 15(5),
the demands for more accurate weather forecasts
662–679.
Economist Intelligence Unit. (2012). The deciding factor: have grown. The predictions made for weather
Big data & decision making. New York, NY, USA: now involve high-altitude wind directions and
Capgemini/The Economist. intensity and, with flights of 19 hours or more
McAfee, A., & Brynjolfsson, E. (2012). Big data: The
possible, take a much longer perspective than in
management revolution. Harvard Business Review,
90(10), 61–67. the past. Fog or very low ceilings can prevent
aircraft from landing, and taking off while turbu-
lence and icing are also significant in-flight haz-
ards. Thunderstorms are a problem because of
severe turbulence and icing, due to the heavy pre-
Aviation cipitation, as well as hail, strong winds, and light-
ning, which can cause severe damage to an aircraft
Kenneth Button in flight. Locally, prior knowledge of shifts in wind
Schar School of Policy and Government, George intensity and direction allow airports to plan for
Mason University, Arlington, VA, USA changes in runway use permitting advantage to be
taken of prevailing headwinds.
Traditionally, forecasting largely relied on data
Introduction gathered from ground stations and weather bal-
loons that recorded changes in barometric pressure,
In 2018 globally there were some 38.1 million current weather conditions, and sky condition or
commercial flights carrying about 4.1 billion pas- cloud cover and manual calculations using simple
sengers. Air freight is also now one of the major models for forecasting. Distance between reporting
contributors to the global supply chain. The sites, limited measuring techniques, lack of
Aviation 65

adequate mechanical computational equipment, respond to emerging problems immediately after


and inadequate models resulted in poor reliability. landing rather than waiting for the next scheduled
It was only in 1955 with the advent of computer service. For example, if a part needs to be replaced, A
simulation that numerical weather predictions the system can send a message and the part is avail-
became possible. Today with satellite data gathering, able on landing, and plane turnaround times are
as well as strategically placed instruments to measure reduced.
temperature, pressure, and other parameters on the From the business perspective, this allows
surface of the planet as well as the atmosphere, more rapid rescheduling of hardware and crew
massive amounts of data are available in real time. should the problem-solving and remedy require
Manipulating the vast data sets and performing the significant down time. As in most cases, it is not
complex calculations necessary to modern numerical the big data themselves that is important, but it is
weather prediction then require some of the most rather the ability to look at the data through
powerful supercomputers in the world (Anaman machine learning, data mining, and so on that
et al. 2017). allows relevant information to be extracted iso-
Weather data is increasingly being combined lated and analyzed. This helps airlines make better
with other big data sets. The European Aviation commercial decisions as well as improving safety.
Safety Agency’s Data4Safety program collects
and gathers all data that may support the manage-
ment of safety risks at European level. This Business Management
includes safety reports (or occurrences), telemetry
data generated by an aircraft via their flight data On the demand side, since the world began to
recorders, and surveillance data from air traffic, as deregulate air transportation in the late 1970s, air-
well as weather data. Similarly, near real-time lines have been increasingly free to set fares and
weather data can be used not only to enhance cargo rates, and to determine their routes served
safety and to avoid the discomforts of turbulence and schedules. To optimize these, and thus their
but also to reduce airline costs by adjusting rout- revenues, airlines use big data sets regarding their
ings to minimize fuel consumption – aviation fuel consumers behavior, and this includes data on indi-
is about 17% of a commercial airline’s costs. viduals’ computer searches even when no seats are
booked (Carrier and Fiig 2018). Many millions of
pieces of data are collected daily regarding the type
Maintenance and Monitoring of ticket individuals buy, the frequency individuals
fly, where to and in what class of seat, and their
Aircraft are complex machines. Correct maintenance special needs, if any. They also obtain information
and repair are important – the “airplane’s health.” To on add-on purchases, such as additional baggage
this end there are regularly scheduled inspections and allowances and meals. Added to this there are data
services. This has been supplemented more recently available from credit card, insurance, and car rental
by in-service monitoring and recording of plane’s companies, hotels, and other sectors whose prod-
performance in flight. The aerospace industry is ucts are marketed with a flight.
now utilizing the enormous amount of data transmit- This enables airlines to build up profiles of their
ted via sensors embedded on airplanes to preempt customer bases and tailor service/fare packages to
problems (Chen et al. 2016). Boeing, for example, these. For example, easyJet uses an artificially intel-
analyzes two million conditions daily across 4000 ligent algorithm that determines seat pricing auto-
aircraft as a part of its Airplane Health Management matically, depending on demand and to allows it
system. Pratt and Whitney have fitted about 5000+ to analyze historical data to predict demand patterns
sensors on its PW1000G engines for the Bombardier up to a year in advance. United Airlines use their
C Series and are generating about 10 GB of data per “collect, detect, act” protocol to analyze over 150
second. It provides engineers with millions of pieces variables in each customer profile with the objective
of information that can be used to inspect and of enhancing their yield management model. Delta
66 Aviation

Air Lines, United and other airlines have apps that the development of hub-and-spoke structures
permit customers to track their bags on their (Button 2002). This involves the use of
smartphones. interconnecting services to channel passengers
Airlines sell tickets in “buckets”; each bucket moving from A to B along a routing involving
has a set of features and a fare associated with it. an intermediate, hub airport, C. By consolidating
The features may include the size and rack of seats, traffic at C originating not only from A but also
the refreshment offered, the entertainment pro- from D, E, F, etc. and then taking them together
vided, the order of boarding the plane, access to to their final destination B, fewer total flights are
airport lounges, etc. The buckets are released at needed and larger, more economically efficient
fares that tend to rise as the takeoff date planes may be used. To make the system efficient,
approaches, although the lack of sales on any there is a need for all the incoming flights to arrive
higher fare seats may produce a reversion to a at approximately the same time and to leave at
lower-fare bucket. The aim is to maximize fare approximately the same time – i.e., scheduling
revenues. Leisure travelers tend to pay for their banks of flights. This is complex partly because
own trips, plan travel well in advance, often book of the needs to coordinate the ticketing of passen-
in units (e.g., as a family), and, because the fare is gers interchanging at the hub and partly because
important them, will take up the cheaper seats there is a need to ensure cabin crews, with adequate
offered early. Business travelers with less advanced flying hours, are free and planes are available for
knowledge of their itineraries, and because ongoing flights. The costs of missed connections
employers pay the fare, generally book much later are high, financially for the airline and in terms of
and take higher-fare seats. The various buckets are wasted time for travelers. Weather forecasting and
released partly based on real-time demand but also monitoring of aircraft performance have improved
on projected later demand. the reliability of hubbing as well as of passenger
Added to this, the more expensive seats often management.
allow for no-penalty cancelations at the last minute. In terms of cargo aviation, air transportation
To maximize revenues, airlines have to predict the carries considerably amounts of high-value, low-
probability of no-shows. Getting this wrong leads volume consignments. This has been the result of
to overbookings and denied boarding compensa- freer markets and technical improvements in such
tion for ticketed passengers for whom there are no things as the size and fuel economy of aircraft, the
seats. Further, recently there has been considerable use of air cargo containerization, and a general
unbundling of the products being sold by airlines, shift to higher-value manufacture production. But
and a seat is now often sold separately to other at the forefront of it has been the development of
services. For example, with lower-priced seats, a computerized information systems that, just like
passenger may buy seat selection, boarding prior- passenger air transportation, allows easier and
ity, the use of a dedicated security check, a baggage more knowledgeable booking, management of
allowance (including that carried on the plane), and demand by differential pricing, tracking of con-
the like separate from the airline ticket itself. Big signment, and allocation of costs. Big data have
data that provide insights into the actions of pas- been the key input to this dynamic logistics chain.
sengers on previous flights is a major input into It has allowed aviation to be a central player in the
estimating the number of no-shows and the supple- success of business using just-in-time production
mentary services that different types of passen- practices and has been a major factor in the growth
gers tend to purchase. of companies like FedEx (the largest cargo airline
Big data, and the associated growth in com- with 681 planes that did 17.5 billion freight ton-
putation power, also facilitates more commer- kilometers in 2018), UPS, and other express pack-
cially efficient networking of scheduled airline age carriers. The big aviation data is linked to that
services and, in particular, have contributed to available for connection modes, such as trucking,
Aviation 67

and for warehouse inventories to provide interface of military and civilian aviation where
seemless supply chains. their airspace needs overlap.
A
. . .and the Military
Further Reading
Inevitably the military side of aviation is also a Anaman, K. A., Quaya, R., & Owusu-Brown, B.
large user of big data. Equally, and inevitably for (2017). Benefits of aviation weather services: A
security reasons, we know less about it. The review of the literature. Research in World Econ-
objectives of the military are not commercial, omy, 8, 45–58.
Button, K. J. (2002). Airline network economics. In D.
much of the equipment used is often very differ- Jenkins (Ed.), Handbook of airline economics (2nd
ent, and the organization is based upon command- ed., pp. 27–34). New York: Aviation Week.
and-control mechanisms rather than on the price Carrier, E., & Fiig, T. (2018). Special issue: Future of
mechanism (Hamilton and Kreuzer 2018). Never- airline revenue management. Journal of Revenue and
Pricing Management, 17, 45–120.
theless, and disregarding the ultimate motivations Chen, J., Lyu, Z., Liu, Y., Huang, J., Zhang, G., Wang, J., &
of military aviation, some aspects of it use of big Chen, X. (2016) A big data analysis and application
data are similar to that in the civilian industry. platform for civil aircraft health management. In 2016
Military aviation has similar demands for reliable IEEE Second International Conference on Multimedia
Big Data (BigMM), Taipai.
weather forecasting and for monitoring the health Hamilton, S. P., & Kreuzer, M. P. (2018). The big data
of planes. It also uses it for managing its man- imperative: Air force intelligence for the information.
power and its air bases. And there is an inevitable Air and Space Power Journal, 32, 4–18.
B

BD Hubs Although there exists some controversy regarding


the use of behavioral analytics, it has much to
▶ Big Data Research and Development Initiative offer organizations and businesses that are willing
(Federal, U.S.) to explore its integration into their models.

Definition
BD Spokes
The concept of behavioral analytics has been
▶ Big Data Research and Development Initiative defined by Montibeller and Durbach as an analyt-
(Federal, U.S.) ical process of extracting behavioral insights from
datasets containing behavioral data. This defini-
tion is derived from previous conceptualizations
of the broader overarching idea of business ana-
Behavioral Analytics lytics put forth by Davenport and Harris as well as
Kohavi and colleagues. Business analytics in turn
Lourdes S. Martinez is a subarea within business intelligence and
School of Communication, San Diego State described by Negash and Gray as systems that
University, San Diego, CA, USA integrate data processes with analytics tools to
demonstrate insights relevant to business planners
and decision-makers. According to Montibeller
Behavioral analytics can be conceptualized as a and Durbach, behavioral analytics differs from
process involving the analysis of large datasets traditional descriptive analysis of behavioral data
comprised of behavioral data in order to extract by focusing analyses on driving action and
behavioral insights. This definition encompasses improving decision-making among individuals
three goals of behavioral analytics intended to and organizations. The purpose of this process is
generate behavioral insights for the purposes of threefold. First, behavioral analytics facilitates the
improving organizational performance and deci- detection of users’ behavior, judgments, and
sion-making as well as increasing understanding choices. For example, a health website that tracks
of users. Coinciding with the rise of big data and the click-through behavior, views, and downloads
the development of data mining techniques, a of its visitors may offer an opportunity to person-
variety of fields stand to benefit from the emer- alize user experience based on profiles of different
gence of behavioral analytics and its implications. types of visitors.
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
70 Behavioral Analytics

Second, behavioral analytics leverages find- insights from user behavioral data. For example,
ings from these behavioral patterns to inform between recommendation search engines for
decision-making at the organizational level and Amazon and teams of data scientists for LinkedIn,
improve performance. If personalizing the visitor behavioral analytics has allowed these companies
experience to a health website reveals a mismatch to transform their plethora of user data into
between certain users and the content provided on increased profits. Similarly, advertising efforts
the website’s navigation menu, the website may have turned toward the use of behavioral analytics
alter the items on its navigation menu to direct this to glean further insights into consumer behavior.
group of users to relevant content in a more effi- Yamaguchi discusses several tools on which dig-
cient manner. Lastly, behavioral analytics informs ital marketers rely that go beyond examining data
decision-making at the individual level by from site traffic.
improving judgments and choices of users. A Nagaitis notes observations that are consistent
health website that is personalized to unique with Jou’s view of behavioral analytics’ impact on
health characteristics and demographics of visi- marketing. According to Nagaitis, in the absence
tors may help users fulfill their informational of face-to-face communication, behavioral analyt-
needs so that they can apply the information to ics allows commercial marketers to examine e-
improve decisions they make about their health. consumers through additional lenses apart from
the traditional demographic and traffic tracking.
In approaching the selling process from a relation-
Applications ship standpoint, behavioral analytics uses data
collected via web-based behavior to increase
According to Kokel and colleagues, the largest understanding of consumer motivations and
behavioral databases can be found at Internet goals, and fulfill their needs. Examples of these
technology companies such as Google as well as sources of data include keyword searchers, navi-
online gaming communities. The sheer size of gation paths, and click-through patterns. By input-
these datasets is giving rise to new methods, ting data from these sources into machine learning
such as data visualization, for behavioral analyt- algorithms, computational social scientists are
ics. Fox and Hendler note the opportunity in able to map human factors of consumer behavior
implementing data visualization as a tool for as it unfolds during purchases. In addition, behav-
exploratory research and argue for a need to create ioral analytics can use web-based behaviors of
a greater role for it in the process of scientific consumers as proxies for cues typically conveyed
discovery. For example, Carneiro and Mylonakis through in-person face-to-face communication.
explain how Google Flu relies on data visualiza- Previous research suggests that web-based dia-
tion tools to predict outbreaks of influenza by logs can capture rich data pointing toward behav-
tracking online search behavior and comparing it ioral cues, the analysis of which can yield highly
to geographical data. Similarly, Mitchell notes accurate predictions comparable to data collected
how Google Maps analyzes traffic patterns during face-to-face interactions. The significance
through data provided via real-time cell phone of this ability to capture communication cues is
location to provide recommendations for travel reflected in marketers increased ability to speak to
directions. In the realm of social media, Bollen their consumers with greater personalization that
and colleagues have also demonstrated how anal- enhances the consumer experience.
ysis of Twitter feeds can be used to predict public Behavioral analytics has also enjoyed increas-
sentiments. ingly widespread application in game develop-
According to Jou, the value of behavioral ana- ment. El-Nasr and colleagues discuss the
lytics has perhaps been most notably observed in growing significance of assessing and uncovering
the area of commercial marketing. The consumer insights related to player behavior, both of which
marketing space has borne witness to the progress have emerged as essential goals for the game
made through extracting actionable and profitable industry and catapulted behavioral analytics into
Behavioral Analytics 71

a central role with commercial and academic providing implications for designing streamlined
implications for game development. A combina- and efficient user-oriented processes and ser-
tion of evolving mobile device technology and vices. Behavioral analytics can also offer pros-
shifting business models that focus on game dis- pects for increasing personalization during the
tribution via online platforms has created a user experience by drawing from user informa- B
situation for behavioral analytics to make impor- tion provided in user profiles. These profiles con-
tant contributions toward building profitable tain information about how the user interacts
businesses. with the system, and the system can accordingly
Increasingly available data on user behavior adjust based on clustering of users.
has given rise to the use of behavioral analytic Despite advances made in behavioral analyt-
approaches to guide game development. Fields ics within the commercial marketing and game
and Cotton note the premium placed in this indus- industries, several areas are ripe with opportuni-
try on data mining techniques that decrease ties for integrating behavioral analytics to
behavioral datasets in complexity while extracting improve performance and decision-making prac-
knowledge that can drive game development. tices. One area that has not yet reached its full
However, determining cutting-edge methods in potential for capitalizing on the use of behavioral
behavioral analytics within the game industry is analytics is security. Although Brown reports on
a challenge due to reluctance on the part of various exploration in the use of behavioral analytics to
organizations to share analytic methods. Drachen track cross-border smuggling activity in the
and colleagues observe a difficulty in assessing United Kingdom through vehicle movement,
both data and analytical methods applied to data the application of these techniques under the
analysis in this area due to a perception that these broader umbrella of security remains
approaches represent a form of intellectual prop- understudied. Along these lines and in the con-
erty. Sifa further notes that to the extent that data text of an enormous amount of available data,
mining, behavioral analytics, and the insights Jou discusses the possibilities for implementing
derived from these approaches provide a compet- behavioral analytics techniques to identify
itive advantage over rival organizations in an insider threats posed by individuals within an
industry that already exhibits fierce competition organization. Inputting data from a variety of
in the entertainment landscape, organizations will sources into behavioral analytics platforms can
not be motivated to share knowledge about these offer organizations an opportunity to continu-
methods. ously monitor users and machines for early indi-
Another area receiving attention for its appli- cators and detection of anomalies. These sources
cation of behavioral analytics is business man- may include email data, network activity via
agement. Noting that while much interest in browser activity and related behaviors, intellec-
applying behavioral analytics has focused on tual property repository behaviors related to how
modeling and predicting consumer experiences, content is accessed or saved, end-point data
Géczy and colleagues observe a potential for showing how files are shared or accessed, and
applying these techniques to improve employee other less conventional sources such as social
usability of internal systems. More specifically, media or credit reports. Connecting data from
Géczy and colleagues describe the use of behav- various sources and aggregating them under a
ioral analytics as a critical first step to user-ori- comprehensive data plane can provide enhanced
ented management of organizational information behavioral threat detection. Through this, robust
systems through identification of relevant user behavioral analytics can be used to extract
characteristics. Through behavioral analytics, insights into patterns of behavior consistent
organizations can observe characteristics of with an imminent threat. At the same time, the
usability and interaction with information sys- use of behavioral analytics can also measure,
tems and identify patterns of resource underuti- accumulate, verify, and correctly identify real
lization. These patterns are important in insider threats while preventing inaccurate
72 Bibliometrics/Scientometrics

classification of nonthreats. Jou concludes that Carneiro, H. A., & Mylonakis, E. (2009). Google trends: A
the result of implementing behavioral analytics web-based tool for real-time surveillance of disease
outbreaks. Clinical Infectious Diseases, 49(10).
in an ethical manner can provide practical and Davenport, T., & Harris, J. (2007). Competing on analyt-
operative intelligence while raising the question ics: The new science of winning. Boston: Harvard
as to why implementation in this field has not Business School Press.
occurred more quickly. Drachen, A., Sifa, R., Bauckhage, C., & Thurau, C. (2012).
Guns, swords and data: Clustering of player behavior in
In conclusion, behavioral analytics has been computer games in the wild. Proceedings of the IEEE
previously defined as a process in which large Computational Intelligence and Games.
datasets consisting of behavioral data are ana- El-Nasr, M. S., Drachen, A., & Canossa, A. (2013). Game
lyzed for the purpose of deriving insights that analytics: Maximizing the value of player data. New
York: Springer Publishers.
can serve as actionable knowledge. This defini- Fields, T. (2011). Social game design: Monetization
tion includes three goals underlying the use of methods and mechanics. Boca Raton: Taylor &
behavioral analytics, namely, to enhance organi- Francis.
zational performance, improve decision-mak- Fox, P., & Hendler, J. (2011). Changing the equation on
scientific data visualization. Science, 331(6018).
ing, and generate insights into user behavior. Géczy, P., Izumi, N., Shotaro, A., & Hasida, K. (2008).
Given the burgeoning presence of big data and Toward user-centric management of organizational
spread of data mining techniques to analyze this information systems. Proceedings of the Knowledge
data, several fields have begun to integrate Management International Conference, Langkawi,
Malaysia (pp. 282-286).
behavioral analytics into their approaches for Kohavi, R., Rothleder, N., & Simoudis, E. (2002). Emerg-
problem-solving and performance-enhancing ing trends in business analytics. Communications of the
actions. While concerns related to accuracy and ACM, 45(8).
ethical use of these insights remain to be Mitchell, T. M. (2009). Computer science: Mining our
reality. Science, 326(5960).
addressed, behavioral analytics can present Montibeller, G., & Durbach, I. (2013). Behavioral analyt-
organizations and business with unprecedented ics: A framework for exploring judgments and choices
opportunities to enhance business, management, in large data sets. Working Paper LSE OR13.137. ISSN
and operations. 2041-4668.
Negash, S., & Gray, P. (2008). Business intelligence. Ber-
lin/Heidelberg: Springer.
Sifa, R., Drachen, A., Bauckhage, C., Thurau, C., &
Cross-References Canossa, A. (2013). Behavior evolution in tomb raider
underworld. Proceedings of the IEEE Computational
Intelligence and Games.
▶ Big Data
▶ Business Intelligence Analytics
▶ Data Mining
▶ Data Science
▶ Data Scientist Bibliometrics/Scientometrics

Staša Milojević1 and Loet Leydesdorff2


1
Further Reading Luddy School of Informatics, Computing, and
Engineering, Indiana University, Bloomington,
Bollen, J., Mao, H., & Pepe, A. (2011). Modeling public IN, USA
mood and emotion: Twitter sentiment and socio-eco- 2
Amsterdam School of Communication Research
nomic phenomena. Proceedings of the Fifth Interna-
tional Association for Advancement of Artificial (ASCoR), University of Amsterdam, Amsterdam,
Intelligence Conference on Weblogs and Social The Netherlands
Media.
Brown, G. M. (2007). Use of kohonen self-organizing
maps and behavioral analytics to identify cross-border
smuggling activity. Proceedings of the World Congress “Scientometrics” and “bibliometrics” can be used
on Engineering and Computer Science. interchangeably as the name of a scientific field at
Bibliometrics/Scientometrics 73

the interface between library and information sci- The use of citation analysis in research evalu-
ence (LIS), on the one side, and the sociology of ation followed on the applied side of the field. The
science, on the other. On the applied side, this field US National Science Board launched the biannual
is well known for the analysis and development of Science Indicator series in 1972. This line of
evaluative indicators such as the journal impact research has grown significantly prompting a B
factor, h-index, and university ranking. whole field of “evaluative bibliometrics” (Narin
The term “bibliometrics” was coined by 1976). The interest in developing useful indicators
Pritchard (1969), to describe research that utilizes has been advanced by the Organisation for Eco-
mathematical and statistical methods to study nomic Co-operation and Development (OECD)
written records. “Scientometrics” emerged as the and their Frascati Manual (OECD [1962] 2015)
quantitative study of science in the 1970s (Elkana for the measurement of scientific and technical
et al. 1978), alongside with the development of activities and the Oslo Manual (OECD [1972]
citation databases (indexes) by Eugene Garfield 2018) for the measurement of innovations. Patents
(1955). The pioneering work in this area by the thus emerged as a useful data source to study the
historian of science Derek de Solla Price (e.g., process of knowledge diffusion and the transfer of
1963, 1965) proposed studying the sciences as knowledge between science and technology
networks of documents. The citation indexes pro- (Price 1984; Rosenberg 1982). The analysis of
vided the measurement tools for testing hypothe- patents has helped to make major advances in
ses in the Mertonian sociology of science, in understanding processes of innovation (Jaffe and
which one focuses on the questions of stratifica- Trajtenberg 2002), and patent statistics has
tion and the development of scientific fields using, become a field in itself, but with strong connec-
for example, co-citation analysis (e.g., Mullins tions to scientometrics.
1973). While researchers have been developing ever
For example, one has also focused on ques- more sophisticated indicators, some of the earlier
tions related to the social structures that lead to measures became widely used in research man-
advancement of science. While some researchers agement and policy making. The best-known of
study larger units, such as scientific fields or dis- these are journal impact factors, university rank-
ciplines, others were interested in identifying the ings, and the h-index. The impact factor, defined
roles played by elite and non-elite scientists. This as a 2-year moving citation average at the level of
latter question is known as the Ortega Hypothesis journals, was proposed by Eugene Garfield as a
after the Spanish philosopher Ortega y Gasset means to assess journal quality for potential inclu-
(1932) who proposed that non-elite scientists sion in the database (Garfield and Sher 1963;
also play a major role in the advancement of Garfield 1972). However, it is not warranted to
science. However, Newton’s aphorism that lead- use this measure for the assessment of individual
ing scientists “stand on the shoulders of giants” publications or individual scholars for the pur-
provides an alternative view of an elite structure poses of funding and promotion given the skew-
operating as a relatively independent layer ness of citation distributions. Instead, citation data
(Bornmann et al. 2010; cf. Merton 1965). In her can also be analyzed using nonparametric statis-
book The New Invisible College, Caroline Wagner tics (e.g., percentiles) after proper normalization
(2008) argued that international co-authorship for a field. However, the normalization of publi-
relations have added a new layer in knowledge cation and citation counts for different fields of
production during the past decades, but not in all science has remained a hitherto insufficiently
disciplines to the same extent. In general, under- solved problem.
standing the processes of knowledge creation is of The h-index (Hirsch 2005) is probably the
paramount importance not only for understanding most widely used metrics for assessing the
science but also for making informed decisions impact of individual authors and has been
about the allocation of resources. praised as a simple-to-understand indicator
74 Bibliometrics/Scientometrics

capable of capturing both productivity and together with major advances in analysis tech-
impact. However, Waltman and Van Eck (2012) niques, such as network analysis, machine
have shown that the h-index is mathematically learning, and natural language processing,
inconsistent. The publication of the first Aca- holds a great potential for using these tools
demic Ranking of World Universities (ARWU) and techniques not only to advance scientific
of the Shanghai Jiao Tong University in 2004 knowledge but also as a basis for improving
(Shin et al. 2011) has further enhanced the atten- decision-making when it comes to the alloca-
tion to evaluation and ranking. tion of resources.
In the 2000s, new citation indexes were cre-
ated, including Scopus, Google Scholar, and
Microsoft Academic. As scholarly communica-
Cross-References
tion diversified and expanded to the web, new
sources for gathering alternative metric
▶ Scientometrics
(“altmetric”) data became available as well (e.g.,
▶ Bibliometrics/Scientometrics
Mendeley) leading to the development of indica-
tors capable of capturing the changes in research
practices, such as bookmarking, citing, reading,
References
and sharing.
More recently, massive investments in knowl- Borgman, C. L. (2015). Big data, little data, no data:
edge infrastructures and the increased awareness Scholarship in the networked world. Cambridge: The
of the importance of data sharing (Borgman 2015) MIT Press.
Bornmann, L., De Moya-Anegón, F., & Leydesdorff,
have led to the attempts to provide proper incen-
L. (2010). Do scientific advancements lean on the
tives and recognition for authors who make their shoulders of giants? A bibliometric investigation of
data available to a wider community. Data cita- the Ortega hypothesis. PLoS One, 5(10), e13327.
tion, for example, prompted the creation of the de Solla Price, D. J. (1963). Little science, big science.
New York: Columbia University Press.
Data Citation Index, and a body of research
de Solla Price, D. J. (1965). Networks of scientific papers.
focused on better understanding of data life cycles Science, 149(30), 510–515.
within the sciences. de Solla Price, D. J. (1984). The science/technology rela-
In summary, the analysis of written records in tionship, the craft of experimental science, and policy
for the improvement of high technology innovation.
quantitative science studies has significantly
Research Policy, 13(1), 3–20.
advanced the knowledge about the structures and Elkana, Y., Lederberg, J., Merton, R. K., Thackray, A., &
dynamics of science and the process of innovation Zuckerman, H. (Eds.). (1978). Toward a metric of
(Leydesdorff 1995). The field of scientometrics science: The advent of science indicators. New York:
Wiley.
has developed into a scientific community with
Garfield, E. (1955). Citation indexes for science: A new
an intellectual core of research agendas (Milojević dimension in documentation through association of
and Leydesdorff 2013; Wouters and Leydesdorff ideas. Science, 122(3159), 108–111.
1994). In addition to intensified efforts to improve Garfield, E. (1972). Citation analysis as a tool in journal
evaluation. Science, 178, 471–479.
indicators, the increased availability of data
Garfield, E., & Sher, I. H. (1963). New factors in the
sources has brought a renewed interest in funda- evaluation of scientific literature through citation
mental questions, such as theory of citation, clas- indexing. American Documentation, 14(3), 195–201.
sifications of the sciences, and the nature of Hirsch, J. E. (2005). An index to quantify an individual’s
scientific research output. PNAS, 102(46),
collaboration across disciplines and in university-
16569–16572.
industry-government relations, and brought about Jaffe, A. B., & Trajtenberg, M. (2002). Patents, citations,
great advances in the mapping of science, tech- and innovations: A window on the knowledge economy.
nology, and innovation. Cambridge: The MIT Press.
Leydesdorff, L. (1995). The challenge of scientometrics:
The rapid increase in the number, size, and
The development, measurement, and self-organization
quality of data sources that are widely avail- of scientific communications. Leiden: DSWO Press,
able and amenable to automatic processing, Leiden University.
Big Data and Theory 75

Merton, R. K. (1965). On the shoulders of giants:


A Shandean postscript. New York: The Free Press. Big Data and Theory
Milojević, S., & Leydesdorff, L. (2013). Information met-
rics (iMetrics): A research specialty with a socio-
cognitive identity? Scientometrics, 95(1), 141–157. Wolfgang Maass1, Jeffrey Parsons2, Sandeep
Mullins, N. C. (1973). Theories and theory groups in Purao3, Alirio Rosales4, Veda C. Storey5 and B
contemporary American sociology. New York: Harper Carson C. Woo4
& Row. 1
Narin, F. (1976). Evaluative bibliometrics: The use of Saarland University, Saarbrücken, Germany
2
publication and citation analysis in the evaluation of Memorial University of Newfoundland,
scientific activity. Cherry Hill: Computer Horizons. St. John’s, Canada
OECD. (2015 [1962]). The measurement of scientific and 3
Bentley University, Waltham, USA
technical activities: “Frascati Manual”. Paris: OECD. 4
Available at https://www.oecd-ilibrary.org/science-and- University of British Columbia, Vancouver,
technology/frascati-manual-2015_9789264239012-en. Canada
5
OECD/EuroStat (2018 [1972]). Proposed guidelines for J Mack Robinson College of Business, Georgia
collecting and interpreting innovation data, “Oslo man- State University, Atlanta, GA, USA
ual”. Paris: OECD. Available at https://www.oecd-
ilibrary.org/science-and-technology/oslo-manual-2018_
9789264304604-en.
Ortega y Gasset, J. (1932). The revolt of the masses.
The necessity of grappling with Big Data, and the
New York: Norton.
desirability of unlocking the information hidden
Pritchard, A. (1969). Statistical bibliography or
within it, is now a key theme in all the sciences –
bibliometrics? Journal of Documentation, 25,
arguably the key scientific theme of our times.
348–349.
(Diebold 2012)
Rosenberg, N. (1982). Inside the black box: Technology
and economics. Cambridge: Cambridge University
Press.
Shin, J. C., Toutkoushian, R. K., & Teichler, U. (2011).
University rankings: Theoretical basis, methodology Introduction
and impacts on global higher education. Dordrecht:
Springer. Big data is the buzzword du jour in diverse fields
Wagner, C. S. (2008). The new invisible college: Science
for development. Washington, DC: Brookings Institu-
in the natural, life, social, and applied sciences,
tion Press. including physics (Legger 2014), biology (Howe
Waltman, L., & Van Eck, N. J. (2012). The inconsistency of et al. 2008), medicine (Collins and Varmus 2015),
the h-index. Journal of the American Society for Infor- economics (Diebold 2012), and management
mation Science and Technology, 63(2), 406–415.
Wouters, P., & Leydesdorff, L. (1994). Has Price’s dream
(McAfee and Brynjolfsson 2012; Gupta and
come true: Is scientometrics a hard science? George 2016). The traditional Vs of big data –
Scientometrics, 31(2), 193–222. volume, variety, and velocity – reflect the unpar-
alleled quantity, diversity, and immediacy of data
generated by sensing, measuring, and social com-
puting technologies. The result has been signifi-
Big Data cant new research opportunities, as well as unique
challenges. Computer and information scientists
▶ Business Intelligence Analytics have responded by developing tools and tech-
▶ Data Integration niques for big data analytics, intended to discover
▶ Data Provenance patterns (statistical regularities among variables)
▶ NoSQL (Not Structured Query Language) in massive data sets (Fukunaga 2013), reconcile
the variety in diverse sources of data (Halevy et al.
2009), and manage data generated at a high
velocity.
Big Data Analytics With the success of these tools and techniques,
some have proclaimed the “end of theory,” argu-
▶ NoSQL (Not Structured Query Language) ing that “the data deluge makes the scientific
76 Big Data and Theory

method obsolete” (Anderson 2008) and that any theory” is in contrast with the traditional scientific
question can now be answered by data analysis focus on “small data, big theory,” going beyond
(Halevy et al. 2009; Baesens et al. 2016). This data-driven emphasis on “big data, small theory,”
position has led to a radical rethinking of scientific by explicating interactions between big data ana-
practice, as well as an assessment of the impact of lytics and theory building.
big data research in specific disciplines, such as Figure 1 presents a framework for science in
astronomy (Pankratius and Mattmann 2014) and the era of big data that represents these interac-
biology (Shaffer and Purugganan 2013). tions. The intent of the framework is to identify
However, a primary focus on statistical pattern possibilities that researchers can use to position
finding in big data has limited potential to advance their work, thereby encouraging closer interac-
science because the extracted patterns can be mis- tions between research communities engaged in
leading or reveal only idiosyncratic relationships big data analytics versus theory-driven research.
(Bentley et al. 2014). Research based on big data The value of the framework can be explored by
analytics should be complemented by theoretical examining scientific practice, which has primarily
principles that evaluate which patterns are mean- been driven by the cost and effort required for data
ingful. Big data and big theory should comple- collection and analysis.
ment each other. Researchers, thus, need to Work in natural sciences has focused on devel-
integrate theory and big data analytics for oping or testing theory with relatively small data
conducting science in the era of big data (Rai sets, often due to the cost of experimental design
2016). and data collection. As recently as the first decade
of the twenty-first century, population genetic
models were based only upon the analysis of one
Framework for Science in the Era of Big or two genes, but now evolutionary biologists can
Data use data sets at the scale of the full genome of an
increasing number of species (Wray 2010). Anal-
Big data analytics and domain theories should ogous examples exist in other fields, including
have complementary roles as scientific practice sociology and management (Schilpzand et al.
moves toward a desirable future that combines 2014). This mode of research emphasizes a tight
“big data” with “big theory.” This “big data, big link between big data analytics and theory

Reduced to Feedback to
statistical theory
patterns development

Big Data / Science in Theory


Big Data
Analytics building
Era
Hypotheses
Feedback on
derived from
analyzing big
theory
data

Big Data, Small Data,


Small Theory Big Data, Big Theory
Big Theory

Big Data and Theory, Fig. 1 Framework for science in the era of big data
Big Data and Theory 77

building and testing, exemplifying a mode of analyzing big data created by the Large Hadron
research we call “small data, big theory.” Collider.
Sensor technologies and massive computing Core research areas in computer science are
power have transformed data collection and affected by big data analytics. For instance, com-
analysis by reducing effort and cost. Scientists puter vision witnessed a major shift as the concept B
can now extract statistical patterns from very of deep learning (Krizhevsky et al. 2012) signifi-
large data sets with advanced analytical tech- cantly improved the success rates of many appli-
niques (e.g., Dhar 2013). Biomedical scientists cations such as facial recognition. It also changed
can analyze full genomes (International Human research to an algorithmic understanding of com-
Genome Sequencing Consortium 2004; puter vision. Image recognition now obtains
ENCODE Project Consortium 2012). Likewise, results that are getting close to that of humans
astronomy is becoming a computationally inten- (Sermanet et al. 2013), which was not feasible
sive field due to an “exciting evolution from an with prior declarative theories. When images can
era of scarce scientific data to an era of over- be classified with humanlike accuracy, even better
abundant data” (Shaffer and Purugganan 2013). scientific questions can be posed, such as “what
Research in these domains is being transformed really is vision?” by generating procedural theo-
with the use of big data techniques that may have ries that replicate and explain how the human
little or no connection to prior theories in the brain operates (LeCun et al. 2015). In this sense,
scientific discipline. This practice exemplifies a research is moving from “what is built” to “how to
mode of research we call “big data, small build.”
theory.” These examples highlight a desirable mode of
This emphasis on big data analytics risks sev- research, “big data, big theory.” This form of
ering the connection between data and theory, research includes extracting statistical patterns
threatening our ability to understand and interpret from large, and often heterogeneous, data sets.
extracted statistical patterns. Overcoming this Pattern extraction is not a stand-alone activity,
threat requires purposeful interactions between but rather one that shapes, and is shaped by, theory
theory development and data collection and anal- building and testing.
ysis. The framework highlights these interactions
via labeled arrows.
We are already beginning to witness such inter- Application of Framework
actions. Population geneticists, for example, can
delve deeper into our evolutionary past by postu- The framework for science in the era of big data in
lating the genetic structure of extinct and ancestral Fig. 1 depicts and promotes interdisciplinary
populations and investigating them with the help interactions between researchers in the big data
of novel sequencing technologies and other analytics field (L.H.S.) with those in disciplines or
methods of data analysis (Wall and Slatkin domains related to various sciences (R.H.S.).
2012). A new field of “paleopopulation genetics” The top left arrow indicates that a data scientist
was not possible without proper integration of big has the capability to reduce big data to statistical
data and theory. In astronomy, statistical patterns patterns using analytical techniques, as in the
can easily be extracted from large data sets, identification of homologous sequences from evo-
although theory is required to interpret them prop- lutionary genomics (Saitou 2013).
erly (Shaffer and Purugganan 2013). The standard The top right arrow shows that statistical pat-
model, a fundamental theory in particle physics, terns can suggest novel theoretical explanations to
places requirements on energies needed for pro- domain scientists. These patterns may extend the-
ducing experimental conditions for the Higgs ory to new domains or reveal inconsistencies in
boson (Aad et al. 2012). Based upon these the- current theory. Without integration with theory,
ory-derived requirements, scientists have verified statistical patterns may remain as merely curious
the theoretical prediction of the Higgs boson by facts (Bentley et al. 2014). In some situations, big
78 Big Data and Theory

data may significantly increase empirical support position themselves at the desired intersection
for existing theoretical predictions. In other cases, of big data and big theory, to realize the
big data may simplify theory testing (e.g., by potential for unprecedented scientific
modifying measurement parameters), facilitate progress.
theory refinement (e.g., based upon more or new Reasonable actions that researchers should
data), or radically extend the scope of a theory. An consider from the data analytics perspective
example is paleopopulation genetics (Wall and include: (1) using larger and more complete data
Slatkin 2012), which has made possible studies sets (e.g., physics, biology, and medicine); (2)
of extinct populations. increasing computational capabilities (e.g.,
The bottom right arrow indicates that a domain astronomy); (3) mining heterogeneous data sets
scientist can identify specific data that may need for predictive analytics, text mining, and senti-
to be acquired, by revealing gaps in testing ment (e.g., business applications); (4) adopting
existing theories. This leads to the bottom left new machine learning techniques (e.g., computer
arrow where data scientists can close a gap by vision); and (5) generating new, and novel, ques-
extracting, cleaning, integrating, and analyzing tions. From a theoretical perspective, researchers
relevant data. Doing so might also reveal alterna- should consider: (1) what impactful theoretical
tive perspectives on how to manipulate, analyze, questions can now be addressed that could not
or synthesize existing (big) data. For example, be answered using the traditional “big theory,
Metcalfe’s law could only be properly tested small data” approach; (2) how interpretability of
when large amounts of data became available patterns extracted can be supported by or drive
(e.g., membership growth numbers from theory development; and (3) how theoretical con-
Facebook) (Metcalfe 2013). cepts can be mapped onto available data variables
Another example of “big data, big theory” is and vice versa.
the emerging discipline of astroinformatics. It Minimally, the framework can enable scientists
would be incorrect to view computing in astron- to reflect on their practices and better understand
omy as applied computer science. Clearly, com- why theory remains essential in the era of big data.
puter science impacts astronomy, but computer An extreme interpretation of the framework is a
scientists do not have effective techniques that reconceptualization of the scientific endeavor
can be easily adapted to astronomy (Shaffer and itself; indeed, one that recognizes the synergy
Purugganan 2013). Through interaction with between big data and theory building as intrinsic
astronomers, techniques are created and evolve. to future science. The framework has been illus-
This interdisciplinary integration highlights a cru- trated in several domains to demonstrate its appli-
cial aspect of the changing nature of scientific cability across disciplines.
practice. Considering big data and theory-driven
research as complementary endeavors can pro-
duce outcomes not achievable by considering
either in isolation. Further Reading

Aad, G., et al. (2012). Observation of a new particle in the


search for the standard model Higgs boson with the
Conclusion ATLAS detector at the LHC. Physics Letters B, 716(1),
1–29.
Anderson, C. (2008). The end of theory. Wired Magazine,
Computing technologies provide an exciting 16(7), 16–07.
opportunity for a new mode of research (big Baesens, B., Bapna, R., Marsden, J. R., Vanthienen, J., &
data, big theory), as the scientific community Zhao, J. L. (2016). Transformational issues of big data
moves from a time of “data poverty” to “data and analytics in networked business. MIS Quarterly, 40
(4), 807–818.
wealth.” The science in the era of big data
Bentley, R. A., O’Brien, M. J., & Brock, W. A. (2014).
framework provides both data and domain Mapping collective behavior in the big-data era. Behav-
scientists with an understanding of how to ioral and Brain Sciences, 37(01), 63–76.
Big Data Concept 79

Collins, F. S., & Varmus, H. (2015). A new initiative on Wray, G. A. (2010). Integrating genomics into evolution-
precision medicine. New England Journal of Medicine, ary theory. In M. Pigliucci & G. B. Muller (Eds.)., 2010
372(9), 793–795. Evolution: The extended synthesis (pp. 97–116). Cam-
Dhar, V. (2013). Data science and prediction. Communica- bridge, MA: MIT Press.
tions of the ACM, 56(12), 64–73.
Diebold, F. X. (2012). On the origin (s) and development of B
the term ‘Big Data’ (PIER working paper). Philadel-
phia: PIER.
ENCODE Project Consortium. (2012). An integrated
encyclopedia of DNA elements in the human genome. Big Data Concept
Nature, 489(7414), 57–74.
Fukunaga, K. (2013). Introduction to statistical pattern
recognition. Academic press, Cambridge, MA.
Connie L. McNeely and Laurie A. Schintler
Gupta, M., & George, J. F. (2016). Toward the develop- George Mason University, Fairfax, VA, USA
ment of a big data analytics capability. Information
Management, 53(8), 1049–1064.
Halevy, A., Norvig, P., & Pereira, F. (2009). The unreason-
able effectiveness of data. IEEE Intelligent Systems, 24
Big data is one of the most critical features of
(2), 8–12. today’s increasingly digitized and expanding
Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., information society. While the term “big data”
Hide, W., . . . & Twigger, S. (2008). Big data: The has been invoked in various ways in relation to
future of biocuration. Nature, 455(7209), 47–50.
International Human Genome Sequencing Consortium.
different stakeholders, groups, and applications,
(2004). Finishing the euchromatic sequence of the its definition has been a matter of some debate,
human genome. Nature, 431(7011), 931–945. changing over time and focus. However, despite a
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). lack of consistent definition, typical references
Imagenet classification with deep convolutional neural
networks. In Advances in neural information pro-
point to the collection, management, and analysis
cessing systems (pp. 1097–1105), Curran Associates, of massive amounts of data, with some general
Red Hook, NY. agreement signaling the size of datasets as the
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learn- principal defining factor. As such, big data can
ing. Nature, 521(7553), 436–444.
Legger, F. (2014). The ATLAS distributed analysis sys-
be conceptualized as an analytical space marked
tem. Journal of Physics: Conference Series, 513 by encompassing processes and technologies that
(3):032053. can be employed across a wide range of domains
McAfee, A., & Brynjolfsson, E. (2012). Big data: the and applications. Big data are derived from vari-
management revolution. Harvard Business Review,
90(10), 60–68.
ous sources – including sensors, observatories,
Metcalfe, B. (2013). Metcalfe’s law after 40 years of ether- satellites, the World Wide Web, mobile devices,
net. Computer, 46(12), 26–31. crowdsourcing mechanisms, and so on – to the
Pankratius, V., & Mattmann, C. (2014). Computing in extent that attention is required to both instrumen-
astronomy: To see the unseen. Computer, 9(47), 23–25.
Rai, A. (2016). Synergies between big data and theory. MIS
tal and intrinsic aspects of big data to understand
Quarterly, 40(2), iii–ix. their meanings and roles in different circum-
Saitou, N. (2013). Introduction to evolutionary genomics. stances, sectors, and contexts. This means consid-
London: Springer. ering conceptual delineations and analytical uses
Schilpzand, P., Hekman, D. R., & Mitchell, T. R. (2014).
An inductively generated typology and process model
relative to issues of validity, credibility, applica-
of workplace courage. Organization Science, 26(1), bility, and broader implications for society today
52–77. and in the future.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus,
R., & LeCun, Y. (2013). Overfeat: Integrated recogni-
tion, localization and detection using convolutional
networks. arXiv preprint arXiv:1312.6229. Conceptual Dimensions
Shaffer, H. B., & Purugganan, M. D. (2013). Introduction
to theme “Genomics in Ecology, Evolution, and Sys- The explosion of big data references the breadth
tematics”. Annual Review of Ecology, Evolution, and
Systematics, 44, 1–4.
and depth of the phenomenon in and of itself as a
Wall, J. D., & Slatkin, M. (2012). Paleopopulation genet- core operational feature of society relative to how
ics. Annual Review of Genetics, 46, 635–649. we understand and use it in that regard. Big data is
80 Big Data Concept

a multidimensional concept which, despite differ- validity, and viscosity – although these generally
ent approaches, largely had been interpreted ini- are encompassed in the other Vs. They typically
tially according to the “3 Vs”: volume, variety, are raised as specific points rather than in overall
and velocity. Volume refers to the increasing reference to big data.)
amount of data; variety refers to the complexity Big data contains disparate formats, structures,
and range of data types and sources; and velocity semantics, granularity, and so on, along with other
refers to the speed of data, particularly the rate of dimensions related to exhaustivity, identification,
data creation and availability. That is, big data relationality, extensionality, and scalability. Spe-
generally refers to massive volumes of data char- cifically, big data can be described in terms of how
acterized by variety that reflects the heterogeneity well it captures a complete system or population;
and types of structured and unstructured data col- its resolution and ability to be indexed; the ease
lected and the velocity at which the data are with which it can be reduced, expanded, or inte-
acquired and made available for analysis. These grated; and its potential to expand in size rapidly.
dimensions together constitute a basic conceptual Big data also can be delineated by spatial and/or
model for describing big data. temporal features and resolution. The idea that
However, beyond the initial basic model, two things can be learned from a large body of data
additional Vs – variability and veracity – have that cannot be comprehended from smaller
been noted, to the extent that reference to the “5 amounts also links big data to notions of complex-
Vs” became common. Variability is reflected in ity, such that complex structures, behaviors, and
inconsistencies in data flows that attend the vari- permutations of datasets are basic considerations
ety and complexity that mark big data. Veracity in labeling data as big. Big data need not incorpo-
refers to the quality and reliability of the data, rate all of the same characteristics and, in fact, few
i.e., indicating data integrity and the extent to big data sets possess all possible dimensions, to
which the data can be trusted for analytical and the effect that there can be multiple forms of big
decision-making purposes. Veracity is of special data.
note given that big data often are compromised
by incompleteness, redundancy, bias, noise, and
other imperfections. Accordingly, methods of Conceptual Sources and Allocations
data verification and validation are especially
relevant in this regard. Following these lines, a Massive datasets derive from a variety of sources.
sixth V – vulnerability – has been recognized as a For example, the digitization of large collections
fundamental and encompassing characteristic of of documents has given rise to massive corpora
big data. Vulnerability is an integrated notion that and archives of unstructured data. Additionally,
speaks to security and privacy challenges posed social media, crowdsourcing platforms, e-com-
by the vast amounts, range of sources and formats, merce, and other web-based sources are contrib-
and the transfer and distribution of big data, uting to a vast and complex collection of
with broad social implications. Also, a seventh V information on social and economic exchanges
– value – is typically discussed in this regard as and interactions among people, places, and orga-
yet another consideration. Value, as a basic nizations from moment-to-moment around the
descriptive dimension, highlights the capacity world. Satellites, drones, telescopes, and other
for adding and extracting value from big data. modes of surveillance are collecting massive
Thus, allusions to the “7 Vs” – volume, variety, amounts of information on the physical and
velocity, variability, veracity, vulnerability, and human-made environment (buildings, nightlights,
value – have been increasingly invoked as broadly land use cover, meteorological conditions, water
indicative of big data and are now considered the quality, etc.) as well as the cosmos. Big data also
principal determinant features that mark related has been characterized as “organic,” i.e., continu-
conceptual delineations. (Note that, on occasion, ously produced and observational transaction data
three other Vs also have been included – volatility, from the everyday behaviors of people. Along
Big Data Concept 81

those lines, for example, mobile devices (e.g., coordinating data resources, are central to man-
“smart phones”) and location acquisition technol- aging the volume, velocity, and variety of big
ogies are producing reams of detailed information data.
on people, animals, the world, and various phe- The computational strategies and technologies
nomena. The Internet of Things (IoT), which that are used to handle large datasets also offer a B
comprises a large and growing assemblage of conceptual frame for understanding big data, and
interconnected devices, actively monitors and artificial intelligence (AI) and machine learning
processes everything from the contents of refrig- (ML) are being employed to make sense of the
erators to the operational characteristics of large- complex and massive amounts of data. Moreover,
scale infrastructures. Large-scale simulations the expanding amounts, availability, and variety
based on such data provide additional layers of of data further empower and support AI and ML
data. applications. Along with tools for processing lan-
Big data has abundant applications and is being guage, images, video, and audio, AI and ML are
used across virtually all sectors of society. Some advancing capacities to glean insights and intelli-
industries in particular have benefitted from big gence from big data, and other technologies, such
data availability and use relative to others: as cloud computing, are enhancing the ability to
healthcare, banking and finance, media, retail, store and process big data. However, while tools
and energy and utilities. Beyond those, industries and methods are available for handling the com-
that are rapidly being encompassed and marked plexity of big data in general, more effective
by big data include medicine, construction, and approaches are needed for greater specification
transportation. Worldwide, the industries that are in dealing with various types of big data (e.g.,
investing the most in big data are banking, spatial data) and for assessing and comparing
manufacturing, professional services, and govern- data quality, computing efficiency, and the perfor-
ment, with effects manifesting across all levels of mance of algorithms under different conditions
analysis. and across different contexts.

Analytical and Computational Conclusion


Capabilities
Big data is one of the most pertinent and defining
The generation, collection, manipulation, analy- features of the world today and will be even more
sis, and use of big data can make for a number so in the future. Broadly speaking, big data refers
of challenges, including, for example, dealing to the collection, analysis, and use of massive
with highly distributed data sources, tracking amounts of digital information for decision mak-
and validating data, confronting sampling biases ing and operational applications.
and heterogeneity, working with variably for- With data expected to grow even bigger in
matted and structured data, ensuring data integ- terms of pervasiveness, scale, and value in the
rity and security, developing appropriately near future (a process that arguably has acceler-
scalable and incremental algorithms, and ated due to the pandemic-intensified growth of
enabling data discovery, integration, and sharing. “life online”), big data tools and technologies are
Accordingly, the tools and techniques that are being developed to allow for the real-time pro-
used to process – to search, aggregate, and cessing and management of large volumes of a
cross-reference – massive datasets play key variety of data (e.g., generated from IoT devices)
roles in producing, manipulating, and recogniz- to reveal and specify trends and patterns and to
ing big data as such. Related capabilities rest on indicate relationships and connections to inform
robust systems at each point along the data relevant decision making, planning, and research.
pipeline, and algorithms for handling related Big data, as a term, indicates large volumes of
tasks, including allocating, pooling, and information from a variety of sources coming at
82 Big Data Hubs and Spokes

very high speeds. However, the wide range of Kitchin, R., & McArdle, G. (2016). What makes big data,
sources and quality of big data can lead to prob- big data? Exploring the ontological characteristics of
26 datasets. Big Data and Society, 3(1), 1–10.
lems and challenges to its use. The opportunities Lohr, S. (2013, 19 June). Sizing up big data, broadening
and challenges brought on by the explosion of beyond the internet. New York Times. http://bits.blogs.
information require considerations of problems nytimes.com/2013/06/19/sizing-up-big-data-broaden
that occur with and because of big data. The ing-beyond-the-internet/?_r¼0.
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A
massive size and high dimensionality of big revolution that will transform how we live, work, and
datasets present computational challenges and think. New York: Houghton Mifflin Harcourt.
problems of validation linked to not only selection McNeely, C. L. (2015). Workforce issues and big data
bias and measurement errors, but also to spurious analytics. Journal of the Washington Academy of Sci-
ences, 101(3), 1–11.
correlations, storage and scalability blockages, McNeely, C. L., & Hahm, J. (2014). The big (data) bang:
noise accumulation, and incidental endogeneity. Policy, prospects, and challenges. Review of Policy
Moreover, the bigger the data, the bigger the Research, 31(4), 304–310.
potential not just for its use, but, importantly, for Schintler, L. A. (2020). Regional policy analysis in the era
of spatial big data. In Z. Chen, W. Bowen, & D.
its misuse, including ethical violations, discrimi- Whittington (Eds.), Development studies in regional
nation, and bias. Accordingly, policies and basic science (pp. 93–109). Singapore: Springer.
approaches are needed to ensure that the possible Schintler, L. A., & Chen, Z. (Eds.). (2017). Big data for
benefits of big data are maximized, while the regional science. New York: Routledge.
Smartym Pro. (2020). How to protect big data? The key big
downsides are minimized. The bottom line is data security challenges. https://smartym.pro/blog/
that the future will be affected by how big data how-to-protect-big-data-the-main-big-data-security-
are collected, managed, used, and understood. challenges.
Data is the foundation of the digital world, and Ward, J. S., & Barker, A. (2013). Undefined by data: A
survey of big data definitions. http://arxiv.org/pdf/
big data are and will be fundamental to determin- 1309.5821v1.pdf.
ing and realizing value in that context.

Further Reading
Big Data Hubs and Spokes

boyd, d., & Crawford, K. (2012). Critical questions for big ▶ Big Data Research and Development Initiative
data: Provocations for a cultural, technological, and (Federal, U.S.)
scholarly phenomenon. Information, Communication,
and Society, 15(5), 662–679.
Economist. (2010, 27 February). Data data everywhere: A
special report on managing information. https://www.
emc.com/collateral/analyst-reports/ar-the-economist- Big Data Integration Tools
data-data-everywhere.pdf.
Ellingwood, J. (2016, 28 September). An introduction to
big data concepts and terminology. Digital Ocean. ▶ Data Integration
https://www.digitalocean.com/community/tutorials/
an-introduction-to-big-data-concepts-and-terminology.
Fan, J., Han, F., & Liu, H. (2014). Challenges of big data
analysis. National Science Review, 1(2), 293–314.
Frehill, L. M. (2015). Everything old is new again: The big Big Data Literacy
data workforce. Journal of the Washington Academy of
Sciences, 101(3), 49–62.
Galov, N. (2020, 24 November). 77+ big data stats for the Padmanabhan Seshaiyer and Connie L. McNeely
big future ahead | Updated 2020. https:// George Mason University, Fairfax, VA, USA
hostingtribunal.com/blog/big-data-stats.
Groves, R. (2011, 31 May). ‘Designed data’ and ‘organic
data.’ U.S. Census Bureau Director’s Blog. https://
www.census.gov/newsroom/blogs/director/2011/05/ Over the last decade, the need to understand big
designed-data-and-organic-data.html. data has emerged as an increasingly important
Big Data Literacy 83

area that is making an impact on the least to the areas of study – most notably, mathematical, statis-
most advanced enterprises and societies in the tical, and computational foundations. Particularly in
world. Whether it is about analyzing terabytes to reference to developing a data science ready work-
petabytes of data or recognizing and revealing or force, big data literacy is a complex undertaking,
predicting patterns of behavior and interactions, involving knowledge of data acquisition, modeling, B
the growth of big data has emphasized the press- management and curation, visualization, workflow
ing need to develop next generation data scientists and reproducibility, communication and teamwork,
who can anticipate user needs and develop “intel- domain-specific considerations, and ethical problem
ligent services” to address business, academic, solving (cf. NASEM 2018). However, given the
and government challenges. This would require need not only for a big data literate workforce, but
the engagement of big data proficient students, also for a general population with a basic grasp of
faculty, and professionals who will help to bridge big data collection and uses, it is important to look
the big data to knowledge gap in communities and beyond training individuals in data science analytics
organizations at local, national, and global levels. and skills to a more broad-based competency in big
However, the sheer pervasiveness of big data also data.
makes clear the need for the population in general In today’s increasingly digitized real-world
to have a better understanding of the collection contexts, the need to understand, use, and apply
and uses of big data as it affects them directly and big data to address everyday challenges is
indirectly, within and across sectors and circum- expected. An example can be found in the contri-
stances. Especially relevant for addressing com- butions of big data to critical questions of sustain-
munity needs in regard to sustainability and ability and development (Sachs 2012). For
development, in public, personal, and profes- instance, big data can help provide real-time infor-
sional milieus, at least a basic understanding of mation on income levels via spending patterns on
big data is increasingly required for life in the mobile devices. Another example is tracking
growing information society. In other words, access to clean water through data collected
there is a general call for “big data literacy,” and from sensors connected to water pumps. Big
in this sense, big data literacy refers to the basic data also is used in analysis of social media in
empowerment of individuals and groups to better revealing public opinions on various issues, such
understand, use, and apply big data assets for as effective governance or human rights. Exam-
effective decision making and societal participa- ples such as these suggest the importance of the
tion and contribution. So, how is a big data literate need for big data literacy in solving global
populace created, with the ability to address challenges.
related challenges in real-world settings? Contexts can be engaged to help build a mean-
Reality reflects a critical disconnect between ingful map to data competency. Frameworks
big data understanding and awareness cutting such as design thinking or system thinking pro-
across different groups and communities. Related vide user-centered approaches to big data prob-
possibilities and expectations must be assessed lem solving. Identifying and defining problems,
accordingly, especially since education is linked providing guidelines for accessible solutions,
to financial, social, psychological, and cognitive and assessing implementation feasibility are
conditions that promote or hinder literacy devel- integral to building big data literacy and compe-
opment. Underscoring the content dimension of tency. Such frameworks can be used to decom-
literacy asymmetries, even if people have techno- pose big problems into small problems (big data
logical access (e.g., access to computers and the to small data) and design adaptive and effective
Internet), they may be limited by knowledge gaps strategies. Central to this process – and key for
that separate them from any meaningful under- understanding – are capacities for big data inter-
standing of big data roles, uses, and effects. pretation, planning, and application for making
In education, acquiring big data acumen is decisions and navigating life in a digitized
based on exposure to material from multiple environment.
84 Big Data Quality

Big data literacy and related skills concern


capacities for effective and efficient uses of Big Data Quality
data resources, encompassing awareness of
what is possible, knowing how to find appli- Subash Thota
cable information, being able to assess content Synectics for Management Decisions, Inc.,
validity and perform related tasks, and engag- Arlington, VA, USA
ing and managing relevant knowledge.
Defined in terms of understanding, use, and
application in real-world contexts, big data Introduction
literacy is critical for sustainability and devel-
opment for individuals and communities today Data is the most valuable asset for any organiza-
and in the future. Having said that, a caution- tion. Yet in today’s world of big and unstructured
ary note is also relevant for big data literacy, data, more information is generated than can be
particularly regarding the ethical implications collected and properly analyzed. The onslaught of
and impacts – including bias and representa- data presents obstacles to create data-driven deci-
tion – of big data collection and uses through sions. Data quality is an essential characteristic of
artifacts such as images, social media, videos, data that determines the reliability of data for
etc. (Newton et al. 2005). For example, ques- making decisions in any organization or business.
tions of surveillance, monitoring, and privacy Errors in data can cost a company millions of
are particularly relevant in terms of big data dollars, alienate customers, and make
collection and uses and related effects (Xu and implementing new strategies difficult or impossi-
Jia 2015). Awareness of these kinds of issues ble (Redman 1995).
is central to developing big data literacy. In practically every business instance, project
failures and cost overruns are due to fundamental
misunderstanding about the data quality that is
essential to the initiative. A global data manage-
Cross-References ment survey by PricewaterhouseCoopers of 600
companies across the USA, Australia, and Britain
▶ Data Scientist showed that 75% of reported significant problems
▶ Digital Literacy were a result of data quality issues, with 33% of
▶ Ethics those saying the problems resulted in delays in
▶ Privacy getting new business intelligence (BI) systems
running or in having to scrap them altogether
(Capehart and Capehart 2005). The importance
Further Reading and complexity related to data and its quality
compounds incrementally and could potentially
National Academies of Sciences, Engineering, and Medi- challenge the very growth of the business that
cine. (2018). Data science for undergraduates: Oppor-
tunities and options. Washington, DC: National
acquired the data. This paper is intended to show-
Academies Press. case challenges related to data quality and
Newton, E. M., Sweeney, L., & Malin, B. (2005). Preserv- approaches to mitigating data quality issues.
ing privacy by de-identifying face images. IEEE Trans-
actions on Knowledge and Data Engineering, 17(2),
232–243.
Sachs, J. D. (2012). From millennium development goals Data Defined
to sustainable development goals. The Lancet, 379
(9832), 2206–2211. Data is “ . . . language, mathematical or other sym-
Xu, H., & Jia, H. (2015). Privacy in a networked world:
New challenges and opportunities for privacy research.
bolic surrogates which are generally agreed upon to
Journal of the Washington Academy of Sciences, 101 represent people, objects, events and concepts”
(3), 73–84. (Liebenau and Backhouse 1990). Vayghan et al.
Big Data Quality 85

(2007) argued that most enterprises deal with three always appropriately reflect their expected
types of data: master data, transactional data, and consistency? Inconsistency between data
historical data. Master data are the core data enti- values plagues organizations attempting to rec-
ties of the enterprise, i.e., customers, products, oncile different systems and applications.
employees, vendors, suppliers, etc. Transactional – Accuracy: Do data objects accurately repre- B
data describe an event or transaction in an organi- sent the “real-world” values they are expected
zation, such as sales orders, invoices, payments, to model? Incorrect spellings of products, per-
claims, deliveries, and storage records. Transac- sonal names or addresses, and even untimely
tional data is time bound and changes to historical or not current data can impact operational and
data once the transaction has ended. Historical analytical applications.
data contain facts, as of certain point in time (e.g., – Duplication: Are there multiple, unnecessary
database snapshots), and version information. representations of the same data objects within
your data set? The inability to maintain a single
representation for each entity across your sys-
Data Quality tems poses numerous vulnerabilities and risks.
– Integrity: What data is missing important rela-
Data quality is the capability of data to fulfill and tionship linkages? The inability to link related
satisfy the stated business, framework, system and records together may actually introduce dupli-
technical requirements of an enterprise. A classic cation across your systems. Not only that, as
definition of data quality is “fitness for use,” or more value is derived from analyzing connec-
more specifically, the extent to which some data tivity and relationships, the inability to link
successfully serve the purposes of the user (Tayi related data instance together impedes this
and Ballou 1998; Cappiello et al. 2003; Lederman valuable analysis.
et al. 2003; Watts et al. 2009).
To be able to correlate data quality issues to
Causes and Consequences
business impacts, we must be able to both classify
our data quality expectations as well as our busi-
The “Big Data” era comes with new challenges
ness impact criteria. In order to do that, it is
for data quality management. Beyond volume,
valuable to understand these common data quality
velocity, and variety lies the importance of the
dimensions (Loshin 2006):
fourth “V” of big data: veracity. Veracity refers
to the trustworthiness of the data. Due to the sheer
– Completeness: Is all the requisite information
volume and velocity of some data, one needs to
available? Are data values missing, or in an
embrace the reality that when data is extracted
unusable state? In some cases, missing data is
from multiple datasets at a fast and furious clip,
irrelevant, but when the information that is
determining the semantics of the data – and under-
missing is critical to a specific business pro-
standing correlations between attributes –
cess, completeness becomes an issue.
becomes of critical importance.
– Conformity: Are there expectations that data
Companies that manage their data effectively
values conform to specified formats? If so, do
are able to achieve a competitive advantage in the
all the values conform to those formats?
marketplace (Sellar 1999). On the other hand, bad
Maintaining conformance to specific formats
data can put a company at a competitive disad-
is important in data representation, presenta-
vantage comments (Greengard 1998). It is there-
tion, aggregate reporting, search, and
fore important to understand some of the causes of
establishing key relationships.
bad data quality:
– Consistency: Do distinct data instances pro-
vide conflicting information about the same
underlying data object? Are values consistent • Lack of data governance standards or valida-
across data sets? Do interdependent attributes tion checks.
86 Big Data Quality

• Data conversion usually involves transfer of of data errors. Business processes, customer
data from an existing data source to a new expectations, source systems and compliance
database. rules are constantly changing – and data quality
• Increasing complexity of data integration and management systems must reflect this. Vast
enterprise architecture. amounts of time and money are spent on custom
• Unreliable and inaccurate sources of coding and “firefighting” to dampen an immediate
information. crisis rather than dealing with the long-term prob-
• Mergers and acquisitions between companies. lems that bad data can present to an organization.
• Manual data entry errors.
• Upgrades of infrastructure systems.
• Multidivisional or line-of-business usage of data. Data Quality: Approaches
• Misuse of data for purposes different from the
capture reason. Due to the large variety of sources from which
data is collected and integrated, for its sheer vol-
Different people performing the same tasks ume and changing nature, it is impossible to man-
have a different understanding of the data being ually specify data quality rules. Below are a few
processed, which leads to inconsistent data mak- approaches to mitigating data quality issues:
ing its way into the source systems. Poor data
quality is a primary reason for 40% of all business 1. Enterprise Focus and Discipline
initiatives failing to achieve their targeted benefits
(Friedman and Smith 2011). Marsh (2005) sum- Enterprises should be more focused and
marizes consequences in one of his article: engaged toward data quality issues; views toward
data cleansing must evolve. Clearly defining roles
• Eighty-eight percent of all data integration pro- and outlining the authority, accountability and
jects either fail completely or significantly responsibility for decisions regarding enterprise
overrun their budgets. data assets provides the necessary framework for
• Seventy-five percent of organizations have resolving conflicts and driving a business forward
identified costs stemming from dirty data. as the data-driven organization matures. Data
• Thirty-three percent of organizations have quality programs are most efficient and effective
delayed or canceled new IT systems because when they are implemented in a structured,
of poor data. governed environment.
• $611B per year is lost in the USA to poorly
targeted bulk mailings and staff overheads. 2. Implementing MDM and SOA
• According to Gartner, bad data is the number
one cause of customer-relationship manage- The goal of a master data management (MDM)
ment (CRM) system failure. solution is to provide a single source of truth of
• Less than 50% of companies claim to be very data, thus providing a reliable foundation for that
confident in the quality of their data. data across the organization. This prevents busi-
• Business intelligence (BI) projects often fail due ness users across an organization from using dif-
to dirty data, so it is imperative that BI-based ferent versions of the same data. Another
business decisions are based on clean data. approach of big data and big data governance is
• Only 15% of companies are very confident in the deployment of cloud-based models and soft-
the quality of external data supplied to them. ware-oriented architecture (SOA). SOA enables
• Customer data typically degenerates at 2% per the tasks associated with a data quality program to
month or 25% annually. be deployed as a set of services that can be called
dynamically by applications. This allows business
To Marsh, organizations typically overestimate rules for data quality enforcement to be moved
the quality of their data and underestimate the cost outside of applications and applied universally at
Big Data Quality 87

a business process level. These services can either – Data Consistency: This is the fidelity or integ-
be called proactively by applications as data is rity of the data within data structures or
entered into an application system, or by batch interfaces.
after the data has been created. – Data Adherence: This is a measure of compli-
ance or adherence of the data to the intended B
3. Implementing Data Standardization and Data standards or logical rules that govern the stor-
Enrichment age or interpretation of data.
– Data Duplicity: This is a measure of dupli-
Data standardization usually covers reformatting cates records or fields in the system that can be
of user-entered data without any loss of information consolidated to reduce the maintenance costs
or enrichment of information. Such solutions are and efficiency of the system storage processes.
most suitable for applications that integrate data. – Data Completeness: This is a measure of the
Data enrichment covers the reformatting of data correspondence between the real world and the
with additional enrichment or addition of useful specified dataset.
referential and analytical information.
In assessing a dataset for veracity, it is impor-
tant to answer core questions about it:
Data Quality: Methodology in Profiling
• Do the patterns of the data match expected
Data profiling provides a proactive way to manage patterns?
and comprehend an organization’s data. Data pro- • Do the data adhere to appropriate uniqueness
filing is explicitly about discovering and and null value rules?
reviewing the underlying data available to deter- • Are the data complete?
mine the characteristics, patterns, and essential • Are they accurate?
statistics about the data. Data profiling is an • Do they contain information that is easily
important diagnostic phase that furnishes quanti- understood and unambiguous?
fiable and tangible facts about the strength of the • Do the data adhere to specified required key
organization’s data. These facts not only help in relationships across columns and tables?
establishing what data is available in the organi- • Are there inferred relationships across col-
zation but also how accurate, valid, and usable the umns, tables, or databases?
data is. Data profiling covers numerous tech- • Are there redundant data?
niques and processes:
Data in an enterprise is often derived from
– Data Ancestry: This covers the lineage of the different sources, resulting in data inconsis-
dataset. It describes the source from which the tencies and nonstandard data. Data profiling
data is acquired or derived and the method of helps analysts dig deeper to look more closely
acquisition. at each of the individual data elements and estab-
– Data Accuracy: This is the closeness of the lish which data values are inaccurate, incom-
attribute data associated with an object or fea- plete, or ambiguous. Data profiling allows
ture, to the true value. It is usually recorded as analysts to link data in disparate applications
the percentage correctness for each topic or based on their relationships to each other or to a
attribute. new application being developed. Different
– Data Latency: This is the level at which the pieces of relevant data spread across many indi-
data is current or accurate to date. This can be vidual data stores make it difficult to develop a
measured by having appropriate data reconcil- complete understanding of an enterprise’s data.
iation procedures to gauge any unintended Therefore, data profiling helps one understand
delays in acquiring the data due to technical how data sources interact with other data
issues. sources.
88 Big Data Quality

Metadata on investment (ROI) that saves time, reduces


operating costs, and satisfies both clients and
Metadata is used to describe the characteristics of stakeholders.
a data field in a file or a table and contains
information that indicates the data type, the
field length, whether the data should be unique,
and if a field can be missing or null. Pattern Further Reading
matching determines if the data values in a field
Capehart, B. L., & Capehart, L. C. (2005). Web based
are in the likely format. Basic statistics about
energy information and control systems: case studies
data such as minimum and maximum values, and applications, 436–437.
mean, median, mode, and standard deviation Cappiello, C., Francalanci, C., & Pernici, B. (2003). Time-
can provide insight into the characteristics of related factors of data quality in multi-channel infor-
mation systems. Journal of Management Information
the data. Systems, 20(3), 71–91.
Friedman, T., & M. Smith. (2011). Measuring the business
value of data quality (Gartner ID# G00218962). Avail-
Conclusion able at: http://www.data.com/export/sites/data/com
mon/assets/pdf/DS_Gartner.pdf.
Greengard, S. (1998). Don’t let dirty data derail you. Work-
Ensuring data quality is one of the most pressing force, 77(11), 107–108.
challenges today for most organizations. With Knolmayer, G., & Röthlin, M. (2006). Quality of material
applications constantly receiving new data and master data and its effect on the usefulness of distrib-
uted ERP systems. Lecture Notes in Computer Science,
undergoing incremental changes, achieving data
4231, 362–371.
quality cannot be a onetime event. As organiza- Lederman, R., Shanks, G., Gibbs, M.R. (2003). Meeting
tions’ appetite for big data grows daily in their privacy obligations: the implications for information
quest to satisfy customers, suppliers, investors, systems development. Proceedings of the 11th Euro-
pean Conference on Information Systems. Paper pre-
and employees, the common obstacle of impedi- sented at ECIS: Naples, Italy.
ment is data quality. Improving data quality is the Liebenau, J., & Backhouse, J. (1990). Understanding
lynchpin to a better enterprise, better decision- information: an introduction. Information systems.
making, and better functionality. Palgrave Macmillan, London, UK.
Loshin, D. (2006). The data quality business case: Pro-
Data quality can be improved, and there are
jecting return on investment (White paper). Available
methods for doing so that are rooted in logic and at: http://knowledge-integrity.com/Assets/data_qual
experience. On the market are commercial off-the- ity_business_case.pdf.
shelf (COTS) products which are simple, intuitive Marsh, R. (2005). Drowning in dirty data? It’s time to sink
or swim: A four-stage methodology for total data qual-
methods to manage and analyze data – and establish
ity management. Database Marketing & Customer
business rules for an enterprise. Some can imple- Strategy Management, 12(2), 105–112. Available at:
ment a data quality layer that filters any number of http://link.springer.com/article/10.1057/palgrave.dbm.
sources for quality standards; provide real-time 3240247.
Redman, T. C. (1995). Improve data quality for competi-
monitoring; and enable the profiling of data prior tive advantage. MIT Sloan Management Rev., 36(2),
to absorption and aggregation with a company’s pp. 99–109.
core data. At times, however, it will be necessary Sellar, S. (1999). Dust off that data. Sales and Marketing
to bring in objective, third-party subject-matter Management, 151(5), 71–73.
Tayi, G. K., & Ballou, D. P. (1998). Examining data qual-
experts for an impartial analysis and solution of an
ity. Communications of the ACM, 41(2), 54–57.
enterprise-wide data problem. Vayghan, J. A., Garfinkle, S. M., Walenta, C., Healy, D. C.,
Whatever path is chosen, it is important for an & Valentin, Z. (2007). The internal information trans-
organization to have a master data management formation of IBM. IBM Systems Journal, 46(4), 669–
684.
(MDM) plan no differently than it might have a
Watts, S., Shankaranarayanan, G., & Even, A. (2009). Data
recruiting plan or a business development plan. A quality assessment in context: A cognitive perspective.
sound MDM creates an ever-present return Decision Support Systems, 48(1), 202–211.
Big Data Research and Development Initiative (Federal, U.S.) 89

microscopes, telescopes, undersea vehicles, sen-


Big Data R&D sor networks, particle accelerators, and scientific
observatories has opened windows into our obser-
▶ Big Data Research and Development Initiative vation and understanding of natural, engineered,
(Federal, U.S.) and social phenomenon. For scientists, access to B
these instruments unlocks a myriad of new theo-
ries and approaches to the discovery process to
help them develop new and different kinds of
Big Data Research and insights and breakthroughs. Many of these results
Development Initiative have big payoffs for society – helping find novel
(Federal, U.S.) solutions to challenges in health, education,
national security, disaster prevention and mitiga-
Fen Zhao1 and Suzi Iacono2 tion, the economy, and the scientific discovery
1
Alpha Edison, Los Angeles, CA, USA process itself.
2
OIA, National Science Foundation, Today, most research instruments produce very
Alexandria, VA, USA large amounts of data. Extracting knowledge from
these increasingly large and diverse datasets
requires a transformation in the culture and con-
Synonyms duct of scientific discovery. Some have referred to
this movement as the Fourth Paradigm (Tansley
BD hubs; BD spokes; Big data hubs and spokes; and Tolle 2009), where the unique capabilities of
Big data R&D; Data science; Harnessing the data data science have defined a new mode of scientific
revolution; HDR; NSF discovery and a new era of scientific progress.
Take, as an illustrative example, the revolution
occurring in oceanography. Oceanographers are
Introduction no longer limited, as they had been in previous
decades, to the small amounts of data they can
On March 29, 2012, the Office of Science and collect in summer research voyages. Now, they
Technology Policy (OSTP) and the Committee can remotely collect data from big sensor net-
on Technology Networking and Information works at the bottom of the sea or in estuaries and
Technology Research and Development Sub- rivers and conduct real-time analysis of that data.
committee (NITRD) launched the Federal Big This story of innovation is echoed in countless
Data Research and Development (R&D) Initia- scientific disciplines, as having more access to
tive. Since then, the National Science Foundation complete datasets and advanced analytic tech-
has been a leader across federal agencies in niques spurs faster insights and improved
supporting and catalyzing Big Data research, hypotheses.
development, and innovation across the scientific, Harnessing this so-called data revolution
public, and private sectors. This entry summarizes has enormous potential impact. That is why the
the timeline of Big Data and data science activities National Science Foundation (NSF) recently
initiated by the NSF since the start of the initiative. announced that Harnessing Data for 21st Century
Science and Engineering (HDR) would be one of
NSF’s 10 Big Ideas. The Big Ideas are a set of bold
The Fourth Paradigm ideas for the Foundation that look ahead to the
most impactful trends in the future of science and
Over the course of history, advances in science society:
and engineering have depended on the develop- . . . NSF proposes Harnessing Data for 21st Century
ment of new research infrastructure. The advent of Science and Engineering, a bold initiative to
90 Big Data Research and Development Initiative (Federal, U.S.)

develop a cohesive, national- scale approach to so-called Four Vs of Big Data, which help sum-
research data infrastructure and a 21st-century marize the challenges and opportunities for data:
workforce capable of working effectively with data.
This initiative will support basic research in volume, variety, velocity, and value. The ability to
math, statistics and computer science that will integrate a variety of types of data across different
enable data-driven discovery through visualization, areas of science allows us to target grand chal-
better data mining, machine learning and more. lenges whose solution continues to elude us when
It will support an open cyberinfrastructure for
researchers and develop innovative educational approached from a single discipline viewpoint –
pathways to train the next generation of data for example, leveraging data between or among
scientists. fields observing the earth, ocean, and/or atmo-
This initiative builds on NSF’s history of data sphere can help us to answer the biggest and
science investments. As the only federal agency
supporting all fields of S&E, NSF is uniquely posi- most challenging research questions about our
tioned to help ensure that our country’s future is one environment as a whole. Similarly, our ability
enriched and improved by data. to handle data at the velocity of life is necessary
for addressing the many challenges that must be
acted upon in real time – for example, responding
What’s All the Excitement About? to storms and other disasters. Yet, technology
imposes many limitations on updating simula-
By 2017, the term “Big Data” has become a com- tions and models and supporting decision-making
mon buzzword in both the academic and commer- on the ground in the pressing moment of need.
cial sectors. International Data Corporation (IDC) Finally, understanding and quantifying the value
regularly makes predictions about the growth of of the massive numbers of diverse datasets now
data: “The total amount of data in the world was collected by all sectors of society is still an open
4.4 zettabytes in 2013. That is set to rise steeply to question for the data science research community.
44 zettabytes by 2020. To put that in perspective, Understanding the value and use of datasets is
one zettabyte is equivalent to 44 trillion giga- critical to finding solutions to major challenges
bytes. This sharp rise in data will be driven by around curation, reproducibility, storage, and
rapidly growing daily production of data” (Turner long-term data management as well as to privacy
et al. 2014). Now, IDC believes that by 2025 the and security considerations. If our research com-
total will hit 180 zettabytes. munities can resolve today’s many hard problems
Complementary to the continuing focus on Big around data science, benefits will be global while
Data, many thought leaders today have started to strengthening opportunities for US leadership.
emphasize the importance of “little data” or data at
the long tail of science. There are thousands of
scientists who rely on their own methods to collect Kicking Off a Federal Big Data Research
and store data at small or medium scales in their and Development Initiative
offices and labs. The ubiquity and importance of
data at all scales within scientific research has led In December 2010, the President’s Council of
NSF to mandate that every proposed project Advisors for Science and Technology (PCAST)
includes a data management plan. This data man- report, Designing a Digital Future: Federally
agement plan is a critical part of the merit review Funded Research Development in Networking
of that project, regardless of the size of the and Information Technology (Holdren 2010),
datasets they are collecting and analyzing. challenged the federal research agencies to take
The importance of data at the long tail of sci- actions to support more research and development
ence illustrates that some of the most exciting (R&D) on Big Data.
facets of data innovation do not center only on Shortly after that, the Office of Science
scale. Often experts will cite a framework of the and Technology Policy (OSTP) responded to
Big Data Research and Development Initiative (Federal, U.S.) 91

this call and chartered an interagency Big Data with the National Institutes of Health (NIH), and
Senior Steering Group (BDSSG) under the anticipates opportunities for cross-disciplinary
efforts under its Integrative Graduate Education
Committee on Technology Networking and Infor- and Research Traineeship program and an Ideas
mation Technology Research and Development Lab for researchers in using large datasets to
Subcommittee (NITRD). NSF and the National enhance the effectiveness of teaching and learning. B
Institutes of Health (NIH) have co-chaired
About 11 other agencies also participated. The
this group over the years, while approximately
White House Big Data Fact Sheet included their
18 other research agencies have sent representa-
announcements. Here are some examples:
tives to the meetings. Several non-research federal
agencies with interests in data technologies also • DARPA launched the XDATA program, which
participated informally in the BDSSG. sought to develop computational techniques
Over the course of the years following and software tools for analyzing large volumes
its establishment, the BDSSG inventoried the of semi-structured and unstructured data.
existing Big Data programs and projects at each Central challenges to be addressed included
of the participating agencies and began coordinat- scalable algorithms for processing imperfect
ing across the agencies. Efforts were divided data in distributed data stores and effective
into four main areas: investments in Big Data human-computer interaction tools that are rap-
foundational research, development of cyber- idly customizable to facilitate visual reasoning
infrastructure in support of domain-specific data- for diverse missions.
intensive science and engineering, support for • DHS announced the Center of Excellence on
data science education and workforce develop- Visualization and Data Analytics (CVADA), a
ment, and activities in support of increased collaboration among researchers at Rutgers
collaboration and partnerships with the private University and Purdue University (with three
sector. Other important areas, such as privacy additional partner universities each) that lead
and open access, were also identified as critical research efforts on large, heterogeneous data
to Big Data by the group but became the focus that First Responders could use to address
areas of new NITRD subgroups – for example, the issues ranging from man-made or natural
Privacy Research and Development Interagency disasters to terrorist incidents, law enforcement
Working Group (Privacy R&D IWG). Big Data to border security concerns, and explosives to
R&D remained the central focus of the work of the cyber threats.
BDSSG. • NIH highlighted The Cancer Genome Atlas
On March 29, 2012, OSTP and NITRD (TCGA) project – a comprehensive and coor-
launched the Federal Big Data Research and dinated effort to accelerate understanding of
Development Initiative across federal agencies. the molecular basis of cancer through the appli-
The NSF press release states: cation of genome analysis technologies,
At an event led by the White House Office including large-scale genome sequencing.
of Science and Technology Policy in
Washington, D.C., (then NSF Director) Suresh Collectively, these activities had an impressive
joined other federal science agency leaders to impact; in a 2013 report, PCAST commended
discuss cross-agency big data plans and announce
the agencies for their efforts to push Big Data
new areas of research funding across disciplines
in this field. into the forefront of their research priorities: “Fed-
NSF announced new awards under its Cyber- eral agencies have made significant progress
infrastructure for the 21st Century framework and in supporting R&D for data collection, storage,
Expeditions in Computing programs, as well as
management, and automated large-scale analysis
awards that expand statistical approaches to address
big data. The agency is also seeking proposals (Holdren 2013). They recommended continued
under a Big Data solicitation, in collaboration emphasis on these investments in future years.
92 Big Data Research and Development Initiative (Federal, U.S.)

Taking the Next Steps: Developing Some examples include:


National, Multi-stakeholder Big Data
Partnerships • A new Big Data analytics platform Spark cre-
ated by UC Berkeley’s AMPLab which was
Entering the second year of the Big Data Initia- funded by NSF, DARPA, DOE, and a host of
tive, the BDSSG expanded the framing of the private companies such as Google and SAP.
Federal Big Data R&D Initiative as a coordinated • A summer program called Data Science
national endeavor rather than just a federal gov- for Social Good (funded by the Schmidt Family
ernment effort. To encourage the participation of Foundation and University of Chicago with part-
stakeholders in private industry, academia, state ners including the City of Chicago, Cook County
and local government, nonprofits, and founda- Land Bank, Cook County Sheriff, Lawrence
tions to develop and participate in Big Data Ini- Berkeley National Labs, and many others) hosted
tiatives across the country, in April 2013, NSF fellows to create applications to solve data sci-
issued a request for information (RFI) about Big ence challenges as defined by their partners.
Data. This RFI encouraged non-federal stake- • Global corporations Novartis, Pfizer, and Eli
holders to identify the kinds of Big Data projects Lilly partnered to improve access to informa-
they were willing to participate in to further Big tion about clinical trials, including matching
Data innovation across the country. Of particular individual health profiles to applicable clinical
interest were cross sector partnerships designed to trials.
advance core Big Data technologies, harness the
power of Big Data to advance national goals, While the Data 2Action event catalyzed federal
initiate new competitions and challenges, and fos- outreach on Big Data research to communities
ter regional innovation. beyond academia, continuing Big Data innova-
In November 2013, the BDSSG and NITRD tion on a national scale required sustained com-
convened Data to Knowledge to Action (Data munity investment beyond federal coordination.
2Action), an event in Washington, DC, that To help achieve this goal of sustained community
highlighted a number of high-impact, novel, dialogue and partnerships around Big Data, in the
multi-stakeholder partnerships surfaced through fall of 2014, NSF’s Directorate for Computer and
the RFI and later outreach efforts. These projects Information Science and Engineering (CISE)
embraced collaboration between the public and announced a plan to establish a National Network
private sectors and promoted the sharing of data of Big Data Regional Innovation Hubs
resources and the use of new sophisticated tools to (BD Hubs). Released in winter 2015, the program
plumb the depths of huge datasets and derive solicitation to create the Hubs states:
greater value for American consumers while
Each BD Hub [is] a consortium of members from
growing the nation’s economy. academia, industry, and/or government. . . [and is]
The event featured scores of announcements across distinct geographic regions of the United
by corporations, educational institutions, profes- States, including the Northeast, Midwest, South,
sional organizations, and others that – in collab- and West... and focus[es] on key Big Data challenges
and opportunities for its region of service. The BD
oration with federal departments and agencies Hubs aim to support the breadth of interested local
and state and local governments – enhance stakeholders within their respective regions, while
national priorities such as economic growth and members of a BD Hub should strive to achieve
job creation, education and health, energy and common Big Data goals that would not be possible
for the independent members to achieve alone.
sustainability, public safety and national security,
and global development. About 30 new partner- To foster collaboration among prospective part-
ships were announced with a total of about ners within a region, in April 2015, NSF sponsored
90 partners. a series of intensive 1-day “charrettes” to convene
Big Data Research and Development Initiative (Federal, U.S.) 93

stakeholders, explore Big Data challenges, and aid on underrepresented minorities in the south. In
in the establishment of the Hub consortia. Fall 2015, NSF solicited proposals for Spokes
“Charrettes” are meeting in which all stakeholders projects that would work in concert with their
in a project attempt to resolve conflicts and map corresponding regional BD Hub to address one
solutions. NSF convened a charrette in each of the of three broad challenges in Big Data: B
four Hub geographic regions. To facilitate discus-
sion beyond the charrette, a HUBzero community • Accelerating progress towards addressing soci-
etal grand challenges relevant to regional and
portal was established over the course of the initial
national priority areas;
Hub design and implementation process. Potential • Helping automate the Big Data lifecycle; and
partners used this portal to communicate with other • Enabling access to and increasing use of impor-
members or potential partners within their Hub. tant and valuable available data assets, also
including international datasets....
In November 2015, NSF announced seven
awards totaling more than $5 million to establish
Similar to a Hub, each Big Data Spoke takes on
four regional Hubs for data science innovation.
a convening and coordinating role as opposed to
The consortia are coordinated by top data scien-
tists at Columbia University (Northeast Hub), conducting fundamental research. Unlike a Hub,
each Spoke would have a specific goal-driven
Georgia Institute of Technology with the Univer-
scope within an application or technology area.
sity of North Carolina (South Hub), University of
Illinois at Urbana-Champaign (Midwest Hub), Typical Spoke activities included, for example,
gathering important stakeholders via forums, meet-
and University of California, San Diego, Univer-
ings, or workshops; engaging with end users and
sity of California, Berkeley, and University of
Washington (West Hub). solution providers via competitions and commu-
nity challenges; and forming multidisciplinary
Covering all 50 states, they include initial
teams to tackle questions no single field could
partnership commitments from more than
solve alone. Strategic leadership guiding both the
250 organizations. These organizations ranged
Big Data Hubs and Spokes comes from each Hub’s
from universities and cities to foundations and
Steering Committee – a group of Big Data experts
Fortune 500 corporations, and the four Hubs
and thought leaders across sectors that act as advi-
developed plans to expand the consortia further
sors and provide general guidance.
over time. The network of four Hubs established
In 2016 and 2017, NSF awarded $13 million to
a “big data brain trust” geared toward conceiving,
11 Spoke projects, 10 planning grants, and a number
planning, and supporting Big Data partnerships
of other Spoke-related projects. Project topics range
and activities to address regional challenges.
from precision agriculture to personalized education
Among the benefits of the program for Hub mem-
and from data sharing to reproducibility. The range
bers are greater ease in initiating partnerships
of Spoke topics reflected the unique priorities and
by reducing coordination costs; opportunities for
capabilities of the four Big Data Hubs and their
sharing ideas, resources, and best practices; and
regional interests. A second Spoke solicitation was
access to top data science talent.
released in March 2017, and new awards are
While Hubs focused primarily on ideation and
expected by the end of fiscal year 2018.
coordination of regional Big Data partnerships,
additional modes of support were needed for the
actual projects that were to become the outputs of
those coordination efforts. These projects were Developing an Interagency
called the “Spokes” of the Big Data Hub network. Strategic Plan
The Spokes were meant to focus on data innova-
tion in specific areas of interest, for example, Starting in 2014, the BDSSG began work on
drought data archives in the west or health data an interagency strategic plan to help coordinate
94 Big Data Research and Development Initiative (Federal, U.S.)

future investments in Big Data R&D across the 2016] aimed to “build upon the promise and
federal research agencies. A key assumption of excitement of the myriad applications enabled by
this plan was that it would not be prescriptive at Big Data with the objective of guiding Federal
any level, but instead would be a potential enabler agencies as they develop and expand their indi-
of future agency actions by surfacing areas of vidual mission-driven programs and investments
commonalities and priority to support agency related to Big Data.” The Plan described a vision
missions. The development of this strategic plan for Big Data innovation shared across federal
was supported through a number of cross-agency research agencies (Fig. 1):
workshops, a request for information from the We envision a Big Data innovation ecosystem in
public, and a workshop with non-federal stake- which the ability to analyze, extract information
holders to gauge their opinions. from, and make decisions and discoveries based
Building upon all the work that had been car- upon large, diverse, and real- time datasets enables
new capabilities for Federal agencies and the Nation
ried out to date on the National Big Data R&D at large; accelerates the process of scientific discov-
Initiative, the Federal Big Data Research and ery and innovation; leads to new fields of research
Development Strategic Plan (Plan) [NITRD and new areas of inquiry that would otherwise be

Big Data Research and


Development Initiative
(Federal, U.S.),
Fig. 1 The cover of the
Federal Big Data Research
and Development Strategic
Plan (2016)
Big Data Research and Development Initiative (Federal, U.S.) 95

impossible; educates the next generation of 21st research on how to reduce risk and maximize
century scientists and engineers; and promotes benefits for the data-driven technologies of the
new economic growth.
future.
The Plan articulates seven strategies that represent Multiple industry reports (Manyika et al. 2011)
key areas of importance for US Big Data R&D. have forewarned of a dramatic and continuing B
These are: deficit for demand in data analytics talent within
the USA. This deficit ranges from data-savvy
Strategy 1: Create next-generation capabilities by knowledge workers to PhD-trained data scientists.
leveraging emerging Big Data foundations, Research agencies acknowledge the needs for
techniques, and technologies. programs that support the development of a work-
Strategy 2: Support R&D to explore and understand
trustworthiness of data and resulting knowl-
shop with data skills at all levels to staff, support,
edge, to make better decisions, enable break- and communicate their mission programs.
through discoveries, and take confident action. Through the strategic planning process, agen-
Strategy 3: Build and enhance research cyber- cies saw many synergies between different agency
infrastructure that enables Big Data innovation
in support of agency missions.
missions in their use of Big Data. Strategy
Strategy 4: Increase the value of data through pol- 7 acknowledges ways that agencies could act col-
icies that promote sharing and management lectively to create interagency programs and infra-
of data. structures to share benefits across the federal
Strategy 5: Understand Big Data collection, shar-
ing, and use with regard to privacy, security, and
government.
ethics.
Strategy 6: Improve the national landscape for Big
Data education and training to fulfill increasing Moving Toward the Future: Harnessing
demand for both deep analytical talent and ana-
lytical capacity for the broader workforce.
the Data Revolution
Strategy 7: Create and enhance connections in the
national Big Data innovation ecosystem. NSF’s Harnessing the Data Revolution (HDR)
Big Idea builds on past investments and lays
While the strategic plan addresses the challenges the foundations for the future transformation of
outlined by Four Vs of Big Data, it encompasses a science and society by data technologies. HDR
broader vision for the future of data science and its has a number of major themes, which are outlined
application toward mission and national goals. in Fig. 2.
Strategy 2 emphasizes the need to move beyond Given NSF’s breadth of influence over almost
managing scale to enabling better use of data all fields of science, the Foundation can bring
analytics outputs; rather than focusing on the together science, engineering, and education
first part of the “Data to Knowledge to Action” experts into convergence teams to make this
pipeline, which is usually focused on purely vision a reality. NSF has a unique role within
technological solutions, it recognizes the need universities (which are critical participants) in
to understand the sociotechnical needs to derive the support of research, sustainable research
actionable insight from data-driven knowledge. infrastructure, and development of human capi-
Strategies 3 and 4 both address the national tal. NSF also has strong connections with indus-
need to sustain an ecosystem of open data and try and with funding agencies around the world.
the tools to analyze that data; such an infrastruc- Given the trend toward global science and the
ture supports not only federal agency missions but value of sharing research data internationally,
the utility of Big Data to the private sector and the NSF is well positioned to work with other
public at large. Agencies also recognized the risks research agencies when moving forward on
and challenges in using Big Data in developing research priorities.
Strategy 5, focusing not only on the privacy, HDR’s thesis on foundational theory-based
security, and ethical challenges that come with research in data science is that it must exist at the
Big Data analytics today but pressing for more intersection of math, statistics, and computer
96 Big Data Research and Development Initiative (Federal, U.S.)

Big Data Research and Development Initiative (Federal, U.S.), Fig. 2 Conceptualization of NSF’s Harnessing the
Data Revolution (HDR) Big Idea

science. Experts in each of these three disciplines benefits of advanced data analytics and manage-
must leverage the unique perspective of their field ment could be leveraged by all size of science
in unison to develop the next generation of research projects, from individual to center scale.
methods for data science. The TRIPODS program One of the key components of the HDR Big
recognizes this needed convergence by funding Idea is the design and construction of a national
data science center-lets (pre-center scale grants) data infrastructure of use to a wide array of sci-
that host experts across all three disciplines. ence and engineering communities supported by
Today, research into the algorithms and sys- NSF. Past investments by NSF have built
tems for data science must be sociotechnical in some basic components of this endeavor, but
nature. New data tools must not only manage for others have yet to be imagined. The ultimate
the Four Vs but also the challenges of human error goal is a co-designed, robust, comprehensive,
or misuse. New systems are needed to help data open, science-driven, research cyber-
user understand the limits of their statistical anal- infrastructure (CI) ecosystem capable of acceler-
ysis, manage privacy and ethical issues, ensure ating a broad spectrum of data-intensive
reproducibility of their results, and efficiently research, including research in large scale and
share data with others. Major Research Equipment and Facilities Con-
Promoting progress in the scientific disciplines struction (MREFC).
is the core of NSF’s mission. At the heart of the Innovative learning opportunities and
HDR Big Idea is the tantalizing potential of leap- educational pathways are needed to build a
frogging progress in multiple sciences through twenty-first-century data-capable workforce. These
applications of translational data science. The opportunities must be grounded in an
Big Data Theory 97

education research-based understanding of the Tansley, S., & Tolle, K. M. (2009). In T. Hey (Ed.), The
knowledge and skill demands needed by that fourth paradigm: Data-intensive scientific discovery
(Vol. 1). Redmond: Microsoft Research.
workforce. NSF’s current education programs Turner, V., Gantz, J. F., Reinsel, D., & Minton, S. (2014).
span the range from informal, K-12, undergradu- The digital universe of opportunities: Rich data and the
ate, graduate, and postgraduate education and can increasing value of the internet of things. IDC Analyze B
be leveraged to train the full diversity of this the Future. https://scholar.google.com/scholar?cluster=
2558441206898490167&hl=en&as_sdt=2005&sciodt=
nation’s workforce needs. Of particular interest 0,5.
is undergraduate and graduate data science train-
ing, to create pi-shaped scientists who have broad
skills but with deep expertise in data science in
addition to their scientific domain.
Big Data Theory

In Summary Melanie Swan


New School University, New York, NY, USA
This short entry summarizes some of the work
done for the National Big Data Research and
Development Initiative since the beginning of Definition/Introduction
calendar year 2011. It is written from the point
of view of NSF because that is where the authors Big Data Theory is a set of generalized principles
are located. It should be noted one could imagine that explain the foundations, knowledge, and
different narratives if it were written from the NIH methods used in the practice of data-driven
or DARPA perspectives with stories that would be science.
equally compelling.
Through a reorganization of the NITRD Sub-
committee, the BDSSG was renamed the Big Data Part I: Theory of Big Data in General
Interagency Working Group (BDIWG), showing
its intention to be a permanent part of that organi- In general a theory is an explanatory mechanism.
zation. The current co-chairs are Chaitan Baru A theory is a supposition or a system of ideas
from NSF and Susan Gregurick from NIH. intended to explain something, especially one
But the work continues. based on general principles independent of the
thing to be explained. Big Data Theory explains
big data (data-driven science), what it is and its
Further Reading foundations, approaches, methods, tools, prac-
tices, and results. A theory explains something in
Holdren, J. P., Lander, E., & Varmus, H. (2010). Report to a generalized way. A theory attempts to capture
the president and congress: Designing a digital future: the core mechanism of a situation, behavior, or
Federally funded research and development in net-
phenomenon.
working and information technology. Executive Office
of the President and President’s Council of Advisors A theory is a class of knowledge. Different
on Science and Technology. https://obamawhitehouse. classes of knowledge have different proof stan-
archives.gov/sites/default/files/microsites/ostp/pcast- dards. The overall landscape of knowledge
nitrd-report-2010.pdf.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,
includes observation, conjecture, hypothesis, pre-
Roxburgh, C., & Byers, A. H. (2011). Big data: The next diction, theory, law, and proof. Consider New-
frontier for innovation, competition, and productivity. ton’s laws, for example, and the theory of
McKinsey & Company. https://www.mckinsey.com/~/ gravity; laws have a more-established proof stan-
media/McKinsey/Business%20Functions/McKinsey
%20Digital/Our%20Insights/Big%20data%20The%
dard than theories. An explanation of a phenom-
20next%20frontier%20for%20innovation/MGI_big_ enon is called a theory, whereas a law is a more
data_exec_summary.ashx. formal description of an observed phenomenon.
98 Big Data Theory

Many theories do not become laws but serve as a ontology (existence; the definition, dimensions,
useful tool for practitioners to understand and and limitations of big data), epistemology (knowl-
work with a phenomenon practically. edge; the knowledge obtained from big data and
Here are some examples of theories in which corresponding proof standards), and axiology
the same structure would apply to theories of data- (valorization; ethical practice, the parts of big
driven science (a theory provides an explanatory data practices and results that are valorized as
mechanism): being correct, accurate, elegant, right).
The Philosophy of Big Data or Data Science is
• Darwin’s theory of evolution states that all a branch of philosophy concerned with the foun-
species of organisms arise and develop through dations, methods, and implications of big data and
the natural selection of small, inherited varia- data science. Big data science is scientific prac-
tions that increase the individual’s ability to tices that extract knowledge from data using tech-
compete, survive, and reproduce. niques and theories from mathematics, statistics,
• Pythagoras’s theorem is a fundamental relation computing, and information technology. The phil-
in Euclidean geometry among the three sides osophical concerns of the Philosophy of Big Data
of a right triangle, stating that the square of the Science include the definitions, meaning, knowl-
hypotenuse (the side opposite the right angle) edge production possibilities, conceptualizations
is equal to the sum of the squares of the other of science and discovery, definitions of knowl-
two sides. edge, proof standards, and practices in situations
• Bayes theorem describes the probability of an of computationally intensive science involving
event, based on prior knowledge of conditions large-scale, high-dimensional modeling, observa-
that might be related to the event tion, and experimentation in network environment
• A theory of error estimation is a set of princi- with very-large data sets.
ples on which the practice of the activity of
error estimation is based.
Part II: Theory in Big Data
There is the Theory of Big Data and the Phi- and Data-Driven Science
losophy of Big Data. Theory relates to the internal
practices of a field, and philosophy includes both The Theories of Big Data and data-driven science
the internal practices of the field and the external correspond to topic areas of big data and data-
impact, the broader impact of the field on the driven science. Instead of having a “data science
individual and society. The Philosophy of Big theory” overall which might be too general a topic
Data is the branch of philosophy concerned with to have an explanatory theorem, theories are likely
the definition, methods, and implications of big to relate to topics within the field of data science.
data and data-driven science in two domains: For example, there are theories of machine learn-
ing, Bayesian updating, and classification and
• Internal Industry Practice: Internal to the field unstructured learning.
as a generalized articulation of the concepts, Big data, data science, or data-driven science is
theory, and systems that comprise the overall an interdisciplinary field using scientific methods,
use of big data and data science processes, and systems to extract knowledge and
• External Impact: External to the field, consid- insights from various forms of data. The primary
ering the impact of big data more broadly on method of data science is statistics; however, other
individuals, society, and the world, for exam- mathematical methods are also involved such as
ple, addressing data concerns such as security, linear algebra and calculus. In addition to mathe-
privacy, and accessibility. matics, data science may involve information the-
ory, computation, visualization, and other
The Philosophy of Big Data and Data-driven methods of data collection, modeling, analysis,
Science may have three areas. These include and decision-making. The term “data science”
Big Data Theory 99

was used by Peter Naur in 1960 as a synonym for • Mathematical methods such as matrix opera-
computer science. tions, optimization, least-squares, gradient
Theories of data science may relate to the kinds descent, structural optimization, Bayesian
of activities in the field such as description, pre- updating, regression analysis, linear transfor-
diction, evaluation, data gathering, results com- mation, scale inference, variable clustering, B
munication, and education. Theories may model-free learning, pattern analysis, and ker-
correspond to the kinds of concepts used in the nel methods
field such as causality, validity, inference, and • Industry norms related to data-sharing and col-
deduction. Theories may address foundational laboration models, peer review, and experi-
concerns of the field, for example, general princi- mental results replication
ples (the bigger the data corpus, the better), and
commonly used practices (how p-values are
Conclusion
used). Theories may be related to methods within
an area, for example, a theory of structured or
Big Data Theory is a set of generalized principles
unstructured learning. There may be theories of
that explain the foundations, knowledge, and
shallow learning (1–2 layers) relating to methods
methods used in the practice of data-driven sci-
specific to that topical area such as Bayesian
ence. The Philosophy of Big Data or Data Science
inference, support vector machines, decision
is a branch of philosophy concerned with the
trees, K-means clustering, and K-nearest neighbor
foundations, methods, and implications of big
analysis. Similarly, there may be theories of deep
data and data science. Big data science is scientific
learning (5–20 layers) relating to methods of prac-
practices that extract knowledge from data using
tice specific to that area such as neural nets,
techniques and theories from mathematics, statis-
convolutional neural nets in the case of image
tics, computing, and information technology. The
recognition, and recurrent neural nets in the case
philosophical concerns of the Philosophy of Big
of text and speech recognition.
Data Science include the definitions, meaning,
Specific data science topics are the focus in Big
knowledge production possibilities, conceptuali-
Data Theory workshops to address situations
zations of science and discovery, definitions of
where a theory with an explanatory mechanism
knowledge, proof standards, and practices in sit-
would be useful in a generalized sense beyond
uations of computationally intensive science
specific use cases. Some of these topics include:
involving large-scale, high-dimensional model-
ing, observation, and experimentation in network
• Practices related to model-fit, “map-territory,”
environment with very-large data sets.
explanandum-explanans (fit of explanatory
model to that which is to be explained), scale-
free models, and model-free learning
• Challenges of working with big data problems Further Reading
such as spatial and temporal analysis, time
Harris, J. (2013). The need for data Philosophers. The
series, motion data, topological data analysis, Obsessive-Compulsive Data Quality (OCDQ) blog.
hypothesis-testing, computational limitations Available online at http://www.ocdqblog.com/home/
and cloud computing, distributed and net- the-need-for-data-philosophers.html.
work-based computing models, graph theory, Swan, M. (2015). Philosophy of Big Data: Expanding the
human-data relation with Big Data science services.
complex applications, results visualization, IEEE BigDataService 2015. Available online at http://
and big data visual analytics www.melanieswan.com/documents/Philosophy_of_
• Concepts such as randomization, entropy, eval- Big_Data_SWAN.pdf.
uation, adaptive estimation and prediction, Symons, J., & Alvarado, R. (2016). Can we trust Big Data?
Applying philosophy of science to software. Big Data
error, multivariate volatility, high-dimensional & Society, 3(2), 1–17. Available online at http://
operations (causal inference, estimation, learn- journals.sagepub.com/doi/abs/10.1177/
ing), and hierarchical modeling 2053951716664747.
100 Big Data Workforce

principal characterizing elements can be distin-


Big Data Workforce guished according to 1) skill and task identifica-
tion, 2) disciplinary and field delineations, and
Connie L. McNeely and Laurie A. Schintler 3) organizational specifications.
George Mason University, Fairfax, VA, USA

Skills-Based Classification
The growth of big data, its engagement, and
its applications are growing within and across all Engaging and applying data in work processes has
sectors of society and, as such, require a big data become a basic requirement in many jobs and has
workforce. Big data encompasses processes and led to new job creation in a number of areas. Big
technologies that can be applied across a wide data are calling for more and more workers with
range of domains from business to science, from “deep skills and talent,” pointing to the relation-
government to the arts (Economist 2010). Under- ship between higher education and the develop-
standing this workforce situation calls for exami- ment of the big data workforce. However,
nation from not only technical but, importantly, organizational needs in this regard are variable.
social and organizational perspectives. Also, the For example, data analytics are now fundamental
skills and training necessary for big data related to positions such as management analysts and
jobs in industry, government, and academia have market research analysts, both of which require
become a focus of discussions on educational only short-term certification, as opposed to longer
attainment relative to workforce trajectories. degree terms of study (Carantit 2018). Some of
The complex and rapidly changing digital the basic technical skills required to handle big
environment is marked by the growth and spread data are accessing databases to query data,
of big data and by related technologies and activ- ingesting data into databases, gathering data
ities that pose workforce development challenges from various sources using web scraping, parsing,
and opportunities that require specific and evolv- and tokenizing texts in big data storage environ-
ing skills and training for the changing jobs land- ments (NASEM 2018).
scape. Big data and calls for related workers Technically speaking, training and education
appear across virtually all sectors (Galov 2020). for many big data jobs typically require a basic
The range of areas in which big data plays an knowledge of statistics, quantitative methods, or
increasingly central role requires agile and flexi- programming, upon which applicable skillsets
ble workers with the ability to rapidly analyze can be built. More than the computing sciences,
massive datasets derived from multiple sources such background can be acquired in a number of
in order to provide information to enable actions fields that have long incorporated related prepara-
in real-time. As one example, big data may enable tion. As one example, “social scientists have
more precise dosing of medications and has been worked with exceptionally large datasets for
used to develop sensor technologies to determine quite some time, historically accessing remote
when a football player needs to be side-lined due space, writing code, analyzing data, and then tell-
to heightened risks of concussion (Frehill 2015, p. ing stories about human social behavior from
58). Yet another example is high-frequency trad- these complex sources.” Indeed, many “tech-
ing, which draws upon various sources of real- niques, tools, and protocols developed by social
time data to create real-time actionable insights. science research communities to manage and
Understanding the relationship between the share large datasets – including attention to the
workforce needs of big data employers and the ethical issues associated with collecting these
supply of workers with skills adapted to related data – hold important implications for the big
positions is a key consideration in determining data workforce” (Frehill 2015, pp. 49, 52).
what constitutes the big data workforce. Although General descriptions have indicated that to
posing challenges to strict labor classification, its exploit big data – characterized particularly by
Big Data Workforce 101

velocity, variety, and volume – workers are and processing capacities are constantly evolving.
needed with “the skills of a software programmer, As such, the skillset for the big data worker is
statistician, and storyteller/artist to extract something of a moving target, and depending on
nuggets of gold hidden under mountains of data” how skill requirements are specified, the type and
(Economist 2010). These characteristics are size of the pool of big data workforce talent can B
encompassed in the broad occupational category vary accordingly (Frehill 2015).
of “data scientist” (Hammerbacher 2009). Con-
sidering the combination of disparate skills
required to capture value from big data, three Skill Mismatch Dilemmas
key types of workers have been identified under
the rubric of data scientist (Manyika et al. 2011): Frankly, relative to employer practices and work-
force needs, when an industry or field is growing
1. Deep analytical talent – people with technical rapidly, “it is not unusual for a shortage of workers
skills in statistics and machine learning, for to occur until educational institutions and training
example, capable of analyzing large volumes organizations build the capacity to teach more
of data to derive business insights. individuals, and more people are attracted to the
2. Data-savvy managers and analysts who have needed occupations” (CEA 2014, p. 41), a point
the skills to be effective consumers of big data that is translated in the growing number of ana-
insights, that is, capable of posing the right lytics and data science programs (Topi and
questions for analysis, interpreting and chal- Markus 2015). However, rapidly accelerating big
lenging the results, and making appropriate data growth and technological change can pose
decisions. limits to skills forecasting. Accordingly, some
3. Supporting technology personnel who recommendations focus on gaining adaptable
develop, implement, and maintain the hard- core, transversal skills and on building technical
ware and software tools, such as databases learning capacities, rather than on planning edu-
and analytic programs, needed to make use of cation and training to meet specified forecasts of
big data. requirements, especially since they may change
before curricular programs can adjust. “Shorter
Note that such skills and workers – deep ana- training courses, which build on solid general
lytical talent, data-savvy managers and analysts, technical and core skills, can minimize time lags
and supporting technology personnel – princi- between the emergence of skill needs and the
pally apply to capacities and capabilities to extract provision of appropriate training” (ILO 2011, p.
information from massive amounts of data and 22). Be that as it may, especially given assertions
to enabling related data-driven decision-making of a skill mismatch and gap for manipulating,
in work settings. However, disciplinary silos com- analyzing, and understanding big data, the rela-
plicate the picture of the big data workforce and tionship between education and the development
associated occupational needs. These skills are of the big data workforce is a critical point of
required in various fields. The arena from which departure for delineating the field in general.
data scientists are drawn and in which associated Skill mismatch, as a term, can relate to many
skills are developed is broader than the pool of forms of labor market friction and imbalance,
those trained in computing and information tech- including educational vertical and horizontal mis-
nology disciplines, with many being basic matches, skill gaps, skill shortages, and skill obso-
requirements in various liberal arts fields, includ- lescence (McGuinness et al. 2017). In general,
ing social sciences and other science and technol- skill mismatch refers to labor market imbalances
ogy areas, ranging from, for example, architects to and workforce situations in which skill supply and
sociologists to engineers. Moreover, technology, skill demand diverge. Such is the case with big
sources, and applications of big data, big data data analytics and digital skill requirements rela-
analytics, big data hardware, and big data storage tive to employer asserted shortages and needs.
102 Big Data Workforce

Workforce Participation and Access educated, and low-skill workers). All in all, the
role of big data in shaping social, political, and
The rapid and dramatic changes brought about by economic relations (and power) come into play
big data in today’s increasingly digitized society as reflected in educational and workforce oppor-
have led to challenges and opportunities tunity and access.
impacting the related workforce. Education and
training, hiring, and career patterns point to social
and labor market conditions that reflect changes in Conclusion
workforce participation and representation. The
ubiquitous nature of big data has meant an New opportunities and prospects, but also new
expanded need for workers with a variety of appli- challenges, controversies, and vulnerabilities,
cable skills (many of which entail relatively good have marked the explosion of big data and, so
earning potential). Against this backdrop, impor- too, the workforce associated with it. Indeed,
tant questions have been raised about those who there is a need for big data workers “who are
use it and those who work with it. Big data and sensitive to data downsides as well as upsides”
related technologies and activities can mean to achieve the benefits of big data while avoiding
increased demands and wages for highly skilled harmful consequences (Topi and Markus 2015, p.
workers and, arguably, will hold more possibili- 39). The use of big data, along with machine
ties for employment opportunities and learning and AI, is transforming economies and,
participation. arguably, delivering new waves of productivity
However, especially in light of socio-cultural (Catlin et al. 2015). Accordingly, educating, train-
and structural dynamics relative to labor market ing, and facilitating access to workers with big
processes that shape and are subsequently data analytical skills is the sine qua non of the
shaped by demographic factors such as race, future.
ethnicity, gender, disability, etc., questions of
worker identity and skills are brought to the big
data agenda, with particular attention to dispar- Further Reading
ities in terms of educational and workforce
dynamics. For example, minorities and women Berman, J. J. (2013). Principles of big data: Preparing,
sharing, and analyzing complex information. Burling-
constitute only a small percentage of the big data
ton: Morgan Kaufman.
workforce, signaling attention to capacity build- Carantit, L. (2018). Six ways big data has changed the
ing and to questions of big data skill attainment workforce. https://ihrim.org/2018/06/six-ways-big-
and of workforce opportunity, access, participa- data-has-changed-the-workforce.
Catlin, T., Scanlan, J., & Willmott, P. (2015, June 1).
tion, and mobility. Also, the use of big data in Raising your digital quotient. McKinsey Quarterly.
human resource activities affects recruitment and https://www.mckinsey.com/business-functions/strat
retention practices, with specific algorithms egy-and-corporate-finance/our-insights/raising-your-
developed to monitor trends and gauge employee digital-quotient.
Chmura Economics and Analytics. (CEA). (2014). Big
potential, in addition to general performance
Data and Analytics in Northern Virginia and the Poto-
tracking and surveillance (Carantit 2018). Keep- mac Region. Northern Virginia Technology Council.
ing in mind that a variety of social, political, and https://gwtoday.gwu.edu/sites/gwtoday.gwu.edu/files/
economic factors affect educational and skill downloads/BigData%20report%202014%20for%
20Web.pdf.
attainment in the first place, such issues involve Economist. (2010). Data, data everywhere. http://www.
attention to the allocation of occupational roles, economist.com/node/15557443.
upgrading of skills, and access to employment Frehill, L. M. (2015). Everything old is new again: The big
opportunities. Big data and related digital skill data workforce. Journal of the Washington Academy of
Sciences, 101(3), 49–62.
requirements leave some individuals and groups
Galov, N. (2020, November 24). 77+ Big data stats for the
at higher risk of unemployment and wage depres- big future ahead | updated 2020. https://hostingtribunal.
sion (e.g., women, minorities, and older, lower- com/blog/big-data-stats.
Big Geo-data 103

Hammerbacher, J. (2009). Information platforms and the geosciences. It is used to describe the phenomenon
rise of the data scientist. In T. Segaran & J. that large volumes of georeferenced data (including
Hammerbacher (Eds.), Beautiful data: The stories
behind elegant data solutions (pp. 73–84). Sebastapol: structured, semi-structured, and unstructured data)
O’Reilly. about various aspects of the Earth environment and
International Labour Organization (ILO). (2011). A skilled society are captured by millions of environmental B
workforce for strong, sustainable, and balanced and human sensors in a variety of formats such as
growth: A G20 training strategy. Geneva: International
Labour Organization. remote sensing imageries, crowdsourced maps,
Manyika, J., Chui, M., Brown, B, Bughin, J., Dobbs, R., geotagged videos and photos, transportation smart
Roxburgh, C., & Byers, A. H. (2011). Big data: card transactions, mobile phone data, location-
The next frontier for innovation, competition, and based social media content, and GPS trajectories.
productivity. McKinsey Global Institute. https://
www.mckinsey.com/business-functions/mckinsey- Big geo-data is “big” not only because it involves a
digital/our-insights/big-data-the-next-frontier-for-inn huge volume of georeferenced data but also
ovation. because of the high velocity of generation streams,
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A high dimensionality, high variety of data forms, the
revolution that will transform how we live, work, and
think. New York: Houghton Mifflin Harcourt. veracity (uncertainty) of data, and the complex
McGuinness, S., Pouliakas, K., & Redmond, P. (2017). interlinkages with (small) datasets that cover mul-
How useful is the concept of skills mismatch? Geneva: tiple perspectives, topics, and spatiotemporal
International Labour Organization. scales. It poses grand research challenges during
McNeely, C. L. (2015). Workforce issues and big data
analytics. Journal of the Washington Academy of Sci- the life cycle of large-scale georeferenced data
ences, 101(3), 1–11. collection, access, storage, management, analysis,
National Academies of Sciences, Engineering, and Medi- modeling, and visualization.
cine (NASEM). (2018). Data science for undergradu-
ates: Opportunities and options. Washington, DC:
National Academies Press.
Topi, H., & Markus, M. L. (2015). Educating data sci-
entists in the broader implications of their work. Theoretical Aspects
Journal of the Washington Academy of Sciences,
101(3), 39–48.
Geography has a long-standing tradition on
the duality of research methodologies: the law-
seeking approach and the descriptive or explan-
atory approach. With the increasing popularity
Big Geo-data of data-driven approaches in geography, a vari-
ety of statistical methods and machine learning
Song Gao methods have been applied in geospatial knowl-
Department of Geography, University of edge discovery and modeling for predictions.
California, Santa Barbara, CA, USA Miller and Goodchild (2015) discussed the
major challenges (i.e., populations not samples,
messy not clean data, and correlations not cau-
Synonyms sality) and the role of theory in the data-driven
geographic knowledge discovery and spatial
Big georeferenced data; Big geospatial data; modeling, with addressing the tensions between
Geospatial big data; Spatial big data idiographic versus nomothetic knowledge in
geography. Big geo-data is leading to new
approaches to research methodologies in cap-
Definition/Introduction turing complex spatiotemporal dynamics of the
Earth and the society directly at multiple spatial
Big geo-data is an extension to the concept of and temporal scales instead of just snapshots.
big data with emphasis on the geospatial compo- The data streams play a driving-force role in
nent and under the context of geography or data-driven methods rather than a test or
104 Big Geo-data

calibration role behind the theory or models in Technical Aspects


conventional geographic analyses.
While data-driven science and predictive ana- Cloud computing technologies and their distrib-
lytics evolve in geographic and provide new uted deployment models offer scalable computing
insights, sometimes it is still very challenging for paradigms to enable big geo-data processing
humans to interpret the meanings of machine for scientific researches and applications. In the
learning or analytical results or relate findings to geospatial research world, cloud computing has
underlying theory. To solve this problem, attracted increasing attention as a way of solving
Janowicz et al. (2015) proposed a semantic cube data-intensive, computing-intensive, and access-
to illustrate the need for semantic technologies intensive geospatial problems and challenges,
and domain ontologies to address the role of such as supporting climate analytics, land-use
diversity, synthesis, and definiteness in big data and land-cover change analysis, and dust storm
researches. forecasting (Yang et al. 2017). Geocomputation
facilitates fundamental geographical science stud-
ies by synthesizing high-performance computing
Social and Human Aspects capabilities with spatial analysis operations, with
providing a promising solution to aforementioned
The emergence of big geo-data brings new oppor- geospatial research challenges.
tunities for researchers to understand our socio- There are a variety of big data analytics plat-
economic and human environments. In the journal forms and parallelized database systems emerging
of Dialogues in Human Geography (volume 3, in the new era. They can be classified into two
Issue 3, November 2013), several human geogra- categories: (1) the massively parallel processing
phers and GIScience researchers discussed a data warehousing systems like Teradata which
series of theoretical and practical challenges and are designed for holding large-scale structured
risks to geographic scholarship and raised a num- data and support standard SQL queries and
ber of epistemological, methodological, and ethi- (2) the distributed file storage systems and
cal questions related to the studies of big data in cluster-computing framework like Apache
geography. With the advancements in location- Hadoop and Apache Spark. The advantages of
awareness technology, information and commu- Hadoop-based systems mainly lie in their high
nication technology, and mobile sensing technol- flexibility, scalability, low cost, and reliability for
ogy, researchers employed emerging big geo-data managing and efficiently processing a large volume
for investigating the geographical perspective of of structured and unstructured datasets, as well as
human dynamics research within such contexts in providing job schedules for balancing data,
the special issue on Human Dynamics in the resources, and task loads. A MapReduce computa-
Mobile and Big Data Era on the International tion paradigm on Hadoop takes the advantages
Journal of Geographical Information Science of divide-and-conquer strategy and improves the
(Shaw et al. 2016). By synthesizing multi-sources processing efficiency. However, big geo-data has
of big data, those researches can uncover interest- its complexity on the spatial and temporal compo-
ing human behavioral patterns that are difficult or nents and requires new analytical framework
impossible to uncover with the traditional and functionalities compared with nonspatial big
datasets. However, challenges still exist in the data. Gao et al. (2017) built a scalable Hadoop-
scarcity of demographics and cross-validation or based geoprocessing platform (GPHadoop) and
getting the identity of individual behaviors rather ran big geo-data analytical functions to solve
than aggregated patterns. Moreover, the location- crowdsourced gazetteers harvesting problems.
privacy concerns and discussions arise in both Recently, more efforts have been made in
academic world and the society. There exist social connecting traditional GIS analysis research com-
tensions among big data accessibility and privacy munity to the cloud computing research commu-
protection. nity for the next frontier of big geo-data analytics.
Big Humanities Project 105

In one special issue on big data at the journal


of Annals of GIS (volume 20, Issue 4, 2014), Big Georeferenced Data
researchers further discussed several key techno-
logies (e.g., cloud computing, high-performance ▶ Big Geo-data
geocomputation cyberinfrastructures) for dealing B
with quantitative and qualitative dynamics of big
geo-data. Advanced spatiotemporal big data mining
and geoprocessing methods should be developed Big Geospatial Data
by optimizing the elastic storage, balanced sched-
uling, and parallel computing resources in high- ▶ Big Geo-data
performance geocomputation cyberinfrastructures.

Conclusion Big Humanities Project

With the advancements in location-awareness tech- Ramon Reichert


nology and mobile distributed sensor networks, Department for Theatre, Film and Media Studies,
large-scale high-resolution spatiotemporal datasets Vienna University, Vienna, Austria
about the Earth and the society become available
for geographic research. The research on big geo-
data involves interdisciplinary collaborative efforts. “Big Humanities” are a heterogenic field of
There are at least three research areas that require research between IT, cultural studies, and human-
further work: (1) the systematic integration of vari- ities in general. Recently, because of higher avail-
ous big geo-data sources in geospatial knowledge ability of digital data, they gained even more
discovery and spatial modeling, (2) the development importance. The term “Big Humanities Data”
of advanced spatial analysis functions and models, has prevailed due to the wider usage of the Inter-
and (3) the advancement of quality assurance issues net, and it replaced the terms like “computational
on big geo-data. Finally, there will still be ongoing science” and “humanities computing,” which
comparisons between data-driven and theory-driven have been used since the beginning of the com-
research methodologies in geography. puter era in the 1960s. These terms were related
mostly to the methodological and practical devel-
opment of digital tools, infrastructures, and
Further Reading archives.
In addition to the theoretical explorations on
Gao, S., Li, L., Li, W., Janowicz, K., & Zhang, Y. (2017).
Constructing gazetteers from volunteered big geo-data science according to Davidson (2008), Svensson
based on Hadoop. Computers, Environment and Urban (2010), Anne et al. (2010) and Gold (2012), “Big
Systems, 61, 172–186. Humanities Data” are divided into three trendset-
Janowicz, K., van Harmelen, F., Hendler, J., & Hitzler,
ting theoretical approaches, simultaneously cov-
P. (2015). Why the data train needs semantic rails. AI
Magazine, Association for the Advancement of Artifi- ering the historical development and changes in
cial Intelligence (AAAI), pp. 5–14. the field of research according to the epistemolog-
Miller, H. J., & Goodchild, M. F. (2015). Data-driven ical policy:
geography. Geo Journal, 80(4), 449–461.
Shaw, S. L., Tsou, M. H., & Ye, X. (2016). Editorial:
Human dynamics in the mobile and big data era. Inter- 1. The usage of computers and digitalization of
national Journal of Geographical Information Science, “primary data” within humanities and cultural
30(9), 1687–1693. studies are in the center of Digital humanities.
Yang, C., Huang, Q., Li, Z., Liu, K., & Hu, F. (2017). Big
On the one hand the digitization projects relate
data and cloud computing: Innovation opportunities
and challenges. International Journal of Digital to the digitalized portfolios. On the other hand
Earth, 10(1), 13–53. they relate to the computerized philology tools
106 Big Humanities Project

for the application of secondary data or results. Blogging Humanities (work on digital publica-
Even today these elementary methods of digi- tions and mediation in peer-to-peer networks)
tal humanities are based on philological tradi- and Multimodal humanities (presentation and
tion, which sees the evidence-driven collection representation of knowledge within multime-
and management of data as the foundation of dia software environments) that stand for the
hermeneutics and interpretation. Beyond the technical modernization of academic knowl-
narrow discussions about the methods, edge (McPherson 2008). Because of them Big
computer-based measuring within humanities Social Humanities claims the right to represent
and cultural studies claims the media-like pos- paradigmatically alternative form of knowl-
tulates of objectivity within modern sciences. edge production. In this context one should
Contrary to the curriculum of text studies in the reflect on the technical fundamentals of the
50s and 60s within the “Humanities Comput- computer-based process of gaining insights
ing” (McCarthy 2005) the research area of within the research of humanities and cultural
related disciplines has been differentiated and studies while critically considering data,
broadened to history of art, culture and sociol- knowledge genealogy and media history in
ogy, media studies, technology, archaeology, order to evaluate properly the understanding
history and musicology (Gold 2012). of a role in the context of digital knowledge
2. According to the second phase, in addition to production and distribution (Thaller 2012,
the quantitative digitalization of texts, the pp. 7–23).
research practices are being developed in
accordance with the methods and processes of
production, analysis and modeling of digital History of Big Humanities
research environments for work within human-
ities with digital data. This approach stands Big Humanities have been considered only occa-
behind the enhanced humanities and tries to sionally from the perspective of science and
find new methodological approaches of quali- media history in the course of the last few years
tative application of generated, processed and (Hockey 2004). Historical approach to the
archived data for reconceptualization of tradi- interdependent relation between humanities and
tional research subjects. (Ramsey and cultural studies and the usage of computer-based
Rockwell 2012, pp. 75–84). processes relativize the aspiration of digital
3. The development from humanities 1.0 to methods on the evidence and truth and support
humanities 2.0 (Davidson 2008, pp. 707–717) the argumentation that digital humanities were
marks the transition from digital development developed from a network of historical cultures
of methods within “Enhanced Humanities” to of knowledge and media technologies with their
the “Social Humanities” which use the possi- roots in the end of the nineteenth century.
bility of web 2.0 to construct the research The relevant research literature of the historical
infrastructure. Social humanities use interdis- context and genesis of Big Humanities is regarded
ciplinarity of scientific knowledge by making as one of the first projects of genuine humanistic
use of software for open access, social reading usages of computer a Concordance of Thomas of
and open knowledge and by enabling online Aquino based on punch cards by Roberto Busa
cooperative and collaborational work on (Vanhoutte 2013, p. 126). Roberto Busa
research and development. On the basis of the (1913–2011), an Italian Jesuit priest, is considered
new digital infrastructure of social web as a pioneer of Digital Humanities. This project
(hypertext systems, Wiki tools, Crowd funding enabled the achievement of uniformity in histori-
software etc.) these products transfer the ography of computational science in its early
computer-based processes from the early stage (Schischkoff 1952). Busa, who in 1949
phase of digital humanities into the network developed the linguistic corpus of “Index
culture of the social sciences. Today it is Thomisticus” together with Thomas J. Watson,
Big Humanities Project 107

the founder of IBM, (Busa 1951, 1980, was automatized in specific areas. Other projects
pp. 81–90), is regarded a founder of the point of of data processing show that the automatized pro-
intersection between humanities and IT. The first duction of an index or a concordance marks the
digital edition on punch cards initiated a series beginning of computer-based humanities and cul-
of the following philological projects: “In the tural studies for the lexicography and catalogue B
60s the first electronic version of ‘Modern Lan- apparatus of libraries. Until the late 1950s, it was
guage Association International Bibliography’ the automatized method of processing large text
(MLAIB) came up, a specific periodical bibliog- data with the punch card system after Hollerith
raphy of all modern philologies, which could be procedure that stood in the center of the first
searched through with a telephone coupler. The applications/usages. The technical procedure of
retrospective digitalization of cultural heritage punch cards changed the lecture practice of text
started after that, having had ever more works analysis by transforming a book into a database
and lexicons such as German vocabulary by and by turning the linear-syntagmatic structure of
Grimm brothers, historical vocabularies as the text into a factual and term-based system. As early
Krünitz or regional vocabularies” (Lauer 2013, as 1951, the academic debate among the contem-
p. 104). poraries started in academic journals. This debate
At first, a large number of other disciplines and saw the possible applications of the punch card
non-philological areas were formed such as liter- system as largely positive and placed them into
ature, library, and archive studies. They had lon- the context of economically motivated rationality.
ger epistemological history in the field of Between December 13 and 16, 1951, the German
philological case studies and practical information Society for Documentation and the Advisory
studies. Since the introduction of punch card Board of German Economical Chamber orga-
methods, they have been dealing with quantitative nized a working conference on the study of mech-
and IT procedures for facilities of knowledge anization and automation of documentation
management. As one can see, neither the research process, which was enthusiastically discussed by
question nor Busa’s methodological procedure philosopher Georgi Schischkoff. He talked about
have been without its predecessors, so they can a “significant simplification and acceleration [. . .]
be seen as a part of a larger and longer history of by mechanical remembrance” (Schischkoff 1952,
knowledge and media archeology. Sketch models p. 290). The representatives of computer-based
of mechanical knowledge apparatus capable of humanities saw in the “literary computing,”
combining information were found in the manu- starting in the early 1950s, the first autonomous
scripts of Suisse Archivar Karl Wilhelm Bührer research area, which could provide an “objective
(1861–1917, Bührer 1890, pp. 190–192). This analysis of exact knowledge” (Pietsch 1951). In
figure of thought of flexible and modularized the 1960s, the first studies in the field of computer
information unit was made to a conceptional linguistics concerning the automatized indexing
core of mechanical data processing. The archive of large text corpora appeared, publishing the
and library studies took part directly in the histor- computer-based analysis about word indexing,
ical change of paradigm of information pro- word frequency, and word groups.
cessing. It was John Shaw Billings, the doctor The automatized evaluation procedure of texts
and later director of the National Medical Library, for the editorial work within literary studies was
who worked further on the development of appa- described already in the early stages of “humani-
ratus for machine-driven processing of statistical ties computing” (mostly within its areas of “com-
data, a machine developed by Hermann Hollerith puter philology” and “computer linguistics”) on
in 1886 (Krajewski 2007, p. 43). Technology of the ground of two discourse figures relevant even
punch cards traces its roots in technical pragmat- today. The first figure of discourse describes the
ics of library knowledge organization; even if achievements of the new tool usage with instru-
only later – within the rationalization movement mental availability of data (“helping tools”); the
in the 1920s – the librarian working procedure other figure of discourse focuses on the
108 Big Humanities Project

economical disclosure of data and emphasizes the of equality, freedom, and omniscience attainable
efficiency and effectivity of machine methods of again.
documenting. The media figure of automation As opposed to its beginnings in the 1950s, the
was finally combined with the expectance that Digital Humanities today have also an aspiration
interpretative and subjective influences from the to reorganize the knowledge of the society. There-
processing and analysis of information can be fore, they regard themselves “both as a scientific
systematically removed. In the 1970s and 1980s, as well as a socioutopistic project” (Hagner and
the computer linguistics was established as an Hirschi 2013, p. 7). With the usage of social media
institutionally positioned area of research with in the humanities and cultural studies, the techno-
its university facilities, its specialist journals logical possibilities and the scientific practices of
(Journal of Literary and Linguistic Computing, Digital Humanities not only developed but they
Computing in the Humanities), discussion panels also brought to life new phantasmagoria of scien-
(HUMANIST), and conference activities. The tific distribution, quality evaluation, and transpar-
computer-based work in the historical- ency in the World Wide Web (Haber 2013,
sociological research has its first large rise, but it pp. 175–190). In this context, Bernhard Rieder
remains regarded in the work reports less than an and Theo Röhle identified five central problematic
autonomous method, and it is seen mostly as a perspectives for the current “Digital Humanities”
tool for critical text examination and as a simpli- in their text from 2012 “five challenges.” These
fication measure by quantifying the prospective are the following: the temptation of objectivity,
subjects (Jarausch 1976, p. 13). the power of visual evidence, black-boxing
A sustainable media turn both in the field of (fuzziness, problems of random sampling, etc.),
production and in the field of reception aes- institutional turbulences (rivaling service facilities
thetics appeared with the application of stan- and teaching subjects), and the claim of univer-
dardized markup texts such as the Standard sality. Computer-based research is usually domi-
Generalized Markup Language established in nated by the evaluation of data so that some
1986 and software-driven programs for text researchers see the advanced analysis within the
processing. They made available the additional research process even as a substitution for a sub-
series of digital modules, analytical tools, and stantial theory construction. That means that the
text functions and transformed the text into a research interests are almost completely data
model of a database. The texts could be loaded driven. This evidence-based concentration on the
as structured information and were available as data possibilities can deceive the researcher to
(relational) databases. In the 1980s and 1990s, neglect the heuristic aspects of his own subject.
the technical development and the text recep- Since the social net is not only a neutral reading
tion were dominated by the paradigm of a channel of research, writing, and publication
database. resources without any power but also a govern-
With the domination of the World Wide Web, mental structure of power of scientific knowledge,
the research and teaching practices changed dras- the epistemological probing of social, political,
tically: the specialized communication experi- and economic contexts of Digital Humanities
enced a lively dynamics through the digital includes also a data critical and historical
network culture of publicly accessible online questioning of its computer-based reformation
resources, e-mail distribution, chats, and forums, agenda (Schreibmann 2012, pp. 46–58).
and it became largely responsive through the What did the usage of computer technology
media-driven feedback mentality of rankings change for cultural studies and humanities on the
and voting. With its aspiration to go beyond the basis of theoretical essentials? Computers did
hierarchical structures of academic system reorganize and accelerated the quantification and
through the reengineering of scientific knowl- calculation process of scientific knowledge; they
edge, the Digital Humanities 2.0 made the ideals did entrench the metrical paradigm in the cultural
Big O Notation 109

studies and humanities and promoted the Krajewski, M. (2007). In Formation. Aufstieg und Fall der
hermeneutical-interpretative approaches with a Tabelle als Paradigma der Datenverarbeitung. In
D. Gugerli, M. Hagner, M. Hampe, B. Orland,
mathematical formalization of the respective sub- P. Sarasin, & J. Tanner (Eds.), Nach Feierabend.
ject field. In addition to these epistemological Zürcher Jahrbuch für Wissenschaftsgeschichte (Vol.
shifts, the research practices within the Big 3, pp. 37–55). Zürich/Berlin: Diaphanes. B
Humanities have been shifted, since the research Lauer, G. (2013). Die digitale Vermessung der Kultur.
Geisteswissenschaften als Digital Humanities. In
and development are seen as project related, col- H. Geiselberger & T. Moorstedt (Eds.), Big Data. Das
laborative, and network formed, and on the net- neue Versprechen der Allwissenheit (pp. 99–116).
work horizon, they become the subject of research Frankfurt/M: Suhrkamp.
of network analysis. The network analysis itself McCarty, W. (2005). Humanities computing. London:
Palgrave.
has its goal to reveal the correlations and relation- McPherson, T. (2008). Dynamic vernaculars: Emergent dig-
patterns of digital communication of scientific ital forms in contemporary scholarship. Lecture presented
networks and to declare the Big Humanities itself to HUMLab Seminar, Umeå University, 4 Mar. http://
to the subject of reflections within a social con- stream.humlab.umu.se/index.php?streamName¼dynami
cVernaculars.
structivist actor-network-theory. Pietsch, E. (1951). Neue Methoden zur Erfassung
des exakten Wissens in Naturwissenschaft und
Technik. Nachrichten für Dokumentation, 2(2),
38–44.
Further Reading Ramsey, S., & Rockwell, G. (2012). Developing things:
Notes toward an epistemology of building in the digital
humanities. In M. K. Gold (Ed.), Debates in the digital
Anne, B, Drucker, J., Lunenfeld, P., Presner, T., & humanities (pp. 75–84). Minneapolis: University of
Schnapp, J. (2010). Digital_humanities. Cambridge, Minnesota Press.
MA: MIT Press, 201(2). Online: http://mitpress.mit. Rieder, B., & Röhle, T. (2012). Digital methods: Five
edu/sites/default/files/titles/content/9780262018470_ challenges. In D. M. Berry (Ed.), Understanding digital
Open_Access_Edition.pdf. humanities (pp. 67–84). London: Palgrave.
Bührer, K. W. (1890). Ueber Zettelnotizbücher und Schischkoff, G. (1952). Über die Möglichkeit der
Zettelkatalog. Fernschau, 4, 190–192. Dokumentation auf dem Gebiete der Philosophie.
Busa, R. (1951). S. Thomae Aquinatis Hymnorum Zeitschrift für Philosophische Forschung, 6(2),
Ritualium Varia Specimina Concordantiarum. Primo 282–292.
saggio di indici di parole automaticamentecomposti e Schreibman, S. (2012). Digital humanities: Centres and
stampati da macchine IBM a schede perforate. Milano: peripheries. In: M. Thaller (Ed.), Controversies around
Bocca. the digital humanities (Historical social research, Vol.
Busa, R. (1980). The annals of humanities computing: The 37(3), pp. 46–58). Köln: Zentrum für Historische
index Thomisticus. Computers and the Humanities, Sozialforschung.
14(2), 83–90. Svensson, P. (2010). The landscape of digital humanities.
Davidson, C. N. (2008). Humanities 2.0: Promise, perils, Digital Humanities Quarterly (DHQ), 4(1). Online:
predictions. Publications of the Modern Language http://www.digitalhumanities.org/dhq/vol/4/1/000080/
Association (PMLA), 123(3), 707–717. 000080.html.
Gold, M. K. (Ed.). (2012). Debates in the digital humani- Thaller, M. (Ed.). (2012). Controversies around the digital
ties. Minneapolis: University of Minnesota Press. humanities: An agenda. Computing Historical Social
Haber, P. (2013). ‘Google Syndrom‘. Phantasmagorien des Research, 37(3), 7–23.
historischen Allwissens im World Wide Web. Zürcher Vanhoutte, E. (2013). The gates of hell: History and defi-
Jahrbuch für Wissensgeschichte, 9, 175–190. nition of digital | humanities. In M. Terras, J. Tyham, &
Hagner, M., & Hirschi, C. (2013). Editorial Digital E. Vanhoutte (Eds.), Defining digital humanities
Humanities. Zürcher Jahrbuch für Wissensgeschichte, (pp. 120–156). Farnham: Ashgate.
9, 7–11.
Hockey, S. (2004). History of humanities computing. In
S. Schreibman, R. Siemens, & J. Unsworth (Eds.), A
companion to digital humanities. Oxford: Blackwell.
Jarausch, K. H. (1976). Möglichkeiten und Probleme der
Quantifizierung in der Geschichtswissenschaft. In:
ders., Quantifizierung in der Geschichtswissenschaft. Big O Notation
Probleme und Möglichkeiten (pp. 11–30). Düsseldorf:
Droste. ▶ Algorithmic Complexity
110 Big Variety Data

Large data sets in a big data environment


Big Variety Data are made up of varying data types. Data can
be classified as structured, semi-structured,
Christopher Nyamful1 and Rajeev Agrawal2 and unstructured based on how it is stored
1
Department of Computer Systems Technology, and analyze. Semi-structured data is a kind
North Carolina A&T State University, of structured data that is not raw or strictly
Greensboro, NC, USA typed. They don’t have any underlying data
2
Information Technology Laboratory, US Army model, hence cannot be associated with any
Engineer Research and Development Center, relational database. The web provides numer-
Vicksburg, MS, USA ous examples of semi-structured data such as
hypertext markup language (html) and exten-
sible markup language (xml). Structured data
Introduction is organized in a strict format of rows and
columns. It makes use of data model which
Massive data generated from daily activities determines the schema for the structured data.
which has accumulated over the years may con- Data types under structured data can be orga-
tain valuable information and insights which can nized by index and queried in various ways to
be leveraged to assist in decision-making for yield required results. Relational database
greater competitive advantage. Consider data management systems are used to analyze and
from weather, traffic control, satellite imagery, manage structured data.
geography, and social media data to daily sales About 90 percent of big data is highly unstruc-
figures that have inherent patterns that, if discov- tured (Cheng et al. 2012). The primary concern of
ered, can be used to forecast likely future occur- businesses and organizations is how to manage
rences. The huge amount of user-generated data is unstructured data, because they form the bulk part
unstructured; its unstructured content has no con- of data received and processed. Unstructured data
ceptual data-type definition. They are typically requires a significant amount of storage space.
stored as files, such as word documents, Storing large collection of digital files and stream-
PowerPoint presentations, photos, videos, web ing videos have become common in today’s era of
pages, blogs such as tweets, and Facebook posts. big data. For instance, Youtube receives one bil-
Unstructured data is the most common, and lion unique users every day, and 100 h of video is
people use it every day. For example, the use of uploaded each minute (“YouTube Data Statistics”
video surveillance has increased, likewise satel- 2015). Clearly, there is a massive increase in video
lite-based remote sensing and aerial photography file storage requirements, in terms of capacity and
of both optical and multispectral imagery. The IOPS.
smartphone is also a good example of how a As data sets increase in both structured and
mobile device produces an additional variety of unstructured forms, analysis and management
data sources that is captured for reuse. Most of the get more diverse. The real-time component of
files that organizations want to keep are usually big data environment poses a great challenge.
image based. Industries and government regula- The use of web commercials based on users’
tions require a significant portion, if not all purchase and search history requires real-time
unstructured data be stored for long-term retention analytics. In order to effectively manage these
and access. This data must be appropriately clas- huge quantities of unstructured data, a high
sified and managed for future analysis, search, and IOPS performance storage environment is
discovery. The notion of using data variety entails required. A wide range of technologies and tech-
the idea of using multiple sources of data to help niques have been developed to analyze, manipu-
understand and solve a problem. late, aggregate, and visualize big data.
Big Variety Data 111

Current Systems storage vendors such as IBM, NetApp Inc.,


Hitachi Data systems, Hewlett Packard (HP),
A scale-out network-attached storage (NAS) is a Dell Inc., and among others offer scale-out NAS
network storage system used to simplify storage to address the unstructured big data needs.
management through a centralized point of con- Object-based storage system(OSD) offers an B
trol. It pools multiple storage nodes in a cluster. innovative platform for storing and managing
This cluster performs NAS processes as a single unstructured data. It stores data in the form of
entity. Unlike traditional NAS, scale-out NAS objects based on its content and other attribute.
provides the capability of nodes or heads to be An object has a variable length and can be used to
added as processing power demands. It supports store any type of data. It provides an integrated
file I/O-intensive applications and scales to solution that supports file-, block-, and object-
petabytes of data. Scale-out NAS provides the level access to storage devices. Object-based stor-
platform for a flexible management of diverse age devices organize and stores unstructured data
data types in big data. It’s characterized with such as movies, photos, and documents as objects.
moderate performance and availability to produce OSD uses flat address space to store data and uses
a complete system with better aggregate comput- a unique identifier to access that data. The use of
ing power and availability. The use of gigabit the unique identifier eliminates the need to know
Ethernet allows scale-out NAS to be deployed specific location of a data object. Each object is
over a wide geographical area and still maintain associated with an object ID, generated by a spe-
high throughput. cial hash function and guarantees each object is
Most storage vendors are showing more inter- uniquely identified. The object is also composed
est in scale-out NAS to deal with the challenges of of data, attributes, and rich metadata. The meta-
big data with media-rich files – unstructured data. data keeps track of the object content and makes
Storage vendors differ in a way they architect access, discovery, distribution, and retention
scale-out network-attached storage. EMC Corpo- much more feasible. Object storage brings struc-
ration offers Isilon OneFS scale-out NAS to its ture to unstructured data, making it easier to store,
clients. Isilon provides the capabilities to meet big protect, secure, manage, organize, search, sync,
data challenges. It comes with a specialized oper- and share file data. The great features provided by
ating system known as OneFS. Isilon OneFS con- OSD allow organizations to leverage a single stor-
solidate file system, volume manager, and age investment for a variety of workloads.
Redundant Array of Independent Disk (RAID) Hitachi Data Systems offers object-based stor-
into a unified software layer and a single file age solution that treats data files, metadata, and
system that is distributed across all nodes in the file attributes as a single object that is tracked and
cluster. EMC scale-out NAS simplifies storage retained among a variety of storage tiers. They
infrastructure and reduces cost by consolidating provide multiple fields for metadata so that differ-
unstructured data sets and large-scale files, elimi- ent users and application can use their own meta-
nating storage silos. It provides massive scalabil- data and tag without conflict. EMC Atmos storage
ity for unstructured big data storage needs, system is designed to support object-based stor-
ranging from 16 TB to 20 TB capacity per cluster. age for unstructured data such as videos and pic-
Isilon’s native Hadoop Distributed File System tures. Atmos integrates massive scalability with
can be leveraged to support Hadoop analytics on high performance to address challenges associ-
both structured and unstructured data. EMC Isilon ated with vast amount of unstructured data. It
moderate performance can reach up to 2.6 million enhances operational efficiency by distributing
file operations per second with over 200 gigabyte content automatically based on business policy.
per second of aggregate throughput to support the Atmos also provides data services such as repli-
demands posed by big data workloads. Other cation, deduplication, and compression. Atmos
112 Big Variety Data

multitennacy feature allows multiple applications volume, high variety, and high velocity data.
to be processed from the same infrastructure. Comparatively, they are more scalable and sup-
port superior performance. MongoDB is a cross
platform NoSQL database that is designed to
Distributed Systems overcome the limitations of the traditional data-
base. MongoDB is optimized for efficiency, and
Apache Hadoop project has developed open- its features include:
source software for reliable, scalable, and efficient
distributed computing. Hadoop Distributed File 1. Scale-out architecture, instead of expensive
System (HDFS) is a distributed file system that monolithic architecture
stores data on low-cost machines, providing high 2. Supports large volumes of structured, semi-
aggregate bandwidth across the cluster (Shvachko structured, and structured data
et al. 2010). HDFS stores huge files across multi- 3. Agile sprints, quick iteration, and frequent
ple nodes. It ensures reliability by replicating data code pushes
across multiple hosts in a cluster. HDFS is com- 4. Flexibility in use of object-oriented
posed of two agents, namely, NameNode and programming
DataNode. NameNode is responsible for manag-
ing metadata and DataNode manages data input/ An alternative area for addressing big data-
output. Each DataNode serves up blocks of data related issues is the cloud. The emergence of
over the network using a block protocol specific to cloud computing has eased IT burdens of most
HDFS and uses the standard IP network for com- organizations. Storage and analysis can be
munication. HDFS has master/slave architecture. outsourced to the cloud. In the era of big data,
Input files distributed across the cluster are auto- the cloud offers a potential self-service consump-
matically split into even-sized chunks which are tion model for data analytics. Cloud computing
managed by different nodes in the cluster. It allows organizations and individuals to obtain IT
ensures scalability and availability. For example, resources as a service. Cloud computing offers
Yahoo uses Hadoop to manage 25 PB of enter- it services according to several fundamental
prise data in 25,000 servers. model – Infrastructure as a Service (IaaS), Plat-
Hadoop uses distributed architecture known as form as a Service (PaaS), and Software as a Ser-
MapReduce for mapping tasks to servers for pro- vice (SaaS). The Amazon Compute Cloud (EC2)
cessing. Amazon Elastic MapReduce (Amazon is an example of IaaS that provides on-demand
EMR) uses Hadoop to analyze and processes services in the cloud for clients. IaaS allow access
huge amount of data, both structured and unstruc- to its infrastructure for client application to be
tured data. It achieves this by distributing work- deployed onto. It shares and manages a pool of
loads across virtual servers running in the configurable and scalable resources such as net-
Amazon cloud. Amazon EMR simplifies the use work and storage servers. Google App Engine is
of Hadoop and big data-intensive applications. an example of PaaS that allows clients to develop
Amazon Elastic Compute Cloud (EC2) is being and run their software solutions on the cloud
used by various organizations to process vast platform. PaaS guarantees high computing plat-
amount of unstructured data. The New York forms to meet the different workload from clients.
Times rents 100 virtual machines to convert Cloud providers manage required computing
11 million scanned articles to PDFs. infrastructure and software support service to cli-
The relational database is not able to support ents. SaaS model allows clients to use provided
the vast variety of unstructured data which are application to meet their business needs. The
being received from all sources of digital activi- cloud uses its multitenant feature to accommodate
ties. NoSQL databases are now being deployed to a large number of users. Flickr, Amazon, and
address some of the big unstructured data chal- Google docs are great examples of SaaS. Both
lenges. NoSQL represents a class of data manage- cloud computing and big data analytics are exten-
ment technologies designed to address high- sion of virtualization technologies. Virtualization
Bioinformatics 113

abstract physical resources such as storage, com- its basis in the study of genotypes and pheno-
pute, and network and make them appear as log- types, the bioinformatics domain is extensive,
ical resources. Cloud infrastructure is usually built encompassing genomics, metagenomics, proteo-
on virtualized data center by providing resource mics, pharmacogenomics, and metabolomics.
pooling. Organizations are deploying With advances in IT-supported data storage and B
virtualization technique across data centers to management, very large data sets (Big Data) have
optimize their use. become available from diverse sources at greatly
accelerated rates, providing unprecedented oppor-
tunities to engage in increasingly more sophisti-
Conclusion cated biological data analytics.
Although the formal study of biology has its
Big variety data is on the rise and touches all areas origins in the seventeenth century CE, the appli-
of life, especially with the high-degree usage of cation of computer science to biological research
the Internet. Methods for simplifying big variety is relatively recent. In 1953, the Watson and Crick
data in terms of storage, integration, analysis, and published the DNA structure. In 1975, Sanger and
visualization are complex. Current storage sys- the team of Maxam and Gilbert independently
tems, to a considerable extent, are addressing developed DNA sequences. In 1980, the US
some big data-related issues. A high-performance Supreme Court ruled that patents on genetically
system to ensure maximum data transfer rate and modified bacteria are allowed. This ruling made
analysis has become a research focus. Current pharmaceutical applications a primary motive for
systems can be improved in the near future to human genomic research: the profits from drugs
handle efficiently the vast amount of unstructured based on genomic research could be enormous.
data in the big data environment. Genomic research became a race between aca-
deme and the commercial sector. Academic con-
sortia, comprising universities and laboratories,
Further Reading rushed to place their gene sequences in the public
domain to prevent commercial companies from
Cheng, Y., Qin, C., & Rusu, F. (2012). GLADE: Big data applying for patents on those sequences. The
analytics made easy. Paper presented at the Proceed-
1980s also saw the sequencing of human mito-
ings of the 2012 ACM SIGMOD International Confer-
ence on Management of Data, Scottsdale. chondrial DNA (1981; 16,589 base pairs) and the
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). Epstein-Barr virus genome (1984; 172,281 base
The hadoop distributed file system. Paper presented at pairs). In 1990, the International Human Genome
the 2010 I.E. 26th Symposium on Mass Storage Sys-
project was launched with a projected 15-year
tems and Technologies (MSST), IEEE.
YouTube Data Statistics. (2015). Retrieved 15 Jan 2015, duration. On 26 June 2006, the draft of the
from http://www.youtube.com/yt/press/statistics.html. human genome was published, reflecting the suc-
cessful application of informatics to genomic
research.

Bioinformatics
Biology, Information Technology, and
Erik W. Kuiler Big Data Sets
George Mason University, Arlington, VA, USA
For much of its history, biology was considered to
be a science based on induction. The introduction
Background of computer science to the study of biology
and the evolution of data management from main-
Bioinformatics constitute a specialty in the infor- frame- to cloud-based computing provides the
matics domain that applies information technolo- impetus for the progression from strictly observa-
gies (IT) to the study of human biology. Having tion-based biology to bioinformatics. Current
114 Bioinformatics

relational database management systems, origi- Nucleotide Sequence Database Collaboration.


nally developed to support business transaction The Japanese National Institute of Genetics sup-
processes, were not designed to support data sets ports the DNA Data Bank of Japan and the Center
of one or more petabytes (1015 bytes) or exabytes for Information Biology. To ensure synchronized
(1018 bytes). Big biological data sets are likely to coverage, these organizations share information
contain structured as well as unstructured data, on a regular basis. For organizations that manage
including structured quantitative data, images, bioinformatics databases, providing browsers to
and unstructured text data. Advances in Big Data access and explore their contents has become a
storage management and distribution support de facto standard; for example, the US National
large genomic data mappings; for example, nucle- Center for Biotechnology Information offers
otide databases may contain more than 6  1011 ENTREZ to execute parallel searches in multiple
base pairs, and a single sequenced human genome databases.
may be approximately 140 gigabytes in size. With
the adoption and promulgation of Big Data bioin-
formatics, human genomic data can be embedded Translational Bioinformatics
in electronic health records (EHR), facilitating,
for example, individualized patient care. Bioinformatics conceptualize biology at the
molecular level and organize data to store and
manage them efficiently for research and presen-
Bioinformatics Transparency and tation. Translational bioinformatics focus on the
International Information Exchange transformation of biomedical data into informa-
tion to support, for example, the development of
The number of genomic projects is growing. new diagnostic techniques, clinical interventions,
Started in 2007 as a pilot project, the Encyclo- or new commercial products and services. In the
paedia of DNA Elements (ENCODE) is an pharmacological sector, translational bioinformat-
attempt to understand the functions of the ics provide the genetic data necessary to repur-
human genome. The results produced by projects pose existing drugs. Genetic data may also prove
such as ENCODE are expected generate more useful in differentiating drug efficacy based
genomics-focused research projects. The on gender or age. In effect, translational bio-
National Human Genome Research Institute informatics enable bi-directional data sharing
(NHGRI), a unit of the National Institutes of between the research laboratory and the medical
Health (NIH), provides a list of educational clinic. Translational bioinformatics enable hospi-
resources on its website. The volume of bioin- tals to develop clinical diagnostic support capa-
formatics-generated research has led to the bilities that incorporate correlations between
development of large, online research databases, individual genetic variations and clinical risk
of which PubMed, maintained by the US factors, disease presentations, or responses to
National Library of Medicine, is just one treatment.
example.
Genomic and biological research is an interna-
tional enterprise, and there is a high level of trans- Bioinformatics and Health Informatics
national collaboration to assure data sharing.
For example, the database of nucleic acid Translational bioinformatics-generated data can
sequences is maintained by a consortium compris- be incorporated in EHRs that are structured to
ing institutions from the US, UK, and Japan. The contain both structured and unstructured data.
UK-based European Bioinformatics Institute For example, with an EHR that complies with
(EBI) maintains the European Nucleotide Archive the Health Level 7 Consolidated Clinical Docu-
(ENA). The US National Center for Biotechnol- ment Architecture (HL7 C-CDA), it is possible to
ogy Information maintains the International share genetic data (personal genomic data), X-ray
Bioinformatics 115

images, and diagnostic data, as well as a clini- Challenges and Future Trends
cian’s free-form notes. Because EHR data are
usually entered to comply with predetermined To remain epistemically viable, bioinformatics,
standards, there is substantially less likelihood of like health informatics, require the capabilities to
error in interpretation or legibility. Combined, ingest, store, and manage Big Data sets. However, B
Health Information Exchange (HIE) and EHR these capabilities are still in their infancy.
provide the foundation for biomedical data shar- Similarly, data analytics tools may not be suffi-
ing. HIE operationalizes the Meaningful Use ciently efficient to support Big Data exploration in
(MU) provisions of the Health Information Tech- a timely manner. Because personal genomic infor-
nology for Economic and Clinical Health mation can now be used in EHRs, translational
(HITECH) Act, enacted as Titles IV and XIII bioinformatics, like health informatics, must
the American Recovery and Reinvestment Act incorporate stringent anonymization controls.
(ARRA) of 2009, by enabling information sharing Bioinformatics are beginning to develop compu-
among clinicians, patients, payers, care givers, tational models of disease processes. These can
federal and state agencies. prove beneficial not only to the development or
modification of clinical diagnostic protocols and
interventions but also for epidemiology and
Biomedical Data Analytics public health. In academe, bioinformatics are
increasingly accepted as multidiscipline pro-
Biomedical research depends on quantitative data grams, drawing their expertise from biology,
as well as unstructured text data. With the avail- computer science, statistics, and medicine (trans-
ability of Big Data sets, selecting the appropriate lational bioinformatics).
analytical model depends on the kind of analysis Based on the evolutionary history of IT devel-
we plan to undertake, the kinds of data we have, opment, dissemination, and acceptance, it is likely
and the size of the data set. Data mining models, that IT-based technical issues in Big Data-focused
such as artificial neural networks, statistics-based bioinformatics will be addressed and that the req-
models, and Bayesian models that focus on uisite IT capabilities will become available over
probabilities and likelihoods support predictive time. However, there are a number of ethical and
analytics. Classification models are useful in deter- moral issues that attend the increasing acceptance
mining the category where an individual object of translational bioinformatics-provided informa-
belongs based on identifiable properties. Clustering tion in the health domain. For example, is it ethical
models are useful for identifying or moral for a health insurance provider to deny
population subsets based on shared parameters. coverage based on personal genomics? Also, is it
Furthermore, advances in computer science have appropriate, from a public policy perspective, to
also led to the development and analysis of algo- use bioinformatics-generated data to institute
rithms, not only in terms of complexity but also in eugenic practices, even if only de facto, to support
terms of performance. Induction-based algorithms the social good? As a society it behooves us to
are useful in unsupervised learning settings; for address questions such as these.
example, text mining or topic analysis for the pur-
pose of exploration, where we are not trying to
prove or disprove a hypothesis but are simply
Further Reading
exploring a body of documents for lexical clusters
and patterns. In contrast, deduction-based algo- Butte, A. (2008). Translational bioinformatics: Coming of
rithms can be useful in supervised learning settings, age. Journal of the American Medical Informatics
where there are research questions to be answered Association, 15(6), 709–714.
Cohen, I. G., Amarasingham, R., Shah, A., Xie, B., &
and hypotheses to be tested. In the health domain,
Lo, B. (2014). The legal and ethical concerns that
random clinical trials (RCT) are archetypal exam- arise from using complex predictive analytics in health
ples of hypothesis-based model development. care. Health Affairs, 33(7), 1139–1147.
116 Biomedical Data

Kumari, D., & Kumari, R. (2014). Impact of biological researchers, clinicians, and patients), data scien-
big data in bioinformatics. International Journal of tists, funders, publishers, and librarians.
Computer Applications, 10(11), 22–24.
Maojo, V., & Kulikowski, C. A. (2003). Bioinformatics The collection and analysis of big data in
and medical informatics: Collaborations on the road to biomedical area have demonstrated its ability to
genomic medicine? Journal of the American Medical enable efficiencies and accountability in health
Informatics Association., 10(6), 515–522. care, which provides strong evidence for the
Ohno-Machado, L. (2012). Big science, big data, and the
big role for biomedical informatics. Journal of the benefits of big data usage. Electronic health
American Medical Informatics Association, 19(e1), e1. records (EHRs), an example of biomedical big
Shah, N. H., & Tenebaum, J. D. (2012). The coming of age data, can provide timely data for assisting mon-
of data-driven medicine: Translational bioinformatics’ itoring of infectious diseases, disease outbreaks,
next frontier. Journal of the American Medical Infor-
matics Association, 19, e1–e2. and chronic illnesses, which could be particu-
larly valuable during public health emergencies.
By collecting and extracting data from EHRs,
public health organizations and authorities
could receive extraordinary amount of informa-
Biomedical Data tion. By analyzing the massive data from EHRs,
public health researchers could conduct compre-
Qinghua Yang1 and Fan Yang2 hensive observational studies with uncountable
1
Department of Communication Studies, Texas patients who are treated in real clinical settings
Christian University, Fort Worth, TX, USA over years. Disease progress, clinical outcomes,
2
Department of Communication Studies, treatment effectiveness, and public health inter-
University of Alabama at Birmingham, vention efficacies can also be studied by analyz-
Birmingham, AL, USA ing EHRs data, which may influence public
health decision-making (Hoffman and Podgurski
2013).
Thanks to the development of modern data col- As a crucial juncture of addressing the oppor-
lection and analytic techniques, biomedical tunities and challenges presented by biomedical
research generates increasingly large amounts big data, the National Institutes of Health (NIH)
of data in various formats and at all levels, has initiated a Big Data to Knowledge (BD2K)
which is referred to as big data. Big data is a initiative to maximize the use of biomedical big
collection of data sets, which are large in volume data. BD2K, a response to the Data and Informat-
and complex in structure. To illustrate, the data ics Working Groups (DIWG), focuses on
managed by America’s leading healthcare pro- enhancing:
vider Kaiser is 4,000 times more than the amount
of information stored in the Library of Congress. (a) the ability to locate, access, share, and apply
As to data structure, the range of nutritional data biomedical big data,
types and sources make it really difficult to nor- (b) the dissemination of data analysis methods
malize. Such volume and complexity of big data and software,
make it difficult to be processed by traditional (c) the training in biomedical big data and data
data analytic techniques. Therefore, to further science,
knowledge and uncover hidden value, there is (d) the establishment of centers of excellence in
an increasing need to better understand and data science (Margolis et al. 2014)
mine biomedical big data by innovative tech-
niques and new approaches, which requires First, BD2K initiative fosters the emergence of
interdisciplinary collaborations involving data data science as a discipline relevant to biomedi-
providers and users (e.g., biomedical cine by developing the solutions to specific high-
Biomedical Data 117

need challenges confronting the research commu- Challenges


nity. For instance, the Centers of Excellence in
Data Science initiated the first BD2K Funding Despite the opportunities brought by biomedical
Opportunity to test and validate new ideas in big data, certain noteworthy challenges also exist.
data science. Second, BD2K aims to enhance the First, to use big biomedical data effectively, it is B
training of methodologists and practitioners in imperative to identify the potential sources of
data science by improving their skills in demand healthcare information and to determine the
under the data science “umbrella,” such as com- value of linking them together (Weber et al.
puter science, mathematics, statistics, biomedical 2014). The “bigness” of biomedical data sets is
informatics, biology, and medicine. Third, given multidimensional: some big data, such as EHRs,
the complex questions posed by the generation of provide depth by including multiple types of data
large amounts of data requiring interdisciplinary (e.g., images, notes, etc.) about individual patient
teams, BD2K initiative facilitates the develop- encounters; others, such as claims data, provide
ment of investigators in all parts of the research longitudinality, which refers to patients’ medical
enterprise for interdisciplinary collaboration to information over a period of time. Moreover,
design studies and perform subsequent data ana- social media, credit cards, census records, and a
lyses (Margolis et al. 2014). various number of other types of data can help
Besides these promotive initiatives proposed assemble a holistic view of a patient and shed light
by national research institutes, such as NIH, on social and environmental factors that may be
great endeavors in improving biomedical big influencing health.
data processing and analysis have also been The second technical obstacle in linking big
made by biomedical researchers and for-profit biomedical data results from the lack of a
organizations. National cyberinfrastructure has national unique patient identifier (UPI) in the
been suggested by biomedical researchers as one United States (Weber et al. 2014). To address
of the systems that could efficiently handle many the absence of a UPI to enable precise linkage,
of big data challenges facing the medical infor- hospitals and clinics have developed sophisti-
matics community. In the United States, the cated probabilistic linkage algorithms based on
national cyberinfrastructure (CI) refers to an other information, such as demographics. By
existing system of research supercomputer centers requiring enough variables to match, hospitals
and high-speed networks that connect them and clinics are able to reduce the risk of linkage
(LeDuc et al. 2014). CI has been widely used by errors to an acceptable level even though two
physical and earth scientists, and more recently different patients share the same characteristics
biologists, yet little used by biomedical (e.g., name, age, gender, zip code). In addition,
researchers. It has been argued that more compre- the same techniques used to match patients
hensive adoption of CI could facilitate many chal- across different EHRs can be extended to data
lenges in biomedical area. One example of sources outside of health care, which is an advan-
innovative biomedical big data technique pro- tage of probabilistic linkage.
vided by for-profit organizations is GENALICE Third, besides the technical challenges, pri-
MAP, a next-generation sequencing (NGS) DNA vacy and security concerns turn to be a social
processing software launched by a Dutch Soft- challenge in linking biomedical big data (Weber
ware Company GENALICE. Processing biomed- et al. 2014). As more data are linked, they
ical big data one hundred times faster than become increasingly more difficult to be de-
conventional data analytic tools, MAP demon- identified. For instance, although clinical data
strated robustness and spectacular performance from EHRs offer considerable opportunities for
and raised the NGS data processing and analysis advancing clinical and biomedical research,
to a new level. unlike most other forms of biomedical research
118 Biometrics

data, clinical data are typically obtained outside


of traditional research settings and must be Biometrics
converted for research use. This process raises
important issues of consent and protection of Jörgen Skågeby
patient privacy (Institute of Medicine 2009). Department of Media Studies,
Possible constructive responses could be to reg- Stockholm University, Stockholm, Sweden
ulate legality and ethics, to ensure that benefits
outweigh risks, to include patients in the deci-
sion-making process, and to give patients con- Biometrics refers to measurable and distinct
trol over their data. Additionally, changes in (preferably unique) biological, physiological,
policies and practices are needed to govern or behavioral characteristics. Stored in both
research access to clinical data sources and commercial and governmental biometric data-
facilitate their use for evidence-based learning bases, these characteristics are subsequently used
in healthcare. Improved approaches to patient to identify and/or label individuals. This entry
consent and risk-based assessments of clinical summarizes common forms of biometrics, their
data usage, enhanced quality and quantity of different applications and the societal debate sur-
clinical data available for research, and new rounding biometrics, including its connection to
methodologies for analyzing clinical data are big data, as well as its potential benefits and
all needed for ethical and informed use of bio- drawbacks.
medical big data. Typical applications for biometrics include
identification in its own right, but also as a measure
to verify access privileges. As mentioned, biomet-
Cross-References rical technologies rely on physiological or, in some
cases, behavioral characteristics. Some common
▶ Biometrics physiological biometric identifiers are eye retina
▶ Data Sharing and iris scans, fingerprints, palm prints, face recog-
▶ Health Informatics nition and DNA. Behavioral biometric identifiers
can include typing rhythm (keystroke dynamics),
signature recognition, voice recognition, or gait.
Further Reading While common sites of use include border controls,
education, crime prevention and health care,
Hoffman, S., & Podgurski, A. (2013). Big bad data: Law, biometrics are also increasingly deployed in con-
public health, and biomedical databases. The Journal of sumer devices and services, such as smartphones
Law, Medicine & Ethics, 41(8), 56–60.
(e.g., fingerprint verification in the iPhone 5 and
Institute of Medicine. (2009). Beyond the HIPAA privacy
rule: Enhancing privacy, improving health through onwards, as well as facial recognition in Samsung’s
research. Washington, DC: The National Academies Galaxy phones) and various web services and
Press. applications making use of keystroke dynamics.
LeDuc, R., Vaughn, M., Fonner, J. M., Sullivan, M., Wil-
Although some biometric technologies have been
liams, J. G., Blood, P. D., et al. (2014). Leveraging the
national cyberinfrastructure for biomedical research. deployed for over a century, the recent technolog-
Journal of the American Medical Informatics Associa- ical development has spurred a growing interest in
tion, 21(2), 195–199. a wider variety of biometrical technologies and
Margolis, R., Derr, L., Dunn, M., Huerta, M., Larkin,
J., Sheehan, J., et al. (2014). The National Institutes of
their respective potential benefits and drawbacks.
Health’s Big Data to Knowledge (BD2K) initiative: As such, current research on biometrics is not
Capitalizing on biomedical big data. Journal of the limited to technical or economical details. Rather,
American Medical Informatics Association, 21(6), political, ontological, social, and ethical aspects
957–958.
and implications are now often considered in
Weber, G., Mandl, K. D., & Kohane, I. S. (2014). Finding
the missing link for big biomedical data. Journal of cohort with technical advances. As a consequence,
American Medical Association, 331(4), 2479-2480. questions around the convergence of the biological
Biometrics 119

domain and the informational domain are given will also argue that there is still a prevailing risk of
new relevance. Because bodily individualities are stolen or forged identities. While the difficulty of
increasingly being turned into code/information, counterfeiting biometric data can be seen as
digital characteristics (e.g., being easily transferra- directly related to the sophistication of the tech-
ble, combinable, searchable, and copyable) are nology used to authenticate biometric data, a gen- B
progressively applicable to aspects of the corpo- eral argument made is that of a “technological
real, causing new dilemmas to emerge. Perhaps not balance of terror.” That is, as biometric technolo-
surprisingly then, both the suggested benefits and gies develop to become more sophisticated and
drawbacks with biometric technologies and the sensitive, so will the technologies capable of forg-
connected big data repositories tie into larger dis- ing identities. More so however, the risk of stolen
cussions on privacy in the digital age. proofs of identity still presents a very real risk,
One of the most commonly proposed advan- particularly when conducted in a networked dig-
tages of biometrics is that it provides a (more or ital environment. Digital files can be easily copied
less) unique identity verifier. This, in turn, makes and unlike a physical lock, which can be
it harder to forge or steal identities. It is also exchanged if the key is stolen, once such a digital
argued that such heightened fidelity in biometrical file (of, e.g., a fingerprint or iris scan) is stolen it
measures improves the general level of societal may present serious and far-reaching repercus-
security. Proponents will also argue that problems sions. Thus, detractors argue that even with the
caused by lost passports, identity cards or driver’s proper policies in place the related incidents can
licenses, as well as forgotten passwords are virtu- be problematic. Furthermore, it will become even
ally eliminated. More ambivalent advantages of harder to foresee the potential problems and mis-
biometrics include the possibility to automatically uses of biometrics and as a consequence the public
and unequivocally tie individuals to actions, geo- trust in biometric archives and technologies will
graphical positions, and moments in time. While be hard to maintain.
such logs can certainly be useful in some cases, On a larger scale, the potential cooperation
they also provide opportunities for more perva- between commercial actors and governments has
sive surveillance. become a cause for concern for critics. They argue
Consequently, the more ambivalent conse- that practices previously regulated to states (and
quences, combined with a more careful consider- even then questionable) have now been adopted
ation of the risks connected to biometric data by commercial actors turning biometrical technol-
retention, have generated many concerns about ogies into potential tools for general surveillance.
the widened deployment of biometric technolo- Critics argue that instead of regulating the collec-
gies. For example, biometric databases and tion of biometric data to those who are convicted,
archives are often regarded as presenting too registration of individuals has now spread to the
many risks. Even though the highest security general population. As such, opponents argue that
must be maintained to preserve the integrity of under widespread biometric application all citi-
these digital archives, they can still be hacked or zens are effectively treated as potential threats or
used (by both governmental and commercial suspected criminals.
actors as well as individuals who have access to In summary, biometrics refers to ways of using
the information in their daily work) in ways not the human body as a verification of identity. Tech-
anticipated, sanctioned, or legitimized. There is nologies making use of biometric identifiers are
also a question of continuously matching physical becoming increasingly common and will likely be
bodies to the information stored in databases. visible in a growing number of applications in the
Injuries, medical conditions, signs of aging, and everyday lives of citizens. Due to the ubiquitous
voluntary bodily modifications may cause indi- collection and storage of biometric data through
viduals to end up without a matching database an increasingly sophisticated array of pervasive
entry, effectively causing citizens and users to be technologies and big data repositories, many crit-
locked out from their own identity. Disbelievers ical questions are raised around the ontological,
120 Biosurveillance

social, ethical, and political consequences of their Applied Physics Laboratory of Johns Hopkins
deployment. Larger discussions of surveillance, University has played a decisive and pioneering
integrity, and privacy put biometric technologies role (Burkom et al. 2008).
and databases in a position where public trust will The Internet biosurveillance uses the accessi-
be a crucial factor in its overall success or failure. bility to data and analytic tools provided by digital
infrastructures of social media, participatory
sources, and non-text-based sources. The struc-
tural change generated by digital technologies, as
Further Reading
main driver for Big Data, offers a multitude of
Ajana, B. (2013). Governing through biometrics: The
applications for sensor technology and biometrics
biopolitics of identity. London: Palgrave Macmillan. as key technologies. Biometric analysis technolo-
Gates, K. (2011). Our biometric future. New York: gies and methods are finding their way into all
New York University Press. areas of life, changing people’s daily lives. In
Magnet, S. (2011). When biometrics fail. Durham: Duke
University Press.
particular the areas of sensor technology, biomet-
Payton, T., & Claypoole, T. (2014). Privacy in the age of ric recognition process, and the general tendency
big data. Lanham: Rowman & Littlefield. toward convergence of information and commu-
nication technologies are stimulating the Big Data
research. The conquest of mass markets through
sensor and biometric recognition processes can
Biosurveillance sometimes be explained by the fact that mobile,
web-based terminals are equipped with a large
Ramón Reichert variety of different sensors. More and more users
Department for Theatre, Film and Media Studies, come this way into contact with the sensor tech-
Vienna University, Vienna, Austria nology or with the measurement of individual
body characteristics. Due to the more stable and
faster mobile networks, many people are perma-
Internet biosurveillance, or Digital Disease nently connected to the Internet using their mobile
Detection, represents a new paradigm of Public devices, providing connectivity an extra boost.
Health Governance. While traditional approaches With the development of apps, application
to health prognosis operated with data collected in software for mobile devices such as smartphones
the clinical diagnosis, Internet biosurveillance (iPhone, Android, BlackBerry, Windows Phone)
studies use the methods and infrastructures of and Tablet computer, the application culture of
Health Informatics. That means, more precisely, biosurveillance changed significantly, since
that they use unstructured data from different these apps are strongly influenced by the dynam-
web-based sources and targets using the collected ics of the bottom-up participation. Andreas
and processed data and information about changes Albrechtslund speaks in this context of the
in health-related behavior. The two main tasks of “Participatory Surveillance” (2008) on the social
the Internet biosurveillance are (1) the early detec- networking sites, in which biosurveillance incre-
tion of epidemic diseases, biochemical, radiolog- asingly assumes itself as a place for open produc-
ical, and nuclear threats (Brownstein et.al. 2009) tion of meaning and permanent negotiation, by
and (2) the implementation of strategies and mea- providing comment functions, hypertext systems,
sures of sustainable governance in the target areas and ranking and voting procedures through col-
of health promotion and health education (Walters lective framing processes. This is the case of the
et al. 2010). Biosurveillance has established itself sports app Runtastic, monitoring different sports
as an independent discipline in the mid-1990s, as activities, using GPS, mobile devices, and sensor
military and civilian agencies began to get inter- technology, and making information, such as dis-
ested in automatic monitoring systems. In this tance, time, speed, and burned calories, accessible
context, the biosurveillance program of the and visible for friends and acquaintances in real
Biosurveillance 121

time. The Eatery app is used for weight control Ginsberg et al. (2009) point out that in the case
and requires its users the ability to do self-optimi- of an epidemic, it is not clear whether the search
zation through self-tracking. Considering that engines behavior of the public remains constant
health apps also aim to influence the attitudes of and thus whether the significance of Google Flu
their users, they can additionally be understood as Trends is secured or not. They refer to the B
persuasive media of Health Governance. With medialized presence of the epidemic as distorting
their feedback technologies, the apps facilitate cause of an “Epidemic of Fear” (Eysenbach 2006,
not only issues related to healthy lifestyles but p. 244), which can lead to miscalculations
also multiply the social control over compliance concerning the impending influenza activity. Sub-
with the health regulations in peer-to-peer net- sequently, the prognostic reliability of the corre-
works. Taking into consideration the network lation between increasing search engine entries
connection of information technology equipment, and increased influenza activity has been
as well as the commercial availability of biometric questioned. In recent publications on digital
tools (e.g., “Nike Fuel,” “Fitbit,” “iWatch”) and biosurveillance, communication processes in
infrastructure (apps), the biosurveillance is fre- online networks are more intensely analyzed.
quently associated, in the public debates, to dys- Especially in the field of Twitter Research (Paul
topian ideas of a society of control biometrically and Dredze 2011), researchers developed specific
organized. techniques and knowledge models for the study of
Organizations and networks for health promo- future disease development and work backed up
tion, health information, and health education and by context-oriented sentiment analysis and social
formation observed with great interest that, every network analysis to hold out the prospect of a
day, millions of users worldwide search for infor- socially and culturally differentiated
mation about health using the Google search biosurveillance.
engine. During the influenza season, the searches
for flu increase considerably, and the frequency of
certain search terms can provide good indicators Further Reading
of flu activity. Back in 2006, Eysenbach evaluated
in a study on “Infodemiology” or “Infoveillance” Albrechtslund, A. (2008). Online social networking as
the Google AdSense click quotas, with which he participatory surveillance. First Monday, 13(3).
Online: http://firstmonday.org/ojs/index.php/fm/arti
analyzed the indicators of the spread of influenza
cle/viewArticle/2142/1949.
and observed a positive correlation between Brownstein, J. S., et al. (2009). Digital disease detection –
increasing search engine entries and increased Harnessing the web for public health surveillance. The
influenza activity. Further studies on the volume New England Journal of Medicine, 360(21), 2153–
2157.
of search patterns have found that there is a sig-
Burkom, H. S., et al. (2008). Decisions in biosurveillance
nificant correlation between the number of flu- tradeoffs driving policy and research. Johns Hopkins
related search queries and the number of people technical digest, 27(4), 299–311.
showing actual flu symptoms (Freyer-Dugas et al. Eysenbach, G. (2006). Infodemiology: Tracking flu-
related searches on the Web for syndromic surveil-
2012). This epidemiological correlation structure
lance. In AMIA Annual Symposium, Proceedings 8/2,
was subsequently extended to provide early warn- 244–248.
ing of epidemics in cities, regions, and countries, Freyer-Dugas, A., et al. (2012). Google Flu Trends: Cor-
in cooperation with the 2008 established Google relation with emergency department influenza rates and
crowding metrics. Clinical Infectious Diseases, 54(15),
Flu Trends in collaboration with the US authority 463–469.
for the surveillance of epidemics (CDC). On the Ginsberg, J., et al. (2009). Detecting influenza epidemics
Google Flu Trends website, users can visualize using search engine query data. In Nature. Interna-
the development of influenza activity both geo- tional weekly journal of science (Vol. 457, pp. 1012–
1014).
graphically and chronologically. Some studies
Paul, M. J., & Dredze, P. (2011). You are what you Tweet:
criticize that the predictions of the Google project Analyzing Twitter for public health. In Proceedings of
are far above the actual flu cases. the Fifth International AAAI Conference on Weblogs
122 Blockchain

and Social Media. Online: www.aaai.org/ocs/index. transactions. Typically, mining a block involves
php/ICWSM/ICWSM11/paper/.../3264. finding a solution to a cryptographic puzzle with
Walters, R. A., et al. (2010). Data sources for
biosurveillance. In G. Voeller John (Ed.), Wiley hand- varying levels of difficulty set by an algorithm.
book of science and technology for homeland security Each new block contains cryptographically
(Vol. 4, pp. 2431–2447). Hoboken: Wiley. hashed information on the most recent transac-
tions and all previous transactions. Blocks are
integrated in a chain-like manner, hence the
name blockchain. All data on a blockchain is
Blockchain encrypted and hashed. Once validated and added
to the blockchain, a transaction can never be tam-
Laurie A. Schintler pered with or removed. In some cases, blockchain
George Mason University, Fairfax, VA, USA transactions are automated via “smart contracts,”
which are agreements between two or more
parties in the form of computer code; a transaction
Overview of Blockchain Technology is only triggered if the agreed-upon conditions are
met.
Blockchain technology is one of the hallmarks of Blockchains are a way to establish trust in
the Fourth Industrial Revolution (4IR). A organizational (e.g., corporate) and personal
blockchain is essentially a decentralized, distrib- transactions, which is fundamental to removing
uted, and immutable ledger. The first significant uncertainty. Several types of uncertainty face all
application of blockchain technology was to transactions: (1) knowing the identity of the part-
bitcoin markets in the early 1990s. Since then, ners in a transaction, i.e., knowing whom one is
the uses of blockchain technology have expanded dealing with; (2) transparency including the pre-
enormously, going well beyond bitcoin and history of conditions leading up to the transaction;
cryptocurrency. Indeed, blockchain is now a per- and (3) recovering the loss associated with a trans-
vasive technology in academia and government, action that fails. Establishing trust is how each of
and across and within industries and sectors. In these uncertainties gets managed, which has tra-
this regard, some emerging applications include ditionally been done through intermediaries that
banking and financial payments and transfers, exact a cost for ensuring trust among the trans-
supply chain management, insurance, voting, acting parties. Since blockchain removes and
energy management, retail trade, crowdfunding, replaces third parties, where no trust is required
public records, car leasing, cybersecurity, trans- between those involved in transactions, it is char-
portation, charity, scholarly communications, acterized as a “trustless” system. (On the other
charity, government, health care, online music, hand, one can argue that the network of machines
real estate, criminal justice, and human resources. in a blockchain constitutes the intermediary, but
Unlike centralized ledgers, blockchain records one in a different guise than a conventional insti-
transactions between parties directly without tutional third party.)
third-party involvement. Each transaction is vet- So, how specifically does one trust the identity
ted and authenticated by powerful computer algo- of a transacting party? Each party is assigned a
rithms running across all the blocks and all the pair of cryptographically generated and connected
users. Such algorithms typically require consen- electronic keys, a public key is known to the
sus across the nodes, where different algorithms world, and a private key is known only to the
are used for this purpose. The decentralized dis- party that owns it. These pairs of keys can be
tributed ledgers are updated asynchronously by stored in the form of hard copies (paper copies)
adding a new block that is cryptographically or relatively secure digital wallets owned by indi-
“mined” based on a preset number of compiled viduals. There are a number of public-key
Blockchain 123

cryptography-based algorithms. One of the most Indeed, big data is far from perfect. It tends to
widely used kind of algorithm is known as the be fraught with noise, biases, redundancies, and
“Elliptic Curve Digital Signature Algorithm” other imperfections. Another set of issues relates
(ECDSA). to data provenance and data lineage, i.e., where
Any transaction encrypted with a private key the data comes from and how it has been used B
can only be decrypted by its associated public key along the way. Once data is acquired, transmitted,
and vice versa. For example, if Sender A wants to and stored, such information should ideally be
transact with Receiver B, then A encrypts the recorded for future use. However, with big data,
transaction with B’s public key and then signs it this poses some difficulties. Big data tends to
with the private key owned and known only to A. change hands frequently, where at each stop, it
The receiving party B can verify the identity of A gets repurposed, repackaged, and reprocessed.
by using A’s public key and subsequently decrypt Thus, the history of the data can get lost as it
the transaction with his/her private key. There are travels from one person, place, or organization to
many variations on how to use public-key cryp- another. Moreover, in the case of proprietary and
tography for transactions, including transactions personally sensitive information, the data attri-
that involve multisignature protocol for transac- butes tend to be hidden, complicating matters
tions among multiple parties. In brief, public-key further. Finally, big data raises various kinds of
cryptography technology has enabled trustworthy privacy concerns. Many big data sources – includ-
peer-to-peer transactions among total strangers. ing from transactions – contain sensitive, detailed,
There are other ways in which blockchain pro- and revealing information about individuals, e.g.,
motes trust. First, it is generally a transparent about their finances, personal behavior, or health
system in which all blocks of information (history and medical conditions. Such information may be
and changes) reside in a copy of the distributed intentionally or inadvertently exposed or used in
ledger maintained by nodes (users). Second, with ways that violate someone’s privacy. Data pro-
a blockchain system, there is no need for trans- duced, consumed, stored, and transmitted in
acting parties to know each other as information cyberspace are particularly vulnerable in these
about them and their transactions are known not regards.
only by each party but also by all users (parties). Blockchain technology can help to improve the
Therefore, the information is verifiable by all quality, trustworthiness, traceability, transpar-
blockchain participants. Finally, blockchain pro- ency, privacy, and security of big data in several
vides mechanisms for recourse in the event of ways. As blockchains are immutable ledgers,
failed transactions. Specifically, recourse can be unauthorized modification of any data added to
built into the block data or executed through smart them is virtually impossible. In other words, once
contracts. the data is added to the blockchain, there is only a
minimal chance that it can be deleted or modified.
Data integrity and data security are further
Blockchain and Big Data enhanced, given that transactions must be vetted
and authenticated before being added to the
Blockchain and big data is a “marriage made in blockchain. Additionally, since no form of per-
heaven.” On the one hand, big data analytics are sonal identification is needed to initiate and use a
needed to vet the massive amounts of information blockchain, there is no central server with this
added to a blockchain and arrive at a consensus information that could be compromised. Lastly,
regarding a transaction’s validity. On the other blockchain automatically creates a detailed and
hand, blockchain provides a means for addressing permanent record of all transactions (data) added
the limitations of big data and the challenges to it, thus facilitating activities tied to data prove-
associated with its use and application. nance and documentation.
124 Blogs

The Dark Side of Blockchain Cross-References

While blockchain does improve data security, ▶ Data Integrity


it is not a completely secure system. More ▶ Data Provenance
specifically, decentralized distributed tamper- ▶ Fourth Industrial Revolution
proof blockchain technologies, although ▶ Privacy
secure against most common cyber threats,
can be vulnerable. For example, the
blockchain mining operation can be suscepti- Further Reading
ble to 51% attack where a party or a group of
parties possess enough computing power to Deepa, N., Pham, Q. V., Nguyen, D. C., Bhattacharya, S.,
Gadekallu, T. R., Maddikunta, P. K. R., et al. (2020). A
control the mining of new blocks. There is
survey on blockchain for big data: Approaches, oppor-
also a possibility of orphan blocks with legit- tunities, and future directions. arXiv preprint arXiv,
imate transactions to be created, which never 2009, 00858.
get integrated into the parent blockchain. Karafiloski, E., & Mishev, A. (2017, July). Blockchain
solutions for big data challenges: A literature
Moreover, while blockchain technology is
review. In IEEE EUROCON 2017-17th interna-
generally hack-proof due to its decentralized tional conference on smart technologies (pp. 763–
and distributed nature of operations, the rise of 768). IEEE.
central exchanges for facilitating transactions Nofer, M., Gomber, P., Hinz, O., & Schiereck, D. (2017).
Blockchain. Business & Information Systems Engi-
across blockchains is open to cyberattacks. So
neering, 59(3), 183–187.
are the digital wallets used by individuals to Schintler, L. A., & Fischer, M. M. (2018). Big data and
store their public/private keys. regional science: Opportunities, challenges, and direc-
Blockchain also raises privacy concerns. tions for future research.
Swan, M. (2015). Blockchain: Blueprint for a new econ-
Indeed, blockchain technology is not entirely
omy. Sebastopol: O’Reilly Media.
anonymous; instead, it is “pseudonymous,” Zheng, Z., Xie, S., Dai, H., Chen, X., & Wang, H. (2017,
where data points do not refer to any particular June). An overview of blockchain technology: Archi-
individuals, but multiple transactions by a single tecture, consensus, and future trends. In 2017 IEEE
international congress on big data (BigData congress)
person can be combined and correlated to reveal (pp. 557–564). IEEE.
their identity. This problem is compounded in
public blockchains, which are open to many indi-
viduals and groups. The immutable nature of
blockchain also makes it easy for anyone to “con-
nect the dots” about individuals on the Blogs
blockchain.
Finally, blockchain is touted as a democratiz- Ralf Spiller
ing technology where any person, organization, or Macromedia University, Munich, Germany
place in the world can use and access it. However,
in reality, specific skills and expertise, and
enabling technologies (e.g., the Internet, broad- A blog or Web log is a publicly accessible diary
band), are required to use and exploit blockchain. or journal on a website in which at least one
In this regard, the digital divides are a barrier to person, the blogger, posts issues, records results,
blockchain adoption for certain individuals and or writes down thoughts. Often the core of a
entities. blog is a list of entries in chronological order.
Considering all these issues and challenges, we The blogger or publisher is responsible for the
need to develop technical strategies and public content, contributions are often written from a
policy to ensure that all can benefit from first-person perspective. A blog is for authors
blockchain while no one is negatively impacted and reader an easy tool to cover all kinds of
or harmed by the technology. topics. Often, comments or discussions about
Blogs 125

an article are permitted. Thus, blogs serve as a For the operation of an individual weblog
medium to gather, share, and discuss informa- on own web space, one needs at least a special
tion, ideas, and experiences. weblog software and a rudimentary knowledge
of HTML and the used server technology. Since
blogs can be customized easily to specific needs, B
History they are also often used as pure content man-
agement systems (CMS). Under certain circum-
The first weblogs appeared in the mid-1990s. stances, such websites are not perceived as blogs.
They were called online diaries and were websites From a pure technical point of view, all blogs are
on which Internet users periodically made entries content management systems.
about their own lives. From 1996, services such as One of the important features of Web log soft-
Xanga have been set up that enabled Internet users ware is online maintenance, which is performed
to set up easily their own weblogs. through a browser-based interface (often called a
In 1997 one of the first blogs was started that dashboard), which allows users to create and
still exists. It was called Scripting News, set up update the contents of their blogs from any online
by Dave Winer. After a rather slow start, such browser. This software also supports the use of
sites grew rapidly from the late 1990. Xanga, for external client software to update content using an
example, grew from 100 blogs in 1997 to 20 mil- application programming interface. Web log soft-
lion in 2005. In recent years, blogging is also ware commonly includes plugins and other fea-
used for business in so-called corporate blogs. tures that allow automatic content generation via
Also many news organizations like newspapers RSS or other types of online feeds.
and TV-stations operate blogs to expand their The entries, also called postings, blog posts or
audience and get feedback from readers and posts are the main components of a weblog. They
listeners. are usually listed in reverse chronological order;
According to Nielsen Social, a consumer the most recent posts can be found at the top of the
research company, in 2006 there were about weblog. Older posts are usually listed in archives.
36 million public blogs in existence, in 2009 The consecutive posts on a specific topic within
about 127 million and in 2011 approximately a blog are called a thread. Each entry, in some
173 million. On September 2014, there were weblog systems, each comment has a unique
around 202 million Tumblr and more than and unchanging, permanent Web address (URL).
60 million WordPress blogs in existence world- Thus, other users or bloggers can directly link to
wide. The total number of blogs can only be the post. Web feeds, for example, rely on these
estimated but should be far more than 300 million permanent links (permalinks).
in 2014. Most weblogs provide the possibility to leave a
comment. Such a post is then displayed on the
same page as the entry itself or as a popup. A web
Technical Aspects feed contains the contents of a weblog in a unified
manner, and it can be subscribed via a feed reader.
Weblogs can be divided into two categories. With this tool, the reader can take a look at several
First, the ones operated by a commercial pro- blogs at the same time and monitor new posts.
vider allowing usage after a simple registration. There are several technical formats for feeds. The
Second, those that are operated by the respective most common are RSS and Atom.
owners on their individual server or webspace A blogroll is a list of other blogs or websites
mostly under their own domain. Well-known that a blogger endorses, commonly references or
providers of blog communities are Google’s is affiliated with. A blogroll is generally found on
Blogger.com, WordPress, and Tumblr. Several one of the blog’s side columns.
social networks offer also blog functionalities A weblog client (blog client) is an external
to its members. client software that is used to update blog content
126 Blogs

through an interface other than the typical Web- networked political environments. Thus, blogs
based version provided by blog software. There provide an alternative space to challenge the
are desktop or mobile interfaces for blog posting. dominant public discourse. They are able to
They provide additional features and capabili- question mainstream representations and offer
ties, such as offline blog posting, better format- oppositional counter-discourses. Sometimes
ting, and cross-posting of content to multiple they are viewed as a form of civic, participatory
blogs. journalism. Following this idea, they represent
an extension of media freedom.
In certain topics like information technology,
Typology blogs challenge classic news websites and
compete directly with their readers. This leads
Blogs can be segmented according to various sometimes to innovations in the journalistic prac-
criteria. From a content perspective, certain tice, for example, that online news sites adopt blog
kinds of blogs are particularly popular: travel features.
blogs, fashion blogs, technology blogs, corporate Farrell and Drezner (2008) argue that under
blogs, election blogs, warblogs, watch blogs, specific circumstances, blogs can socially con-
and blog novels. Other kinds of blogs are pure struct an agenda or interpretive frame that acts as
link blogs (annotated link collections), moblogs a focal point for mainstream media, shaping and
(Mobile Blogs), music blogs (MP3 blogs), audio constraining the larger political debate. This hap-
blogs (podcasts), and vlogs (video blogs). Micro- pens, for example, when key Web logs focus on a
blogging is another type of blogging, featuring new or neglected issue.
very short posts like quotes, pictures, and links
that might be of interest.
Blog lists like Blogrank.io provide useful Policy
information about the most popular blogs on a
diverse range of topics. Several blog search Many human rights activists, especially in coun-
engines are used to search blog contents, such as tries like Iran, Russia, or China, use blogs to
Bloglines. publish reports on human rights violations, cen-
The collective community of all blogs is sorship, and current political and social issues
known as the blogosphere. It has become an without censorship by governments. Bloggers,
invaluable source for citizen journalism – that is, for example, reported on the violent protests dur-
real time reporting about events and conditions in ing the presidential elections in Iran in 2009 or the
local areas that large news agencies or newspapers political upheavals in Egypt in 2012 and 2013.
do not or cannot cover. Discussions in the These blogs were an important source for news in
blogosphere are frequently used by the media as Western media.
a reflection of public opinion on various topics. Blogs are harder to control than broadcast
or even print media. As a result, totalitarian and
authoritarian regimes often seek to suppress blogs
Characteristics and/or to punish those who maintain them.
Many politicians use blogs and similar tools
Empirical studies show that blogs emphasize like Twitter and Facebook particularly during
personalization, audience participation in content election campaigns. US president Barack Obama
creation, and story formats that are fragmented was one of the first who used them effectively
and interdependent with other websites. during his two presidential campaigns in 2009
Sharon Meraz (2011) shows that blogs and 2012. President Trump’s tweets are carefully
excerce social influence and are able to weaken observed all over the world. Also nongovern-
the influence of elite, traditional media as a mental organizations (NGOs) use blogs for their
singular power in issue interpretation within campaigns.
Blogs 127

Blogs and Big Data Various techniques can be used to extract


information from the blogosphere: Blogs can be
Big data usually refers to data sets defined by their analyzed via sentiment analysis. Sentiment can
volume, velocity, and variety. Volume refers to the vary by demographic group, news source, or
magnitude of data, velocity to the rate at which geographic location. Results show opinion ten- B
data are generated and the speed at which it should dencies in popularity or market behavior and
be analyzed and acted upon and variety to the might also serve for forecasts regarding certain
structural heterogeneity in a dataset. issues. Sentiment maps can identify geographi-
Blogs are usually composed of unstructured cal regions of favorable or adverse opinions for
data. This is the largest component of big given entities.
data and is available as audio, images, video, and Blogs can also be analyzed via content analysis
unstructured text. It is estimated that the analytics- methods. These are ways to gather rich, authentic,
ready structured data forms only a subset of big and unsolicited customer feedback. Information
data of about 5% (Gandomi and Haider 2015). technology advances continuously and increas-
Analyzing blog content implies dealing with ingly large numbers of blogs facilitate blog mon-
imprecise data. This is a characteristic of big data, itoring as a cost-effective method for service
which is addressed by using tools and analytics providers like hotels, restaurants, or theme parks.
developed for management and mining of
uncertain data.
Performing business intelligence (BI) on blogs Trends
is quite challenging because of the vast amount of
information and the lack of commonly adopted Some experts see the emergence of Web logs and
methodology for effectively collecting and ana- their proliferation as a new form of grassroots
lyzing such information. But the software is con- journalism. Mainstream media increasingly rely
tinually advancing and delivering useful results, on information from blogs, and certain prominent
for example, about product information in blogs. bloggers can play a relevant role in the agenda
According to Gandomi and Haider (2015), setting process of news. In conclusion blogs have
analytics of blogs can be classified into two become together with other social media tools an
groups: Content-based analytics and structure- irrefutable part of the new media eco systems with
based analytics. The first one focuses on data the Internet at its technical core.
posted by users such as customer feedback and Corporate blogs are used internally to enhance
product reviews. Such content is often noisy and the communication and culture in a corporation
dynamic. The second one puts emphasis on the or externally for marketing, branding, or public
relationships among the participating entities. It is relations purposes. Some companies try to take
also called social network analytics. The structure advantage of the popularity of certain blogs and
of a social network is modeled through a frame- encourage these bloggers via free tests and
work of nodes and edges, representing partici- other measures to post positive statements about
pants and relationships. They can be visualized products or services. Most bloggers do not see
via social graphs and activity graphs. themselves as journalists and are open to cooper-
Analytic tools can extract implicit communi- ation with companies. Some blogs have become
ties within a network. One application is compa- serious competitors for mainstream media since
nies that try to develop more effective product they are able to attract large readerships.
recommendation systems. Social influence analy-
sis evaluates the participants’ influence, quantifies
the strengths of connections, and uncovers the Cross-References
patterns of influence diffusion in a network. This
information can be used for viral marketing to ▶ Content Management System (CMS)
enhance brand awareness and adoption. ▶ Sentiment Analysis
128 Border Control/Immigration

Further Reading In this discussion, I argue that with big data come
“big borders” through which the scope of control
Blumenthal, M. M. (2005). Toward an open-source meth- and monopoly over the freedom of movement can
odology. What we can learn from the blogosphere.
be intensified in ways that are bound to reinforce
Public Opinion Quarterly, 69(5, Special Issue),
655–669. “the advantages of some and the disadvantages of
Farrell, H., & Drezner, D. (2008). The power and politics others” (Bigo 2006, p. 57) and contribute to the
of blogs. Public Choice, 134, 15–30. enduring inequalities underpinning international
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big
circulation. Relatedly, I will outline some of the
data concepts, methods and analytics. International
Journal of Information Management, 35, 137–144. ethical issues pertaining to such developments.
Godbole, N., Srinivasaiah, M., & Skiena, S. (2007). Large- To begin with, let us consider some of the
scale sentiment analysis for news and blogs. In definitions of big data. Generally, big data are
Proceedings of the International Conference on
often defined as “datasets whose size is beyond
Weblogs and Social Media (ICWSM).
Meraz, S. (2011). The fight for ‘how to think’: Traditional the ability of typical database software tools to
media, social networks, and issue interpretation. capture, store, manage, and analyze” (McKinsey
Journalism, 12(1), 107–127. Global Institute 2011). They therefore require
more enhanced technologies and advanced ana-
lytic capacities. Although emphasis is often
placed on the “size” aspect, big data are by no
Border Control/Immigration means merely about large data. Instead they are
more about the networked and relational aspect
Btihaj Ajana (Manovich 2011; Boyd and Crawford 2011). It is
King’s College London, London, UK the power of connecting, creating/unlocking pat-
terns, and visualizing correlations that makes big
data such a seductive field of investment and
Big Borders: Smart Control Through Big enquiry for many sectors and organizations.
Data Big data can be aggregated from a variety of
sources including web search histories, social
Investments in the technologies of borders and in media, online transactions records, mobile tech-
the securitization of movement continue to be one nologies and sensors that generate and gather
of the top priorities of governments across the information about location, and any other source
globe. Over the last decade, there has been a where digital trails are left behind knowingly or
notable increase in the deployment of various unknowingly. The purpose of big data analytics is
information systems and biometric solutions to primarily about prediction and decision-making,
control border crossing and fortify the digital as focusing on “why events are happening, what will
well as physical architecture of borders. More happen next, and how to optimize the enterprise’s
recently, there has been a growing interest in the future actions” (Parnell in Field Technologies
techniques of big data analytics and in their capac- Online 2013). In the context of border manage-
ity to enable advanced border surveillance and ment, the use of big data engenders a “knowledge
more informed decision-making and risk manage- infrastructure” (Bollier 2010, p. 1) involving the
ment. For instance, in Europe, programs such as aggregation, computation, and analysis of com-
Frontex and EUROSUR are examples of big data plex and large size contents which attempt to
surveillance currently used to predict and monitor establish patterns and connections that can inform
movements across EU borders. While in Austra- the process of deciding on border access, visa
lia, a recent big data system called Border Risk granting, and other immigration and asylum
Identification System has been developed by IBM related issues. Such process is part and parcel of
for the Australian Customs and Border Protection the wholesale automation of border securitization
Service for the purpose of improving border man- whereby border control is increasingly being
agement and targeting so-called “risky travellers.” conducted remotely, at a distance and well before
Border Control/Immigration 129

the traveller reaches the physical border (Broeders enables the systematic ordering, profiling, and
2007; Bigo and Delmas-Marty 2011). More spe- categorization of the moving population body
cifically, this involves, for instance, the use of into pattern types and distinct categories. This
Advance Passenger Records (APR) and informa- process contributes to labeling some people as
tion processing systems to enable information risky and others as legitimate travellers and B
exchange and passenger monitoring from the demarcating the boundaries between them. In
time an intending passenger purchases an air supporting the use of big data in borders and in
ticket or applies for a visa (see for example the the security field, Alan Bersin, from the US
case of Australia’s Advance Passenger Processing Department of Homeland Security, describes the
and the US Advance Passenger Information Sys- profiling process in the following terms: “‘high-
tem). Under such arrangements, airlines are risk’ items and people are as ‘needles in hay-
required to provide information on all passengers stacks’. [Instead] of checking each piece of
and crew, including transit travellers. This infor- straw, [one] needs to ‘make the haystack smaller’,
mation is collected and transmitted to border by separating low-risk traffic from high-risk
agencies and authorities for processing and issu- goods or people” (in Goldberg 2013).
ing passenger boarding directives to airlines prior Through this rationality of control and catego-
to the arrival of aircrafts (Wilson and Weber rization, there is the danger of augmenting the
2008). A chief purpose of these systems is the function of borders as spaces of “triage” whereby
improvement of risk management and securitiza- some identities are given the privilege of quick
tion techniques through data collection and passage, whereas other identities are arrested (lit-
processing. erally). The management of borders through big
However, and as Bollier (2010, p. 14) argues, data technology is indeed very much about creat-
“more data collection doesn’t mean more knowl- ing the means by which freedom of mobility can
edge. It actually means much more confusion, be enabled, smoothened, and facilitated for the
false positives and so on.” Big data and their qualified elite; the belonging citizens, all the
analytical tools are, as such, some of the tech- while allowing the allocation of more time and
niques that are being fast-tracked to enable more effort for additional security checks to be
sophisticated ways of tracking the movement of exercised on those who are considered as “high
perceived “risky” passengers. Systems such as the risk” or “problematic” categories. The danger of
Australian Border Risk Identification System such rationality and modality of control, as Lee
function through the scanning and analysis of (2013) points out, is that governments can target
massive amounts of data accumulated by border and “track undocumented migrants with an
authorities over the years. They rely on advanced unheard of ease, prevent refugee flows from enter-
data mining techniques and analytical solutions to ing their countries, and track remittances and
fine tune the knowledge produced out of data travel in ways that put migrants at new risks.”
processing and act as a digital barrier for policing The deployment of big data can thus become an
border movement and a tool for structuring intel- immobilizing act of force that suppresses the
ligence, all with the aim to identify in advance movement of certain categories and restricts their
suspected “high risk” passengers and facilitate the access to spaces and services. With big data, the
crossing of low risk ones. The argument is that possibilities of control might be endless: govern-
automated big data surveillance systems make ments might be able to
border control far more rigorous than what was predict the next refugee wave by tracking pur-
previously possible. However, these data-driven chases, money transfers and search terms prior to
surveillance systems raise a number of ethical the last major wave. Or connect the locations of
concerns that warrant some reflection. recipients of text messages and emails to construct
an international network and identify people vul-
Firstly, there is the issue of categorization. nerable to making the big move to join their family
Underlying border surveillance through big data or spouse abroad. (If the NSA can do it, why not
is a process of sorting and classification, which Frontex?) Or, an even more sinister possibility-
130 Border Control/Immigration

identify undocumented migrant clusters with meaningful way are significantly impaired, as a
greater accuracy than ever before by comparing result. We are, as such, at risk of “being defined by
identity and location data with government statistics
on who is legally registered. (ibid.) algorithms we can’t control” (Lowe and Steenson
2013) as the management of life and the living
Another ethical concern relates to the issue of becomes increasingly reliant on data and feedback
projection, which is at the heart of big data tech- loops. In this respect, one of the ethical challenges
niques. Much of big data analytics and the risk is certainly a matter of “setting boundaries around
management culture within which it is embedded the kinds of institutional assumptions that can and
are based on acts of projection whereby the future cannot be made about people, particularly when
itself is increasingly becoming the object of important life chances and opportunities hang in
calculative technologies of simulation and specu- the balance” (Kerr and Earle 2013). Circulation
lative algorithmic probabilities. This techno-cul- and movement are no exception.
ture is based on the belief that one can create “a The third and final point to raise here relates to
grammar of futur antérieur” by which the future the implications of big data on understandings
can be read as a form of past in order to manage and practices of identity. In risk management and
risk and prevent unwanted events (Bigo 2006). profiling mechanisms, identity is “assumed to be
Big data promise to offer such grammar through anchored as a source of prediction and preven-
their visualization techniques and predictive algo- tion” (Amoore 2006). With regard to immigra-
rithms, through their correlations and causations. tion and border management, identity is indeed
However, as Kerr and Earle (2013) argue, big data one of the primary targets of security technolo-
analytics raises concerns vis-à-vis its power to gies whether in terms of the use of biometrics to
enable “a dangerous new philosophy of preemp- fix identity to the person’s “body” for the purpose
tion,” one that operates by unduly making of identification and identity authentication
assumptions and forming views about others (Ajana 2013) or in terms of the deployment of
without even “encountering” them. In the context big data analytics to construct predictive profiles
of border management and immigration control, to establish who might be a “risky” traveller.
this translates into acts of power, performed from Very often, the identity that is produced by big
the standpoint of governments and corporations, data techniques is seen as disembodied and
which result into the construction of “no-fly lists” immaterial, and individuals as being reduced to
and the prevention of activities that are perceived bits and digits dispersed across a multitude of
to generate risk, including the movement of databases and networks and identified by their
potential asylum seekers and refugees. profiles rather than their subjectivities. The dan-
What is at issue in this preemption philosophy ger of such perception lies in the precluding of
is also a sense of reduced individual agency. The social and ethical considerations when
subjects of big data predictions are often unaware addressing the implications of big data on iden-
of the content and the scale of information gener- tity, as individuals are seldom regarded in terms
ated about them. They are often unable to respond of their anthropological embeddedness and
to or contest the “categorical assumptions” made embodied nature. An embodied approach to the
about their behaviors and activities and the ensu- materiality of big data and identity is therefore
ing projections that affect many aspects of their needed to contest this presumed separation
lives, rights, and entitlements. Given the lack of between data and their physical referent and the
transparency and the one-way character of big ever-increasing abstraction of people. This is
data surveillance, people are often kept unaware crucial, especially when the identities at issue
of the nature and extent of such surveillance and are those of vulnerable groups such as asylum
left without the chance to challenge the measures seekers and refugees whose lives and potentiali-
and policies that affect them in fundamental ways, ties are increasingly being caught up in the
such as criteria of access and so on. Autonomy biopolitical machinery of bureaucratic institu-
and the ability to act in an informed and tions and their sovereign web of biopower.
Brand Monitoring 131

Finally, it is hoped that this discussion has Goldberg, H. (2013). Homeland Security official gives lec-
managed to raise awareness of some of the perti- ture on borders and big data. http://www.michigandaily.
com/news/ford-school-homeland-security-lecture.
nent ethical issues concerning the use of big data Kerr, I., & Earle, J. (2013). Prediction, preemption, pre-
in border management and to stimulate further sumption: How big data threatens big picture privacy.
debates on these issues. Although the focus of http://www.stanfordlawreview.org/online/privacy-and- B
this discussion has been on the negative implica- big-data/prediction-preemption-presumption.
Lee, C. (2013). Big data and migration – What’s in store?
tions of big data, it is worth bearing in mind that http://noncitizensoftheworld.blogspot.co.uk/.
big data technology also carries the potential to Lowe, J., & Steenson, M. (2013). The new nature vs.
benefit vulnerable groups if deployed with an nurture: Big data & identity. http://schedule.sxsw.
ethics of care and in the spirit of helping migrants com/2013/events/event_IAP5064.
Manovich, L. (2011). Trending: The promises and the
and refugees as opposed to controlling them. For challenges of big social data. http://www.manovich.
instance, and as Lee (2013) argues, big data can net/DOCS/Manovich_trending_paper.pdf.
provide migration scholars and activists with McKinsey Global Institute. (2011). Big data: The next fron-
more accurate statistics and help them fight back tier for innovation, competition, and productivity. http://
www.mckinsey.com/insights/business_technology/big_
against “fear-mongering false statistics in the data_the_next_frontier_for_innovation.
media,” while enabling new ways of understand- Wilson, D., & Weber, L. (2008). Surveillance, risk and
ing the flows of migration and enhancing human- preemption on the Australian border. Surveillance and
itarian processes. As such, conducting further Society, 5(2), 124–141.
research on the empowering and resistance-
enabling aspects of big data is certainly worth
pursuing.
Brain Research Through
Advancing Innovative
Further Reading
Neurotechnologies

Ajana, B. (2013). Governing through biometrics: The ▶ White House BRAIN Initiative
biopolitics of identity. Basingstoke: Palgrave
Macmillan.
Amoore, L. (2006). Biometric borders: Governing mobil-
ities in the war on terror. Political Geography, 25, 336–
351. Brand Monitoring
Bigo, D. (2006). Security, exception, ban and surveillance.
In D. Lyon (Ed.), Theorising surveillance: The Chiara Valentini
panopticon and beyond. Devon: Willan Publishing. Department of Management, Aarhus University,
Bigo, D., & Delmas-Marty, M. (2011). The state and surveil-
lance: Fear and control. http://cle.ens-lyon.fr/anglais/the- School of Business and Social Sciences, Aarhus,
state-and-surveillance-fear-and-control-131675.kjsp? Denmark
RH¼CDL_ANG100100#P4.
Bollier, D. (2010). The promise and peril of big data. http://
www.aspeninstitute.org/sites/default/files/content/docs/
pubs/The_Promise_and_Peril_of_Big_Data.pdf. Introduction
Boyd, D., & Crawford, K. (2011). Six provocations for big
data. http://papers.ssrn.com/sol3/papers.cfm?abstract_ Brand monitoring is the act of searching and
id¼1926431. collecting large datasets on brands with the pur-
Broeders, D. (2007). The new digital borders of Europe:
EU databases and the surveillance of irregular pose of evaluating brand performance and value
migrants. International Sociology, 22(1), 71–92. as perceived by consumers and the public in gen-
Field Technologies Online. (2013). Big data: Datalogic eral. Today, a lot of data on brands is collected
predicts growth in advanced data collection as business online. Online brand monitoring is about scan-
analytics systems drive need for more data and innova-
tion. http://www.fieldtechnologiesonline.com/doc/big- ning, gathering, and analyzing content that is
data-datalogic-data-collection-systems-data-innovation- published on the Web. Online data is machine-
0001. readable, has explicitly defined meanings, and is
132 Brand Monitoring

linked to other external datasets (Bizer 2009). in consumers’ interests from desiring a specific
Given that most data is today complex and commodity to desiring a precise, branded type of
unstructured and requires different storage and good or service. He observed that certain brands
processing, brand monitoring often relies on big represent something more than a product or ser-
data analytics which consists of collecting, orga- vice; they own a special place in the minds of
nizing, and analyzing large and diverse datasets consumers. Companies are, indeed, trying to
from various databases. Brand monitoring is a gain a special place in the minds of consumers
central activity for the strategic brand manage- by focusing on creating brand values and charg-
ment of an organization and any organized entity. ing consumers who purchase those brands extra
A brand is an identifier of a product, a service, an dollars for these specific values. Brand values
organization or even a person's main characteris- can be of functional nature, that is, they have
tics and qualities. Its main scope is to differentiate specific characteristics related to product or ser-
products, services or an individual's qualities from vice quality. Brand values can also be of sym-
those of competitors through the use of a specific bolic nature, that is, they can possess intangible
name, term, sign, symbol, or design, or a combi- characteristics such as particular meanings
nation of them. In marketing, Kotler (2000) resulting from holding and owning specific
defines a brand as a “name associated with one brands, as well as from the act of brand
or more items in the product line that is used to consumption.
identify the source of character of the item(s)” (p. Brand monitoring is an important step for eval-
396). Brands have existed for centuries, yet, the uating brand performance and brand values and,
modern understanding of a brand as something in general, for managing a brand.
related to trademarks, attractive packaging, etc.,
that signify a guarantee of product or service
authenticity, is a phenomenon of late nineteenth Understanding Consumption Motives
century (Fullerton 1988). and Practices

Kotler (2000) argues that the most important func-


Brands and Consumers tion of marketers is to create, maintain, protect,
and enhance brands. Along the same line of
The brand concept became popular in marketing thoughts, Tuominen (1999) postulates that the
discipline as a company tactic to help customers real capital of companies is their brands and the
and consumers to identify specific products or perception of these brands in the minds of poten-
services from competitors but also to communi- tial buyers. Because brands have become so
cate their intangible qualities (Kapferer 1997). important for organizations, the studying of strate-
Today, anything can be branded, for example, an gic brand management, integrated marketing
organization, a person, and a discipline. The con- communication, and consumer behavior has
cept of branding, that is, the act of publicizing a become more and more important in marketing
product, service, organization, person, etc., research. Due to the tangible and intangible nature
through the use of a specific brand name, has of brand values, an important area of study in
become more than a differentiation tactic. It has brand management is the identification and
turned into a strategic management discipline. assessment of how changing elements of the mar-
The scope is to create a strong product, service, keting mix impact customer attitudes and behav-
organization, and personal identity that can lead to iors. Another important area of study is
positive images and attitudes among existing and understanding consumption motives and prac-
potential consumers, customers and even the gen- tices. Diverse dataset analytics have become
eral public. important tools for the study of both these areas.
Reflecting on the impact of branding in busi- In explaining how consumers consume, Holt
ness organizations Kapferer (1997) noted a shift (1995) identifies four typologies of consumption
Brand Monitoring 133

practices: consuming as experience, consuming recognized that established brands reduce mar-
as integration, consuming as classification, and keting costs because they increase the brand vis-
consuming as a play. Consuming as experience ibility and thus help in getting consumer
represents the act of consuming a product or ser- consideration, provide reasons to buy, attract
vice because consuming it provokes some enjoy- new customers via awareness and reassurance, B
ment, a positive experience on its own. and create positive attitudes and feelings that can
Consuming as integration is the act of consuming lead to brand locality (Aaker 1991). Therefore,
with the scope of transferring the meanings that brands can produce equities.
specific brands have into own identity. It is an act
that serves the purpose of constructing a personal
identity. Consumers can use brands to strengthen Brand Monitoring for Marketing
a specific social identification and use consump- Research
tion as practice for making themselves be recog-
nizable for the objects they own and use. This act Besides identifying the benefits that brands can
is consuming as classification. Finally, consuming provide to organizations, marketing research has
as a play reflects the idea that through consump- been interested in investigating how to measure
tion people can develop relationships and relate to brand value to explain its contribution to organi-
others. This last purpose of consuming was later zations' business objectives. This value is mea-
used to explain a specific type of brand value, sured through brand equity. A brand equity is “a
called the linking value (Cova 1997). The linking set of brand assets and liabilities linked to a brand,
value is often referred to a product or service’s its name and symbol, that adds to or subtracts
contribution to establishing or reinforcing bonds from the value provided by a product or service
between individuals. to a firm and/or to that firm’s customer” (Aaker,
These four typologies of consumption prac- 1991, p. 15). Aaker’s brand equity comprises four
tices show that in developed economies people dimensions: brand loyalty, brand awareness,
buy and consume products not only for their brand associations, and perceived quality. Posi-
basic human needs and the functional values of tive brand equity is obtained through targeted
products, but often for their symbolic meanings marketing communications activities that gener-
(Sassatelli 2007). Symbolic meanings are not ate "customer awareness of the brand and the
created in a vacuum, but are often trends and customer holds strong, unique, and favorable
tendencies coming from different social and cul- brand associations in memory" (Christodoulides
tural phenomena. They are, in the words of and de Chernatony 2004, p.169). Traditionally
McCracken (1986), borrowed from the culturally brand equity was measured through indicators
constituted world, which is a world where mean- such as price premium, customer satisfaction or
ings about objects are shaped and changed by a loyalty, perceived quality, brand leadership or
diverse variety of people. For example, compa- popularity, perceived brand value, brand person-
nies such as Dior have for long employed cul- ality, organizational associations, brand aware-
tural icons, such as the French actress Catherine ness, market share, market price, and distribution
Deneuve, in the Chanel No. 5 perfume cam- coverage. Yet with the increased popularity of
paigns, in their product advertisements to trans- Internet and social media, a lot of organizations
pose cultural meanings of such icons into the are using online channels to manage their brands
product brand value. Research indicates that and the values they can offer to their customers
these iconic associations have a strong attitudinal and consumers. Christodoulides and de
effect on consumers, since consumers think Chernatony (2004) propose to include other
about brands as if they were celebrities or famous brand equity indicators to assess online brand
historical figures (Aaker 1997). Studies on brand value. These are online brand experience, interac-
association emerged as well as those related to tivity, customization, relevance, site design, cus-
brand awareness. In marketing it is well tomer service, order fulfillment, quality of brand
134 Brand Monitoring

relationships, communities, as well as website even sites that allow companies to monitor and
logs statistics. assess the status of their brands. For instance,
Google offers Google Trends and Google Analyt-
ics; these are tools that monitor search traffic of a
Conclusion company and its brand. Integrated monitoring
services such as Hootsuite for monitoring Twitter,
Brand monitoring helps organizations to collect Facebook, LinkedIn, WordPress, Foursquare and
large sets of data on online brand experiences Googleþ conversations in real-time and
which encompasses all points of interaction SocialMention for seeking web for user-generated
between the customer and the brand in the virtual content, such as blogs, comments, bookmarks,
space. Online experiences are also about the level events, news, videos, and microblogging services,
of interactivity, quality of the site, or social media etc., have also been used for collecting specific
design and customization that organizations can brand dataset content. Social media companies
offer to their online customers. Through the use of collect large strings of social data on a regular
specific software computing big data analytics, basis, sometimes for company related purposes
organizations can collect large and diverse other times for selling to other companies that
datasets on individuals’ preferences and they can seek information on their brands on those social
systematically analyze and interpret them to pro- networking sites (boyd and Crawford 2012).
vide unique content of direct relevance to each When they do not buy such datasets, organiza-
customer. Companies like Amazon track their cus- tions can acquire information on consumers’
tomers' purchases and provide a customized list of opinions on their brand experience, satisfaction,
suggested items when customers revisit their com- and overall impression on their brands by simply
pany websites (Ansari and Mela 2003). Other scanning and monitoring online conversations in
brand equity indicators are Web log metrics, the social media, Internet fora and sites. Angry cus-
number of hits, the number of revisits and view tomers, dissatisfied employees, and consumer
time per page, number of likes, shares, and re- activists use the Web and social media as weapons
tweets (Christodoulides and de Chernatony to attack brands, organizations, political figures
2004). Information on viewers is collected via and celebrities. Therefore, it has become para-
web bugs. A Web bug (also known as a tracking mount for any organization and prominent indi-
bug, pixel tag, Web beacon, or clear gif) is a vidual to have in place mechanisms to gather,
graphic in a website or a graphic-enabled e-mail analyze and interpret big data on people's opinions
message. Sentiment analysis also known as opin- on their brands. Yet as boyd and Crawford (2012)
ion mining is another type of social media analyt- pointed out, there are still issues on relying only
ics that allows companies to monitor the status of on large datasets from Web sources as these are
consumers and publics’ opinions on their brands often unreliable, not necessarily objective or accu-
(Stieglitz et al. 2014). Behavioral analytics is rate. Big data analytics are often taken out of
another approach to collect and analyze large context, and this means that datasets lose meaning
scale datasets on consumers or simply web visi- and value, especially when organizations are
tors’ behaviors. According to the Privacy Rights seeking to assess their online brand equity. Fur-
Clearinghouse (2014, October), companies regu- thermore, ethical concerns about anonymity and
larly engage in behavioral analytics with the pur- privacy of individuals can emerge when organi-
pose of monitoring individuals, their web zations collect datasets online.
searches, the visited pages, the viewed content,
their interactions on social networking sites, and
the products and services they purchase. Cross-References
Brand monitoring is an important component
of brand equity measurement, and today there ▶ Behavioral Analytics
exists a number of software applications and ▶ Business Intelligence Analytics
Business 135

▶ Facebook
▶ Google Analytics Business
▶ Online Advertising
▶ Privacy Magdalena Bielenia-Grajewska
▶ Sentiment Analysis Division of Maritime Economy, Department of B
Maritime Transport and Seaborne Trade,
University of Gdansk, Gdansk, Poland
Further Reading Intercultural Communication and
Neurolinguistics Laboratory, Department of
Aaker, J. L. (1991). Managing brand equity. New York: Translation Studies, University of Gdansk,
The Free Press.
Gdansk, Poland
Aaker, J. L. (1997). Dimensions of brand personality.
Journal of Marketing Research, 34(3), 347–356.
Ansari, A., & Mela, C. F. (2003). E-customization. Journal
of Marketing Research, 40(2), 131–145. Business consists of the “profit seeking activities
Bizer, C. (2009). The emerging web of linked data. Intel- and enterprises that provide goods and services
ligent Systems, 24(5), 87–92.
necessary to an economic system” (Boone and
boyd, d., & Crawford, K. (2012). Critical questions for big
data. Information, Communication & Society, 15(5), Kurtz 2013, p. 5) and can be understood from
662–679. different perspectives. Taking a macroapproach,
Christodoulides, G., & de Chernatony, L. (2004). business can be treated as the sum of activities
Dimensionalising on- and offline brands’ composite directed at gaining money or any other profit. The
equity. Journal of Product & Brand Management, 13
(3), 168–179. meso-point of investigation concentrates on busi-
Cova, B. (1997). Community and consumption: Towards ness as a sector or a type of industry. Examples
a definition of the linking value of products or ser- may include, among others, automotive business,
vices. European Journal of Marketing, 31(3/4), agricultural business, and media business. As in
297–316.
the case of other types, the appearance of new
Fullerton, R. A. (1988). How modern is modern mar-
keting? Marketing’s evolution and the myth of the enterprises is connected with different factors.
production era. Journal of Marketing, 52(1), For example, the growing role of digital technol-
108–125. ogies has resulted in the advent of different types
Holt, D. B. (1995). How consumers consume: A typology of industries, such as e-commerce or e-medicine.
of consumption practices. Journal of Consumer
Research, 22(1), 1–16. Applying a more microperspective, a business is
Kapferer, J.-N. (1997). Strategic brand management. Lon- an organization, a company, or a firm, which
don, UK: Kogan Page. focuses on offering goods or services to cus-
Kotler, P. (2000). Marketing management. The millennium tomers. In addition, business can refer to one’s
edition. Upper Saddle River: Prentice Hall.
profession or type of conducted work. No matter
McCracken, G. (1986). Culture and consumption: A theo-
retical account of the structure and movement of the which perspective is taken, business is a complex
cultural meaning of consumer goods. Journal of Con- entity being influenced by different factors and, at
sumer Research, 13(1), 71–84. the same time, influencing other entities in various
Privacy Rights Clearinghouse. (2014, October). Fact sheet
ways. Moreover, irrespective of level of analysis,
18: Online privacy: Using the internet safely. https://
www.privacyrights.org/online-privacy-using-internet- big data plays an increasingly central role in busi-
safely. Accessed on 7 Nov 2014. ness intelligence, management, and analysis, such
Sassatelli, R. (2007). Consumer culture: History, theory that basic studies and understandings of business
and politics. London, UK: Sage.
rely on big data as a determinant feature.
Stieglitz, S., Dang-Xuan, L., Bruns, A., & Neuberger, C.
(2014). Social media analytics: An interdisciplinary
approach and its implications for information systems.
Business & Information Systems Engineering, 6(2), Subdomains in Business Studies
89–96.
Tuominen, P. (1999). Managing brand equity. Liiketalou-
dellinen Aikakauskirja – The Finnish Journal of Busi- Different aspects of business have been incorpo-
ness Economics, 48(1), 65–100. rated as various subdomains or subdisciplines by
136 Business

which it has been studied. For example, a broad depending on the type of a company, its organi-
subdomain is marketing, focusing on how to zational structure, and organizational culture. In
make customers interested in products or services some organizations, especially large ones and cor-
and facilitating their selection. Researchers and porations, there are top (or senior), middle, and
marketers are interested in, e.g., branding and first-line managers who perform different func-
advertising. Branding is connected with making tions. In small business, the structure is less com-
customers aware that a brand exists, whereas plex, often confined to a single manager
advertising is related to using verbal and nonver- responsible for different jobs. Moreover, there
bal tools as well as different channels of commu- are different characteristics that define a good
nication to make products visible on the market manager. For example, he/she must be a good
and then selected by users. leader; leadership skills include the possession of
Another subdomain is corporate social respon- social power to influence or persuade people to
sibility (CSR), referring to ethical dimensions of achieve certain goals. Studies on leadership have
corporate performance, with particular attention to stressed issues such as inborn features or learned
social and environmental issues, since modern expertise (soft and occupational skills), which
companies pay more and more attention to ethics. make a person a good leader. It should also be
One of the reasons that modern companies are stated that leadership styles, although sharing
increasingly concerned with ethics is market com- some common features, may be culture-specific.
petitiveness; firms observe the ways they are Thus, no leadership style applies to all situations
viewed by different stakeholders since many cus- and communities; the way a group of people is to
tomers, prior to selecting products or services, be guided depends on the features, expectations,
might consider how a company respects the envi- needs, and preferences of a given community. In
ronment, takes care of local communities, or sup- the management literature, the most often
ports social initiatives. CSR can be categorized discussed types of leadership are autocratic
according to different notions. For example, audi- (authoritarian) leadership, democratic leadership,
ence is the basis for many CSR activities, analyzed and laissez-faire (free-rein) leadership. Autocratic
through the prism of general stakeholders, cus- leaders do not discuss their visions and ideas with
tomers, local communities, or workers. CSR also subordinates but implement their decisions with-
can be studied by examining the types of CSR out any prior consultancy with workers. Demo-
activities focused on by a company (Bielenia- cratic leaders, on the other hand, allow for the
Grajewska 2014). Thus, the categorization can active role of employees in decision-making pro-
include strengthening the potential of workers and cesses. Thus, this way of leadership involves
taking care of the broadly understood environment. mutual cooperation in the process of making and
The internal dimension is connected with paying implementing decisions. Laissez-faire leadership
attention to the rights and wishes of workers, makes the workers responsible for decision-mak-
whereas the external one deals with the needs and ing. The role of a supervisor is mainly to monitor
expectations of stakeholders located outside the and communicate with employees as far as their
company. decisions are concerned. As provided in the defi-
Since modern organizations are aimed at nition, management is connected with organizing
achieving particular goals and targets, manage- human capital. Thus, the way individuals work
ment stands out as a subdomain of contemporary and contribute to the performance of an organiza-
business. It can be classified as the set of tools and tion determines the way a company is organized.
strategies for reaching corporate aims by using As a concept, human resources denote the set of
different types of capital and resources. It is the tools and strategies used to optimize the perfor-
role of managers to make available human and mance of workers in a company. The staff employed
nonhuman assets in order to achieve company by a department of human resources is responsible
goals. Regarding the human aspect of manage- for, among other things, recruiting, employing, and
ment, there are different profiles of managers, laying off workers, along with organizing vocational
Business 137

training and other methods of improving the profes- represented in, e.g., how types of management
sional skills of personnel. Thus, such concepts as are related to the past political systems in a given
human capital and talent management are used to state. As far as branding products is concerned,
denote the skills and abilities of employees within history is used to stress the long tradition or expe-
the area of human resources. rience of a given company to produce merchan- B
A prominent subdomain of business studies is dise or deliver services. Thus, a company
finance, since running a business is intimately operating for a long time on the market is often
related to managing its financial sphere. Finance regarded as trustworthy and experienced in a
is connected with assets, liabilities, and equities of given industry.
a company. For example, accounting (or financial Also important is geography since location
reporting) in companies focuses on keeping a determines contemporary business in different
financial record of corporate activities, paying ways. For one thing, it shapes the profile of a
taxes, and providing information on the financial company since a given business has to adjust to
situation of companies for interested stakeholders, available geographical conditions. For example,
an area that is increasingly complex and growing there are fisheries located near the water reservoirs
with the incorporation of big data analytics. such as seas, oceans, or lakes. In that case, setting
Yet another domain of study important for con- up a business near a given geographical location
temporary business is business law, comprising the limits the cost of transportation. In other situa-
set of regulations connected with corporate perfor- tions, geographical characteristics may serve as a
mance. It concerns the legal sphere of creation, barrier in running a given type of company due to
production, and sale of products. It should also be the lack of required resources or limited access to
stressed that the various spheres do not only deter- its premises for most customers. Moreover, a par-
mine contemporary business as such but also influ- ticular geographical location may work as a tool
ence themselves. For example, the financial sphere of branding a given company. For example,
is determined by business law, whereas manage- mountains or villages may be associated with
ment depends on available human resources in a fresh air and nature, and thus, products offered
company and its crucial characteristics. Thus, con- by the companies operating in such areas are
temporary business is not only shaped in terms of associated with healthy food.
the subdomains but also shapes them. Another factor is politics since decisions made
by politicians influence the way companies func-
tion. For example, the type of governing can
Factors Shaping Contemporary Business enhance or diminish the amount of people inter-
ested in running private companies. Economics is
An important factor shaping the way business another determinant of contemporary business.
functions is its environment, broadly conceived. From a macroeconomic perspective, such notions
Today’s competitive business world is affected by as production, consumption, inflation, growth,
the competitive environment, the global environ- and unemployment influence the way contempo-
ment, the technological environment, and the eco- rary businesses function. For example, low unem-
nomic environment (Pride et al. 2012). Analysis ployment may result in companies having to
of contemporary businesses from different per- increase wages or salaries. The microeconomic
spectives allows enumeration of at least seven consideration of households, sellers, or buyers
crucial determinants of contemporary business, shapes the way companies adjust prices and the
as indicated in Fig. 1. level of production. Also, technology is a factor;
First, history determines the performance of the twenty-first century can be characterized as the
companies since the way business entities func- age of extensive technological developments in
tion mirrors the history not only of a company but all spheres of life. For example, modern compa-
also of a state. In addition, the traces of history are nies rely on the Internet in advertising their prod-
observed in the way firms are managed, being ucts and communicating with stakeholders. In that
138 Business

Business, Fig. 1 Main


determinants of
contemporary business

History

Culture Geography

Contemporary
Business
Law Politics

Technology Economics

way, technology is responsible for lowering the interactions taking place in the web. It is
costs of communication and making it more effec- connected with sending e-mails, using social
tive in comparison with standard methods of media networking tools, such as Facebook or
exchanging information, such as face-to face Twitter, posting information at websites, etc.
interactions or regular mail. It should also be Off-line communication is understood as all
stressed that companies do not exist in a vacuum types of communicative exchanges that do not
and their performance is determined by the legal involve the Internet. Thus, this notion entails
sphere, i.e., the law. Laws and other legal regula- direct and indirect interaction without the
tions shape the way a company is set up, run, and online exchange of data.
closed. Communication can also be discussed in
Finally, culture is probably the most com- terms of its elements. The main typology
plex factor of all the determinants. It can be involves verbal and nonverbal tools of commu-
understood in many ways, with the perspective nication. As far as the verbal sphere is
of constituents being presented as the first concerned, language is responsible for shaping
approach. Communication is one of the most corporate linguistic policies, determining the
important and most visible representations of role of corporate lingo as well as the national
culture in modern business, understood by tak- language of a country a company operates in and
ing inner and outer dimensions into account. the usage of professional languages and dialects.
The division of internal (with and among In addition, it also shapes the linguistic sphere of
workers) and external (with broadly under- those going abroad, e.g., expatriates who also
stood stakeholders) communication can be have to face linguistic challenges in a new coun-
applied. Also online and offline communica- try. Moreover, language is also not only the
tions are a main determinant of interaction. means of corporate communication, but it is
Online communication involves all types of also a sphere of activity that is regulated in
Business 139

companies. Corporate linguistic rights and cor- Communication can also be divided by
porate linguistic capital are issues that should be discussing the type of stakeholders participating
handled with care by managers since properly in interaction. Taking into account the partici-
managed company linguistic identity capital pants, communication can be classified as internal
offers possibilities to create a friendly and effi- and external. Internal corporate communication B
cient professional environment for both external entails interactions taking place between
and internal communication (Bielenia- employees of a company. This dimension incor-
Grajewska 2013a). As far as the linguistic aspect porates different types of discourse among
is concerned, related tools can be divided into workers themselves as well as between workers
literal and nonliteral ones. Nonliteral tools and employees, including such notions as hierar-
encompass symbolic language, represented by, chy, power distance, and organizational commu-
e.g., metonymies and metaphors. Taking meta- nication policy. On the other hand, external
phors into consideration, they turn out to be corporate communication focuses on interactions
efficient in intercultural communication, relying with the broadly understood stakeholders. It
on broadly understood symbols, irrespective of involves communication with customers, local
one’s country of origin. Using a well-known community, mass media, administration, etc.
domain to describe a novel one is a very effec- Communication is also strongly influenced by
tive strategy of introducing new products or cultural notions, such as the type of culture shared
services on the market. Paying attention to avail- or not shared by interlocutors. Apart from the
able connotations, metaphors often make stake- discussed element-approach, one may also dis-
holders attracted to the merchandise. In addition, cuss culture as a unique set of norms and rules
metaphors can be used to categorize types of shared by a given community (be it ethnic,
contemporary business. For example, a modern national, professional, etc.).
company may be perceived as a teacher; intro- Another approach for investigating culture
ducing novel technologies and advancements in and its role for contemporary business is by
a given field results in customers having access looking at cultural differences. No matter which
to the latest achievements and products. typology is taken into account, modern compa-
On the other hand, taking the CSR perspective nies are often viewed through similarities and
into account, companies may become a paragon differences in terms of values they praise. The
of protecting the environment, teaching the local differences may be connected with national
community how to take care of their neighbor- dichotomies or organizational varieties. For
hood. In addition, companies may promote a example, such notions as the approach to power
given lifestyle, such as eating healthy food, or hierarchy in a given national culture are taken
exercising more, or spending free time in an edu- as a factor determining the way a given company
cational way. Another organizational metaphor is is run. In addition, companies are also viewed
a learner, with companies playing the role of stu- through the prism of differences related to orga-
dents in the process of organizational learning nizational values and leadership styles. It should
since they observe the performance of other com- be mentioned, however, that contemporary busi-
panies and the behavior of customers. Apart from ness is not only an entity influenced by different
the verbal sphere, a company communicates itself factors, but it is also an entity that influences
through pictorial, olfactory, and audio channel. others. Contemporary business influences the
The pictorial corporate dimension is represented environment at both individual and societal
by, e.g., logos or symbols, whereas the audio one level. Applying the microposition, modern com-
is visible in songs used in advertising or corporate panies construct the life of individuals. Starting
jingles. The olfactory level is connected with all with the tangible sphere, they shape the type of
the smells related in some way to the company competence and skills required from the individ-
itself (e.g., the scent used in corporate offices) or uals, being the reason why people decide to edu-
its products. cate or upgrade their qualification. They are often
140 Business

the reasons why people decide to migrate or opt for e-commerce, offering goods or services
reorganize their private life. Taking the meso- on the web.
level into account, companies determine the per-
formance of other companies, through such
notions as competitiveness, providing necessary Big Data and Researching Contemporary
resources, etc. The macrodimension is related to Business
the way companies influence the state.
The growing expectations of customers who are
faced with multitudes of goods and services have
Types of Contemporary Business led to the emergence of cross-domain studies that
contribute to a complex picture of how stake-
One way of classifying businesses is by taking holders’ expectations can be met. Thus, many
into account the type of ownership (Pride et al. researchers opt for multidisciplinary methods. A
2012). This classification encompasses different popular approach to investigating contemporary
types of individual and social ownership. For business is a network perspective. Network stud-
example, as in the case of sole proprietorship, ies offer a multidimensional picture of the inves-
one can run his or her own business, without tigated phenomenon, drawing attention to
employing any workers. When an individual different elements that shape the overall picture.
runs a business with another person, it is called Analyzing the application of network approaches
partnership (e.g., limited liability partnership, in contemporary business, e.g., Actor-Network-
general partnership). On the other hand, corpo- Theory, stresses the role of living and nonliving
rations are companies or groups of companies entities in determining corporate performance. It
acting as a single legal entity, having rights and may be used to show how not only individuals but
liabilities toward their workers, shareholders, also, e.g., technology, mobile phones, and office
and other stakeholders. There is also a state (pub- artifacts influence the operational aspects of con-
lic) ownership when a business is run by the temporary business (Bielenia-Grajewska 2011).
state. Another type of running a business is fran- The selection of methods is connected to the
chising. A franchisee is given the right to use the object of analysis. Thus, researchers use
brand and marketing strategies of a given com- approaches such as observation to investigate cor-
pany to run its own store, restaurant, service porate culture, interviews to focus on hierarchy
point, etc. issues, or expert panels to study management
Modern companies also can be studied from styles. Moreover, contemporary business can be
the perspective of profit and nonprofit statuses. researched by applying qualitative or quantitative
Contemporary business can also be divided by approaches. The first are focused on researching a
taking into account different types of business carefully selected group of individuals, factors,
and looking through the prism of how they are and notions in order to observe some tendencies,
run and managed. In that case, leadership styles whereas the quantitative studies are related to
as well as corporate cultures can be examined. dealing with relatively high numbers of people
Contemporary business can also be discussed by or concepts. Moreover, there are types of research,
analyzing its scope of activities. Thus, division generally associated with other fields, which can
into regional, national, and international compa- provide novel data on modern companies. Taking
nies can be used. Business also can be classified linguistics as an example, discourse studies can be
by analyzing type of industry. With the advent of used to research how the selection of words and
new technologies and the growing popularity of phrases influences the attitude of clients toward
the Internet, companies can also be sub- offered products. One of the popular domains in
categorized according to online-off-line distinc- contemporary business is neuroscience. Its grow-
tions. Thus, apart from standard types of ing role in different modern disciplines has
business, nowadays more and more customers influenced, e.g., management, and resulted in
Business Intelligence Analytics 141

such areas of study as international neurobusiness, Bielenia-Grajewska, M. (2013b). International


international neurostrategy, neuromarketing, Neuromanagement. In D. Tsang, H. H. Kazeroony, &
G. Ellis (Eds.), The Routledge companion to interna-
neuroentrepreneurship, and neuroethics tional management education. Abingdon: Routledge.
(Bielenia-Grajewska 2013b). Bielenia-Grajewska, M. (2014). CSR online communica-
No matter which approach is taken into tion: The metaphorical dimension of CSR discourse in B
account, both researchers studying contemporary the food industry. In R. Tench, W. Sun, & B. Jones
(Eds.), Communicating corporate social responsibility:
business and managers running the companies Perspectives and practice (Critical studies on corporate
have to deal with an enormous amount of data of responsibility, governance and sustainability) (Vol. 6).
different types and sources (Roxburgh 2019). Bingley: Emerald Group Publishing.
They can include demographic data (names, Boone, L. E., & Kurtz, D. L. (2013). Contemporary busi-
ness. Hoboken: Wiley.
addresses, sex, ethnicity, etc.), financial data Celina, M. O. (2020). Business intelligence and big data:
(income, expenditures), retail data (shopping Drivers of organizational success. Boca Raton: CRC
habits), and data connected with transportation, Press.
education, health, and social media use. Sources Olszak, C. (2020). Business Intelligence and Big Data:
Drivers of Organizational Success. Abingdon: CRC
of information include public, health, social secu- Press.
rity, and retail repositories as well as the Internet. Pride, W., Hughes, R., & Kapoor, J. (2012). Business.
The variety and volume of data and its complex Mason: South Western.
features lead to many techniques that can be used Roxburgh, E. (2019). Business and big data: Influencing
consumers. New Jork NJ: Lucent Press.
by organizations to deal with these types of data.
These can include techniques such as data mining,
text mining, web mining, graph mining, network
analysis, machine learning, deep learning, neural
networks generic algorithms, spatial analysis, and Business Intelligence
search-based application (Olszak 2020).
▶ Business Intelligence Analytics

Cross-References
Business Intelligence
▶ Ethics
Analytics
▶ Human Resources
▶ International Nongovernmental Organizations
Feras A. Batarseh
(INGOs)
College of Science, George Mason University,
▶ Semiotics
Fairfax, VA, USA

Further Reading Synonyms


Bielenia-Grajewska, M. (2011). A potential application of
actor network theory in organizational studies: The Advanced analytics; Big data; Business intelli-
company as an ecosystem and its power relations gence; Data analytics; Data mining; Data science;
from the ANT perspective. In A. Tatnall (Ed.), Actor- Data visualizations; Predictive analytics
network theory and technology innovation: Advance-
ment and new concepts. Hershey: Information Science
Reference.
Bielenia-Grajewska, M. (2013a). Corporate linguistic Definition
rights through the prism of company linguistic identity
capital. In C. Akrivopoulou & N. Garipidis (Eds.),
Digital democracy and the impact of technology on
Business Intelligence Analytics is a wide set of
governance and politics. New globalized perspectives. solutions that could directly and indirectly influ-
Hershey: IGI Global. ence the decision making process of a business
142 Business Intelligence Analytics

organization. Many vendors build Business Intel- to organize, employ, and structure an organization
ligence (BI) platforms that aim to plan, organize, from the inside; and how strategies affect the
share, and present data at a company, hospital, direction of an organization in general).
bank, airport, federal agency, university, or any
other type of organization. BI is the business
umbrella that has analytics and big data under it. Main BI Features and Vendors

The capabilities of BI include decision support,


Introduction statistical analysis, forecasting, and data mining.
Such capabilities are achieved through a wide
Nowadays, business organizations need to care- array of features that a BI vendor should inject
fully gauge markets and take key decisions, into their software offering. Most BI vendors pro-
quicker than ever before. Certain decision can vide such features, however, the leading global BI
steer the direction of an organization, and halt its vendors are: IBM (Watson Analytics), Tibco
progress, while other decisions can improve its (Spotfire), Tableau, MicroStrategy, SAS, SAP
place in the market and even increase profits. If (Lumira), and Oracle and Microsoft (PowerBI).
BI is broken into categories, three organizational Figure 1 below illustrates BI market leaders based
areas would emerge: technological intelligence on the power of execution and performance and
(understanding the data, advancing the technolo- the clarity of vision.
gies used, and the technical footprint), market BI vendors provide a wide array of software,
intelligence (studying the market, predicting data, and technical features for their customers;
where it is heading and how to react to its many the most commonplace features include database
variables), and strategic intelligence (dictates how management, data organization and

Business Intelligence
Analytics, Fig. 1 Leading
BI vendors (Forrester 2015)
Business Intelligence Analytics 143

augmentation, data cleaning and filtering, data policy making, (5) geography, remote sensing,
normalization and ordering, data mining and sta- and weather forecasting, and (6) defense and
tistical analysis, data visualization, and interactive army operations, among many other successful
dashboards (Batarseh et al. 2017). applications.
Many industries (such as healthcare, finance, However, to achieve such decision-making B
athletics, government, education, and the media) support functions, BI relies heavily on structured
adopted analytical models within their organiza- data. Obtaining structured data is quite challeng-
tions. Although data mining research has been of ing in many cases, and data are usually raw,
interest to many academic researchers around unstructured, and unorganized. Business organi-
the world for a long time, data analytics (a form zations have data in forms of emails, documents,
of BI) did not see much light until it was adopted surveys, sheets, tables, and even meeting notes;
by the industry. Many software vendors (SAS, furthermore, they have data for customers that can
SPSS, Tableau, Microsoft, and Pentaho) shifted be aggregated at many different levels (such as
the focus of their software development to weekly, monthly, or yearly), but to achieve suc-
include a form of BI analytics, big data, statisti- cessful applications, most organizations need to
cal modeling, and data visualization. have a well-defined BI lifecycle. The BI lifecycle
is introduced in the next section.

BI Applications
The BI Development Lifecycle
As it is mentioned previously, BI has been
deployed at many domains, some famous and Based on multiple long and challenging deploy-
successful BI applications include: ments in many fields, trials, and errors, and many
(1) healthcare records collection and analysis, consulting exchanges with customers from a vari-
(2) predictive analytics for the stock market, ety of domains, BI vendors coined a data manage-
(3) airport passengers flow and management ment lifecycle model for BI. SAS provided that
analytics, (4) federal government decision and model (illustrated in Fig. 2).

Business Intelligence Analytics, Fig. 2 BI lifecycle model (SAS 2017)


144

Calls by service type by weekday Calls by service type


Call Start Service Type
3K

64.00%
20K
2K

1K
10K

Number of Calls
Number of Calls
15.78%
10.57%
0K
0K 9.48%

Friday
Sunday
SMS

Monday
MMS 0.09%

Tuesday
MBox 0.07%

Saturday

Thursday
Phone
WLAN 0.01%

Wednesday
Unknown

Calls by daytime and service type GPRS/3G


Call Start
2000

1500

1000

Number of calls
500

0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Business Intelligence Analytics, Fig. 3 A dashboard (Tableau 2017)


Business Intelligence Analytics
Business-to-Community (B2C) 145

The model includes the following steps: iden- Further Reading


tify and formulate the problem; prepare the data
(pivoting and data cleansing), data exploration Batarseh, F., Yang, R., & Deng, L. (2017). A comprehen-
sive model for management and validation of federal
(through summary statistics charts), data transfor-
big data analytical systems. Published at Springer’s
mation and selection (select ranges, and create journal of Big Data Analytics. B
subsets), statistical model development (data Evelson, B. (2015). The Forrester wave: Agile business
mining), validation, verification and deployment; intelligence platforms. A report published by Forrester
Research, Inc.
evaluate and monitor results of models; deliver
SAS website and reports. (2017). Available at: http://www.
the best model; and observe the results and refine sas.com/en_us/home.html.
(Batarseh et al. 2017). The main goal of the BI Tableau website and dashboards. (2017). Available at:
lifecycle is to allow BI engineers to transform the http://www.tableau.com/.
big data into useful reports, graphs, tables, and
dashboards. Dashboards and interactive visuali-
zations are the main outputs of most BI tools.
Figure 3 shows an example output – a Tableau Business-to-Community (B2C)
Dashboard (Tableau 2017).
BI outputs are usually presented on top of a Yulia A. Levites Strekalova
data warehouse. The data warehouse is the main College of Journalism and Communications,
repository of all data that are created, collected, University of Florida, Gainesville, FL, USA
generated, organized, and owned by the organiza-
tion. Data warehouses can host databases (such as
Oracle databases), or big data that is unstructured Organizations of all types and sizes reap the ben-
(but organized through tools such as Hadoop). efits of web interactivity and actively seek to
Each of the mentioned technologies has become engage with their customers. Social communities
essential in the lifecycle of BI and its outputs. and online engagement functionality offered by
Different vendors have different weaknesses the web allow businesses to foster the creation of
and strengths, most of which are presented in communities of customers and benefit from multi-
many market analysis studies presented in publi- way conversations and information exchanges
cations from Mckinsey & Company, Gartner, and that happen in these communities. Moreover, big
Forrester (Forrester 2015). data drives web development and the creation of
online communities such that principles of busi-
ness-to-community (B2C) engagement have
Conclusion applications beyond traditional marketing efforts
and create value for associations, nonprofits, and
Business Intelligence (analytics built on top civic groups. Online communities and social net-
of data, in many cases big data) is a rapidly grow- works facilitate the creation, organization, and
ing field of research and development and has sharing of knowledge. As social organisms, com-
attracted interest from academia and government munities go through life cycles of development
but mostly from industry. BI analytics depend on and maturity and show different characteristics at
many other software technologies and research different stages of development. Interactive com-
areas of study such as data mining, machine learn- munication and engagement techniques in the
ing, statistical analysis, art, user interfaces, market enterprise promise to have profound and far-
intelligence, artificial intelligence, and big data. reaching effects on how organizations reach and
BI has been used in many domains, and it is still support customers as communities. Community
witnessing a growing demand with many new building, therefore, requires a long-term commit-
applications. BI is a highly relevant and a very ment and ongoing efforts from organizations.
interesting area of study that is worth investing in Overall, communities may become an integral
at all venues and exploring at all levels. part of a business’s operations and become an
146 Business-to-Community (B2C)

asset if strategically engaged, or a liability if communities: the host-created model, the audi-
mismanaged. ence-created model, and the co-creation model.
Traditional business-to-consumer interactions The host-created model of community building
relied on one-way conversations between busi- is a top-down approach, where an organization
nesses and their consumers. These conversations builds a community for its target audience and
could take a form of interviews or focus groups as encourages its customers to actively participate
part of a market research initiative or a customer in the online knowledge exchange. This
satisfaction survey. These methods are still effec- approach relies on the organization’s staff rather
tive in collecting customer feedback, but they are than volunteers to keep the community in the
limited in their scope as they are usually guided by community active and offers the most control to
predefined research questions. As such, extended the organization. It also requires the most con-
data collection activities force customers to pro- trol, strategic planning, and ongoing effort to
vide information of interest to an organization grow and maintain the community. The audi-
rather than collect unaided information on the ence-created model is a bottom-up approach,
areas or products of specific interest or concern which is driven by the consumers themselves
to the customers. Conversely, in a community based on shared hobbies and fandom. In this
communication environment, community mem- case, organizations support and cultivate passion
bers can pose questions themselves indicating for their product through a loyal group of advo-
what interests or concerns them most. cates. While this approach may be more cost-
effective for organizations, its outcomes are
also a lot less predictable and may be hard to
Community Development measure. The last, co-creation, model is a hybrid
of the first two, where an organization may pro-
Community planning requires organizations to vide technical support and a platform for com-
establish a general strategy for community devel- munication and exchange of ideas, but
opment, make decisions about the type of leader- individuals are drawn to the community to sat-
ship in the community, decide on desired cultural isfy their information needs through group inter-
characteristics of the knowledge, define the level action and knowledge exchange.
of formality of the community management,
decide on the content authorship, and develop a
set of metrics to establish and measure the out- Community Life Cycle
comes of community engagement.
The decision to maintain an online community Margaret Brooks and her colleagues, discussing
creates an asset with a lot of benefits for the business-to-business (B2B) social communities,
organizations, but it also creates a potential liabil- describe a four-stage community life cycle
ity. Ill-maintained, unmoderated community com- model. This four-stage model describes commu-
munication may create negative company and nities and their development as an onboarding-
brand perceptions among potential customers established-mature-mitotic continuum. Onboarding
and prompt them to consider a different brand. communities are new and forming. These com-
Similarly, existing customers who do not get pro- munities usually attract early adopters and cham-
mpt responses in online community may not pions who want to learn more about a product or
develop a strong tie to the company and its prod- an organization. Members can contribute to the
uct and failing to engage with the product to the development of the community and a virtual col-
extent they could. These situations may have a laborative space. At this stage, new community
long-term effect on the future product choices for members are interested in supporting the commu-
the existing customers. nity and creating benefits mutually shared
Rich Millington identifies three models facil- between them and an organization. Onboarding
itated by big data for building large active online communities are most vulnerable and need to
Business-to-Community (B2C) 147

create interest among active community members valuable insights for companies. If organiza-
to gain momentum and attract additional fol- tions can build systems to collect and analyze
lowers. Established communities are character- data on consumer insights, community commu-
ized by an established membership with leaders, nication can feed ideas for new product devel-
advocates and followers. Members of these com- opment from the perspective of those who will B
munities start to create informal networks and be using the products. Communities can also
share knowledge among them. Mature communi- drive marketing efforts by creating and spread-
ties, which have existed for a few years, are said to ing the buzz about new products and services.
be virtually self-sustaining. These networks fea- Here, the necessary ingredient for effective viral
ture internal teams, area experts, and topical col- marketing is understanding of the audience that
laborations. The goal of mature communities is will support a new product and tailoring of
not to increase their membership but to keep communication to generate interest in this audi-
existing members engaged in active communica- ence. Finally, communities can self-support
tion and information exchange. Finally, mitotic themselves in real time. This support can pro-
communities are compared to a mother cell that vide a robust, powerful and sustaining environ-
grows and separates into daughter cells with iden- ment for training and education and act as a
tical characteristics. The separation could be product demonstration ground for potential
based on a regional level for large communities new customers.
or based on a development of different product Active engagement is key to successful online
lines. This process could be an indication that a community efforts and several strategies can
community lost its core focus. Yet, it could also be help in increasing participation of the members.
a natural process that will lead to the creation of Lack of participation in an online community
new established and mature communities with may be associated with several factors. For
narrower focus. example, community groups are too segmented,
which makes it hard for the existing and new
community members to find the right group to
Active Communities ask a question. One indicator of this problem
may be a low number of members per group or
Interaction functionality afforded by the web a low number of new questions posted to a dis-
creates new opportunities for organizations to cussion group.
contact and connect with their customers by pro- Content relevance is another issue that may
viding rich data and informational support and contribute to the community inactivity. The anal-
interactive communication through wikis and ysis of audience interests and the questions that
blogs. The use of product champions and com- audiences look to resolve through online commu-
munity experts allows organizations to assist nity participation may not overlap with the
with problem resolution through discussion company’s immediate priorities, yet if content is
groups and online forums. Web-based modes of not of value to the audience, large quantities of
customer support creates opportunities to service irrelevant content will not lead to large quantities
tens of thousands of customers with just hun- of community engagement.
dreds of employees and ongoing support with Ongoing engagement is the third area that may
community expert volunteers. Additionally, cus- contribute to the lack of online participation. The
tomer-to-customer advocacy for an organization rates of new members vising and registering in the
or its products may be more persuasive and pow- community, the amount of information each mem-
erful than organization’s own marketing and ber views in a session, participation in discus-
advertising efforts. sions, initiation of new discussions and other
Engaged communities create competitive online communication behavior are all areas that
advantage for companies that succeed in may require an intervention. Combination of all
forming and supporting them and provide these points may create audience profiles and
148 Business-to-Community (B2C)

facilitate further audience analysis and compari- Cross-References


son of individual community members against
exiting and desired models of member ▶ Cluster Analysis
engagement. ▶ Profiling
Numerous studies have shown that only 5– ▶ Sentiment Analysis
20% of online community members engage in ▶ Social Media
active communication while the rest are passive
information consumers. The attitudes of the lat-
ter, dominant group is harder to assess, yet the Further Reading
passive consumption of information does not
mean the lack of engagement with the company’s Brooks, M. (2013). Developing B2B social communities:
keys to growth, innovation, and customer loyalty. CA:
products. This does mean, however, that a few
CA Technologies Press.
active online community members may act as Millington, R. (2014). Buzzing communities: how to build
opinion leaders, thus having a strong effect on bigger, better, and more active online communities.
the rest of the customers participating in an Lexington: FeverBee.
Simon, P. (2013). Too big to ignore: the business case for
online community created largely through big
big data. Hoboken: Wiley.
data processes.
C

Cancer stages within the cancer continuum. Sources of


data include laboratory investigations, feasibility
Christine Skubisz studies, clinical trials, cancer registries, and
Department of Communication Studies, Emerson patient medical records. The paragraphs that fol-
College, Boston, MA, USA low describe current practices and future direc-
Department of Behavioral Health and Nutrition, tions for cancer-related research in the era of big
University of Delaware, Newark, DE, USA data.

Cancer is an umbrella term that encompasses Cancer Prevention and Early Detection
more than 100 unique diseases related to the
uncontrolled growth of cells in the human body. Epidemiology is the study of the causes and pat-
Cancer is not completely understood by scientists, terns of human diseases. Aggregated data allows
but it is generally accepted to be caused by both epidemiologists to study why and how cancer
internal genetic factors and external environmen- forms. Researchers study the causes of cancer
tal factors. The US National Cancer Institute and ultimately make recommendations about
describes cancer on a continuum, with points of how to prevent cancer. Data provides medical
significance that include prevention, early detec- practitioners with information about populations
tion, diagnosis, treatment, survivorship, and end- at risk. This can facilitate proactive and preventive
of-life care. This continuum provides a frame- action. Data is used by expert groups including
work for research priorities. Cancer prevention the American Cancer Society and the United
includes lifestyle interventions such as tobacco States Preventive Services Task Force to write
control, diet, physical activity, and immunization. recommendations about screening for detection.
Detection includes screening tests that identify Screening tests, including mammography and
atypical cells. Diagnosis and treatment involves colonoscopy, have advantages and disadvantages.
informed decision making, the development of Evidence-based results, from large representative
new treatments and diagnostic tests, and outcomes samples, can be used to recommend screening for
research. Finally, end-of-life care includes pallia- those who will gain the largest benefit and sustain
tive treatment decisions and social support. Large the fewest harms. Data can be used to identify
data sets can be used to uncover patterns, view where public health education and resources
trends, and examine associations between vari- should be disseminated.
ables. Searching, aggregating, and cross- At the individual level, aggregated information
referencing large data sets is beneficial at all can guide lifestyle choices. With the help of
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
150 Cancer

technology, people have the ability to quickly and Data is also being used to predict which med-
easily measure many aspects of their daily lives. ications may be good candidates to move for-
Gary Wolf and Kevin Kelly coined this rapid ward into clinical research trials. Clinical trials
accumulation of personal data the quantified self are scientific studies that are designed to deter-
movement. Individual-level data can be collected mine if new treatments and diagnostic proce-
through wearable devices, activity trackers, and dures are safe and effective. Margaret Mooney
smartphone applications. The data that is accumu- and Musa Mayer estimate that only 3% of adult
lated is valuable for cancer prevention and early cancer patients participate in clinical trials. Much
detection. Individuals can track their physical of what is known about cancer treatment is based
activity and diet over time. These wearable on data from this small segment of the larger
devices and applications also allow individuals population. Data from patients who do not par-
to become involved in cancer research. Individ- ticipate in clinical trials exists, but this data is
uals can play a direct role in research by contrib- unconnected and stored in paper and in elec-
uting genetic data and information about their tronic medical records. New techniques in big
health. Health care providers and researchers can data aggregation have the potential to facilitate
view genetic and activity data to understand the patient recruitment for clinical trials. Thousands
connections between health behaviors and of studies are in progress worldwide at any given
outcomes. point in time. The traditional, manual, process of
matching patients with appropriate trials is both
time consuming and inefficient. Big data
Diagnosis and Treatment approaches can allow for the integration of med-
ical records and clinical trial data from across
Aggregated data that has been collected over long multiple organizations. This aggregation can
periods of time has made a significant contribu- facilitate the identification of patients for inclu-
tion to research on the diagnosis and treatment of sion in an appropriate clinical trial. Nicholas
cancer. The Human Genome Project, completed LaRusso writes that IBM’s supercomputer Wat-
in 2003, was one of the first research endeavors to son will soon be used to match cancer patients
harness large data sets. Researchers have used with clinical trials. Patient data can be mined for
information from the Human Genome Project to lifestyle factors and genetic factors. This can
develop new medicines that can target genetic allow for faster identification of participants
changes or drivers of cancer growth. The ability that meet inclusion criteria. Watson, and other
to sequence the DNA of large numbers of tumors supercomputers, can shorten the patient identifi-
has allowed researchers to model the genetic cation process considerably, matching patients in
changes underlying certain cancers. seconds. This has the potential to increase enroll-
Genetic data is stored in biobanks, repositories ment in clinical trials and ultimately advance
in which samples of human DNA are stored for cancer research.
testing and analysis. Researchers draw from these Health care providers’ access to large data sets
samples and analyze genetic variation to observe can improve patient care. When making a diagno-
differences in the genetic material of someone sis, providers can access information from
with a specific disease compared to a healthy patients exhibiting similar symptoms, lifestyle
individual. Biobanks are run by hospitals, choices, and demographics to form more accurate
research organizations, universities, or other med- conclusions. Aggregated data can also improve a
ical centers. Many biobanks do not meet the needs patient’s treatment plan and reduce the costs of
of researchers due to an insufficient number of conducting unnecessary tests. Knowing a
samples. The burgeoning ability to aggregate patient’s prognosis helps a provider decide how
data across biobanks, within the United States aggressively to treat cancer and what steps to take
and internationally, is invaluable and has the after treatment. If aggregate data from large and
potential to lead to new discoveries in the future. diverse groups of patients were available in a
Cancer 151

single database, providers would be better will always be incomplete and will fail to cover
equipped to predict long-term outcomes for the entire population. Data from diverse sources
patients. Aggregate data can help providers select will vary in quality. Self-reported survey data
the best treatment plan for each patient, based on will appear alongside data from randomized,
the experiences of similar patients. This can also clinical trials. Second, the major barrier to using
allow providers to uncover patterns to improve big data for diagnosis and treatment is the task of
care. Providers can also compare their patient out- integrating information from diverse sources. C
comes to outcomes of their peers. Harlan Allen Lichter explained that 1.6 million Ameri-
Krumholz, a professor at the Yale School of Med- cans are diagnosed with cancer every year, but in
icine, argued that the best way to study cancer is to more than 95% of cases, details of their treat-
learn from everyone who has cancer. ments are in paper medical records, file drawers,
or electronic systems that are not connected to
each other. Often, the systems in which useful
Survivorship and End-of-Life Care information is currently stored cannot be easily
integrated. The American Association of Clinical
Cancer survivors face physical, psychological, Oncology is working to overcome this barrier
social, and financial difficulties after treatment and has developed software that can accept infor-
and for the remaining years of their lives. As sci- mation from multiple formats of electronic health
ence advances, people are surviving cancer and records. A prototype system has collected
living in remission. A comprehensive database on 100,000 breast cancer records from 27 oncology
cancer survivorship could be used to develop, test, groups. Third, traditional laboratory research is
and maintain patient navigation systems to facili- necessary to understand the context and meaning
tate optimal care for cancer survivors. of the information that comes from the analysis
Treating or curing cancer is not always possi- of big data. Large data sets allow researchers to
ble. Health care providers typically base patient explore correlations or relationships between
assessments on past experiences and the best data variables of interest. Danah Boyd and Kate
available for a given condition. Aggregate data Crawford point out that data are often reduced
can be used to create algorithms to model the to what can fit into a mathematical model. Taken
severity of illness and predict outcomes. This out of context, results lose meaning and value.
can assist doctors and families who are making The experimental designs of clinical trials will
decisions about end-of-life care. Detailed infor- ultimately allow researchers to show causation
mation, based on a large number of cases, can and identify variables that cause cancer. Bigger
allow for more informed decision making. For data, in this case more data, is not always better.
example, if a provider is able to tell a patient’s Fourth, patient privacy and security of informa-
family with confidence that it is extremely tion must be prioritized at all levels. Patients are,
unlikely that the patient will survive, even with and will continue to be, concerned with how
radical treatment, this eases the discussion about genetic and medical profiles are secured and
palliative care. who will have access to their personal
information.

Challenges and Limitations


Cross-References
The ability to search, aggregate, and cross-refer-
ence large data sets has a number of advantages ▶ Evidence-Based Medicine
in the prevention and treatment of cancer. Yet, ▶ Health Care Delivery
there are multiple challenges and limitations to ▶ Nutrition
the use of big data in this domain. First, we are ▶ Prevention
limited to the data that is available. The data set ▶ Treatment
152 Cell Phone Data

Further Reading reception, and the ways in which this data volume
is managed via cellular networks. Consideration
Murdoch, T. B., & Detsky, A. S. (2013). The inevitable will also be given to consumer-facing industry
application of big data to health care. Journal of the
practices related to cell data access, including
American Medical Association, 309(13), 1351–1352.
data pricing and roaming charges. The article
will conclude with a brief examination of the
types of information that can be collected on cell
phone users who access and utilize cellular data
Cell Phone Data services.

Ryan S. Eanes
Department of Business Management, Cell Phone Data Transmission and
Washington College, Chestertown, MD, USA Reception

Digital cell phones still rely on radio technology


Cell phones have been around since the early for reception and transmission, just as analog cell
1980s, but their popularity and ubiquity expanded phones do. However, digital information, which
dramatically at the turn of the twenty-first century simply consist of 1’s and 0’s, is much more easily
as prices fell and coverage improved; this demand compressed than analog content; in other words,
for mobile phones has steadily grown worldwide. more digital “stuff” can fit into the same radio
The introduction of the iPhone and subsequent transmission that might otherwise only carry a
smartphones in 2007, however, drove dramatic solitary analog message. This switch to digital
change within the underlying functionality of cel- has meant that the specific radio bandwidths
lular networks, given these devices’ data band- reserved for cell phone transmissions can now
width requirements for optimal function. Prior to handle many more messages than analog technol-
the advent of the digital GSM (Global System for ogies could previously.
Mobile Communications, originally Groupe Spé- When a call is placed using a digital cell phone
cial Mobile) and CDMA (code division multiple on certain systems, including the AT&T and T-
access) networks in Europe and North America in Mobile GSM networks in the United States, the
the 1990s, cell phones solely utilized analog cell phone’s onboard digital-to-analog converter,
radio-based technologies. While modern cellular or DAC, converts the analog sound wave into a
networks still use radio signals for the transmis- digital signal, which is then compressed and trans-
sion of information, the content of these transmis- mitted via radio frequency (RF) to the closest cell
sions has changed to packets of digital data. phone tower, with GSM calls in the USA utilizing
Indeed, the amount of data generated, transmitted, the 850 MHz and 1.9 GHz bands. As mentioned,
and received by cell phones is tremendous, given multiple cell phone signals can occupy the same
that virtually every cell phone handset sold today radio frequency thanks to time division multiple
utilizes digital transmission technologies. Further- access, or TDMA. For example, say that three
more, cellular systems continue to be upgraded to digital cell phone users are placing calls simulta-
handle this enormous (and growing) volume of neously and have been assigned to the same radio
data. frequency. A TDMA system will break the radio
This switch to digitally driven systems repre- wave into three sequential timeslots that repeat in
sents both significant improvements in data band- succession, and bits of each user’s signal will be
width and speeds for users as well as potential assigned and transmitted in the proper slot. The
new products, services, and areas of new research TDMA technique is often combined with wide-
based on the data generated by our devices. There- band transmission techniques and “frequency
fore, this article will first outline in general terms hopping,” or rapidly switching between available
the mechanics of cell phone data transmission and frequencies, in order to minimize interference.
Cell Phone Data 153

Other types of digital networks, such as the than previous digital cell phone systems. Each of
CDMA (“code division multiple access”) system these 4G systems builds upon modifications to
employed by Verizon, Sprint, and most other previous technologies; WiMAX, for example,
American carriers, take an entirely different uses OFDM (orthogonal frequency division
approach. Calls originate in the same manner: A multiplexing), a technique similar to CDMA that
caller’s voice is converted to a digital signal by the divides data into multiple channels with the trans-
phone’s onboard DAC. However, outgoing data mission recombined at the destination. However, C
packets generated by the DAC are tagged with a WiMAX was a standard built from scratch, and
unique identifying code, and these small packets has proven slow and difficult to deploy, given the
are transmitted over a number of the available expense of building new infrastructure. Many
frequencies available to the phone. In the USA, industry observers, however, see LTE as the first
CDMA transmissions occur on the 800 MHz and standard that could be adopted universally, given
1.9 GHz frequency bands; each of these bands its flexibility and ability to operate on a wide range
consists of a number of possible frequencies of radio bands (from 700 MHz to 2.6 GHz); fur-
(e.g., the specific frequency 806.2 MHz is part of thermore, LTE could build upon existing infra-
the larger 800 MHz band). Thus, a CDMA call’s structure, potentially reaching a much wider
packets might be transmitted on a number of range of users in short order.
frequencies simultaneously, such as 806.2 MHz,
808.8 MHz, 811.0 MHz, and so forth, as long as
these frequencies are confined to the specific band Data Access and the Telecom Industry
being used by the phone. The receiver at the other
end of the connection uses the unique identifying Modern smartphones require robust, high-speed,
code that tags each packet to reassemble the and consistent access to the Internet in order for
message. users to take full advantage of all of their features;
Because of the increased demands for data as these devices have increased significantly in
access that smartphones and similar technologies popularity, the rollout of advanced technologies
put on cell phone networks, a third generation of such as 4G LTE has accelerated in recent years.
digital transmission technology – often referred to Indeed, it is not uncommon for networks to mar-
as “3G” – was created, which includes a variety of ket themselves on the basis of the strengths of
features to facilitate faster data transfer and to their respective networks; commercials touting
handle larger multimedia files. 3G is not in and network superiority are not at all uncommon.
of itself a cellular standard, but rather a group of Despite these advances and the ongoing devel-
technologies that conform to specifications issued opment of improved digital transmission technol-
by the International Telecommunication Union ogies via radio, which has produced a significant
(ITU). Widely used 3G technologies include the reduction in the costs associated with operating
UMTS (“Universal Mobile Telecommunications cellular networks, American cellular companies
System”) system, used in Europe, Japan, and continue to charge relatively large amounts for
China, and EDGE, DECT, and CDMA2000, access to their networks, particularly when com-
used in the United States and South Korea. Most pared to their European counterparts. Most tele-
of these technologies are backwards compatible com companies charge smartphone users a data
with earlier 2G technologies, ensuring interoper- fee on top of their monthly cellular service charges
ability between handsets. simply for the “right” to access data, despite the
4G systems are the newest standards and fact that smartphones necessarily require data
include LTE (“Long-Term Evolution”) and access in order to fully function. Consider this
WiMAX standards. Besides offering much faster example: as of this writing, for a new smartphone
data transmission rates, 4G systems can move subscriber, AT&T charges $25 per month for
much larger packets of data much faster; they access to just 1 gigabyte of data (extra fees are
also operate on totally different frequency bands charged if one exceeds this allotment) on top of
154 Cell Phone Data

monthly subscription fees. Prices rapidly escalate routing information, and so forth and are often
from there; 6 gigabytes costs $80 a month, and audited by wireless networks to facilitate billing
20 GB go for $150 a month. Lower prices might and to identify weaknesses in infrastructure.
be had if a customer is able to find a service that Law enforcement agencies have long used
offers a “pay as you go” model rather than a CDRs to identify suspects, corroborate alibis,
contractual agreement; however, these types of reveal behavior patterns, and establish associa-
services have some downsides, may not be as tions with other individuals, but more recently,
much of a bargain as advertised, and may not scholars have begun to use CDRs as a data source
fully take advantage of smartphone capabilities. that can reveal information about populations.
Technology journalist Rick Broida, for example, Becker et al., for example, used anonymized
notes that MMS (multimedia service) messaging CDRs from cell phone users in Los Angeles,
and visual voicemail do not work on certain no- New York, and San Francisco to better understand
contract carriers. a variety of behaviors related to human mobility,
Many customers outside of the United States, including daily travel habits, traffic patterns, and
on the other hand, particularly those in Europe, carbon emission generation; indeed, this type of
purchase their handsets individually and can work could have significant implications for
freely choose which carrier’s SIM card to install; urban planning, mass transit planning, alleviation
as a result, data prices are much lower and of traffic congestion, combatting of carbon emis-
extremely competitive (though handsets them- sions, and more.
selves are much more expensive, as American That said, CDRs are not the only digital “fin-
carriers are able to largely subsidize handset gerprints” that cell phone users – and particularly
costs by committing users to multi-year con- smartphone users – leave behind as they use apps,
tracts). In fact, the European Union addressed messaging services, and the World Wide Web via
data roaming charges in recent years and has put their phones. Users, even those that are not know-
caps in place; as a result, as of July 1, 2014, ingly or actively generating content, nevertheless
companies may not charge more than €0.20 create enormous amounts of information in the
(approximately US$0.26) per megabyte for cellu- form of Instagram posts, Twitter messages,
lar data. Companies are free to offer lower rates, emails, Facebook posts, and more, virtually all
and many do; travel writer Donald Strachan notes of which can be identified as having originated
that a number of prepaid SIM cards can be had from a smartphone thanks to meta tags and other
that offer 2 gigabytes of data for as little as €10 hidden data markers (e.g., EXIF data). In some
(approximately US$13). cases, these hidden data are quite extensive and
can include such information as the user’s
geolocation at the date and time of posting, the
Data from Data specific platform used, and more. Furthermore,
much of these data can be harvested and examined
The digitalization of cell phones has had another without individual users ever knowing; Twitter
consequence: digital cell phones interacting with sentiment analysis, for example, can be conducted
digital networks produce a tremendous amount of specifically on messages generated via mobile
data that can be analyzed for a variety of purposes. platforms.
In particular, call detail records, or CDRs, are Passive collection of data generated by cell
generated automatically when a cell phone con- phones is not the only method available for
nects to a network and receives or transmits; studying cell phone data. Some intrepid
despite what the name might suggest, CDRs are researchers and research firms, recognizing
generated for both text messages as well as phone that cell phones are a rich source of information
calls. CDRs contain a variety of metadata related on people, their interpersonal interactions, and
to a call including the phone numbers of the their mobility, have developed various pieces of
originator and receiver, call time and duration, software that can (voluntarily) be installed on
Census Bureau (U.S.) 155

smartphones to facilitate the tracking of sub- Yi, S. J., Chun, S. D., Lee, Y. D., Park, S. J., & Jung, S. H.
jects and their behaviors. Such studies have (2012). Radio protocols for LTE and LTE-advanced.
Singapore: Wiley.
included the collection of communiqués as Zhang, Y., & Årvidsson, A. (2012). Understanding the
well as proximity data; in fact, researchers characteristics of cellular data traffic. ACM SIGCOMM
have found that this type of data alone is, in Computer Communication Review, 42(4), 461. https://
many cases, enough to infer friendships and doi.org/10.1145/2377677.2377764.
close interpersonal relationships between indi- C
viduals (see Eagle, Pentland, and Lazer for one
example). The possibilities for such research
driven by data collected via user-installed apps Census Bureau (U.S.)
are potentially limitless, for both academia and
industry; however, as such efforts undoubtedly Stephen D. Simon
increase, it will be important to ensure that end P. Mean Consulting, Leawood, KS, USA
users are fully aware of the risks inherent in
disclosing personal information and that users
have fully consented to participation in data The United States Bureau of the Census (hereafter
collection activities. Census Bureau) is a federal agency that produces
big data that is of direct value and which also
provides the foundation for analyses of other big
Cross-References data sources. They also produce information crit-
ical for geographic information systems in the
▶ Cell Phone Data United States.
▶ Data Mining The Census Bureau is mandated by Article I,
▶ Network Data Section II of the US Constitution to enumerate the
population of the United States to allow the proper
allocation of members of the House of Represen-
Further Reading tatives to each state. This census was first held in
1790 and then every 10 years afterwards. Full data
Ahmad, A. (2005). Wireless and mobile data networks. from the census is released 72 years after the
Hoboken: Wiley.
census was held. With careful linking across mul-
Becker, R., et al. (2013). Human mobility characterization
from cellular network data. Communications of the tiple censuses, researchers can track individuals
ACM, 56(1), 74. https://doi.org/10.1145/2398356. such as Civil War veterans (Costa et al. 2017)
2398375. across their full lifespan, or measure demographic
Broida, R. Should you switch to a no-contract phone car-
changes in narrowly defined geographic regions,
rier? CNET. http://www.cnet.com/news/have-you-
tried-a-no-contract-phone-carrier/. Accessed Sept 2014. such as marriage rates during the boll weevil
Cox, C. (2012). An introduction to LTE: LTE, LTE- epidemic of the early 1900s (Bloome et al. 2017).
advanced, SAE and 4G mobile communications. For more recent censuses, samples of micro-
Chichester: Wiley.
data are available, though with steps taken to
Eagle, N., Pentland, A. (Sandy), & Lazer, D. (2009). Infer-
ring friendship network structure by using mobile protect confidentiality (Dreschler and Reiter
phone data. Proceedings of the National Academy of 2012). Information from these sources as well as
Sciences of the United States of America, 106(36), census microdata from 79 other countries is avail-
15274. https://doi.org/10.1073/pnas.0900282106.
able in a standardized format through the Inte-
Gibson, J. D. (Ed.). (2013). Mobile communications hand-
book (3rd ed.). Boca Raton: CRC Press. grated Public Use Microdata Series International
Mishra, A. R. (2010). Cellular technologies for emerging Partnership (Ruggles et al. 2015).
markets: 2G, 3G and beyond. Chichester: Wiley. Starting in 1940, the Census Bureau asked
Strachan, D. The best local SIM cards in Europe. The
additional questions for a subsample of the cen-
Telegraph. http://www.telegraph.co.uk/travel/travel-
advice/9432416/The-best-local-SIM-cards-in-Europe. sus. The questions, known informally as “the long
html. Accessed Sept 2014. form,” had questions about income, occupation,
156 Census Bureau (U.S.)

education, and other socioeconomic issues. In reservations. Tracts are relatively stable over
2006, the long form was replaced with the Amer- time, with merges and partitions as needed to
ican Community Survey (ACS), which covered keep the number of people in a census tract rea-
similar issues, but which was run continuously sonably close to 4000 (Torrieri 1994, Chapter 10).
rather than once every 10 years (Torrieri 2007). The Census Bureau also aggregates geo-
The ACS has advantages associated with the time- graphic regions into Metropolitan Statistical
liness of the data, but some precision was lost Areas (MSA), categorizes regions on an urban/
compared to the long form (Spielman et al. rural continuum, and clusters states into broad
2014; Macdonald 2006). national regions. All of these geographic regions
Both the decennial census and the ACS rely on provide a framework for many big data analyses
the Master Address File (MAF), a list of all the and helps make research more uniform and
addresses in the United States where people might replicable.
live. The MAF is maintained and updated by the The geographic aggregation is possible
Census Bureau from a variety of sources but pre- because of another product that is of great value
dominantly the delivery sequence file of the to big data applications, the Topologically Inte-
United States Postal Service (Loudermilk and Li grated Geographic Encoding and Referencing
2009). (TIGER) System (Marx 1990). The TIGER Sys-
Data from the MAF are aggregated into con- tem, a database of land features like roads and
tiguous geographic regions. The regions are cho- rivers and administrative boundaries like county
sen to follow, whenever possible, permanent and state lines, has formed the foundation of many
visible features like streets, rivers, and railroads commercial mapping products used in big data
and to avoid crossing county or state lines, with analysis (Croner et al. 1996). The TIGER system
the exception of regions within Indian reserva- allows many useful characterizations of geo-
tions (Torrieri 1994, Chapter 10). The geographic graphic regions, such as whether a region contains
regions defined by the Census Bureau have many a highway ramp, a marker of poor neighborhood
advantages over other regions, such as those quality (Freisthler et al. 2016), and whether a
defined by zip codes (Krieger et al. 2002). daycare center is near a busy road (Houston
Shapefiles for various census regions are available et al. 2006).
for free download from the Census Bureau The ACS is the flagship survey of the Census
website. Bureau and has value in and of itself, but also is
The census block, the smallest of these regions, important in supplementing other big data
typically represent what would normally be con- sources. The ACS is a self-report mail survey
sidered a city block in an urban setting, though the with a telephone follow up for incomplete or
size might be larger for suburban and rural set- missing surveys. It targets roughly 300,000
tings. There are many census blocks with zero households per month. Response to the ACS is
reported population, largely because the areas mandated by law, but the Census Bureau does not
are uninhabitable or because residence is pro- enforce this mandate. The ACS releases 1 year
hibited (Freeman 2014). summaries for large census regions, 3 year sum-
Census blocks are aggregated into block maries for smaller census regions, and 5 year
groups that contain roughly 600 to 3000 people. summaries for every census region down to the
The census block group is the smallest geographic block group level. This release schedule repre-
region for which the Census Bureau provides sents the inevitable trade-off between the desire
aggregate statistics and sample microdata (Croner for a large sample size and the desire for up-to-
et al. 1996). date information. The ACS has been used to
Census block groups are aggregated into cen- describe health insurance coverage (Davern et al.
sus tracts. Census tracts are relatively homoge- 2009), patterns of residential segregation (Louf
nous in demographics and self-contained within and Barthelemy 2016), and disability rates
county boundaries or American Indian (Siordia 2015). It has also been used to
Census Bureau (U.S.) 157

supplement other big data analysis by developing The National Crime Victimization Survey, a
neighborhood socioeconomic status covariates joint effort with the Bureau of Justice Statistics,
(Kline et al. 2017) and obtaining the denominators is a self-report survey of 160,000 households per
needed for local prevalence estimates (Grey et al. year on nonfatal personal crimes and household
2016). The National Academies Press has a property crimes. The survey has supplements for
detailed guide on how to use the ACS (Citro and school violence (Musu-Gillette et al. 2017) and
Kalton 2007) available in book form or as a free stalking (Menard and Cox 2016). C
PDF download. While the Census Bureau conducts its own big
The Census Bureau conducts many additional data analyses, it also provides a wealth of infor-
surveys in connection with other federal agencies. mation to anyone interested in conducting large-
The American Housing Survey (AHS) is a joint scale nationally representative analyses. Statistics
effort with the Department of Housing and Urban within the geographic regions defined by the Cen-
Development that surveys both occupied and sus Bureau serve as the underpinnings of analyses
vacant housing units in a nationally representative of many other big data sources. Finally, the Cen-
sample and a separate survey of large MSAs. The sus Bureau provides free geographic information
AHS conducts computer-assisted interviews of system resources through their TIGER files.
roughly 47,000 housing units biennially. The
AHS allows researchers to see whether mixed
use development influences commuting choices Further Reading
(Cervero 1996) and to assess measures of the
house itself (such as peeling paint) and the neigh- Bloome, D., Feigenbaum, J., & Muller, C. (2017). Ten-
ancy, marriage, and the boll weevil infestation, 1892–
borhood (such as nearby abandoned buildings)
1930. Demography, 54(3), 1029–1049.
that can correlated with health outcomes (Jacobs Cervero, R. (1996). Mixed land-uses and commuting: Evi-
et al. 2009). dence from the American Housing Survey. Transpor-
The Current Population Survey, a joint effort tation Research Part A: Policy and Practice, 30(5),
361–377.
with the Bureau of Labor Statistics, is a monthly
Citro, C. F., & Kalton, G. (Eds.). (2007). Using the Amer-
survey of 60,000 people that provides unemploy- ican Community Survey: Benefits and challenges.
ment rates for the United States as a whole and for Washington, DC: The National Academies Press.
local regions and specific demographic groups. Costa, D. L., DeSomer, H., Hanss, E., Roudiez, C., Wilson,
S. E., & Yetter, N. (2017). Union army veterans, all
The survey includes supplements that allows for
grown up. Historical Methods, 50, 79–95.
the analysis of tobacco use (Zhu et al. 2017), Croner, C. M., Sperling, J., & Broome, F. R. (1996).
poverty (Pac et al. 2017), food security (Jernigan Geographic Information Systems (GIS): New perspec-
et al. 2017), and health insurance coverage tives in understanding human health and environmental
relationships. Statistics in Medicine, 15, 1961–1977.
(Pascale et al. 2016).
Davern, M., Quinn, B. C., Kenney, G. M., & Blewett, L. A.
The Consumer Expenditure Survey, also a joint (2009). The American Community Survey and health
effort with the Bureau of Labor Statistics, is an insurance coverage estimates: Possibilities and chal-
interview survey of major expenditure compo- lenges for health policy researchers. Health Services
Research, 44(2 Pt 1), 593–605.
nents combined with a diary study of detailed
Dreschler, J., & Reiter, J. P. (2012). Sampling with synthe-
individual purchases that is integrated to provide sis: A new approach for releasing public use census
a record of all expenditures of a family. The pur- microdata. Journal of American Statistical Association,
chasing patterns form the basis for the market 105(492), 1347–1357.
Freeman, N. M. (2014). Nobody lives here: The nearly 5
basket of goods used in computation of a measure million census blocks with zero population. http://
of inflation, the Consumer Price Index. Individual tumblr.mapsbynik.com/post/82791188950/nobody-liv
level data from this survey allows for detailed es-here-the-nearly-5-million-census. Accessed 6 Aug
analysis of purchasing habits, such as expendi- 2017.
Freisthler, B., Ponicki, W. R., Gaidus, A., & Gruenewald,
tures in tobacco consuming households (Rogers
P. J. (2016). A micro-temporal geospatial analysis of
et al. 2017) and food expenditures of different medical marijuana dispensaries and crime in Long
ethnic groups (Ryabov 2016). Beach California. Addiction, 111(6), 1027–1035.
158 Centers for Disease Control and Prevention (CDC)

Grey, J. A., Bernstein, K. T., Sullivan, P. S., Purcell, D. W., Pascale, J., Boudreaux, M., & King, R. (2016). Under-
Chesson, H. W., Gift, T. L., & Rosenberg, E. S. (2016). standing the new current population survey health
Estimating the population sizes of men who have sex insurance questions. Health Services Research, 51(1),
with men in US states and counties using data from the 240–261.
American Community Survey. JMIR Public Health Rogers, E. S., Dave, D. M., Pozen, A., Fahs, M., & Gallo,
Surveill, 2(1), e14. W. T. (2017). Tobacco cessation and household spend-
Houston, D., Ong, P. M., Wu, J., & Winer, A. (2006). ing on non-tobacco goods: Results from the US Con-
Proximity of licensed childcare to near-roadway vehi- sumer Expenditure Surveys. Tobacco Control; pii:
cle pollution. American Journal of Public Health, 96 tobaccocontrol-2016-053424.
(9), 1611–1617. Ruggles, S., McCaa, R., Sobek, M., & Cleveland, L.
Jacobs, D., Wilson, J., Dixon, S. L., Smith, J., & Evens, A. (2015). The IPUMS collaboration: Integrating and dis-
(2009). The relationship of housing and population seminating the world’s population microdata. Journal
health: A 30-year retrospective analysis. Environmen- of Demographic Economics, 81(2), 203–216.
tal Health Perspectives, 117(4), 597–604. Ryabov, I. (2016). Examining the role of residential segre-
Jernigan, V. B. B., Huyser, K. R., Valdes, J., & Simonds, V. gation in explaining racial/ethnic gaps in spending on
W. (2017). Food insecurity among American Indians fruit and vegetables. Appetite, 98, 74–79.
and Alaska Natives: A national profile using the current Siordia, C. (2015). Disability estimates between same- and
population survey-food security supplement. Journal different-sex couples: Microdata from the American
of Hunger and Environmental Nutrition, 12(1), 1–10. Community Survey (2009–2011). Sexuality and Dis-
Kline, K., Hadler, J. L., Yousey-Hindes, K., Niccolai, L., ability, 33(1), 107–121.
Kirley, P. D., Miller, L., Anderson, E. J., Monroe, M. Spielman, S. E., Folch, D., & Nagle, N. (2014). Patterns
L., Bohm, S. R., Lynfield, R., Bargsten, M., Zansky, S. and causes of uncertainty in the American Community
M., Lung, K., Thomas, A. R., Brady, D., Schaffner, W., Survey. Applied Geography, 46, 147–157.
Reed, G., & Garg, S. (2017). Impact of pregnancy on Torrieri, N. K. (1994). Geographic areas reference man-
observed sex disparities among adults hospitalized with ual. https://www.census.gov/geo/reference/garm.
laboratory-confirmed influenza, FluSurv-NET, 2010- html. Accessed 7 Aug 2017.
2012. Influenza and Other Respiratory Viruses, 11(5), Torrieri, N. (2007). America is changing, and so is the
404–411. census: The American Community Survey. The Amer-
Krieger, N., Waterman, P., Chen, J. T., Soobader, M. J., ican Statistician, 61(1), 16–21.
Subramanian, S. V., & Carson, R. (2002). Zip code Zhu, S. H., Zhuang, Y. L., Wong, S., Cummins, S. E., &
caveat: Bias due to spatiotemporal mismatches Tedeschi, G. J. (2017). E-cigarette use and associated
between zip codes and US census-defined geographic changes in population smoking cessation: Evidence
areas – The Public Health Disparities Geocoding Pro- from US current population surveys. BMJ (Clinical
ject. American Journal of Public Health, 92(7), 1100– Research Ed.), 358, j3262.
1102.
Loudermilk, C. L., & Li, M. (2009). A national evaluation
of coverage for a sampling frame based on the Master
Address File. Proceedings of the Joint Statistical Meet-
ing. American Statistical Association, Alexandria, VA. Centers for Disease Control
Louf, R., & Barthelemy, M. (2016). Patterns of residential
segregation. PLoS One, 11(6), e0157476.
and Prevention (CDC)
Macdonald, H. (2006). The American Community Survey:
Warmer (more current), but fuzzier (less precise) than Stephen D. Simon
the decennial census. Journal of the American Plan- P. Mean Consulting, Leawood, KS, USA
ning Association, 72(4), 491–503.
Marx, R. W. (1990). The Census Bureau’s TIGER system.
New Zealand Cartography Geographic Information
Systems, 17(1), 17–113. The Centers for Disease Control and Prevention
Menard, K. S., & Cox, A. K. (2016). Stalking victimiza- (CDC) is a United States government agency
tion, labeling, and reporting: Findings from the NCVS self-described as “the nation’s health protection
stalking victimization supplement. Violence Against
agency” (CDC 2017a). CDC responds to new
Women, 22(6), 671–691.
Musu-Gillette, L., Zhang, A., Wang, K., Zhang, J., & and emerging health threats and conducts
Oudekerk, B. A. (2017). Indicators of school crime research to track chronic and acute diseases. Of
and safety: 2016. https://www.bjs.gov/content/pub/ greatest interest to readers of this article are the
pdf/iscs16.pdf. Accessed 6 Aug 2017.
CDC efforts in surveillance using nationwide
Pac, J., Waldfogel, J., & Wimer, C. (2017). Poverty among
foster children: Estimates using the supplemental pov- cross-sectional surveys to monitor the health of
erty measure. Social Service Review, 91(1), 8–40. diverse populations (Frieden 2017). These
Centers for Disease Control and Prevention (CDC) 159

surveys are run annually, in some cases across inpatient and emergency room visits), physician
more than five decades. The National Center for offices, and long-term care providers. The micro-
Health Statistics (NCHS), a branch of the CDC data from all of these surveys are publicly avail-
either directly conducts or supervises the collec- able, usually in compressed ASCII format and/or
tion and storage of the data from most of these comma-separated value format. CDC also pro-
surveys. vides code in SAS, SPSS, and STATA for reading
The National Health Interview Survey (NHIS) some of these files. C
conducts in-person interviews about the health Since these surveys span many years,
status and health care access for 35,000 house- researchers can examine short- and long-term
holds per year, with information collected about trends in health. Time trend analysis, however,
the household as a whole and for one randomly does require care. The surveys can change from
selected adult and one randomly selected child year to year in the sampling frame, the data col-
(if one is available) in that household (Parsons lected, the coding systems, and the handling of
et al. 2014). NHIS has been used to assess health missing values.
insurance coverage (Martinez and Ward 2016), To improve efficiency, many of the CDC data-
the effect of physical activity on health (Carlson bases use a complex survey approach where geo-
et al. 2015) and the utilization of cancer screening graphic regions are randomly selected and then
(White et al. 2017). patients are selected within those regions. Often
The National Health and Nutrition Examina- minority populations are oversampled to allow
tion Survey (NHANES) conducts in-person inter- sufficient sample sizes in these groups. Both the
views about the diet and health of roughly 5000 complex survey design and the oversampling
participants per year combined with a physical require use of specialized statistical analysis
exam for each participant (Johnson et al. 2014). approaches (Lumley 2010; Lewis 2016).
Sera, plasma, and urine are collected during The CDC has removed any information that
the physical exam. Genetic information is extra- could be used to personally identify individual
cted from the sera specimens, although consent respondents, particularly geocoding. For those
rates among various ethnic groups are uneven researchers requiring this level of information
(Gabriel et al. 2014). NHANES has been used can apply for access through the NCHS Research
to identify dietary trends in patients with diabetes Data Center (CDC 2017b).
(Casagrande and Cowie 2017), the relationship The CDC maintains the National Death Index
between inadequate hydration and obesity (NDI), a centralized database of death certificate
(Chang et al. 2016), and the association of Vita- data collected from each of the 50 states and the
min D levels and telomere length (Beilfuss District of Columbia. The raw data is not available
et al. 2017). for public use, but researchers can apply for access
The Behavioral Risk Factor Surveillance Sys- that lets them submit a file of patients that they are
tem (BRFSS) conducts telephone surveys of studying to see which ones have died (CDC
chronic conditions and health risk behaviors 2016). Many of the CDC surveys described
using random digit dialing (including cell phone above are linked automatically to NDI. While
numbers from 2008 onward) for 400,000 partici- privacy concerns restrict direct access to the full
pants per year. This represents the largest tele- information on the death certificate, the CDC does
phone survey in the world (Pierannunzi et al. offer geographically and demographically aggre-
2013). This survey has been used to identify gated data sets on deaths (and births) as well as
time trends in asthma prevalence (Bhan et al. reduced data sets on individual deaths and births
2015), fall injuries among the elderly (Bergen with personal identifiers removed.
et al. 2016), and mental health disparities between The CDC uses big data in its own tracking of
male and female caregivers (Edwards et al. 2016). infectious diseases. The US Influenza Hospitali-
There are additional surveys of other patient zation Surveillance Network (FluSurv-Net) mon-
populations as well as surveys of hospitals (both itors influenza hospitalizations in 267 acute care
160 Centers for Disease Control and Prevention (CDC)

hospitals serving over 27 million people (Chaves Casagrande, S. S., & Cowie, C. C. (2017). Trends in
et al. 2015). dietary intake among adults with type 2 diabetes:
NHANES 1988–2012. Journal of Human Nutrition
Real time reporting on influenza is available at and Dietetics. https://doi.org/10.1111/jhn.12443.
the FluView website (https://www.cdc.gov/flu/ Centers for Disease Control and Prevention. (2016). About
weekly/). This site reports on a weekly basis the NCHS – NCHS fact sheets – National death index.
outpatient visits, hospitalizations, and death rates https://www.cdc.gov/nchs/data/factsheets/factsheet_
ndi.htm. Accessed 10 Mar 2017.
associated with influenza. It also monitors the
Centers for Disease Control and Prevention. (2017a). Mis-
geographic spread, the strain type, and the drug sion, role, and pledge. https://www.cdc.gov/about/
resistance rates for influenza. organization/mission.htm. Accessed 13 Feb 2017.
FoodNet tracks laboratory confirmed food- Centers for Disease Control and Prevention. (2017b).
borne illnesses in ten geographic areas with a RDC – NCHS research data center. https://www.cdc.
gov/rdc/index.htm. Accessed 6 Mar 2017.
population of 48 million people (Crim et al. Chang, T., Ravi, N., Plegue, M. A., Sonneville, K. R., &
2015). The Active Bacterial Core Surveillance Davis, M. M. (2016). Inadequate hydration, BMI, and
collects data on invasive bacterial infections in obesity among US adults: NHANES 2009–2012.
ten states representing up to 42 million people Annals of Family Medicine, 14(4), 320–324. https://
doi.org/10.1370/afm.195.
(Langley et al. 2015).
Chaves, S. S., Lynfield, R., Lindegren, M. L., Bresee, J., &
The hallmark of every CDC data collection Finelli, L. (2015). The US influenza hospitalization
effort is the great care taken in either tracking surveillance network. Emerging Infectious Diseases,
down every event in the regions being studied, or 21(9), 1543–1550. https://doi.org/10.3201/eid2109.
141912.
in collecting a nationally representative sample.
Crim, S. M., Griffin, P. M., Tauxe, R., Marder, E. P., Gilliss,
These efforts insure that researchers can extrap- D., Cronquist, A. B., et al. (2015). Preliminary inci-
olate results from these surveys to the United dence and trends of infection with pathogens transmit-
States as a whole. These data sets, most of ted commonly through food – Foodborne Diseases
which are available at no charge, represent a Active Surveillance Network., 10 U.S. Sites,
2006–2014. Morbidity and Mortality Weekly Report,
tremendous resource to big data researchers 64(18), 495–499.
interested in health surveillance in the United Edwards, V. J., Anderson, L. A., Thompson, W. W., &
States. Deokar, A. J. (2016). Mental health differences
between men and women caregivers, BRFSS 2009.
Journal of Women & Aging. https://doi.org/10.1080/
08952841.2016.1223916.
Further Reading Frieden, T. (2017). A safer, healthier U.S.: The centers for
disease control and prevention, 2009–2016. American
Beilfuss, J., Camargo, C. A. Jr, & Kamycheva, E. (2017). Journal of Preventive Medicine, 52(3), 263–275.
Serum 25-Hydroxyvitamin D has a modest positive https://doi.org/10.1016/j.amepre.2016.12.024.
association with leukocyte telomere length in middle- Gabriel, A., Cohen, C. C., & Sun, C. (2014). Consent to
aged US adults. Journal of Nutrition. https://doi.org/ specimen storage and continuing studies by race and
10.3945/jn.116.244137. ethnicity: A large dataset analysis using the 2011–2012
Bergen, G., Stevens, M. R., & Burns, E. R. (2016). Falls National Health and Nutrition Examination Survey.
and fall injuries among adults aged 65 years – United Scientific World Journal. https://doi.org/10.1155/
States, 2014. Morbidity and Mortality Weekly Report, 2014/120891.
65(37), 993–998. Johnson, C. L., Dohrmann, S. M., Burt, V. L., & Mohadjer,
Bhan, N., Kawachi, I., Glymour, M. M., & Subramanian, L. K. (2014). National health and nutrition examination
S. V. (2015). Time trends in racial and ethnic disparities survey: Sample design, 2011–2014. Vital and Health
in asthma prevalence in the United States from the Statistics, 2(162).
Behavioral Risk Factor Surveillance System (BRFSS) Langley, G., Schaffner, W., Farley, M. M., Lynfield, R.,
Study (1999–2011). American Journal of Public Bennett, N. M., Reingold, A. L., et al. (2015). Twenty
Health, 105(6), 1269–1275. https://doi.org/10.2105/ years of active bacterial core surveillance. Emerging
AJPH.2014.302172. Infectious Diseases, 21(9), 1520–1528. https://doi.org/
Carlson, S. A., Fulton, J. E., Pratt, M., Yang, Z., & Adams, 10.3201/eid2109.141333.
E. K. (2015). Inadequate physical activity and health Lewis, T. H. (2016). Complex survey data analysis with
care expenditures in the United States. Progress in SAS. New York: Chapman and Hall.
Cardiovascular Disease, 57(4), 315–323. https://doi. Lumley, T. (2010). Complex surveys. A guide to analysis
org/10.1016/j.pcad.2014.08.002. using R. New York: Wiley.
Charter of Fundamental Rights (EU) 161

Martinez, M. E., & Ward, B. W. (2016). Health care access The Charter rights concern six macro areas:
and utilization among adults aged 18–64, by poverty dignity, freedoms, equality, solidarity, citizens’
level: United States, 2013–2015. NCHS Data Brief,
262, 1–8. rights, and justice. These six areas represent
Parsons, V. L., Moriarity, C., Jonas, K., Moore, T. F., those “fundamental rights and freedoms recog-
Davis, K. E., & Tompkins, L. (2014). Design and nized by the European Convention on Human
estimation for the national health interview survey, Rights, the constitutional traditions of the EU
2006–2015. Vital and Health Statistics, 165, 1–53.
Pierannunzi, C., Hu, S. S., & Balluz, L. (2013). member states, the Council of Europe’s Social C
A systematic review of publications assessing reliabil- Charter, the Community Charter of Fundamental
ity and validity of the Behavioral Risk Factor Surveil- Social Rights of Workers and other international
lance System (BRFSS), 2004–2011. BMC Medical conventions to which the European Union or its
Research Methodology. https://doi.org/10.1186/1471-
2288-13-49. member states are parties” (European Parliament
White, A., Thompson, T. D., White, M. C., Sabatino, S. A., 2001, February 21). Europeans can use judicial
de Moor, J., Doria-Rose, P. V., et al. (2017). Cancer and political mechanisms to hold EU institutions,
screening test use – United States, 2015. Morbidity and and in certain circumstances, member states,
Mortality Weekly Report, 66(8), 201–206.
accountable in those situations where they do
not comply with the Charter. The Charter can be
used as a political strategy to influence decision-
makers so to develop policies and legislation that
Charter of Fundamental are in line with human right standards (Butler
Rights (EU) 2013).

Chiara Valentini
Department of Management, Aarhus University, Historical Development
School of Business and Social Sciences, Aarhus,
Denmark The Charter was draft during the European Con-
vention and was solemnly proclaimed by the
three major EU decision-making institutions,
Introduction that is, the European Parliament, the Council of
the European Union and the European Commis-
The Charter of Fundamental Rights is a legal sion, on December 7, 2000. Before the Charter
document that protects individuals and legal enti- was written, the EU had already internal rules on
ties from actions that disregard fundamental human rights, but these were not incorporated in
rights. It covers personal, civic, political, eco- a legal document. They were only part of the
nomic, and social rights of people within the general principles governing EU law. In practice,
European Union. The Charter also safeguards the lack of a legal document systematically
so-called “third generation” fundamental rights, addressing questions of human rights permitted
such as data protection, bioethics, and transparent a number of EU law infringements. For instance,
administration matters, which includes protection in situations where member states needed to
from the misuse of massive datasets on individ- transpose EU law into their national ones, in
uals’ online behaviors collected by organizations. some cases national courts refused to apply EU
Diverse organizations have taken advantage of law with the content that it conflicted with rights
large data sets and big data analytics to bolster protected by their national constitutions (Butler
competitiveness, innovation, market predictions, 2013). To solve the issue of EU law infrangement
political campaigns, targeted advertising, scien- as well as to harmonize EU legislations in rela-
tific research, and policymaking and to influence tion to fundamental rights, the European Council
elections and political outcomes through, for entrusted a group of people during the 1999
instance, targeted communications (European Cologne meeting to form the European Conven-
Parliament 2017, February 20). tion, a body set up to deal with the task of
162 Charter of Fundamental Rights (EU)

drafting the Charter of Fundamental Rights of the European Union and the European Com-
(Nugget 2010; Butler 2013). missions have specialized bodies and procedures
The endorsement of respecting the Charter of to help ensuring that proposals are consistent with
Fundamental Rights by the three major EU polit- the Charter (Butler 2013). Furthermore, to
ical institutions sparkled a new political discus- increase awareness and knowledge on fundamen-
sion on whether the Charter should be included in tal rights, the European Council decided on 13
the EU Constitutional Treaty, which was at the top December 2003 to extend the duties of an existing
of the political agenda in early 2000, and on agency, the European Monitoring Centre on Rac-
whether the EU should sign up to the European ism and Xenophobia, to include the monitoring of
Convention of Human Rights (ECHR). The Char- human rights. A newly formed community
ter was amended a number of times and ended up agency, the European Union Agency for Funda-
not to be included in the Constitutional Treaty mental Rights (FRA), was established in 2007 and
(Nugget 2010). Nonetheless several of the under- is based in Vienna, Austria (CEC 2007, February
pinning rights became legally binding with the 22). As declared, the main scope of this agency is
entry into force of the Lisbon Treaty in December “to collect and disseminate objective, reliable and
2009. De facto, the Charter has gained some legal comparable data on the situation of fundamental
impact in EU legal system. Today the Charter can rights in all EU countries within the scope of EU
regulate the activities of national authorities that law” (European Commission 2013, July 16).
implement EU laws at national level. But it cannot During the past years, the agency has been
be used in cases of infringements of rights for involved in investigating the status of surveil-
actions dealing with national legislation. The lance practices in Europe, specifically in relation
Charter has also limited influence in certain coun- to the respect for private and family rights and
tries that have obtained some opt-outs. Member the protection of personal data (FRA 2014,
states that are granted opt-outs are allowed not to May). The agency receives the mandate to carry
implement certain EU policies. In March 2017, out investigations by the European Parliament.
the European Parliament voted on a nonlegislative The scope is gathering information on the status
resolution about the fundamental rights implica- of privacy, security, and transparency in the EU.
tions of big data, including privacy, data protec- Specifically, the agency appraises when, how,
tion, nondiscrimination, security, and law- and for which purposes member states collect
enforcement. Essentially, the resolution tries to data on the content of communications and meta-
address recommendations for digital literacy, eth- data and follow citizens’ electronic activities, in
ical frameworks, and guidelines for algorithmic particular in their use of smartphones, tablets,
accountability and transparency. It also seeks to and computers (European Parliament 2014, Feb-
foster cooperation among authorities, regulators, ruary 21). In 2014, the agency found out that
and the private sector and to promote the use of mass surveillance programs were in place in
security measures like privacy by design and by some member states breaching EU fundamental
default, anonymization techniques, encryption, rights (LIBE 2014). On the basis of the agency's
and mandatory privacy impact assessments (Euro- investigative work, the European Parliament
pean Parliament 2017, February 20). voted a resolution addressing the issue of mass
surveillance. The agency continues recognizing
the importance of big data for today’s informa-
Surveillance Practices and Protection of tion society as a way for boosting innovation, yet
Human Rights it acknowledges the importance of finding a right
balance between the challenges linked to security
Because the rights in the Charter are binding EU and respect of fundamental rights by helping EU
legislation, the European Parliament, the Council policymakers and its members states with
Chemistry 163

updated research on how large sets of data col- LIBE. (2014). Libe Committee inquiry. Electronic mass
lections are conducted within the EU. surveillance of EU citizens. Protecting fundamental
rights in a digital age. Proceedings, outcome and back-
ground documents. Document of the European Parlia-
ment, http://www.europarl.europa.eu/document/
Cross-References activities/cont/201410/20141016ATT91322/201410
16ATT91322EN.pdf. Accessed on 31 Oct 2014.
▶ European Commission
Nugget, N. (2010). The government and politics of the
European Union (7th ed.). New York: Palgrave
C
▶ European Commission: Directorate-General Macmillan.
for Justice (Data Protection Division)
▶ European Union

Chemistry
Further Reading
Colin L. Bird and Jeremy G. Frey
Butler, I. (2013). The European charter of fundamental Department of Chemistry, University of
right: What can I do?. Background paper of the open
society European Policy Institute. http://www. Southampton, Southampton, UK
opensocietyfoundations.org/sites/default/files/eu-char
ter-fundamental-rights-20130221.pdf. Accessed on 10
Oct 2014. Chemistry has always been data-dependent, but as
CEC. (2007, February 22). Council Regulation (EC) No.
168/2007 of 15 February 2007 establishing a European computing power has increased, chemical science
Union Agency for Fundamental Rights. http://fra. has become increasingly data-intensive, a devel-
europa.eu/sites/default/files/fra_uploads/74-reg_168- opment recognized by several contributors to the
2007_en.pdf. Accessed on 10 Oct 2014. book edited by Hey, Tansley, and Tolle, The
European Commission. (2013, July. 16). The European
Union agency for fundamental rights. http://ec.europa. Fourth Paradigm (Hey et al. 2009). In one article,
eu/justice/fundamental-rights/agency/index_en.htm. chemistry is given as one example of “a genuinely
Accessed on 10 Oct 2014. new kind of computationally driven, inter-
European Parliament. (2001, February 21). The charter of connected, Web-enabled science.”
fundamental rights of the European Union. http://www.
europarl.europa.eu/charter/default_en.htm. Accessed The study of chemistry can be perceived as
on 10 Oct 2014. endeavoring to obtain big information – big in
European Parliament. (2014, February 21). Report on the the sense of significance – from data relating to
US NSA surveillance programme, surveillance bodies molecules, which are small in the physical sense.
in various Member States and their impact on EU
citizens’ fundamental rights and on transatlantic coop- The transition from data to big information is
eration in Justice and Home Affairs. http://www. perhaps well illustrated by the role of statistical
europarl.europa.eu/sides/getDoc.do?pubRef¼-//EP// mechanics as we see the move from modeling
NONSGMLþREPORTþA7-2014-0139þ0þDOC departures from ideal gas behavior through to
þPDFþV0//EN. Accessed on 10 Oct 2014.
European Parliament. (2017, February 20). Report on fun- the measurement of single molecule properties: a
damental rights implications of big data: Privacy, data journey of simple information about lots of simi-
protection, non-discrimination, security and law- lar molecules to complex information about indi-
enforcement (2016/2225- INI). http://www.europarl. vidual molecules, paralleled in the development
europa.eu/sides/getDoc.do?pubRef¼-//EP//NON
SGMLþREPORTþA8-2017-0044þ0þDOCþPDF of machine learning from large data sets
þV0//EN. Accessed on 20 June 2017. (Ramakrishnan et al. 2015; Barrett and Langdon
FRA. (2014, May). National intelligence authorities and 2006) (Fig. 1).
surveillance in the EU: Fundamental rights safeguards As the amounts of data available have
and remedies. http://fra.europa.eu/en/project/2014/
national-intelligence-authorities-and-surveillance-eu- increased to the extent that chemists now handle
fundamental-rights-safeguards-and. Accessed on 10 Big Data routinely, the influence of that data
Oct 2014. tracks the evolution of the topics or disciplines
164 Chemistry

Chemistry, Fig. 1 Aspects of Big Data Chemistry – highlighting the rise of Big Data, the velocity (reactivity) of Big
Data and the ultimate extraction of a small amount of knowledge from this data

of chemometrics and cheminformatics. Starting volume, and sometimes high-velocity. Four other
from the application of statistical methods to the “Vs” are also relevant to chemistry data: value,
analysis and modeling of small but critical data, veracity, visualization, and virtual. The challenge
chemical science has moved with the increasing for chemists is to make effective use of the broad
quantity and complexity of chemical data, to the range of information available from multiple
creation of a chemical informatics discipline to sources, taking account of all seven “Vs”. Chem-
handle the need to link complex heterogeneous ists therefore take a “big picture” view of their
data. data, whatever its numerical size. Pence and Wil-
Chemists were generating large quantities of liams point out the distinction between big data-
data before the end of the twentieth century, for bases and Big Data, noting that students in
example, with combinatorial chemistry experi- particular are familiar with searching the former
ments (Lowe 1995) and, at the beginning of this but may be unaware of the latter and its signifi-
century, e-Science techniques opened up pros- cance (Pence and Williams 2016).
pects of greater diversity in what was achievable. One potentially simplifying aspect of many
High throughput chemistry created yet more data large sets of complex and linked chemical data
at greater speed. The term Big Data has been relating to physical and social processes is their
employed for about the same length of time, characterization by power-law distributions. This
although only more recently has it been used in has not been very apparent in chemical data so far,
chemistry. Combinatorial and high throughput perhaps because of limited data volumes or bias
methods led the way for chemists to work with within the collection of the data, however, some
even greater data volumes, the challenge being to interesting distributions have been suggested and
make effective use of the flood of data, although this may lead to new approaches to curation and
there is a point of view that chemical data, searching of large sets of linked chemical data
although diverse and heterogeneous, is not neces- (Benz et al. 2008).
sarily “big” data. The concept of chemical space puts into per-
In terms of the Gartner definition (Gartner), spective not only the gamut of chemical structures
chemical data is high-variety, can be high- but also the scale and scope of the associated
Chemistry 165

chemical data. When chemists define a space with uk/dial-a-molecule/phase-iii-themes/data-driven-


a set of descriptors, that space will be multi- synthesis/)].
dimensional and will comprise a large number of Chemists, irrespective of the domain in which
molecules of interest, resulting in a big dataset. they are interested, will be concerned with many
For example, a chemical space often referred to in aspects: the preservation and curation of chemical
medicinal chemistry is that of potential pharma- data sets; accessing and searching for chemical
cologically active molecules. The size, even when data; analysis of the data, including interpreting C
limited to small drug-like molecules, is huge, esti- spectroscopic data; and also with more specific
mates vary from 1030–1060 (the GDB-17 (up to 17 information, such as thermodynamic data.
atoms of C, N, O, S, and halogens.) enumeration Obtaining value from high volumes of varied
of chemical space contains 166.4 billion mole- and heterogeneous data, sometimes at high veloc-
cules (Reymond 2015)). ity, requires “cost-effective, innovative forms of
However, a key question for Big Data analytics information processing that enable enhanced
is for how many of these molecules do we have insight, decision-making, and process automation
any useable data? For comparison, the Cambridge (Gartner).” Chemical data might be multi-
Structural Database (CSD) has 875,000 curated dimensional, might be virtual, might be of dubi-
structures (https://www.ccdc.cam.ac.uk), and the ous validity, and might be difficult to visualize.
Protein Data Bank (PDB) over 130,000 structures The seven “Vs” appear again.
(https://www.rcsb.org/pdb/statistics/holdings.do), One approach that has gained popularity in
PubChem (Bolton et al. 2016) reaches into the Big recent years is to use Semantic Web techniques
Data regime as it contains over 93 million com- to deal programmatically with the meaning of the
pounds but with differing quantity and quality of data, whether “Big” or of a more easily managed
information for these entries (https://pubchem. size (Frey and Bird 2013). However, chemists
ncbi.nlm.nih.gov/). Other comparisons of chemi- must acknowledge that such techniques, while
cal data repositories are given in Tetko et al. potentially of great value, do lead to a major
(2016) and Bolstad et al. (2012). increase in the scale and complexity of the data
The strategies for navigating the chemical held.
space of potential molecules for new and different Medicinal chemistry, which comprises drug
molecular systems has led to the development of discovery, characterization, and toxicology, is
a number of search/navigation strategies to the high-value field most commonly associated
cope with both the graphical and virtual nature with Big Data in chemistry. The chemical space
(Sayle et al. 2013; Hall et al. 2017) and size of the of candidate compounds contains billions of mol-
databases (Andersen et al. 2014). ecules, the efficient scanning of which requires
Much of the discussion about the size of advanced techniques, exploiting the broad variety
Chemical space focuses on the molecular struc- of data sources. Data quality (veracity) is impor-
tures, but an even greater complexity is apparent tant. This vast chemical space is represented by
in the discussion of the reactions that link these large databases consisting of virtual compounds
molecules and many chemists desire the data to that require intelligent search strategies and smart
be bigger than it currently is. This is an area visualization techniques.
where the extraction of information from the The future of drug studies is likely to lie in the
literature is crucial (Schneider et al. 2016; Jessop integration of chemical, biological (molecular
et al. 2011; Swain and Cole 2016). The limited biology, genomics, metabolomics) (Bon and
publication of reactions that do not work well, or Waldmann 2010; Araki et al. 2008; Butte and
the reporting of only a selection of reaction condi- Chen 2016; Tormay 2015; Dekker et al. 2013),
tions investigated, has been and will continue to be materials, and environmental datasets (Buytaert
a major hindrance in the modeling of chemical et al. 2015), with toxicological predictions of
reactivity [see for example the Dial-a-Molecule new materials being a very significant application
data initiative (http://generic.wordpress.soton.ac. of Big Data (Hartung 2016). The validation of this
166 Chemistry

data (experimental or calculated) is an important Mohimani et al. have recently reported a new
problem which is exacerbated by the data volumes technique that enables high-throughput scanning
but potentially ameliorated by the potential com- of mass spectra to identify potential antibiotics
parisons between datasets that are opened up with and other bioactive peptides (Mohimani et al.
increasing amounts of linked data. Automated 2017).
structure validation (Spek 2009), automated Computational chemistry uses modeling, sim-
application of thermodynamics aided by ulation, and complex calculations to enable chem-
ThermoML (https://www.nist.gov/mml/acmd/trc/ ists to understand the structure, properties, and
thermoml) and potentially a change in approach to behavior of molecules and compounds. As such
publicly available large datasets earlier (Ekins it is very much dependent on data, which might be
et al. 2012; Gilson et al. 2012) and databases virtual, varied, and high in volume. Molecular
cited earlier will be influential. Dynamics simulations present significant Big
Drug design and discovery is now highly Data challenges (Yeguas and Casado 2014).
dependent on web-based services, as reviewed Pence and Williams (2016) note that Big Data
by Frey and Bird in 2011 (Frey and Bird 2011). tools can enhance collaboration between separate
The article highlighted the growing need for ser- research groups. The EU-funded BIGCHEM pro-
vices and systems able to cope with the vast ject considers how data sharing can operate effec-
amounts of chemical, biochemical, and bioinfor- tively in the context of the pharmaceutical
matics information. industry, which “mainly aims to develop compu-
Tools for the manipulation of chemical data are tational methods specifically for Big Data analy-
very important. Open Babel (Banck et al. 2011) is sis”, facilitating data sharing across institutions
one of the leading tools which converts over 110 and companies (Tetko et al. 2016).
different formats and the open source tools CDK The Chemical Science community makes
(Han et al. 2003) and RDKit (http://www.RDKit. extensive use of large scale science research infra-
org) are widely used in the exploration of chemi- structures (e.g., Synchrotron, Neutron, Laser) and
cal data, but further work on tools optimized for the next generation of such facilities, are already
very large datasets is ongoing. coming online (e.g., LCLS (https://lcls.slac.
Environmental chemistry is another field in stanford.edu/), XFEL (https://www.xfel.eu/)) as
which large volumes of data are processed as we are the massive raw datasets generated by cryo-
tackle the issues raised by climate change and EM (Belianinov et al. 2015). These experiments
seek ways to mitigate the effect. Edwards et al. are generating data on a scale that will challenge
define big environmental data as large or complex even the data produced by CERN (https://home.
sets of structured or unstructured data that might cern/). The social and computational infrastruc-
relate to an environmental issue (Edwards et al. tures to deal with these new levels of production
2015). They note that multiple and/or varied of data are the next challenge faced by the
datasets might be required, presenting challenges community.
for analysis and visualization. The chemical industry recognizes that
Toxicology testing is as vital for environmental harnessing Big Data will be vital for every sector,
compounds as for drug candidates, and similar the two areas of particular interest being pricing
high throughput methods are used (Zhu et al. strategy and market forecasting (ICIS Chemical
2014), which is providing plenty of data for sub- Business 2013). Moreover, Big Data is seen as
sequent model building. Efficient toxicology test- valuable in the search for new products that cause
ing of candidate compounds is of prime less emission or pollution (Lundia 2015). A recent
importance. Automated high throughput screen- article published by consultants KPMG asserts
ing techniques make feasible the in vitro testing of that the “global chemical industry has reached a
up to 100,000 compounds per day (Szymański tipping point” (Kaestner 2016). The article sug-
et al. 2012), thereby generating large amounts of gests that data and analytics should now be con-
data that requires rapid – high velocity – analysis. sidered among the pillars of a modern business.
Chemistry 167

By using analytics to integrate information from a Babel: An open chemical toolbox. Journal of
range of sources, manufacturers can increase effi- Cheminformatics, 3, 33.
Barge, L. M., Cardoso, S. S., Cartwright, J. H., Cooper, G.
ciency and improve quality: “In developed mar- J., Cronin, L., Doloboff, I. J., Escribano, B., Goldstein,
kets, companies can use Big Data to reduce costs R. E., Haudin, F., Jones, D. E., Mackay, A. L., Maselko,
and deliver greater innovation in products and J., Pagano, J. J., Pantaleone, J., Russell, M. J., Sainz-
services.” Díaz, C. I., Steinbock, O., Stone, D. A., Tanimoto, Y.,
The importance of educating present and
Thomas, N. L., & Wit, A. D. (2015). From chemical
gardens to chemobrionics. Chemical Reviews, 115(16),
C
future chemists about Big Data is gaining 8652–8703.
increased recognition. Pence and Williams Barrett, S. J., & Langdon, W. B. (2006). Advances in the
application of machine learning techniques in drug
argue that Big Data issues now pervade chemical
discovery, design and development. In A. Tiwari, R.
research and industry to an extent that the topic Roy, J. Knowles, E. Avineri, & K. Dahal (Eds.),
should become a mandatory part of the under- Applications of soft computing. Advances in intelli-
graduate curriculum (2016). They acknowledge gent and soft computing (Vol. 36). Berlin/Heidelberg:
Springer.
the difficulties of fitting a new topic into an
Belianinov, A., et al. (2015). Big data and deep data in
already full curriculum, but believe it sufficiently scanning and electron microscopies: Deriving func-
important that the addition is necessary, tionality from multidimensional data sets. Advanced
suggesting that a chemical literature course Structural and Chemical Imaging, 1, 6. https://doi.
org/10.1186/s40679-015-0006-6.
might provide a medium. Benz, R. W., Baldi, P., & Swamidass, S. J. (2008).
Big Data in Chemistry should be seen in the Discovery of power-laws in chemical space. Journal
context of the challenges and opportunities in the of Chemical Information and Modeling, 48(6),
wider physical sciences (Clarke et al. 2016). 1138–1151.
Bolstad, E. S., Coleman, R. G., Irwin, J. J., Mysinger, M.
Owing to limited space in this article, we have M., & Sterling, T. (2012). ZINC: A free tool to discover
concentrated on the use of chemical Big Data for chemistry for biology. Journal of Chemical Informa-
molecular chemistry. The complexity of the wider tion and Modeling, 52(7), 1757–1768.
discussion of polymer and materials chemistry is Bolton, E., Bryant, S. H., Chen, J., Fu, G., Gindulyte, A.,
Han, L., He, J., He, S., Kim, S., Shoemaker, B. A.,
elegantly illustrated by the discussion of the Thiessen, P. A., Wang, J., Yu, B., & Zhang, J. (2016).
“Chemical gardens” which as the authors state, PubChem substance and compound databases. Nucleic
“are perhaps the best example in chemistry of a Acids Research, 44, D1202–D1213.
self-organizing non-equilibrium process that cre- Bon, R. S., & Waldmann, H. (2010). Bioactivity-guided
navigation of chemical space. Accounts of Chemical
ates complex structures” (Barge et al. 2015), and Research, 43(8), 1103–1114.
in this light, Whitesides highlights that “Chemis- Butte, A., & Chen, B. (2016). Leveraging big data to
try is in a period of change, from an era focused on transform target selection and drug discovery. Clinical
molecules and reactions, to one in which manip- Pharmacology and Therapeutics, 99(3), 285–297.
Buytaert, W., El-khatib, Y., Macleod, C. J., Reusser, D., &
ulations of systems of molecules and reactions Vitolo, C. (2015). Web technologies for environmental
will be essential parts of controlling larger Big Data. Environmental Modelling and Software, 63,
systems”(Whitesides 2015). 185–198.
Clarke, P., Coveney, P. V., Heavens, A. F., Jäykkä, J., Korn,
A., Mann, R. G., McEwen, J. D., Ridder, S. D., Roberts,
S., Scanlon, T., Shellard, E. P., Yates, J. A., & Royal
Further Reading Society (2016). https://doi.org/10.1098/rsta.2016.0153.
Dekker, A., Ennis, M., Hastings, J., Harsha, B., Kale, N.,
Andersen, J. L., Flamm, C., Merkle, D., & Stadler, P. F. Matos, P. D., Muthukrishnan, V., Owen, G., Steinbeck,
(2014). Generic strategies for chemical space explora- C., Turner, S., & Williams, M. (2013). The ChEBI
tion. International Journal of Computational Biology reference database and ontology for biologically rele-
and Drug Design, 7(2–3), 225–258. vant chemistry: Enhancements for 2013. Nucleic Acids
Araki, M., Gutteridge, A., Honda, W., Kanehisa, M., & Research, 41, D456–D463.
Yamanishi, Y. (2008). Prediction of drug–target inter- Edwards, M., Aldea, M., & Belisle, M. (2015). Big Data is
action networks from the integration of chemical and changing the environmental sciences. Environmental
genomic spaces. Bioinformatics, 24(13), i232–i240. Perspectives, 1. Available from http://www.exponent.
Banck, M., Hutchison, G. R., James, C. A., Morley, C., com/files/Uploads/Documents/Newsletters/EP_2015_
O’Boyle, N. M., & Vandermeersch, T. (2011). Open Vol1.pdf.
168 Chemistry

Ekins, S., Tkachenko, V., & Williams, A. J. (2012). Cheminformatics, 3, 41. https://doi.org/10.1186/1758-
Towards a gold standard: Regarding quality in public 2946-3-41.
domain chemistry databases and approaches to improv- Kaestner, M. (2016). Big Data means big opportunities for
ing the situation. Drug Discovery Today, 17(13–14), chemical companies. KPMG REACTION, 16–29.
685–701. Lowe, G. (1995). Combinatorial chemistry. Chemical Soci-
Frey, J. G., & Bird, C. L. (2011). Web-based services for ety Review, 24, 309–317. https://doi.org/10.1039/
drug design and discovery. Expert Opinion on Drug CS9952400309.
Discovery, 6(9), 885–895. Lundia, S. R. (2015). How big data is influencing chemical
Frey, J. G., & Bird, C. L. (2013). Cheminformatics and the manufacturing. Available from https://www.chem.info/
semantic web: Adding value with linked data and blog/2015/05/how-big-data-influencing-chemical-
enhanced provenance. Wiley Interdisciplinary Reviews: manufacturing.
Computational Molecular Science, 3(5), 465–481. Mohimani, H., et al. (2017). Dereplication of peptidic
https://doi.org/10.1002/wcms.1127. natural products through database search of mass spec-
Gartner. From the Gartner IT glossary: What is Big Data? tra. Nature Chemical Biology, 13, 30–37. https://doi.
Available from https://www.gartner.com/it-glossary/ org/10.1038/nchembio.2219.
big-data. Pence, H. E., & Williams, A. J. (2016). Big data and
Gilson, M. K., Liu, T., & Nicola, G. (2012). Public domain chemical education. Journal of Chemical Education,
databases for medicinal chemistry. Journal of Medici- 93(3), 504–508. https://doi.org/10.1021/acs.jchemed.
nal Chemistry, 55(16), 6987–7002. 5b00524.
Groth, P. T., Gray, A. J., Goble, C. A., Harland, L., Loizou, Peter V. Coveney, Edward R. Dougherty, Roger R.
A., & Pettifer, S. (2014). API-centric linked data inte- Highfield, (2016) Big data need big theory too. Philo-
gration: The open phacts discovery platform case study. sophical Transactions of the Royal Society A: Mathe-
Web Semantics: Science, Services and Agents on the matical, Physical and Engineering Sciences 374
World Wide Web, 29, 12–18. (2080):20160153
Hall, R. J., Murray, C. W., & Verdonk, M. L. (2017). The Ramakrishnan, R., Dral, P. O., Rupp, M., & Anatole von
fragment network: A chemistry recommendation Lilienfeld, O. (2015). Big data meets quantum chemis-
engine built using a graph database. Journal of Medic- try approximations: The Δ-machine learning approach.
inal Chemistry, 60(14), 6440–6450. https://doi.org/10. Journal of Chemical Theory and Computation, 11(5),
1021/acs.jmedchem.7b00809. 2087–2096. https://doi.org/10.1021/acs.jctc.5b00099.
Han, Y., Horlacher, O., Kuhn, S., Luttmann, E., Steinbeck, Reymond, J. (2015). The chemical space project. Accounts
C., & Willighagen, E. L. (2003). The Chemistry Devel- of Chemical Research, 48(3), 722–730.
opment Kit (CDK): An open-source Java library for Sayle, R. A., Batista, J., & Grant, A. (2013). An efficient
chemo-and bioinformatics. Journal of Chemical Infor- maximum common subgraph(MCS) searching of large
mation and Computer Sciences, 43(2), 493–500. chemical databases. Journal of Cheminformatics, 5(1),
Hartung, T. (2016). Making big sense from big data in O15. https://doi.org/10.1186/1758-2946-5-S1-O15.
toxicology by read-across. ALTEX, 33(2), 83–93. Schneider, N., Lowe, D. M., Sayle, R. A., Tarselli, M. A.,
Hey, A., Tansley, S., & Tolle, K. (Eds.). (2009). The fourth & Landrum, G. A. (2016). Big data from pharmaceuti-
paradigm, data-intensive scientific discovery. Red- cal patents: A computational analysis of medicinal
mond: Microsoft Research. ISBN 978-0-9825442-0-4. chemists’ bread and butter. Journal of Medicinal
http://generic.wordpress.soton.ac.uk/dial-a-molecule/phase- Chemistry, 59(9), 4385–4402. https://doi.org/10.1021/
iii-themes/data-driven-synthesis/. Accessed 30 Oct 2017. acs.jmedchem.6b00153.
https://home.cern/. Accessed 30 Oct 2017. Spek, A. L. (2009). Structure validation in chemical crys-
https://lcls.slac.stanford.edu/. Accessed 30 Oct 2017. tallography. Acta Crystallographica. Section D, Bio-
https://pubchem.ncbi.nlm.nih.gov/. Accessed 30 Oct 2017. logical Crystallography.
https://www.ccdc.cam.ac.uk. Accessed 30 Oct 2017. Swain, M. C., & Cole, J. M. (2016). ChemDataExtractor:
https://www.nist.gov/mml/acmd/trc/thermoml. Accessed A toolkit for automated extraction of chemical infor-
30 Oct 2017 mation from the scientific literature. Journal of Chem-
http://www.RDKit.org. Accessed 30 Oct 2017. ical Information and Modeling, 56(10), 1894–1904.
https://www.rcsb.org/pdb/statistics/holdings.do. Accessed https://doi.org/10.1021/acs.jcim.6b00207.
30 Oct 2017. Szymański, P., Marcowicz, M., & Mikiciuk-Olasik, E.
https://www.xfel.eu/. Accessed 30 Oct 2017. (2012). Adaptation of high-throughput screening in
ICIS Chemical Business. (2013). Big data and the chemical drug discovery – Toxicological screening tests. Inter-
industry. Available from https://www.icis.com/resources/ national Journal of Molecular Sciences, 13, 427–452.
news/2013/12/13/9735874/big-data-and-the-chemi https://doi.org/10.3390/ijms13010427.
cal-industry/. Tetko, I. V., Engkvist, O., Koch, U., Reymond, J.-L., &
Jessop, D. M., Adams, S. E., Willighagen, E. L., Hawizy, Chen, H. (2016). BIGCHEM: Challenges and opportu-
L., & Murray-Rust, P. (2011). OSCAR4: A flexible nities for big data analysis in chemistry. Molecular
architecture for chemical text-mining. Journal of Informatics, 35, 615.
Clickstream Analytics 169

Tormay, P. (2015). Big data in pharmaceutical R&D: Cre- by Internet service providers, which provide the
ating a sustainable R&D engine. Pharmaceutical Med- portal through which most individual web traffic
icine 29(2), 87–92.
Whitesides, G. M. (2015). Reinventing chemistry. travels. Other clickstream data are logged by both
Angewandte Chemie, 54(11), 3196–3209. benign and malicious parties through the use of
Yeguas, V., & Casado, R. (2014). Big Data issues in JavaScript code, CGI scripts, and tracking cookies
computational chemistry, 2014 international confer- that embed signals within individual users’ Web
ence on future internet of things and cloud. Available
from http://ieeexplore.ieee.org/abstract/document/ browsers. Such passive tracking methods help to C
6984225/. differentiate between real human interaction and
Zhu, H., et al. (2014). Big data in chemical toxicity machine-generated Web traffic and also automati-
research: The use of high-throughput screening assays cally share clickstream data across hundreds or
to identify potential toxicants. Chemical Research in
Toxicology, 27(10), 1643–1651. https://doi.org/10. thousands of affiliated websites.
1021/tx500145h. Clickstream data are used for a variety of pur-
poses. One of the primary ways in which
clickstream data are used involves Web advertis-
ing. By analyzing a variety of metrics, such as a
Clickstream Analytics website’s contact efficiency, which refers to the
number of unique page views, and a website’s
Hans C. Schmidt conversion efficiency, which refers to the percent-
Pennsylvania State University – Brandywine, age of page views that lead to purchases, adver-
Philadelphia, PA, USA tisers can decide where to place advertisements.
Similarly, clickstream logs allow the perfor-
mance of individual advertisements to be moni-
Clickstream analytics is a form of Web usage tored by providing data about how often an
mining that involves the use of predictive models advertisement is shown (impression), how often
to analyze records of individual interactions with an ad is clicked on (click-through rate), and how
websites. many viewers of an ad proceed to buy the adver-
These records, known as clickstreams, are gath- tised product (conversion rate).
ered whenever a user connects to the Web and Clickstream logs are also used to create indi-
include the totality of a Web user’s browsing his- vidual consumer and Web user profiles. Profiles
tory. A clickstream comprises a massive amount of are based on personal characteristics, such as
data, making clickstream data a type of big data. address, gender, income, and education, and
Each click, keystroke, server response, or other online behaviors, such as purchase history, spend-
action is logged with a time stamp and the origi- ing history, search history, and click path history.
nating IP address, as well as information about the Once profiles have been generated, they are used
geographic location of individual users, the refer- to help directly target advertisements, construct
ring or previously visited website, the amount of personalize Web search results, and create cus-
time spent on a website, the frequency of visits by tomized online shopping experiences with prod-
users to a website, and the purchase history of users uct recommendations.
on a website. Technical details are also gathered, Online advertising and e-commerce are not the
and clickstream records also often include informa- only uses for clickstream analytics. Clickstream
tion regarding the user’s Web browser, operating data are also used to improve website design and
system, screen size, and screen resolution. create faster browsing experiences. By analyzing
Increasingly, clickstream records also include the most frequent paths through which users nav-
data transmitted over smartphones, game consoles, igate websites, designers can redesign Web pages
and a variety of household appliances connected to create a more intuitive browsing experience.
via the emerging “Internet of Things.” The most Further, by knowing the most frequently visited
comprehensive clickstream records are compiled websites, Internet service providers and local
170 Climate Change, Hurricanes/Typhoons/Cyclones

network administrators can increase connectivity Further Reading


speeds by caching popular websites and
connecting users to the locally cached pages Croll, A., & Power, S. (2009). Complete web monitoring:
Watching your visitors, performance, communities and
instead of directing them to the host server.
competitors. Sebastopol: O’Reilly Media.
Clickstream analytics have also come to be Jackson, S. (2009). Cult of analytics: Driving online mar-
used by law enforcement agencies and in national keting strategies using web analytics. Oxford:
defense and counterterrorism initiatives. In such Butterworth-Heinemann.
Kaushik, A. (2007). Web analytics: An hour a day. India-
instances, similar tools are employed in order to
napolis, IN: Wiley.
identify individuals involved with criminal, or
otherwise nefarious, activities in an online
environment.
While clickstream analytics have become an
important tool for many organizations involved Climate Change, Hurricanes/
with Web-based commerce, law enforcement, Typhoons/Cyclones
and national defense, the widespread use of per-
sonal data is not without controversy. Patrick Doupe and James H. Faghmous
Some object to the extensive collection and Arnhold Institute for Global Health, Icahn School
analysis of clickstream data because many Web of Medicine at Mount Sinai, New York, NY, USA
users operate under the assumption that their
actions and communications are private and anon-
ymous and are unaware that data are constantly Introduction
being collected about them. Similarly, some
object because much clickstream data are col- Hurricanes, typhoons, and tropical cyclones
lected surreptitiously by websites that do not (hereafter TCs) refer to intense storms that start
inform visitors that data are being gathered or at sea over warm water and sometimes reach land.
that tracking cookies are being used. Since 1970, around 90 of these storms occur
To this end, some organizations, like the annually, causing large amounts of damage
Electronic Frontier Foundation, have advo- (Schreck et al. 2014). The expected annual global
cated for increased privacy protections and damage from these hurricanes is US$ 21 billion,
suggested that Internet service providers affecting around 19 million people annually with
should collect less user information. Similarly, approximately 17,000 deaths (Guha-Sapir et al.
many popular Web browsers now offer do-not- 2015). Damages from hurricanes also lower
track features that are designed to limit the incomes, with a 90th percentile event reducing
extent to which individual clickstreams are per capita incomes by 7.4% after 20 years
recorded. Yet, because there are so many (Hsiang and Jina 2014). This reduction is compa-
points at which user data can be logged, and rable with a banking crisis.
because technology is constantly evolving, Despite these high costs, our current under-
new methods for recording online behavior standing of hurricanes is limited. For example,
continue to be developed and integrated into we lack basic understanding on cyclogenesis, or
the infrastructure of the Web. how cyclones form and how hurricanes link to
basic variables like sea surface temperatures
(SSTs). For instance, there is a positive correlation
Cross-References between SSTs and TC frequency in the North
Atlantic Ocean but no relationship in the Pacific
▶ Data Mining and Indian Ocean. Overall, the Intergovernmental
▶ National Security Administration (NSA) Panel on Climate Change’s (IPCC) current con-
▶ Privacy clusion is that we have low confidence that there
Climate Change, Hurricanes/Typhoons/Cyclones 171

are long-term robust changes in tropical cyclone the climate and cyclones. First and in contrast to
activity (Pachauri et al. 2014). short-time series, we have large spatial dimen-
Much of our limited understanding can be sionality. So although there is a reasonable
explained through an understanding of data. We amount of cyclones over time, when taking over
currently have good, global data since 1970 on TC space and time, we have few events for very many
events (Schreck et al. 2014). This includes data on observations. Second, there are large amplitude
wind speed, air temperature, etc. The current data fluctuations in present-day storms (Ding and C
challenge is twofold: understanding the relation- Reiter 1981). Last, there are large knowledge
ships between TCs and other variables and pre- gaps concerning the exact influence various cli-
dicting future TCs. Progress on these topics will mate factors have on TC activity (Gray and Brody
be made by understanding and overcoming data 1967; Emanuel 2008). These characteristics of TC
limitations. In this article, we show this through a data mean that it is easy for researchers to over fit
review of the literature attempting to predict TCs. explanatory variables to poorly understood noise
or autocorrelations in the data. We now investi-
gate how these constraints affect cyclone
Characteristics of TC Data forecasting.

We can characterize TC data as being noisy, hav-


ing a short-time series and large spatial dimen- Forecasting Cyclones
sionality, highly autocorrelated with rare events.
This, in combination with limited understanding We can group TC predictions into three broad
of the physical processes, means that it is easy for categories. First centennial projections are sim-
researchers to overfit data points. ulations used to model TC activity under various
The data in which we do have confidence is the warming scenarios. Centennial projections gen-
relatively short-time series we have on TCs erally look at TC activity beyond the twenty-first
(Landsea 2007; Chang and Guo 2007). The most century. Second seasonal forecasts of TC activ-
reliable data is satellite data, which provides us ity are issued in December (for the Atlantic) the
with information only back to 1970. Prior to the previous year and are periodically updated
satellite era, storm reconnaissance was through throughout the TC season. Last short-term fore-
coastal or ship observations. This data is not as casts are issued 7–14 days before TC genesis and
reliable. Given a large number of storms never generally predict intensity and tracks instead of
reach land and that ships generally try to avoid genesis.
storms, a large number of storms might have gone For centennial projects, researchers use
undetected. Another source of undercounts is low physics-based climate models. These models pro-
coastal population density during the earlier record ject a global decrease in the total number TCs yet
(Landsea 2007). Furthermore, it has been are highly uncertain in individual basins. These
suggested that a storm per year undercount as late uncertainties stem from major roadblocks
as the 2003–2006 period is possible due to changes outlined above. First, we don’t understand rela-
in data processing methods (Landsea 2007). tionships that lead to cyclogenesis (Knutson and
Therefore, researchers face a trade-off: longer, Tuleya 2004; LaRow et al. 2008; Zhao et al. 2009)
less reliable datasets or shorter, more reliable or the climate-TC feedback relationship (Pielke
datasets. This trade-off is important: when we et al. 2005; Emanuel 2008). Second, that the
control for this observation bias, global trends in observations are too coarse (20–120 km) to fully
cyclone frequency become insignificant (Knutson model TC properties (Chen et al. 2007).
et al. 2010). Seasonal basin-wide activity predictions of
In addition to missing data, other factors con- seasonal TC activity are issued as early as April
found our estimates of the relationship between in the previous year (to forecast activity in
172 Climate Change, Hurricanes/Typhoons/Cyclones

August–October the following year). Forecasts of this bind is to focus on small spatial windows,
are generated by both dynamical and statistical rather than large basins. Rather than trying to
models. Similar to physics-based models, dynam- identify relationships in an over parameterized,
ical models predict the state of future climate, and highly nonlinear, and autocorrelated environment,
the response of the TC-like vortices in the models a focus on smaller manageable windows may
is used to estimate future hurricane activity (Vitart generate insights that can be scaled up.
2006; Vitart et al. 2007; Vecchi et al. 2011; Zhao
et al. 2009). An approach analogous to model
simulations is the statistical approach, where one
infers relationships solely based on observational Further Reading
data (Elsner and Jagger 2006; Klotzbach and Gray
Belanger, J. I., Curry, J. A., & Webster, P. J. (2010).
2009; Gray 1984). These models have limitations
Predictability of north Atlantic tropical cyclone activity
based on TC data characteristics. One limitation is on intraseasonal time scales. Monthly Weather Review,
that given the relatively short record of observa- 138(12), 4362–4374.
tional data, statistical models are subject to over- Camargo, S. J., Barnston, A. G., Klotzbach, P. J., &
Landsea, C. W. (2007). Seasonal tropical cyclone fore-
fitting. Another limitation is that the relatively few
casts. WMO Bulletin, 56(4), 297.
events make it difficult to interpret a model’s Chang, E. K. M., & Guo, Y. (2007). Is the number of north
output. For instance, if a model predicts a below Atlantic tropical cyclones significantly underestimated
average season, all it takes is a single strong hur- prior to the availability of satellite observations? Geo-
physical Research Letters, 34(14). L14801.
ricane to inflict major damage, therefore rendering
Chen, S. S., Zhao, W., Donelan, M. A., Price, J. F., &
the forecast uninformative. This was the case in Walsh, E. J. (2007). The CBLAST-hurricane program
1983, when hurricane Alicia struck land during a and the next-generation fully coupled atmosphere–
below average season (Elsner et al. 1998). It is no wave–ocean models for hurricane research and predic-
tion. Bulletin of the American Meteorological Society,
surprise then that these models have yet to impact
88(3), 311–317.
climate science (Camargo et al. 2007). Ding, Y. H., & Reiter, E. R. (1981). Large-scale circulation
Short-term forecasts are used mainly by conditions affecting the variability in the frequency of
weather services but have also received attention tropical cyclone formation over the North Atlantic and
the North Pacific Oceans. Fort Collins, CO: Colorado
in the scientific literature. For dynamical models,
State University.
Belanger et al. (Belanger et al. 2010) test the Elsner, J. B., & Jagger, T. H. (2006). Prediction models for
European Center for Medium-Range Weather annual U.S. hurricane counts. Journal of Climate,
Forecasts (ECMWF) Monthly Forecast System’s 19(12), 2935–2952.
Elsner, J. B., Niu, X., & Tsonis, A. A. (1998). Multi-year
(ECMFS) ability to predict Atlantic TC activity.
prediction model of north Atlantic hurricane activity.
For the 2008 and 2009 seasons, the model was Meteorology and Atmospheric Physics, 68(1), 43–51.
able to forecast TCs for a week in advance with Emanuel, K. (2008). The hurricane-climate connection.
skill above climatology for the Gulf of Mexico Bulletin of the American Meteorological Society,
89(5), ES10–ES20.
and the MDR on intraseasonal time scales.
Gray, W. M. (1984). Atlantic seasonal hurricane frequency.
Part I: El niÃs’o and 30 mb quasi-biennial oscillation
influences. Monthly Weather Review, 112(9),
Conclusion 1649–1668.
Gray, W. M., & Brody, L. (1967). Global view of the origin
of tropical disturbances and storms. Fort Collins, CO:
We see that forecasting is constrained by charac- Colorado State University, Department of Atmospheric
teristics of the data. These constraints provide Science.
fertile ground for the ambitious researcher. For Guha-Sapir, D., Below, R., & Hoyois, P. (2015). Em-dat:
International disaster database. Brussels: Catholic
instance, we do have good evidence about TC
University of Louvain.
forecasts in the North Atlantic (Pachauri et al. Hsiang, S. M., & Jina, A. S. (2014).The causal effect of
2014). This suggests that one potential route out environmental catastrophe on long-run economic
Climate Change, Rising Temperatures 173

growth: Evidence from 6,700 cyclones. Technical


report, National Bureau of Economic Research. Climate Change, Rising
Klotzbach, P. J., & Gray, W. M. (2009). Twenty-five years
of Atlantic basin seasonal hurricane forecasts Temperatures
(1984–2008). Geophysical Research Letters 36(9).
L09711. Elmira Jamei1, Mehdi Seyedmahmoudian2 and
Knutson, T. R., & Tuleya, R. E. (2004). Impact of Alex Stojcevski2
co2-induced warming on simulated hurricane intensity 1
College of Engineering and Science, Victoria C
and precipitation: Sensitivity to the choice of climate
model and convective parameterization. Journal of Cli- University, Melbourne, VIC, Australia
2
mate, 17(18), 3477–3495. School of Software and Electrical Engineering,
Knutson, T. R., McBride, J. L., Chan, J., Emanuel, K., Swinburne University of Technology, Melbourne,
Holland, G., Landsea, C., Held, I., Kossin, J. P.,
VIC, Australia
Srivastava, A. K., & Sugi, M. (2010). Tropical
cyclones and climate change. Nature Geoscience,
3(3), 157–163.
Landsea, C. (2007). Counting Atlantic tropical cyclones Climate Change and Big Data
back to 1900. Eos, Transactions American Geophysical
Union, 88(18), 197–202.
LaRow, T. E., Lim, Y.-K., Shin, D. W., Chassignet, E. P.,
In its broadest sense, the term climate refers to a
& Cocke, S. (2008). Atlantic basin seasonal hurri- statistical description and condition of weather,
cane simulations. Journal of Climate, 21(13), oceans, land surfaces, and glaciers (considering
3191–3206. average and extremes). Therefore, climate change
Pachauri, R. K., Allen, M. R., Barros, V. R., Broome, J.,
is the alteration in the climate pattern over a long
Cramer, W., Christ, R., Church, J. A., Clarke, L., Dahe,
Q., & Dasgupta, P., et al. (2014). Climate change 2014: period of time due to both natural and human-
Synthesis report. Contribution of working groups I, II induced activities.
and III to the fifth assessment report of the Intergov- The climate of the earth has changed over the
ernmental Panel on Climate Change. IPCC.
past century. Global warming and increased air
Pielke, R. A., Jr., Landsea, C., Mayfield, M., Laver, J., &
Pasch, R. (2005). Hurricanes and global warming. Bul- temperature have significantly altered the ocean,
letin of the American Meteorological Society, 86(11), atmospheric condition, sea level, and glaciers.
1571–1575. Global climate change, particularly its impact on
Schreck, C. J., III, Knapp, K. R., & Kossin, J. P. (2014).
lifestyle and public health, has become one of the
The impact of best track discrepancies on global trop-
ical cyclone climatologies using IBTrACS. Monthly greatest challenges of our era. Human activities
Weather Review, 142(10), 3881–3899. and rapid urbanization are known as the main
Vecchi, G. A., Zhao, M., Wang, H., Villarini, G., Rosati, contributors of greenhouse gas emissions. The
A., Kumar, A., Held, I. M., & Gudgel, R. (2011).
first scientific assessment of climate change,
Statistical-dynamical predictions of seasonal north
Atlantic hurricane activity. Monthly Weather Review, which was published in June 1990, is a proof for
139(4), 1070–1082. such claim (Houghton et al. 1990). The report is a
Vitart, F. (2006). Seasonal forecasting of tropical storm comprehensive statement on the scientific and
frequency using a multi-model ensemble. Quarterly
climatic knowledge regarding the state of climate
Journal of the Royal Meteorological Society,
132(615), 647–666. change and the role of mankind in exacerbating
Vitart, F., Huddleston, M. R., Déqué, M., Peake, D., global warming (Intergovernmental Panel on Cli-
Palmer, T. N., Stockdale, T. N., Davey, M. K., Ineson, mate Change 2015).
S., & Weisheimerm, A. (2007). Dynamically-based
To address this rapidly changing climate, there
seasonal forecasts of Atlantic tropical storm activity
issued in June by EUROSIP. Geophysical Research is an urgency to monitor the climate condition,
Letters, 34(16). L16815. forecast its behavior and identify the most effi-
Zhao, M., Held, I. M., Lin, S.-J., & Vecchi, G. A. (2009). cient adaptation and mitigation strategies against
Simulations of global hurricane climatology, inter-
global warming. This need has already resulted in
annual variability, and response to global warming
using a 50-km resolution gcm. Journal of Climate, fruitful outcomes in certain fields, such as science,
22(24), 6653–6678. information technology, and participatory urban
174 Climate Change, Rising Temperatures

planning. However, despite the urgency of data in Climate Change, Rising Temperatures, Table 1
climatology, the number of studies highlighting Advantages and disadvantages of climatic data sources
(Faghmous and Kumar 2014)
this necessity is lacking.
At present, the amount of climatic data col- Climate data
source Main strength Main drawback
lected rapidly increases. As the volume of climate
Climatic Capacity to run Only based on
data increase, data collection, representation, and modeling forward physics
implementation in decision-making have become simulations
equally important. One of the main challenges On-site Only based on Possibility of
with increased amount of climatic data lie in measurements direct having spatial
information and knowledge management of the and observations bias
observations
collected data on an hourly or even second basis
Satellite Large coverage Unstable and
(Flowers 2013). only lasts for the
Big data analytics is one of the methods that duration of the
help in data monitoring, modeling, and interpre- mission
tation, which are necessary to better understand Paleoclimate Capability of Technologies for
using proxy data analyzing such
causes and effects of climate change and formu- to infer data are still
late appropriate adaptation strategies. Big data preindustrial under
refers to different aspects of data, including data climate trends development
size (such as massive, rapid, or complex) and
technological requirement for data processing
and analysis.
Big data has had a great success in various Dealing with continuously changing observa-
fields, such as advertisement and electronic com- tion systems is another challenge encountered by
merce; however, big data is still less employed climatologists. Data monitoring instruments,
in climate science. In the era of explosive especially for satellites and other remote sensing
increasing global data, big data is used to tools, undergo alterations. The change in instru-
explain and present the massive datasets. In con- mentations and data processing algorithms poses
trast to conventional traditional datasets, big a question regarding the applicability of such data.
data requires real-time types of analysis. In addi- Availability of climatic data before data explo-
tion, big data assists in exploring new opportu- ration is another barrier for climatologists. A few
nities and values and achieving an in-depth datasets have been developed only a decade ago
understanding of hidden values. Big data also or less. These datasets have spatial resolutions but
addresses a few major questions regarding effec- short temporal duration.
tive dataset organization and management (Chen Another barrier in climatic data collection is
et al. 2014). data heterogeneity. Climate is governed by several
Exploratory data analysis is the first step prior interacting variables defined by the earth’s sys-
to releasing data. This analysis is critical in under- tem. These variables are monitored and measured
standing data variabilities and intricacies and is using various methods and techniques. However,
particularly more important in areas such as cli- a few variables cannot be totally observed. For
mate science, where data scientists should be example, a few climatic variables may rely on
aware regarding the collection process. ground stations; therefore, these variables may
Four different sources of climate data include be influenced by spatial bias. Other variables
on-site measurements, remote sensing, modeling, may be obtained from satellites whose missions
and paleoclimate. Each source has its set of last only 5 or 10 years; thus, continuous monitor-
strengths and weaknesses which should be fully ing and long recording time is difficult. Despite
understood before any data exploration. Table 1 the fact that the climatic data are soured from
presents each data source with its key strengths different sources, they belong to a same system
and weaknesses. and are inter-related. As a result, merging data
Cloud Computing 175

from heterogeneous sources is necessary but Houghton, J. T., Jenkins, G., & Ephraums, J. (1990). Cli-
redundant. mate change: The IPCC scientific assessment. Report
prepared for Intergovernmental Panel on Climate
Data representation is another important task Change by working group I. Cambridge: Cambridge
for climatologists. Conventional data science is University Press. http://www.ipcc.ch/ipccreports/far/
based on attribute–value data. However, certain wg_I/ipcc_far_wg_I_full_report.pdf. Accessed 11
climatic phenomena (e.g., hurricanes) cannot be June 2012.
represented in the form of attribute–value. For
Intergovernmental Panel on Climate Change. (2014). Climate
change 2014: Mitigation of climate change (Vol. 3).
C
example, hurricanes have their own special pat- Cambridge University Press.
terns; thus, they cannot be represented with binary
values. Such evolutionary phenomenon can be
demonstrated through equations used in climate
models. However, there is still a significant need Cloud
for similar abstractions within the broader data
science. ▶ Data Center

Conclusion
Cloud Computing
The rapid acceleration of climate change and
global warming are the most significant chal- Erik W. Kuiler
lenges of the twenty-first century. Thus, innova- George Mason University, Arlington, VA, USA
tive and effective solutions are urgently needed.
Understanding the changing world and finding
adaptation and mitigation strategies have forced Cloud-based computing provides important tools
researchers with different backgrounds to join for big dataset analytics and management. The
together to overcome such issues through a global cloud-based computing model is a network-
“data revolution” known as big data. Big data based distributed delivery model for providing
effectively supports climate change research com- virtual, on-demand computing services to cus-
munities in addressing collection, analysis, and tomers. Cloud-based applications usually oper-
dissemination of massive amounts of data and ate on multiple Internet-connected computers
information to enlighten possible future climates and servers that are accessible not only via
under different scenarios, address major chal- machine-to-machine interactions but also via
lenges encountered by climatologists, and provide personal devices, such as smart phones and web
guidance to governments in making future browsers. Cloud-based computing is customer
decisions. focused, offering information technology (IT)
capabilities as subscription-based services that
require minimal user-direct oversight and
Further Reading management.
Although, in actuality, cloud-based computing
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. may not be the safest option for sensitive data, it
Mobile Networks and Applications, 19(2), 171–209. may be described via various advantages: no geo-
https://doi.org/10.1007/s11036-013-0489-0.
graphical restrictions, cost-effectiveness, reliabil-
Faghmous, J. H., & Kumar, V. (2014). A big data guide to
understanding climate change: The case for theory- ity, scalability to reflect customers’ needs, and
guided data science. Big Data, 2(3), 155–163. minimal direct requirements for customer-pro-
Flowers, M. (2013). Beyond open data: The data-driven vided active management of cloud-based
city. Beyond transparency: Open data and the future
of civic innovation (pp. 185–198). http://beyondtran
resources. Additional features include user-initi-
sparency.org/chapters/part-4/beyond-open-data-the-data- ated self-service (on-demand access to network-
driven-city/. enabled data storage, server time, applications,
176 Cloud Services

etc.); network access (computing capabilities systems, applications, and data storage drives.
available via a network for use on heterogenous Note that IaaS is usually an outsourced pay-for-
thick or thin client platforms, e.g., mobile phones, use service; the user usually does not control the
laptops, workstations, etc.); economies of scale underlying cloud structure but may control oper-
(resources are pooled and dynamically allocated ating systems, storage, and deployed applications.
to meet the demands of multiple customers); ser-
vice flexibility (computer resources provisioned
and released to meet customer needs from indi-
vidual customer perspectives, providing the illu- Conclusion
sion of access to unlimited resources); and
utilization-based resource management (resource Cloud-based computing offers customers a cost-
consumption monitored, measured, and reported effective, generally reliable means to access and
to customers and providers of the services). use pooled computer resources on demand, with
minimal direct management of those resources.
These resources are Internet based and may be
Infrastructure Implementation Models geographically dispersed.

Cloud computing configurations of such


resources as networks, servers, storage, applica- Further Reading
tions, and services collectively provide enhanced
Buyya, R., & Vecchiola, C. (2013). Mastering cloud com-
user access to those resources. Cloud infrastruc- puting. Burlington: Morgan Kaufmann.
ture implementations generally are categorized in Ji, C., Li, Y., Qiu, W., Awada, U., & Li, K. (2012). Big data
relation to four different forms: processing in cloud computing environments. In IEEE
international symposium on pervasive systems, algo-
rithms and networks.
1. Private cloud – the cloud infrastructure is Mell, P., & Grance, T. (2011). NIST special publication
dedicated to a single organization that may 800–145. Available from https://csrc.nist.gov/publica
include multiple customers. tions/detail/sp/800-145/final.
2. Community cloud – the cloud infrastructure is
dedicated to a community of organizations,
each of which may have customers that fre-
quently share common requirements, such as Cloud Services
security, legal compliance requirements, and
missions. Paula K. Baldwin
3. Public cloud – the cloud infrastructure is open Department of Communication Studies, Western
to the public. Oregon University, Monmouth, OR, USA
4. Hybrid cloud – a composition of two or more
distinct cloud infrastructures (private, commu-
nity, or public). As consumers and institutions congregate larger
and larger portions of data, hardware storage has
In regard to implementation, cloud-based com- become inadequate. These additional storage
puting supports different pay-for-use service needs led to the development of virtual data cen-
options. These options include, for example, Soft- ters, also known as the cloud, cloud computing,
ware as a Service (SaaS) applications that are or, in the case of the cloud providers, cloud ser-
available by subscription; Platform as a Service vices. The origin of the term, cloud computing, is
(PaaS) by which cloud computing providers somewhat unclear, but a cloud-shaped symbol is
deploy the cloud infrastructure on which cus- often used as a representation on the Internet of
tomers can develop and run their own applica- the cloud. The cloud symbol also represents the
tions; and Infrastructure as a Service (IaaS) remote, complex system infrastructure used to
based on virtual servers, networks, operating store and manage the consumer’s data.
Cloud Services 177

The first reference to cloud computing in the responsibility is shared between the organiza-
contemporary age appeared in the mid-1990s, tions. Hybrid clouds are a grouping of two or
and it became popular in the mid-2000s. As more clouds, public or private community,
cloud services become much more versatile where the cloud service is comprised of variant
and economical, consumers’ use is increasing. combination that extends the capacity of the ser-
The cloud offers users immediate access to a vice through aggregation, integration, or
shared pool of computer resources. As proces- customizations with another cloud service. Some- C
sors continue to develop both in power and eco- times a hybrid cloud is used on a temporary basis
nomic feasibility, the expansion of these data to meet short-term data needs that cannot be ful-
centers (the cloud) has expanded on an enor- filled by the private cloud. Having the ability to
mous scale. Cloud services incentivize migra- use the hybrid cloud enables the organization to
tion to the cloud as users recognize the elastic only pay for the extra resources when they are
potential for data storage as a reasonable cost. needed, so this exists as a fiscal incentive for
Cloud services are the new generation of com- organizations to use a hybrid cloud service.
puting infrastructures, and there are multiple The other aspect to consider when evaluating
cloud vendors providing a range of cloud ser- cloud services is the specific service models
vices. The fiscal benefit of cloud computing is offered for the consumer or organization. Cloud
the consumer only pays for the use on the computing offers three different levels of service:
resources they need without any concern over Software as a Service (SaaS), Platform as a Ser-
compromising their physical storage areas. The vice (PaaS), and Infrastructure as a Service (IaaS).
cloud service manages the data on the back end. The SaaS has a specific application or service
In an era where physical storage limitations has subscription for the customer (e.g., Dropbox,
become problematic with increased downloads Salesforce.com, and QuickBooks). With the
of movies, books, graphics, and other high data SaaS, the service provider handles the installation,
memory products, cloud computing has been a setup, and running of the application with little to
welcome development. no customization. The PaaS allows businesses an
integrated platform on which they can create and
deploy custom apps, databases, and line-of-busi-
Choosing a Cloud Service ness service (e.g., Microsoft Windows Azure,
IBM Bluemix, Amazon Web Services (AWS),
As the cloud service industry grows, choosing a Elastic Beanstalk, Heroku, Force.com, Apache
cloud service can be confusing for the consumer. Stratos, Engine Yard, and Google App Engine).
One of the first areas to consider is the unique The PaaS service model includes the operating
cloud service configurations. Cloud services are system, programming language execution envi-
configured in four ways. One, public clouds may ronment, database, and web servicer designed
be free or bundled with other services or offered as for a specific framework with a high level of
pay per usage. Generally speaking, public cloud customization. With Infrastructure as a Service
service providers like Amazon AWS, Microsoft, (IaaS), businesses can purchase infrastructure
and Google own and operate their own infrastruc- from providers as virtual resources. Components
ture data centers, and access to these providers’ include servers, memory, firewalls, and more, but
services is through the Internet. Private cloud the organization provides the operating system.
services are data management infrastructures cre- IaaS providers include Amazon Elastic Cloud
ated solely for one particular organization. Man- Computer (Amazon EC2), GoGrid, Joyent,
agement of the private cloud may be internal or AppNexus, Rackspace, and Google Compute
external. Community cloud services exist when Engine.
multiple organizations from a specific community Once the correct cloud service configuration is
with common needs choose to share an infrastruc- determined, the next step is to match user needs
ture. Again, management of the community cloud with correct service level. When looking at cloud
service may be internal or external, and fiscal services, it is important to examine four different
178 Cloud Services

aspects: application requirements, business determine that a consumer’s usage in relationship


expectations, capacity provisioning, and cloud to the service level purchased is appropriate, the
information collection and process. These four more serious concern for consumers is data safety.
areas complicate the process of selecting a cloud Furthermore, because users do not have physical
service. First, the application requirements refer to possession of their data, public cloud services are
the different features such as data volume, data underutilized due to trust issues. Larger organiza-
production rate, data transfer and updating, com- tions use privately held clouds, but if a company
munication, and computing intensities. These fac- does not have the resources to develop their own
tors are important because the differences in these cloud service, most organizations are unlikely to
factors will affect the CPU (central processing use public cloud services due to safety concerns.
unit), memory, storage, and network bandwidth Currently, there is no global standardization of
for the user. Business expectations fluctuate data encryption between cloud services, and
depending on the applications and potential there have been some concerns raised by experts
users, which, in turn, affect the cost. The pricing who say there is no way to be completely sure that
model depends on the level of the service required data, once moved to the cloud, remains secure.
(e.g., voicemail, a dedicated service, amount of With most cloud services, control of the encryp-
storage required, additional software packages, tion keys is retained by the cloud service, making
and other custom services). Capacity provisioning your data vulnerable to a rogue employee or a
is based on the concept that, according to need, governmental request to see your data.
different IT technologies are employed and, there- The Electronic Frontier Foundation (EFF) is a
fore, each technology has its own unique strengths privacy advocacy group that maintains a section
and weaknesses. The downside for the consumer on their website (Who Has Your Back) that rates
is the steep learning curve required. The final the largest Internet companies on their data pro-
challenge requires that the consumers invest a tections. The EFF site uses six criteria to rate the
substantial amount of time to investigate individ- companies: requires a warrant for content, tells
ual websites, collect information about each cloud users about government data requests, publishes
service offering, collate their findings, and employ transparency reports, publishes law enforcement
their own assessments to determine their best guidelines, fights for user privacy rights in courts,
match. If an organization has an internal IT and fights for user privacy rights in Congress.
department or employs an IT consultant, the deci- Another consumer and corporate data protection
sion is easier to make; for the individual con- group is the Tahoe Least Authority File System
sumer, without an IT background, the choice (Tahoe-LAFS) project. Tahoe-LAFS protects a
may be considerably more difficult. free, open-source storage system created and
developed by Zooko Wilcox-O’Hearn with the
goal of data security and protection from hard-
Cloud Safety and Security ware failure. The strength of this storage system is
their encryption and integrity – checks first go
For the consumer, two primary issues are relevant through gateway servers, and after the process is
to cloud usage: a check and balance system on the complete, the data is stored on a secondary set of
usage versus service level purchased and data servers that cannot read or modify the data.
safety. This on-demand computation model of Security for data storage via cloud services is a
cloud computing is processed through large vir- global concern whether for individuals or organi-
tual data centers (clouds), offering storage and zations. From a legal perspective, there is a great
computation needs for all types of cloud users. deal of variance in how different countries and
These needs are based on service level agree- regions deal with security issues. At this point in
ments. Although cloud services are relatively time, until there are universal rules or legacy
low cost, there is no way to know if the services specifically addressing data privacy legislation,
they are purchasing are equivalent to the service the consumers must take responsibility for their
level purchased. Although being able to own data. There are five strategies for keeping
Collaborative Filtering 179

your data secure in the cloud, outside of what the Kun, H., et al. (2014). Securing the cloud storage audit
cloud services offer. First, consider storing crucial service: Defending against frame and collude attacks of
third party auditor. IET Communications, 8(12), 2106–
information somewhere other than the cloud. For 2113.
this type of information, perhaps utilizing the Mell, P., et al. (2011). National Institute of Standards and
available hardware storage might be the best solu- Technology, U.S. Department of Commerce. The
tion rather than using a cloud service. Second, NIST definition of cloud computing. Special Publica-
when choosing a cloud service, take the time to
tion 800-145, 9–17.
Qi, Q., et al. (2014). Cloud service-aware location update
C
read the user agreement. The user agreement in mobile could computing. IET Communications, 8(8),
should clearly delineate the parameters of their 1417–1424.
service level and that will help with the decision- Rehman, Z., et al. (2014). Parallel could service selection
and ranking based on QoS history. International Jour-
making. Third, take creating passwords seriously. nal of Parallel Programming, 42(5), 820–852.
Oftentimes, the easy route for passwords is famil-
iar information such as dates of birth, hometowns,
and pet’s or children’s names. With the advances
in hardware and software designed specially to Cluster Analysis
crack passwords, it is particularly important to
use robust, unique passwords for each of your ▶ Data Mining
accounts. Fourth, the best way to protect data is
through encryption. The way encryption works in
this instance is to use an encryption software on a
file before you move the file to the cloud. Without Collaborative Filtering
the password to the encryption, no one will be
able to read the file content. When considering a Ashrf Althbiti and Xiaogang Ma
cloud service, investigate their encryption ser- Department of Computer Science, University of
vices. Some cloud services encrypt and decrypt Idaho, Moscow, ID, USA
user files local as well as provide storage and
backup. Using this type of service ensures that
data is encrypted before it is stored in the cloud Synonyms
and after it is downloaded from the cloud provid-
ing the optimal safety net for consumer data. Data reduction; Network data; Recommender
systems

Cross-References
Introduction
▶ Cloud
▶ Cloud Computing Collaborative filtering (CF) entirely depends on
▶ Cloud Services users’ contribution such as ratings or reviews
about items. It exploits the matrix of collected
user-item ratings as the main source of input. It
Further Reading ultimately provides the recommendations as an
output that takes the following two forms: (1) a
Ding, S., et al. (2014). Decision support for personalized numerical prediction to items that might be liked
cloud service selection through multi-attribute trust-
worthiness evaluation. PLoS One, 9(6), e97762.
by an active user U and (2) a list of top-rated items
Gui, Z., et al. (2014). A service brokering and recommen- as top-N items. CF claims that similar users
dation mechanism for better selecting cloud services. express similar patterns of rating behavior. Also,
PLoS One, 8(8). e105297. https://doi.org/10.1371/ CF claims that similar items obtain similar ratings.
journal.pone.0105297.
There are two primary approaches of CF algo-
Hussain, M., et al. (2014). Software quality in the clouds: A
cloud-based solution. Cluster Computing, 17(2), 389– rithms: (1) neighborhood-based and (2) model-
402. based (Aggarwal 2016).
180 Collaborative Filtering

The neighborhood-based CF algorithms (aka, rating to new items in the future. For instance,
memory-based) directly utilize stored user-item Table 1 shows a user-item ratings matrix, which
ratings to predict ratings for unseen items. There includes four users’ rating of four items. The task
are two primary forms of neighborhood-based is to predict the rating of unrated item3 by the
algorithms: (1) user-based nearest neighbor CF active user Andy.
and (2) item-based nearest neighbor CF In order to solve the task presented above, the
(Aggarwal 2016). In the user-based CF, two following notations are given. The set of users is
users are similar if they rate several items in a symbolized as U ¼ {U1, .., Uu}, the set of items is
similar way. Thus, it recommends to a user the symbolized as I ¼ {I1, ..,Ii}, the matrix of ratings
items that are the most preferred by similar users. is symbolized as R where ru, i means rating of a
In contrast, the item-based CF recommends to a user U for an item I, and the set of possible ratings
user the items that are the most similar to the is symbolized as S where its values take a range of
user’s previous purchases. In such an approach, numerical ratings {1,2,3,4,5}. Most systems con-
two items are similar if several users have rated sider the value 1 as strongly dislike and the value 5
these items in a similar way. as strongly like. It is worth noting that ru, i should
The model-based CF algorithms (aka, learn- only take one rating value.
ing-based models) form an alternative approach The first step is to compute the similarity
by sending both items and users to the same latent between Andy and the other three users. In this
factor space. The algorithms utilize users’ ratings example, the similarity between the users is
to learn a predictive model (Ning et al. 2015). The simply computed using Pearson’s correlation
latent factor space attempts to interpret ratings by coefficient (1).
characterizing both items and users on factors
X qX
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
automatically inferred from previous users’ rat- simðu, vÞ ¼  Þðrv, i  rv
i  I ðru, i  ru  Þ= i  Iðru, i, i  ru  Þ2
ings (Koren and Bell 2015).
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
qX
 i  I ðrv, i  rv  Þ2 Þ ð1Þ

where ru and rv  are the average rating of the


Methodology
available ratings made by users u and v.
By applying Eq. (1) to the rating data in Table 1
Neighborhood-Based CF Algorithms  ¼ 3þ3þ5 ¼ 3:6, and rU1 ¼
given that ð rAndy
4þ2þ2þ4
 3

User-Based CF 4 ¼ 3 , the similarity between Andy and


User-based CF claims that if users rated items in a U1 is calculated as follows:
similar fashion in the past, they will give similar

ð3  3:6Þð4  3Þ þ ð3  3:6Þð2  3Þ þ ð5  3:6Þð4  3Þ


simðAndy, U1Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð3  3:6Þ2 þ ð3  3:6Þ2 þ ð5  3:6Þ2  ð4  3Þ2 þ ð2  3Þ2 þ ð4  3Þ2
¼ 0:49
ð2Þ

It is worth noting that the results of Pearson’s are 0.15 and 0.19, respectively. Referring to the
correlation coefficient are in the range of (+1 previous calculations, it seems that U1 and U3
to  1), where +1 means high positive correlation similarly rated several items in the past. Thus,
and  1 means high negative correlation. U1 and U3 are utilized in this example to predict
The similarities between Andy and U2 and U3 the rating of item3 for Andy.
Collaborative Filtering 181

P
The second step is to compute the prediction v  K simðu, vÞ  ðrv, i  rvÞ
r^ðu, iÞ ¼ ru þ P
for item3 using the ratings of Andy’s K-neighbors v  K simðu, vÞ
(U1 and U3). Thus, Eq. (3) is introduced where r^ ð3Þ
means the predicted rating.

C
r^ðAndy, item3Þ ¼ rAndy
   
simðAndy, U1Þ  rU 1, item3  rU 1 þ simðAndy, U3Þ  rU 3, item3  rU 3
þ
simðAndy, U1Þ þ simðAndy, U3Þ
¼ 4:45
ð4Þ

P  
Given the result of the prediction computed by simði, jÞ ¼ u  U ru, i  ri ðru, j  rjÞ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P  2
Eq. (4), it is most likely that item3 will be a good = u  U ru, i, i  ri
choice to be included in the recommendation list qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P 2
for Andy.  u  U ðru, j  rjÞ
ð5Þ
Item-Based CF
Item-based CF algorithms are introduced to solve
In Equation (5), ri and rj are the average rating
serious challenges when applying user-based
of the available ratings made by users for both
nearest neighbor CF algorithms. The main chal-
items i and j.
lenge is that when the system has massive records
Then, make the prediction for item I for user U
of users, the complexity of the prediction task
by applying Eq. (6) where K means the number of
increases sharply. Accordingly, if the number of
neighbors of items for item I.
items is less than the number of users, it is ideal to
adopt the item-based CF algorithms.
This approach computes the similarity between r^ðu, iÞ ¼ ri
P
items instead of an enormous number of potential j  K simði, jÞ  ðru, j  rjÞ
þ P ð6Þ
neighbor users. Also, this approach considers the j  K simði, jÞ
ratings of user U to make a prediction for item I, as
item I will be similar to the previous rated items by Model-Based CF Algorithms
user U. Therefore, users may prefer to utilize their Model-based CF algorithms take the raw data
ratings rather than other users’ rating when mak- that has been preprocessed in the offline step
ing the recommendations. where the data typically requires to be cleansed,
Equation (5) is used to compute the similarity filtered, and transformed and then generate the
between two items. learned model to make a prediction. It solves
several issues that appear in the neighborhood-
based CF algorithms. These issues are (1) limited
Collaborative Filtering, Table 1 User-item rating coverage which means finding neighbors is based
dataset
on the rating of common items and (2) sparsity in
User name Item1 Item2 Item3 Item4 the rating matrix which means the diversity of
Andy 3 3 ? 5 items rated by different users.
U1 4 2 2 4 Model-based CF algorithms compute the
U2 1 1 4 2 similarities between users or items by developing
U3 5 2 3 4
a parametric model that investigates their
182 Column-Based Database

relationships and patterns. It is classified into two nowadays for researches on Web, big data, and
main categories: (1) factorization methods and (2) data mining.
adaptive neighborhood learning methods (Ning
et al. 2015).
Cross-References
Factorization Methods
Factorization methods aim to define the charac- ▶ Data Aggregation
terization of ratings by projecting users and items ▶ Data Cleansing
to the reduced latent vector. It helps discover ▶ Network Analytics
more expressive relations between each pair
of users, items, or both. It has two main types:
(1) factorization of a sparse similarity matrix and References
(2) factorization of an actual rating matrix
Aggarwal, C. C. (2016). An introduction to recommender
(Jannach et al. 2010). systems. In Recommender systems (pp. 1–28). Cham:
The factorization is done by using the singular Springer.
value decomposition (SVD) or the principal com- Golub, G., & Kahan, W. (1965). Calculating the singular
ponent analysis (PCA). The original sparse ratings values and pseudo-inverse of a matrix. Journal of the
Society for Industrial and Applied Mathematics, Series
or similarities matrix is decomposed into a B: Numerical Analysis, 2(2), 205–224.
smaller-rank approximation in which it captures Jannach, D., Zanker, M., Felfernig, A., & Friedrich, G.
the highly correlated relationships. It is worth (2010). Recommender systems: An introduction.
mentioning that the SVD theorem (Golub and Cambridge, UK: Cambridge University Press.
Koren, Y., & Bell, R. (2015). Advances in collaborative
Kahan 1965) claims that matrix M can be col- filtering. In Recommender systems handbook (pp. 77–
lapsed into a product of three matrices as follows: 118). Boston: Springer.
Ning, X., Desrosiers, C., & Karypis, G. (2015). A compre-
X hensive survey of neighborhood-based
M¼U VT ð7Þ recommendation methods. In Recommender systems
handbook (pp. 37–76). Boston: Springer.
where U and V contain left and right singular
vectors and the values of the diagonal of  are
singular values.
Column-Based Database
Adaptive Neighborhood Learning Methods
This approach combines the original neighbor- ▶ NoSQL (Not Structured Query Language)
hood-based and model-based CF methods. The
main difference of this approach, in comparison
with the basic neighborhood-based, is that the
learning of the similarities is directly inferred Common Sense Media
from the user-item ratings matrix, instead of
adopting pre-defined neighborhood measures. Dzmitry Yuran
School of Arts and Communication, Florida
Institute of Technology, Melbourne, FL, USA
Conclusion

This article discusses a general perception of The rise of big data has brought us to the verge of
the CF. CF is one of the early approaches redefining our understanding of privacy. The
proposed for information filtering and recommen- possibility of high-tech profiling, identification,
dation making. However, CF still ranks among and discriminatory treatment based on informa-
the most popular methods that people employ in tion often provided unknowingly (and
Common Sense Media 183

sometimes involuntarily) brings us to the fore- on the selected age group: ON (appropriate),
front of a new dimension of the use and protec- PAUSE (some content could be suitable for
tion of personal information. Even less aware some children of a selected age group), OFF (not
than adults of the means to the ends of their age-appropriate), and NOT FOR KIDS
digital product consumption, children become (inappropriate for kids of any age). A calculated
more vulnerable to the risks of the digital world score from a series of five-point scale categories
defined by big data. (such as “positive role models,” “positive mes- C
The issue of children’s online privacy and pro- sages,” “violence & scariness,” etc.) determines
tection of their safety and rights in today’s virtu- the assignment of an ON, PAUSE, or OFF rating
ally uncontrolled Internet environment is among to content for an age group.
the main concerns of Common Sense Media Some media, such as software applications,
(CSM), an independent non-profit organization video games, and websites, are evaluated
providing parents, educators, and policymakers according to their learning potential on a five-
with tools and advice to aid making children’s point scale (three-point scale before 2013), rang-
use of media and technology a safer and more ing from BEST (excellent learning approach) to
positive experience. NOT FOR KIDS (not recommended for learning).
Protecting data that students and parents pro- Each rated item receives a series of scores in
vide to education institutions from commercial dimensions such as engagement, learning
interest and other third parties is the key concern approach, feedback, and other. Combined score
behind CSM’s School Privacy Zone campaign. across the dimensions determines overall learning
The organization does more to advocate potential of given media content.
safeguarding the use of media by youth. It also A one to five star rating assesses media’s over-
provides media ratings and reviews, designs, and all quality. Parents, children, and educators can
promotes educational tools. review and rate media after creating an account on
the Common Sense Media website. User reviews
and rating are displayed separately and are broken
Media Reviews and Ratings down into two groups: parents and kids reviews.
As of summer 2014, the Common Sense Media
On their website, www.commonsensemedia.org, review library was exceeding 20,000 reviews.
Common Sense Media publishes independent
expert reviews of movies, games, television pro-
grams, books, application software, websites and Education
music. The website enables sorting through the
reviews by media type, age, learning rating, topic, Common Sense Media provide media and tech-
genre, required hardware and software platforms, nology resources for educators, including Graph-
skills the media are aimed at developing or ite, a free tool for discovery and sharing of
improving, recommendations by editors, parents, curricula, educational software, sites and games.
as well as popularity among children. As a part of Graphite, App Flows, an interactive
Common Sense Media does not accept pay- lesson plan-building framework allows educators
ments for its reviews in order to avoid bias and to fit discovered digital tools into a dynamic plat-
influence by creators and publishers (charitable form to create and share lesson plans.
donations are welcome, however). Media content Common Sense also hosts Appy Hours, a
is reviewed and rated by a diverse trained staff series of videos on the organization’s YouTube
(from reviewers for major publications to librar- channel, designed to bring educators together for
ians, teachers, and academics) and edited by a a discussion of ways in which digital tools could
group of writers and media professionals. be used for learning.
All reviewed media content is assigned one of Editorial picks for educational digital content,
the four age-appropriateness ratings, depending discussion boards, and blogs are incorporated into
184 Common Sense Media

the Common Sense Graphite site in order to Research


enhance educators’ experience with the system.
Common Sense Media carries out a variety of
research projects in order to inform its ratings
Advocacy and aid its advocacy efforts. They collect, analyze,
and disseminate data on children’s use of media
Common Sense Media works with lawmakers and and technology and media impact on their devel-
policymakers nationwide in the pursuit of an opment. Both original data (collected via commis-
improved media landscape for children and fam- sioned by Common Sense Media to Knowledge
ilies. The main issues the organization addresses Networks online surveys) and secondary data
are represented in three areas: children’s online from large national surveys and databases are
privacy, access to digital learning, violence and used in their analyses. The organization also
gender roles in media. produces overviews of the state of research on
In an attempt to give more control over kids’ certain topics, such as advertising for children
digital footprints to children themselves as well as and teens and media connection to violence. Full
their families, CSM supports several legislative texts of featured reports as well as summaries
projects, including the Do Not Track Kids bill and infographics highlighting main findings
concerned with collecting location information are available for viewing and download on
and sending targeted adds to teens as well as the commonsensemedia.org free of charge.
Eraser Button bill which requires apps and Results of Commons Sense Media research
websites to allow teens to remove their postings regularly make their way into mainstream mass
and prohibits advertisement of illegal or harmful media. Among others, NPR, Time, and the
products, like alcohol or tobacco, to minors. The New York Times featured CSM findings in news
organization also promotes the need for updates to stories.
the Federal Trade Commission’s 1999 Children’s
Online Privacy Protection Act designed to give
parents more control over information about their Organization History, Structure and
children collected and shared by companies Partnerships
online.
In their effort to promote digital learning tech- Common Sense Media was founded in 2003 by
nology, Common Sense Media supports Federal James P. Steyer, the founder of Children Now
Communication Commission’s E-Rate program Group and a lecturer at Stanford University at
aimed at bringing high-speed Internet to Ameri- the time. With initial investment of $500,000
can schools. The CSM’s School Privacy Zone from various backers (including among others
initiative seeks to safeguard data about students Charles R. Schwab of the Charles Schwab Corpo-
collected by schools and educators from adver- ration, Philip F. Anschutz of Qwest Communica-
tisers and other private interests. tions International, George R. Roberts of
The organization addresses the concern with Kohlberg Kravis Roberts & Company, and
the impact that video games and other media James G. Coulter of Texas Pacific Group) the
content could have on development of children, first office opened in San Francisco,
as well as the contribution of these media to the CA. William E. Kennard and Newton N. Minow,
culture of violence in the United States. CSM two former Federal Communications Commis-
highlights the gaps in research of portrayal of sion chairmen, were among the first board mem-
violence in media and encourages Congress to bers for the young organization.
promote further scientific inquiry in order to Since 2003, Common Sense Media has grown
address overwhelming concern among parents. to 25 members on the board of directors and
Communication Quantity 185

25 advisors. It employs extensive teams of Common Sense Media. Program for the study of children
reviewers and editors. The organization added and media. https://www.commonsensemedia.org/
research. Accessed Sept 2014.
three regional offices: in Los Angeles, New York Rutenberg, J. (2003). A new attempt to monitor media
City, and Washington D.C. and established a pres- content. The New York Times. http://www.nytimes.
ence in social media (Facebook, Twitter, com/2003/05/21/business/a-new-attempt-to-monitor-
YouTube, Google+, and Pinterest). media-content.html. Accessed Sept 2014.
The evolution of the Internet and the ever- C
growing number of its applications turned it into a
virtual world with almost endless possibilities.
While the rules and the laws of this new world Communication Quantity
are yet to take shape and be recorded, young people
spend more and more time in this virtual reality. It Martin Hilbert
affects their lives both inside and outside the virtual Department of Communication,
world, affecting their development, and physical University of California, Davis, Davis, CA, USA
and emotional state. Virtually any activity we
engage in online produces data which could be
used in both social research and with commercial An increasing share of the world’s data capacity
purposes. Collection and use of these data (for is centralized in “the cloud.” The gatekeeper
advertising or other purposes) could pose serious to obtain access to this centralized capacity is
legal and ethical issues which raises serious con- telecommunication access. Telecommunication
cern among parents of young media consumers and channels are the necessary (but not sufficient)
educators. As some of the most vulnerable media condition to provide access to the mass of the
consumers, children need additional protection and world’s data storage.
guidance in the virtual world. Organizations like In this inventory we mainly follow the
Common Sense Media, parents, educators, law methodology of what has become a standard ref-
makers, and policy makers begin paying closer erence in estimating the world’s technological
attention to the kids’ place on virtual reality and information capacity: Hilbert and López (2011).
the impact that reality can have on children. The total communication capacity is calculated as
the sum of the product of technological devices
and their bandwidth performance, where the latter
is normalized on compression rates. We measure
Cross-References the “installed capacity” (not the effectively used
capacity), which implies that it is assumed that all
▶ Media
technological capacities are used to their maxi-
▶ Online Advertising
mum. For telecommunication, this describes the
▶ Social Media
“end-user bandwidth potential” (“if all end-users
would use their full bandwidth”). This is merely
a “potential,” because in reality, negative net-
Further Reading work externalities create a trade-off in bandwidth
among users. For example, estimating that the
Common Sense Media. Graphite™: About us http://www.
graphite.org/about-us. Accessed Sept 2014. average broadband connection is 10 Mbps in a
Common Sense Media. Our mission. https://www. given country does not mean that all users could
commonsensemedia.org/about-us/our-mission#about- use this average bandwidth at the same second.
us. Accessed Sept 2014.
The network would collapse. The normalization
Common Sense Media. Policy priorities. https://www.
commonsensemedia.org/advocacy. Accessed Sept on software compression rates is important for the
2014. creation of meaningful time series, as
186 Communication Quantity

compression algorithms have enable to send more dynamic of the number of subscriptions follows
information through the same hardware infra- existing patterns in population distribution.
structure over recent decades (Hilbert 2014a; Hil- Especially the diffusion of mobile phones during
bert and López 2012a). We normalize on recent decades has contributed to the fact that both
“optimally compressed bits” as if all content distributions align. The number of subscriptions
were compressed with the best compression algo- reaches a saturation limit at about 2–2.5 subscrip-
rithms possible in 2014 (Hilbert and López tions per capita worldwide, and therefore leads to a
2012b). For the estimation of compression rates natural closure of the divide over time. On
of different content, justifiable estimates are elab- the contrary, communication capacity in kbps (and
orated for 7-year intervals (1986, 1993, 2000, therefore access to the global Big Data
2007, 2014). The subscriptions data stem mainly infrastructure) follows the signature of economic
from ITU (2015) with completions from other capacities. After only a few decades, both processes
sources. One of the main sources for internet align impressively well. This shows that the digital
bandwidth is NetIndex (Ookla 2014), which has divide in terms of data capacity is far from being
gathered the results of end-user-initiated band- closed but is rather becoming a structural character-
width velocity tests per country per day over istic of modern societies, which is as persistent as
recent years (e.g., an average 180,000 test per the existing income divide (Hilbert 2014b, 2016).
day already in 2010 through Speedtest.net and Figure 1a also reveals that the evolution of
Pingtest.net). For more see Hilbert (2015) and communication capacities in kbps is not a mono-
López and Hilbert (2012). tone process. Increasing and decreasing shares
Figure 1a looks at the total telecommunication between high income and upper middle income
capacity in optimally compressed kbps in terms of countries suggest that the evolution of bandwidth
global income groups (following the classification is characterized by a complex nonlinear interplay
of the World Bank of 2015). The world’s installed of public policy, private investments, and techno-
telecommunication capacity has grown with a logical progress. Some countries in this income
compound annual growth rate of 35% during the range seem to (at least temporarily) do much
same period, (from 7.5 petabites to 25 exabits). better than their economic capacity would
The last three decades show a gradual loss of suggest. This is a typical signature of effective
dominance of global information capacities for public policy.
today’s high-income countries. High-income Figure 2 shows the same global capacity in
countries dominated 86% of the globally installed optimally compressed kbps per geographic
bandwidth potential, but merely 66% in 2013. It regions (following the World Bank classification
is interesting to compare this presentation with of 2015). Asia has notably increased its global
the more common method to assess the advance- share at the expense of North America and
ment approximation in terms of the number of Europe, with a share of less than a quarter of the
telecommunication subscriptions (Fig. 1b). Both global capacity in 1986 (23%) and a global major-
dynamics are quite different, which stems from ity of 51% in 2013 (red-shaded areas in Fig. 2).
the simple fact that not all subscriptions are equal Figure 2 reveals that the main driver of this expan-
in their communicational performance. This intu- sion during the early 2000s were Japan and
itive difference is the main reason why the statis- South Korea, both of which famously pursued a
tical accounting of subscriptions is an obsolete very aggressive public sector policy agenda in the
and very often misleading indicator. This holds expansion of fiber optic infrastructure in the early
especially true in an age of Big Data, where the 2000s. The more recent period since 2010 is char-
focus of development is set on informational bits, acterized by the expansion of bandwidth in both
not on the number of technological devices (for China and Russia. Notably, most recent broad-
the complete argument, see Hilbert (2014b, 2016)). band policy efforts in the USA seems to show
Comparing these results with the global shares of some first detectable effects on a macrolevel, as
Gross National Income (GNI) and population North America has started to return its tendency of
(Fig. 1c, d), it becomes clear that the diffusion a shrinking global share during recent years.
Communication Quantity 187

Communication Quantity, Fig. 1 (continued)


188 Communication Quantity

Communication Quantity, Fig. 1 International income (c) World Gross National Income (GNI, current USD);
groups: (a) telecommunication capacity in optimally (d) World population
compressed kbps; (b) telecommunication subscriptions;

Communication Quantity, Fig. 2 Telecommunication capacity in optimally compressed kbps per world region

Expressed in installed kbps per capita (per had access to an average of 100 kbps of installed
inhabitant), we can obtain a clearer picture about bandwidth potential, while the average inhabitant
the increasing and decreasing nature of the evolv- of the rest of the world had access to merely 9 kbps.
ing digital divide in terms of bandwidth capacity. In absolute terms, this results in a difference of
First and foremost, Fig. 3a shows that the divide some 90 kbps. As shown in Fig. 3a, this divide
continuously increases in absolute terms. In 2003, increased with an order of magnitude every
the average inhabitant of high-income countries 5 years, reaching almost 900 kbps in 2007 and
Communication Quantity 189

Communication Quantity, Fig. 3 (a) Telecommunica- rest of world. (b) Ratio of telecommunication capacity
tion capacity per capita in optimally compressed kbps: per capita in high-income countries versus rest of world,
high-income groups (World Bank classification) versus and of subscriptions per capita

over 10,000 kbps by 2013. This increasing divide by the global diffusion of narrowband internet and
in absolute terms is important to notice in the 2G telephony. The increasing nature of the divide
context of a Big Data world, in which the amount between 2001 and 2008 is due to the global intro-
of data is becoming a crucial ingredient for growth. duction of broadband for fixed and mobile solu-
In relative terms, this results in an increasing tions. The most recent decreasing nature of the
and decreasing evolution of the divide over time. divide is evidence of the global diffusion of broad-
Figure 3b contrasts this tendency with the mono- band. The digital divide in terms of data capacities
tonically decreasing tendency of the digital divide is a continuously moving target, which opens up
in terms of telecommunication subscriptions. with each new innovation that is introduced into
It shows that the divide in terms of data capacities is the market (Hilbert 2014b, 2016).
much more susceptible to both technological Finally, another aspect with important implica-
change and technology interventions. The decreas- tions for the Big Data paradigm is the relation
ing divide during the period until 2000 is explained between uplink and downlink capacity. Uplink
190 Communication Quantity

Communication Quantity, Fig. 4 Telecommunication capacity in optimally compressed kbps per uplink and
downlink

and downlinks show the potential of contribution between 1986 and 2010. Journal of the Association for
and exploitation of the digital Big Data footprint. Information Science and Technology, 65(4), 821–835.
https://doi.org/10.1002/asi.23020.
Figure 4 shows that the global telecommunication Hilbert, M. (2015). Quantifying the data deluge and the
landscape has evolved from being a media of data drought (SSRN scholarly paper no. ID 2984851).
equal up- and downlink, toward to more down- Rochester: Social Science Research Network. Retrieved
load heavy medium. Up until 1997, global tele- from https://papers.ssrn.com/abstract¼2984851.
Hilbert, M. (2016). The bad news is that the digital
communication bandwidth potential was equally
access divide is here to stay: Domestically installed
split with 50% up- and 50% down-link. The intro- bandwidths among 172 countries for 1986–2014.
duction of broadband and the gradual introduction Telecommunications Policy, 40(6), 567–581. https://
of multimedia video and audio content changed doi.org/10.1016/j.telpol.2016.01.006.
this. In 2007, the installed uplink potential was as Hilbert, M., & López, P. (2011). The world’s technological
capacity to store, communicate, and compute informa-
little as 22%. The global diffusion of fiber optic tion. Science, 332(6025), 60–65. https://doi.org/10.
cables seems to reverse this trend, reaching a share 1126/science.1200970.
of 30% uplink in 2013. It can be expected that the Hilbert, M., & López, P. (2012a). How to measure the
share of effectively transmitted bits through this world’s technological capacity to communicate, store
and compute information? Part I: Results and scope.
installed bandwidth potential leads to an even International Journal of Communication, 6, 956–979.
larger share of fixed-line broadband (for more in Hilbert, M., & López, P. (2012b). How to measure
these methodological differences, see Hilbert and the world’s technological capacity to communicate,
López (2012a, b)). store and compute information? Part II: Measurement
unit and conclusions. International Journal of
Communication, 6, 936–955.
ITU (International Telecommunication Union). (2015).
Further Reading World Telecommunication/ICT Indicators Database.
Geneva: International Telecommunication Union.
Hilbert, M. (2014a). How much of the global information Retrieved from http://www.itu.int/ITU-D/ict/statistics/.
and communication explosion is driven by more, López, P., & Hilbert, M. (2012). Methodological and
and how much by better technology? Journal of the statistical background on the world’s technological
Association for Information Science and Technology, capacity to store, communicate, and compute informa-
65(4), 856–861. https://doi.org/10.1002/asi.23031. tion (online document). Retrieved from http://www.
Hilbert, M. (2014b). Technological information martinhilbert.net/WorldInfoCapacity.html.
inequality as an incessantly moving target: The redis- Ookla. (2014). NetIndex source data. Retrieved from
tribution of information and communication capacities http://www.netindex.com/source-data/.
Communications 191

These networks can be archived from social net-


Communications working sites such as Twitter or Facebook, or
alternatively can be constructed through surveys
Alison N. Novak of people within a group, organization, or com-
Department of Public Relations and Advertising, munity. The automated data aggregation of digital
Rowan University, Glassboro, NJ, USA social networks makes the method appealing to
Communications researchers because it produces C
large networks quickly and with limited possibil-
There is much debate about the origins and his- ity of human error in recording nodes. Addition-
tory of the field of Communications. While ally, the subfield of Health Communications has
many researchers point to a rhetorical origin in adopted the integration of big datasets in an effort
ancient Greece, others suggest the field is much to study how healthcare messages are spread
newer, developing from psychology and propa- across a network.
ganda studies of the 1940s. The discipline Natural language processing is another area of
includes scholars exploring subtopics such as big data inquiry in the field of Communications.
political communication, media effects, and In this vein of research, scholars explore the way
organizational relationships. The field generally that computers can develop an understanding of
uses both qualitative and quantitative language and generate responses. Often studied
approaches, as well as developing a variety of along with Information Science researchers and
mixed-methods techniques to understand social Artificial intelligence developers, natural lan-
phenomena. guage processing draws from Communications
Russell W. Burns argues that the field of Com- association with linguistics and modern lan-
munications developed from a need to explore the guages. Natural language processing is an attempt
ways in which media influenced people to behave, to build communication into computers so they
support, or believe in a certain idea. Much of can understand and provide more sender-tailored
Communication studies investigates the idea of messages to users.
media and texts, such as newspaper discourses, The field of communication has also been out-
social media messages, or radio transcripts. As the spoken about the promises levied with big data
field has developed, it has investigated new tech- analytics as well as the ethics of big data use.
nologies and media, including those still in their Recognizing that the field is still early in its devel-
infancies. opment, scholars point to the lifespan of other
Malcom R. Parks states that the field of Com- technologies and innovations as examples of
munications has not adopted one set definition of how optimism early in the lifecycle often turns
big data, but rather sees the term as a means to into critique. Pierre Levy is one Communications
identify datasets and archival techniques. Singu- scholar who explains that although new datasets
larly thinking of big data as a unit of measurement and technologies are viewed as positive changes
or a size fails to underscore the many uses and with big promises early in their trajectory, as more
methods used by Communications to explore big information is learned about their effects, scholars
datasets. often begin to challenge their use and ability to
One frequent source of big data analysis in provide insight.
Communications is that of network analysis or Communications scholars often refer to big data
social network analysis. This method is used to as the “datafication” of society, meaning turning
explore the ways in which individuals are everyday interactions and experiences into quanti-
connected in physical and digital spaces. Commu- fiable data that can be segmented and analyzed
nications research on social networks particularly using brad techniques. This in particular refers to
investigates how close individuals are to each analyzing data that has not been previously viewed
other, whom they are connected through, and as data before. Although this is partially where the
what resources can be shared amongst networks. value of big data develops from, for
192 Community Management

Communications researchers, this complicates the


ability to think holistically or qualitatively. Community Management
Specifically, big datasets in Communications
research include information taken from social ▶ Content Moderation
media sites, health records, media texts, political
polls, and brokered language transcriptions. The
wide variety of types of datasets reflects the truly
broad nature of the discipline and its subfields. Community Moderation
Malcom Parks offers suggestions on the future
of big data research within the field of Communi- ▶ Content Moderation
cations. First, the field must situate big data
research with larger theoretical contexts. One cri-
tique of the data-revolution is the false identifica-
tion of this form of analysis as being new. Rather Complex Event Processing
than consider big data as an entirely new phenom- (CEP)
ena, by situating it within a larger history of Com-
munications theory, more direct comparisons Sandra Geisler
between past and present datasets can be drawn. Fraunhofer Institute for Applied Information
Second, the field requires more attention to the Technology FIT, Sankt Augustin, Germany
topic of validity in big data analysis. While quan-
titative and statistical measurements can support
the reliability of a study, validity asks researchers Synonyms
to provide examples or other forms of support for
their conclusions. This greatly challenges the eth- Complex event recognition; Event stream
ical notions of anonymity in big data, as well as processing
the consent process for individual protections.
This is one avenue in which the quality of big
data research needs more work within the field of Overview
communications.
Communications asserts that big data is an In the CEP paradigm simple and complex events
important technological and methodological can be distinguished. A simple event is the repre-
advancement within research, however, due to sentation of a real-world occurrence, such as a
its newness, researchers need to exercise caution sensor reading, a log entry, or a tweet. A complex
when considering its future. Specifically, event is a composite event or also called situation
researchers must focus on the ethics of inclusion which has been detected by identifying a pattern
in big datasets, along with the quality of analysis based on the input stream values which may con-
and long term effects of this type of dataset on stitute either simple or complex events. As an
society. example for a situation, we consider detecting a
material defect in a production line process based
on the specific reading values of multiple hard-
Further Reading ware sensors, that is, the thickness and the flexi-
Burns, R. W. (2003). Communications: An international
bility. If the thickness is less than 0.3 and at the
history of the formative years. New York: IEE History same time flexibility is higher than 0.8, a material
of Technology Series. defect has been detected. An event is character-
Levy, P. (1997). Collective intelligence: Mankind’s emerg- ized by a set of attributes and additionally con-
ing world in cyberspace. New York: Perseus Books.
tains one or more timestamps indicating the time
Parks, M. R. (2014). Big data in communication research:
Its contents and discontents. Journal of Communica- the event has been produced (either assigned by
tion, 64, 355–360. the original source or by the CEP system) or the
Complex Event Processing (CEP) 193

duration of its validity, respectively. In our exam- hierarchies and recursive processing. CEP is
ple the complex event could look like this: defect also related to the field of Data Stream Manage-
(timestamp,partid). The timestamps play a very ment Systems (DSMS). DSMS constitute a more
important role as especially in CEP systems their general way to process data streams and enable
time-based relationship in terms of order or par- generic user queries, while CEP systems focus on
allel occurrence may be crucial to detect a certain the detection of specific contextual composite
complex event. An event is instantiated when the occurrences using ordering relationships and C
attributes are filled with concrete values and can patterns.
be represented in different data formats, for exam-
ple, a relational schema or a nested schema for-
mat. This instantiation is also called a tuple. Key Research Findings
Events with the same set of attributes are sub-
sumed under an event type, for example, defect. Query Components
Events are created by an event producer which As described the filtering, transformation, and
observes a source, for example, a sensor. A poten- pattern detection on events can be expressed
tially unbounded series of events coming from the using operators usually combined in forms of
same source is termed as event stream, which may rules. The term rule is often interchangeably
contain events of different types (heterogeneous used with the term query. Depending on the sys-
event stream) or contains only the same event tem implementation rules may be added at design
type (homogeneous event stream) (Etzion and time or run time. Rules are either evaluated when
Niblett 2011). The events are pipelined into an an event arrives (event-based) or on a regular
event network consisting of operators processing temporal basis (time-based).
the events and routing them according to the task The filtering of events only allows certain
they fulfill. These operators usually comprise events of interest to participate in the following
tasks to filter and transform events and detect processing steps. The filter operator is either
patterns on them. The network represents rules applied as part of another operator or before the
which determine which complex events should be next operators (as a separate operator). A filter
detected and what should happen if they have expression can be applied to the metadata of an
been detected. The rule processing can be sepa- event or the content of an event and can be state-
rated into two phases: detection phase (does the less or stateful. Stateless filters only process one
event lead to a rule match?) and production phase event at a time. Stateful filters are applied to a
(what is the outcome, what has to be done if a rule certain time context (i.e., a time window) where,
matched?) (Cugola and Margara 2012b). Usually, for example, the first x elements, the most recent x
also a history of events is kept to represent the elements, or a random set of elements in the time
context of an event and to keep partial results (if a context, are dropped to reduce the amount of data
rule has not been completely matched yet). flowing through the system. This is often done in
Finally, event consumers or sinks wait for notifi- conjunction with performance measures and pol-
cations that a complex event has been detected, icies to fulfill latency requirements (also called
which in our example could be a monitoring load shedding).
system alerting the production manager or it Transformation operators take an input and
may be used to sum up the number of defects in produce different result events based on this
a certain time period. Hence, CEP systems are input. They can also either be stateless (only one
also regarded as extensions of the Publish-Sub- input at a time is processed) or stateful (based on
scribe scheme (Cugola and Margara 2012b), multiple input elements). Common operators are
where producers publish data and consumers sim- projection, value assignment to attributes, copy-
ply filter out the data relevant to them. In some ing, modifying, or inserting new attributes, data
systems, the produced complex events can serve enrichment, splitting into several event streams,
as an input to further rules enabling event and merging of event streams (join).
194 Complex Event Processing (CEP)

To define pattern detection mechanisms, a set Query Languages


of common operators is used. Usually, conjunc- There are different ways to distinguish the avail-
tion (all events happen in the specified time), able CEP query languages and how rules are
disjunction (at least one event happens in the expressed. On the one hand, the way how com-
specified time), sequence (the events happen plex events are retrieved, that is, declaratively or
sequentially), Kleene operator (the event may imperatively, can be distinguished. Declarative
happen zero or multiple times), or negation (the languages describe what should be the result of
event does not occur in the specified time) are the query and these are often variants of SQL or
used to formulate queries to detect composite very similar to it, such as the Continuous Query
events in a specific time frame. So-called functor Language (CQL) or the SASE+ language. Imper-
patterns apply aggregation functions such as aver- ative languages describe how a result is retrieved
age, min, max, count, standard deviation, etc., and and are often manifested in code or visual repre-
compare the results against a certain threshold sentations of it, that is, operators are visual com-
(Etzion and Niblett 2011). Other patterns select ponents which can be connected to each other to
events based on values of an attribute in a top k process the event streams. On the other hand,
manner in the set of events to be processed in this languages can be distinguished based on their
step. Finally, patterns with regard to time and theoretical foundation to describe the most impor-
space, so-called dimensional patterns, can be tant operators. Etzion and Niblett differentiates
defined (Etzion and Niblett 2011). Temporal pat- roughly stream-oriented (e.g., CQL) and rule-ori-
terns include the detection of a sequence of ented languages (e.g., Drools).
events, the top k events in terms of time (the Eckert, Bry, Brodt, Poppe, and Hausmann
most recent k or first k events in a time period), (2011) define more fine-granulated language cat-
trend patterns (a set ordered by time fulfills a egories which we will summarize here briefly.
criterion, e.g., a value is increasing, decreasing, Composition-based languages use a combination
remains the same, etc.). Spatial patterns are or nesting of operators to compose events, such as
applied to spatial attributes and may fire when conjunction, negation, and disjunction, to detect
events fulfill a certain spatial pattern, such as a events in a certain time frame. A well-known
minimum, maximum, or average distance example for this category is the SASE+ language
between two events, and can also be combined (http://avid.cs.umass.edu/sase). Data stream man-
with temporal aspects, such as detecting spatial agement languages are mainly SQL-based declar-
trends over time. ative languages which can be used to define CEP
An often required feature of a query language style queries in DSMS. The Continuous Query
is also the combination of streaming data with Language (CQL) (Arasu et al. 2006) is a famous
static, historical data. Some query languages deputy of this class comprising the aforemen-
offer different representations for such inputs, tioned operators. It is used in various systems,
such as CQL or StreamSQL. The definition of such as the STREAM system, Oracle CEP, or an
selection policies, that is, if a rule is fired only extension of Apache Spark (https://github.com/
once, k times, or each time when a pattern has Samsung/spark-cep). Other languages of this type
been matched (Cugola et al. 2015), may also be of are ESL, Esper, StreamInsight (LINQ), or
interest to control the production of events. Sim- SPADE. Further language types contain state-
ilarly, some languages support to restrict events as machine based languages which utilize formaliza-
input, if they do not fulfill a certain contiguity, for tions of finite state machines to describe rules
example, the rule is only matched, if two com- detecting the pattern. The events may lead to
bined events are contiguous (Flouris et al. 2017). state transitions (in a certain order) where the
Additionally, consumption policies can be complex event is detected when a specific state
defined, which control if an input event may be is reached. Production rule-based languages
used in the same rule pattern again or if it should define the pattern in terms of if-then-rules (usually
be forgotten after it has led to a match. using a higher level programming language such
Complex Event Processing (CEP) 195

as Java) where events are represented as facts or and an attribute of the stream schema is deter-
objects and are matched against defined rules. An mined to be the timestamp attribute. Additionally,
example is the Business Rules Management Sys- in the Stream Mill language ESL there exists the
tem Drools (https://www.drools.org). If a match is concept of latent timestamps. Latent timestamps
found a corresponding action is evoked (namely are assigned on demand (lazily), that is, only for
creation of the complex event). Finally, languages operations dependent on a timestamp such as win-
based on logical languages, such as Prolog, allow dowed aggregates, while explicit timestamps are C
for the definition of queries and pattern recogni- assigned to every tuple. An interesting question is
tion tasks using corresponding rules and facts. how timestamps should be assigned to results of
for example binary operators and aggregates to
Time and Order ensure semantic correctness. The first option is to
We already emphasized the prominent status of use the creation time of an output tuple when
timestamps as time and may play an important using an implicit timestamp model. The second
role in detecting complex events expressed in the option is to use the timestamp of the first stream
variety of temporal and spatiotemporal operators. involved in the operator, which is suited for
A timestamp is always handled as a specific attri- explicit and implicit timestamp models. For
bute which is not part of the common attribute set, aggregates similar considerations can be made.
for example, in the TESLA language (Cugola For example, if a continuous or windowed mini-
et al. 2015). A monotonic domain for time can mum or maximum is calculated, the timestamp of
be defined as an ordered, infinite set of discrete the maximal or minimal tuple, respectively, could
time instants. For each timestamp exists a finite be used. When a continuous sum or count is
number of tuples (but it can also be zero). In calculated, the creation time of the result event
literature, there exist several ways to distinguish or the timestamp of the latest element included in
where, when, and how timestamps are assigned. the result can be used. If an aggregate is win-
First of all, the temporal domain from which the dowed, there exist additional possibilities. The
timestamps are drawn can be either a logical time smallest or the highest timestamp of the events
domain or a physical clock-time domain. Logical in the window can be used as they reflect the
timestamps can be simple consecutive integers, oldest timestamp or most recent timestamp in the
which do not contain any date or time informa- window, respectively. Both maybe interesting,
tion, but just serve for ordering. In contrast, phys- when timeliness for an output tuple is calculated,
ical clock-time includes time information (e.g., but which one to use depends obviously on the
using UNIX timestamps). Furthermore, systems desired outcome. Another possibility would be to
differ in which timestamps they accept and use for take the median timestamp of the window.
internal processing (ordering and windowing). In Many of the systems and their operators rely on
most of the systems implicit timestamps, also (and assume) the ordered arrival of tuples in
called internal timestamps or system timestamps, increasing timestamp order to be semantically
are supported. Implicit timestamps are assigned to correct (also coined as the ordering requirement).
a tuple, when it arrives at the CEP system. This But as already pointed out, this cannot be
guarantees that tuples are already ordered by guaranteed especially for explicit timestamps
arrival time, when they are pipelined through the and data from multiple sources. In the various
system. Implicit timestamps also allow for esti- systems basically two main approaches to the
mating the timeliness of the tuple when it is out- problem of disorder have been proposed. One
put. Besides a global implicit timestamp (assigned approach is to tolerate disorder in controlled
on arrival), there exists also the concept of new bounds. For example, a slack parameter is defined
(local) timestamps assigned at the input or output for order-sensitive operators denoting, how many
of each operator (time of tuple creation). In con- out-of-order tuples may arrive between the last
trast, explicit timestamps, external timestamps, or and the next in-order event. All further out-of-
application timestamps are created by the sources order tuples will be discarded. The second way
196 Complex Event Processing (CEP)

to handle disorder is to dictate the order of tuples partial matches representing the inner nodes. The
and reorder them if necessary. While the use of root node constitutes the overall match of the rule.
implicit timestamps is a simple way of ordering Similarly, events may be pipelined through oper-
tuples on arrival, the application semantics often ator networks which forward and transform the
requires the use of explicit timestamps though. events based on their attributes (Etzion and
Heartbeats are events sent with the stream includ- Niblett 2011). Finally, graphs are also used as a
ing at least a timestamp. These markers indicate to combination of all activated rules to detect event
the processing operators that all following events dependencies (Flouris et al. 2017).
have to have a timestamp greater than the
timestamp in the punctuation. Some systems
buffer elements and output them in ascending Further Directions for Research
order as soon as a heartbeat is received, that is,
they are locally sorted. The sorting can be either Uncertainty in CEP
integrated in a system component or be a separate Uncertainty has been studied intensively for Data-
query operator. Heartbeats are only one possible base Management Systems and several systems
form of punctuation. Punctuations, in general, can implementing probabilistic query processing have
contain arbitrary patterns which have to be evalu- been proposed. Also for CEP this is an interesting
ated by operators to true or false. Therefore, punc- aspect as it often handles data from erroneous
tuations can also be used for approximation. They sources and uncertainty may be considered on
can limit the evaluation time or the number of various levels. For data streams in general
tuples which are processed by an otherwise Kanagal and Deshpande distinguished two main
blocking or stateful operator. Other methods are, types of uncertainties. First, the existence of a
for example, compensation-based techniques, tuple in a stream can be uncertain (how probable
where operators in the query are executed in the is it for this tuple to be present at the current time
same way as if all events would be ordered or instant?), which is termed tuple existence uncer-
approximation-based techniques, where either tainty. Second, the value of an attribute in a tuple
streams are summarized and events are approxi- can be uncertain, which is called attribute value
mated or approximation is done on a recent his- uncertainty (what is the probability of attribute X
tory of events (Giatrakos et al. 2019). to have a certain value?). The latter aspect is
crucial as many data sources may inherently cre-
Rule Evaluation Strategies ate erroneous data, such as sensors or other
There are basically two strategies to evaluate rules devices or are estimates of a certain kind. Natu-
defined in the CEP system (Cugola et al. 2015). rally, both aspects have also been considered spe-
Either the rules are evaluated incrementally on cifically for CEPs (Flouris et al. 2017; Cugola
each incoming event or the processing is delayed et al. 2015). There are two ways to represent
until events in the history fulfill all conditions. attribute value uncertainty. The attribute value
The latter requires that all primitive events are can be modeled as a random variable which is
stored until a rule fires which may reduce latency accompanied by a corresponding probability den-
(Cugola and Margara 2012a). The first option is sity function (pdf) describing the deviation from
the usual case and there have been proposed dif- the exact value. The second option is to attach to
ferent strategies how the partial matches are stored each attribute value a concrete probability value.
(Flouris et al. 2017). Many systems use non- In consequence, the tuple existence uncertainty to
deterministic finite automata or finite state consist of a certain configuration of values may be
machines, where each state represents a partial modeled by a joint distribution function, which
or complete match and incoming events trigger multiplies the probabilities of the single attribute
state transitions. Further structures comprise values. Finally, each event may then have a value
graphs, where the leafs represent the simple which indicates the probability of its occurrence
events which are incrementally combined to (Cugola et al. 2015). Depending on the
Complex Event Processing (CEP) 197

assumption if attributes are independent of each learning approaches specifically suited for data
other, probability values can be propagated to streams. For example, Mehdiyev et al. (2015)
complex events depending on the operators compare different rule-based classifiers to detect
applied (combination, aggregation etc.). A further event patterns for CEP to derive rules for activity
instance of this problem is temporal uncertainty. detection based on accelerometer data in phones.
The time of the occurrence of an event, synchro- Usually, CEP is reactive, that is, the detected
nization problems between clocks of different complex events lie in the past. Mousheimish, C
processing nodes, or different granularities of Taher, and Zeitouni present an approach how pre-
event occurrences may be observed. Zhang, dictive rules for CEP could be learned, such that
Diao, and Immerman introduced a specific tem- events in the near future could be predicted. They
poral uncertainty model to express the uncertain use mining on labeled multivariate time series to
time of occurrence by an interval. Another level of create rules which are installed online, that is, are
uncertainty can be introduced in the rules. For activated during run time.
example, Cugola et al. use Bayesian networks to
model the relationships between simple and com- Scalability
plex events in the rules to reflect the uncertainty As CEPs are systems which operate on data
which occur in complex systems. How uncer- streams adaptability to varying workloads is
tainty is handled on data and rule level can be important. Looking at the usual single centralized
again categorized based on the theoretical foun- systems, there are a multiple levels which can be
dation. Alevizos, Skarlatidis, Artikis, and considered. If the input load is too high and com-
Paliouras (2017) distinguish between automata- pleteness of input data may not be of major impor-
based approaches, first-order logic and graph tance, data may be sampled to decrease the system
models, petri nets, and syntactical approaches load. This can be done by measuring system per-
using grammars. formance using QoS parameters, such as the out-
put latency or the throughput. Based on these
Rule Learning measures, for example, load shedders integrated
An interesting direction to follow in CEP is the in the event processor or directly into operator
automatic learning of rules. On the one hand, the implementations may drop events when perfor-
definition of rules is usually a lengthy manual mance cannot be kept on an acceptable level.
task, which is done by domain experts and data Besides load balancing and load shedding, para-
scientists. It is done at design time and may need llelization techniques can be applied to increase
multiple cycles to adapt the rule to the use case at the performance of a system. Giatrakos et al.
hand. On the other hand, depending on the data, (2019) distinguish two kinds of parallelization
the rules to detect certain complex events might for CEP, namely, task and data parallelization,
not be obvious. Hence, it might not be possible to where task parallelization comprises the distribu-
define a rule from scratch or in a reasonable time. tion of queries and subqueries or single operators
In both cases, it is desirable to learn rules from to different nodes where they are executed. Data
sample data to support the process of rule defini- parallelization considers the distribution of data to
tion. This can be done by using labeled historical multiple instances of the same operator or query.
data. Margara et al. (2014) identified several A CEP system should be scalable in terms of
aspects, which need to be considered to learn a queries as some application contexts can get com-
rule, for example, the time frame or the event plex and require the introduction of several
types and attributes to be considered. Conse- queries at the same time. Hence, for para-
quently, they build custom learners for each iden- llelization the execution of multiple queries can
tified subproblems (or constraints) and use labeled be distributed to different processing units. Fur-
historical data divided into positive and negative thermore, a query can be divided into subqueries
event traces which either match or do not match a or single operators and the parts can be distributed
complex event. Other approaches use machine over multiple threads parallelizing their
198 Complex Event Recognition

execution, bringing up the need for intra- and event processing: Status and prospects in the big data
multiquery optimization, for example, by sharing era. Journal of Systems and Software, 127, 217–236.
Giatrakos, N., Alevizos, E., Artikis, A., Deligiannakis, A.,
results of operators in an overall query plan. In & Garofalakis, M. (2019). Complex event recognition
data parallelization the data is partitioned and in the big data era: A survey. The VLDB Journal, 29,
distributed to equal instances of operators or sub- 313. https://doi.org/10.1007/s00778-019-00557-w.
queries and the results are merged in the end. A Kanagal, B., & Deshpande, A. (2009). Efficient query
evaluation over temporally correlated probabilistic
possible strategy to implement CEPs in a scalable streams. In IEEE 25th international conference on
way is elevating CEP systems to big data plat- data engineering, 2009. icde’09 (pp. 1315–1318).
forms as these are designed to serve high work- Luckham, D. C., & Frasca, B. (1998). Complex event
loads by workload distribution and elastic processing in distributed systems (Technical report)
(Vol. 28). Stanford: Computer Systems Laboratory,
services. Giatrakos et al. (2019) show how CEP Stanford University.
can be integrated with Spark Streaming, Apache Margara, A., Cugola, G., & Tamburrelli, G. (2014). Learn-
Flink, and Apache Storm taking advantage of the ing from the past: Automated rule generation for com-
corresponding abilities for scalability. plex event processing. In Proceedings of the 8th ACM
international conference on distributed event-based
systems (pp. 47–58).
Mehdiyev, N., Krumeich, J., Enke, D., Werth, D., & Loos,
Cross-References P. (2015). Determination of rule patterns in complex
event processing using machine learning techniques.
Procedia Computer Science, 61, 395–401.
▶ Big Data Quality Mousheimish, R., Taher, Y., & Zeitouni, K. (2017). Auto-
▶ Data Processing matic learning of predictive cep rules: Bridging the gap
▶ Data Scientist between data mining and complex event processing. In
▶ Machine Learning Proceedings of the 11th ACM international conference
on distributed and event-based systems (pp. 158–169).
▶ Metadata Zhang, H., Diao, Y., & Immerman, N. (2010). Recognizing
patterns in streams with imprecise timestamps. Pro-
ceedings of the VLDB Endowment, 3(1–2), 244–255.
Further Reading

Alevizos, E., Skarlatidis, A., Artikis, A., & Paliouras, G.


(2017). Probabilistic complex event recognition: A sur- Complex Event Recognition
vey. ACM Computing Surveys (CSUR), 50(5), 71.
Arasu, A., Babu, S., & Widom, J. (2006). The CQL con-
tinuous query language: Semantic foundations and ▶ Complex Event Processing (CEP)
query execution. The VLDB Journal, 15(2), 121–142.
Cugola, G., & Margara, A. (2012a). Low latency complex
event processing on parallel hardware. Journal of Par-
allel and Distributed Computing, 72(2), 205–218.
Cugola, G., & Margara, A. (2012b). Processing flows of Complex Networks
information: From data stream to complex event pro-
cessing. ACM Computing Surveys (CSUR), 44(3), 15. Ines Amaral
Cugola, G., Margara, A., Matteucci, M., & Tamburrelli, G.
(2015). Introducing uncertainty in complex event pro- University of Minho, Braga, Minho, Portugal
cessing: Model, implementation, and validation. Com- Instituto Superior Miguel Torga, Coimbra,
puting, 97(2), 103–144. https://doi.org/10.1007/ Portugal
s00607-014-0404-y. Autonomous University of Lisbon, Lisbon,
Eckert, M., Bry, F., Brodt, S., Poppe, O., & Hausmann, S.
(2011). A cep babelfish: Languages for complex event Portugal
processing and querying surveyed. In Reasoning in
event-based distributed systems (pp. 47–70). Springer,
Berlin, Heidelberg. In recent years, the emergence of a large amount
Etzion, O., & Niblett, P. (2011). Event processing in action.
Greenwich: Manning. of data dispersed in several types of databases
Flouris, I., Giatrakos, N., Deligiannakis, A., Garofalakis, enabled the extraction of information on a never
M., Kamp, M., & Mock, M. (2017). Issues in complex seen scale. Complex networks allow the
Complex Networks 199

connection of a vast amount of scattered and Graph Theory. Studying the topology of networks
unstructured data in order to understand relations, through the Graph Theory, formalists’ authors
construct models for their interpretation, analyze pursue to analyze situations in which the phenom-
structures, detect patterns, and predict behaviors. ena in question establish relations among them-
The study of complex networks is multi- selves. The premise is that everything is
disciplinary and covers several knowledge areas connected and nothing happens in isolation,
as computer science, physics, mathematics, soci- which is based on the formalist perspective of C
ology, and biology. Within the context of the Network Theory.
Theory of Complex Networks, a network is a Network Theory approaches the study of
graph that represents a set of nodes connected by graphs as a representation of either symmetric or
edges, which together form a network. This net- asymmetric relations relationships between
work or graph can represent relationships between objects. This theory assumes the perspective that
objects/agents. Graphs can be used to model many social life is relational, which suggests that the
types of relations and processes in physical, bio- attributes by itself have no meaning that can
logical, social, and information systems. explain social structures or other networks. The
A graph is a graphical representation of a pat- focus of the analysis is the relationships
tern of relationships and is used to reveal and established in a given system. Thus, the purpose
quantify important structural properties. In fact, of different methodologies within the Network
the graphs identify structural patterns that cannot Theory is to detect accurately and systematically
be detected otherwise. The representation of a patterns of interaction.
network or a graph consists of a set of nodes In the formalist perspective of Network, a sys-
(vertices) that are connected by lines, which may tem is complex when their properties are not a
be arcs or edges, depending on the type of rela- natural consequence of their isolated elements. In
tionship to study. Matrices are an alternative to this sense, the theoretical propose is the applica-
represent and summarize data networks, tion of models in order to identify common pat-
containing exactly the same information as a terns of interaction systems. The network models
graph. designed by the authors of formalist inspiration
Studies of complex networks have their origin have been used in numerous investigations and
in Graph Theory and in Network Theory. In the can be summarize three different perspectives:
eighteenth century, the Swiss mathematician random networks, small-world networks and
Euler developed the founding bases of Graph scale-free networks.
Theory. Euler solved the problem of the bridges The model of random networks was proposed
of Königsberg through the modeling of a graph by Paul Erdös and Alfred Rényi in 1959 and is
that transformed the ways in straight lines and considered the simplest model of complex sys-
their intersection points. It is considered that this tems. The authors argued that the process of for-
was the first graph developed. The primacy of mation of networks was random. Assuming as
relations is explained in the work of George true the premise that nodes aggregate randomly,
Simmel, a German sociologist who is often nom- the researchers concluded that all actors in a net-
inated as the theoretical antecedent of Network work have a number of close links and the same
Analysis. Simmel argued that the social world probability of establish new connections. The the-
was the result of interactions and not the aggrega- ory focuses on the argument that the more com-
tion of individuals. The author argued that society plex is the network, the greater is the probability
was no more than a network of relationships, of their construction is random. From the perspec-
given the intersection of these as the basis for tive of Erdös and Rényi, the formation of net-
defining the characteristics of social structures works is based on two principles: equality or
and individual units. democracy of networks (all nodes have the same
The modeling of complex networks is probability of belonging to the network) and the
supported by the mathematical formalism of transition (from isolation to connectivity).
200 Complex Networks

Barabási (2003) and other authors argue that the of connections among agents, and the trend for the
theory of randomness cannot explain the complex new nodes being connected to others who have
networks that exist in the world today. high degree of connectivity. Power laws are asso-
Watts and Strogatz proposed the small-world ciated to this specific symmetry.
model in 1998. The model assumes as its theoret- The Theory of Complex Networks has
ical basis the studies of “small worlds” of emerged in the 1990s of the last century due to
Milgram, who argued that 5.2 degrees of separa- the Internet and computers capable of processing
tion mediate the distance between any two people big data. Despite the similarities, the Theory of
in the world, and Granovetter;s theories on the Complex Networks differs from Graph Theory, in
weak social ties between individuals and the three basic aspects: (i) it is related to the modeling
structural importance, and the influence they of real networks, through analysis of empirical
have on the evolution and dynamics of networks. data; (ii) the networks are not static, but evolve
The researchers created a model where some con- over time, changing its structure; (iii) the net-
nections were established by proximity and others works are structures where dynamic processes
randomly, which transforms networks into small (such as the spread of viruses or opinions) can
worlds. Watts and Strogatz found that the separa- be simulated.
tion increases more slowly than the network evo- The Theory of Complex Networks is currently
lution. This theory, called effect of “small world” widely applied both to the characterization and the
or “neighborhood effect,” argues that in contexts mathematical modeling of complex systems.
where there are very close connected members, Complex networks can be classified according to
actors bind so that there are few intermediaries. statistical properties as the degree or the coeffi-
Therefore, there is a high degree of clustering and cient of aggregation. There are several tools to
a reduced distance between the nodes. According generate graphical representations of networks.
to the model developed by Watts and Strogatz, the Visualization of complex networks can represent
average distance between any two people would large-scale data and enhance their interpretation
not exceed a small number of other people, being and analysis.
only required that there were a few random links Big data and complex networks share three
between groups. proprieties: large scale (volume), complexity
In a study that sought to assess the feasibility of (variety), and dynamics (velocity). Big data can
applying the Theory of Small Worlds to the World change the definition of knowledge, but by itself is
Wide Web, Barabási and Albert demonstrated that not self-explanatory. Therefore, the ability to
networks are not formed randomly. The understand, model, and predict behavior using
researchers have proposed the model of “scale- big data can be provided by the Theory of Com-
free networks,” which is grounded on the argu- plex Networks.
ment that networks that evolve are based on mech- As mathematical models of simpler networks
anisms of preferential attachment. As do not display the significant topological features,
Granovetter’s theories and studies of Watts and modeling big data in complex networks can facil-
Strogatz, Barabási and Albert argued that there is itate the analysis of multidimensional networks
an order in the dynamic structure of networks and extracted from massive data sets. The clustering
defined the preferential attachment as a standard of data in networks provides a way to understand
for pattern of structuring such as “rich get richer”. and obtain relevant information from large data
Therefore, many more connections a node has, the sets, which allows learning, inferring, predicting,
higher is the probability of having more links. The and having knowledge of large volumes of
model of scale-free networks is based on growth dynamic data sets.
and preferential attachment. In this type of net- Complex networks may promote collaboration
work, the main feature is the unequal distribution between many disciplines towards large-scale
Computational Social Sciences 201

information management. Therefore, computa-


tional, mathematical, statistical, and algorithmic Computational Social
techniques can be used to modeling high dimen- Sciences
sional data, large graphs, and complex data in
order to detect structures, communities, patterns, Ines Amaral
locations, influence, and model transmissions in University of Minho, Braga, Minho, Portugal
interdisciplinary research at the interface between Instituto Superior Miguel Torga, Coimbra, C
big data analysis and complex networks. Several Portugal
areas of knowledge can benefit from the use of Autonomous University of Lisbon, Lisbon,
complex networks models and techniques for ana- Portugal
lyzing big data.

Computational social sciences is a research disci-


Cross-References pline at the interface between computer science
and the traditional social sciences. This interdis-
▶ Computational Social Sciences ciplinary and emerging scientific field uses com-
▶ Data Visualization putationally methods to analyze and model social
▶ Graph-Theoretic Computations/Graph Databases phenomena, social structures, and collective
▶ Network Analytics behavior. The main computational approaches to
▶ Network Data the social sciences are social network analysis,
▶ Social Network Analysis automated information extraction systems, social
▶ Visualization geographic information systems, complexity
modeling, and social simulation models.
New areas of social science research have arisen
Further Reading due the existence of computational and statistical
tools, which allow social scientists to extract and
Barabási, A.-L. (2003). Linked. Cambridge, MA: Perseus analyze large datasets of social information. Com-
Publishing.
Barabási, A.-L., & Albert, R. (1999). Emergence of scaling
putational social sciences diverges from conven-
in random networks. Science, 286(5439), 509. tional social science because of the use of
Bentley, R. A., O’Brien, M. J., & Brock, W. A. (2014). mathematical methods to model social phenomena.
Mapping collective behavior in the big-data era. Behav- As an intersection of computer science, statistics,
ioral and Brain Sciences, 37, 63.
and the social sciences, computational social sci-
Boccaletti, S., et al. (2006). Complex networks: Structure
and dynamics. Physics Reports, 424(4–5), 175. ence is an interdisciplinary subject, which uses
McKelvey, K., et al. (2012). Visualizing communication large-scale demographic, behavioral, and network
on social media: Making big data accessible. arXiv data to analyze individual activity, collective
preprint arXiv:1202.1367.
behaviors, and relationships. Modern distributed
Strogatz, S. H. (2001). Exploring complex networks.
Nature, 410(6825), 268. computing frameworks, algorithms, statistics, and
Watts, D. (2003). Six degrees: The science of a connected machine learning methods can improve several
age. New York: Norton. social science fields like anthropology, sociology,
Watts, D. (2004). The “new” science of networks. Annual
economics, psychology, political science, media
Review of Sociology, 30(1), 243.
studies, and marketing. Therefore, computational
social sciences is an interdisciplinary scientific
area, which explores social dynamics of society
Computational Ontology through advanced computational systems.
Computational social science is a relatively
▶ Ontologies new field, and its development is closely related
202 Computational Social Sciences

to the computational sociology that is often asso- complex mechanisms that form part of many social
ciated to the study of social complexity, which is a phenomena in contemporary society. Big data can
useful conceptual framework for the analysis of be used to understand many complex phenomena
society. Social complexity is theory neutral that as it offers new opportunities to work toward a
frames both local and global approaches to social quantitative understanding of our complex social
research. The theoretical background of this con- systems. Technological-mediated social phenom-
ceptual framework dates back to the work of ena emerging over multiple scales are available in
Talcott Parsons on action theory, the integration complex datasets. Twitter, Facebook, Google, and
of the study of social order with the structural Wikipedia showed that it is possible to relate, com-
features of macro and micro factors. Several pare, and predict opinions, attitudes, social influ-
decades later, in the early 1990s, social theorist ences, and collective behaviors. Online and offline
Niklas Luhmann began to work on the themes of big data can provide insights that allow the under-
complex behavior. By then, new statistical and standing of social phenomena like diffusion of
computational methodologies were being devel- information, polarization in politics, formation of
oped for social science problems. groups, and evolution of networks.
Nigel Gilbert, Klaus G. Troitzsch, and Joshua M. Big data is dynamic, heterogeneous, and inter-
Epstein are the founders of modern computational related. But it is also often noisy and unreliable.
sociology, merging social science research with sim- However, even so, big data may be more valuable
ulation techniques in order to model complex policy to social sciences than small samples because the
issues and essential features of human societies. overall statistics obtained from frequent patterns
Nigel Gilbert is a pioneer in the use of agent-based and correlation analysis disclose often hidden pat-
models in the social sciences. Klaus G. Troitzsch terns and more reliable knowledge. Furthermore,
introduces the method of computer-based simulation when big data is connected, it forms large net-
in the social sciences. Joshua M. Epstein developed, works of heterogeneous information with data
with Robert Axtell, the first large-scale agent-based redundancy that can be exploited to compensate
computational model, which aims to explore the role for the lack of data, to validate trust relationships,
of social experiences such as seasonal migrations, to disclose inherent groups, and to discover hid-
pollution, and transmission of disease. den patterns and models. Several methodologies
As an instrument-based discipline, computa- and applications in the context of modern social
tional social sciences enables the observation and science datasets allow scientists to understand and
empirical study of phenomena through computa- study different social phenomena, from political
tional methods and quantitative datasets. Quantita- decisions to the reactions of economic markets to
tive methods such as dynamical systems, artificial the interactions of individuals and the emergence
intelligence, network theory, social network analy- of self-organized global movements.
sis, data mining, agent-based modeling, computa- Trillions of bytes of data can be captured by
tional content analysis, social simulations instruments or generated by simulation. Through
(macrosimulation and microsimulation), and statis- better analysis of these large volumes of data that
tical mechanics are often combined to study com- are becoming available, there is the potential to
plex social systems. make further advances in many scientific disci-
Technological developments are constantly plines and improve the social knowledge and the
changing society, ways of communication, behav- success of many companies. More than ever, sci-
ioral patterns, the principles of social influence, and ence is now a collaborative activity. Computational
the formation and organization of groups and com- systems and techniques created new ways of
munities, enabling the emergence of self-organized collecting, crossing and interconnecting data.
movements. As technology-mediated behaviors Analysis of big data are now at the disposal of
and collectives are primary elements in the dynam- social sciences, allowing the study of cases in
ics and in the design of social structures, computa- macro- and in microscales in connection to other
tional approaches are critical to understand the scientific fields.
Computer Science 203

Cross-References machine-based designs for application-oriented


questions.
▶ Computer Science
▶ Data Visualization
▶ Network Analytics Computer Science and Big Data
▶ Network Data
▶ Social Network Analysis The buzzword “big data” is on everyone’s lips and C
▶ Visualization not only describes scientific data practices but also
stands for societal change and a media culture in
transition. On the assumption that digital media
Further Reading and technologies do not merely convey neutral
messages but establish cultural memory and
Bankes, S., Lempert, R., & Popper, S. (2002). Making develop social potency, they may be understood
computational social science effective epistemology,
as discourses of societal self-reflection. Over the
methodology, and technology. Social Science Com-
puter Review, 20(4), 377–388. past few years, big data research has become
Bainbridge, W. S. (2007). Computational sociology. In highly diversified and has yielded a number of
The Blackwell Encyclopedia of Sociology. Malden, published studies, which employ a form of com-
MA: Blackwell Publishing.
puter-based social media analysis supported by
Cioffi-Revilla, C. (2010). Computational social science.
Wiley Interdisciplinary Reviews: Computational Statis- machine-based processes such as text analysis
tics, 2(3), 259–271. (quantitative linguistics), sentiment analysis
Conte, R., et al. (2012). Manifesto of computational social (mood recognition), social network analysis,
science. The European Physical Journal Special
image analysis, or other processes of a machine-
Topics, 214(1), 325–346.
Lazer, D., et al. (2009). Computational social science. based nature. Given this background, it would be
Science, 323(5915), 721–723. ethically correct to regularly enlighten users of
Miller, J. H., & Page, S. E. (2009). Complex adaptive online platforms about the computer-based possi-
systems: An introduction to computational models of
bilities, processes, and results associated with the
social life. Princeton: Princeton University Press.
Oboler, A., et al. (2012). The danger of big data: Social collection and analysis of large volumes of data.
media as computational social science. First Monday 17 As a phenomenon and discipline developed only
(7). Retrieved from http://firstmonday.org/article/view/ in the past several years, big data can be described
3993/3269/.
as the collection, manipulation, and analysis of
massive amounts of data – and the decisions
made from that analysis. Moreover, big data is
affecting and will affect almost all fields of
Computer Science study, from criminology to philosophy, business,
government, transportation, energy, genetics,
Ramón Reichert medicine, physics, and more. As we tackle big
Department for Theatre, Film and Media Studies, data in this proposed encyclopedia, we objec-
Vienna University, Vienna, Austria tively report on the negative effects of loss of
privacy, surveillance, and possible misuse of
data in trade-offs for security. On the other hand,
Computer science is the scientific approach to the it is big data that is helping us to peer into the
automatic processing of data and information human genome for invaluable medical insights, or
using digital computing machines. Computer sci- to reach deep across the universe, discovering
ence has developed itself at the interface between planets much like our own. In the era of big
two scientific disciplines: on the one hand it data, the status of social networks has changed
emerges from the formal logical method of math- radically. Today, they increasingly act as gigantic
ematics; on the other hand, it takes genuine prob- data collectors for the observational requirements
lems of engineering sciences and tries to develop of social-statistical knowledge and serve as a
204 Computer Science

prime example of normalizing practices. Where models used for data collection in publication
extremely large quantities of data are analyzed, it strategies, in order to inform the users of online
now usually entails the aggregation of moods and platforms or to invite them to contribute to the
trends. Numerous studies exist where the textual design and development of partially open
data of social media has been analyzed in order to interfaces.
predict political attitudes, financial and economic
trends, psychopathologies, and revolutions and
protest movements. The statistical evaluation of The Emergence of Computer Science
big data promises a range of advantages, from
increased efficiency in economic management The basic principle of the computer is the conver-
via measurement of demand and potential profit sion of all signs and sign processes in arithmetic
to individualized service offers and better social operations. In this respect, the history of computer
management. science refers to earlier traditions of thought,
The structural change generated by digital which already had an automation of calculations
technologies, as main driver for big data, offers a in mind. Before the computer was invented as a
multitude of applications for sensor technology tangible machine, there were already operational
and biometrics as key technologies. The conquest concepts of the use of symbols, which provided
of mass markets through sensor and biometric the guidelines for the upcoming development of
recognition processes can sometimes be the computer. The arithmetic, algebraic, and log-
explained by the fact that mobile, web-based ter- ical calculi of operational use of symbols can be
minals are equipped with a large variety of differ- understood as pioneers in computer science. The
ent sensors. More and more users come this way key-thought behind the idea of formalization is
into contact with the sensor technology or with the based on the use of written symbols, both sche-
measurement of individual body characteristics. matic and open to interpretation. Inspired by the
Due to the more stable and faster mobile net- algebraic methods of mathematics, René Des-
works, many people are permanently connected cartes proposed, in his “Regulae ad directionem
to the Internet using their mobile devices, provid- ingenii” in 1628, for the first time the idea of the
ing connectivity an extra boost. With the develop- unity of a rational, reasoned thinking, a mathesis
ment of apps, application software for mobile universalis. The idea of mathematics as a method
devices such as smartphones (iPhone, Android, of gaining knowledge, that works without object
BlackBerry, Windows Phone) and Tablet com- bondage, was taken from the analytic geometry of
puter, the application culture of biosurveillance Pierre de Fermat (1630) and further developed by
changed significantly, since these apps are Gottfried Wilhelm Leibniz in his early work
strongly influenced by the dynamics of the bot- “Dissertatio de Arte combinatoria” published in
tom-up participation. Therefore the algorithmic 1666. Leibniz intended his characteristica
prognosis of collective processes enjoys particu- universalis to be a universal language of science
larly high political status, with the social web and created a binding universal symbolism for all
becoming the most important data-source for fields of knowledge, which is constructed
knowledge on governance and control. according to natural sciences and mathematical
Within the context of big data, on the other models. The Boolean algebra of George Boole
hand, a perceptible shift of all listed parameters resulted from the motivation to describe human
has taken place, because the acquisition, model- thinking and action using precise formal methods.
ing, and analysis of large amounts of data, accel- With his “Laws of Thought” (1854), George
erated by servers and by entrepreneurial Boole laid the foundations of mathematical logic
individuals, is conducted without the users’ and established in the form of Boolean algebra,
knowledge or perusal. Consequently, the socially the fundamental mathematical principles for the
acceptable communication of big data research whole technical computer science. This outlined
seeks to integrate the methods, processes, and development of a secure logical language of
Computer Science 205

symbols gives form to the operational basis of since the late twentieth century, a central role in
modern computer technology. the production, processing, and management of
A comparative historical analysis of the data computer science knowledge. Concomitantly,
processing, taking into account the material cul- media technologies of data collection and pro-
ture of data practices from the nineteenth to the cessing, as well as media that develop knowledge
twenty-first century, shows (Gitelman and Pingree by using spaces of opportunity move to the center
2004) that in the nineteenth century the of knowledge production and social control. In C
researcher’s interests on the taxonomic knowl- this sense we can speak of both data-based and
edge was strongly influenced by the mechanical data-driven sciences, since the production of
data practices – long before computer-based knowledge has become dependent on the avail-
methods of data collection even existed (Driscoll ability of computer technology infrastructures and
2012). Further studies analyze and compile the on the development of digital applications and
social and political conditions and effects of the methods.
transition from mechanical data count, of the first
census in 1890 through the electronic data pro-
cessing in the 1950s, to digital social monitoring Computational Social Science
of the immediate present (Bollier 2010, p. 3).
Published in 1937, the work of the British math- In the era of big data, the importance of social
ematician and logician Alan Mathison Turing On networking culture has changed radically. Social
Computable Numbers, with an Application to the media acts today as a gigantic data collector and as
“Entscheidungsproblem,” in which he developed relevant data sources for the digital communica-
a mathematical model machine, is to this day of tions research: “Social media offers us the oppor-
vital importance for the history of theories of tunity for he first time to both observe human
modern information and computer technology. behavior and interaction in real time and on a
With his invention of the universal Turing global scale.” (Golder and Macy 2012, p. 7).
machine, Turing is widely considered to be one The large amounts of data are collected in
of the most influential theorists of the early com- different domains of knowledge, such as biotech-
puter development. In 1946, the mathematician nology, genomics, labor and financial sciences, or
John von Neumann developed the key compo- trend research, rely in their work and studies on
nents of a computer, which is in use until today: the results of information processing of big data,
control- and arithmetic unit, memory and input/ and formulate on this basis significant models of
output facilities. During the 1960s the first gener- the current status and future development of social
ation of informatics specialists from the field of groups and societies. The big data research
social sciences, such as Herbert A. Simon (1916– became significantly differentiated in recent
2001), Charles W. German (1912–1992), and Har- years, as numerous studies have been published,
old Guetzkow (1915–2008), started to systemati- using machine-based methods such as text analy-
cally use calculating machines and punch cards sis (quantitative linguistics), sentiment analysis
for the statistical analysis of their data. (Cioffi- (mood detection), social network analysis, and
Revilla 2010, pp. 259–271) The computer is as image analysis or otherwise machine-based pro-
an advanced calculator, which translates the com- cesses of computer-based social media analysis.
plete information into a binary code and electri- The newly emerging discipline of “Computa-
cally transmits it in form of signals. This way the tional Social Science” (Lazer et al. 2009, pp. 721–
computer, as a comprehensive hypermedia, is able 723; Conte et al. 2012, pp. 325–346) evaluates the
to store, edit, and deliver, not only verbal texts but large amounts of data online use in the backend
also visual and auditory texts in a multimedia area and has emerged as a new leading science in
convergence space. the study of social media web 2.0. It provides a
The digital large-scale research with its large common platform for computer science and social
data processing centers and server farms play, sciences connecting the different expert opinions
206 Computer Science

on computer science, society, and cultural pro- media platforms, new insights into human behav-
cesses. The computer science deals with the com- ior, into social issues beyond these platforms, and
puter-based elaboration of large databases, which into their software can be achieved. Numerous
no longer cope with the conventional methods of representatives of computer-based social and cul-
statistical social sciences. Its goal is to describe tural sciences sustain the assumption that online
the social behavioral patterns of online users on data could be interpreted as social environments.
the basis of methods and algorithms of data min- To do so, they define the practices of Internet use
ing: “To date, research on human interactions has by docking them using a positivist data term, which
relied mainly on one-time, self-reported data on comprehends the user practices as an expression of
relationships” (Lazer et al. 2009, p. 722). In order specifiable social activity. The social positivism of
to answer this question of social behavior in a “Computational Social Science” in social media
relevant or meaningful way, the computer science platforms neglects, however, the meaningful and
requires the methodological input of social sci- intervening/instructive role of the media in the
ences. With their knowledge of theories and production of social roles and stereotyped conducts
methods of social activity, social sciences make in dealing with the medium itself. With respect to
a valuable contribution to the formulation of rele- its postulate of objectivity, social behaviorism of
vant issues. online research can, in this regard, be questioned.
The vision of such native- digital research
methodology, whether in the form of a “computa-
Digital Methods tional social science” (Lazer et al. 2009, pp. 721–
723) or “cultural analytics” (Manovich 2009, pp.
At the interface between the “computational social 199–212) is, however, still incomplete and
science” (Lazer et al. 2009) and the “cultural ana- requires an epistemic survey of digital methods
lytics” (Manovich 2009, pp. 199–212), an interdis- in Internet research of the following areas:
ciplinary theoretical field has emerged, reflecting
the new challenges of the digital Internet research. 1. Digital methods as validity theoretical project.
The representatives of the so-called digital methods It stands for a specific process that claims the
pursue the aim of rethinking the research about use social recognition of action orientations. The
(audience research), by interpreting the use prac- economy of computer science, computational
tices of the Internet as a cultural change and as linguistics, and empirical communication soci-
social issues (Rogers 2013, p. 63). Analogous ology not only form a network of scientific
methods, though that have been developed for the fields and disciplines but they also develop, in
study of interpersonal or mass communication, their strategic collaborative projects, certain
cannot simply be transferred to digital communi- expectations, describing and explaining the
cation practices. Digital methods can be under- social world and are, in this respect, intrinsi-
stood as approaches that focus on the genuine cally connected with epistemic and political
practice of digital media and not on existing issues. In this context, the epistemology,
methods adapted for Internet research. According questioning the self-understanding of digital
to Rogers (2013), digital methods are research methods, deals with the social effectiveness
approaches that take advantage of large-scale dig- of the digital data science.
ital communication data to, subsequently, model 2. Digital methods as constitutional theoretical
and manage this data using computational pro- construct. The relation to the object in big
cesses. Both the approach of “Computational data research is heterogeneous and consists of
Social Science” and the questioning of “Digital different methods. Using interface technolo-
Methods” represent the fundamental assumption gies, the process of data tracking, of keyword
that by using the supplied data, which creates social tracking, of automatic network analysis, of
Console 207

argument and sentiment analysis, or machine- compliance with data protection regulations are
based learning results in critical perspecti- key issues.
vizations of data constructs. Against this back-
ground, the Critical Code Studies try to make
the media techniques, of computer science Further Reading
power relations, visible and study the technical
and infrastructural controls over layer models, Bollier, D. (2010). The promise and peril of big data. C
Washington, DC: The Aspen Institute. Online: http://
network protocols, access points, and
www.aspeninstitute.org/sites/default/files/content/
algorithms. docs/pubs/The_Promise_and_Peril_of_Big_Data.pdf.
3. Digital methods may ultimately be regarded as Cioffi-Revilla, C. (2010). Computational social science.
a founding theoretical fiction. The relevant Wiley Interdisciplinary Reviews: Computational Statis-
tics, 2(3), 259–271.
research literature has dealt extensively with
Conte, R., et al. (2012). Manifesto of computational social
the reliability and validity of scientific data science. European Physical Journal: Special Topics,
collection and came to the conclusion that the 214(1), 325–346.
data interfaces of Social Net (Twitter, Driscoll, K. (2012). From punched cards to ‘Big Data’: A
social history of database populism. Communication þ
Facebook, YouTube) act more or less like dis- 1, 1(1), 1. Online: http://kevindriscoll.info/.
positive orders according to a gatekeeper. The Gitelman, L., & Pingree, G. B. (2004). New media: 1740–
filter interface generates the API’s (application 1915. Cambridge, MA: MIT Press.
programming interfaces) economically moti- Golder, S., & Macy, M. (2012). Social science with social
media. Footnotes, 40(1), 7. Online: http://www.asanet.
vated exclusionary effects for network research
org/footnotes/jan12/socialmedia_0112.html.
that cannot be controlled by their own efforts. Lazer, D., et al. (2009). Life in the network: The coming
age of computational social science.” Science, 323
In this context of problem-oriented develop- (5915), 721–723.
Manovich, L. (2009). How to follow global digital cul-
ment of computer science, the expectations on
tures: Cultural analytics for beginners. In K. Becker &
the science of the twenty-first century have signif- F. Stalder (Eds.), Deep search: The politics of search
icantly changed. In the debates increasingly beyond Google (pp. 198–212). Innsbruck,
claims are being made, they insist in processing Studienverlag.
Rogers, R. (2013). Digital methods. Cambridge, MA: MIT
historically, socially, and ethically leading aspects Press.
of digital data practices – associated with the
purpose to anchor these aspects in the future sci-
entific cultures and epistemologies of data gener-
ation and data analysis. Lazer et al. demand of Computer-Assisted Reporting
future computer scientists a responsible use of
available data and see in negligent handling a ▶ Media
serious threat to the future of the discipline itself:
“A single dramatic incident involving a breach of
privacy could produce a set of statutes, rules, and
prohibitions that could strangle the nascent field Consensus Methods
of computational social science in its crib. What is
necessary, now, is to produce a self-regulatory ▶ Ensemble Methods
regime of procedures, technologies, and rules
that reduce this risk but preserve most of the
research potential.” (Lazer et al. 2009, p. 722) If
it is going to be made research on social interac- Console
tion using computer science and big data, then the
responsible handling of data as well as the ▶ Dashboard
208 Content Management System (CMS)

managing, and publishing content, thus delivering


Content Management and disseminating knowledge. To that end, a CMS
System (CMS) provides a platform for linking and using large
datasets in connection to tools for strategic plan-
Yulia A. Strekalova1 and Mustapha Bouakkaz2 ning and decision-making.
1
College of Journalism and Communications, CMS has different meanings and foci across
University of Florida, Gainesville, FL, USA various disciplines and practical applications. The
2
University Amar Telidji Laghouat, Laghouat, viewpoint of managerial applications of CMS
Algeria puts more emphasis on the knowledge manage-
ment and the strategic decisions adopted to man-
age and deliver business knowledge, while
The use of content management systems (CMSs) information processing perspective focuses on
dates back to late 1990s. Taken broadly, CMSs the process of collecting, managing, and publish-
include systems for strategic decision-making in ing content. More specifically, this latter view-
relation to organizational knowledge management point defines a CMS as a computer application
and sharing, or, more narrowly, online applica- that allows to publish, edit, and modify content
tions for share this knowledge with internal and online from a central, shared interface.
external users. CMSs are automated interfaces Early CMS applications aimed to simplify the
that allow system users to create, publish, edit, task of coding to streamline the website develop-
index, and control access to their content without ment process. As technological applications grew,
having to learn hypertext markup language the definition of a CMS received several interpre-
(HTML) or other programming languages. tations. CMS systems today carry a multitude of
CMSs have several possible advantages, like functions and facilitate big data centralization,
low-cost, built-in pathways for customization editing, publishing, and modification through a
and upgrades, flexibility in content access, and single back-end interface which also includes
ease in use by nontechnical content producers. organization-level rules and processes that govern
At the same time, especially in these days of big content creation and management. The latter are
data and massive datasets, large-scale CMSs can frequently viewed a part of the enterprise strategic
require extended strategic planning and system decision-making and guide the selection, devel-
pre-evaluation. Enterprise-level CMSs may opment, and implementation of a CMS. As con-
require extensive staff training, hardware invest- tent is produced, content managers can rely on the
ments, and commitments for ongoing mainte- support of a CMS to create an infrastructure for
nance. However, a CMS also may present targets multiple users to collaborate and contribute to all
for cyberattacks and security threats when not necessary knowledge management activities
managed as an integral part of an organization’s simultaneously.
overall information infrastructure. A CMS may also provide tools for targeted
online advertising, business-to-community com-
munication, and audience engagement
Definition management.

CMS as a concept evolved organically and there


are no official standards that guide or prescribe the Uses
features CMS should or should not have. Overall,
CMS tools aim to manage comprehensive Most frequently, examples of a CMS use include
websites and online collaboration portals through blogs, news sites, and shopping portals. In a nut-
the management of a process of collecting, shell, the CMS can keep the look and feel of a
Content Management System (CMS) 209

website consistent while saving content managers size of an organization and the volume of
the time and effort for creating new web pages as published content, a CMS can be used to define
necessary, informing subscribers of the created roles and create workflow procedures for the col-
content, or updating past content with new infor- laborators who are involved in the content man-
mation. In a sense, a CMS can create standards for agement. The workflow may include manual steps
content management for small organizations and and standard operating procedures, or set up as an
individual authors and make sure these standards automated cascade of actions triggered by one of C
are kept consistent. the content managers. As a central repository of
A CMS can be used to manage the content of data, a CMS may include documents, videos,
an externally focused organizational website or pictures, customer and collaborator contact infor-
and internally targeted information sharing sys- mation, or scientific data. A CMS, aside from
tem. In either application, a CMS framework con- storing and publishing information, can provide
sists of two main elements: a content management necessary content linkages to show how new
application (CMA) and a content delivery appli- entries fit in and enhance previously existing
cation (CDA). The CMA functions as a tool for content.
the content manager, who may not know HTML,
to create, modify, and remove content from an
online website independent of IT and webmaster Functionality
support. Through the CDA, information is com-
piled and published online. Together, these ele- Applying a systems approach to CMS evaluation
ments create a central interface and make it and identifying the features of an ideal CMS, the
possible for nontechnical users to add and edit following functions have been posited as most
text, control revisions, index data, and manage desirable (Garvin 2011; Han 2004): 1) a robust
content for dissemination. In other words, a framework to facilitate knowledge use by end
CMS allows a what-you-see-is-what-you-get users, 2) a stable access to and ability to share
editing and formatting by content authors who information with other enterprise-level informa-
are not IT and web development specialists. tion systems, 3) a strategic plan for maintaining
A CMS, which may not require coding or direct and disseminating relevant internally created and
management from an end user, provides more external knowledge, 4) a strategy for managing
than content editing support. Robust systems indexing and metadata associated with the con-
may automatically generated navigation across tent, and 5) a solution to reuse created content
the content, provide search functionality, facilitate effectively.
indexing of content entries, track content activity CMSs offer complex and powerful functions.
by the system’s users, and define user groups with Most frequently CMS features can be broken into
varying security permissions and access to three major application areas: data maintenance,
content. user access management, and system modifica-
In business applications, a CMS can be used tion. Data maintenance allows creating a uniform
for one-on-one marketing by delivering product structure across the variety of shared content
information tailored to specific users’ interest through standardization. Standardization, aside
based on their past interaction with content. from ensuring uniform presentation of content,
While corporations are the most frequent CMS creates structure for the shared data and allows
users in their marketing efforts, a wide range or for more robust analysis of the data themselves
organizations, including nonprofits and public and their use. Other examples of data maintenance
service organizations, can benefit from CMS use include template automation, which standardizes
in relation to knowledge management and infor- content appearance; data versioning, which
mation dissemination efforts. Depending on the allows for effective updates; and overall data
210 Content Management System (CMS)

management, which includes content publishing, broad range of content management needs, from a
editing temporary removal from public access, small blog to an enterprise-level application with
and final archiving. complex access permissions and asynchronous,
User access features include permission con- multiuser content publishing and editing.
trol and collaboration management. Permissions
control allows CMS administrators to create user
groups and assign access privileges and rights as CMS Options
well as to define how users, if any, can contribute
their own content to an organization’s website. A number of CMS systems are available, some of
Collaboration management features allow for which are indicated in the brief descriptions and
multiple users to work on the content and for classifications below delineating some differences
administrators to set up a workflow to cycle found in CMS products.
shared content for necessary review and editing Blogs are CMS systems that are most appro-
and to delegate tasks automatically to the assigned priate for central content creation and controlled
users. dissemination management. Most modern blog
Finally, system modification features include systems, like WordPress or Tumblr, require little
scalability and upgrades. Scalability features initial HTML knowledge, but more advanced
may include the ability to create microsites features may require application and use of
within one CMS installation or addition of mod- scripts.
ules and plug-ins that extent the functionality of Online wikis are CMS that frequently crowd-
the original CMS. Plug-ins and module functions source content and are usually edited by a number
may include additional search options, content of users. Although the presentation of the content
syndication, or moderation of user-shared con- is usually static, such CMS systems benefit from
tent. Upgrade features refer to the regular updates the functionality that provides community mem-
to the CMS in accordance with the most current bers to add and edit content without coordination
web standards. with a central knowledge repository.
Existing CMSs can be divided into two groups. Forums are dynamic CMS systems that pro-
The first group is proprietary or enterprise CMSs, vide community members with active conversa-
created and managed by one particular developer tion and discussion functionality, e.g., vBulletin
with administrative access to edit and customize or bbPress. Most forum systems are based on PHP
the website that uses the CMS. The second group and MySQL, but some online forum systems can
is open source CMSs, which are open to any be initiated without database or scripting
number of administrators who can make changes knowledge.
using any device. Although not necessary for Portals are another type of a CMS that can
immediate use, open-source CMSs allow pro- include both static and interactive content man-
grammers to adapt and modify the code of a agement features. Most portals are comprehensive
system to tailor a CMS for the needs of an orga- and include wikis, forums, news feeds, etc. Some
nization. The two CMS groups also differ in their projects that support portal solutions are Joomla,
approaches to the management of data and Drupal, and Xoops, which support the develop-
workflow. The first system frequently establishes ment of portal sites in progressive, modular
standard operating procedures for content crea- manner.
tion, review, and publishing, while the second On the one hand, CMS offers tools for the
usually lacks strict standardization. The fact that collection, storage, analysis, and use of large
a CMS is open source does not mean that it cannot amounts of data. On the other hand, big data are
be used for enterprise-level content management. used to assess CMS measures and outcomes and
For example, Drupal and Wordpress can support a to explore the relationships between them.
Content Moderation 211

Cross-References removed by a moderator, acting as an agent of


the platform or site in question. Increasingly,
▶ Business-to-Community (B2C) social media platforms rely on massive quantities
▶ Content Moderation of UGC data to populate them and to drive user
▶ Semantic/Content Analysis/Natural Language engagement; with that increase has come the con-
Processing comitant need for platforms and sites to enforce
▶ Sentiment Analysis their rules and relevant or applicable laws, as the C
▶ Social Media posting of inappropriate content is considered a
major source of liability.
The style of moderation can vary from site to
Further Reading site, and from platform to platform, as rules
around what UGC is allowed are often set at a
Barker, D. (2015). Web content management: Systems, site or platform level and reflect that platform’s
features, and best practices. Sebastopol, CA: O’Reilly
brand and reputation, its tolerance for risk, and the
Media.
Frick, T., & Eyler-Werve, K. (2015). Return on engage- type of user engagement it wishes to attract. In
ment: content strategy and web design techniques for some cases, content moderation may take place in
digital marketing. Burlington, MA: Focal Press. haphazard, disorganized, or inconsistent ways; in
Garvin, P. (2011). Government information management in
others, content moderation is a highly organized,
the 21st century: International perspectives. Farnham/
Surrey: Ashgate Pub. routinized, and specific process. Content modera-
Han, Y. (2004). Digital content management: The search tion may be undertaken by volunteers or, increas-
for a content management system. Library Hi Tech, ingly, in a commercial context by individuals or
v.22.
firms who receive remuneration for their services.
The latter practice is known as commercial con-
tent moderation, or CCM. The firms who own
social media sites and platforms that solicit UGC
Content Moderation employ content moderation as a means to protect
the firm from liability and negative publicity and
Sarah T. Roberts to curate and control user experience.
Department of Information Studies, University of
California, Los Angeles, Los Angeles, CA, USA
History

Synonyms The Internet and its many underlying technologies


are highly codified and protocol-reliant spaces
Community management; Community modera- with regard to how data are transmitted within it
tion; Content screening (Galloway 2006), yet the subject matter and
nature of content itself has historically enjoyed a
much greater freedom. Indeed, a central claim to
Definition the early promise of the Internet as espoused by
many of its proponents was that it was highly
Content moderation is the organized practice of resistant, as a foundational part of both its architec-
screening user-generated content (UGC) posted to ture and ethos, to censorship of any kind.
Internet sites, social media, and other online out- Nevertheless, various forms of content moder-
lets, in order to determine the appropriateness of ation occurred in early online communities. Such
the content for a given site, locality, or jurisdic- content moderation was frequently undertaken by
tion. The process can result in UGC being volunteers and was typically based on the
212 Content Moderation

enforcement of local rules of engagement around techniques to combat what they viewed as the
community norms and user behavior. Moderation misappropriation of the comments spaces, using
practices and style therefore developed locally in-house moderators, turning to firms that special-
among communities and their participants and ized in the large-scale management of such inter-
could inform the flavor of a given community, active areas and deploying technological
from the highly rule-bound to the anarchic: the interventions such as word filter lists or
Bay Area-based online community the WELL disallowing anonymous posting, to bring the com-
famously banned only three users in its first ments sections under control. Some media
6 years of existence, and then only temporarily outlets went the opposite way, preferring instead
(Turner 2005, p. 499). to close their comments sections altogether.
In social communities, on the early text-based
Internet, mechanisms to enact moderation were
often direct and visible to the user and could Commercial Content Moderation and
include demanding that a user alters a contribution the Contemporary Social Media
to eliminate offensive or insulting material, the Landscape
deletion or removal of posts, the banning of
users (by username or IP address), the use of text The battle with text-based comments was just the
filters to disallow posting of specific types of beginning of a much larger issue. The rise of
words or content, and other overt moderation Friendster, MySpace, and other social media
actions. Examples of sites of this sort of content applications in the early part of the twenty-first
moderation include many Usenet groups, BBSes, century has given way to more persistent social
MUDs, listservs, and various early commercial media platforms of enormous scale and reach. As
services. of the second quarter of 2016, Facebook alone
Motives for people participating in voluntary approached two billion users worldwide, all of
moderation activities varied. In some cases, users whom generate content by virtue of their partici-
carried out content moderation duties for prestige, pation on the platform. YouTube reported receiv-
status, or altruistic purposes (i.e., for the better- ing upwards of 100 hours of UGC video per
ment of the community); in others, moderators minute as of 2014.
received non-monetary compensation, such as The contemporary social media landscape is
free or reduced-fee access to online services, therefore characterized by vast amounts of UGC
e.g., AOL (Postigo 2003). The voluntary model uploads made by billions of users to massively
of content moderation persists today in many popular commercial Internet sites and social
online communities and platforms; one such media platforms with a global reach. Mainstream
high-profile site where volunteer content modera- platforms, often owned by publicly traded firms
tion is used exclusively to control site content is responsible to shareholders, simply cannot
Wikipedia. afford the risk – legal, financial, and to reputa-
As the Internet has grown into large-scale tion – that unchecked UGC could cause. Yet,
adoption and a massive economic engine, the contending with the staggering amounts of trans-
desire for major mainstream platforms to control mitted data from users to platforms is not a task
the UGC that they host and disseminate has also that can currently be addressed reliably and at
grown exponentially. Early on in the proliferation large scale by computers. Indeed, making
of so-called Web 2.0 sites, newspapers and other nuanced decisions about what UGC is acceptable
news media outlets, in particular, began noticing a and what is not currently exceeds the abilities of
significant problem with their online comments machine-driven processes, save for the applica-
areas, which often devolved into unreadable tion of some algorithmically informed filters or
spaces filled with invective, racist and sexist dia- bit-for-bit or hash value matching, which occur
tribes, name-calling, and irrelevant postings. at relatively low levels of computational
These media firms began to employ a variety of complexity.
Content Moderation 213

The need for adjudication of UGC – video- and guidelines of the platform for which they labor.
image-based content, in particular – therefore They must also be aware of the laws and statutes
calls on human actors who rely upon their own that may govern the geographic or national loca-
linguistic and cultural knowledge and competen- tion from where the content emanates, for which
cies to make decisions about UGC’s appropriate- the content is destined, and for where the platform
ness for a given site or platform. Specifically, or site is located – all of which may be distinct
“they must be experts in matters of taste of the places in the world. They must be aware of the C
site’s presumed audience, have cultural knowl- platform’s tolerance for risk, as well as the expec-
edge about location of origin of the platform and tations of the platform for whether or how CCM
of the audience (both of which may be very far workers should make their presence known.
removed, geographically and culturally, from In many cases, CCM workers may work at
where the screening is taking place), have linguis- organizational arm’s length from the platforms
tic competency in the language of the UGC (that they moderate. Some labor arrangements in
may be a learned or second language for the CCM have workers located at great distances
content moderator), be steeped in the relevant from the headquarters of the platforms for which
laws governing the site’s location of origin and they are responsible, in places such as the Philip-
be experts in the user guidelines and other pines and India. The workers may be structurally
platform-level specifics concerning what is and removed from those firms, as well, via
is not allowed” (Roberts 2016). These human outsourcing companies who take on CCM con-
workers are the people who make up the legions tracts and then hire the workers under their aus-
of commercial content moderators: moderators pices, in call center (often called BPO, or business
who work in an organized way, for pay, on behalf process outsourcing) environments. Such
of the world’s largest social media firms, apps, and outsourcing firms may also recruit CCM workers
websites who solicit UGC. using digital piecework sites such as Amazon
CCM processes may take place prior to mate- Mechanical Turk or Upwork, in which the rela-
rial being submitted for inclusion or distribution tionships between the social media firms, the
on a site, or they may take place after material has outsourcing company, and the CCM worker can
already been uploaded, particularly on high- be as ephemeral as one review.
volume sites. Specifically, content moderation Even when CCM workers are located on-site at
may be triggered as the result of complaints a headquarters of a social media firm, they often
about material from site moderators or other site are brought on as contract laborers and are not
administrators, from external parties (e.g., compa- afforded the full status, or pay, of a regular full-
nies alleging misappropriation of material they time employee. In this regard, CCM work, wher-
own; from law enforcement; from government ever it takes place in the world and by whatever
actors) or from other users themselves who are name, often shares the characteristic of being rel-
disturbed or concerned by what they have seen atively low wage and low status as compared to
and then invoke protocols or mechanisms on a other jobs in tech. These arrangements of institu-
site, such as the “flagging” of content, to prompt tional and geographic removal can pose a risk for
a review by moderators (Crawford and Gillespie workers, who can be exposed to disturbing and
2016). In this regard, moderation practices are shocking material as a condition of their CCM
often uneven, and the removal of UGC may rea- work but can be a benefit to the social media
sonably be likened to censorship, particularly firms who require their labor, as they can distance
when it is undertaken in order to suppress speech, themselves from the impact of the CCM work on
political opinions, or other expressions that the workers. Further, the working conditions,
threaten the status quo. practices, and existence of CCM workers in social
CCM workers are called upon to match and media are little known to the general public, a fact
adjudicate volumes of content, typically at rapid that is often by design. CCM workers are fre-
speed, against the specific rules or community quently compelled to sign NDAs, or
214 Content Screening

nondisclosure agreements, that preclude them CCM is central to many of these core questions of
from discussing the work that they do or the what the Internet has been, is now, and will be in
conditions in which they do it. While social the future, and yet the continued invisibility and
media firms often gesture at the need to maintain lack of acknowledgment of CCM workers by the
secrecy surrounding the exact nature of their mod- firms for which their labor is essential means that
eration practices and the mechanisms they used to such questions cannot fully be addressed. Never-
undertake them, claiming the possibility of users’ theless, discussions of moderation practices and the
being able to game the system and beat the rules if people who undertake them are essential to the end
armed with such knowledge, the net result is that of more robust, nuanced understandings of the state
CCM workers labor in secret. The conditions of of the contemporary Internet and to better policy
their work – its pace, the nature of the content they and governance based on those understandings.
screen, the volume of material to be reviewed, and
the secrecy – can lead to feelings of isolation,
burnout, and depression among some CCM Cross-References
workers. Such feelings can be enhanced by the
fact that few people know such work exists, ▶ Algorithm
assuming, if they think of it at all, that algorithmi- ▶ Facebook
cally driven computer programs take care of social ▶ Social Media
media’s moderation needs. It is a misconception ▶ Wikipedia
that the industry has been slow to correct.

Further Reading
Conclusion
Crawford, K., & Gillespie, T. (2016). What is a flag for?
Despite claims and conventional wisdom to the Social media reporting tools and the vocabulary of
complaint. New Media & Society, 18(3), 410–428.
contrary, content moderation has likely always
Galloway, A. R. (2006). Protocol: How control exists after
existed in some form on the social Internet. As decentralization. Cambridge, MA: MIT Press.
the Internet’s many social media platforms grow Postigo, H. (2003). Emerging sources of labor on the
and their financial, political, and social stakes internet: The case of America online volunteers. Inter-
national Review of Social History, 48(S11), 205–223.
increase, the undertaking of organized control of
Roberts, S. T. (2016). Commercial content moderation:
user expression through such practices as CCM Digital laborers’ dirty work. In S. U. Noble &
will likewise only increase. Nevertheless, CCM B. Tynes (Eds.), The intersectional internet: Race,
remains a little discussed and little acknowledged sex, class and culture online (pp. 147–160). New
York: Peter Lang.
aspect of the social media production chain, despite
Turner, F. (2005). Where the counterculture met the new
its mission-critical status in almost every case in economy: The WELL and the origins of virtual com-
which it is employed. The existence of a globalized munity. Technology and Culture, 46(3), 485–512.
CCM workforce abuts many difficult, existential
questions about the nature of the Internet itself and
the principles that have long been thought to under-
gird it, particularly, the free expression and circu- Content Screening
lation of material, thought, and ideas. These
questions are further complicated by the pressures ▶ Content Moderation
related to contested notions of jurisdiction, borders,
application and enforcement of laws, social norms,
and mores that frequently vary and often are in
conflict with each other. The acknowledgement Context
and understanding of the history of content moder-
ation and the contemporary reality of large-scale ▶ Contexts
Contexts 215

phones, Big Data, tablets, and the cloud, AI appli-


Contexts cations such as context-aware software systems
are gaining much traction. Context-aware systems
Feras A. Batarseh have the advantage of dynamically adapting to
College of Science, George Mason University, current events and occurrences in the system and
Fairfax, VA, USA its surroundings. One of the main characteristics
of such systems is to adjust the behavior of the C
system without human-user intervention
Synonyms (Batarseh 2014).
Recent applications of context include:
Context; Contextual inquiry; Ethnographic (1) intelligent user interfaces’ design and devel-
observation opment, (2) context in software development,
(3) robotics, and (4) intelligent software agents,
among many others.
Definition For context to reach its intended goals, it must
be studied from both the technical and non-
Contexts refer to all the information available to a technical perspectives. Highlighting human
software system that characterizes the situation it aspects in AI (through context) will reduce the
is running within. Context can be found across all fears that many critics and researchers have
types of software systems (where it is usually toward AI. The arguments against AI have been
intentionally injected); however, it is mostly mostly driven by the complexity of the human
contained by intelligent systems. Intelligent sys- brain that is characterized by psychology, philos-
tems are driven by two main parts, the intelligent ophy, and biology. Such arguments – many scien-
algorithm and the data. More data means better tists believe – could be tackled by context, while
understanding of context; therefore, Big Data can context could be heavily improved by leveraging
be a major catalyst in increasing the level of Big Data.
systems’ self-awareness (i.e., the context they are
operating within).

Conclusion
Contextual Reasoning
From the early days of AI research, many argued
Humans have the ability to perform the processes against the possibility of complete and general
of reasoning, thinking, and planning effectively; machine intelligence. One of the strongest argu-
ideas could be managed, organized, and even ments is that it is not yet clear how AI will be able
conveyed in a comprehensible and quick manner. to replicate the human brain and its biochemistry;
Context awareness is a “trivial” skill for humans. therefore, it will be very difficult to represent
That is because humans receive years of training feelings, thoughts, intuitions, moods, and aware-
while observing the world around them, use ness in a machine. Context, however, is a gateway
agreed-upon syntax (language), comprehend the to many of these very challenging aspects of
context in which they are in, and accommodate intelligence.
their understanding of events accordingly. Unfor-
tunately, the same cannot be said about com-
puters; this “understanding of context” is a major Further Reading
Artificial Intelligence (AI) challenge – if not the
Batarseh, F. (2014). Chapter 3: Context-driven testing. In
most important one in this age of technological
Context in computing: A cross-disciplinary approach
transformations. With the latest massive diffusion for modeling the real world. Springer. ISBN: 978-1-
of many new technologies such as smart mobile 4939-1886-7.
216 Contextual Inquiry

the curricula of those who will not obtain


Contextual Inquiry degrees or certificates in disciplines related to
big data – but for whom training or education in
▶ Contexts these KSAs is still desired or intended. A third
core issue is how to construct the curriculum –
whether the degree is directly related to big data or
some key KSAs relating to big data are proposed
Control Panel for integration into another curriculum – in such a
way that it is evaluable. Since the technical attri-
▶ Dashboard butes of big data and its management and analysis
are evolving nearly constantly, any curriculum
developed to teach about big data must be evalu-
ated periodically (e.g., annually) to ensure that
Core Curriculum Issues what is being taught is relevant; this suggests
(Big Data Research/Analysis) that core underpinning constructs must be identi-
fied so that learners in every context can be
Rochelle E. Tractenberg encouraged to adapt to new knowledge rather
Collaborative for Research on Outcomes than requiring retraining or reeducation.
and – Metrics, Washington, DC, USA
Departments of Neurology; Biostatistics,
Bioinformatics & Biomathematics; and Role of the Curriculum in “Education”
Rehabilitation Medicine, Georgetown University, Versus “Training”
Washington, DC, USA
Education can be differentiated from training by
the existence of a curriculum in the former and its
Definition absence in the latter. The Oxford English Dictio-
nary defines education as “the process of educating
A curriculum is defined as the material and con- or being educated, the theory and practice of teach-
tent that comprises a course of study within a ing,” whereas training is defined as “teaching a
school or college, i.e., a formal teaching program. particular skill or type of behavior through regular
The construct of “education” is differentiated practice and instruction.” The United Nations
from “training” based on the existence of a cur- Educational, Scientific and Cultural Organization
riculum, through which a learner must progress in (UNESCO) highlights the fact that there may be an
an evaluable, or at least verifiable, way. In this articulated curriculum (“intended”) but the curric-
sense, a fundamental issue about a “big data cur- ulum that is actually delivered (“implemented”)
riculum” is what exactly is meant by the expres- may differ from what was intended. There are also
sion. “Big data” is actually not a sufficiently the “actual” curriculum, representing what students
concrete construct to support a curriculum, nor learn, and the “hidden” curriculum, which com-
even the integration of one or more courses into prises all the bias and unintended learning that any
an existing curriculum. Therefore, the principal given curriculum achieves (http://www.unesco.
“core curriculum issue” for teaching and learning org/new/en/education/themes/strengthening-educ
around big data is to articulate exactly what ation-systems/quality-framework/technical-notes
knowledge, skills, and abilities are to be taught /different-meaning-of-curriculum/). These types
and practiced through the curriculum. A second of curricula are also described by the Netherlands
core issue is how to appropriately integrate those Institute for Curriculum Development (SLO,
key knowledge, skills, and abilities (KSAs) into http://international.slo.nl/) and worldwide in
Core Curriculum Issues (Big Data Research/Analysis) 217

multiple books and publications on curriculum Determining the Knowledge, Skills, and
development and evaluation. Abilities Relating to Big Data That
When a curriculum is being developed or eval- Should Be Taught
uated with respect to its potential to teach about big
data, each of these dimensions of that curriculum The principal core curricular issue for teaching
(intended, implemented, actual, hidden) must be and learning around big data is to articulate
considered. These features, well known to instruc- exactly what knowledge, skills, and abilities are C
tors and educators who receive formal training to to be taught and practiced through the curriculum.
engage in the kindergarten–12th grade (US) or As big data has become an increasingly popular
preschool/primary/secondary (UK/Europe) edu- construct (since about 2010), different stake-
cation, are less well known among instructors in holders in the education enterprise have articu-
tertiary/higher education settings whose training lated curricular objectives in computer science,
is in other domains – even if their main job will be statistics, mathematics, and bioinformatics for
to teach undergraduate, graduate, postgraduate, undergraduate (e.g., De Veaux et al. 2017) and
and professional students. It may be helpful, in graduate students (e.g., Greene et al. 2016). These
the consideration of curricular elements around stakeholders include longstanding national or
big data, for those in the secondary education/ international professional associations and new
college/university setting to consider what attri- groups seeking to establish either their own cred-
butes characterize the curricula that their incom- ibility or to define the niche in “big data” where
ing students have experienced relating to the same they plan to operate. However, “big data” is not a
content or topics. specific domain that is recognized or recogniz-
Many modern researchers in the learning able; it has been described as a phenomenon
domains reserve the term “training” to mean “voca- (Boyd and Crawford 2012) and is widely consid-
tional training.” For example, Gibbs et al. (2004) ered not to be a domain for training or education
identify training as specifically “skills acquisition” on its own. Instead, knowledge, skills, and abili-
to be differentiated from instruction (“information ties relating to big data are conceptualized as
acquisition”); together with socialization and the belonging to the discipline of data science; this
development of thinking and problem-solving discipline is considered as existing at the intersec-
skills, this information acquisition is the foundation tion of mathematics, computer science, and statis-
of education overall. The vocational training is tics. This is practically implemented as the
defined as a function of skills or behaviors to be articulation of foundational aspects of each of
learned (“acquired”) by practice in situ. When con- these disciplines together with their formal and
sidering big data trainees, defined as individuals purposeful integration into a formal curriculum.
who participate in any training around big data that With respect to data science, then, generally,
is outside of a formal curriculum, it is important to there is agreement that students must develop
understand that there is no uniform cognitive abilities to reason with data and to adapt to a
schema, nor other contextual support, that the for- changing environment, or changing characteris-
mal curriculum typically provides. Thus, it can be tics of data (preferably both). However, there is
helpful to consider “training in big data” as appro- not agreement on how to achieve these abilities.
priate for those who have completed a formal cur- Moreover, because existing undergraduate course
riculum in data-related domains. Otherwise, skills requirements are complex and tend to be compre-
that are acquired in such training, intended for hensive for “general education” as well as for the
deployment currently and specifically, may actu- content making up a baccalaureate, associate, or
ally limit the trainees’ abilities to adapt to new other terminal degree in the postsecondary con-
knowledge, and thereby, lead to a requirement for text, in some cases just a single course may be
retraining or reeducation. considered for incorporation into either required
218 Core Curriculum Issues (Big Data Research/Analysis)

or elective course lists. This would represent the online version of regular university courses and
least coherent integration of big data into a col- curricula (and so is closer to “education” than
lege/university undergraduate curriculum. In the “training”) – degree and certificate programs all
construction of a program that would award a cer- have costs associated and also can be considered
tificate, minor or major, if it seeks to successfully to follow a formal curriculum to a greater extent
prepare students for work in or with big data, or than any other option for widely accessible train-
statistics and data science, or analytics, or of other ing/learning around big data. These examples rep-
programs intended to train or prepare people for resent a continuum that can be characterized by
jobs that either focus on, or simply “know about,” the attention to the curricular structure from min-
big data must follow the same curricular design imal (Big Data University) to complete (The Open
principles that every formal educational enterprise University). The individual who selects a given
should follow. If they do not, they risk training opportunity, as well as those who propose
underperforming on their advertising and and develop training programs, must articulate
promises. exactly what knowledge, skills, and abilities are
It is important to consider the role of training in to be taught and practiced. The challenge for
the development, or consideration of develop- individuals making selections is to determine
ment, of curricula that feature big data. In addition how correctly an instructor or program developer
to the creation of undergraduate degrees and has described the achievements the training is
minors, Master’s degrees, post-baccalaureate cer- intended to provide. The challenge for those curat-
tificate programs, and doctoral programs, all of ing or creating programs of study is to ensure that
which must be characterized by the curricula the learning objectives of the curriculum are met,
they are defined and created to deliver, many i.e., that the actual curriculum is as high a match
other “training” opportunities and workforce to the intended curriculum as possible. Basic prin-
development initiatives also exist. These are ciples of curriculum design can be brought to bear
being developed in corporate and other human for acceptable results in this matching challenge.
resource-oriented domains, as well as in more The stronger the adherence to these basic princi-
open (open access) contexts. Unlike traditional ples, the more likely a robust and evaluable curric-
degree programs, training and education around ulum, with demonstrable impact, will result. This is
big data are unlikely to be situated specifically not specific to education around big data, but with
within a single disciplinary context – at least not all the current interest in data and data science,
exclusively. People who have specific skills, or these challenges rise to the level of “core curricu-
who have created specific tools, often create free lum issues” for this domain.
or easily accessible representations of the skills or
tool – e.g., instructional videos on YouTube or as
formal courses of varying lengths that can be read Utility of Training Versus a Curriculum
(slides, documentation) or watched as webinars. Around Big Data
Examples can be found online at sites including
Big Data University (bigdatauniversity.com), De Veaux et al. (2017) convened a consensus
created by IBM and freely available, and panel to determine the fundamental requirements
Coursera (coursera.org) which offers data science, for an undergraduate curriculum in “data sci-
analytics, and statistics courses as well as eight ence.” They articulated that the main topical
different specializations, comprising curated areas that comprise – and must be leveraged for
series of courses – but also many other topics. appropriate baccalaureate-level training in – this
Coursera has evolved many different educational domain are as follows: data description and
opportunities and some curated sequences that curation, mathematical foundations, computa-
can be completed to achieve “certification,” with tional thinking, statistical thinking, data model-
different costs depending on the extent of ing, communication, reproducibility, and ethics.
student engagement/commitment. The Open Since computational and statistical thinking, as
University (www.open.ac.uk) is essentially an well as data modeling, all require somewhat
Core Curriculum Issues (Big Data Research/Analysis) 219

different mathematical foundations, this list experts undoubtedly can articulate what compe-
shows clearly the challenges in selecting specific tencies are required for functional status in their
“training opportunities” to support development domain. However, (a) a training experience devel-
of new skills in “big data” for those who are not oped to fill in a slot within a competency checklist
already trained in quantitative sciences to at least often fails to support teaching and learning around
some extent. Moreover, arguments are arising in the integration of the competencies into regular
many quarters (science and society, philosophy/ practice; and (b) curricula created in alignment C
ethics/bioethics, and professional associations with competencies often do not promote the actual
like the Royal Statistical Society, American Sta- development and refinement of these competen-
tistical Association, and Association of Comput- cies. Instead, they may tend to favor the checking-
ing Machinery) that “ethics” is not a single entity off of “achievement of competency X” from the
but, with respect to big data and data science, is a list.
complex – and necessary – type of reasoning that Another potential challenge arises from the
cannot be developed in a single course or training opposite side of the problem, learner-driven train-
opportunity. The complexity of reasoning that is ing development. “What learners want and need
required for competent work in the domain from training” should be considered together with
referred to exchangeably as “data analytics,” what experts who are actually using the target
“data science,” and “big data”, which includes knowledge, skills, and abilities believe learners
this ability to reason ethically, underscores the need from training. However, the typical trainee
point that piecemeal training will be unsuccessful will not be sufficiently knowledgeable to choose
unless the trainee possesses the ability to organize the training that is in fact most appropriate for
the new material together with extant (high level) their current skills and learning objectives. The
reasoning abilities, or at least a cognitive/mental construct of “deliberate practice” is instructive
schema within which the diverse training experi- here. In their 2007 Harvard Business Review arti-
ences can be integrated for a comprehensive cle, “The making of an expert,” Ericsson, Prietula,
understanding of the domain. and Cokely summarize Ericsson’s prior work on
However, the proliferation of training oppor- expertise and its acquisition, commenting that
tunities around big data suggests a pervasive “(y)ou need a particular kind of practice – delib-
sense that a formal curriculum is not actually erate practice - to develop expertise” (emphasis in
needed – just training is. This may arise from a original, p. 3). Deliberate practice is practice
sense that the technology is changing too fast to where weaknesses are specifically identified and
create a whole curriculum around it. Training targeted – usually by an expert both in the target
opportunity creators are typically experts in the skillset and perhaps more particularly in identify-
domain, but may not necessarily be sufficiently ing and remediating specific weaknesses. If a
expert in teaching and learning theories, or the trainee is not (yet) an expert, determining how
domains from which trainees are coming, to suc- best to address a weakness that one has self-iden-
cessfully translate their expertise into effective tified can be another limitation on the success of a
“training.” This may lead to the development of training opportunity, if it focuses on what the
new training opportunities that appear to be rel- learner wants or believes they need without appeal
evant, but which can actually contribute only to subject matter experts. This perspective argues
minimally to an individual trainee’s ability to for the incorporation of expert opinion into the
function competently in a new domain like big development, descriptions, and contextualizations
data, because they do not also include or provide of training, i.e., the importance of deliberate prac-
contextualization or schematic links with prior tice in the assurance that as much as possible of
knowledge. the intended curriculum becomes the actual cur-
An example of this problem is the creation of riculum. Training opportunities around big data
“competencies” by subject matter expert consen- can be developed to support, or fill in gaps, in a
sus committees, which are then used to create formal curriculum; without this context, training
“learning plans” or checklists. The subject matter in big data may not be as successful as desired.
220 Corporate Social Responsibility

Conclusions competency debate. South African Family Practice,


46(10), 5–6. https://doi.org/10.1080/20786204.2004.
10873146.
A curriculum is a formal program of study, and Greene, A. C., Giffin, K. A., Greene, C. S., & Moore, J. H.
basic curriculum development principles are (2016). Adapting bioinformatics curricula for big data.
essential for effective education in big data – as Briefings in Bioinformatics, 17(1), 43–50. https://doi.
in any other domains. Knowledge, skills, and org/10.1093/bib/bbv018.
abilities, and the levels to which these will be
both developed and integrated, must be articulated
in order to structure a curriculum to optimize the
match between the intended and the actual curric- Corporate Social
ula. The principal core curricular issue for teach- Responsibility
ing and learning around big data is to articulate
exactly what knowledge, skills, and abilities are to Yon Jung Choi
be taught and practiced. A second core issue is Center for Science, Technology, and Innovation
that the “big data” knowledge, skills, and abilities Policy, George Mason University, Fairfax, VA,
may require more foundational support for train- USA
ing of those who will not obtain, or have not
obtained, degrees or certificates in disciplines
related to big data. A third core issue is how to Big data has become a popular source for busi-
construct the curriculum in such a way that the nesses to analyze various aspects of human psy-
alignment of the intended and the actual objec- chology and behaviors and organizational
tives is evaluable and modifiable as appropriate. processes and practices. However, the growing
Since the technical attributes of big data and its corporate use of big data has raised several eth-
management and analysis are evolving nearly ical and social concerns, especially related to the
constantly, any curriculum developed to teach use of personal data for commercial interests and
about big data must be evaluated periodically to possible infringement of fundamental rights
ensure the relevance of the content; however the (e.g., privacy) (Herschel and Miori 2017;
alignment of the intended and actual curricula Flyverbom et al. 2019). These concerns are pri-
must also be regularly evaluated to ensure learn- marily issues of corporate social responsibility
ing objectives are achieved and achievable. (CSR), referring in general to the responsibilities
of businesses for their impacts on society, includ-
ing economic, legal, environmental, and social
responsibilities. Despite growing public con-
Further Reading cerns, practical efforts to connect big data with
CSR have been rarely made (Richards and King
Boyd, D., & Crawford, K. (2012). Critical questions for big
data: Provocations for a cultural, technological, and 2014; Zwitter 2014; Napier 2019). Corporate use
scholarly phenomenon. Information, Communication, of big data poses both risks and opportunities for
& Society, 15(5), 662–679. society. The following sections summarize vari-
De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S.,
ous CSR-related concerns and possible contribu-
Bray, A., Bressoud, T. C., et al. (2017). Curriculum guide-
lines for undergraduate programs in data science. Annual tions to CSR brought by corporate use of big data
Review of Statistics and its Applications, 4, 2.1–2.16. that have been identified by scholars and
https://doi.org/10.1146/annurev-statistics-060116- practitioners.
053930. Downloaded from http://www.amstat.org/asa/
files/pdfs/EDU-DataScienceGuidelines.pdf. 2 Jan 2017.
Ericsson, K. A., Prietula, M. J., & Cokely, E. T. (2007). The
making of an expert. Harvard Business Review 85(7– CSR-Related Concerns
8):114–121, 193. Downloaded from https://hbr.org/
2007/07/the-making-of-an-expert. 5 June 2010.
Gibbs, T., Brigden, D., & Hellenberg, D. (2004). The There are several social and ethical concerns in the
education versus training and the skills versus use of big data by businesses. The following are
Corporate Social Responsibility 221

some of the issues that have been identified by depended on decisions made by a handful of
scholars, the media, and the public. social network service (SNS) companies. Peo-
ple’s public images and behaviors are greatly
• Infringement of privacy: The most well- influenced by how these companies guide peo-
publicized concern is the issue of a possible ple through their digital platforms.
violation of privacy. A vast amount of per- • Politicization and commodification of per-
sonal data, including personal information, sonal information: Another growing concern C
interests, movement, and relationships, are in over big data handled by internet companies is
the hands of internet companies that can be the possibility of political use for mass surveil-
easily approached and exchanged among lance and commodification of personal infor-
them with or without users’ consent (Zwitter mation, as evidenced by Edward Snowden’s
2014; Herschel and Miori 2017; Flyverbom revelations about the US National Security
et al. 2019). Thus, user’s privacy can be easily Agency’s mass surveillance, and the growing
infringed upon by these companies under the suspicion of surveillance by the Chinese gov-
current system, which lacks sufficient ernment of online information provided by
regulations. internet companies (Bauman et al. 2014; Hou
• Transparency and consumer rights: In gen- 2017). Companies have the option to sell or
eral, corporate information about how to cre- provide their big data to governments and/or
ate, manage, exchange, and protect big data private companies for political or commercial
composed of users’ information is not open to purposes, which raises serious ethical concerns
the public because it is considered a proprietary (Flyverbom et al. 2019).
asset. Consumers have little knowledge of how
their information is gathered and handled by Scholars have pointed out that laws and regu-
companies. Users’ data that companies can lations over these issues are lacking at both
access is extensive, including geolocation, national and global levels, and the lives and
photos and videos, contacts, text messages, well-being of the public are significantly at
and emails, and users are not fully aware of stake. Some argue that companies, as “socially
this (Flyverbom et al. 2019, p. 7). This raises responsible” actors, should consider the ethical
not only concerns over privacy but also over management of big data as part of their business
consumers’ “right-to-know” about the risks to account (Fernandes 2018; Flyverbom et al. 2019).
their lives and well-being that can be caused In other words, socially conscious management of
by this. big data is argued to be part of the main areas of
• Data monopoly and manipulation of indi- CSR and should be scrutinized by the public
vidual desires and needs: Because many through sources such as CSR/transparency
internet companies enjoy exclusive rights to reports, especially for those companies creating
the big data generated by their platforms, and dealing with big data. In this regard, the main
there is an issue of data monopoly, as areas of CSR that are widely recognized, espe-
evidenced by the antitrust lawsuits filed against cially in academia and industry, are environmental
Facebook (Iyengar 2020; Flyverbom et al. protection, employee welfare, stakeholder
2019; Zwitter 2014). Scholars warn of internet involvement, anticorruption, human rights protec-
companies’ ability to shape “our views of the tion, and community development (UNGC 2020;
world by managing, editing, and controlling GRI 2018). Others also insist on the necessity of
information in ways that have important con- developing more inclusive decision-making
sequences for individuals, organizations, and mechanisms either within corporate governance
societies alike” (Flyverbom et al. 2019, p. 8; or collaboratively by inviting various stake-
see also Flyverbom 2016; Helveston 2016). holders, enabling them to serve their interests
For example, how individuals represent them- more adequately (Flyverbom 2016; Flyverbom
selves in the digital world has increasingly et al. 2019).
222 Corporate Social Responsibility

Implications of Big Data Contributions analytics (Napier 2019; Flyverbom et al.


to CSR 2019, p. 12).

Scholars and practitioners are increasingly engag- Although it is still at an early stage, the debate
ing big data in relation to CSR, as summarized on corporate social responsibility surrounding big
below: data has pointed out both risks and benefits to
society. A better understanding is needed of the
• Measurement, Assessment, and Enhance- economic, political, environmental, and social
ment of CSR performance: Big data can be impact and implications of corporate use of big
used to measure and evaluate CSR and sustain- data in order to discuss and establish the proper
able development activities of companies by and ethical roles and responsibilities of business
analyzing environmental/social data and com- in society.
munications of corporate stakeholders
(Barbeito-Caamaño and Chalmeta 2020; Jeble
et al. 2018). Big data analytics may help to ease Further Reading
the difficulty of measuring and evaluating
intangible social values influenced by corpo- Akhtar, P., Khan, Z., Frynas, J., Tse, Y., & Rao-Nicholson,
R. (2018). Essential micro-foundations for contempo-
rate practices (Jeble et al. 2018). In addition,
rary business operations: Top management tangible
new information technologies using big data competencies, relationship-based business networks
can also help manage and enhance companies’ and environmental sustainability. British Journal of
social and environmental performances Management, 29, 43–62.
Barbeito-Caamaño, A., & Chalmeta, R. (2020). Using big
(Carberry et al. 2017; Akhtar et al. 2018;
data to evaluate corporate social responsibility and
Napier 2019). sustainable development practices. Corporate Social
• Better management of stakeholders: Responsibility & Environmental Management, 27(6),
Scholars and practitioners have recognized 2831–2848.
Bauman, Z., Bigo, D., Esteves, P., Guild, E., Jabri, V.,
the potential of big data in stakeholder man-
Lyon, D., & Walker, R. B. J. (2014). After Snowden:
agement. For instance, SAP Ariba, a software Rethinking the impact of surveillance. International
company, has developed procurement intelli- Political Sociology, 8(2), 121–144.
gence (a method of big data analytics) to iden- Carberry, E., Bharati, P., Levy, D., & Chaudhury,
A. (2017). Social movements as catalysts for corporate
tify “unethical or unsustainable business
social innovation: Environmental activism and the
practices” of suppliers of companies and there- adoption of green information systems. Business &
fore enhance their supply chain management Society, 58(5), 1083–1127.
(York 2018). Big data analytics can also be Fernandes, K. (2018, November 2). CSR in the era of big
data. The CSR Journal. https://thecsrjournal.in/csr-era-
used for better management of other stake-
big-data-analytics-private-companies/.
holders, such as employees and consumers, Flyverbom, M. (2016). Disclosing and concealing: Internet
with a more in-depth understanding of their governance, information control, and the management
needs and preferences. of visibility. Internet Policy Review, 5(3), 1–15.
Flyverbom, M., Deibert, R., & Matten, D. (2019). The
• Contributions to creating social goods:
governance of digital technology, big data, and the
Companies arguably can generate significant internet: New roles and responsibilities for business.
social benefits with a deeper understanding Business & Society, 58(1), 3–19.
of the people, organizations, culture, and Global Reporting Initiative (GRI). (2018). GRI standards.
https://www.globalreporting.org/media/55yhvety/gri-
values of a society if they use and manage
101-foundation-2016.pdf.
big data more responsibly and implement Helveston, M. (2016). Consumer protection in the age of
more socially conscious practices. More spe- big data. Washington University Law Review, 93(4–5),
cifically, companies can make a significant 859.
Herschel, R., & Miori, V. (2017). Ethics & big data. Tech-
contribution to generating public goods such
nology in Society, 49, 31–36.
as “improvements in health care, education, Hou, R. (2017). Neoliberal governance or digitalized
and urban planning” through big data autocracy? The rising market for online opinion
Corpus Linguistics 223

surveillance in China. Surveillance & Society, 15(3/4), high-powered computers and the increased avail-
418–424. ability of machine-readable texts, it has become a
Iyengar, R. (2020, December 11). The antitrust case against
Facebook: Here’s what you need to know. CNN Busi- major force in modern linguistic research.
ness. https://www.cnn.com/2020/12/11/tech/facebook-
antitrust-lawsuit-what-to-know/index.html.
Jeble, S., Dubey, R., Childe, S., Papadopoulos, T., History
Roubaud, D., & Prakash, A. (2018). Impact of big
data and predictive analytics capability on supply
C
chain sustainability. International Journal of Logistics The use of corpora for language analysis long
Management, 29(2), 513–538. predates computers. Theologians were making
Napier, E. (2019). Technology enabled social responsibil- Biblical concordances in the eighteenth century,
ity projects and an empirical test of CSR’s impact on
firm performance. (Doctoral dissertation, Georgia State and Samuel Johnson started a tradition followed
University). ScholarWorks @ Georgia State University to this day (e.g., most famously by the Oxford
https://scholarworks.gsu.edu/marketing_diss/50. English Dictionary) of compiling collections
Richards, N. M., & King, J. H. (2014). Big data ethics. of quotations from prestigious literature to form
Wake Forest Law Review, 49, 393–432.
United Nations Global Compact (UNGC). (2020). UNGC the basis of his dictionary entries. Dialect dictio-
principles. https://www.unglobalcompact.org/what-is- naries such as the Dictionary of American
gc/mission/principles. Regional English (DARE) are typically compiled
York, M. (2018, March 26). Intelligence-driven CSR: Put- on the basis of questionnaires or interviews of
ting big data to good use. COP Rising. https://
cporising.com/2018/03/26/intelligence-driven-csr- hundreds or thousands of people.
putting-big-data-to-good-use/. The first and possibly most famous use of
Zwitter, A. (2014). Big data ethics. Big Data & Society, computer-readable corpora was the million-word
1(3), 1–6. Brown corpus (Kučera and Nelson Francis 1967).
The Brown corpus consists of 500 samples, each
of about 2000 words, collected from writings
published in the United States in 1961. Genre
Corpus Linguistics coverage includes nine categories of “informative
prose” and six of “imaginative prose,” including
Patrick Juola inter alia selections of press reportage, learned
Department of Mathematics and Computer journals, religious tracts, and mystery novels.
Science, McAnulty College and Graduate School For many years, the Brown corpus was the only
of Liberal Arts, Duquesne University, Pittsburgh, large corpus available and the de facto standard.
PA, USA Even today, the Brown corpus has influenced the
design and collection of many later corpora,
including the LOB corpus (British English), the
Introduction Kolhapur corpus (Indian English), the Australian
Corpus of English, and the 100-million-word
Corpus linguistics is, broadly speaking, the appli- British National Corpus.
cation of “big data” to the science of linguistics. Improvements in computer science made three
Unlike traditional linguistic analysis [caricatured major innovations possible. First, more powerful
by Fillmore (1992) as “armchair linguistics”], computers made the actual task of processing data
which relies on native intuition and introspection, much easier and faster. Second, improvements in
corpus linguists rely on large samples to quantita- networking technology make it practical to dis-
tively analyze the distribution of linguistic items. tribute data more easily and even to provide cor-
It has therefore tended to focus on what can be pus analysis as a service via web interfaces such as
easily measured by computer and quantified, such https://books.google.com/ngrams or https://cor
as words, phrases, and word-based grammar, pus.byu.edu/coca. Finally, the development of
instead of more abstract concepts such as dis- online publishing via platforms such as the Web
course or formal syntax. With the advent of makes it much easier to collect data simply by
224 Corpus Linguistics

scraping databases; similarly, improvements in people have been observed using this very expres-
optical character recognition (OCR) technology sion over the telephone, indicating that his intui-
have made large-scale scanning projects such as tions about acceptability (and the reasons for
https://www.google.com/googlebooks/about/ unacceptability) are not necessarily universally
more practical. shared. Alternatively, people’s intuitions may not
accurately reflect their actual use of language, a
phenomenon found in other studies of human
Theory expertise. Observation of actual use can often be
made only by using empirical, that is, corpus-
From the outset, corpus linguistics received type, evidence.
pushback from some theoretical linguists. Chom- Corpora therefore provide observational evi-
sky, for example, stated that “Corpus linguistics dence about the use of language – which patterns
doesn’t mean anything.” [cited in McEnery and are used, and by extension, which might not be –
Hardy 2012]. Meyer (2002) describes a leading without necessarily diving deeper into a descrip-
generative grammarian as saying “`the only legit- tion of the types of patterns used or an explanation
imate source of grammatical knowledge’ about a of the underlying processes. Furthermore, they
language [is] the intuitions of the native speaker.” provide a way of capturing the effects of the
Statistics often provide an inadequate explanatory intuitions of hundreds, thousands, or millions of
basis for linguistic findings. For example, one can people instead of a single researcher or small
observe that the sentence *Studied for the exam team. They enable investigation of rare phenom-
does not appear in a sample of English writing, but ena that researchers may not have imagined and
He studied for the exam does. It requires substan- allow quantitative investigations with greater sta-
tial intuition, ideally by a native speaker, to tistical power to discern subtle effects.
observe that the first form is not merely rare but
actively ungrammatical. More specifically, intui-
tion is what tells us that English (unlike Italian) Applications of Corpus Linguistics
generally requires that all sentences have an
explicit subject. (Worse, the sentence Studied for Corpora are used for many purposes, including
the exam might appear, perhaps as an example, or language description and as a resource for lan-
perhaps in an elliptical context. This might sug- guage learning. One long-standing application is
gest to the naïve scholar that the only difference compiling dictionaries. By collecting a large
between the two forms is how common each is.) enough number of samples of a specific word in
Similarly, just because a phenomenon is com- use, scholars can identify the major categories of
mon does not make it important or interesting. meanings or contexts. For example, the English
Fillmore (1992) caricatures this argument as “if word risk typically takes three different types of
natural scientists felt it necessary to portion out direct objects – you can “risk” an action
their time and attention to phenomena on the basis (I wouldn’t risk that climb), what you might do
of their abundance and distribution in the uni- as a consequence (. . . because you risk a fall on
verse, almost all of the scientific community the slippery rocks), or even the consequence of the
would have to devote itself exclusively to the consequence, what you might lose (. . . and you
study of interstellar dust.” would risk your life). The different meanings of
At the same time, intuitions are not necessarily polysemous terms (words with multiple mean-
reliable; perhaps more importantly, they are ings) like bank (the edge of a river, a financial
unshared. Fillmore (1992) cites as an example institution, and possibly other meanings, such as a
his theory that “the colloquial gesture-requiring bank shot) can be identified from similar lists.
yea, as in It was about yea big,” couldn’t be used Corpora can help identify frequently occurring
in a context when the listener couldn’t see the patterns (collocations) such as idioms and can
speaker. However, Fillmore acknowledged that identify grammatical patterns such as the types
Correlation Versus Causation 225

of grammatical structure associated with specific Conclusions


words. (For instance, the word give can take an
indirect object, as in John gave Mary a taco, but The use of “big data” in language greatly expands
donate typically does not – constructions such as the types of research questions that can be
*John donated the museum a statue are vanish- addressed and provides valuable resources for
ingly rare.) use in large-scale language processing. Despite
Another application is in detecting illustrating early criticisms, it has helped to establish many C
language variation and change. For example, cor- new research methods and findings and continues
pora of Early Modern English such as the to be an important and now-mainstream part of
ARCHER corpus can help illustrate differences linguistics research.
between Early Modern and contemporary
English. Similar analysis across genres can show
different aspects of genre variations, such as what
Cross-References
Biber (cited in McEnery and Hardy 2012) termed
“narrative vs. non-narrative concerns,” a concept
▶ Google Books Ngrams
describing the relation of specific past events with
specific people. Other studies show differences
between groups or even between specific individ-
Further Reading
uals (Juola 2006), a capacity of interest to law
enforcement (who may want to know which spe- Fillmore, C. J. (1992). “Corpus linguistics” or “computer-
cific individual was associated with a specific aided armchair linguistics”. In J. Svartvik (Ed.), Direc-
writing, like a ransom note). tions in corpus linguistics: Proceedings of Nobel sym-
posium 82. 4–8 August 1991 (pp. 35–60). Berlin:
Corpora can also provide source material to
Mouton de Gruyter.
train statistical language processors on. Many Juola, P. (2006). Authorship attribution. Foundations and
natural language tasks (such as identifying all Trends in Information Retrieval, 1(3), 233–334.
the proper nouns in a document or translating a Kennedy, G. (1998). An introduction to corpus linguistics.
London: Longman.
document from one language to another) have
Kučera, H., & Nelson Francis, W. (1967). Computational
proven to be difficult to formalize in a rule-based analysis of present-day American English. Providence:
system. A suitable corpus (perhaps annotated by Brown University Press.
partial markup done by humans) can provide the McEnery, T., & Hardy, A. (2012). Corpus linguistics:
Method, theory, practice. Cambridge: Cambridge Uni-
basis instead for a machine learning system to
versity Press.
determine complex statistical patterns associated Meyer, C. F. (2002). English corpus linguistics: An intro-
with that task. Rather than requiring linguists to duction. Cambridge: Cambridge University Press.
list the specific attributes of a proper noun or the
specific rules governing the exact translation of
the verb to wear into Japanese, the system can
“learn” patterns associated with these distinc- Correlation Versus Causation
tions and generalize them to novel contexts.
Other examples of such natural language pro- R. Bruce Anderson1,2 and Matthew Geras2
1
cessing problems include parsing sentences to Earth & Environment, Boston University,
determine their constituent structure, resolving Boston, MA, USA
2
ambiguities such as polysemous terms, provid- Florida Southern College, Lakeland, FL, USA
ing automatic markup such as tagging the words
in a document for their parts of speech, answer-
ing client questions (“Siri, where is Intel Both the terms correlation and causation are often
based?”), or determining whether the overall used when interpreting, evaluating, and describ-
sentiment expressed in an online review is favor- ing statistical data. While correlations and causa-
able or unfavorable. tions can be associated, they do not need to be
226 Correlation Versus Causation

related or linked. A correlation and a causation are burned off through the course of a workout. This
two distinct and separate statistical terms that can is an example of positive correlation because as
each individually be used to describe and interpret the amount of time spent exercising increases, so
different types of data. Sometimes the two terms does the amount of calories being burned. A cor-
are mistakenly used interchangeably, which could relation coefficient value that ranges from any-
misrepresent important trends in a given data set. thing less than zero to negative one indicates a
The danger of using these terms as synonyms has negative linear correlation, where a score of neg-
become even more problematic in recent years ative one is a perfect negative correlation and a
with the continued emergence of research projects negative score close to zero represents variables
reliving on big data. Any time a researcher utilizes that have a very weak or limited negative correla-
a large dataset with thousands of observations, tion. An example of a negative correlation would
they are bound to find correlations between vari- be the speed at which a car is traveling and the
ables; however, with such large datasets, there is amount of time it takes that car to arrive at its
an inherent risk that these correlations are spuri- destination. As the speed of the car decreases,
ous opposed to causal. the amount of time traveling to the destination
A correlation, sometimes called an association, increases. A correlation coefficient of zero indi-
describes the linear relationship or lack of a linear cates that there is no correlation between the vari-
relationship between two or more given variables. ables in question. Additionally, when two
The purpose of measuring data sets for correlation variables result in a very small negative or posi-
is to determine the strength of association between tive correlation, such as 0.01 or 0.01, the
different variables. There are several statistical corresponding negative or positive correlation is
methods to determine whether a correlation exists often considered to have very little substantive
between variables, whether that correlation is pos- meaning and thus variables in cases such as
itive or negative, and whether the correlation these are also often considered to have little to
shows a strong association or a weak association. no correlation.
Correlations can be either positive or negative. A R or the correlation coefficient can be deter-
positive correlation occurs when one variable mined in several ways. One way to determine the
increases as another variable increases or when correlation between two different variables is
one variable decreases as another variable through the use of graphing methods. Scatterplots
decreases. For a positive correlation to be evident, can be used to compare more than one variable.
the variables being compared have to move in When using a scatterplot, one variable is graphed
tandem with one another. A negative correlation along the x-axis, while the other variable is graphed
behaves in the opposite pattern; as one variable along the y-axis. Once all of the points are graphed,
increases another variable decreases or as the first a line of best fit, a line running through the data
variable decreases, the second variable increases. points where half of the data points are positioned
With negative correlations, the variables in ques- above the line and half of the data points are
tion need to move in the opposite direction of one graphed below the line, can be used to determine
another. It is also possible for no correlation to whether there is a correlation between the variables
exist between two or more different variables. being examined. If the line of best fit has a high
Statistically, correlations are stated by the cor- positive slope, then there is a strong correlation
relation coefficient r. A correlation coefficient between the variables and if the slope of the line
with a value that is greater than zero up to and of best fit is a high negative number, the correlation
including one indicates a positive linear correla- between the variables is negative. Finally, if the
tion, where a score of one is a perfect positive slope of the line of best fit is close to zero, or
correlation and a positive score close to zero rep- close to being a straight line, there is little to no
resents variables that have a very weak or limited correlation between the variables. A correlation
positive correlation. One example of a strong pos- coefficient can also be determined numerically by
itive correlation would be the amount of time the use of Karl Pearson’s coefficient of correlation
spent exercising and the amount of calories formula. Using this formula involves taking Sigma,
Correlation Versus Causation 227

or the sum, of all of the differences between each holiday or because the local public swimming
individual x values and the mean x value multiplied pool is closed for repairs. Due to the possibility,
by the sum of all of the differences between each of confounding variables, it is not possible to
individual y values and the mean y value. This determine causation from correlation alone.
product then becomes the numerator of the equa- Despite this, it is possible to estimate
tion and is divided by the product of N, or the whether there is a causal relationship between
number of observations, the standard deviation of two variables. One way to evaluate causation is C
x, and the standard deviation of y. Fortunately, due through the use of experimentation with random
to the ever advancing and expanding field of tech- samples. Through the use of either laboratory or
nology, correlation coefficients can now be deter- field experiments, researchers may be able to esti-
mined practically instantly through the use of mate causation since they will be able to control or
technology such as graphing calculators and differ- limit the effect of possible confounding variables.
ent types of statistical analysis software. Due to the For the beach example listed above, researchers
speed and increasing availability of these types of could measure beach attendance on a daily basis
software, the practice of manually calculating cor- for a given period of time. This would help them
relation coefficients is limited mostly to to eliminate potential extraneous variables, such
classrooms. as holidays, because data will be collected
In relation to experimentation and data analy- according to a set plan and not just based on
sis, causation means that changes in one variable one day’s worth of observations. Additionally,
directly relate and influence changes in another the coefficient of determination or r2 can be used
variable. When causation or causality is present to measure whether two variables cause the
between two variables, the first variable, the var- changes in one another. The coefficient of deter-
iable that is the cause, may bring the second mination is equivalent to the correlation coeffi-
variable into occurrence or influence the direction cient multiplied by itself (squared). Since any
and movement of the second variable. While on number squared is a positive number, the coeffi-
the surface or by reading definitions alone, corre- cient of determination will always have a positive
lation and causation may appear to be the same value. As is the case with positive correlations, the
thing or at least that correlations prove causation, closer to one that a coefficient of determination is,
this is not the case. Correlations do not necessarily the more likely it is that the first variable being
indicate causation because causation is not always examined caused the second variable. This is
straight forward and is difficult to prove. Even a because r2 tells what percent of the independent
strong correlation cannot immediately be consid- variable x explains what happened to the depen-
ered causation due the possibility of confounding dent variable. The closer r squared is to 1, the
variables. Confounding variables are extraneous better x explains y. Of these two methods, exper-
variables or variables that are not being controlled imental research is more widely accepted when
for or measured in a particular experiment or claiming causality between two variables, but
survey but could still have impact on the results. both methods provide better indicators of causal-
When examining variables x and y, it may be ity than does a single correlation coefficient. This
possible to determine that x and y have a positive is especially true when utilizing big data since the
correlation, but it is not as clear that x causes y risk of finding spurious correlations increases as
because confounding variables w, z, etc. could the number of observations increases.
also be influencing the outcome of y unbeknownst
to the individuals examining the data. For exam-
ple, the number of people at the beach could Cross-References
increase as the daily temperature increases, but it
is not possible to know that the increase in tem- ▶ Association Versus Causation
perature caused the increased beach attendance. ▶ Data Integrity
Other variables could be in play; more people ▶ Data Processing
could be at the beach because it is a federal ▶ Transparency
228 COVID-19 Pandemic

Further Reading electronic health records, wearable devices, satel-


lites, wastewater systems, and scholarly articles
Correlation Coefficients. (2005, January 1). Retrieved 14 and clinical studies, are being exploited for these
Aug 2014, from http://www.andrews.edu/~calkins/
purposes (Lin and Hou 2020).
math/edrm611/edrm05.htm.
Green, N. (2012, January 6). Correlation is not causation. Working hand-in-hand with big data is an inte-
Retrieved 14 Aug 2014, from http://www.theguardian. grated set of emerging digital, cyber-physical, and
com/science/blog/2012/jan/06/correlation-causation. biological tools and technologies (Ting et al.
Jaffe, A. (2010, January 1). Correlation, causation, and
2020). Indeed, the COVID-19 pandemic has
association – What does it all mean? Retrieved 14 Aug
2014, from http://www.psychologytoday.com/blog/all- unfolded during a period of rapid, disruptive,
about-addiction/201003/correlation-causation-and-asso and unprecedented technological change referred
ciation-what-does-it-all-mean. to as a Fourth Industrial Revolution. In this con-
text, emerging technologies, such as Artificial
Intelligence (AI) enabled by deep learning, have
been invaluable in the fight against COVID-19,
COVID-19 Pandemic particularly for transforming big data into action-
able insight (i.e., translating evidence to action).
Laurie A. Schintler Other technologies such as blockchain, the Inter-
George Mason University, Fairfax, VA, USA net of Things (IoT), and smart mobile devices
provide big data sources and the means for pro-
cessing, storing, and vetting massive bits and
Overview bytes of information.

In 2020, COVID-19 took the world by storm. First


discovered in China, the novel coronavirus Benefits and Opportunities
quickly and aggressively spread throughout Asia
and then to the rest of the world. As of November Big data are helping to address the informational
2020, COVID-19 infections, deaths, and hospital- challenges of the COVID-19 pandemic in various
izations continue to rise with no end in sight. In ways. First, in a global disease outbreak like
attempts to manage the pandemic’s progression, COVID-19, there is a need for timely information,
big data are playing an innovative and instrumen- especially given that conditions surrounding the
tal role (Lin and Hou 2020; Pham et al. 2020; disease’s spread and understanding of the disease
Vaishya et al. 2020). Specifically, big data are itself are very fluid. Big data tends to have a high
being used for: velocity, streaming in at a relatively fast pace – in
some cases, second-by-second. In fact, data are
1. Disease surveillance and epidemiological produced now at an exponentially higher speed
modeling than in other recent pandemics, e.g., the SARS
2. Understanding disease risk factors and triggers 2002–2003 outbreak. In COVID-19, such fast-
3. Diagnosis, treatment, and vaccine moving big data enable continuous surveillance
development of epidemiological dynamics and outcomes and
4. Resource optimization, allocation, and forecasting and on-the-fly predictions and assess-
distribution ments, i.e., “nowcasting.” For example, big data
5. Formulation and evaluation of containment produced by Internet and mobile phone users are
policies helping with the ongoing evaluation of non-phar-
maceutical interventions, such as shutdowns,
Various sources of structured and unstructured travel bans, quarantines, and social distancing
big data, such as mobile phones, social media mandates (Oliver et al. 2020).
platforms, search engines, biometrics sensors, In a pandemic, there is also a dire need to
genomics repositories, images and videos, understand the pathology of and risk factors
COVID-19 Pandemic 229

behind the disease in question, particularly for the come into play, from data acquisition and filtering
rapid development and discovery of effective pre- to analysis and modeling. Such problems are
ventative measures, treatments, and vaccines. In compounded by the fact that there is an informa-
COVID-19, there have been active attempts to tion overload due to the enormous amounts of
mine massive troves of data for these purposes data are being produced via active and passive
(Pham et al. 2020; Vaishya et al. 2020). For surveillance of people, places, and the disease
instance, pharmaceutical, biomedical, and genetic itself. The quality and integrity of big data are a C
data, along with scientific studies and clinical tri- related matter. As with conventional sources of
als, are being combined and integrated for under- data, big data are far from perfect. Indeed, many
standing how existing drugs might work in big data sources used in the battle against the
treating or preventing COVID-19. novel coronavirus are fraught with biases, noise,
Having accurate and complete information is prejudices, and imperfections. For instance, social
also imperative in a pandemic. In this regard, media posts, search engine queries, and Web
conventional data sources are often inadequate. “apps” are notoriously skewed toward particular
In the COVID-19 pandemic, official estimates of demographics and geographies, owing to the dig-
regional infection and fatality rates have been ital divides and differences in individual prefer-
unreliable due to failures and delays in testing ences, needs, and desires.
and reporting, measurement inconsistencies The use of big data and digital analytics for
across organizations and places, and a high prev- managing the COVID-19 also raises various eth-
alence of undetected asymptomatic cases. Waste- ical and legal issues and challenges (Gasser et al.
water and sewage sensing, coupled with big data 2020; Zwitter and Gstrein 2020). One problem of
analytics, are being used in many communities to grave concern in this context is privacy. Many
fill in these informational gaps and serve as an sources of big data being used for managing the
early warning system for COVID-19 outbreaks. pandemic contain sensitive and personally identi-
Finally, in a pandemic, it is vital to have fiable information, which can be used to “connect
disaggregated data, i.e., data with a high level of the dots” about individual’s activities, prefer-
spatial and temporal resolution. Such data are cru- ences, and motivations. Big biometrics data,
cial for enabling activities such as contact such as that produced by thermal recognition sen-
tracing, localized hotspot detection, and parameter- sors, raises a similar set of concerns. While steps
ization of agent-based epidemiological models. In can be taken to mitigate privacy concerns (e.g.,
this regard, traditional administrative records fall anonymization via the use of synthetic data), in a
short, as they summarize information in an aggre- significant health crisis like COVID-19, there is
gated form. On the other hand, big geo-temporal an urgency to find solutions. Thus, the implemen-
data, such as that produced by “apps,” social media tation of privacy protections may not be feasible
platforms, and mobile devices, have refined spatial or desirable, as they can hinder effective and
and temporal granularity. Given the data are rich in timely public health responses.
information on individuals’ space-time movements Another set of ethical issues pertain to the use
and their social and spatial interaction from of big-enabled AI systems for decision-making in
moment to moment, they have been an essential the pandemic (Leslie 2020). One problem, in par-
source of information in the pandemic. ticular, is that AI has the potential to produce
biased and discriminatory outcomes. In general,
the accuracy and fairness of AI systems hinge
Downsides and Dilemmas crucially on having precise and representative
big data in the first place. Accordingly, if such
With all that said, big data are not necessarily a data are skewed, incomplete, or inexact, AI-
magical or quick-and-easy panacea for any prob- enabled tools and models may produce unreliable,
lem. Pandemics are no exception. First of all, there unsafe, biased, and prejudiced outcomes and deci-
are computational and analytical challenges that sions (Leslie 2019). For example, facial
230 Crowdsourcing

recognition systems – used for surveillance pur- Leslie, D. (2020). Tackling COVID-19 through responsi-
poses in the pandemic – have been criticized in ble AI innovation: Five steps in the right direction.
Harvard Data Science Review.
this regard. Further, AI learns from patterns, rela- Lin, L., & Hou, Z. (2020). Combat COVID-19 with artifi-
tionships, and dynamics associated with real- cial intelligence and big data. Journal of Travel Medi-
world phenomena. Hence, if there are societal cine, 27(5). https://doi.org/10.1093/jtm/taaa080.
gaps and disparities in the first place, then AI is Oliver, N., Letouzé, E., Sterly, H., Delataille, S.,
De Nadai, M., Lepri, B., et al. (2020). Mobile phone
likely to mimic them unless appropriate corrective data and COVID-19: Missing an opportunity? arXiv
actions are employed. Indeed, COVID-19 has preprint arXiv:2003.12347.
brought to the fore various social, economic, and Pham, Q. V., Nguyen, D. C., Hwang, W. J., & Pathirana, P.
digital inequities, including those propelled by N. (2020). Artificial intelligence (AI) and Big Data for
coronavirus (COVID-19) pandemic: A survey on the
the pandemic itself. Accordingly, conclusions, state-of-the-arts. https://doi.org/10.20944/preprints202
decisions, and actions based on AI systems for 004.0383.v1.
the pandemic have the potential to disadvantage Ting, D. S. W., Carin, L., Dzau, V., & Wong, T. Y. (2020).
certain segments of the population, which has Digital technology and COVID-19. Nature Medicine,
26(4), 459–461.
broader implications for public health, human Vaishya, R., Javaid, M., Khan, I. H., & Haleem, A. (2020).
rights, and social justice. Artificial intelligence (AI) applications for COVID-19
pandemic. Diabetes & Metabolic Syndrome: Clinical
Research & Reviews, 14, 337.
Zwitter, A., & Gstrein, O. J. (2020). Big data, privacy and
Looking Forward COVID-19–learning from humanitarian expertise in
data protection. Journal of International Humanitarian
Big data will undoubtedly play an influential role Action, 5(4), 1–7.
in future pandemics, which are inevitable, given
our increasingly globalized society. However, as
highlighted, while big data for a significant health
emergency like COVID-19 brings an array of Crowdsourcing
benefits and opportunities, it also comes with var-
ious downsides and dilemmas. As technology Heather McIntosh
continues to accelerate and advance, and new Mass Media, Minnesota State University,
sources of big data and analytical and computa- Mankato, MN, USA
tional tools surface, the upsides and downsides
may look quite different as well.
Crowdsourcing is an online participatory culture
activity that brings together large, diverse sets of
Cross-References people and directs their energies and talents
toward varied tasks designed to achieve specific
▶ Biomedical Data goals. The concept draws on the principle that the
▶ Epidemiology diversity of knowledge and skills offered by a
▶ Ethical and Legal Issues crowd exceeds the knowledge and skills offered
▶ Spatiotemporal Analytics by an elite, select few. For big data, it offers access
to abilities for tasks too complex for computa-
tional analysis. Corporations, government groups,
Further Reading and nonprofit organizations all use crowdsourcing
for multiple projects, and the crowds consist of
Gasser, U., Ienca, M., Scheibner, J., Sleigh, J., & volunteers who choose to engage tasks toward
Vayena, E. (2020). Digital tools against COVID-19: goals determined by the organizations. Though
Taxonomy, ethical challenges, and navigation aid. these goals may benefit the organizations more
The Lancet Digital Health, 2(8), e425–e434.
Leslie, D. (2019). Understanding artificial intelligence so than the crowds helping them, ideally the ben-
ethics and safety. arXiv preprint arXiv:1906.05684 efit is shared between the two. Crowdsourcing
Crowdsourcing 231

breaks down into basic procedures, the tasks and producing content. With big data, crowds may
their applications, the crowds and their makeup, enter, clean, and validate data. The crowds may
and the challenges and ethical questions. even collect data, particularly geospatial data,
Crowdsourcing follows a general procedure. which prove useful for search and rescue, land
First, an organization determines the goal or the management, disaster response, and traffic man-
problem that requires a crowd’s assistance in order agement. Other tasks might include transcription
to achieve or solve. Next, the organization defines of audio or visual data and tagging. C
the tasks needed from the crowd in order to fulfill When bringing crowdsourcing to big data, the
its ambitions. After, the organization seeks the crowd offers skills that benefit through matters of
crowd’s help, and the crowd engages the tasks. judgment, contexts, and visuals – skills that
In selective crowdsourcing, the best solution from exceed computational models. In terms of judg-
the crowd is chosen, while in integrative ment, people can determine the relevance of items
crowdsourcing, the crowd’s solutions become that appear within a data set, identify similarities
worked into the overall project in a useful manner. among items, or fill in holes within the set. In
Working online is integral to the terms of contexts, people can identify the situa-
crowdsourcing process. It allows the gathering tions surrounding the data and how those situa-
of diverse individuals who are geographically dis- tions influence them. For example, a person can
persed to “come together” for working on the determine the difference between the Statue of
projects. The tools the crowds need to engage Liberty on Ellis Island in New York and the rep-
the tasks also appear online. Since using an orga- lica on The Strip in Las Vegas. The contexts then
nization’s own tools can prove too expensive for allow determination of accuracy or ranking, such
big data projects, organizations sometimes use as in this case differentiating the real from the
social networks for recruitment and task fulfill- replica. People also can determine more in-depth
ment. The documentary project Life in a Day, for relationships among data within a set. For exam-
example, brought together video footage from ple, people can better decide the accuracy of
people’s everyday lives from around the world. search engine terms and results matches, deter-
When possible, people uploaded their footage to mine better the top search result, or even predict
YouTube, a video-sharing platform. To address other people’s preferences.
the disparities of countries without access to Properly managed crowdsourcing begins
digital production technologies and the Internet, within an organization that has clear goals for its
the project team sent cameras and memory storage big data. These organizations can include govern-
cards through the mail. Other services assist with ment, corporations, and nonprofit organizations.
recruitment and tasks. LiveWork and Amazon Their goals can include improving business prac-
Mechanical Turk are established online service tices, increasing innovations, decreasing project
marketplaces, while companies such as completion times, developing issue awareness,
InnoCentive and Kaggle offer both the crowds and solving social problems. These goals fre-
and the tools to support an organization’s project quently involve partnerships that occur across
goals. multiple entities, such as government or corpora-
Tasks vary depending on the project’s goals, tions partnering with not-for-profit initiatives.
and they vary in structure, interdependence, and At the federal level and managed through Mas-
commitment. Some tasks follow definite bound- sachusetts Institute for Technology’s Center for
aries or procedures, while others are open-ended. Collective Intelligence, Climate CoLab brings
Some tasks depend on other tasks for completion, together crowds to analyze issues related to global
while others stand alone. Some tasks require but a climate change, registering more than 14,000
few seconds, while others demand more time and members who participate in a range of contests.
mental energy. More specifically, tasks might Within the contests, members create and refine
include finding and managing information, ana- proposals that offer climate change solutions.
lyzing information, solving problems, and The proposals then are evaluated by the
232 Crowdsourcing

community and, through voting, recommended on Syria, Humanitarian Tracker aggregates these
for implementation. Contest winners presented data into maps that show the impacts of systematic
their proposals to those who might implement killings, civilian targeting, and other human tolls.
them at a conference. Some contests build their Not all crowdsourcing and big data projects
initiatives on big data, such as Smart Mobility, originate within these organizations. For example,
which relies on mobile data for tracking transpor- Galaxy Zoo demonstrates the expanses of both
tation and traveling patterns in order to suggest big data and crowds. The project asked people to
ways for people to reduce their environmental classify a data set of one million galaxies into
impacts while still getting where they want to go. three categories: elliptical, merger, and spiral. By
Another government example comes from the the project’s completion, 150,000 people had con-
city of Boston, wherein a mobile app called Street tributed 50 million classifications. The data fea-
Bump tracks and maps potential potholes ture multiple independent classifications as well,
throughout the city in order to guide crews toward adding reliability. The largest crowdsourcing pro-
fixing them. The crowdsourcing for this initiative ject involved searching satellite images for wreck-
comes from two levels. One, the information gath- age from Malaysia Airlines flight MH370, which
ered from the app helps city crews do their work went missing in March 2014. Millions of people
more efficiently. Two, the app’s first iteration searched for signs among the images made avail-
reported too many false positives, leading crews able by Colorado-based Digital Globe. The
to places where no potholes existed. The city amount of crowdsourcing traffic even crashed
worked with a crowd drawn together through websites.
InnoCentive to improve the app and its efficiency, Not all big data crowdsourced projects suc-
with the top suggestions coming from a hackers ceed, however. One example is the Google Flu
group, a mathematician, and a software engineer. tracker. The tracker included a map to show the
Corporations also use crowdsourcing to work disease’s spread throughout the season. It was
with their big data. AOL needed help with later revealed that the tracker overestimated the
cataloging the content on its hundreds of thou- expanse of the flu spreading, predicting twice as
sands web pages, specifically the videos and their much as actually occurred.
sources, and turned to crowdsourcing as a means In addition to their potentially not succeeding,
to expedite and streamline the project’s costs. another drawback to these projects is their overall
Between 2006 and 2010, Netflix, an online management, which tends to be time-consuming
streaming and mail DVD distributor, sought help and difficult. Several companies attempt to
with perfecting its algorithm for predicting user fulfill this role. InnoCentive and Kaggle use
ratings of films. The company developed a contest crowds to tackle challenges brought to them by
with a $1 million dollar prize, and for the contest, industries, government, and nonprofit organiza-
it offered data sets consisting of multiple million tions. Kaggle in particular offers almost 150,000
units for analysis. The goal was to beat Netflix’s data scientists – statisticians – to help companies
current algorithm by 10%, which one group develop more efficient predictive models, such as
achieved and took home the prize. deciding the best order in which to show hotel
Not-for-profit groups also incorporate rooms for a travel app or guessing which cus-
crowdsourcing as part of their initiatives. AARP tomers would leave an insurance company within
Foundation, which works on behalf of older a year. Both InnoCentive and Kaggle run their
Americans, used crowdsourcing to tackle such crowdsourcing activities as contests or competi-
issues as eliminating food insecurity and food tions as these are often tasks that require a higher
deserts (areas where people do not have conve- time and mental commitment than others.
nient or close access to grocery stores). Humani- Crowds bring wisdom to crowdsourced tasks
tarian Tracker crowdsources data from people “on on big data through their diversity of skills and
the ground” about issues such as disease, human knowledge. Determining the makeup of that
rights violations, and rape. Focusing particularly crowd proves more challenging, but one study of
Cultural Analytics 233

Mechanical Turk offers some interesting findings. Computers and Humans Apart). Other activities
It found that US females outnumber males by 2 to involve risks and possible violations of other indi-
1 and that many of the workers hold bachelor’s viduals, such as gathering large amounts of per-
and even master’s degrees. Most live in small sonal data for sale. Overall, the crowd participants
households of two or fewer people, and most use remain unaware that they are engaging in
the crowdsourcing work to supplement their unethical activities.
household incomes as opposed to being the pri- C
mary source of income.
Crowd members choose the projects on which
they want to work, and multiple factors contribute
Cross-References
to their motivations for joining a project and
▶ Cell Phone Data
staying with it. For some working on projects
▶ Netflix
that offer no further incentive to participate, the
▶ Predictive Analytics
project needs to align with their interests and
experience so that they feel they can make a
contribution. Others enjoy connecting with other
Further Reading
people, engaging in problem-solving activities,
seeking something new, learning more about the Brabham, D. C. (2013). Crowdsourcing. Cambridge, MA:
data at hand, or even developing a new skill. Some MIT Press.
projects offer incentives such as prize money or Howe, J. (2009). Crowdsourcing: why the power of the
top-contributor status. For some entertainment crowd is driving the future of business. New York:
Crown.
motivates them to participate in that the tasks Nakatsu, R. T., Grossman, E. B., & Charalambos, L. I.
offer a diversion. For others, though, working on (2014). A taxonomy of crowdsourcing based on task
crowdsourced projects might be addiction as well. complexity. Journal of Information Science, 40(6),
While crowdsourcing offers multiple benefits 823–834.
Shirky, C. (2009). Here comes everybody: the power of
for the processing of big data, it also draws some organizing without organizations. New York: Penguin.
criticism. A primary critique centers on the notion Surowiecki, J. (2005). The wisdom of crowds. New York:
of labor, wherein the crowd contributes knowl- Anchor.
edge and skills for little-to-no pay, while the orga-
nization behind the data stands to gain much more
financially. Some crowdsourcing sites offer low
cash incentives for the crowd participants, and in Cultural Analytics
doing so, they sidestep labor laws requiring min-
imum wage and other worker benefits. Opponents Tobias Blanke
of this view cite that the labor involved frequently Department of Digital Humanities, King’s
requires menial tasks and that the labor faces no College London, London, UK
obligation in completing the assigned tasks. They
also cite that crowd participants engage the tasks
because they enjoy doing so. Definition
Ethical concerns come back to the types of
crowdsourced big data projects and the intentions Cultural analytics was originally introduced by
behind them, such as information gathering, sur- Lev Manovich in 2007 in order to describe the
veillance, and information manipulation. With use of “computational and visualization methods
information manipulation, for example, crowd for the analysis of massive cultural data sets and
participants might create fake product reviews flows” and “to question our basic cultural con-
and ratings for various web sites, or they might cepts and methods” (Software Studies Initiative
crack anti-spam devices such as CAPTCHAs 2014) Manovich was then especially concerned
(Completely Automated Public Turing test to tell with “the exploration of large cultural data sets by
234 Cultural Analytics

means of interactive and intuitive visual analysis The Sloan Digital Sky Survey, for instance, had
techniques” (Yamaoka et al. 2011) and massive brought together about 100 TB of astronomical
multimedia data sets (Manovich 2009). observations by the end of 2010. This is big data,
In 2016, Manovich further elaborated that cul- but not as big as some cultural heritage data sets.
tural analytics brings together the disciplines of The Holocaust Survivor Testimonials’ Collec-
digital humanities and social computing. “[W]e tions by the Shoah Foundation contained
are interested in combining both in the study of 200 TB of data in 2010. The Google Books corpus
cultures – focusing on the particular, interpreta- had hundreds of millions of books Another typical
tion, and the past from the humanities, while digitization project, the Taiwanese TELDAP
centering on the general, formal models, and pre- archive of Chinese and Taiwanese heritage
dicting the future from the sciences” (Manovich objects, had over 250 TB of digitized content in
2016). Cultural analytics works with “historical 2011 (Digital Taiwan 2011).
artifacts” as well as “the study of society using Book corpora like the Google Books project or
social media and social phenomena specific to historical testimonials such as the Holocaust Sur-
social networks.” It can thus be summarized as vivor Testimonials are the primary type of data
any kind of advanced computational technique to associated with digital culture. Quantitative work
understand digital cultural expressions, as long as with these has been popularized in the work on a
these reach a certain size. “Quantitative Analysis of Culture” by Michel
et al. (2011), summarizing the big trends of
human thoughts with the Google Ngram corpus
Big Cultural Data and thus moving to corpora that cannot be read by
humans alone, because they are too large. From a
Big data is not limited to the sciences and large- scholarly point of view, Franco Moretti (2000,
scale enterprises. With more than seven billion 2005) has pioneered quantitative methods to
people worldwide and counting, vast amounts of study literature and advocates “distant reading”
data are produced in social and cultural interac- and “a unified theory of plot and style” (Schulz
tions. At the same time, we can look back onto 2011). The methods Moretti uses such as social
several thousand years of human history that have network analysis have been employed by social
delivered huge numbers of cultural records. scientists for a long time but hardly in the study of
Large-scale digitization efforts have recently culture. An exception is the work by Schich et al.
begun to create digital surrogates for these records (2014) to develop a network framework of cul-
that are freely available online. In the USA, the tural history of the lives of over 100,000 historical
HathiTrust published research data extracted from individuals. Other examples of new quantitative
over 4,800,000 volumes of digitized books methods for the large-scale study of digital culture
(containing 1.8 billion pages) – including parts include genre detection in literature. Underwood
of the Google Books corpus and the Internet et al. (2013) demonstrated how genres can be
Archive. The European Union continues to be identified in the HathiTrust Digital Library corpus
committed to digitize and present its cultural her- in order to “trace the changing proportions of first-
itage online. At the time of writing, its cultural and third-person narration,” while computational
heritage aggregator Europeana has made available stylistics is able to discover differences in the
over 60 m digital objects (Heath 2014). particular choices made by individuals and groups
Nature covered the topic of big cultural data using languages (Eder 2016). This has now
already in 2010 (Hand 2011) and compared typi- become a fast-developing new field, brought
cal data sets used in cultural research with those in together in the recently launched Journal of Cul-
the sciences that can be considered big. While tural Analytics (http://culturalanalytics.org/).
data sets from the Large Hadron Collider are still Next to such digitized cultural sources, cultural
by far the largest around, cultural data sets can analytics, however, also works with the new dig-
easily compare to other examples of big sciences. ital materials that can be used to capture
Cultural Analytics 235

contemporary culture. At its eighth birthday in culture is polarized around strong political ties
2013, the YouTube video-sharing site proudly and works often as an echo chamber for one’s
announced that “more than 100 hours of video already formed political opinions. Twitter users
are uploaded to YouTube every minute” of the same political pervasion cluster together
(YouTube 2013), while the site has been visited in fragmented groups and do not take input from
by over one billion users, many of which have the outside. This kind of cultural analytics is crit-
added substantial cultural artifacts to the videos ical toward analyzing the democratizing effect of C
such as annotations or comments. Facebook adds social media. Just because there is communication
350 million photos to its site every day. All these happening, this communication does not neces-
are also cultural artifacts and are already now sarily lead to political exchange across traditional
subject to research on contemporary culture. It boundaries.
was against the background of this new cultural Big money is currently flowing into building
material that Manovich formulated his idea of cultural analytics engines that help understand
cultural analytics, which is interested not just in users’ preferences, social media likes, etc. This
great historical individuals but in “everything cre- money is effectively spent on analyzing and mon-
ated by everybody” (Manovich 2016). etizing our digital culture, which has become the
currency with which we pay for online services.
John Naughton has pointed out that we “pay” for
Commercial Implications all those free online services nowadays in a “dif-
ferent currency, namely your personal data”
Because cultural analytics is interested in every- (Naughton 2013). Because this personal data is
thing created by everybody, it rushes to work never just personal but involves others, he could
with new cultural materials in social media. It have also said that we pay with our digital culture.
does so not solely because of the new kinds of Cultural analytics algorithms decide how we
research in digital culture but also because of should fill our shopping basket, which political
new economic value from digital culture. Social groups we should join on Facebook, etc. How-
media business models are often based on work- ever, the data used in these commercial cultural
ing with cultural analytics using social comput- analytics and produced by it is seldom open to
ing as well as digital humanities. The almost real- those who produce it (Pybus et al. 2015). The
time gathering of opinions (Pang and Lee 2008) companies and organizations like Google, who
to read the state of mind of organizations, con- collect this data, also own the underlying cultural
sumers, politicians, and other opinion makers data and its life cycle and therefore our own con-
continues to excite businesses around the struction of our digital identity.
world. Twitter, YouTube, etc. now appear to be The cultural analytics done by the large Inter-
the “echo chamber of people’s opinions” (Van net intermediaries such as Google and Facebook
Dijck and Poell 2013, p. 9). Companies, policy means that cultural expressions are quantified and
researchers, and many others have always analyzed for their commercial value – in order to
depended on being able to track what people market and sell more products. However, not
believe and think. Social media cultural artifacts everything can be quantified, and a critical
can signal the performance of stocks by offering approach to cultural analytics also needs to under-
insights on the emotional state of those involved stand what is lost in such quantification and what
with companies. In this way, they allow for mea- its limits are. We have already discussed, e.g., how
suring the culture of a company, of groups, and of Twitter’s organization around counting retweets
individuals and have led to a new research area and followers makes its users encounter mainly
called “social sensing” (Helbing 2011). more of the same in political interactions. Under-
Cultural analytics has shown, for instance, that standing the limits of cultural analytics will be a
online political cultures are not that different from major role for the study of digital culture in the
the real world (Rainie 2014). Twitter’s political future that will also need to identify how
236 Cultural Analytics

opposition against such commercial cultural ana- we should rather be interested in what “big data
lytics practices can be formulated and practiced. will never explain,” as Leon Wieseltier has put it:
In the riot of words and numbers in which we live so
smartly and so articulately, in the comprehensively
Critique of Cultural Analytics quantified existence in which we presume to believe
that eventually we will know everything, in the
expanding universe of prediction in which hope
The study of digital culture needs to remain crit- and longing will come to seem obsolete and merely
ical to what is possible when algorithms are used ignorant, we are renouncing some of the primary
to understand culture. It has already begun to do human experiences. (Wieseltier 2013)
so in its critical analysis of the emerging tools and
Leon Wieseltier has emerged as one of the
methods of cultural analytics. A good example is
strongest opponents of cultural analytics: “As
the reaction to the already discussed Google
even some partisans of big data have noted, the
Ngram viewer (http:// books.google.com/
massive identification of regularities and irregu-
ngrams), which enables public access to the cul-
larities can speak to ‘what’ but not to ‘why’: they
tural analytics of millions of words in books. Erez
cannot recognize causes and reasons, which are
Aiden and Jean-Baptiste Michel (pioneers of the
essential elements of humanistic research”
Google Ngram Viewer) go as far as to promise a
(Wieseltier 2013). Responding to such criticisms,
“new lens on human culture” and a transformation
for Manovich cultural analytics, should therefore
of the scientific disciplines concerned with
not just focus on the large trends but on individ-
observing it. The Ngram Viewer’s “consequences
uals, true to its humanistic foundations. “[W]e
will transform how we look at ourselves (...). Big
may combine the concern of social science, and
data is going to change the humanities, transform
sciences in general, with the general and the reg-
the social sciences, and renegotiate the relation-
ular, and the concern of humanities with individ-
ship between the world of commerce and the ivory
ual and particular” (Manovich 2016).
tower” (Aiden and Michel 2013).
This kind of enthusiasm is at least a little sur-
prising, because it is not easy to find the exciting
new research that the Ngram viewer has made Conclusions
possible. Pechenick et al. (2015) have demon-
strated the “strong limits to inferences of socio- While the criticisms by Wieseltier and others
cultural and linguistic evolution” the Ngram should be taken serious and emphasize important
viewer allows because of severe shortages in the limitations and the dangers of a wrong kind of
underlying data. Researchers also complain about cultural analytics, the study of culture also needs
“trivial” results research the Ngram viewer to acknowledge that many of the digital practices
delivers (Kirsch 2014). No cultural historian associated with cultural analytics show promise.
needs the Ngram viewer to understand that reli- More than 10 years ago, Google and others revo-
gion is in retreat in the nineteenth century. This lutionized information access, while Facebook
does not mean that the Ngram viewer cannot has allowed for new kinds of cultural connections
produce new kinds of evidence or even new since the early 2000s. This kind of efforts has
insights, but this needs to be carefully examined made us understand that there are many more
and involve the critical interpretation of primary books available than any single person in the
and secondary sources using traditional world can read in a lifetime and that computers
approaches next to cultural analytics. can help us with this information overload and
While the Ngram viewer is an interesting tool stay on top of the analysis. It is the foremost task
for research and education, it is exaggerated to of cultural analytics to understand better how we
claim that it is already changing cultural research can use new digital tools and techniques in cul-
in a significant way. Against Moretti and the Goo- tural research, which includes understanding the
gle Ngram efforts, for some researchers of culture, boundaries and what they cannot explain.
Curriculum, Higher Education, and Social Sciences 237

Further Reading muse-what-is-distant-reading.html?_r¼0. Accessed 2


July 2016.
Aiden, E., & Michel, J.-B. (2013). Uncharted: Big data as Software Studies Initiative. (2014). Cultural analytics.
a lens on human culture. New York: Penguin. http://lab.softwarestudies.com/p/cultural-analytics.html.
Accessed 2 July 2016.
Digital Taiwan. (2011). NDAP international conference. http://
Underwood, T., Black, M.L., Auvil, L., & Capitanu, B.
culture.teldap.tw/culture/index.php?option¼com_content
(2013). Mapping mutable genres in structurally com-
&view¼article&id¼23:ndap-international-conference-
&catid¼1:events&Itemid¼215. Accessed 2 July 2016.
plex volumes. 2013 I.E. International Conference on
Big Data. Washington, DC: IEEE.
C
Eder, M. (2016). Rolling stylometry. Digital Scholarship
Van Dijck, J., & Poell, T. (2013). Understanding social
Humanities, 31(3), 457–469.
media logic. Media and Communication, 1(1), 2–14.
Hand, E. (2011). Culturomics: Word play. Nature, 474 Wieseltier, L. (2013). What big data will never explain,
(7352), 436–440. New Republic. http://www.newrepublic.com/article/
Heath, P. (2014). Europe’s cultural heritage online. https:// 112734/what-big-data-will-never-explain. Accessed 2
epthinktank.eu/2014/04/09/europes-cultural-heritage- July 2016.
online/. Accessed 2 July 2016. Yamaoka, S., Manovich, L., Douglass, J., & Kuester, F.
Helbing, D. (2011). FuturICT–A knowledge accelerator to (2011). Cultural analytics in large-scale visualization
explore and manage our future in a strongly connected environments. Computer, 44(12), 39–48.
world. arXiv preprint arXiv:1108.6131. YouTube. (2013). Here’s to eight great years. From http://
Kirsch, A. (2014). Technology is taking over English youtube-global.blogspot.co.uk/2013/05/heres-to-eight-
departments. https://newrepublic.com/article/117428/ great-years.html. Accessed 2 July 2016.
limits-digital-humanities-adam-kirsch. Accessed 2
July 2016.
Manovich, L. (2009). Cultural analytics: Visualizing cul-
tural patterns in the era of more media. Domus, 923.
Manovich, L. (2016). The science of culture? Social com- Curriculum, Higher Education,
puting, digital humanities and cultural analytics. Cul- and Social Sciences
tural Analytics, 1(1).
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M.
K., Pickett, J., & Orwant, J. (2011). Quantitative anal- Stephen T. Schroth
ysis of culture using millions of digitized books. Sci- Department of Early Childhood Education,
ence, 331(6014), 176–182. Towson University, Baltimore, MD, USA
Moretti, F. (2000). Conjectures on world literature. New
Left Review, 1, 54–68.
Moretti, F. (2005). Graphs, maps, trees: Abstract models
for a literary history. London: Verso. Big data, which has revolutionized many practices
Naughton, J. (2013). To the internet giants, you’re not a in business, government, healthcare, and other
customer. You’re just another user, The Guardian.
http://www.theguardian.com/technology/2013/jun/09/ fields, promises to radically change the curricu-
internet-giants-just-another-customer. Accessed 2 July lum offered in many of the social sciences. Big
2016. data involves the capture, collection, storage,
Pang, B., & Lee, L. (2008). Opinion mining and sentiment collation, search, sharing, analysis, and visualiza-
analysis. Foundations and Trends in Information
Retrieval, 2(1–2), 1–135. tion of enormous data sets so that this information
Pechenick, E. A., Danforth, C. M., & Dodds, P. S. (2015). may be used to spot trends, to prevent problems,
Characterizing the Google Books corpus: Strong limits and to proactively engage in activities that make
to inferences of socio-cultural and linguistic evolution. success more likely. The social sciences, which
PloS One, 10(10), e0137041.
Pybus, J., Coté, M., Blanke, T. (2015). Hacking the social includes fields as disparate as anthropology, eco-
life of big data. Big Data & Society, 2(2). nomics, education, political science, psychology,
Rainie, L. (2014). The six types of twitter conversations. and sociology, is a disparate area, and the tools of
Pew Research Center. http://www.pewresearch.org/ big data are being embraced differently within
fact-tank/2014/02/20/the-six-types-of-twitter-conversa
tions/. Accessed 2 July 2016. each. The economic demands of setting up sys-
Schich, M., Song, C., Ahn, Y.-Y., Mirsky, A., Martino, M., tems that permit the use of big data in higher
Barabási, A.-L., & Helbing, D. (2014). A network education have also hindered some efforts to use
framework of cultural history. Science, 345(6196), these processes, as these institutions often lack the
558–562.
Schulz, K. (2011). What is distant reading? http://www. infrastructure necessary to proceed with such
nytimes.com/2011/06/26/books/review/the-mechanic- efforts. Opponents of the trend toward using big
238 Curriculum, Higher Education, and Social Sciences

data tools for social science analyses often stress with large corporations and organizations in an
that while these tools may provide helpful for effort to accomplish this. They defined big data
certain analyses, it is also crucial for students to as consisting of three “v”s, volume, variety, and
receive training in more traditional methods. As velocity.
equipment and training concerns are overcome, Volume, as used in this context, refers to the
however, the use of big data social sciences increase in data volume caused by technological
departments at colleges and universities seems innovation. This includes transaction-based data
likely to increase. that has been gathered by corporations and orga-
nizations over time but also includes unstructured
data that derives from social media and other
Background sources as well as increasing amounts of sensor
and machine-to-machine data. For years, exces-
A variety of organizations, including government sive data volume was a storage issue, as the cost of
agencies, businesses, colleges, universities, keeping much of this information was prohibitive.
schools, hospitals, research centers, and others, As storage costs have decreased, however, cost
collect data regarding their operations, clients, has diminished as a concern. Today, how best to
students, patients, and findings. Disciplines determine relevance within large volumes of data
within the social sciences, which are focused and how best to analyze data to create value have
upon society and the relationships among individ- emerged as the primary issues facing those wish-
uals within a society, often use such data to inform ing to use it.
studies related to these. Such a volume of data has Velocity refers to the amount of data streaming
been generated, however, that many social scien- in at great speed raises the issue of how best to
tists have found it impossible to use this in their deal with this in an appropriate way. Technologi-
work in a meaningful manner. The emergence of cal developments, such as sensors and smart
computers and other electronic forms of data stor- meters, and client and patient needs emphasize
age has resulted in more data than ever before the necessity of overseeing and handling inunda-
being collected, especially during the last two tions of data in near real time. Responding to data
decades of the twentieth century. This data was velocity in a timely manner represents an ongoing
generally stored in separate databases. This struggle for most corporations and other organi-
worked to make data from different sources inac- zations. Variety in the types of formats in which
cessible to most social science users. As a result, data today comes to organizations presents a prob-
much of the information that could potentially be lem for many. Data today includes that in struc-
obtained from such sources was not used. tured numeric forms which is stored in traditional
Over the past decade and a half, many busi- databases but has grown to include information
nesses became increasingly interested in making created from business applications, e-mails, text
use of data they had but did not use regarding documents, audio, video, financial transactions,
customers, processes, sales, and other matters. and a host of others. Many corporations and orga-
Big data became seen as a way of organizing nizations struggle with governing, managing, and
and using the numerous sources of information merging different forms of data.
in ways that could benefit organizations and indi- Some have added two additional criteria to
viduals. Infonomics, the study of how information these: variability and complexity. Variability con-
could be used for economic gain, grew in impor- cerns the potential inconsistency that data can
tance as companies and organizations worked to demonstrate at times, which can be problematic
make better use of the information they possessed, for those who analyze the data. Variability can
with the end goal being to use it in ways that hamper the process of managing and handling
increased profitability. A variety of consulting the data. Complexity refers the intricate process
firms and other organizations began working that data management involves, in particular when
Curriculum, Higher Education, and Social Sciences 239

large volumes of data come from multiple and would negate the need for, and expense of, many
disparate sources. For analysts and other users to big data projects. Such critics suggest that
fully understand the information that is contained increased emphasis in the curriculum should
in these data, they must be must first be connected, focus on finding quality, rather than big, data
correlated, and linked in a way that helps users sources and that efforts to train students to load,
make sense of them. transform, and extract data is sublimating other
more important skills. C
Despite these criticisms, changes to the social
Big Data Comes to the Social Sciences sciences curriculum are occurring at many insti-
tutions. Many programs now require students to
Colleges, universities, and other research centers engage in work that examines practices and para-
have tracked the efforts of the business world to digms of data science, which would provide stu-
use big data in a way that helped to shape organi- dents with a grounding in the core concepts of
zational decisions and increase profitability. Many data science, analytics, and data management.
working in the social sciences were intrigued by Work in algorithms and modeling, which provide
this process, as they saw it as a useful tool that proficiency in basic statistics, classification, clus-
could be used in their own research. The typical ter analysis, data mining, decision trees, experi-
program in these areas, however, did not provide mental design, forecasting, linear algebra, linear
students, be they at the undergraduate or graduate and logistic regression, market basket analysis,
level, the training necessary to engage in big data predictive modeling, sampling, text analytics,
research projects. As a result, many programs in summarization, time series analysis, and
the social sciences have altered their curriculum in unsupervised learning constrained optimization,
an effort to assure that researchers will be able to is also an area of emphasis in many programs.
carry out such work. For many programs across Students require exposure to tools and platforms,
the social sciences that have pursued curricular which provides proficiency in modeling, develop-
changes that will enable students to engage in ment and visualization tools to be used on big data
big data research, these changes have resulted in projects, as well as knowledge about the platforms
more coursework in statistics, networking, pro- used for execution, governance, integration, and
gramming, analytics, database management, and storage of big data. Finally, work with applica-
other related areas. As many programs already tions and outcomes, which emphasize the primary
required a substantial number of courses in other applications of data science to one’s field, and
areas, the drive toward big data competency has how it interacts with disciplinary issues and con-
required many departments to reexamine the work cerns have been emphasized by many programs.
required of their students. Some programs have embraced big data tools
This move toward more coursework that sup- but suggested that not every student needs mas-
ports big data has not been without its critics. tery of them. Instead, these programs have
Some have suggested that changes in curricular suggested that big data has emerged as a field of
offerings have come at a high cost, with students its own and that certain students should be trained
now being able to perform certain operations in these skills so that they can work with others
involved with handling data but unable to compe- within the discipline to provide support for those
tently perform other tasks, such as establishing a projects that require big data analysis. This
representative sample or composing a valid sur- approach offers more incremental changes to the
vey. These critics also suggest that while big data social science curricular offerings, as it would
analysis has been praised for offering tremendous require fewer changes for most students yet still
promise, in truth the analysis performed is shal- enable departments to produce scholars who are
low, especially when compared to that done with equipped to engage in research projects involving
smaller data sets. Indeed, representative sampling big data.
240 Curriculum, Higher Education, Humanities

Cross-References foreign languages programs, let alone in other


Humanities subjects, have been far and between
▶ Big Data Quality until fairly recently. A notable exception in this
▶ Correlation Versus Causation respect are Classical Studies, in which the avail-
▶ Curriculum, Higher Education, and Social ability of a limited, well-defined, and compara-
Sciences tively small corpus of ancient texts (in a pre-
▶ Curriculum, Higher Education, Humanities WWW age fitting on one or a couple of
▶ Education CD-ROMs, e.g., the Perseus Digital Library)
has lent itself to some form of corpus analytics;
and historical subjects with similarly well-
Further Reading described limited corpora, e.g., medieval studies,
plus of course corpus linguistics itself, in which
Foreman, J. W. (2013). Data smart: Using data science to computational methods have long figured
transform information into insight. Hoboken: Wiley.
prominently.
Lane, J. E., & Zimpher, N. L. (2014). Building a smarter
university: Big data, innovation, and analytics. Having said that, the size of corpora available
Albany: The State University of New York Press. since the introduction, success and exponential
Mayer-Schönberger, V., & Cukier, K. (2013). Big data. growth of the World Wide Web, e.g., Google
New York: Mariner Books.
Books with its 25 million and growing number
Siegel, E. (2013). Predictive analytics: The power to pre-
dict who will click, buy, lie, or die. Hoboken: Wiley. of digitized books (2015), outstrips the size of
previously available corpora by several orders of
magnitude so that “big data” also here can still be
seen as a fairly recent development.
Curriculum, Higher Education,
Humanities
Types of Courses
Ulrich Tiedau
Centre for Digital Humanities, University College Digital approaches to the Humanities are taught in
London, London, UK departments across the whole spectrum of the
Humanities, with subjects such as English and
History leading in terms of numbers; followed
Introduction by Media Studies and Library, Archive, and Infor-
mation Studies; dedicated Digital Humanities pro-
As a relatively new field, there is no generally grams and interdisciplinary programs, e.g., liberal
accepted standard or reference curriculum for arts and sciences degrees, not to forget, on the
big data in the Humanities yet. This entry high- boundaries of engineering and the Humanities,
lights some of the main strands and common Computer Science (Spiro 2011). Especially in
themes that seem to be emerging and provides the USA, English departments seem to have
pointers to further resources. taken the lead in the Digital Humanities
While quantitative methods, often borrowed (Kirschenbaum 2010), whereas History also has
from the Social Sciences, especially statistics and a long and distinctive tradition in Digital History
content analysis, and corresponding software (Cohen and Rosenzweig 2005), leading to recent
packages (e.g., SAS, SPSS, STATA), have been discussions whether or not Digital History can be
part of the curriculum of more social science– considered a separate development from Digital
orientated subjects such as social and economic Humanities or an integral part of it (e.g.,
history, for decades, introductions to big data Robertson 2014; McGinnis 2014). There are also
analysis in literary studies, English, and modern general methodological courses aimed at students
Curriculum, Higher Education, Humanities 241

of all Humanities subjects at a great number of collaboration, an innovation just as transformative


institutions. to traditional Humanities research culture with its
While most of these courses are self-contained “lone scholar ideal” as the use of computational
and usually optional modules in their respective methods. Often this aspect of the curriculum also
curricula, dedicated Digital Humanities programs includes project management, thus training key
provide systematic introduction. Since the mid- skills that are also relevant in other contexts
2000s, these specialist degree courses, of which (cf. Mahony et al. 2012). And thirdly, again in C
some focus more on the cultural side of Digital line with DH culture, open practice and digital
Humanities, whereas others have a pronounced scholarship figure prominently, frequently requir-
emphasis on technology, are rapidly emerging at ing students to practice in the open, keeping learn-
higher education institutions all over the world; ing journals, using blogs, wikis, social media like
Melissa Terras’s visualization of the spread of Twitter, etc.
Digital Humanities (2012) for example counts In terms of technologies taught, most digital
114 physical DH centers in 24 countries in 2012. courses in the Humanities, traditionally predomi-
Established postgraduate degree programs exist at nantly text-based subjects, unsurprisingly focus
places like Kings’ College London, Loyola Uni- on text analysis and text-encoding. XML and
versity Chicago, the University of Alberta, Uni- TEI are the most frequently taught technologies
versity College London, and University College here and plenty of free teaching resources are
Cork, Ireland (Gold 2011; cf. Centernet.org for a available (e.g., TEI by Example, http://www.
full list). teibyexample.org; DHOER, http://www.ucl.ac.
uk/dhoer). As tools for analysis and visualization,
Google’s n-gram viewer (http://books.google.
Common Themes com/ngrams) and Voyant Tools (http://www.
voyant-tools.org) are popular, both due to their
In a first analysis of 134 curricula of DH courses, ease of use and their not requiring knowledge of
Lisa Spiro (2011) observes three common gen- any coding. Besides text encoding and analytics,
eral characteristics: firstly, that most courses processing of sound and still and moving images,
make a connection between theory and practice databases and networks, simulations and games,
by explicitly or implicitly employing a project- maps and geospatial information systems (GIS),
based learning (PBL) approach, requiring stu- and data visualization are also part of course
dents to get involved in hands-on or practice syllabi (Spiro 2011).
learning by building digital artifacts. This A major debate seems to be about whether or
reflects the double nature of Digital Humanities, not the curriculum needs to include coding, in
that it is as much about building websites, data- other words whether you can pursue Digital
bases, or demonstrators as about analyzing and Humanities without being able to program.
interpreting them, a theory-practice dichotomy While there are certainly arguments in favor of
that only at first sight seems to be new as coding, many modern tools, some of them specif-
Kathleen Fitzpatrick (2011) has pointed out, as ically designed with a teaching and learning
it exists in other areas of the Humanities as well, aspect in mind (e.g., Omeka for the presentation
e.g., in the separation of Creative Arts from Art of digital artifacts (http://www.omeka.org),
History, or of Creative Writing from Literary Neatline for geotemporal visualization of Human-
Analysis. DH in this respect is overarching the ities data (http://www.neatline.org)) do not
divide. require any coding skills at all. Neither do the
Secondly, in line with DH research culture, DH frequently used Google n-gram viewer (http://
courses not only teach applying tools, methods, books.google.com/ngrams) for basic and Voyant
and technology but also group work and Tools (http://www.voyant-tools.org) for more
242 Cyber Espionage

advanced text-mining and textual analytics. On Gold, M. (Ed.). (2012). Debates in the Digital Humanities.
the other hand, straight-forward text-books and Minneapolis: Minnesota University Press.
Hancher, M. Recent digital humanities syllabi (18 January
online introductions to computer-assisted text 2014). http://blog.lib.umn.edu/mh/dh2/2014/01/recent-
analysis using the programming and scripting digital-humanities-syllabi.html. Accessed Aug 2014.
languages R, Python, PHP (MySQL), SPARQL Hirsch, B. D. (Ed.). (2012). Digital humanities pedagogy:
(RDF), and others are available, specifically Practices, principles and politics. Cambridge: Open
Book Publishers.
directed at Humanities scholars wishing to acquire Jockers, M. L. (2013). Macroanalysis: Digital methods
the necessary coding skills, whether in the class- and literary history. Urbana: University of Illinois
room or in self-study (e.g., Jockers 2013; and in Press.
Hirsch 2012). Jockers, M. L. (2014). Text analysis with R for students of
literature. Heidelberg/New York: Springer.
Kirschenbaum, M. G. (2010). What is digital humanities
and what’s it doing in English departments? ADE Bul-
Further Resources letin, (150), 1–7. https://doi.org/10.1632/ade.150.55.
Mahony, S., Tiedau, U., & Sirmons, I. (2012). Open access
and online teaching materials in digital humanities. In
The largest collection of Digital Humanities C. Warwick, M. Terras, & J. Nyhan (Eds.), Digital
course syllabi is openly available via Lisa Spiro’s humanities in practice (pp. 168–191). London: Facet.
Digital Humanities Education Zotero Group Paul McGinnis, P. (2014). DH vs. DH, and Moretti’s war.
http://majining.com/?p¼417. Accessed Aug 2014.
(https://www.zotero.org/groups/digital_humani Robertson, S. The differences between digital history and
ties_education), for selections of particularly digital humanities. http://drstephenrobertson.com/
relevant and recent ones also see Gold 2012 blog-post/the-differences-between-digital-history-and-
and Hancher 2014. An important source and digital-humanities/. Accessed Aug 2014.
Spiro, L. (2011). Knowing and doing: Understanding the
discussion platform for pedagogical and curric- digital humanities curriculum. June 2011. http://
ular information is the website of the Humani- digitalscholarship.files.wordpress.com/2011/06/
ties, Arts, Science and Technology Alliance and spirodheducationpresentation2011-4.pdf. Accessed
Collaboratory (HASTAC), a virtual organiza- Aug 2014.
Spiro, L. (2014). Shaping (digital) scholars: Design prin-
tion of more than 12,000 individuals and insti- ciples for digital pedagogy. https://digitalscholarship.
tutions dedicated to innovative new modes of files.wordpress.com/2014/08/spirodigitalpedagogyuts
learning and research in higher education c2014.pdf.
(http://www.hastac.org). Spiro, L. Digital Humanities Education Zotero Group.
https://www.zotero.org/groups/digital_humanities_edu
cation. Accessed Aug 2014.

Cross-References

▶ Big Humanities Project


▶ Humanities (Digital Humanities) Cyber Espionage

David Freet1 and Rajeev Agrawal2


1
Further Reading Eastern Kentucky University, Southern Illinois
University, Edwardsville, IL, USA
2
Cohen, D., & Rosenzweig, R. (2005). Digital history: Information Technology Laboratory, US Army
A guide to gathering, preserving and presenting the Engineer Research and Development Center,
past on the web. Philadelphia: University of Pennsyl-
Vicksburg, MS, USA
vania Press.
Fitzpatrick, K. The humanities done digitally. The chroni-
cle of higher education. 8 May 2011. http://chronicle.
com/article/The-Humanities-Done-Digitally/127382/. Introduction
Accessed Aug 2014.
Gold, M. K. Digital humanities syllabi (6 June 2011). http://
cunydhi.commons.gc.cuny.edu/2011/06/06/digital-hum Cyber espionage or cyber spying is the act of
anities-syllabi/. Accessed Aug 2014. obtaining personal, sensitive, or proprietary
Cyber Espionage 243

information from individuals without their knowl- data is processed within acceptable privacy limits
edge or consent. In an increasingly transparent and and that informed consent is present. Compared to
technological society, the ability to control the pri- the EU, the USA has relatively few laws that
vate information an individual reveals on the Inter- enforce information privacy. The USA has mostly
net and the ability of others to access that relied on industry guidelines and practices to
information are a growing concern. This includes ensure privacy of personal information, and the
storage and retrieval of e-mail by third parties, most significant feature of EU regulations in rela- C
social media, search engines, data mining, GPS tion to the USA has been the prohibition of the
tracking, the explosion of smartphone usage, and transfer of personal data to countries outside the
many other technology considerations. In the age EU that do not guarantee an adequate level of
of big data, there is growing concern for privacy protection (Burgunder 2011, p. 478). EU law
issues surrounding the storage and misuse of per- affirms that the collection, storage, or disclosure
sonal data and non-consensual mining of private of information relating to private life interferes
information by companies, criminals, and with the right to private life and therefore requires
governments. justification.
Concerning the growing threat of cyber espio- Ironically, as we become a more technologi-
nage in the big data world, Sigholm and Bang cally dependent society with increased public sur-
write that unlike traditional crimes, companies veillance, data mining, transparency of private
cannot call the police and expect them to pursue information, and social media, we come to expect
cyber criminals. Affected organizations play a less privacy and are consequently entitled to less
leading role in each and every investigation of it (Turley 2011). This leads to the difficult
because it is their systems and data that are being question of how much privacy we as individuals
stolen or leveraged. The fight against cybercrime can “reasonably” expect. Recently, the Jones
must be waged on a collective basis, regardless of v. United States case challenged the existing pri-
whether the criminal is a rogue hacker or a nation- vacy expectations over police surveillance using a
state (Sigholm and Bang 2013). GPS that monitored the suspect’s location. As the
In 1968, the US government passed the Omni- normalcy of warrantless surveillance increases,
bus Crime Control and Safe Streets Act, which our expectations fall, allowing this type of surveil-
included a wiretapping law that became com- lance to become more “common.” This results in a
monly known as the Wiretap Act (Burgunder move toward limitless police powers. These
2011, p. 462). This law made it illegal for any declining expectations are at the heart of the
person to willfully use an electronic or mechanical Obama administration’s argument in this case,
device to intercept an oral communication unless where it affirms that the government is free to
prior consent was given or if the interception track citizens without warrants because citizens
occurred during the ordinary course of business. “expect to be monitored” (Turley 2011).
In 1986, Congress passed the Electronic Commu-
nications Privacy Act (ECPA) which amended the
original Wiretap Act and also introduced the Vulnerable Technologies
Stored Communications Act (SCA) which primar-
ily prevents outsiders from hacking into facilities Smartphones have become a common fixture of
that are used to store electronic communications. daily life, and the enormous amount of personal
These pieces of legislation form the cornerstone for data stored on these devices has led to unforeseen
defining protections against cyber espionage in the difficulties with the interpretation of laws meant to
age of big data and social media. protect privacy. Current legislation has actually
In contrast to US privacy laws, the European focused on defining a smartphone as an extension
Union (EU) has adopted significant legislation to an individual’s home for the sake of protecting
governing the collection and processing of sensitive information. In 2009, the Supreme Court
personal information. This ensures that personal of Ohio issued the most clear-cut case which held
244 Cyber Espionage

that the search of a smartphone incident to an attempted to procure these pictures as evidence,
arrest is unreasonable under the Fourth Amend- Romano opposed the motion claiming she “pos-
ment. The court held in State v. Smith that because sessed a reasonable expectation of privacy in her
a smartphone allows for high-speed Internet home computer” (Walder 2010). Facebook also
access and is capable of storing “tremendous opposed releasing Romano’s profile information
amounts of private data,” it is unlike other con- without her consent because it violated the federal
tainers for the purposes of Fourth Amendment Stored Communications Act. Acting Justice Jeffrey
analysis. Because of this large amount of personal Arlen Spinner of New York’s Suffolk County
information, its user has a “high expectation” of Supreme Court rejected Romano’s argument that
privacy. In short, the state may confiscate the the release of information would violate her Fourth
phone in order to collect and preserve evidence Amendment right to privacy. Spinner wrote “When
but must then obtain a warrant before intruding Plaintiff created her Facebook and MySpace
into the phone’s contents (Swingle 2012, p. 37). accounts she consented to the fact that her personal
Smartphones have evolved into intimate collec- information would be shared with others, notwith-
tions of our most personal data and no longer just standing her privacy settings. Indeed, that is the
lists of phone numbers, the type of data that has very nature and purpose of these social networking
traditionally been kept in the privacy of our homes sites or they would cease to exist” (Walder 2010).
and not in our pockets. The judge ruled that “In light of the fact that the
The phenomenon of social media has raised a public portions of Plaintiff’s social networking sites
host of security and privacy issues that never contain material that is contrary to her claims and
existed before. The vast amount of personal infor- deposition testimony, there is a reasonable likeli-
mation displayed and stored on sites such as hood that the private portions of her sites may
Facebook, Snapchat, MySpace, and Google make contain further evidence such as information with
it possible to piece together a composite picture of regard to her activities and enjoyment of life, all of
users in a way never before possible. Social net- which are material and relevant to the defense of
working sites contain various levels of privacy this action” (Walder 2010). With social media, indi-
offered to users. Sites such as Facebook encourage viduals understand portions of their personal infor-
users to provide real names and other personal mation may be observed by others but that most
information in order to develop a profile that is people do not contemplate a comprehensive map-
then available to the public. Many online dating ping of their lives over a span of weeks or months.
sites allow people to remain anonymous and in Yet, this is exactly what happens with social media
more control of their personal data. This voluntar- when we submit the most personal details of our
ily divulgence of so much personal information lives to public scrutiny voluntarily.
plays into the debate over what kind of privacy In the Jones case, Supreme Court Justice
we can “reasonably expect.” Sotomayor suggested that the Court’s rulings
In 2003, an individual named Kathleen Romano that a person “Has no reasonable expectation of
fell off an allegedly defective desk chair while at privacy in information voluntarily disclosed to
work. Romano claimed she sustained serious per- third parties” were “Ill suited to the digital age””
manent injuries involving multiple surgeries and (Liptak 2012). She wrote “People disclose the
sued Steelcase Inc., the manufacturer of the chair. phone numbers that they dial or text to their cel-
Steelcase refuted the suit saying Romano’s claims lular providers; the URLs that they visit and the
of being confined to her house and bed were e-mail addresses with which they correspond to
unsubstantiated based on public postings on her their Internet service providers; and the books,
Facebook and MySpace profiles which showed groceries, and medications they purchase to
her engaged in travel and other rigorous physical online retailers. I for one doubt that people
activities (Walder 2010). When Steelcase would accept without complaint the warrantless
Cyber Espionage 245

disclosure to the government of a list of every web The SCA provides some level of protection for
site they had visited in the last week, month, or this type of data depending on the length of time
year” (Liptak 2012). Clearly a fine line of legality the e-mail has been stored but allows subpoenas
exists between information we voluntarily and court orders to be issued under much lower
divulge to the public and parts of that information standards than those of the Fourth Amendment
which we actively seek to protect. Regardless of and provides less protection to electronic commu-
the argument as to whether the information can nications than wire and oral communications. C
“legally” be used, we must understand in the While e-mail should be afforded the same level
current digital age there is a very small “reason- of constitutional protection as traditional forms of
able expectation” of information privacy. communication, the expectation of privacy that an
Electronic mail has completely transformed the individual can expect in e-mail messages depends
way in which we communicate with each other in greatly on the type of message sent and to whom
the digital age and provided a vast amount of big the message was sent.
data for organizations that store e-mail communi- On December 14th, 2010, the Sixth Circuit
cations. This provides an enormous challenge to Court of Appeals was the first and only federal
laws that were written to govern and protect our use appellate court to address the applicability of
of paper documents. For example, electronic Fourth Amendment protection to stored emails
records can be stored indefinitely and retrieved in the landmark case of United States
from electronic storage in a variety of locations. v. Warshak. The Sixth Circuit held that the “rea-
Through the use of technologies such as “keystroke sonable expectation” of privacy for communica-
loggers,” it is possible to read the contents of an tions via telephone and postal mail also extended
e-mail regardless of whether it is ever sent. These to stored e-mails (Benedetti 2013, p. 77). While
technologies introduce a wide range of ethical this was an important first step in determining the
issues regarding how they are used to protect or future of e-mail privacy, there still remain critical
violate our personal privacy. Momentum has been questions pertaining to the government’s ability to
building for Congress to amend the ECPA so that search and seize stored electronic communica-
the law protects reasonable expectations of privacy tions and the proper balance between law enforce-
in e-mail messages. There is a good probability that ment’s need to investigate criminal activity and
some techniques such as keystroke loggers used by the individual’s need to protect personal privacy.
employers to monitor e-mail violate the Wiretap Modern use of e-mail is as important to Fourth
Act. As employees increasingly communicate Amendment protections as traditional telephone
through instant messaging and social networks, conversations. Although the medium of commu-
the interception of these forms of communication nication will certainly change as technology
also fall into question (Burgunder 2011, p. 465). evolves, the “reasonable expectation” of privacy
exists in the intention of private citizens to
exchange ideas between themselves in a manner
Privacy Laws that seeks to preserve the privacy of those ideas.
As with the previous discussion of social media,
The fundamental purpose of the Fourth Amend- what an individual seeks to preserve as private,
ment is to safeguard the privacy and security of even in an area accessible to the public, may be
individuals against arbitrary invasions by govern- constitutionally protected (Hutchins 2007, p. 453).
ment officials. In the past, courts have affirmed Big data promises significant advances in
that telephone calls and letters are protected media information technology and commerce but also
that should only be available to law enforcement opens the door to a host of privacy and data
with an appropriate warrant. In the same manner, protection issues. As social networking sites con-
electronic e-mail should be protected accordingly. tinue to collect massive amounts of data and
246 Cyber Espionage

computational methods evolve in terms of power standpoint of economics and national security,
and speed, the exploitation of big data becomes an we must strive to develop a more comprehensive
increasingly important issue in terms of cyber set of legislation and protection for the vast stores
espionage and privacy concerns. Social media of private information readily accessible on the
organizations have added to their data mining Internet today.
capabilities by acquiring other technology com-
panies, such as when Google acquired
DoubleClick and YouTube, or by moving into
Further Reading
new fields such as Facebook did when it created
“Facebook Places” (McCarthy 2010). In big data Benedetti, D. (2013). How Far Can the Government’s
terminology, the breadth of an account measures Hand Reach Inside Your Personal Inbox? The John
the number of types of online interaction for a Marshall Journal of Information Technology & Pri-
given user. The depth of an account measures vacy Law – Volume 30 (Issue 1). Retrieved from:
http://repository.jmls.edu/cgi/viewcontent.cgi?article=
the amount of data that user processes (Oboler 1730&context=jitpl.
et al. 2012). Taken together, the breadth and Burgunder, L. B. (2011). Legal aspects of managing tech-
depth of information across multiple aspects of a nology (5th ed.). Mason: South-Western Cengage
user’s life can be pieced together to form a com- Learning.
Hutchins, R. M. (2007). Tied up in knotts? GPS technology
posite picture that may contain unexpected accu- and the fourth amendment. UCLA Law Review.
racy. In 2012, when Alma Whitten, Google’s Retrieved from: http://www.uclalawreview.org/pdf/
Director of Privacy, Product and Engineering, 55-2-3.pdf.
announced that Google would begin to aggregate Kang, C. (2012). Google announces privacy changes across
products; users can’t opt out. Washington Post
data and “treat you as a single user across all our (24 January). Retrieved from: http://www.
products,” the response from users and critics was washingtonpost.com/business/economy/google-tracks-
alarming. In an article for the Washington Post, consumers-across-products-users-cant-opt-out/2012/01/
Jeffrey Chester, Executive Director of the Center 24/gIQArgJHOQ_story.html.
Liptak, A. (2012). Justices say GPS tracker violated privacy
for Digital Democracy, voiced the reality that rights. New York Times. Retrieved from: http://www.
“There is no way a user can comprehend the impli- nytimes.com/2012/01/24/us/police-use-of-gps-is-ruled-
cation of Google collecting across platforms for unconstitutional.html?pagewanted¼all&_r¼0.
information about your health, political opinions McCarthy, C. (2010). Facebook granted geolocation pat-
ent. CNet News (6 October). Retrieved from: http://
and financial concerns” (Kang 2012). In the same news.cnet.com/8301-13577_3-20018783-36.html.
article, James Steyer, Common Sense Media Chief, Oboler, A., Welsh, K., & Cruz, L. (2012). The danger of
stated that “Google’s new privacy announcement is big data: Social media as computational social science.
frustrating and a little frightening.” First Monday Peer Reviewed Journal. Retrieved from:
http://firstmonday.org/ojs/index.php/fm/article/view/
3993/3269#p4.
Sigholm, J., & Bang, M. (2013). Towards offensive cyber
Conclusion counterintelligence. 2013 European intelligence and secu-
rity informatics conference. Retrieved from: http://www.
ida.liu.se/~g-johsi/docs/EISIC2013_Sigholm_Bang.pdf.
As the world moves toward a “big data” culture Swingle, H. (2012). Smartphone searches incident to arrest.
centered around mobile computing, social media, Journal of the Missouri Bar. Retrieved from: https://
and the storage of massive amounts of personal www.mobar.org/uploadedFiles/Home/Publications/Jour
information, the threat from cyber espionage is nal/2012/01-02/smartphone.pdf.
Turley, J. (2011). Supreme court’s GPS case asks: How
considerable. Because of the low cost of entry much privacy do we expect? The Washington Post.
and anonymity afforded by the Internet, anyone Retrieved from: http://www.washingtonpost.com/opin
with basic technical skills can steal private infor- ions/supreme-courts-gps-case-asks-how-much-privacy-
mation off computer networks. Due to the vast do-we-expect/2011/11/10/gIQAN0RzCN_story.html.
Walder, N. (2010). Judge grants discovery of postings on
number of network attack methods and cyber social media. New York Law Journal. Retrieved from:
espionage techniques, it is difficult to determine http://www.law.com/jsp/article.jsp?id¼1202472483935&
one effective solution. However, from the Judge_Grants_Discovery_of_Postings_on_Social_Media.
Cyberinfrastructure (U.S.) 247

Beyond these mainstream technologies is a set


Cyberinfrastructure (U.S.) of fields of technological interest with even
greater potential for revolutionary change. The
Ernest L. McDuffie combination of these sets, with the always occur-
The Global McDuffie Group, Longwood, FL, ring unpredictability of technological and scien-
USA tific advances, results in significantly increased
probability of world changing positive or negative C
events. Society can mitigate disruption caused by
Introduction and Background ignorance and the inevitable need for continual
workforce realignment by focusing on the techni-
As the first two decades of the twenty-first cen- cal education of the masses. For the twenty-first
tury come to a close, an ever increasingly accel- century and beyond, knowledge and operational
erated pace of technological advances continue skills in the areas of science, technology, engi-
to reshape the world. In this entry, a brief look at a neering, and mathematics (STEM) is critical for
number of ongoing cyberinfrastructure research individuals and society.
projects, funded by various agencies of the United
States federal government, are examined. While
there are many definitions for the term cyber- Current Technological State
infrastructure, a widely accepted one is the com-
bination of computing systems, data storage Looking at the funding picture at the federal level,
systems, advanced instruments and data reposito- impactful areas of research are for the most part
ries, visualization environments, and people, all computer based. All the following data comes
linked together by software and high-performance from the 2018 NITRD Supplement which pre-
networks. Over the years, related research and sents information based on the fiscal year 2019
development funding has originated more and federal budget request. In the Program Compo-
more from government and less and less from nent Area (PCA) called Computing-Enabled
the private sector sources. High tech companies Human Interaction, Communication, and Aug-
focus mainly on applied research that can be used mentation (CHuman) PCA, with significant
in new products for increased profits. This pro- funding request from seven federal agencies –
cess is well suited for bringing the benefits of Department of Defense (DoD), National Science
mature technology to the public but if allowed to Foundation (NSF), National Institutes of Health
become the sole destination for research funding, (NIH), Defense Advanced Research Projects
non-applied research will be under funded. With- Agency (DARPA), National Aeronautics and
out basic research advances, applied research Space Administration (NASA), National Institute
dries up and comes to an abrupt end. of Standards and Technology (NIST), and
A subcommittee under the Office of Science National Oceanic and Atmospheric Administra-
and Technology Policy (OSTP) called the tion (NOAA) – one of its strategic priorities is
National Coordinating Office (NCO) for Net- human-automation interaction. This research
working and Information Technology Research area focuses on facilitating the interaction
and Development (NITRD) has for almost three between humans and intelligent systems such as
decades annually published a Supplement to the robots, intelligent agents, autonomous vehicles,
President’s budget. (See https://www.nitrd.gov/ and systems that utilize machine learning. Three
for all current and past supplements.) These Sup- of the key programs are (1) Robust Intelligence
plements highlight a number of technologies. where support and the advancement of intelligent
Tremendous potential impact on individuals systems that operate in complex, realistic contexts
and even greater potential impact on the nature is the focus; (2) Smart and Autonomous Systems
and capabilities of the global cyberinfrastructure research that robustly think, act, learn, and behave
are clear. ethically; and (3) Smart and Connected
248 Cyberinfrastructure (U.S.)

Communities where techno-social dimensions from the more basic or pure form such as applied
and their interactions in smart community envi- mathematics and algorithms, and initiate activity
ronments are addressed. in machine learning to optimize output from data-
In the PCA for Computing-Enabled Networked intensive programs at DOE; to more applied
Physical Systems (CNPS), the integration of cyber/ research in support of multiscale modeling of
information, physical, and human worlds is accom- biomedical processes for improved disease treat-
plished using information technology-enabled sys- ment at NIH and multi-physics software applica-
tems. Twenty-one federal agencies are active in this tions to maintain military superiority by DoD. At
space. They are managed by Interagency Working the same time, work continues on HCS infrastruc-
Groups (IWG). One is called the Cyber-Physical ture. For example, NIH provides shared interop-
Systems (CPS) IWG that includes the Smart Cities erable cloud computing environment, high-
and Communities Task Force, and another is capacity infrastructure, and computational analy-
named the High Confidence Software and Systems sis tools for high-throughput biomedical research.
(HCSS) IWG. Research activities in these groups A joint program demonstrating multiagency col-
include investigations into cyber-physical systems, laboration on the Remote Sensing Information
the Internet of Things (IoT), and related complex, Gateway is being operated by the Environmental
high-confidence, networked, distributed comput- Protection Agency (EPA), NASA, and NOAA.
ing systems. In the HCSS IWG, one of the key AI continues to be an important area of focus
programs executing under the strategic priority of for the Intelligent Robotics and Autonomous Sys-
assured autonomous and artificial intelligence (AI) tems (IRAS) PCA. Here funding requested across
technologies is AI and machine learning (AI/ML) 14 agencies exploring intelligent robotic systems
for safety and mission-critical applications. Activ- R&D in robotics hardware, software design and
ity here is supported by NIST, NSA, and NSF. application, machines perception, cognition and
Their efforts cover the search for techniques for adaptation, mobility and manipulation, human-
assuring and engineering trusted AI-based systems, robot interaction, distributed and networked robot-
including development of shared public datasets ics, and increasingly autonomous systems. Two of
and environments for AI/ML training, testing, and the four strategic priorities for IRAS are advanced
development standards and benchmarks for robotic and autonomous systems along with intelli-
assessing AI technology performance. gent physical systems where a complex set of activ-
The highest funded PCA request is High-Capa- ities are involved. These activities include the
bility Computing Infrastructure and Applications development of validate metrics, test methods,
(HCIA). Seven federal agencies – DoD, NSF, information models, protocols, and tools to advance
NIH, Department of Energy (DOE), DARPA, robot system performance and safety, and develop
NASA, and NIST – participate in HCIA. Here measurement science infrastructure to specify and
the focus is on computation and data-intensive evaluate the capabilities of remotely operated or
systems and applications, directly associated soft- autonomous aerial, ground/underground, and
ware, communications, storage, and data manage- aquatic robotic systems. Smart and autonomous
ment infrastructure, and other resources systems that robustly sense, plan, act, learn, and
supporting high-capability computing. All activi- behave ethically in the face of complex and uncer-
ties are coordinated and reported through the High tain environments are the focus.
End Computing (HEC) IWG which has eight par- Key programs of this PCA include the Mind,
ticipating federal agencies and strategic priorities Machine, and Motor Nexus where NSF and DoD
that include, but not limited to, High-Capability are looking at research that supports an integrated
Computing Systems (HCS) infrastructure as well treatment of human intent, perception, and behav-
as productivity and broadening impact. ior in interaction with embodied and intelligent
Some of the key programs ongoing in the HEC engineered systems and as mediated by motor
IWG include an effort designed for the advance- manipulation. The Robotic Systems for Smart
ment of HCS applications. This research ranges Manufacturing program looks to advance
Cyberinfrastructure (U.S.) 249

measurement science to improve robotic system information that is available to solve difficult
performance, collaboration, agility, and ease of problems. Problems related to generating alterna-
integration into the enterprise to achieve dynamic tive hypotheses from multisource data, machine
production for assembly-centric manufacturing reading and automated knowledge extraction,
being executed at NIST. NIH has a Surgical low-resource language processing, media integ-
Tools, Techniques, and Systems program that rity, automated software generation and mainte-
does R&D on next-generation tools, technologies, nance, scientific discovery and engineering design C
and systems to improve the outcomes of surgical in complex application domains, and modeling of
interventions. The U.S. Navy’s Office of Naval global-scale phenomena. Infrastructure and tool
Research has a program under this PCA that is development will focus on enabling interoperabil-
called Visual Common Sense. Here machines are ity and usability of data to allow users to access
developed with the capabilities to represent visual diverse data sets that interoperate both within a
knowledge in compositional models with contex- particular domain and across domains.
tual relations and advance understanding of
scenes through reasoning about geometry, func-
tions, physics, intents, and causality. Accelerating Technological Advance
Two closely related PCAs are the Large-Scale
Data Management and Analysis (LSDMA) and the Artificial Intelligence, quantum computing, nano-
Large-Scale networking (LSN) PCS. LSDMA technology, and fusion power have the potential to
reports all its activities through the Big Data IWG be the biggest game-changers in terms of acceler-
which has 15 federal agencies. LSN forms its own ating the pace of technological advance over the
IWG with some 19 federal agencies. Together next few decades. In addition to the efforts in the
these two PCAs cover several strategic priorities United States to advance AI, other work around
such as future network development, network the world is also moving forward. AI “holds tre-
security and resiliency, wireless networks, the mendous promise to benefit nearly all aspects of
effective use of large-scale data resources, and society, including the economy, healthcare, secu-
workforce development efforts to address the rity, the law, transportation, even technology
shortage of data science expertise necessary to itself” (https://www.nitrd.gov/pubs/National-AI-
move big data projects forward. RD-Strategy-2019.pdf). The American AI Initia-
A select few of the key programs underway for tive represented a whole-of-government strategy
LSN and LSDMA are the development of tech- in collaboration and engagement calling for Fed-
nology, standards, testbeds, and tools to improve eral agencies to prioritize R&D investments to
wireless networks. Within this effort, NSF is provide education and training opportunities to
supporting research on beyond-5G wireless tech- prepare the American workforce and enhance
nologies for scalable experimentation. 5G is the access to high-quality cyberinfrastructure and
next generation of cellular networks. Currently data in the new era of AI.
most systems in the United States are operating a Meanwhile, other countries, such as China, are
mix of 3G and 4G, with 5G set to deliver much moving forward with large-scale projects. “Tech
great speed and bandwidth enabled by millimeter giants, startups, and education incumbents have
capable infrastructure and software applications all jumped in. Tens of millions of students now
designed to take advantage of the greater speed use some form of AI to learn. It’s the world’s
and higher volume of data availability. 6G and 7G biggest experiment on AI in education, and no
networks will build of this framework and deliver one can predict the outcome” (https://www.tech
capabilities difficult to even imagine over the next nologyreview.com/s/614057/china-squirrel-has-sta
decade. rted-a-grand-experiment-in-ai-education-it-could-
NSF and DARPA are leading efforts in foun- reshape-how-the/).
dational research to discover new tools and meth- IBM has recently produced a quantum com-
odologies to use the massive amount of data and puter that can be accessed by the public. “System
250 Cybersecurity

One: a dazzling, delicate, and chandelier-like deepen and expand, becoming indispensable for
machine that’s now the first integrated universal the foreseeable future.
quantum computing system for commercial use,
available for anyone to play with” (https://
singularityhub.com/2019/02/26/quantum-computin Further Reading
g-now-and-in-the-not-too-distant-future/, https://
America’s Energy Future: Technology and Transforma-
www.research.ibm.com/ibm-q/). tion. (2009). http://nap.edu/12091.
There is a difference between evolutionary and Frontiers in Massive Data Analysis. (2013). http://nap.edu/
incremental advances in any field. Nanotechnol- 18374.
ogy provides some interesting possibilities where Quantum Computing: Progress and Prospects. (2019).
http://nap.edu/25196.
“evolutionary nanotechnology involves more Implications of Artificial Intelligence for Cybersecurity:
sophisticated tasks such as sensing and analysis Proceedings of a Workshop. (2019). http://nap.edu/
of the environment by nano-structures, and a role 25488.
for nanotechnology in signal processing, medical
imaging, and energy conversion” (http://www.
trynano.org/about/future-nanotechnology).
Fusion, the process that stars use to generate Cybersecurity
energy, is being pursued by many nations in an
attempt to solve growing energy needs. “That’s Joanna Kulesza
exactly what scientists across the globe plan to do Department of International Law and
with a mega-project called ITER. It’s a nuclear International Relations, University of Lodz, Lodz,
fusion experiment and engineering effort to bridge Poland
the valley toward sustainable, clean, limitless
energy-producing fusion power plants of the future”
(https://www.insidescience.org/video/future-fusion- Definition and Outlook
energy).
Cybersecurity is a broad term referring to mea-
sures taken by public and private entities, aimed at
Conclusion ensuring the safety of online communications and
resources. In the context of big data it refers to the
It may be possible to represent the interaction potential threats that any unauthorized disclosure
between technical areas and related scientific of personal data or trade secrets might have on
fields of interest with a bidirectional, weighted, national, local, or global politics and economics.
fully connected graph where the nodes represent In the context of big data, cybersecurity refers to
the different scientific fields and the edges weight hardware, software, and individual skills
represent the amount of interaction between the deployed in order to mitigate risks originated by
connected nodes. Analysis of such a graph could online transfer and storage of data, such as encryp-
provide insight into where new ideas and technol- tion technology, antivirus software, and employee
ogies may be emerging and where best to increase training. Threats originating the need to introduce
funding to produce even more acceleration of the enhanced cybersecurity measures include but are
overall process. not limited to targeted attacks by organized
Global cyberinfrastructure is at the center of groups (hackers), acting independently or
various emerging technologies that are in the pro- employed by private entities or governments.
cess of making major advances. The speed and Such attacks are usually directed at crucial state
impact of these potential advances are being or company resources. Cybersecurity threats
enabled and accelerated by the growth of scale include also malware designed to damage hard-
and capabilities of the cyberinfrastructure on ware and/or other resources by, e.g., altering their
which they depend. This interdependency will functions or allowing for a data breach. According
Cybersecurity 251

to the “Internet Security Threat Report 2018” by as online currencies, stolen from individuals, thor-
the software company Symantec other threats ough company secrets targeted by hackers hired
include cybercrime, such as phishing or spam. by competition up to critical state infrastructure,
New threats for cybersecurity are originated by including power plants, water supplies, or railroad
the increased popularity of social services and operating systems infected with malicious code
mobile applications, including the growing signif- altering their operation, bringing a direct threat
icance of GPS data and cloud architecture. They to the lives and security of thousands. C
include also the “Internet of Things” with new Cybersecurity threats may be originated by
devices granted IP addresses and providing new individuals or groups acting either for their own
kinds of information, vulnerable to attack. All that benefit, usually a financial one, or those acting
data significantly fuels the big data business, upon an order or authorization of business or
offering new categories of information and new governments. While some of the groups
tools to process them. It significantly impacts the conducting attacks onto critical infrastructure
efficiency of customer profiling and the effective- claim to be only unofficial supporters of national
ness of product targeting. Effectively the need for politics, like the pro-Kremlin “Nashi” group
enhanced cooperation between companies offer- behind the 2007 Estonia attacks, forever more
ing new services and law enforcement agencies states, despite officially denying confirmation on
results in a heated debate on the limits of individ- individual cases, employ hackers to enhance
ual freedom and privacy in the global fight for national security standards and, more signifi-
cybersecurity. cantly, to access or distort confidential informa-
tion of others. State officials admit the growing
need of increased national cybersecurity by rais-
Cybersecurity and Cyberwar ing their military resilience in cyberspace, training
hackers, and deploying elaborate spying software,
The break of the twenty-first century brought an designed at state demand. USCYBERCOM and
increased number and impact of international hos- Unit 61,398 of Chinese People’s Liberation Army
tilities affected online. Almost every international are subject to continuous, mutual, consequentially
conflict has been accompanied by its online man- denied accusations of espionage. Similarly, the
ifestation, taking on the form of malicious soft- question of German authorities permitting
ware deployed in order to impair rival’s critical “Bundestrojaner” – state sponsored malicious
infrastructure or state sponsored hacker groups software, used by German police for individual
attacking resources crucial to the opponent. The surveillance of Internet telephony – has been sub-
2008 Georgia–Russia conflict resulted in attacks ject to heated debate over the limits of allowed
on Georgian authorities’ websites, originated compromise between privacy, state sovereignty,
from Russian territory. The ongoing tension and cybersecurity.
between North and South Korea lead to the Because hostile activities online often accom-
Oplan 5027 – a plan for US aid in case of a pany offline interstate conflict, they are being
North Korean attack – being stolen from Seul in referred to as acts of “cyberwar,” although the
2009, while the 2011 Stuxnet virus, designed to legal qualification of international hostilities
damage Iranian uranium enrichment facilities, enacted online as acts of war or international
was allegedly designed and deployed by Israeli aggression is disputable. Forever more online
and US government agencies, reflecting the long- conflicts attributed to states do not reflect ones
lasting Near East conflict. Next to air, water, and ongoing offline, just to mention the long-lasting
ground, the cyberspace has become the fourth US-China tension resulting in mass surveillance
battleground for international conflicts. by both parties of one another’s secret resources.
The objects of cybersecurity threats range from The subsequent attacks aimed against US-based
material resources, such as money stored in banks, companies and government agencies, allegedly
offering electronic access to their services as well originated from Chinese territory, codenamed by
252 Cybersecurity

the US intelligence “Titan Rain” (2003), “elec- yet not intact with international privacy standards
tronic Pearl Harbour,” (2007) and “Operation incited the debate on individual privacy in the era
Aurora” (2011), resulted in a breach of terabytes of global insecurity. A clear line between individ-
of trade secrets and other data. They did not, ual rights and cybersecurity measures must be
however, reflect an ongoing armed conflict drawn. One can be found in international law
between those states neither did they result in an documents and practice, with the guidelines and
impairment of critical state infrastructure. Follow- recommendations by the United Nations Human
ing the discussion initiated in 1948 about the Rights Committee setting a global minimum stan-
prohibition of force in international relations as dard for privacy. Privacy as a human right allows
per Article 2 (4) United Nations Charter, in 1974 each individual to have their private life, includ-
the international community denied recognizing, ing all information about them, their home, and
e.g., economic sanctions as acts of war in the correspondence protected by law from
United Nations General Assembly Resolution unauthorized interference. State authorities are
3314 (XXIX) on the Definition of Aggression, obliged as per international law to introduce
restricting the definition to direct military involve- legal measures effectively affording such protec-
ment within foreign territory. A similar discussion tion and act with due diligence to enable each
is ongoing with reference to cybersecurity, with individual under their jurisdiction the full enjoy-
one school of thought arguing that any activity ment of their right. Any infraction of privacy may
causing damages or threats similar to those of a only be based on a particular provision of law,
military attack ought to be considered acts of war applied in individually justified circumstances.
under international law and another persistent Such circumstances include the need to protect
with the narrowest possible definition of war, collective interests of others, that is, the right to
excluding any activity beyond a direct, military privacy may be limited for the purpose of pro-
invasion of state territory from its scope, as crucial tecting the rights of others, including the need to
for maintaining international peace. Whether a guarantee state or company security. Should pri-
cyberattack, threatening the life and wellbeing of vacy be limited as per national law the consent of
many, yet affected without the deployment of the individual whom the restriction concerns
military forces, tanks or machine guns, ought to must be explicitly granted or result from particu-
be considered an act of international aggression, lar provisions of law applied by courts with ref-
or, consequentially, whether lines of computer erence to individual circumstances. Moreover,
code hold similar significance to tanks crossing states are under an international obligation to
national borders, is unclear, and the status of an take all appropriate measures to ensure privacy
international cyberattack as an act of international protection against infraction by third parties,
aggression remains an open question. including private companies and individuals.
This obligation results in the need to introduce
comprehensive privacy laws applicable to all
Cybersecurity and Privacy entities dealing with private information that
accompany any security measures. Such regula-
As individual freedom ends where joint security tions, either included in civil codes or personal
begins the question of the limits of permissible data acts, are at a relatively law level of harmoni-
precautionary and counter measures against zation, obliging companies, in particular ones
cybersecurity threats is crucial for defining it. operating in numerous jurisdictions, to take it
The information on secret US mass surveillance upon themselves to introduce comprehensive pri-
published in 2013, describing the operation of the vacy policies, reflecting those varying national
PRISM, UPSTREAM, and X-Keystroke pro- standards and meeting their clients’ demand,
grams, used for collecting and processing indi- while at the same time ensuring company secu-
vidual communications data, gathered by US rity. Effectively the issue of corporate cybersecu-
governmental agencies following national law rity needs to be discussed.
Cybersecurity 253

Business Security Online than numerous smaller states. Effectively, com-


pany policies on cooperation with state authorities
According to the latest Symantec report, elec- and international consumer care shape global
tronic crime will continue to enhance, resulting cybersecurity landscape. Projects such as the
in the need for a tighter cooperation between Global Network Initiative or the UN “Protect,
private business and law enforcement. It is private Respect and Remedy” Framework are addressed
actors who in the era of big data hold the power to directly to global business, transposing interna- C
provide unique information on their users to tional human rights standards onto company obli-
authorities, be it in their fight against child por- gations. While traditionally it is states who need to
nography or international terrorism. States no lon- transpose international law onto national regula-
ger gather intelligence through their exclusive tions, binding to business, in the era of big data
channels alone, but rather resort to laws obliging and global electronic communications, transna-
private business to convey personal information tional companies need to identify their own stan-
or the contents of individual correspondence to dards of cybersecurity and consumer care,
law enforcement officials. The potential threat to applicable worldwide.
individual rights generated by big data technology
results also in increased users’ awareness of the
value their information stored in the cloud and Cybersecurity and Freedom of Speech
processed in bulk holds. They require for their
service providers to grant them access to their Cybersecurity measures undertaken by states
personal information stored by the operator and result not only in certain limitations put on indi-
to have the right to decide upon what happens to vidual privacy, subject to state surveillance, but
information so obtained. Even though interna- also influence the scope of freedom of speech
tional privacy standards might be relatively easy allowed by national laws. An international stan-
to identify, giving individuals the right to decide dard for free speech and its limits is very hard to
on what information about them may be processed identify. While there are numerous international
and under what circumstances, national privacy conventions in place that deal with hate speech or
laws differ thoroughly, as some states deny to preventing genocide and prohibit, e.g., direct and
recognize the right to privacy as a part of their public incitement to commit genocide, it is up to
internal legal order. Similar problem arises when states to transpose this international consensus
freedom of speech is considered – national onto national law. Following different national
obscenity and state security laws differ thor- policies, what one state considers to be protecting
oughly, even though they are based on a uniform national interests another might see as inciting
international standard. Effectively, international conflicts among ethical or religious groups. Sim-
companies operating in numerous jurisdictions ilarly, the flexible compromise on free speech,
dealing with information generated in different present in international human rights law, well
states need to carefully shape their policies in envisaged by Article 10 of the Universal Declara-
order to meet national legal requirements and the tion on Human Rights (UDHR) and Article 19 of
needs of global customers. They also need to International Covenant on Civil and Political
safeguard their own interest by ensuring the secu- Rights (ICCPR) grants everyone the right to free-
rity and confidentiality of rendered services. dom of expression, including the right to seek,
Reacting to the incoherent national laws and receive, and impart information and ideas of all
growing state demands, business have produced kinds, regardless of frontiers through any media,
elaborate company policies, applicable world- yet puts two significant limitations on this right. It
wide. As some argue, business policies have in may be subject to restrictions provided by law and
some areas taken over the role of national laws, necessary for respect of the rights or reputations of
with few companies in the world having more others, for the protection of national security, of
capital and effective influence on world politics public order, public health or morals. Those broad
254 Cybersecurity

limitative clauses allow national authorities to effective online surveillance conducted by states
limit free speech exercised online for reasons of and private business alike. Hence any discussion
national cybersecurity. Wikileaks editors and con- on cybersecurity reflects the need to effectively
tributors as well as PRISM whistleblower, protect human rights exercised online. According
Edward Snowden, who ousted secret information to the United Nations Human Rights Council, new
on US National Security Agency’s practices, have tools granted by the use of big data for cyberse-
tested the limits of allowed free speech when curity purposes need to be used within limits set
confronted with national security and cybersecu- by international human rights standards.
rity agendas. The search for national cybersecu-
rity brings forever more states to put limits on free
speech exercised online, seeing secret state infor-
mation or criticism of state practice as a legitimate Cross-References
danger to public order and state security, yet no
effective international standard can be found, ▶ Cyber Espionage
leaving all big data companies on their own in ▶ Data Provenance
the search for a global free speech compromise. ▶ Privacy

Summary Further Reading

Brenner, J. (2013). Glass houses: Privacy, secrecy, and


Cybersecurity has become a significant issue on cyber insecurity in a transparent world. New York:
national agendas. It covers broad areas of state Penguin Books.
administration and public policy, stretching from Clarke, R. A., & Knake, R. (2010). Cyber war: The next
military training to new interpretations of press threat to national security and what to do about it. New
York: Ecco.
law and limits of free speech. As cyberthreats Deibert, R. J. (2013). Black code: Surveillance, privacy,
include both direct attacks on vulnerable telecom- and the dark side of the internet. Toronto: McClelland
munication infrastructure and publications con- & Stewart.
sidered dangerous to public order, national DeNardis, L. (2014). The global war for internet gover-
nance. New Haven: Yale University Press.
authorities reach for new restraints on online com- Human Rights Council. Resolution on promotion, protec-
munications, limiting individual privacy right and tion and enjoyment of human rights on the Internet. UN
free speech. Big data generates new methods for Doc. A/HRC/20/L.13.
D

Dark Web benchmarks, and indicators to assist in monitoring


and decision-making.”
▶ Surface Web vs Deep Web vs Dark Web The volume, velocity, and variety of data being
produced raise challenges in how to manipulate,
organize, analyze, model, and visualize such big
data in the context of technology and data-driven
processes for high-performance smart cities
Darknet
(Thakuriah et al. 2017). The dashboard provides
an important role both in supporting city policy
▶ Surface Web vs Deep Web vs Dark Web
and decision-making and in the democratization
of digital data and citizen engagement.

Dashboard Historical and Technological Evolution


of Dashboards
Christopher Pettit and Simone Z. Leao
City Futures Research Centre, Faculty of Built Control Centers
Environment, University of New South Wales, The term “dashboard” inherently has its origins in
Sydney, NSW, Australia the vehicle dashboard where the driver has critical
information provided to them via a series of dials
and indicators. The vehicle dashboard typically
Synonyms provides real-time information on key metrics
for which the driver needs to know in making
Console; Control panel; Indicator panel; Instru- timely decisions in navigating from A to B. Such
ment board metrics include speed, oil temperature, fuel usage,
distance traveled, etc. As vehicles have advanced
over the decades so have their dashboard which is
Definition/Introduction now typically digital and linked to the car com-
puter which has a growing array of sensors. This is
Dashboards have been defined by the authors of somewhat analogous to our cities which have a
this entry as “graphic user interfaces which com- growing array of sensors and information, some in
prise a combination of information and geograph- real time, that can be used to formulate reports
ical visualization methods for creating metrics, against their performance. As the concept of the
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
256 Dashboard

dashboard has matured over recent decades, it is dashboard that is used to tackle urban management
important to trace its lineage and highlight key issues with real-time data yet falls short of
events in its evolution. supporting long-term strategic planning and deci-
Cybersyn, developed by Stafford Beer in 1970, sion-making. Therefore, some dashboard initiatives
is one of the first examples in history attempting to which incorporate more aggregated and pre-ana-
bring the idea of a dashboard to the context of lyzed information which is of value to the longer-
human organizations (Medina 2011). It was a term city planning are included in the following
control center for monitoring the economy of section.
Chile, based on data about the performance of
various industries. The experience was not suc- Dashboards for Planning and
cessful due to several technological limitations Decision-Making
associated to slow and out of sync analogue data In the context of cities, dials and indicators can be
collection, and insufficient theories and methods a measure of performance against a range of envi-
to combine the extensive quantities of varied data. ronmental, social, economic criteria. For city
Benefiting from advances in computer technol- planners, policy-makers, politicians, and the com-
ogies, the idea of control centers has progressed in munity at large, it is essential that these indicators
the 1980s and 1990s. Such examples include (i) can be measured and visualized using available
the Bloomberg Terminals (1982) designed for data for a selected city and a dashboard can pro-
finance professionals monitoring of business key vide a window into how the city is performing
performance indicators like cash flow, stocks, and against such indicators. In 2014 the International
inventory; (ii) the ComStat platform of New York Standards Organization (ISO) published
City Police (1994) for aggregating and mapping ISO37120, which is a series of indicators to mea-
crime statistics; and (iii) the Baltimore CitiStat sure and compare the global performance of city
Room (1999) for internal government accountabil- services and quality of life. As of writing this
ity using metrics and benchmarks (Mattern 2015). entry, there were 38 cities from around the world
The Operational Center in Rio de Janeiro, Bra- that have published their performance against
zil, is a more recent dashboard approach in the some citywide indicators which are available via
form of a 24 hour control center developed by the World Council’s City Data Open Portal (http://
IBM making use of the Internet of things in a open.dataforcities.org/). This data portal is essen-
smart city context (Kitchin et al. 2015). Launched tially a city dashboard where cities can be com-
in December 2010 as a response of the govern- pared against one another and indicators. Another
ment after significant death and damage due to similar Dashboard which operates at the country
landslides from severe rainstorms in April of the and regional levels of geography is the OECD’s
same year, it aimed at providing real-time data and Better Life Index (http://www.oecdbetterlifei
analytics for emergency management. The Oper- ndex.org). The Better Life Index reports the per-
ational Center monitors the city in a single large formance of countries and regions against 11
control room with access to 560 cameras, multiple topics including housing, income, job, employ-
sensors, and a detailed weather forecasting com- ment, education, environment, civic engagement,
puter model, also connected to installed sirens and health, life satisfaction, safety and work-life bal-
an SMS system in 100 high-risk neighborhoods ance. Countries and regions can be compared
across the city (Goodspeed 2015). using maps, graphs, and textual descriptions.
The Operation Center in Rio is assessed as useful At a city level, Baltimore CitiStat is among the
for dealing with emergency situations, although it early initiatives of a dashboard approach for plan-
does not address the root problem of landslides ning in the USA (http://citistat.baltimorecity.gov/),
associated with informal urbanization. Rio’s Oper- which was influential to similar developments in
ational Center provides a clear example of a other cities across America. Launched in 1999
Dashboard 257

with a control center format for government inter- respond to this demand and requirement for more
nal planning, in 2003 it was complemented with robust citizen engagement.
an online presence through a website of city oper-
ational statistics for planners and citizens (Mattern Dashboards for Citizens
2015). CitiStat tracks the day-to-day operations of There is a group of dashboards that emphasizes
agencies against key performance indicators with the use of open data feeds to display real-time
a focus on collaboration, problem-solving, trans- data, primarily focused on a single screen or web
parency and accountability, and offers some inter- page and having individual citizens as the target
activity for service delivery to citizens. audience. Examples include the London City D
The Dublin Dashboard is another example of a Dashboard (Gray et al. 2016) and CityDash for
city level platform extending beyond real-time Sydney (Pettit et al. 2017). For these, the design of
data (Kitchin et al. 2016). To characterize how these dashboards was guided by data availability,
Dublin is performing against selected indicators, the frequency of update, and potential interest to
and how its performance compares to other citizens. As noted by Batty (2015), in this type of
places, data are aggregated spatially and tempo- dashboards, each widget may have its interpreta-
rally, so trends and comparisons can be made. tion by each user without needing a detailed anal-
Only to characterize “what is happening right ysis. The Sydney CityDash, for example,
now in Dublin,” which is a small part of the aggregates some data feeds including air quality,
dashboard, some real-time data was harvested weather, sun protection, market information, mul-
and presented. timodal public transit, traffic cameras, news, and
The Sydney 30-Minute City Dashboard (https:// selected Twitter feeds, with frequent updates
cityfutures.be.unsw.edu.au/cityviz/30-min-city/) is every few seconds, few minutes, or an hour,
an example of displaying outputs of pre-developed depending on the parameter.
analysis based on real-time big data. Preprocessing Deviating from the single screen approach, Sin-
of real-time data was necessary due to privacy gapore created a multilayered dashboard as their
issues associated with the smart card public trans- smart city platform to integrate government agen-
port data, the large volumes of data to be stored and cies and to engage industry, researchers, and citi-
processed, and the aim of the platform to display the zens. The Singapore Smart Nation Platform (https://
data through a specific lens (travel time against the www.smartnation.sg/) was designed with the goal
goal of a 30min accessibility city). Total counts and to improve service delivery with the use and test of
statistics are presented along with informative map innovative technologies and to promote monitoring
and graph visualizations, allowing the characteriza- by and feedback from the varied users, particularly
tion of specific employment centers, and compari- intermediated by ubiquitous smartphones.
sons among them, regarding how much they Private businesses and service providers also
deviate from the goal of a city form that promotes started to develop dashboard applications to sup-
travel times within 30 min. port everyday life in cities via personal
These examples were built as a multiple layer smartphones. One growing field is the monitoring
website open to any user with access to the Inter- of health performance based on personal move-
net, significantly differing from control rooms. ment data recorded through wearable activity
Indeed, with the ubiquity of smartphones and the tracker devices. The dashboard component of fit-
growth of networked communication and social ness tracking technology such as Fitbit and Jaw-
media, the engagement of citizens in the city life Bone was found to be a significant motivation
has changed. The democratization of data, trans- factor to users by providing metrics of “progress
parency of planning processes, and participation toward a goal” (Asimakopoulos et al. 2017).
of citizens in the assessment or planning of the Another growing area is monitoring consumption
city are encouraged. Dashboards have evolved to of basic utilities such as water and energy using
258 Dashboard

smartphone applications, including some that can 2003; (2) Rio de Janeiro Operational Center,
remotely control heat or cooling in the place of 2010; (3) London City Dashboard, 2012 (Fig.
residence. AGL, for example, is a large energy 1a); (4) Dublin Dashboard, 2015 (Fig. 1b); (5)
and gas provider in Australia which launched a Smart Nation Singapore (2015); (6) Sydney
smartphone application that tracks energy con- CityDash, 2016 (Fig. 1c); and (7) Sydney 30-Min
sumption and also energy production for those City Dashboard, 2016 (Fig. 1d). Figure 1 illus-
with solar systems installed, sends alerts related trates the interfaces of some of these dashboards.
to user targets, and makes future forecasts (https://
www.agl.com.au/residential/why-choose-agl/agl-
energy-app). These examples benefit from Conclusions
smartphones not only as a tracking and control
device but also as the visualization platform. Numerous urban and city dashboards exist nowa-
days, and many more are expected to be designed
and developed over the coming year. This next
Taxonomy of Dashboards generation of dashboards will likely include new
features, and functionality will take advantage of
Using seven contemporary dashboards from accompanying advances in computer technolo-
across the globe mentioned in this text, a proposed gies and city governance. Kitchin and McArdle
taxonomy has been constructed to characterize (2017) identified six key issues on how we come
dashboards in general. The taxonomy is presented to know and manage cities through urban data and
in Table 1 and includes (1) Baltimore CitiStat, city dashboards: (1) How are insights and value

Dashboard, Table 1 Taxonomy of dashboards


Criteria Type Description Examples
Access to Open Uses data which is openly available to anyone, usually (1), (3), (4), (5), (6)
data captured automatically by the dashboard through APIs
Close Uses data which is available only through licenses for (1), (2), (4), (5), (7)
specific purposes
Frequency Real time Uses data which is captured in real time by sensors and (2), (3), (4), (5), (6)
of data frequently updated in the dashboard through automated
processing
Preprocessed Uses data (capture in real time or in other frequencies) (1), (2), (4), (5), (7)
which is processed and analyzed before being
displayed on the dashboard
Size of data Big data Uses data categorized as big data, with high volume, (1), (2), (3), (4), (5), (6), (7)
velocity, and variety, which raises challenges for
computer systems regarding data storage, sharing, and
fast analysis
Other Uses data with small sizes easily managed by ordinary (1), (2), (4), (5)
computer systems to store, share, and analyze data
Dashboard Decision- Aims to provide information with contents, spatial and (1), (2), (4), (5), (7)
audience makers temporal scales, and analytical tools suitable to respond
to urban planning issues
Citizens Aims to provide information with contents, spatial and (1), (3), (4), (5), (6)
temporal scales, and analytical tools suitable to respond
to individual citizen issues
Dashboard 259

Dashboard, Fig. 1 Examples of urban dashboards

derived from city dashboards? (2) How compre- referred to as open data stores, clearinghouses,
hensive and open are city dashboards? (3) To what portals, or hubs; and (5) support a two-way
extent can we trust city dashboards? (4) How exchange of information to empower citizens to
comprehensible and useable are city dashboards? engage with elements of the dashboard.
(5) What are the uses and utility of city dash- A key challenge is the ability to visualize
boards? And (6) How can we ensure that dash- big data in real time. Incremental changes to big
boards are used ethically? Aligned to these datasets are possible when breaking down big
concerns, some recommendations for the struc- datasets down to individual “little” record such
ture and design of new dashboards are suggested as tweets from a Twitter database or individual tap
by Pettit et al. (2017): (1) understand the specific on and tap off records from a mass transit smart
purpose of a dashboard and design accordingly; card system. However, as these systems are scaled
(2) when developing a city dashboard, human- over larger geographies, the ability to visualize
computer interaction guidelines – particularly big data in real time becomes more challenging.
around usability – should be considered; (3) dash- Dashboards to improve the efficiency of our
boards should support the visualization of big data cities and decision-making are an ongoing endeavor.
to support locational insights; (4) link dashboards Dashboards have proven utility in traffic manage-
to established online data repositories, commonly ment and crisis management, but when it comes to
260 Data

strategic long-term planning, the value proposition


of the dashboard is yet to be determined. Also, Data Aggregation
dashboards that can be used to truly empower citi-
zens in city planning are the next frontier. Tao Wen
Earth and Environmental Systems Institute,
Pennsylvania State University, University Park,
Further Reading PA, USA

Asimakopoulos, S., Asimakopoulos, G., & Spillers, F.


(2017). Motivation and user engagement in fitness Definition
tracking: Heuristics for mobile healthcare wearables.
Informatics, 2017(4), 5. https://doi.org/10.3390/infor
matics4010005. Data aggregation refers to the process by which
Batty, M. (2013). Big data, smart cities and city planning. raw data are gathered, reformatted, and presented
Dialogues in Human Geography, 3(3), 274–279. in a summary form for subsequent data sharing
Batty, M. (2015). A perspective on city dashboards. and further analyses. In general, raw data can be
Regional Studies, Region-al Science, 2(1), 29–32.
Goodspeed, R. (2015). Smart cities: Moving beyond urban aggregated in several ways, such as by time (e.g.,
cybernetics to tackle wicked problems. Cambridge monthly and quarterly), by location (e.g., city), or
Journal of Regions, Economy and Society, 8(1), 79–92. by data source. Aggregated data have long been
Gray, S., O’Brien, O., & Hügel, S. (2016). Collecting and used to delineate new and unusual data patterns
visualizing real-time urban data through city dash-
boards. Built Environment, 42(3), 498–509. (e.g., Wen et al. 2018). In the big data era, data are
Kitchin, R., & McArdle, G. (2017). Urban data and city being generated at an unprecedentedly high speed
dashboards: Six key issues. In R. Kitchin, T. P. and volume, which is a result of automated tech-
Lauriault, & G. McArdle (Eds.), Data and the City. nologies for data acquisition. Aggregated data,
London: Routledge.
Kitchin, R., Lauriault, T. P., & McArdle, G. (2015). Know- rather than raw data, are often utilized to save
ing and governing cities through urban indicators, city storage space and reduce energy and bandwidth
benchmarking and real-time dashboards. Regional costs (Cai et al. 2019). Data aggregation is an
Studies, Regional Science, 2(1), 6–28. essential component of data management, in par-
Kitchin, R., Maalsen, S., & McArdle, G. (2016). The praxis
and politics of building urban dashboards. Geoforum, ticular during the “Analysis and Discovery” stage
77, 93–101. of the data life cycle (Ma et al. 2014).
Mattern S (2015) Mission control: A history of the urban
dashboard, Places Journal, March2015, https://
placesjournal.org/article/mission-control-a-history-of-
the-urban-dashboard/. Data Aggregation Processes and Major
Medina, E. (2011). Cybernetic revolutionaries: Technol- Issues
ogy and politics in Allende’s Chile. Cambridge: The
MIT Press. The processes of transforming raw data into
Pettit, C.J., Lieske, S., Jamal, M. (2017). CityDash:
Visualising a changing city using open data. In: aggregated data can be summarized as a three-
Geertman S, Stillwell J, Andrew, A. Pettit, C.J. (eds) step protocol (Fig. 1): (1) pre-aggregation;
Planning support systems and smart cities, lecture notes (2) aggregation; and (3) post-aggregation. These
in geoinformation and cartography, Springer Interna- steps are further described below.
tional Publishing, Basel, Switzerland, 337–353.
Thakuriah, P. (Vonu)., Tilahun, N. Y., & Zellner, M.
(2017). Seeing cities through big data. Springer Geog- Pre-aggregation
raphy. https://doi.org/10.1007/978-3-319-40902-3_1. This step starts with gathering data from one or
more data sources. The selection of data sources is
dependent on both the availability of raw data and
the goal of the “Analysis and Discovery” stage.
Data Many search tools are available to assist
researchers in locating datasets and data reposito-
▶ “Small” Data ries (e.g., Google Dataset Search and re3data by
Data Aggregation 261

Data Aggregation, Next round data aggregation


Fig. 1 General processes
of data aggregation

Pre-aggregation Aggregation Post-aggregation

• Gathering • Applying • Storing,


and aggregate publishing, D
preparing function to and
raw data raw data analyzing
aggregated
data

DataCite). Some discipline-specific search tools academia, start to advocate open data and FAIR
are also available (e.g., DataONE for earth sci- Principles (i.e., findable, accessible, interopera-
ences). Data repositories generally refer to places ble, and reusable) when sharing data to the data
hosting datasets. For example, Kaggle, an online users.
repository, hosts processed datasets from a variety
of disciplines. National Water Information System Aggregation
(NWIS) by the United States Geological Survey A variety of aggregate functions are available to
(USGS), and STOrage and RETrieval (STORET) summarize and transform the raw data into aggre-
database by the United States Environmental Pro- gated data. These aggregate functions include (but
tection Agency (USEPA) both provide access to are not limited to) minimum, mean, median, max-
water quality data for the entire United States. imum, variance, standard deviation, range, sum,
Incorporated Research Institutions for Seismol- and count. In general, raw data can be divided into
ogy (IRIS) is a collection of seismology-related two types: numeric and categorical. Numerical
data (e.g., waveform and seismic event data). Data data are often measurements of quantitative fea-
downloaded from different sources are often not tures (e.g., air temperature, sulfate concentration,
in a consistent format. In particular, data from stream discharge), and they often have mathemat-
different sources might be reported in different ical meaning. Unlike numerical data, categorical
units (e.g., Niu et al. 2018), with different accu- data are qualitative representations (e.g., city
racy, and/or in different file formats (e.g., name, mineral color, soil texture). The functions
JavaScript Object Notation vs. Comma Separated listed above might not be applicable to all types of
Values). In addition, missing data are also very raw data. For example, categorical data can be
common. Before data aggregation at next step, counted but cannot be averaged. Additionally,
data need to be cleaned and reformatted (noted raw data can be aggregated over time or over
as “preparing” in Fig. 1) into a unified format. space (e.g., counting the number of Fortune 500
The most glaring issue in the pre-aggregation companies in different cities). The best way to
step might be related to data availability. Desired aggregate data (e.g., which aggregate function to
raw data might not be accessible to perform data use) should be determined by the overarching goal
aggregation. This situation is not uncommon, of the study. For example, if a researcher is inter-
especially in business, since many of these raw ested in how housing prices fluctuate on a
data in business are considered proprietary. For monthly basis for a few given cities, they should
example, the information of the unique identifier consider aggregating their raw data in two steps
of persons clicking an Internet advertisement is sequentially: (1) spatially by city, and (2) tempo-
often not accessible (Hamberg 2018). To resolve rally aggregating data of each city by month using
this problem, many communities, especially the mean or median functions.
262 Data Aggregation

Data can be aggregated into groups (i.e., level above dataset of aggregated sulfate concentration
of segmentation) in many different ways, e.g., on a monthly basis, time series analysis can be
housing prices of the United States can be divided performed to determine the temporal trend of sul-
by state, by county, or by city. In the aggregation fate concentration, i.e., decline, increase, or
step, problems can arise if raw data were not unchanged.
aggregated to the proper level of segmentation
(Hamberg 2018). Below, an example from a
water quality study is provided to illustrate this Tools
problem.
In Table 1, a hypothetical dataset of sulfate Many tools are available for data aggregation.
concentration (on an hourly basis) from a USGS These tools generally fall into two categories:
site is listed for 3 days: 01/01/1970–01/03/1970. proprietary software and open-source software.
To calculate the mean concentration over these 3 Proprietary software: Proprietary software is
days, a researcher should first aggregate concen- not free to use and might have less flexibility
tration by day (each of these 3 days will have a compared to open-source software; however,
daily mean), and then aggregate these three daily technical support is often more readily available
means in order to get a more representative value. for users of proprietary software. Examples of
Using this approach, the calculated mean sulfate popular proprietary software include Microsoft
concentration over these 3 days is 5 milligram/ Excel, Trifacta (Data) Wrangler, Minitab, SPSS,
liter. Due to the fact that more sulfate measure- MATLAB, and Stata, all of which are mostly
ments are available on 01/01/1970, the researcher designed for preparing data (i.e., part of step 1:
should avoid directly aggregating these five mea- data cleaning and data reformatting) and aggrega-
surements of these 3 days since this approach tion (i.e., step 2). Some of these pieces of software
gives more weight on a single day, i.e., 01/01/ (e.g., Excel and MATLAB) provide functions to
1970. In particular, direct aggregation of these retrieve data from varying sources (e.g., database
five measurements yields a biased 3-day mean of and webpage).
7 milligram/liter, which is higher than 5 milli- Open-source software: Open-source software
gram/liter by 40%. is free of cost to use although it might have steeper
learning curve compared to proprietary software
Post-Aggregation since programming or coding skills are often
In this step, aggregated data might warrant further required to use open-source software. Open-
data aggregation, in which aggregated data from source software can be either stand-alone program
last round of data aggregation will be used as the or package (or library) of functions written in free
input “raw data” in the next round. Alternatively, programming languages (e.g., Python and R).
aggregated data might be ready for data analysis, One example of stand-alone program is GNU
publication, and storage. For example, in the Octave that is basically an open-source alternative
to MATLAB, which can be used throughout all
steps of data aggregation. Many programming
Data Aggregation, Table 1 Sulfate concentration (in packages are available for applications in the
milligram/liter; raw data) collected from 01/01/1970 to aggregation step (e.g., NumPy, SciPy, and Pandas
01/03/1970 at a hypothetical USGS site in Python; dplyr and tidyr in R). These example
Sulfate concentration packages can deal with data from a variety of
Sampling date and time (milligram/liter) disciplines. Some other packages including Beau-
01/01/1970, 10 AM 15 tiful Soup and html.parser help parse data from
01/01/1970, 1 PM 10 webpages. In certain disciplines, some packages
01/01/1970, 4 PM 5
are present to serve both steps 1 and 2, e.g.,
01/02/1970, 10 AM 2
dataRetrieval in R allows users to gather and
01/03/1970, 10 AM 3
aggregate water-related data.
Data Architecture and Design 263

Conclusion
Data Analyst
Data aggregation is the process where raw data
are gathered, reformatted, and presented in a ▶ Data Scientist
summary form. Data aggregation is an essential
component of data management, especially now-
adays when more and more data providers (e.g.,
Google, Facebook, National Aeronautics and Data Analytics
Space Administration, and National Oceanic D
and Atmospheric Administration) are generating ▶ Business Intelligence Analytics
data at an extremely high speed. Data aggrega- ▶ Data Scientist
tion becomes particularly important in the era of
big data because aggregated data can save stor-
age space, and reduce energy and bandwidth
costs. Data Anonymization

▶ Anonymization Techniques
Cross-References

▶ Data Cleansing
▶ Data Sharing Data Architecture and Design
▶ Data Synthesis
Erik W. Kuiler
George Mason University, Arlington, VA, USA
Further Reading

Cai, S., Gallina, B., Nyström, D., & Seceleanu, C. (2019). Introduction
Data aggregation processes: a survey, a taxonomy, and
design guidelines. Computing, 101(10), 1397–1429.
Hamberg, S. (2018). Are you responsible for these common The availability of Big Data sets has led many
data aggregation mistakes? Retrieved 21st Aug 2019, organizations to shift their emphases from
from https://blog.funnel.io/data-aggregation-101. supporting transaction-oriented data processing
Ma, X., Fox, P., Rozell, E., West, P., & Zednik, S. (2014).
to supporting data-centric analytics and applica-
Ontology dynamics in a data life cycle: Challenges and
recommendations from a geoscience perspective. Jour- tions. The increasing rapidity of dynamic data
nal of Earth Science, 25(2), 407–412. flows, such as those generated by IoT applica-
Niu, X., Wen, T., Li, Z., & Brantley, S. L. (2018). One step tions and devices, the increasing sophistication
toward developing knowledge from numbers in
of interoperability mechanisms, and the concom-
regional analysis of water quality. Environmental Sci-
ence & Technology, 52(6), 3342–3343. itant decreasing costs of data storage have trans-
Wen, T., Niu, X., Gonzales, M., Zheng, G., Li, Z., & formed not only data acquisition and
Brantley, S. L. (2018). Big groundwater data sets reveal management paradigms but have also over-
possible rare contamination amid otherwise improved
loaded available ICT resources, thereby
water quality for some analytes in a region of Marcellus
shale development. Environmental Science & Technol- diminishing their capabilities to support organi-
ogy, 52(12), 7149–7159. zational data and information requirements. Due
to the difficulties of managing Big Data sets and
increasingly more complex analytical models,
transaction processing-focused ICT architectures
Data Aggregators that were sufficient to manage small data sets
may require enhancements and re-purposing to
▶ Data Brokers and Data Services support Big Data analytics.
264 Data Architecture and Design

A number of properties inform properly standards. Interoperability standards, implemented


designed Big Data system architectures. Such at inter-system service levels, establish thresholds
architectures should be modular and scalable, for exchange timeliness, transaction completeness,
able to adapt to support processing different quan- and content quality.
tities of data, and sustain real-time, high-volume,
and high-performance computing, with high rates
of availability. In addition, such architectures Indigenous Services
should support multitiered security and
interoperability. Collectively, metadata and data standards ensure
syntactic conformance and semantic congruence
of the contents of Big Data sets. Data analytics are
Conceptual Big Data System executed according to clearly defined lifecycles,
Architecture from context delineation to presentation of
findings.
The figure below limns a conceptual Big Data
system architecture, comprising exogenous and Metadata
indigenous services as well as system infrastruc- Metadata delineate the identity and provenance of
ture services to support data analytics, informa- data items, their transmission timeliness and secu-
tion sharing, and data exchange. rity requirements, ontological properties, etc.
Operational metadata reflect the management
Exogenous services Indigenous services requirements for data security and safeguarding
Access Interoperability Metadata User Delivery personal identifying information (PII); data inges-
Control (Data Exchange Data (Visualization tion, federation, and integration; data anony-
(Security and Information Standards and
and Sharing) Data Presentation) mization; data distribution; and data storage.
Privacy) Analytics Bibliographical metadata provide information
System Infrastructure about a data item’s producer, applicable categories
Resource Administration; Data Storage; Orchestration; (keywords), etc., of the data item’s contents. Data
Messaging; Network
lineage metadata provide information about a data
Platform
item’s chain of custody with respect to its prove-
nance – the chronology of data ownership, stew-
ardship, and transformations. Syntactic metadata
Exogenous Services provide information about data structures. Seman-
tic metadata provide information about the cul-
Access Control tural and knowledge domain-specific contexts of a
The access control component manages the secu- data item.
rity and privacy of interactions with data providers
and customers. Unlike the user delivery compo- Data Standards
nent, which focuses on “human” interfaces with Data standards ensure managerial and opera-
the system, the access control component focuses tional consistency of data items by defining
on authorized access to Big Data resources via thresholds for data quality and facilitating com-
“machine-to-machine” and “human-to-machine” munications among data providers and users.
interfaces. For example, the internationally recognized Sys-
tematized Nomenclature of Medicine Clinical
Interoperability Terms (SNOMED CT) provides a multilingual
The interoperability component enables data coded terminology that is extensively used in
exchange between different systems, regardless electronic health record (EHR) management.
of data provider, recipient, or application vendor, RxNorm, maintained by the National Institutes
by means of data exchange schemata and of Health’s National Library of Medicine (NIH
Data Architecture and Design 265

NLM), provides a common (“normalized”) Data Anonymization The data set may contain
nomenclature for clinical drugs with links to personally identifiable information (PII) that must
their equivalents in other drug vocabularies be addressed by implementing anonymization
commonly used in pharmacology and drug inter- mechanisms that adhere to the appropriate
action research. The Logical Observation Iden- protocols.
tifiers Names and Codes (LOINC), managed by
the Regenstrief Institute, provides a standard- Data Validation Notionally, data validation
ized lexicon for reporting lab results. The Inter- comprises two complementary activities: data
national Classification of Diseases, ninth and profiling and data cleansing. Data profiling D
tenth editions (ICD-9 and ICD-10), are also focuses on understanding the contents of the
widely used. data set and the extent to which they comply
with their quality specifications in terms of accu-
Data Analytics racy, completeness, nonduplication, and represen-
Notionally, data analytics lifecycles comprise a tational consistency. Data cleansing focuses on
number of interdependent activities: normalizing the data to the extent that they are of
consistent quality. Common data cleansing
Context Delineation methods include data exclusion, to remove non-
Establishing the scope and context of the analysis compliant data; data acceptance, if the data are
define the bounds and parameters of the research within tolerance limits; data consolidation of mul-
initiative. Problems do not occur in vacuums; tiple occurrences of an item; data value insertion;
rather, they and their solutions reflect the complex for example, using a default value for null fields in
interplay between organizations and the larger a data item.
world in which they operate, subject to institu-
tional, legal, and cultural constraints. Data Exploration
Data provide the building blocks with which ana-
Data Acquisition lytical models may be constructed. Data items
Data should come from trusted data sources and should be defined so that they can be used to
be vetted to ensure their integrity and quality prior formulate research questions and their attendant
to their preparation for analytical use. hypotheses, delineate ontological specifications
and properties, identify variables (including for-
Data Preparation mats and value ranges) and parameters, in terms of
Big Data are rarely useful without extensive prep- their interdependencies and their provenance
aration, including integration, anonymization, and (including chains of custody and stewardship) to
validation, prior to their use. determine the data’s trustworthiness, quality,
timeliness, availability, and utility.
Data Integration Data sets frequently come
from more than one provider and, thus, may Data Staging
reflect different cultural and semantic contexts Because they may come from disparate sources,
and comply with different syntactic and sym- data may require alteration so that, for example,
bolic conventions. Data provenance provides a units of analysis are defined at the same levels of
basis for Big Data integration but is not sufficient abstractions and that variables use the same code
by itself to ensure that the data are ready for use. sets and are within predetermined value ranges.
It is not uncommon to give short shrift to the data For example, diagnostic data from one source
integration effort and not address issues of aggregated at the county level and the same kind
semantic ambiguity and syntactic differences of data from another source aggregated at the
and then attempt to address these problems institutional level should be transformed so that
much later in the data analytics lifecycle, at the data to conform to the same units of analysis
much greater costs. before consolidating the data sets prior to use.
266 Data Architecture and Design

Model Development access to, persistent data. This function includes


Analytical models present interpretations of real- logical and physical data organization, distribu-
ity from particular perspectives, frequently in tion, and access methods. In addition, the data
form of quantitative formulas that reflect particu- storage function collaborates with metadata ser-
lar interpretations of data sets. Induction-based vices to support data discovery.
algorithms may be useful, for example, in
unsupervised learning settings, where the focus Orchestration
may be on pattern recognition, for example, in The orchestration function configures the various
text mining, content, and topic analyses. In con- architectural components and coordinates their
trast, deduction-based algorithms may be useful in execution so that they function as a cohesive
supervised learning settings, where the emphasis system.
is on proving, or disproving, hypotheses formu-
lated prior to analyzing the data. Messaging
The messaging function is responsible for ensur-
Presentation of Findings ing reliable queuing and transmission of data and
Only fully tested models should be considered control signals between system components. Mes-
ready for presentation, distribution, and deploy- saging may pose special problems for Big Data
ment. Models may be presented as, for example, systems because of the computational efficiency
Business Intelligence (BI) dashboards or peer- and velocity requirements of processing Big Data
reviewed publications. Analytics models may sets.
also be used as likelihood predictors in program-
matic contexts. Network
The network management function coordinates
User Delivery the transfer of data (messages) among system
The user delivery component presents the results infrastructure components.
of the data analytics process to end-users,
supporting the transformation of data into knowl- Platform
edge in formats understandable to human users. The ICT platform comprises the hardware config-
uration, operating system, software framework,
and any other element on which components or
Big Data System Infrastructure services run.

The Big Data system’s infrastructure provides the


foundation for Big Data interoperability and ana- Conceptual Big Data System Design
lytics by providing the components and services Framework
that support system’s operations and manage-
ment, from authorized, secure access to adminis- User Requirements
tration of system resources. The first activity is to determine that the research
initiative meets all international, national, episte-
Resource Administration mic, and organizational ethical standards. Once
The resource administration function monitors this has been done, the next activity is formally
and manages the configuration, provisioning, to define the users’ data and information require-
and control of infrastructure and other compo- ments, the data’s security and privacy require-
nents that collectively run on the ICT platform. ments, and anonymization requirements. Also,
users’ delivery requirements (visualization and
Data Storage presentation) should be defined. A research ques-
The data storage management function ensures tion is developed to reflect the users’ require-
reliable management, recording, storage of, and ments, followed a formulation of how the
Data Architecture and Design 267

research question can be translated into a set of uncommon for multiple, often very different and
quantifiable, testable hypotheses. incompatible, syntaxes, lexica, and ontologies to
be in use within knowledge communities so that
System Infrastructure Requirements data may require extensive data normalization
In addition to defining user requirements, it prior to their use. There are also different, com-
should be determined that the system infrastruc- peting conveyance and transportation frame-
ture can provide the necessary resources to under- works currently in use that hamper
take the research initiative. interoperability.
Deeply troubling is the absence of a clearly D
Data Acquisition Requirements defined, internationally accepted, and rigorously
Once the research initiative has been approved for enforced code of ethics, with formally specified
execution, sources of the data, their security norms, roles, and responsibilities that apply to
requirements, and their attendant metadata have the conduct of Big Data analytics and applica-
to be identified. tion development. The results produced by Big
Data systems have already been misused; for
Interoperability Requirements example, pattern-based facial recognition soft-
Interoperability requirements should also be ware is currently used prescriptively to oppress
defined, for example, what date are to be shared; minority populations. Big Data analytics may be
what standards and service level agreements also be misused to support prescriptive medicine
(SLAs) are to be enforced; what APIs are to be without considering the risks and consequences
used? to individuals of misdiagnoses or adverse
events. In a global economy, predicated on Big
Project Planning and Execution Data exchanges and information sharing, devel-
A project plan, comprising a work breakdown oping such a code of ethics requires collabora-
schedule (WBS), including time and materials tion on epistemic, national, and international
allocations, should be prepared prior to project levels.
start-up. Metrics and monitoring regimens should
be defined and operationalized, as should man-
agement (progress) reporting schedules and
procedures. Further Reading

NIST Big Data Public Working Group Reference Archi-


Caveats and Future Trends tecture Subgroup. (2015). NIST big data interoperabil-
ity framework: volume 6: Reference architecture.
Washington DC: US Department of Commerce
The growth and propagation of Big Data sets and National Institute of Standards and Technology. Down-
their applications will continue, reflecting the loaded from https://bigdatawg.nist.gov.
impetus to develop greater ICT efficiencies and Santos, M. Y., Sá, J., Costa, C., Galvão, J., Andrade, C.,
Martinho, B., Lima, F. V., & Costa, E. (2017). A big
capacities to support data management and data analytics architecture for industry 4.0. In
knowledge creation. To ensure the development A. Rocha, A. Correia, H. Adeli, L. Reis, &
and proper use of Big Data analytics and appli- S. Costanzo (Eds.), Recent advances in information
cations, there are, however, a number of issues systems and technologies. WorldCIST 2017
(Advances in intelligent systems and computing) (Vol.
that should be addressed. In the absence of an 570, pp. 175–184). Cham: Springer. (Porto Santo
international juridical framework to govern the Island, Madeira, Portugal).
use of Big Data development and analytics, rules Viana, P., & Sato, L. (2015). A proposal for a reference
to safeguard the integrity, privacy, and security of architecture for long-term archiving, preservation, and
retrieval of big data. In 13th international conference
personally identifiable information (PII) differ by on Trust, Security and Privacy in Computing and Com-
country, frequently leading to confusion and the munications, (TrustCom) (pp. 622–629). Beijing: IEEE
proliferation of legal ambiguities. Also, it is not Computer Society.
268 Data Bank

stores or over an online catalog, participating in


Data Bank surveys, surfing the web, chatting with friends
on a social media platform, entering sweepstakes,
▶ Data Repository or subscribing to news websites. These daily
activities generate a variety of information about
those individuals which in turn, in many
instances, is delivered or sold to data brokers
Data Brokers (Ramirez et al. 2014).

▶ Data Mining
Data Brokers

Data brokers aggregate data from a diversity of


Data Brokers and Data sources. In addition to existing open data sources,
Services they also buy or rent individuals data from third-
party companies. The data collected may contain
Abdullah Alowairdhi and Xiaogang Ma activities of web browsing, bankrupt information,
Department of Computer Science, registrations’ warranty, voting information, con-
University of Idaho, Moscow, ID, USA sumer purchase data, and other everyday web
interaction activities. Typically, data brokers do
not acquire data directly from individuals; hence,
Synonyms most individuals are unaware that their data
are collected and consumed by the data brokers.
Data aggregators; Data consolidators; Data Consequently, it is possible that individuals’ detail
resellers life would be constructed and packaged as a final
product by processing and analyzing data compo-
nents supplied from different data brokers’
Background sources (Anthes 2015).
Data brokers acquire and store individuals’
Expert data brokers have been around for a long data as products in a confidential data infra-
time, gathering data from media subscriptions structure, which stores, shares, and
(e.g., newspapers and magazines), mail-order consumes data through networked technologies
retailers, polls, surveys, travel agencies, sympo- (Kitchin and Lauriault 2014). The data will be
siums, contests, product registration and warran- rented or sold for a profit. The data products
ties, payment handling companies, government contain lists of prospective individuals, who
records, and more (CIPPIC 2006). In recent meet certain conditions, including details like
years, particularly since the arrival of the Internet, names, telephone numbers, addresses, e-mail
the data brokers’ industry has expanded swiftly addresses, as well as data elements such as
with the diversification of data capture and age, gender, income, presence of children, ethnic-
consolidation methods. As a result, a verity of ity, credit status, credit card ownership, home
products and services are offered (Kitchin 2014). value, hobbies, purchasing habit, and background
Moreover, on a daily routine, individuals status. These derived data products collections,
involve in a variety of online activities that dis- where data brokers have added value through
close their personal information. Such online data analysis and data integration methods,
activities include using mobile applications, buy- are used to target marketing and advertising
ing a home or a car, subscribing to a publication, promotions, socially classify individuals, evaluate
conducting a credit card transaction at department credit rating, and tracing services (CIPPIC 2006).
Data Brokers and Data Services 269

Data integration and resale accompanied with In various ways, then, individuals are handing
correlated added value services like data analysis over their own data, knowingly or unknowingly,
are a multibillion-dollar industry. Such an indus- in many volumes as subscribers, buyers, regis-
try trades massive amounts of data and derived trants, credit card holders, members, contest
information hourly throughout a range of markets entrants, donors, survey participants, and web
specialized in financial, retail, logistics, tourism, inquirers (CIPPIC 2006). Moreover, because
real estate, health, political voting, business intel- creating, managing, and analyzing data is a spe-
ligence, private security, and more. These data cialized task, many firms subcontract their data
almost cover all aspects of everyday life and requirements to data processing and analytics D
include public administration, communications, companies. By offering the same types of data
consumption of goods and media, travel, services across clients, such companies can create
leisure, crime, and social media interactions extensive datasets that can be packaged and uti-
(Kitchin 2014). lized to produce newly constructed data which
provide further insights than any single source
of data. In addition to these privately obtained
Data Sources data, data brokers also collect and consolidate
public datasets such as census records, aggregate
Selling data to brokers has become a major spatial data such as properties, and rent or buy
revenue stream for many companies. For exam- data from charities and non-governmental
ple, retailers regularly sell data regarding cus- organizations.
tomers’ transactions such as credit card details,
customers’ purchases information and loyalty
programs, customers’ relationship management, Methods for Data Collection and
and subscription information. Internet stores sell Synthesis
clickstream data concerning how a person navi-
gated through a website and the time spent on Data brokers aggregate data from different
different pages. Similarly, media companies, sources using various methods. Firstly, data bro-
such as newspapers, radio, and television sta- kers use crawlers and scrapers (software that
tions, gather the data contained within their con- extracts values from the websites and transfers
tent (e.g., news stories and advertisements). these values to the data broker’s data storage)
Likewise, social media companies aggregate the to assemble publicly accessible web-based data.
metadata and contents of their users in order to As an example, data brokers use software like
process that information to build individuals’ Octoparse and import.io to decide what websites
profiles and produce their own data products to should be crawled and scraped, what data
be sold to data brokers. For example, Facebook elements in each website to harvest, and how
uses user networks, users’ uploaded content, and frequently. Secondly, data brokers obtain and
user profiles of its millions of active users. The process printed information from local govern-
collected data, such as users’ comments, videos, ment records and telephone book directories and
photos, and likes, are used to form a set of adver- then either process these documents using OCR
tising products like “Lookalike Audiences, Part- (optical character recognition) scanner to produce
ner Categories, and Managed Custom digital records or employ specialists in data entry
Audiences.” Such advertising products also part- to create digital records manually. Thirdly, using
ner with well-known data brokers and data mar- daily data feeds, data brokers coordinate for a
keters such as Acxiom, Datalogix, Epsilon, and batch collection of data from various sources.
BlueKai in order to integrate their non-Facebook Lastly, data brokers approach regularly to data
purchasing and behavior data (Venkatadri sources through an API (application program-
et al. 2018). ming interface) which allow data stemming to
270 Data Brokers and Data Services

the data brokers’ infrastructure. Whatsoever the In general, data broker companies want a wide
method, data brokers may accumulate excessive variety of data, as large segment population as
data beyond their need, which may lead to the possible, that are highly relational and indexical
situation that they cannot attain a subset of data in nature. The more data a broker can retrieve and
elements they demand. For example, some data integrate, the more likely their products work
sources sell a massive of data elements as part of optimally and effectively, and they obtain a
a fixed dataset deal although the data broker competitive advantage over their competitors.
does not request all of these data elements. By gathering data together, analyzing, and orga-
Consequently, the data broker will utilize those nizing them appropriately, data brokers can create
extra data elements in some other way, such as derived data and individual and area profiles and
matching or authentication purposes or to build undertake predictive modeling to analyze individ-
models for new topics (Ramirez et al. 2014). ual’s behavior under different situations and in
different areas. This allows the more effective
identification of targeted individual and provides
Data Markets an indication of the individual’s behavior in order
to reach a predetermined answer, e.g., choosing
Data brokers construct an immense relational and buying specific items. Acxiom, for example,
data infrastructure by assembling data from a seeks to merge and process mobile data, offline
variety of sources. For example, Epsilon is data, and online data so as to generate a complete
claimed to own loyalty card memberships data view of individual to form comprehensive profiles
of 300 million global companies, with a database and solid predictive models (Singer 2012). Such
holding data related to 250 million individuals in information and models are very beneficial to
the United States alone (Kitchin 2014). Acxiom is companies because they are empowered to focus
declared to have assembled a databank for their marketing and sales efforts. The risk mitiga-
approximately 500 million global active individ- tion data products increase the possibility of suc-
uals (in the United States, around 126 million cessful transactions and decrease expenses
households and 190 million individuals), with relating to wastage and loss. By utilizing such
around 1500 information elements per individual. products, companies thus aim to be more effective
Every year Acxiom handles more than 50 trillion and competent in their operations.
data transactions. As a result, its income surpass-
ing one billion dollars (Singer 2012). Moreover,
it also administers separate client databases for The Hidden Business
47 of the Fortune 100 enterprises (Singer 2012).
In another example, Datalogix claims to collect Amusingly, slight serious attention has been paid
data relating to offline purchases worth of trillion to data brokers’ operations, given the sizes and
dollars (Kitchin 2014). Other data brokers and variety of individual data that they have, and how
analysis firms including Alliance Data Systems, their data products are utilized to socially sort and
TransUnion, ID Analytics, Infogroup, CoreLogic, target individuals and households. Definitely,
Equifax, Seisint, Innovis, ChoicePoint, Experian, there are a lack of academic research and media
Intelius, Recorded Future, and eBureau all have coverage regarding the consequences of data
their unique data services and data products. For brokers’ work and products. This is partly because
example, eBureau evaluates prospective clients the data broker industry is somewhat out of focus
on behalf of credit card companies, lenders, and concealed, not wanting to draw public atten-
insurers, and educational institutions, and Intelius tion, and weaken public trust in their data assets
provides people-search services and background and activities, which might trigger public aware-
checks (Singer 2012). ness campaigns for accountability, regulation, and
Data Brokers and Data Services 271

transparency. Currently, data broker industry is advertisements, or reduced subprime credit.


unregulated and is not compulsory to supply indi- Furthermore, marketers may use individuals’
viduals with entry to their detained data. In addi- data to aid the distribution of commercial product
tion, data brokers are not compelled to correct advertisements regarding health, financial, or
individuals’ data errors (Singer 2012). Yet these ethnicity, which some individuals might find
data products could have a harmful reflective disturbing and could reduce their confidence in
consequence on the services and opportunities the marketplace.
provided to those individuals, such as whether a Marketers could also use the apparently harm-
job will be offered, a credit application will be less data inferences about individuals in ways that D
approved, an insurance policy will be issued, or a increase worries. For example, a data broker could
tenancy approved, and what price goods and ser- be inferring that an individual classified to be in
vices might cost based on recognized risk and a “Speedy Drivers” data segment which will
value to companies (Kitchin and Lauriault 2014). authorize a car dealership to offer that individual
with a discount on sport cars. However, insurance
company that uses the same data segment may
Benefits and Risks deduce that the individual involves in unsafe
behavior and thus will increase his insurance pre-
Some benefits are feasible for individuals from mium. Lastly, the people-search product can be
data brokers’ products, such as improve innova- employed to ease harassment or chase and might
tive product offerings, provide designated reveal information about victims of domestic
advertisements, and help to avoid fraud, just to violence, police officers, public officials, prosecu-
name a few. The distinguished risk mitigation tors, or other types of individuals, which might
product delivers substantial benefits to individ- be used for revenge or other harm (Ramirez
uals by helping avoid fraudsters from mimicking et al. 2014).
innocent individuals. Designated advertise-
ments benefit individuals by enable them to
find and enjoy the commodities and services Choice as an Individual
they want and prefer more easily. Rivalry small
businesses utilize data brokers’ products to be Opt-outs are often invisible and imperfect. Data
able to contact certain individuals to offer inno- brokers may give individuals an opt-out choice
vative and improved products. However, there for their data. Nevertheless, individuals probably
are numbers of possible risks from data brokers’ do not know how to exercise this choice or even do
compilation and use of individual’s data. For not know the choice is presented. Additionally,
instance, if an individual transaction is rejected individuals may find the opt-outs confusing,
because of an error in the risk mitigation prod- because the data brokers’ opt-out website does
uct, the individual could be affected without not explicitly express whether the individual
realizing the reason. In this case, not only the could utilize a choice to opt out all uses of his
individual cannot take steps to stop the problem data. Even individuals know their data brokers
from repeating, but also he will be deprived of websites and take the time to discover and use the
the immediate benefit. opt-outs, they might still do not know its limita-
Likewise, the scoring methods used are not tions. For risk mitigation products, various data
clear to individuals in marketing product. This brokers are not offering individuals with entry to
means individuals are incapable of mitigating the their data or enable them to correct mistakes. For
destructive effects of lower scores. As a result, marketing products, the scope of individuals’ opt-
individuals are receiving inferior levels of service out choice is not obvious throughout their informa-
from companies, for example, getting limited tion (Ramirez et al. 2014).
272 Data Center

Conclusion https://www.nytimes.com/2012/06/17/technology/acxi
om-the-quiet-giant-ofconsumer-database-marketing.
htm.
Generally, data brokers gather data regarding indi- Venkatadri, G., Andreou, A., Liu, Y., Mislove, A.,
viduals from a broad range of publicly available Gummadi, K., Loiseau, P., . . . Goga, O. (2018, May
sources such as commercial and government 1). Privacy risks with Facebook’s PII-based targeting:
records. The data brokers not only use the raw Auditing a data broker’s advertising interface. In IEEE
conference publication. Retrieved February 27, 2019,
data collected from these sources but as well use from https://ieeexplore.ieee.org/abstract/document/84
the derived data to develop and extend their prod- 18598.
uct. The three main types of products that data
brokers produce for a wide range of industries are
(1) people-search product, (2) marketing product,
and (3) risk mitigation product. These products
will be offered (i.e., sell or rent) as data packages Data Center
to data brokers’ clients. Several data collection
methods are used by data brokers, such as web Mél Hogan
crawlers and scrapers, printed information like Department of Communication, Media and Film,
telephone directories, batch processing through University of Calgary, Calgary, AB, Canada
daily feeds, and integration through an API.
There are both benefits and risks for the targeted
individuals in data brokers’ business. Since the Synonyms
market of data brokers is vague, the choices
to opt out the data collection are also vague. An Data storage; Datacenter; Factory of the twenty-
individual needs to know the right of opt-outs to first century; Server farm; Cloud
protect sensitive personal information.

Definition/Introduction
Further Reading
Big Data requires big infrastructure. A data center
Anthes, G. (2015, January 1). Data brokers are watching is largely defined by the industry as a facility with
you. Retrieved February 27, 2019, from https://dl.acm.
computing infrastructure, storage, and backup
org/citation.cfm?doid¼2688498.2686740.
CIPPIC. (2006). On the data trail: How detailed informa- power. Its interior is usually designed as rows of
tion about you gets into the hands of organizations racks containing stacked servers (a motherboard
with whom you have no relationship. A report on the and hard drive). Most data centers are designed
Canadian data brokerage industry. Retrieved from
https://idtrail.org/files/DatabrokerReport.pdf.
with symmetry in mind, alternate between warm
Kitchin, R. (2014). The data revolution: Big data, open and cool isles, and are dimly lit and noisy. The
data, data infrastructures and their consequences. data center functions as a combination of software
In R. Kitchin (Ed.), Small data, data infrastructures and hardware designed to process data requests –
and data brokers (Rev. ed., pp. 27–47). London: Sage.
to receive, store, and deliver – to “serve” data,
Kitchin, R., & Lauriault, T. (2014, January 8). Small data,
data infrastructures and big data by Rob Kitchin, such as games, music, emails, and apps, to clients
Tracey Lauriault: SSRN. Retrieved February 27, over a network. It has redundant connections to
2019, from https://papers.ssrn.com/sol3/papers.cfm? the Internet and is powered from multiple local
abstract_id¼2376148.
utilities, diesel generators, battery banks, and
Ramirez, E., Brill, J., Ohlhausen, M., Wright, J., &
McSweeny, T. (2014). Data brokers a call for transpar- cooling systems. Our ever-growing desire to mea-
ency and accountability. Retrieved from https://www. sure and automate our world has seen a surge in
ftc.gov/system/files/documents/reports/data-brokers- data production, as Big Data.
call-transparency-accountability-report-federal-trade-
commission-may-2014/140527databrokerreport.pdf.
Today, the data center is considered the heart
Singer, N. (2012, June 17). Mapping, and sharing, the and brain of Big Data and the Internet’s networked
consumer genome. Retrieved February 27, 2019, from infrastructure. However, the data center would be
Data Center 273

defined differently across the last few decades as it the widespread implementation of a new client-
underwent many conceptual and material trans- server computing model.
formations since the general-purpose computer
was first imagined, and instantiated, in the 1940s.
Knowledge of the modern day data center’s Today’s Data Center
precursors is important because each advance-
ment marks an important shift from elements Since the popularization of the public Internet in
internal to external to the apparatus, namely, in the 1990s, and especially the dot-com bubble
the conception of storage as memory. Computers, from 1997 to 2000, data have exploded as a com- D
as we now use them, evolved from the mainframe modity. To put this commodity into perspective,
computer as data center and today supports and each minute of every day, more than 200 million
serves Big Data and our digital networked com- emails are sent, more than 2 million Google
munications from afar. Where and how societal searches are performed, over 48 h of video is
data are stored has been always an important uploaded the YouTube, and more than 4 million
social, historical, and political question, as well posts appear on Facebook. Data are exploding
as one of science and engineering, because the also at the level of real-time data for services like
uses and deployments of data can vary based on Tinder, Uber, and AirBnB, as well as the budding
context, governmental control and motivations, self-driving car industry, smart city grids and
and level of public access. transportation, mass surveillance and monitoring,
One of the most important early examples of e-commerce, insurance and healthcare transac-
large-scale data storage – but which differs from tions, and – perhaps most significantly today –
today’s data center in many ways – was ENIAC the implementation of the Internet of Things
(Electronic Numerator, Integrator, Analyzer, and (IoT), virtual and augmented reality, and gaming.
Computer), built in 1946 for the US Army Ballis- All of these cloud-based services require huge
tic Research Laboratory, to store artillery firing amounts of data storage and energy to operate.
codes. The installation took up 1800 sq. ft. of However, despite the growing demand for stor-
floor space, weighed 30 t, was expensive to run, age – considering that 90% of data have been
buggy, and very energy intensive. It was kept in created in the last 2 years – data remain largely
use for nearly a decade. relegated to the realm of the ephemeral and imma-
In the 1960s, there was no longer a hard dis- terial in the public imaginary, which is a concep-
tinction between processing and storage – large tion further upheld by the metaphor of “the cloud”
mainframes were also data centers. The next two and “cloud computing.” Cloud servers are no dif-
decades saw the beginning and evolution of ferent than other data centers in terms of their
microcomputers (now called “servers”), which materiality. They differ simply in how they provide
would render the mainframe and data center data to users. The cloud relies on virtualization
ostensibly, and if only temporarily, obsolete. Up and a cluster of computers as its source to break
until that point, mainframe computers used punch down requests into smaller component parts (to
cards and punch tape as computer memory, which more quickly serve up the whole) without all data
was pioneered by the textile industry for use in (as packets) necessarily following the same physi-
mechanized looms. Made possible by the advent cal/geographical path.
of integrated circuits, the 1980s saw a widespread For the most part, users cannot access the
adoption of personal computers at the home and servers on which their data and content are stored,
office, relying on cassette tape recorders, and later, which means that questions of data sovereignty,
floppy disks as machine memory. The mainframe access, and ownership are also important threads
computer was too big and too expensive to run, in the fabric of our modern sociotechnical com-
and so the shift to personal computing seemed to munication system. By foisting a guarded dis-
offer mitigation of these issues, which would see tance between users and their data, users are
significant growth once again in the 1990s due to disconnected also from a proper understanding
274 Data Center

of networked culture, and the repercussions of (in the USA) have been built along former trade
mass digital circulation and consumption. This routes or railroad tracks and are often developed
distance serves companies’ interests insofar as it in the confusing context of a new but temporary
maintains an illusion of fetching data on demand, market stability, itself born of economic down-
in and from no apparent space at all, while also turns in other local industries (Burrington 2015).
providing a material base that conjures up an Advances have been made in the last 5 years to
efficient and secure system in which we can reduce the environmental impacts of data centers,
entrust our digital lives. at the level energy use in particular, and this is
In reality, there are actual physical servers in done in part by locating data centers in locations
data centers that contain the world’s data (Neilson with naturally cooler climates and stable power
et al. 2016). The data center is part of a larger grids (such as in Nordic countries). The location
communications infrastructure that stores and of data centers is ultimately dependent on a con-
serves data for ongoing access and retrieval. The fluence of societal factors, of which political sta-
success of the apparatus relies on uninterrupted bility, the risk of so-called natural disasters, and
and seamless transactions at increasingly rapid energy security remain at the top.
speeds. The data center can take on various
forms, emplacements, and purposes; it can be
imagined as a landing site (the structure that wel- Conclusion
comes terrestrial and undersea fiber optics cables),
or as a closet containing one or two locally Due in part to the secretive nature of the industry
maintained servers. But generally speaking, the and the highly skilled labor of the engineers and
data center we imagine (if we imagine one at all) programmers involved, scholars interested in Big
is the one put on virtual display by Big Tech Data, new media, and networked communications
companies like Google, Microsoft, Facebook, have had to be creative in their interventions. This
Apple, Amazon, etc. (Vonderau and Holt 2015). has been accomplished by drawing attention to the
These companies display and curate images of myth of the immaterial as a first steps to engaging
their data centers online and offer virtual tours to every day users and politicizing the infrastructure
highlight their efficiency and design – and by scrutinizing its economic, social, and environ-
increasingly their sustainability goals and com- mental impacts (Starosielski 2015). The data cen-
mitments to the environment. While these visual ter has become a site of inquiry for media scholars
representations of data center interiors are vivid, to explore and counter the widespread myths
rich, and often highly branded, the data center about the immateriality of “the digital” and
exteriors are for the most part boxy and nonde- cloud computing, its social and environmental
script. The sites are generally highly monitored, impacts, and the political economy and ecology
guarded, and built foremost as a kind of fortress to of communications technology more broadly.
withstand attacks, intruders, and security Without denying them their technological
breaches. complexities, data centers, as we now understand
Because the scale of data centers has gotten so them, are crucial components of a physical, geo-
large, they are often referred to as server farms, graphically located infrastructure that facilitates
churning over data, day in and day out. Buildings our daily online interactions on a global scale.
housing data centers can be the size of a few Arguably, the initial interest in data centers by
football fields, require millions of gallons of scholars was to shed light on the idea of data
water daily to cool servers, and use the same storage – the locality of files, on servers, in build-
amount of electricity as a midsize US town. ings, in nations – and to demonstrate the effects of
Smaller data centers are often housed in buildings the scale and speed of communication never
leftover and adapted from defunct industry – from before matched in human history. Given the rising
underground bunkers to hotels to bakeries to importance of including the environment and cli-
printing houses to shopping malls. Data centers mate change in academic and political discourse,
Data Cleansing 275

data centers are also being assessed for their


impacts on the environment and the increasing Data Cleaning
role of Big Tech in managing natural resources.
The consumption rates of water and electricity by ▶ Data Cleansing
the industry, for example, are considered a serious
environmental impact because resources have,
until recently, been unsustainable for the mass
upscaling of its operations. Today, it is no longer Data Cleansing
unusual to see Big Tech manage forests D
(Facebook), partner with wastewater management Fang Huang
plants (Google), use people as human Internet Tetherless World Constellation, Rensselaer
content moderators/filters (Microsoft) or own Polytechnic Institute, Troy, NY, USA
large swaths of the grid (Amazon) to power data
centers. In many ways, the data industry is
impacting both landscape and labor conditions in Synonyms
urban, suburban, rural, and northern contexts,
each with its own set of values and infrastructural Data cleaning; Data pre-processing; Data tidying;
logics about innovation at the limits of the envi- Data wrangling
ronment (Easterling 2014).

Introduction
Cross-References Data cleansing, also known as data cleaning, is the
process of identifying and addressing problems in
▶ Big Data raw data to improve data quality (Fox 2018). Data
▶ Cloud Services quality is broadly defined as the precision and
▶ Data Repository accuracy of data, which can significantly influ-
▶ Data Storage ence the information interpreted from the data
▶ Data Virtualization (Broeck et al. 2005). Data quality issues usually
involve inaccurate, unprecise, and/or incomplete
data. Additionally, large amounts of data are being
Further Reading produced every day, and the intrinsic complexity
and diversity of the data result in many quality
Burrington, I. (2015). How railroad history shaped internet
history. The Atlantic, November 24. http://www. issues. To extract useful information, data cleans-
theatlantic.com/technology/archive/2015/11/how-rail ing is an essential step in a data life cycle.
road-history-shaped-internet-history/417414.
Easterling, K. (2014). Extrastatecraft: The power of infra- Data Life Cycle
structure space. London: Verso.
Neilson, B., Rossiter, N., & Notley, T. (2016). Where’s “A data life cycle represents the whole procedure
your data? It’s not actually in the cloud, it’s sitting in of data management” (Ma et al. 2014), and data
a data centre. August 30, 2016. Retrieved 20 Oct 2016, cleansing is one of the early stages in the cycle.
from http://theconversation.com/wheres-your-data-its- The cycle consists of six main stages (modified
not-actually-in-the-cloud-its-sitting-in-a-data-centre-
64168. from Ma et al. 2014):
Starosielski, N. (2015). The undersea network. Durham:
Duke University Press Books. 1. Conceptual model: Data science problems
Vonderau, P., & Holt, J. (2015). Where the internet lives: often require a conceptual model to define
Data centers as cloud infrastructure. In L. Parks & N.
Starosielski (Eds.), Signal traffic: Critical studies of target questions, research objects, and applica-
media infrastructures. Champaign: University of Illi- ble methods, which helps define the type
nois Press. of data to be collected. Any changes of the
276 Data Cleansing

conceptual models will influence the entire missing values, duplications, inconsistent
data life cycle. This step is essential, yet often units, inaccurate data, and so on. Methods for
ignored. tackling those issues will be discussed in the
2. Collection: Data can be collected via various next sections.
sources – survey (part of a group), census
(whole group), observation, experimentation,
simulation, modeling, scraping (automated Data Cleansing Process
online data collection), and data retrieval
(data storage and provider). Data checking is Data cleansing deals with data quality issues after
needed to reduce simple errors and missing and data collection is complete. The data cleansing
duplicated values. process can be generalized into “3E” steps: exam-
3. Cleansing: Raw data are examined, edited, and ine, explore, and edit. Finding data issues through
transformed into the desired form. This stage planning and examining is the most effective
will solve some of the existing data quality approach. Some simple issues like inconsistent
issues (see below). Data cleansing is an itera- numbers and missing values can be easily
tive task. During stages 4–6, if any data prob- detected. However, exploratory analysis is needed
lems are discovered, data cleansing must be for more complicated cases. Exploratory analyses,
performed again. such as scatter plots, boxplots, distribution tests,
4. Curation and sharing: The cleaned data should and others, can help identify patterns within
be saved, curated, and updated in local and/or a dataset, thereby making errors more detectable.
cloud storage for future use. The data can also Once detected, the data can be edited to address
be published or distributed between devices for the errors.
sharing. This step dramatically reduces the
likelihood of duplicated efforts. Moreover, in Examine
scientific research, open data is required by It is always helpful to define questionable features
many journals and organizations for study in advance, which include data type problems,
integrity and reproducibility. missing or duplicate values, and inconsistency
5. Analysis and discovery: This is the main step and conflicts. A simple reorganization and
for using data to gain insights. By applying indexing of the dataset may help discover some
appropriate algorithms and models, trends of those data quality issues.
and patterns can be recognized from the
data and used for guiding decision-making – Data type problems: In data science, the
processes. two major types of data are categorical or
6. Repurposing: The analysis results will be eval- numeric. Categorical values are normally
uated, and, based on the discovered informa- representation of qualitative features, such
tion, the whole process could be performed as job titles, names, nationalities, and so on.
again for the same or different target. Occasionally, categorical values need to be
encoded with numbers to run certain algo-
Data cleansing plays an essential role in rithms, but these remain distinct from numeric
the data life cycle. Data quality issues can values. Numeric values are usually quantita-
cause extracted information to be distorted or tive features, which can be further divided
unusable – a problem that can be mitigated or into discrete or continuous types. Discrete
eliminated through data cleansing. Some issues numeric values are separate and distinct, such
can be prevented during data collection, but as the population of a country or the number of
many have to be dealt with in the data cleansing daily transactions in a stock market; and con-
stage. Data quality issues include errors, tinuous numeric values are usually continuous
Data Cleansing 277

with decimals, such the index of a stock market One-Dimensional


or the height of a person. For example, the age – Boxplot: A distribution of one numeric data
column of a census contains discrete numeric series with five numbers (Fox 2018). The box
values, and the name column contains categor- shows the minimum, first quartile, median,
ical data. third quartile, and maximum values, allo-
– Missing or duplicate values: These two issues wing any outlier data points to be easily
are easily detected through reorganizing identified.
and indexing the dataset but can be hard to – Histogram: A distribution representation
repair. For duplicate values, simply removing of one numeric data series. The numeric values D
duplicated ones can solve the problem. are divided into bins (x-axis), and the number
Missing data can be filled by checking the of points in each bin is counted (y-axis). X and
original data records or metadata, when avail- Y axes are interchangeable. Its shape will
able. Metadata are the supporting information change with the bin size, offering more free-
of data, such as the methods of measurement, dom than a boxplot.
environmental conditions, location, or spatial
relationship of samples. However, if the
required information is not in the metadata, Two-Dimensional
some exploratory analysis algorithms may – Scatter plot: A graph of the relationship
help fill in the missing values. between two numeric data series, whether lin-
– Inconsistency and conflicts: Inconsistency and ear or nonlinear.
conflicts happen frequently when merging – Bar graph: A chart to present the characteris-
two datasets. Merging data representing the tics of categorical data. One axis represents the
same samples or entities in different formats categories, and the other axis is the values
could easily cause duplication and conflicts. associated with each category. There are also
Occasionally, the inconsistency may not be grouped and stacked bar graphs to show more
solvable in this stage. It is acceptable to flag complex information.
the problematic data and address them after
the “analysis and discovery” stage, once a
better overview of the data is achieved. Multidimensional
– Principal component analysis (PCA): A statis-
Explore tical algorithm to analyze the correlations
This stage is to use exploratory analysis methods among multivariant numeric values (Fox
to identify problems that are hard to find by simple 2018). The multidimensional data will be
examination. One of the most widely used explor- reduced with two orthogonal components
atory tools is visualization, which will improve because it is much easier to explore the rela-
our understanding of the dataset in a more tionship among data features on a two-axis
direct way. There are several common methods plane.
to do exploratory analysis on one-dimensional,
two-dimensional, and multidimensional data. There are many other visualization and non-
The dimension here refers to the number of visualization methods in addition to the above.
features of data. One- and two-dimensional Data visualization techniques are among the
data can be analyzed more easily, but multi- most popular ways to identify data quality
dimensional data are usually reduced to lower problems – they allow recognition of outliers
dimensions for easier analysis and visualization. as well as relationships within the data, includ-
Below is a partial list of the methods that can be ing unusual patterns and trends for further
used in the Explore step. analysis.
278 Data Cleansing

Edit use, but with less flexibility; and the programing


After identification of the problems, researchers packages have a steeper learning curve, but they
need to decide how to tackle those problems, and are free of cost and can be extremely powerful.
there are multiple methods to edit the dataset. (1)
Data types need to be adjusted for consistency.
– Software: Examples of famous software
For instance, revise wrongly inputted numeric
includes OpenRefine, Trifacta (Data) Wran-
data to the correct values or convert numeric or
gler, Drake, TIBCO Clarity, and many others
categorical data to meet the requirements of the
(Deoras 2018). They often have built-in
selected algorithm. (2) One should fill in the miss-
workflows and can do some statistical analysis.
ing values and replace or delete duplicated values
– Programming packages: Packages written in
using the information in metadata (see Examine
free programming languages, such as Python
section). For example, a scientific project called
and R, are becoming more and more popular
“Census of Deep Life” collected microbial life
in the data science industry. Python is power-
samples below the seafloor along with environ-
ful, easy to use, and runs in many different
mental condition parameters, but some pressure
systems. The Python development commu-
values were missing. In this case, the missing
nity is very active and has created numerous
pressure values were calculated using depth infor-
data science libraries, including Numpy,
mation recorded in metadata. (3) For inconsis-
Scipy, Scikit-learn, Pandas, Matplotlib, and
tency and conflicts, data conversion is needed.
so on. Pandas and Matplotlib have very pow-
For example, when two datasets have different
erful and easy functions to analyze and visu-
units, they should be converted before merging.
alize different data formats. Numpy, Scipy,
(4) Some problems cannot be solved with the
and Scikit-learn are used for statistical analy-
previous techniques and should be flagged within
sis and machine learning training. R is a struc-
the dataset. In future analyses, those points can be
tural programming language, similar to
noted and dealt with accordingly. For example,
Python, which also has a variety of statistical
random forest, a type of machine learning algo-
packages. Some widely used R packages
rithm, has the ability to impute missing values
include dplyr, foreign, ggplot2, and tidyr, all
with existing data and relationships.
of which are useful in data manipulation and
visualization.
Overview
Before and during the data cleansing process,
some principles should be kept in mind for best Conclusion
results: (1) planning and pre-defining are critical –
it will give targets for the data cleansing process. Data cleansing is essential to ensure the quality of
(2) Use proper data structures to keep data orga- data input for the analytics and discovery, which
nized and improve efficiency. (3) Prevent data will in turn extract the appropriate and accurate
problems in collection stage. (4) Use unique IDs information for future plans and decisions. This is
to avoid duplication. (5) Keep a good record of particularly important when large tech companies,
metadata. (6) Always keep copies before and after like Facebook and Twitter, 23andMe, Amazon,
cleansing. (7) Document all changes. and Uber, and international collaboration scien-
tific projects, are producing huge amounts of
social media, genetic information, ecommerce,
Tools travel, and scientific data, respectively. Such data
from various sources may have very different
Many tools exist for data cleansing. There are two formats and quality, making data cleansing an
primary types: data cleansing software and pro- essential step in many areas of science and
graming packages. Software is normally easier to technology.
Data Discovery 279

Further Reading methods depending on the context and domain of


the work.
Deoras, S. (2018). 10 best data cleaning tools to get the
most out of your data. Retrieved 8 Mar 2019, from
https://www.analyticsindiamag.com/10-best-data-
cleaning-tools-get-data/. History
Fox, P. (2018). Data analytics course. Retrieved
8 Mar 2019, from https://tw.rpi.edu/web/courses/ Recently, the term “data discovery” has been pop-
DataAnalytics/2018.
ularized as a process in Business Intelligence,
Kim, W., Choi, B. J., Hong, E. K., Kim, S. K., & Lee, D.
with a lot of software applications and tools aiding D
(2003). A taxonomy of dirty data. Data Mining and
Knowledge Discovery, 7(1), 81–99. the user in discovering trends, patterns, outliers,
Ma, X., Fox, P., Rozell, E., West, P., & Zednik, S. (2014). clusters, etc. The data discovery process itself has
Ontology dynamics in a data life cycle: Challenges and
a longer history that dates back to the beginning of
recommendations from a Geoscience Perspective.
Journal of Earth Science, 25(2), 407–412. data mining. Data mining started as a trend in the
Van den Broeck, J., Cunningham, S. A., Eeckels, R., & 1980s and was a process of extracting information
Herbst, K. (2005). Data cleaning: Detecting, diagnos- by examining databases (under human control).
ing, and editing data abnormalities. PLoS Medicine, 2
Other names for data mining include knowledge
(10), e267.
extraction, information discovery, information
harvesting, data archeology, and data pattern pro-
cessing. In 1989, Gregory Piatetsky-Shapiro
introduced the notion of knowledge discovery in
Data Consolidators databases (KDD) in the first KDD workshop. The
main driving factor to define the model was
▶ Data Brokers and Data Services acknowledging the fact that knowledge is the
end product of the data-driven discovery process.
Another outcome of that workshop was the
acknowledgement of the need to develop interac-
Data Discovery tive systems that would provide visual and per-
ceptual tools for data analysis (Kurgan and
Anirudh Prabhu Musilek 2006). Since then, this idea has been
Tetherless World Constellation, Rensselaer worked on and improved upon to evolve into the
Polytechnic Institute, Troy, NY, USA “data discovery” process we know of today.

Synonyms Usage in Different Contexts

Data-driven discovery; Information discovery; Depending on context and domain of application,


KDD; KDDM; Knowledge discovery the process of “discovery” changes, though the
end goal is identifying patterns and trends and to
gain knowledge from them.
Introduction/Definition

Broadly defined, data discovery is the process of In Business Intelligence


finding patterns and trends in processed, analyzed,
or visualized data. The reason data discovery must In Business Intelligence, “data discovery” relies
be defined “broadly” is because this process is more on front-end analytics. The process in this
popular across domains. These patterns and trends domain is to have a dashboard of some kind where
can be “discovered” from the data using different descriptive statistics and visualizations are
280 Data Discovery

represented to the user. The user then employs this more formal reporting, data science, and infor-
interactive dashboard to view different datasets in mation management activities” (Haan 2016).
order to address pertinent questions. Thus, in
Business Intelligence, data discovery can be
In Analytical Disciplines
defined as “a way to let people get the facts
(from data) they need to do their jobs confidently
The non-trivial process of identifying valid,
in a format that’s intuitive and available” (Haan
novel, potentially useful, and ultimately under-
2016).
standable patterns in data is known as knowledge
The five principles of data analytics in business
discovery. Thus, data discovery and knowledge
intelligence are (Haan 2016):
discovery are essentially interchangeably used.
“The discovery process usually consists of a set
Fast: Data discovery is designed to answer
of sequential steps, which often includes multiple
“immediate, spur of the moment” questions.
loops and iterations a single step.” Kurgan and
An ideal discovery solution allows access to
Musilek (2006) survey the major knowledge dis-
information from many sources whenever
covery process models, but the authors specify a
needed. Features supporting this solution
slightly different terminology. According to their
include: quick connections to many data
views, KDDM (Knowledge Discovery and Data
sources, faceting and sub-setting data as
Mining) is the process of knowledge discovery
required, and updating visualizations and sum-
applied to any data source. Fayyad et al. (1996)
mary statistics accordingly.
describes one of the most widely cited discovery
Usable: Usability and representation of the data
process models. The steps for this process are as
go hand in hand. In the business intelligence
follows:
domain, the data discovery process needs to
remain code-free and the interface needs to be
• Develop an understanding of the domain and
as intuitive as possible, with drag-and-drop
gain the relevant prior knowledge required
features that make analysis steps clear as well
• Creating a target dataset (creating a data sam-
as provide many prestructured templates, visu-
ple to perform discovery on)
alizations, and workflows.
• Cleaning and Preprocessing Data
Targeted: “Data discovery isn’t meant to be a
• Data Reduction and Projection (Reducing the
monolithic practice which is the same through-
data by selecting only the most relevant
out the enterprise” (Haan 2016). It needs to be
variables)
customized and optimized depending on the
• Selecting the appropriate data mining method
user’s needs.
• Performing exploratory analysis
Flexible: The data discovery tool should be flex-
• Data Mining
ible enough to answer quick initial results
• Interpreting mined patterns
from a single dataset as well as complex ques-
• Documenting or using the discovered
tions that require subsets, views, and a com-
knowledge
bination of multiple datasets. “Data discovery
can and should be applied to any department
or function the tool can access data for” (Haan User Involvement and Automation
2016).
Collaborative: “Data discovery is not a stand- Data/knowledge Discovery is considered a field
alone process to yield results.” It is when com- where user involvement is extremely important,
bined with analytics processes like predictive since the user judges whether the patterns found
models and interactive visualizations that its are useful or not. The level of user involvement
usefulness is seen. “These tools should also and the steps where the user controls the process
be considered as a gateway to improving change depending on the field of application.
Data Discovery 281

A fully automated discovery system is one in be updated to accommodate massive datasets.


which the user does not need to be involved until Begoli and Horey (2012) describe some design
interpretation of the returned patterns is required. principles that inform organizations on effec-
McGarry (2005) mentions the characteristics nec- tive analyses and data collection processes,
essary for an automated discovery system are to: system organization, and data dissemination
practices.
• Select the most relevant parameters and
variables Principle 1: Support a Variety of Analysis
• Guide the data mining algorithms in selecting Methods D
the most salient parameters and in searching “Most modern discovery systems employ distrib-
for an optimum solution uted programming, data mining, machine learn-
• Identify and filter the results most meaningful ing, statistical analysis, and visualizations.”
to the users Distributed computing is primarily performed
• Identify useful target concepts with Hadoop, a software product commonly
coded in Java. Machine learning and statistics
An example of automated discovery systems are generally coded in R, Python, or SAS. SQL
that incorporate domain knowledge from the data is often employed for data mining tasks. There-
in order to assess the novelty of patterns can be fore, it is important for the discovery architecture
seen in the system proposed by the Ludwig pro- to support a variety of analysis environments.
cess model (McGarry 2005). The system creates a Work is currently being done to enable this. For
prior model that can be revised every time new example, in R and Python environments, there are
information is obtained. The Ludwig definition of packages and libraries being written to run R/
novelty is: “a hypothesis ‘H’ is novel, with respect Python code on Hadoop environments and to
to a set of beliefs ‘B’, if and only if ‘H’ is not also use SQL queries to mine the available data.
derivable from ‘B’.” This means that if a pattern Similarly, R and Python also have packages that
contradicts a known set of beliefs, then that pat- are wrappers for interactive visualization libraries
tern is considered novel. like D3js (written in JavaScript), which can visu-
An example of a semi-automated discovery alize massive datasets and interactively modify
system is the Data Monitoring and Discovery views for these visualizations for the purpose of
Triggering (DMDT) system (McGarry 2005). visual analysis.
This system limits the user’s involvement in pro-
viding feedback to guide the system as it searches. Principle 2: One Size Does Not Fit All
Pattern templates of “interesting” patterns will be The discovery architecture must be able to store
selected and provided to the system. The DMDT and process the data at all the stages of the
system is intended to scale-up on large datasets discovery process. This becomes very difficult
(which in turn may be composed of multiple with large datasets. Begoli and Horey (2012)
datasets). Over a period of time, the pattern tem- proposed that instead of storing the data in one
plates defining the interesting rules will change as large relational database (as has been a common
the data changes and will trigger new discoveries practice in the past), a specialized data manage-
(McGarry 2005). ment system is required. According to the
authors, different types of analysis techniques
should be able to use intermediate data struc-
Discovery in the Big Data Age tures to expedite the process.
The source data is often in an unusable for-
As the amount of the data in the world grows mat and may contain errors and missing values.
exponentially, the algorithms, models, and Thus, the first step would be to clean the dataset
systems proposed for specific tasks need to and prepare it for analysis. According to Begoli
282 Data Discovery

and Horey, Hadoop is an ideal tool for this step. Research and Application Challenges
The Hadoop framework includes MapReduce
for distributed computing and scalable storage. This section outlines some of the obstacles and
Hive and HBase offer data management solu- challenges faced in the discovery process. The list
tions for storing structured and semistructured is by no means exhaustive, it is simply meant to
datasets. Once the structured and semi- give the reader an idea of the problems faced
structured datasets are stored in the format while working in this field (Fayyad et al. 1996).
required for analysis, they can be accessed Big data: Despite most of the recent research
directly by the user for the machine learning/ focusing on big data, this remains one of the
data mining tasks. application challenges. Using Massive datasets
means that the discovery system requires large
Principles 3: Make Data Accessible storage and powerful processing capabilities.
This principle focuses on the representation of This discovery process also needs to use efficient
data and results of the data analysis. It is impor- mining and machine learning algorithms.
tant to make the results (i.e., the patterns and High dimensionality: “Datasets with a large
trends) available and easy to understand. Some number of dimensions increases the size of the
of the “best practices” for presenting results search space for model introduction in a ‘combi-
are: natorially explosive’ manner” (Fayyad et al.
1996). This results in the data mining algorithm
• Use open and popular standards: Using pop- finding patterns that are not useful. Dimension
ular standards and frameworks means that reduction methods combined with the use of
there is extensive support and documentation domain knowledge can be used to effectively
for the required analysis. For example, if cus- identify the irrelevant variables.
tom data visualizations were created using the Overfitting: Overfitting implies that the algo-
D3js framework, it would be easy to produce rithm has modeled the training dataset so perfectly
similar visualizations for different datasets by that it also models the noise specific to that
linking the front end to a constantly updating dataset. In this case, the model cannot be used
data store. on any other test dataset. Cross validation, regu-
• Using lightweight architectures: The term larization, and other statistical methods may be
lightweight architecture is used when the soft- used to solve this issue.
ware application has lesser and simpler work- Assessing statistical significance: “This prob-
ing parts than commonly known applications lem occurs when the system searches for patterns
of the same kind. Using lightweight architec- over many possible models. For example, if a
tures can simplify the creation of rich applica- system tests models at the 0.001 significance
tions. When combined with the open source level, then on average (with purely random
tools mentioned earlier, they ensure that a data), N/1000 of these models will be accepted
robust application can run on a variety of as significant. This can be fixed by adjust the test
platforms. statistic as a function of the pattern search.”
• Interactive and flexible applications: Users (Fayyad et al. 1996.
now demand rich web-enabled APIs (Appli- Changing data: Constantly changing data can
cation Programming Interfaces) to down- make previously discovered patterns invalid.
load, visualize, and interact with the data. Sometimes, certain variables in the dataset can
So, it is important to expose part of or all also be modified or deleted. This can drastically
the data to the users while presenting the damage the discovery process. Possible solutions
results of the knowledge discovery process, include incremental methods for updating patterns
so that they can perform additional analysis if and using change as a trigger for a new discovery
needed. process.
Data Exhaust 283

Missing and noisy data: This is one of the oldest


challenges in data science. Missing or noisy data Data Exhaust
can lead to biased models and thus inaccurate
patterns. There are many known solutions to iden- Daniel E. O’Leary1 and Veda C. Storey2
1
tify missing variables and dependencies. Marshall School of Business, University of
Complex relationships between attributes: In Southern California, Los Angeles, CA, USA
2
some cases, the attributes in a dataset may have a J Mack Robinson College of Business, Georgia
complex relationship with each other (for exam- State University, Atlanta, GA, USA
ple, a hierarchical structure). Older machine learn- D
ing/data mining algorithm might not take these
relationships into account. It is important to use Overview
algorithms that derive relations between the vari-
ables and create pattern based on these relations. Data exhaust is a type of big data that is often
Understandability of patterns: The results of the generated unintentionally by users from normal
discovery process need to be easy to understand and Internet interaction. It is generated in large quan-
interpret. Well-made interactive visualizations, tities and appears in many forms, such as the
combined with summarizations in natural language, results from web searches, cookies, and tempo-
is a good starting step to address this problem. rary files. Initially, data exhaust has limited, or
Integration: Discovery systems typically need no, direct value to the original data collector.
to integrate with multiple data stores and visuali- However, when combined with other data for
zation tools. These integrations are not possible if analysis, data exhaust can sometimes yield valu-
the tools integrated are not interoperable. Use of able insights.
open source tools and framework helps address
this problem.
Description

Cross-References Data exhaust is passively collected and consists of


random online searches or location data that is
▶ Data Processing generated, for example, from using smart phones
with location dependent services or applications
(Gupta and George 2016). It is considered to
Further Reading be “noncore” data that may be generated when
individuals use technologies that passively emit
Begoli, E., & Horey, J. (2012). Design principles for effec- information in daily life (e.g., making an online
tive knowledge discovery from big data. In Software purchase, accessing healthcare information, or
Architecture (WICSA) and European Conference on interacting in a social network). Data exhaust
Software Architecture (ECSA), 2012 joint working
IEEE/IFIP conference on IEEE (pp. 215–218). IEEE. can also come from information-seeking behavior
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). that is used to make inferences about an
From data mining to knowledge discovery in databases. individual’s needs, desires, or intentions, such as
AI Magazine, 17(3), 37. Internet searches or telephone hotlines (George
Haan, K. (2016). So what is data discovery anyway? 5 key
facts for BI. Retrieved Sept 24, 2017, from https://www. et al. 2014).
ironsidegroup.com/2016/03/21/data-discovery-5-facts-bi/.
Kurgan, L. A., & Musilek, P. (2006). A survey of knowl- Additional Terminology
edge discovery and data mining process models. The Data exhaust is also known as ambient data, rem-
Knowledge Engineering Review, 21(1), 1–24.
McGarry, K. (2005). A survey of interestingness measures nant data, left over data, or even digital exhaust
for knowledge discovery. The Knowledge Engineering (Mcfedries 2013). A digital footprint or a digital
Review, 20(1), 39–61. dossier is the data generated from online activities
284 Data Exhaust

that can be traced back to an individual. The Sources of Data Exhaust


passive traces of data from such activities are The origin of data exhaust may be passive, digital,
considered to be data exhaust. The big data that or transactional. Specifically, data exhaust can be
interests many companies is called “found data.” passively collected as transactional data from peo-
Typically data is extracted from random Internet ple’s use of digital services such as mobile
searches and location data is generated from smart phones, purchases, web searches, etc. These dig-
or mobile phone usage. Data exhaust should not ital services are then used to create networked
be confused with community data that is gener- sensors of human behavior.
ated by users in online social communities, such
as Facebook and Twitter. Potential Value
In the age of big data, one can, thus, view data Data exhaust is accessed either directly in an
as a messy collage of data points, which includes unstructured format or indirectly as backend
found data, as well as the data exhaust extracted data. The value of data exhaust often is in its use
from web searches, credit card payments, and to improve online experiences and to make pre-
mobile devices. These data points are collected dictions about consumer behavior. However, the
for disparate purposes (Harford 2014). value of the data exhaust can depend on the par-
ticular application and context.
Generation of Data Exhaust
Data exhaust is normally generated autono- Challenges
mously from transactional, locational, posi- There are practical and research challenges
tional, text, voice, and other data signatures. It to deriving value from data exhaust (technical,
typically is gathered in real time. Data exhaust privacy and security, and managerial). A major
might not be purposefully collected, or is col- technical challenge is the acquisition of data
lected for other purposes and then used to derive exhaust. Because it is often generated without
insights. the user’s knowledge, this can lead to issues of
privacy and security. Data exhaust is often
Example of Data Exhaust unstructured data for which there is, technically,
An example of data exhaust is backend data. no known, proven, way to consistently extract its
Davidson (2016) provides an example from a potential value from a managerial perspective.
real-time information transit application called Furthermore, data mining and other tools that
Transit App (Davidson 2016). The Transit App deal with unstructured data are still at a relatively
provides a travel service to users. The App early stage of development.
shows the coming departures of nearby transit From a research perspective, traditionally,
services. It also has information on bike share, research studies of humans have focused on data
car share, and other ride services, which appear collected explicitly for a specific purpose. Com-
when the user simply opens the app. The app is putational social science increasingly uses data
intended to be useful for individuals who know that is collected for other purposes. This can result
exactly where they are going and how to get in the following (Altman 2014):
there, but want real-time information on sched-
ules. The server, however, retains data on the 1. Access to “data exhaust” cannot easily be con-
origin, destination, and device data for every trolled by a researcher. Although a researcher
search result. The usefulness of this backend may limit access to their own data, data exhaust
data was assessed by comparing the results may be available from commercial sources or
obtained from using the backend data to predict from other data exhaust sources. This increases
trips, to a survey data of actual trips, which the risk that any sensitive information linked
revealed a very similar origin-destination with a source of data exhaust can be
pattern. reassociated with an individual.
Data Fusion 285

2. Data exhaust often produces fine-grained Gupta, M., & George, J. F. (2016). Toward the develop-
observations of individuals over time. Because ment of a big data analytics capability. Information
Management, 53(8), 1049–1064.
of regularities in human behavior, patterns in Harford, T. (2014). Big data: A big mistake? Significance,
data exhaust can be used to “fingerprint” an 11(5), 14–19.
individual, thereby enabling potential Mcfedries, P. (2013). Tracking the quantified self [Techni-
reidentification, even in the absence of explicit cally speaking]. IEEE Spectrum, 50(8), 24–24.
Nadella, A., & Woodie, A. (2014). Data ‘exhaust’ leads
identifiers or quasi-identifiers. to ambient intelligence, Microsoft CEO says. https://
www.datanami.com/2014/04/15/data_exhaust_leads_
to_ambient_intelligence_microsoft_ceo_says/. D
Evolution
As ubiquitous computing continues to evolve,
there will be a continuous generation of data
exhaust from sensors, social media, and other
Data Fusion
sources (Nadella and Woodie 2014). Therefore,
the amount of unstructured data will continue to
Carolynne Hultquist
grow and, no doubt, attempts to extract value from
Geoinformatics and Earth Observation
data exhaust will grow as well.
Laboratory, Department of Geography and
Institute for CyberScience, The Pennsylvania
State University, University Park, PA, USA
Conclusion

As the demand for capture and use of real-time Definition/Introduction


data continues to grow and evolve, data exhaust
may play an increasing role in providing value Data fusion is a process that joins together differ-
to organizations. Much communication, leisure, ent sources of data. The main concept of using a
and commerce occur on the Internet, which is now data fusion methodology is to synthesize data
accessible from smartphones, cars, and a multi- from multiple sources in order to create collective
tude of devices (Harford 2014). As a result, activ- information that is more meaningful than if only
ities of individuals can be captured, recorded, using one form or type of data. Data from many
and represented in a variety of ways, most likely sources can corroborate information, and, in the
leading to an increase in efforts to capture and use era of big data, there is an increasing need to
data exhaust. ensure data quality and accuracy. Data fusion
involves managing this uncertainty and
conflicting data at a large scale. The goal of data
Further Reading fusion is to create useful representations of reality
that are more complete and reliable than a single
Altman, M. (2014). Navigating the changing landscape of
source of data.
information privacy. http://informatics.mit.edu/blog/
2014/10/examples-big-data-and-privacy-problems.
Bhushan, A. (2013). “Big data” is a big deal for develop-
ment. In Higgins, K. (Ed), International development Integration of Data
in a changing world, 34. The North-South Institute,
Ottawa, Canada.
Davidson, A. (2016). Big data exhaust for origin-destina- Data fusion is a process that integrates data
tion surveys: Using mobile trip-planning data for sim- from many sources in order to generate more
ple surveying. Proceedings of the 95th Annual Meeting meaningful information. Data fusion is very
of the Transportation Research Board.
domain-dependent, and therefore, tasks and the
George, G., Haas, M. R., & Pentland, A. (2014). Big data
and management. Academy of Management Journal, development of methodologies are dependent
57(2), 321–326. on the field for diverse purposes (Bleiholder
286 Data Governance

and Naumann 2008). In general, the intention is Conclusion


to fuse data from many sources in order to
increase value. Data from different sources can The process of data fusion directly seeks to address
support each other which decreases uncertainty challenges of big data. The methodologies are
in the assessment or conflicts which raises ques- directed at considering the veracity of large volumes
tions of validity. Castanedo (2013) groups the and many varieties of data. The goal of data fusion is
data fusion field into three major methodologi- to create useful representations of reality that are
cal categories of data association, state estima- more complete and reliable than trusting data that
tion, and decision fusion. Analyzing the is only from a single source.
relationships between multiple data sources
can help to provide an understanding of the
quality of the data as well as identify potential Cross-References
inconsistencies.
Modern technologies have made data easier ▶ Big Data Quality
to collect and more accessible. The develop- ▶ Big Variety Data
ment of sensor technologies and the intercon- ▶ Data Integration
nectedness of the Internet of things (IoT) have ▶ Disaster Planning
linked together an ever-increasing number of ▶ Internet of Things (IoT)
sensors and devices which can be used to mon- ▶ Sensor Technologies
itor phenomena. Data is accessible in large
quantities, and multiple sources of data are
sometimes available for an area of interest. Further Reading
Fusing data from a variety of forms of sensing
technologies can open new doors for research Bleiholder, J., & Naumann, F. (2008). Data fusion. ACM
Computing Surveys, 41, 1:1–1:41.
and address issues of data quality and
Castanedo, F. (2013). A review of data fusion techniques.
uncertainty. The Scientific World Journal, 2013, 1–19, Article ID
Multisensor data fusion can be done for data 704504.
collected for the same type of phenomena. For
example, environmental monitoring data such as
air quality, water quality, and radiation measure-
ments can be compared to other sources and Data Governance
models to test the validity of the measurements
that were collected. Geospatial data is fused with Erik W. Kuiler
data collected in different forms and is some- George Mason University, Arlington, VA, USA
times also known in this domain as data integra-
tion. Geographical information from such
sources as satellite remote sensing, UAVs Introduction
(unmanned aerial vehicles), geolocated social
media, and citizen science data can be fused to Big Data governance is the exercise of decision-mak-
give a picture that any one source cannot pro- ing for, and authority over, Big Data-related matters.
vide. Assessment of hazards is an application Big Data governance comprises a set of decision
area in which data fusion is used to corroborate rights and accountabilities for Big Data and informa-
the validity of data from many sources. The data tion-related processes, executed according to agreed-
fusion process is often able to fill some of the to processes, standards, and models that collectively
information gaps that exist and could assist deci- describe who can take what actions with what infor-
sion-makers by providing an assessment of real- mation and when, in accordance with predetermined
world events. methods and authorized access rights.
Data Governance 287

Distinctions Between Big Data Big Data Governance Conceptual


Governance, Big Data Management, Big Framework
Data Operations, and Data Analytics
The figure below depicts a conceptual framework
Big Data governance is a shared responsibility of Big Data governance, comprising a set of inte-
and depends on stakeholder collaboration so that grated components.
shared decision-making becomes the norm, rather
than the exception, of responsible Big Data
governance. D
As a component of an overall ICT governance
framework, Big Data governance focuses on the
decisions that must be made to ensure effective
management and use of Big Data and decision
accountability.
Big Data management focuses on the execu-
tion of Big Data governance decisions. The Big
Data management function administers, coordi-
nates, preserves, and protects Big Data resources.
In addition, this organization is responsible for
developing Big Data management procedures,
guidelines, and templates in accordance with the
direction provided by the Big Data governance
board. The Big Data management function exe- Data Governance, Fig. 1 Data integration – different
sources
cutes Big Data management processes and pro-
cedures, monitors their compliance with Big Data
governance policies and decisions, and measures Big Data Governance Foundations
the effectiveness of Big Data operations. In addi- Big Data governance is an ongoing process that,
tion, the Big Data management function is respon- when properly implemented, ensures the align-
sible for managing the technical Big Data ment of decision makers, stakeholders, and users
architecture. with the objectives of the authorized, consistent,
Big Data operations function focuses on the and transparent use of data assets. In a Big Data-
execution of the activities stipulated by Big Data dependent environment, change is inevitable, and
management and on capturing metrics of its activ- achieving collaboration among various stake-
ities. In this context, the focus of Big Data opera- holders, such as strategic and operational man-
tions is on the effective execution of the Big Data agers, operational staff, customers, researchers,
management life cycle, broadly defined as Big and analysts, cannot be managed as one-time
Data acquisition and ingestion; Big Data integra- events.
tion, federation, and consolidation; Big Data nor-
malization; Big Data storage; Big Data Guiding Principles, Policies, and Processes
distribution; and Big Data archival. The Big Effective Big Data governance depends on the
Data operations function directly supports the formulation of guiding principles and their trans-
various Big Data user groups, including formation into implementable policies that are
informaticists, who focus on developing Big operationalized as monitored processes. Exam-
Data analysis models (descriptive, predictive, ples are: maintaining the ontological integrity of
and, where feasible, prescriptive analytics), busi- persons; a code of ethics to guide Big Data oper-
ness intelligence models, and Big Data visualiza- ations, applications, and analytics; a single ver-
tion products. sion of the “truth” (the “golden record”);
288 Data Governance

recognition of clearly identified Big Data sources a person’s ontological integrity; from a polity
and data recipients (lineage, provenance, and perspective, that Big Data applications such as
information exchange); transparency; unambigu- Artificial Intelligence (AI), will not be used pre-
ous transformation of Big Data; Big Data quality, scriptively to mark individuals as undesirable or
integrity, and security; alignment of Big Data unwelcome, based, for instance, on nothing more
management approaches; integrity and repeatabil- than cultural prejudices, gender biases, or political
ity of Big Data processes and analytics methods; ideologies.
Big Data analytics design reviews; systemic and
continuous Big Data governance and manage- Big Data Privacy and Security
ment processes, focusing on, for example, config- Data privacy mechanisms have traditionally
uration and change management, access focused on safeguarding Personally Identifiable
management, data life cycle management. Pro- Information (PII) against unauthorized access.
cesses should be repeatable and include decision The availability of Big Data, especially IoT-pro-
points (stage gates), escalation rules, and remedi- duced data, complicates matters. The advent of
ation processes. IoT provides new opportunities for Big Data-
sustained processes and analytics. For example,
Big Data Analytics, Ethics, and Legal medical devices in IoT environments may act as
Considerations virtual agents and operators, each with its own
The ability to process and analyze Big Data sets ontological aspects of identity (beyond radio fre-
has caused an epistemic change in the approach to quency identification tags) and evolutionary prop-
data analytics. Rather than treating data as if they erties, effectively blurring the distinctions
are a bounded resource, current ICT-supported between the digital and the physical spheres of
algorithm design and development capabilities this domain and raising not only ethical questions
enable data to now operate as nodes in ever- but also questions of the sufficiency and efficacy
expanding global networks of ontological per- of the governance framework to address the pri-
spectives, each of which comprises its own set of vacy and security requirements of managing the
shareable relationships. lifecycles of citizens’ data from their capture to
The legal frameworks are not yet in place fully archival.
to address complexities that accompany the avail-
ability of Big Data. Currently, self-regulation sus- Lexica, Ontologies, and Business Rules
tains the regulatory paradigm of Big Data A lexicon provides a controlled vocabulary that
governance, frequently predicated on little more contains the terms and their definitions that col-
than industry standards, augmented by technical lectively constitute a knowledge domain. A lexi-
guidelines, and national security standards. Gov- con enforces rule-based specificity of meaning,
ernance of such a complex environment requires enabling semantic consistency and reducing
the formulation of policies at the national and ambiguity and supports the correlation of syno-
international level that address ethical use of Big nyms and the semantic (cultural) contexts in
Data applications that go beyond technical issues which they occur. Furthermore, to support data
of identity and authentication, access control, interoperability, a lexicon provides mechanisms
communications protocols and network security, that enable cross-lexicon mapping and reconcili-
and fault tolerance. The ethical use of Big Data ation. The terms and their definitions that consti-
depends on establishing and sustaining complex tute the lexicon provide the basis for developing
of trust; for example, from a technological per- ontologies, which delineate the interdependencies
spective, trust in the completeness of a transac- among categories and their properties, usually in
tion; from a human perspective, trust that an agent the form of similes, meronymies, and metony-
in the Internet of things (IoT) will not compromise mies. Ontologies encapsulate the intellectual
Data Governance 289

histories of epistemic communities and reflect for their intended uses in operations, decision
social constructions of reality, defined in the con- making and planning. Data quality means that
text of specific cultural norms and symbols, data are relevant to their intended uses and are of
human interactions, and processes that collec- sufficient detail and quantity, with a high degree
tively facilitate the transformation of data into of accuracy and completeness, consistent with
knowledge. Ontologies support the development other sources, and presented in appropriate ways.”
of dynamic heuristic instruments that sustain Big
Data analytics. Business rules operationalize Big Data Interoperability
ontologies by establishing domain boundaries Big Data interoperability relies on the secure and D
and specifying the requirements that Big Data reliable transmission of data that conform to pre-
must meet to be ontologically useful rather than determined standards and conventions that are
to be excluded as “noise.” encapsulated in the operations of lexicon and
ontology, metadata, and access and security com-
Metadata ponents of the data governance framework. The
Metadata are generally considered to be informa- application of Big Data interoperability capabil-
tion about data and are usually formulated and ities may introduce unforeseen biases or
managed to comply with predetermined stan- instances of polysemy that may compromise,
dards. Operational metadata reflect the require- albeit unintentionally, the integrity of the
ments for data security; data anonymizing, research and the validity of its results. The
including personally identifying information growth of IoT-produced Big Data may exacer-
(PII); data ingestion, federation, and integration; bate transparency issues and ethical concerns and
data distribution; and analytical data storage. IoT may also raise additional legal issues. For
Structural (syntactic) metadata provide informa- example, an IoT agent may act preemptively and
tion about data structures. Bibliographical meta- prescriptively on an individual’s behalf without
data provide information about data set producers, his or her knowledge. In effect, individuals may
such as the author, title, table of contents, appli- have abrogated, unknowingly, their control over
cable keywords of a document; data lineage meta- their decisions and data. Moreover, IoT agents
data provide information about the chain of can share information so that data lineage and
custody of a data item with respect to its prove- provenance chains become confused or lost, cul-
nance – the chronology of data ownership, stew- minating in the compromise of data quality and
ardship, and transformations. Metadata also reliability.
provide information on the data storage locations,
usually as either as local, external, or as cloud- Big Data Analytics
based data stores. Drawing liberally from statistics, microeconom-
ics, operations research, and computer science,
Big Data Quality Big Data analytics constitute an integrative dis-
Both little and Big Data of acceptable quality are cipline to extract and extrapolate information
critical to the effective operations of an organization from very large data sets. Increasingly, organiza-
and to the reliability of its business intelligence tions use data analytics to support data-driven
and analytics. Data quality is a socio-cultural con- program execution, monitoring, planning, and
struct, defined by an organization in the context of decision-making. Data analytics provide the
its mission and purpose. Big Data quality man- means to meet diverse information requirements,
agement is built on the fundamental premise that regardless of how they may be used to present or
data quality is meaningful only to the extent that it manage information. Big Data analytics lifecycle
relates to the intended use of the data. Juran (1999, stages comprise a core set of data analytics capa-
34.9) notes, “Data are of high quality if they are fit bilities, methods, and techniques that can be
290 Data Hacker

adapted to comply with any organization’s data


governance standards, conventions, and Data Hacker
procedures.
▶ Data Scientist

Challenges and Future Trends

Melvin Kranzberg (1986) observes, “Technology Data Integration


is neither good nor bad; nor is it neutral,” as a
reminder that we must constantly compare short- Anirudh Kadadi1 and Rajeev Agrawal2
term with long-term results, “the utopian hopes 1
Department of Computer Systems Technology,
versus the spotted reality, what might have been North Carolina A&T State University,
against what actually happened, and the trade-offs Greensboro, NC, USA
among various ‘goods’ and possible ‘bads’” (pp. 2
Information Technology Laboratory, US Army
547–548). The proliferation of Big Data and the Engineer Research and Development Center,
proliferation of IoT environments will continue, Vicksburg, MS, USA
requiring the development and implementation of
flexible Big Data governance regimes. However,
the velocity of change engendered by Big Data Synonyms
and IoT expansion has exacerbated the difficulties
of defining and implementing effective Big Data Big data; Big data integration tools; Semi-struc-
governance programs without compromising tured data; Structured data; Unstructured data
those standards that, à priori, define ethical use
of Big Data in cloud-based, global IoT
environments. Introduction

Big data integration can be classified as a crucial


part of integrating enormous datasets in multiple
Further Reading
values. The big data integration is a combination
Jacobs, A. (2009). The pathologies of big data. Communi-
of data management and business intelligence
cations of the ACM, 52(8), 36–44. operations which covers multiple sources of data
Juran, J., & Godrey, B. (1999). Juran’s Quality Handbook. within the business and other sources. This data
(Fith Ed.). New York: McGraw-Hill. can be integrated into a single subsystem and
Kranzberg, M. (1986). Technology and history:
“Kranzberg’s Laws”. Technology and Culture, 27(3),
utilized by organizations for business growth.
544–560. Big data integration also involves the develop-
National Research Council. (2013). Frontiers in massive ment and governance of data from different
data analysis. Washington, DC: The National Acade- sources which could impact organization’s abili-
mies Press.
Roman, R., Zhou, J., & Lopez, J. (2013). On the features
ties to handle this data in real time.
and challenges of security and privacy in distributed The data integration in big data projects can be
network of things. Computer Networks, 57(10), 2266– critical as it involves:
2279.
Sinaeepourfard A., J. Garcia, X. Masip-Bruin., & Marín-
Torder E. (2016). Towards a comprehensive data
1. Discovering the sources of data, analyzing the
lifecycle model for big data environments. In 2016 sources to gain bigger insights of data, and
IEEE/ACM 3rd international conference on big data profiling the data.
computing, applications and technologies (pp. 100– 2. Understanding the value of data and analyzing
106).
Weber, R. H. (2010). Internet of things – new security and
the organizational gains through this data. This
privacy challenges. Computer Law and Security can be achieved by improving the quality of
Review, 26, 23–30. data.
Data Integration 291

XML

Flat file

NoSql
Data
D
Algorithm
Repository
DB2

SQL

Data Integration, Fig. 1 Data integration – different sources

3. Finally transforming the data as per the big which is already structured and the unstructured
data environment (Fig. 1). data has to be optimized to structured data before
the data integration is performed (Vassiliadis et al.
The five Vs of big data can influence the data 2002).
integration in many ways. The five Vs can be The trustworthiness and accuracy of the data
classified as volume, velocity, variety, veracity, from the sources can be termed as the veracity.
and value: The enormous volume of data is gen- The data from different sources comes in the
erated every second in huge organizations like form of tags and codes where organizations
Facebook and Google. In the earlier times, the were lagging the technologies to understand
same amount of data was generated every minute. and interpret this data. But technology today
This variation in data-generating capacity of the provides us the flexibility to work with these
organizations has been increasing rapidly, and this forms of data and use it for business decisions.
could motivate the organizations to find alterna- The data integration jobs can be created on this
tives for integrating the data generated in larger data depending on the flexibility and trust of this
volumes for every second. The speed at which the data and its source. The value can be termed as
data is transmitted from the source to destination the business advantage and profits the data can
can be termed as velocity. Data generated by bring to the organization. The value depends
different jobs at each time is transmitted at timely solely on the data and its source. Organizations
basis and stored for further processing. In this target their profits using this data, and this data
case, the data integration can be performed only remains at a higher stake for different business
after a successful data transmission to the data- decisions across the organization. Data integra-
base. The data comes from numerous sources tion jobs can be easily implemented on this data,
which categorizes them into structured and but most of the organizations tend to keep this
unstructured. The data from social media can be data as a backup for their future business deci-
the best example for unstructured data which sions. Overall, the five V’s of big data play a
includes logs, texts, html tags, videos, photo- major role in determining the efficiency of orga-
graphs, etc. The data integration in this scenario nizations to perform the data integration jobs at
can be performed only on the relational data each level (Lenzerini 2002).
292 Data Integration

Traditional ETL Methods with Hadoop as storage device which features hard disk
a Solution drives (HDD) and solid-state drives (SSD);
possesses better performance levels with
Organizations tend to implement the big data reduced latency, high reliability, and quick
methodologies into their work system creating access to the data; and therefore helps accu-
information management barriers which include mulate large datasets from all the sources.
access, transform, extract, and load the informa- Another way of addressing this challenge
tion using traditional methodologies for big data. can be through discovery of common opera-
Big data creates potential opportunities for orga- tional methodologies between the domains
nizations. To gain the advantage over the oppor- for integrating the query operations which
tunities, organizations tend to develop an stands as a better environment to address
effective way of processing and transforming the challenges for large data entities.
the information which involves data integration (ii) Data inconsistency:
at each level of data management. Traditionally, Data inconsistency refers to the imbal-
data integration involves integration of flat files, ances in data types, structures, and levels.
in-memory computing, relational databases, and Although the structured data provides the
moving data from relational to non-relational scope for query operations through relational
environments. approach so that the data can be analyzed
Hadoop is the new big data framework which and used by the organization, unstructured
enables the processing of huge datasets from dif- data takes a lead always in larger data enti-
ferent sources. Some of the market leaders are ties, and this comes as a challenge for orga-
working on integrating Hadoop with the legacy nizations. Addressing the data inconsistency
systems to process their data for business use in can be achieved using the tag and sort
current market trend. One of the oldest contribu- methods which allow searching the data
tors to the IT industry “the mainframe” has been using keywords. The new big data tool
into existence since a long time, and currently Hadoop provides the solution for modulating
IBM is working on development of new tech- and converting the data through MapReduce
niques to integrate the large datasets through and Yarn. Although Hive in Hadoop doesn’t
Hadoop and mainframe. support the online transactions, they can be
implemented for file conversions and batch
processing.
The Challenges of Data Integration (iii) Query optimization:
In real-time data integration, the large
In a big data environment, the data integration can data entities require the query optimization
lead to many challenges in real-time implementa- at microlevels which could involve mapping
tion which has the direct impact on projects. Orga- components to the existing or a new schema
nizations tend to implement new ways to integrate which impacts the existing structures. To
this data to derive meaningful insights at a bigger address this challenge, the number of queries
picture. Some of the challenges posed in data can be reduced by implementing the joins,
integration are discussed as: strings, and grouping functions. Also the
query operations are performed on individ-
(i) Accommodate scope of data: ual data threads which can reduce the latency
Accommodating the sheer scope of data and responsiveness. Using the distributed
and creating newer domains in the organiza- joins like merge, hash, and sort can be an
tion are a challenge, and this can be alternative in this scenario but requires more
addressed by implementing a high-perfor- resources. Implementing the grouping,
mance computing environment and aggregation, and joins can be the best
advanced data storage devices like hybrid approach to address this challenge.
Data Integration 293

(iv) Inadequate resources and implementing systems. Changes made by the data scien-
support system: tists and architects could impact the func-
Lack of resources haunts every organiza- tioning of legacy systems as it has to go
tion at certain point, and this has the direct through many updates to match the stan-
impact on the project. Limited or inadequate dards and requirements of new technolo-
resources for creating data integration jobs, gies to perform a successful data
lack of skilled labor that don’t specialize in integration. In recent times, mainframe
data integration, and costs incurred during stands as one of the best example for leg-
the implementation of data integration tools acy system. For a better data operation D
can be some of the challenges faced by orga- environment and rapid access to the data,
nizations in real time. This challenge can be Hadoop has been implemented by organi-
addressed by constant resource monitoring zations to handle the batch processing unit.
within the organization, and limiting the This follows a typical ETL (Extract, Trans-
standards to an extent can save the organiza- form, and Load) approach to extract the
tions from bankruptcy. Human resources data from number of resources and load
play a major role in every organization, and them into Hadoop environment for the
this could pick the right professionals for the batch processing.
right task in a timely manner for the projects
and tasks at hand. Some of the common data integration tools
There is a need to establish a support which have been in use are Talend, CloverETL,
system for updating requirements and error and KARMA. Both the data integration tools have
handling, and reporting is required when their own significance individually for providing
organizations perform various data integra- the best data integration solutions for the business.
tion jobs within the domains and externally.
This can be an additional cost for the orga-
nizations as setting up a training module to Real-Time Scenarios for Data Integration
train the professionals and direct them
toward understanding the business expecta- In the recent times, Talend was used as the main
tions and deploy them in a fully equipped base for data integration by Groupon, one of the
environment. This can be termed as a good leading deal-of-the day website which offers
investment as every organization would discounted gift certificates, to be used at local
implement advancements in a timely manner shopping stores. For integrating the data from
to stick with the growing market trends. sources, Groupon relied on “Talend.” Talend is
Support system for handling errors could an open-source data integration tool which is used
fetch them the reviews to analyze the nega- to integrate data from numerous resources. When
tive feedback and modify the architecture as Groupon was a startup, they relied on an open
per the reviews and update the newer ver- source for more gains rather than using a licensed
sions with better functionalities. tool which involves more cost for licensing. Since
(v) Scalability: Groupon is a public traded company now, they
Organizations could face big time chal- would have to process 1 TB of data per day, which
lenge in maintaining the data accumulated come from various sources.
from number of years of their service. This There is another case study where a telephone
data is stored and maintained using the company was facing issues with phone invoices in
traditional file systems or other methodol- different formats which were not suitable for elec-
ogies as per their environment. In this sce- tronic processing and therefore involved the man-
nario, often the scalability issues arise ual evaluation of phone bills. This consumed a lot
when the new data from multiple resources of time and resources for the company. The
is integrated with data from legacy CloverETL data integration tool was the solution
294 Data Integrity

for the issue, and the inputs given were itemized


phone bills, company’s employee database, and Data Integrity
customer contact database. The data integration
process involved consolidated call data records, Patrick Juola
report phone expenses in hierarchy, and analysis Department of Mathematics and Computer
of phone calls and its patterns. This helped orga- Science, McAnulty College and Graduate School
nization cut down the costs incurred by 37% of Liberal Arts, Duquesne University, Pittsburgh,
yearly. PA, USA

Conclusion Data integrity, along with confidentiality and


availability, is one of the three fundamental
On a whole, the data integration in current IT aspects of data security. Integrity is about ensuring
world is on demand with the increasing number that data is and remains reliable, and that it has not
of data and covers complete aspects of data been tampered with or altered erroneously. Hard-
solutions with the usage of data integration tools. ware or software failures, human mistakes, and
Data scientists are still finding solutions for a malicious actors can all be threats to integrity.
simplified data integration with an efficient auto- Data integrity refers specifically to the integrity
mated storage systems and visualization methods of the data stored in a system. This can be a
which could turn out complex in terms of big particularly critical issue when dealing with big
data. Development of newer data integration solu- data due to the volume and variety of data stored
tions in the near future could help address the big and processed.
data integration challenges. An efficient data inte- Data integrity deals with questions such as
gration tool is yet to conquer the market, and “trust” and “fitness for use” (Lagoze 2014).
evolution of these tools can help organizations Even when data has been correctly gathered,
handle the data integration in a much more sim- stored, and processed, issues of representativeness
plified way. and data quality can render conclusions unreliable
(Lazer et al. 2014).
Data integrity can also be affected by archival
Further Reading considerations. Many big data projects rely on
third-party data collection and storage. Text col-
Big Data Integration. http://www.ibmbigdatahub.com/ lections such as Wikipedia, Project Gutenberg,
video/big-data-integration-what-it-and-why-you-need-it. and HathiTrust have been used for many
Clover ETL. http://www.cloveretl.com/resources/case-
studies/data-integration.
language-based big data projects. However, the
Data Integration tool. http://blog.pentaho.com/2011/07/15/ data in these projects changes over time, as the
facebook-and-pentaho-data-integration/. collections are edited, expanded, corrected, and
IBM. How does data integration help your organization? generally curated. Even relatively harmless fixes
http://www-01.ibm.com/software/data/integration/.
Lenzerini, M. (2002). Data integration: A theoretical per-
such as correcting optical character recognition or
spective, In Proceedings of the Twenty-first ACM optical character reader (OCR) errors can have an
SIGMOD-SIGACT-SIGART Symposium on Principles effect farther down the processing pipeline; major
of Database Systems (pp. 233–246), New York, NY, changes (such as adding documents newly entered
USA.
Talend. https://www.talend.com/customers/customer-refer
into the public domain every year) will cause
ence/groupon-builds-on-talend-enterprise-data- correspondingly large changes downstream.
integration. However, it may not be practical for an organiza-
Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). tion to archive its own copy of a large database to
Conceptual modeling for ETL processes. In Proceed-
ings of the 5th ACM International Workshop on Data
freeze and preserve it.
Warehousing and OLAP(DOLAP ‘02) (pp. 14–21). Big data technology can create its own data
New York: ACM. integrity issues. Many databases are too large to
Data Lake 295

store on a single machine, and even when


single-point storage is possible, considerations Data Lake
such as accessibility and performance can lead
engineers to use distributed storage solutions Christoph Quix1,2, Sandra Geisler1 and
such as HBase or Apache Cassandra (Prasad Rihan Hai3
1
and Agarwal 2016). More machines, in turn, Fraunhofer Institute for Applied Information
mean more chances for hardware failure, and Technology FIT, Sankt Augustin, Germany
2
duplicating data blocks means more chances Hochschule Niederrhein University of Applied
for copies to become out of sync with each Sciences, Krefeld, Germany D
3
other (which one of the conflicting entries is RWTH Aachen University, Aachen, Germany
correct?).
When using a cloud environment (Zhou et al.
2018), these issues are magnified because the data Overview
consumer no longer has control of the storage
hardware or environment. Cloud storage systems Data lakes (DL) have been proposed as a new
may abuse data management rights and, more concept for centralized data repositories. In con-
importantly, may not provide the desired level of trast to data warehouses (DW), which usually
protection against security threats generally require a complex and fine-tuned Extract-Trans-
(including threats to integrity). form-Load (ETL) process, DLs use a simpler
In general, any computing system, even a model which just aims at loading the complete
small, single-system app, should provide integ- source data in its raw format into the DL. While a
rity, confidentiality, and availability. However, more complex ETL process with data transfor-
the challenge of providing and confirming data mation and aggregation increases the data qual-
integrity is much harder with projects on the ity, it might also come with some information
scale of big data. loss as irregular or unstructured data not fitting
into the integrated DW schema will not be
loaded into the DW. Moreover, some data silos
might not get connected to integrated data repos-
Further Reading itories at all due to the complexity of the data
integration process. DLs address these prob-
Lagoze, C. (2014). Big data, data integrity, and the fractur- lems: they should provide access to the source
ing of the control zone. Big Data & Society, 1(2),
2053951714558281. https://journals.sagepub.com/
data in its original format without requiring an
doi/abs/10.1177/2053951714558281. elaborated ETL process to ingest the data into
Lazer, D., Kennedy, R., & King, G. (2014). The parable of the lake.
Google flu: Traps in big data analysis. Science,
343(6176), 1203–1205.
Prasad, B. R., & Agarwal, S. (2016). Comparative study of
big data computing and storage tools. International Key Research Findings
Journal of Database Theory and Application, 9(1),
45–66. Architecture
Zhou, L., Fu, A., Yu, S., Su, M., & Kuang, B. (2018). Data
integrity verification of the outsourced big data in the
Since the idea of a DL has been described first in
cloud environment: A survey. Journal of Network and a blog post by James Dixon (https://jamesdixon.
Computer Applications, 122, 1–15. wordpress.com/2010/10/14/pentaho-hadoop-
and-data-lakes/), a few DL architectures have
been proposed (e.g., Terrizzano et al. 2015;
Nargesian et al. 2019). As Hadoop is also able
Data Journalism to handle any kind of data in its distributed file
system, many people think that “Hadoop” is the
▶ Media complete answer to the question how a DL
296 Data Lake

should be implemented. Of course, Hadoop is Ingestion Layer


good at managing the huge amount of data with One of the key features of the DL concept is the
its distributed and scalable file system, but it does minimal effort to ingest and load data into the DL.
not provide detailed metadata management The components for data ingestion and metadata
which is required for a DL. For example, the extraction should be able to extract data and meta-
DL architecture presented in Boci and data from the data sources automatically as far as
Thistlethwaite (2015) shows that a DL system possible, for example, by using methods to extract
is a complex eco-system of several components a schema from JSON or XML. In addition to the
and that Hadoop provides only a part of the metadata, the raw data needs to be also ingested
required functionality. into the DL. According to the idea, that the raw
More recent articles about data lakes (e.g., data is kept in its original format, this is more like
Mathis (2017)) mention common functional com- a “copy” operation and thereby certainly less
ponents of a data lake architecture for ingesting, complex than an ETL process in DWs. Neverthe-
storing, transforming, and using data. These com- less, the data needs to be put into the storage layer
ponents are sketched in the architecture of Fig. 1 of the DL, which might imply some syntactical
(Jarke and Quix 2017). transformation.
The architecture is separated into four layers: Data governance and data quality (DQ) manage-
the Ingestion Layer, the Storage Layer, the Trans- ment are important in DLs to avoid data swamps.
formation Layer, and the Interaction Layer. The Data Quality component should make sure that

Data Lake, Fig. 1 Data


lake architecture Interaction Layer

Metadata Manager Data Exploration

Transformation Layer Application-


specific
Data Marts
Data Data Integration
Cleaning Transformation Workflows
Engine

Storage Layer Data Access Interface

Metadata &
DQ Store
Raw Data Stores

Ingestion Layer
Data Metadata Data
Quality Extraction Ingestion

Heterogeneous
Data Sources
Data Lake 297

the ingested data fulfills minimum data quality data in the storage layer for a concrete application.
requirements. For example, if a source with infor- From a logical point of view, these data marts are
mation about genes is considered, it should also rather part of the interaction layer, as the data
provide an identifier of the genes in one of the marts will be created by the users during the
common formats (e.g., from the Gene Ontol- interaction with the DL. On the other hand, their
ogy, http://geneontology.org) instead of using pro- data will be stored in one of the systems of the
prietary IDs that cannot be mapped to other sources. storage layer. In addition, data marts can be more
application-independent if they contain a general-
Storage Layer purpose dataset which has been defined by a data D
The main components in the storage layer are the scientist. Such a dataset might be useful in many
metadata repository and the repositories for raw information requests of the users.
data. The Metadata Repository stores all the meta-
data of the DL which has been partially collected Interaction Layer
automatically in the ingestion layer or will be later The top layer focuses at the interaction of the users
added manually during the curation or usage of with the DL. The users will have to access the
the DL. metadata to see what kind of data is available, and
The raw data repositories are the core of the DL can then explore the data. Thus, there needs to be a
in terms of data volume. As the ingestion layer close relationship between the data exploration
provides the data in its original format, different and the metadata manager components. On the
storage systems for relational, graph, XML, or other hand, the metadata generated during data
JSON data have to be provided. Moreover, the exploration (e.g., semantic annotations, discov-
storage of files using proprietary formats should ered relationships) should be inserted into the
be supported. Hadoop seems to be a good candi- metadata store. The interaction layer should also
date as a basic platform for the storage layer, but it provide to the user functionalities to work with the
needs to be complemented with components to data, including visualization, annotation, selec-
support the data fidelity, such as Apache Spark. tion, and filtering of data, and basic analytical
In order to provide a uniform way to the user to methods. More complex analytics involving
query and access the data, the hybrid data storage machine learning and data mining is in our view
infrastructure should be hidden by a uniform data not part of the core of a DL system, but certainly a
access interface. This data access interface should very useful scenario for the data in the lake.
provide a query language and a data model, which
have sufficient expressive power to enable com- Data Lake Implementations
plex queries and represent the complex data struc- Usually, a DL is not an out-of-the-box ready-to-use
tures that are managed by the DL. Current systems system. The above described layers have to be
such as Apache Spark and HBase offer this kind assembled and configured one by one according
of functionality using a variant of SQL as query to the organization’s needs and business use cases.
language and data model. This is tedious and time consuming, but also offers
a lot of flexibility to use preferred tools. Slowly,
Transformation Layer implementations which offer some of the previ-
To transform the raw data into a desired target ously described features of DLs in a bundle are
structure, the DL needs to offer a data transforma- evolving. The Microsoft Azure Data Lake (https://
tion engine in which operations for data cleaning, azure.microsoft.com/solutions/data-lake) offers a
data transformation, and data integration can be cloud implementation especially for the storage
realized in a scalable way. In contrast to a data layer building on HDFS enabling a hierarchical
warehouse, which aims at providing one inte- file system structure. Another implementation
grated schema for all data sources, a DL should covering a wide range of the above mentioned
support the ability to create application-specific features is offered by the open-source data lake
data marts, which integrate a subset of the raw management platform Kylo (https://kylo.io/).
298 Data Lake

Kylo is built on Hadoop, Spark, and Hive and being refined if more details of the data sources
offers out-of-the-box wrappers for streams as well are known. Also, if a new data source is added to
as for batch data source ingestion. It provides meta- the DL system, or an existing one is updated,
data management (e.g., schema extraction), data some integrated schemas might have to be
governance, and data quality features, such as updated as well, which leads to the problem of
data profiling, on the ingestion and storage layer. schema evolution (Curino et al. 2013).
Data transformation based on Spark and a common Another challenge to be addressed for schema
search interface are also integrated into the plat- evolution is the heterogeneity of the schemas and
form. Another advanced but commercial platform the frequency of the changes. While data ware-
is the Zaloni Data Platform (https://www.zaloni. houses have a relational schema which is usually
com/platform). not updated very often, DLs are more agile sys-
tems in which data and metadata can be updated
very frequently. The existing methods for schema
Future Directions for Research evolution have to be adapted to deal with the
frequency and heterogeneity of schema changes
Lazy and Pay-as-You-Go Concepts in a big data environment (Hartung et al. 2011).
A basic idea of the DL concept is to consume as
little upfront effort as possible and to spend addi- Mapping Management
tional work during the interaction with users, for Mapping management is closely related to the
example, schemas, mappings, and indexes are schema evolution challenge. Mappings state how
created while the users are working with the DL. data should be processed on the transformation
This has been referred to as lazy (e.g., in the layer, that is, to transform the data from its raw
context of loading database (Karæz et al. 2013)) format as provided by the storage layer to a target
and pay-as-you-go techniques (e.g., in data inte- data structure for a specific information require-
gration (Sarma et al. 2008)). All workflows in a ment. Although heterogeneity of data and models
DL system have to be verified, whether the has been considered and generic languages for
deferred or incremental computation of the results models and mappings have been proposed
is applicable in that context. For example, meta- (Kensche et al. 2009), the definition and creation
data extraction can be done first in a shallow of mappings in a schema-less world has not
manner by extracting only the basic metadata; received much attention, yet. The raw data in
only detailed data of the source is required, a DLs is less structured and schema information is
more detailed extraction method will be applied. not explicitly available. Thus, in this context
A challenge for the application of these “pay-as- methods for data profiling or data wrangling
you-go” methods is to make them really “incre- have to be combined with schema extraction,
mental,” for example, the system must be able to schema matching, and relatable dataset discovery
detect the changes and avoid a complete (Alserafi et al. 2017; Hai et al. 2019).
recomputation of the derived elements.
Query Rewriting and Optimization
Schema-on-Read and Evolution In the data access interface of the storage layer,
DLs also provide access to un- or semi-structured there is a trade-off between expressive power and
data for which a schema was not explicitly given complexity of the rewriting procedure as the
during the ingestion phase. Schema-on-read complexity of query rewriting depends to a
means that schemas are only created when the large degree on the choice for the mapping and
data is accessed, which is inline with the “lazy” query language. However, query rewriting
concept described in the previous section. The should not only consider completeness and cor-
shallow extraction of metadata might also lead to rectness, but also the costs for executing the
changes at the metadata level, as schemas are rewritten query should be taken into account.
Data Lake 299

Thus, the methods for query rewriting and query there are only few works on the modeling of
optimization require a tighter integration data and metadata in a DL. Data vault is a
(Gottlob et al. 2014). It is also an open question dimensional modeling technique frequently
whether there is a need for an intermediate lan- applied in DW projects; in Giebler et al.
guage in which data and queries are translated to (2019), this modeling technique is applied to
do the integrated query evaluation over the het- DLs and compared with other techniques.
erogeneous storage system, or whether it is more Because of the fragmentation of the data in
efficient to use some of the existing data repre- many different tables, querying is expensive
sentations. Furthermore, given the growing due to many join operations. Also, the mapping D
adoption of declarative languages in big data of DLs to semantic models has been considered
systems, query processing and optimization in Endris et al. (2019). They propose a frame-
techniques from the classical database systems work that maps heterogeneous sources to a uni-
could be applied as well in DL systems. fied RDF graph and thereby allows federated
query processing. Still, there is a need for
Data Governance and Data Quality more sophisticated metadata models and data
Since the output of a DL should be useful knowl- modeling techniques for DLs to provide more
edge for the users, it is important to prevent a DL guidance in managing a DL.
becoming a data swamp. There are conflicting
goals: on the one hand, any kind of data source
should be accepted for the DL, and no data
cleaning and transformation should be necessary Cross-References
before the source is ingested into the lake. On the
other hand, the data of the lake should have suffi- ▶ Big Data Quality
cient quality to be useful for some applications. ▶ Data Fusion
Therefore, it is often mentioned that data gover- ▶ Data Integration
nance is required for a DL. First of all, data gov- ▶ Data Quality Management
ernance is an organizational challenge, that is, ▶ Data Repository
roles have to be identified, stakeholders have to ▶ Metadata
be assigned to roles and responsibilities, and busi-
ness processes need to be established to organize
various aspects around data governance (Otto Further Reading
2011). Still, data governance needs to be also
Alserafi, A., Calders, T., Abelló, A., & Romero, O. (2017).
supported by appropriate techniques and tools. Ds-prox: Dataset proximity mining for governing the
For data quality, as one aspect of data governance, data lake. In C. Beecks, F. Borutta, P. Kröger, & T. Seidl
a similar evolution has taken place; the initially (Eds.), Similarity search and applications -10th inter-
abstract methodologies have been complemented national conference, SISAP 2017, Munich, Germany,
October 4–6, 2017, proceedings (Vol. 10609, pp. 284–
in the meantime by specific techniques and tools. 299). Springer. https://doi.org/10.1007/978-3-319-
Preventive, process-oriented data quality manage- 68474-120.
ment (in contrast to data cleaning, which is a Boci, E., & Thistlethwaite, S. (2015). A novel big data
reactive data quality management) also addresses architecture in support of ads-b data analytic. In Pro-
ceedings of the integrated communication, navigation,
responsibilities and processes in which data is and surveillance conference (icns) (pp. C1-1–C1-8).
created in order to achieve a long-term improve- https://doi.org/10.1109/ICNSURV.2015.7121218.
ment of data quality. Curino, C., Moon, H. J., Deutsch, A., & Zaniolo, C.
(2013). Automating the database schema evolution
process. VLDB Journal, 22(1), 73–98.
Data Models and Semantics in Data Lakes Endris, K. M., Rohde, P. D., Vidal, M., & Auer, S. (2019).
While it has been acknowledged that metadata Ontario: Federated query processing against a semantic
management is an important aspect in DLs, data lake. In Proceedings of 30th international
300 Data Management and Artificial Intelligence (AI)

conference on database and expert systems applica-


tions (dexa) (Vol. 11706, pp. 379–395). Springer. Data Management and
Retrieved from https://doi.org/10.1007/978-3-030-
27615-7\_29. Artificial Intelligence (AI)
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., &
Mitschang, B. (2019). Modeling data lakes with data Alan R. Shark
vault: Practical experiences, assessment, and lessons Public Technology Institute, Washington, DC,
learned. In Proceedings of the international conference
on conceptual modeling (er). (to appear).
USA
Gottlob, G., Orsi, G., & Pieris, A. (2014). Query rewriting Schar School of Policy and Government, George
and optimization for ontological databases. ACM Mason University, Fairfax, VA, USA
Transations on Database Systems, 39(3), 25:1–25:46.
Retrieved from https://doi.org/10.1145/2638546.
Hai, R., Quix, C., & Wang, D. (2019). Relaxed functional
dependency discovery in heterogeneous data lakes. In The numbers are staggering as big data keeps
Proceeding of the international conference on concep- getting bigger. We know that over 300 hours of
tual modeling (er). (to appear). YouTube videos are downloaded every minute of
Hartung, M., Terwilliger, J. F., & Rahm, E. (2011). Recent every day. Google alone processes more than
advances in schema and ontology evolution. In Z.
Bellahsene, A. Bonifati, & E. Rahm (Eds.), Schema
40,000 searches every second, and when other
matching and mapping (pp. 149–190). Springer Ber- search engines are added-combined, they account
lin/Heidelberg. Retrieved from https://doi.org/10.1007/ for some 5 billion searches a day worldwide
978-3-642-16518-4. (https://merchdope.com/youtube-stats/). When
Jarke, M., & Quix, C. (2017). On warehouses, lakes, and
we look at all the text messages and posted and
spaces: The changing role of conceptual modeling for
data integration. In J. Cabot, C. Gómez, O. Pastor, M. shared pictures, let alone emails and other forms
Sancho, & E. Teniente (Eds.), Conceptual modeling of digital communications, we create no less than
perspectives (pp. 231–245). Springer. https://doi.org/ 2.5 quintillion bytes of data each day (https://
10.1007/978-3-319-67271-716.
www.forbes.com/sites/bernardmarr/2018/05/21/
Karæz, Y., Ivanova, M., Zhang, Y., Manegold, S., &
Kersten, M. L. (2013). Lazy ETL in action: ETL tech- how-much-data-do-we-create-every-day-the-mind-
nology dates scientific data. PVLDB, 6(12), 1286–1289. blowing-stats-everyone-should-read/#3bdc42
Retrieved from http://www.vldb.org/pvldb/vol6/p1286- 5b60ba). These statistics were published prior to
kargin.pdf. the COVID-19 pandemic where the growth of
Kensche, D., Quix, C., Li, X., Li, Y., & Jarke, M. (2009).
Generic schema mappings for composition and query
video feeds and storage have grown hundreds of
answering. Data & Knowledge Engineering, 68(7), percent over a short period of time.
599–621. https://doi.org/10.1016/j.datak.2009.02.006. Data curation and data mining have become a
Mathis, C. (2017). Data lakes. Datenbank-Spektrum, 17 growing specialty. Data has been used to help
(3), 289–293. https://doi.org/10.1007/s13222-017-
understand the opioid crisis with visualized map
0272-7.
Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., & Arocena, P. planning and has been used to better understand
C. (2019). Data lake management: Challenges and oppor- the outbreak of the COVID-19 pandemic. As data
tunities. PVLDB, 12(12), 1986–1989. Retrieved from continues to accumulate, locating and analyzing
http://www.vldb.org/pvldb/vol12/p1986-nargesian.pdf. data in a timely manner become a never-ending
Otto, B. (2011). Data governance. Business & Information
Systems Engineering, 3(4), 241–244. https://doi.org/
challenge. Many are turning to artificial intelli-
10.1007/s12599-011-0162-8. gence (AI) to assist. The potential to harness
Sarma, A. D., Dong, X., & Halevy, A. Y. (2008). data with AI holds enormous promise. But there
Bootstrapping pay-as-you-go data integration systems. are some roadblocks to navigate around, and one
In J. T.-L. Wang (Ed.), Proceedings of ACM SIGMOD
international conference on management of data (pp.
must take a deeper dive to better understand the
861–874). Vancouver: ACM Press. relationship between the two.
Terrizzano, I., Schwarz, P. M., Roth, M., & Colino, J. E. AI, at least how it is applied today, is not as
(2015). Data wrangling: The challenging yourney from new as some would believe. Machine learning
the wild to the lake. In 7th biennial conference on
innovative data systems (cidr). Retrieved from http://
(ML) has been around for some 50+ years and
www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2. can be defined as the scientific study of algo-
pdf. rithms and statistical models that computer
Data Management and Artificial Intelligence (AI) 301

systems use to carry out tasks without explicit 4. Ability to store and retrieve massive amounts
instructions, such as by using pattern recognition of data
and inference (https://en.wikipedia.org/wiki/ 5. Ability to “self-learn”
Machine_learning). Spam and e-mail filters are 6. Advancements in artificial speech and
good examples of ML where algorithms are con- recognition
stantly being updated to detect either which
emails to be placed in which folders or which This author, through studying the practical
emails should be considered SPAM. Note, even applications of AI, has reached the conclusion
the best filters still allow for humans to check to (at least at this writing) that a more accurate def- D
make sure that any email considered SPAM is inition would be “the theory and development of
just that – and there are times when things are computer systems able to supplement human
judged incorrectly. decision making, planning and forecasting based
ML is a large departure from the beginnings of on abundant sources of quality data.” Thus, what
machine programing where computers relied we have today is AI as augmented intelligence,
totally on programs and instructions. Today, assisting humans in searching for meaningful
through ML, machines can be programed to seek answers to contemporary problems and helping
out patterns and are the cornerstone for predictive to make decisions based on data.
analytics (https://en.wikipedia.org/wiki/Machi On February 11, 2019, President Trump signed
ne_learning). Machine learning can be viewed as Executive Order 13859 announcing the American
a unique subfield of artificial intelligence in which AI Initiative – the United States’ National Strategy
algorithms learn to fulfill tasks. AI in practice is on Artificial Intelligence (https://www.whitehouse.
being developed to mimic human behavior. AI gov/ai/). Aside from promoting AI in the govern-
can best be described as a system’s ability to ment workspace through collaboration with indus-
correctly interpret external data, to learn from try and academia, there was a clear recognition that
such data, and to use those learnings to achieve data bias and ethics need to be addressed as AI
specific goals and tasks through flexible adapta- applications advance (https://www.whitehouse.
tion (https://botanalytics.co/blog/2017/08/18/ gov/ai/ai-american-innovation/). Many have warned
machine-learning-artificial-intelligence/). of the potential dangers of AI if ethics and bias are
So, to harness the power of data, AI can be not adequately tackled. Can AI in collecting data
used to search trillions upon trillions of pieces of through articles and papers distinguish between
data in seconds or less in search of a pattern, peer-reviewed studies versus opinions that may
anomaly, statistical probability, and simply any- reflect poor or lack of any scientific proof – or
thing a human might perform in seconds versus worse – ignorant summations, conspiracy theories,
years or more. Several futurists believe in time or racist leanings? (https://www.napawash.org/
that AI will move from machine intelligence to studies/academy-studies/ai-and-its-impact-on-pub
machine consciousness. A popular example of lic-administration).
machine learning would be our growing reliance But the USA was not the first among the most
on such devices such as Alexa, Siri, and talking developed nations to develop its AI initiative. The
into our TV remotes. An example of machine European Union developed its trustworthy AI ini-
consciousness might be talking to a robot who tiative in early 2018 and articulated seven basic
expresses human emotions and can appear to principles which are (https://ec.europa.eu/digital-
both think, feel, and reason. single-market/en/artificial-intelligence):
AI has advanced in the past 10 years because of
six key factors; they are: 1. Human agency and oversight
2. Technical robustness and safety
1. Advancements in complex algorithms 3. Privacy and data governance
2. Dramatic increase in speed and computing power 4. Transparency
3. Ability to digest data from various sources 5. Diversity, nondiscrimination, and fairness
302 Data Management and Artificial Intelligence (AI)

6. Societal and environmental well-being Data management and data ingestion are essen-
7. Accountability tial components of AI. Data is often collected
without any idea of how it might be used at a
And most observers believe China has taken later time; therefore it is imperative that data be
the lead in AI development, and the rest of the tagged in ways that make it easier to comprehend
Western worlds is trying to catch up. (human or machine) any limitations or bias. Data
Regardless of which nation is developing AI quality and data management have never been
strategies and applications, there is one universal more important.
truth – when it comes to data, we have always Data management requires a renewed focus
been “taught garbage in equals garbage out.” If on policies and procedures. While data manage-
not properly trained or applied, AI can produce ment is applied in many ways depending on any
harmful results especially if there is no mecha- institution, one may look to the US federal gov-
nism in place to validate data and to audit what ernment for some examples and guidance of how
data and streams entered a recommendation or they plan to modernize and standardize data
decision. management. In March 2018, as part of the Pres-
Often, we learn more from our mistakes than ident’s Management Agenda (PMA), the admin-
from our successes. A case in point involves istration established a cross-agency priority
IBM’s Watson. Watson has been IBM’s flagship (CAP) goal focused on leveraging data as a stra-
platform for AI and was popularized in winning tegic asset to establish best practices for how
games of jeopardy on TV as well as beating a agencies manage and use data. As part of this
leading international chess player at chess. But CAP goal, the first-ever enterprise-wide Federal
one would learn that Watson’s medical applica- Data Strategy (FDS) was developed to establish
tions did not live up to its promise. IBM stated its standards, interoperability, and skills consistency
goal to the public that it hoped to create AI across agencies (https://strategy.data.gov/action-
doctor. But that never materialized, so they fur- plan/).
ther refined Watson. However, while medical Of the 20 action steps highlighted, there are 2
Watson learned rather quickly how to scan arti- that directly apply to this discussion, action steps
cles about clinical studies and determine basic 8 and 19. Step 8 to aims “Improve Data and
outcomes, it proved impossible to teach Watson Model Resources for AI Research and Develop-
to read the articles the way a doctor would. Later ment” is by its title directly related to AI. Step #8
they would learn more on how doctors absorb directly ties back to the 2019 Executive Order on
information they read from journal articles. They Maintaining American Leadership in Artificial
found that doctors often would look for other Intelligence (https://www.whitehouse.gov/presi
pieces of information that may not have been dential-actions/executive-order-maintaining-
the main point of the article itself. This was not american-leadership-artificial-intelligence/).
an anticipated outcome (https://spectrum.ieee. The order includes an objective to “Enhance
org/biomedical/diagnostics/how-ibm-watson- access to high-quality and fully traceable federal
overpromised-and-underdelivered-on-ai-health- data, models, and computing resources to
care). increase the value of such resources for AI
For years IBM programmers worked on devel- R&D, while maintaining safety, security, pri-
oping a system that could assist doctors in diag- vacy, and confidentiality protections consistent
nosing cancer as an augmented decision-maker, with applicable laws and policies.” The imple-
but real-life experience taught them something mentation guidance provides support to agen-
else. Due to weaknesses in quality control later cies in:
found in the way data was collected and
cataloged, results were often misleading and • Prioritizing the data assets and models under
could have ultimately led to a patient’s death. It their purview for discovery, access, and
had to be pulled from the market – at least for now. enhancement
Data Mining 303

• Assessing the level of effort needed to make encourages national data protection authorities to
necessary improvements in data sets and work together and share information and best
models, against available resources practices with one another.
• Developing justifications for additional The US government has thus far resisted a
resources national approach to data privacy which prompted
the State of California to pass its own privacy act
Action step 19 sets out to “Develop Data Qual- called The California Consumer Privacy Act
ity Measuring and Reporting Guidance.” As (CCPA) (https://www.oag.ca.gov/privacy/ccpa).
pointed out earlier, data quality is essential to As data management becomes more sophisticated D
advance AI. One of the key objectives is to iden- through technology and governance, the public
tify best practices for measuring and reporting on has demonstrated growing concern over privacy
the quality of data outputs created from multiple rights. When you add AI as the great multiplier of
sources or from secondary use of data assets data, a growing number of citizen advocates are
(https://strategy.data.gov/action-plan/#action-19- seeking ways to protect their personal data and
develop-data-quality-measuring-and-reporting- reputations and seek remedies and procedures to
guidance). Clearly data management is essential correct mistakes or to limit their exposure.
to the success and meaningful application of AI. As Western nations struggle to protect privacy
There is no doubt that AI will revolutionize the and maintain a healthy balance between personal
way we collect and interpret data into useful and information and the need to provide better eco-
actionable information. Despite some early disap- nomic, public health, and safety outcomes, AI
pointments, AI continues to hold great promise if continues to advance in ways that continue to
ingesting trillions of pieces of data can lead to defy our imaginations. Any sound data manage-
incredible medical and social science outcomes. ment plan must certainly address the issue of
It can be used in understanding and solving com- privacy.
plex issues regarding public policy and public For AI to reach its potential in the years ahead,
health and safety. AI holds the promise of making data management will always play a significant
rational determinations based on big data – and supporting role. Conversely, poor data manage-
pulling from an array of sources from among ment will yield disappointing results that could
structured and unstructured data sets. ultimately lead to less than optimal outcomes, and
With all the attention paid to AI’s potential worse could lead to the loss of lives. Sound data
regarding date management and AI, there are management is what feeds AI, and AI requires
also many related concerns beyond bias and nothing less than quality and verified intake.
ethics, and that is privacy. The European Union
gained much attention in 2016 when it passed its
landmark GDPR or General Data Privacy Act.
This act articulated seven basic principles – all Data Mining
aimed at protecting and providing redress for
inaccuracies and violations and how data is Gordon Alley-Young
viewed and how long it is stored. To help clarify Department of Communications and Performing
and manage the GDPR, the EU established The Arts, Kingsborough Community College, City
European Data Protection Board (EDPB) in 2018 University of New York, New York, NY, USA
(https://edpb.europa.eu/about-edpb/about-edpb_
en). It provides general guidance (including
guidelines, recommendations, and best practice) Synonyms
to clarify the GDPR and advises the European
Commission on data protection issues and any Anomaly detection; Association analysis; Cell
proposed new EU legislation of particular impor- phone data; Cluster analysis; Data brokers; Data
tance for the protection of personal data and mining algorithms; Data warehouse; Education;
304 Data Mining

Facebook; National Security Administration DM to manage all aspects of their customer


(NSA); Online analytical processing; Regression relations.
The 1990s saw the emergence of online ana-
lytical processing (OLAP) or computer pro-
Introduction cessing that quickly and easily chooses and
analyzes multidimensional data from various per-
Data mining (DM), also called knowledge discov- spectives. OLAP and DM are both similar and
ery in data (KDD), examines vast/intricate stores different, and thus can be used together. OLAP
of citizen, consumer, and user data to find pat- can summarize data, distribute costs, compile and
terns, correlations, connections and/or variations analyze data over time periods (time series analy-
in order to benefit organizations. DM serves to sis), and do what-if analysis (i.e., what if we
either describe what is happening and/or to predict change this value, how will it change our profit
what will happen in the future based on current or loss?). Unlike DM, OLAP systems are not CL
data. Information discovered from DM falls into systems in that they cannot find patterns that were
several categories including: associations, anom- not identified previously. DM can do this indepen-
alies, regressions, classifications, and clusters. dent pattern recognition, also called machine
DM uses algorithms and software programs to learning (ML), which is an outgrowth of 1950s
analyze data collections. The size and complexity AI inquiry. DM can inductively learn or make a
of a data collection also called a data warehouse general rule once it observes several instances.
(DWH) will determine how sophisticated the DM Once DM discovers a new pattern and identifies
system will need to be. The DM industry was the specific transaction data for this pattern, then
estimated to be worth 50 billion dollars in 2017, OLAP can be used to track this new pattern over
and aspects of the DM industry that help compa- time. In this way OLAP would not require a huge
nies eliminate waste and capitalize on future busi- data warehouse (DWH) like DM does because
ness trends are said to potentially increase an OLAP has been configured to only examine cer-
organization’s profitability threefold. Social tain parts of the transaction data.
media organizations profit from DM services
that coordinate advertising and marketing to its
users. DM has raised the concerns of those who Types of DM
fear that DM violates their privacy.
DM’s capacity for intuitive learning falls into five
categories: associations, anomalies, regressions,
History of DM classifications, and clusters. Association learning
(AL) or associations is DM that recommends
DM was coined during the 1960s but its roots products, advertisements, coupons and/or promo-
begin in the 1950s with innovations in the areas tional e-mails/mailings based on an online cus-
of artificial intelligence (AI) and computer learn- tomer’s purchasing profile or a point of sale (POS)
ing (CL). Initially DM was two separate func- customer’s data from scanned customer loyalty
tions: one that focused on retrieving information cards, customer surveys, and warranty pur-
and another that dealt with database processing. chases/registrations. AL allows retailers to create
From the 1970s to the 1990s, computer storage new products, design stores/websites, and stock
capacity and the development of computer pro- their products accordingly. AL uses association
gramming languages and various algorithms learning algorithms (ALA) to do what some
advanced the field, and by the 1990s Knowledge have called market basket analysis. Instead of
Discovery in Databases (KDD) was actively finding associations, Anomaly detection (AD),
being used. Falling data storage costs and rising also called anomalies, outlier detection (OD), or
computer processing rates during this decade novelty detection (ND) uses anomaly detection
meant that many businesses now used KDD or algorithms (ADA) to find phenomena outside of
Data Mining 305

a pattern. Credit card companies use AD to iden- could employ a junk mail/spam filter to prevent
tify possible fraudulent charges and governmental their customers from being bombarded with junk
tax agencies like the Internal Revenue Service e-mails so that the filter uses CA over time to learn
(IRS) in the USA, Canada Revenue Agency to recognize (e.g., from word patterns) what is
(CRA) in Canada, and Her Majesty’s Revenue spam and what is not. Hypothetically, companies
and Customs (HMRC) in the UK use it to find could also use CA to help them to design market-
tax fraud. ing e-mails to avoid junk mail/spam filters. This
In 2014 Facebook bought messaging service form of DM uses classification analysis algo-
WhatsApp in order, analysts argue, to expand rithms (CAA). D
their DM capabilities and increase profits. The final type of DM is called cluster detection
Founded by two former Yahoo employees, (CD) or clusters and is used for finding clusters of
WhatsApp allows users to send text, photo, and data (e.g., individuals) that are homogenous and
voice messages to and from any smartphone with- distinct from other clusters. Segmentation algo-
out a special setup for a small annual fee. Since rithms (SA) create clusters of data that share prop-
its release in 2009, WhatsApp has gained 500 mil- erties. CD has been used by the retail industry to
lion users and by spring 2014 the company find different clusters of consumers among their
announced that its users set a new record by send- customers. For example, customers could be clus-
ing 64 billion messages in a 24 h period. Owning tered based on their purchasing activities where a
WhatsApp gives Facebook access to users’ pri- cluster of one time impulse buyers will be distinct
vate messaging data for purposes of AL and the from the cluster of consumers who will continu-
better marketing of products to consumers. This is ally upgrade their purchases or the cluster of con-
because consumers’ private data may be more sumers who are likely to return their purchases.
accurate, for marketing purposes, than the public The National Security Administration (NSA) in
information they might share on open platforms the USA and Europol in the European Union
like Facebook. This is especially important in (EU) uses CD to locate clusters of possible terror-
countries like India, with an estimated 50 million ists and may use CD on telephone data. For exam-
WhatsApp users (100 million Facebook users) ple, if a known terrorist calls a citizen in the USA,
whose primary communication device is mobile. France, Belgium, or Germany, the contacts of that
Regression analysis (RA) makes predictions contacted person creates a data cluster of personal
for future activity based on current data. For contacts that could be mined to find people who
example, a computer application company might may be engaged in terrorist activity. CD data can
anticipate what new features and applications be used as evidence to justify potentially
would interest their users based on their browsing obtaining a warrant for wiretapping/lawful inter-
histories; this can shape the principles they use ception or for access to other personal records.
when designing future products. RA is done using
regression analysis algorithms (RAA). Much of
DM is predictive and examines consumers’ demo- Government, Commercial, and Other
graphics and characteristics to predict what they Applications of DM
are likely to buy in the future. Predictions have
probability (i.e., Is this prediction likely to be Governmental use of DM has proven controver-
true?) and confidence (i.e., How confident is this sial. For example, in the USA, in June 2013,
prediction?). former government contractor Edward Snowden
Classification analysis (CA) is DM that released a classified 41-slide PowerPoint presen-
operates by breaking data down into classes tation found while working for the NSA to two
(categories) and then breaking down new exam- journalists. Snowden charged that a NSA DM
ples into these classes (categories). CA works by program, codenamed PRISM, alleged to cost
creating and applying rules that solve categoriza- 20 million dollars a year, collected Internet
tion problems. For example, an Internet provider user’s photos, stored data, file transfers, e-mails,
306 Data Mining

chats, videos, and video conferences with the of basketball games. Advanced Scout analyzes
technology industry’s participation. These mate- player movements and outcomes to predict suc-
rials allege that Google and Facebook starting in cessful playing strategies for the future (i.e., RA)
2009, YouTube starting in 2010, AOL starting in and also to find unusual game outcomes (i.e., like
2011, and Apple starting in 2012 were providing a player’s scoring average differing considerably
the government with access to users’ information. from usual patterns (i.e., AD). The Union of
Subsequent to the release of this information, the European Football Associations (UEFA) and its
USA cancelled Snowden’s passport and Snowden affiliated leagues/clubs similarly collaborate
sought temporary asylum in Russia to escape with private DM firms for results and player
prosecution for his actions. tracking as well as for developing effective
At an open House Intelligence Committee team rosters.
Meeting in June of 2013, FBI Deputy Director Every transaction or interaction yields data that
Sean Joyce claimed that the PRISM program is captured and stored making the amounts of data
allowed the FBI in 2009 to identify and arrest unmanageable by human means alone. The DM
Mr. Najibullah Zazi, an airport shuttle driver industry has spawned a lucrative data broker
from outside Denver, who was subsequently (DB) industry (e.g., selling consumer data).
convicted in 2010 of conspiring to suicide Three of the largest DB’s are companies Acxiom,
bomb the New York City Subway system. Experian, and Epsilon who do not widely publi-
Snowden has alleged that low-level government cize the exact origins of their data or the names of
analysts have access to software that searches their corporate customers. DB company Acxiom
hundreds of databases without proper oversight. is reportedly the second largest of the three DBs
The White House denied abusing the informa- with approximately 23,000 computer servers that
tion it collects and countered that analyzing big process over 50 trillion yearly data transactions.
data (BD) (i.e., extremely large and complex Acxiom is reported to have records on hundreds
data sets) was helping to make the US govern- of millions US citizens with 1,500 pieces of data
ment work better by eliminating $115 million in per consumer (e.g., Internet browser cookies,
fraudulent medical payments and protecting mobile user profiles). Acxiom partnered with
national security all while preserving privacy company DataXpand to study consumers in
and civil rights. While not claiming that the Latin America and Spain as well as
USA has abused privacy or civil rights, civil US-Hispanics and Acxiom also has operations in
libertarians have spoken out against the poten- Asia and the EU.
tial for misuse that are represented by programs
like PRISM.
In addition to governmental use, retail and Conclusion
social media industries DM is used in education
and sports organizations. Online education uses It was estimated in 2012 that the world created 2.5
DM on data collected during computer instruc- quintillion bytes of data or 2.5  1018 bytes (i.e.,
tion, in concert with other data sources, to assess 2.5 exabytes (EB)) per day. To put this in perspec-
and improve instruction, courses, programs, and tive, in order to store 2.5 quintillion bytes of data,
educational outcomes by determining when stu- one would need to use over 36 million
dents are most/least engaged, on/off task or 64-gigabyte (GB) iPhones as this number of
experiencing certain emotional states (e.g., frus- devices would only provide an estimated
tration). Data collected include students’ demo- 2.30400 EB of storage. Of the data produced, it
graphics, learning logs/journals, surveys, and is estimated that over 30% is nonconsumer data
grades. The National Basketball Association (e.g., patient medical records) and just less than
(NBA) has used a DM program called Advanced 70% is consumer data (e.g., POS data, social
Scout to analyze data including image recordings media). By 2019, it is estimated that the emerging
Data Monetization 307

markets of Brazil, Russia, India, and China (BRIC


economies) will produce over 60% of the world’s Data Monetization
data. Increases in global data production will
increase the demand for DM services and Rhonda Wrzenski1, R. Bruce Anderson2,3 and
technology. Corey Koch3
1
Indiana University Southeast, New Albany, IN,
USA
Cross-References 2
Earth & Environment, Boston University,
Boston, MA, USA D
▶ Business Intelligence Analytics 3
Florida Southern College, Lakeland, FL, USA

References Given advances in technology, companies all


around the globe are now collecting data in order
Executive Office of the President. (2014). Big data: Seizing
to better serve their customers, limit the infringe-
opportunities, preserving values. Retrieved from http://
www.whitehouse.gov/sites/default/files/docs/big_data_ ment of company competitors, maximize profits,
privacy_report_may_1_2014.pdf. limit expenditures, reduce corporate risk-taking,
Frand, J. (n.d.). Data mining: What is data mining? and maintain productive relationships with busi-
Retrieved from http://www.anderson.ucla.edu/faculty/
jason.frand/teacher/technologies/palace/datamining.
ness partners. This utilization and monetization of
htmfrand/teacher/technologies/palace/datamining.htm. big data has the potential to reshape corporate
Furnas, A. (2012). Everything you wanted to know about data practices and to create new revenue streams either
mining but were afraid to ask. Retrieved from http:// through the creation of platforms that consumers
www.theatlantic.com/technology/archive/2012/04/every
or other third parties can interface with or through
thing-you-wanted-to-know-about-data-mining-but-were-
afraid-to-ask/255388/. revamped business practices that are guided by
Jackson, J. (2002). Data mining: A conceptual overview. data.
Communications of the Association for Information At its simplest, data monetization is the process
Systems, 8, 267–296.
of generating revenue from some source of data.
Oracle. (2016). What is data mining? Retrieved from
https://docs.oracle.com/cd/B28359_01/datamine.111/ Though recent technological developments have
b28129/process.htm#CHDFGCIJ. triggered exponential growth of the field, data
Pansare, P. (2014). You use Facebook: Or Facebook is monetization dates back to the 1800s with the
using you? Retrieved from http://epaper.dnaindia.
first mail order catalogs.
com/story.aspx?id¼66809&boxid¼30580&ed_
date¼2014-07-01&ed_code¼820040&ed_page¼5. In order to monetize data, a data supply chain
Pappalardo, J. (2013). NSA data mining: How it works. must exist and be utilized for the benefit of corpo-
Popular Mechanics, 190(9), 59. rations or consumers. The data supply chain is
US Senate Committee on Commerce, Science and Transpor-
composed of three main components: a data cre-
tation. (2013). A review of the data broker industry:
Collection, use, and sale of consumer data for marketing ator, a data aggregator, and a data consumer. In the
purposes: Staff report for chairman Rockefeller. process of data monetization, data must first be
Retrieved from http://www.commerce.senate.gov/pub created. That created data must then be available
lic/?a¼Files.Serve&File_id¼0d2b3642-6221-4888-
for use. The data can then be recognized, linked,
a631-08f2f255b577.
Yettick, H. (2014). Data mining opens window on student amassed, qualified, authenticated, and, finally,
engagement. Education Week, 33, 23. traded. In the process, the data must be stored
securely to prevent cyber attacks.
An individual data creator must input data.
This could be a person utilizing some type of
Data Mining Algorithms application programming interface or a sensor,
database, log, web server, or software algorithm
▶ Data Mining of some sort that routinely collects or generates
308 Data Monetization

the data. The data source can be compiled over After data has been aggregated, it can be sold.
time or instantaneously streamed. However, raw data is much like crude oil: it
Institutions also generate data on a large scale requires refining. Data is typically processed,
to include their transactions and computer inter- abridged, and stored. At this point, the data is
actions. This data generation also puts institutions known as processed data. From processed data
at risk of unauthorized extraction of their gener- derives data insights, which encompass the fields
ated data. of data mining, predictive modeling, analytics,
The value of this data comes from its ability to and hard data science, which are defined in more
be researched and analyzed for metadata and out- detail below. Through these specialized fields,
come information. Outcome information can be data has its true value; these fields bring about
simple knowledge from the data. This information the benefits data has to offer. Once data has been
can be used to streamline production processes, processed in these fields, outcome information is
enhance development, or formulate predictions of available. Outcome information is the final prod-
future activity. The entire logistics field has come uct of the data refining process and can be used as
to be based on self-data analysis for the sake of business intelligence to improve commerce.
some type of corporate improvement. When this Data mining is the term used to describe the
improvement comes to cause an increase in prof- practice of analyzing large amounts of data to
itability, the data was effectively monetized. discern patterns and relationships in the data.
When individuals create data, they are not typ- From this, decisions can be made that are aided
ically compensated. Legally, the individual has by the information. The compilation of a profile
ownership of the data he or she creates. However, based on this information is known as data profil-
many Internet interfaces, such as Google, require ing. Data mining and profiling produced finished
a data ownership waiver for users to access their products from the data supply and refinement
site. This means that Google owns all user data chain that can be used for a variety of practical
from its site by default. However, without such a applications. These products are not only valuable
waiver, the information belongs to its creator, for traditional commercial applications but also
the individual user. Alas, many users do not rec- for political and defense applications. For
ognize that they waive their personal data rights. instance, political candidates can use big data to
Moreover, some users simply access the site with- target segments of the voting population with
out an official account, meaning that they have not direct mail or advertisements that appeal to their
specifically waived their personal data ownership interests or policy preferences.
rights. Nonetheless, Google still aggregates this Because of the volatility and power of data in
user data for profit. the modern age, government agencies like the
Once data is created, it must be aggregated. Federal Bureau of Investigation (FBI) and the
Data aggregation is the process of compiling National Security Agency (NSA) in the United
information from varying sources. The aggrega- States are regularly involved in data aggregation
tion of data can be used for many things, from and consumption itself. For example, the NSA
scientific research to commercial advancements. maintains data on call detail records and social
For instance, by aggregating Covid-19 data from media announcements. This information is pro-
various sources like the World Health Organiza- cessed and used for counterterrorism initiatives
tion (WHO), the European Center for Disease and to monitor or predict criminal activity. The
Control and Prevention (ECDC), and the Centers NSA contracts with various companies, such as
for Disease Control and Prevention (CDC), one Dataminr and Venntel, Inc., to compile data from
can better track the cases, death rates, hospitaliza- a multitude of sources.
tion status, and recovery of patients around the Predictive modeling is the creation of models
globe. based on probability. Most often used to predict
Data Monetization 309

future outcomes, predictive modeling is a product In the retail industry, billions of records are
of data refinement that can be applied to nearly generated daily. Many retailers have long been
any kind of unknown event. One example of how using sales logistics as a means of data analysis.
this technique can be used by corporations would The data generated in sales was formerly reserved
be a business using data on consumer purchasing to advance the interests of only that retailer to
behavior to target those most likely to shop with whom the data belongs. In the modern commer-
special digital or in-store coupons. cial world, retailers exchange data to simulta-
Data analytics is a broader field that encom- neously track competitors’ sales, which allows
passes the examination, refining, transforming, for a higher degree of data analysis. D
and modeling of data. Data analysis refers to An example of retail data monetization can be
the process of scrutinizing data for useful found in Target, a retailer in the United States
information. with websites that can be used by domestic and
Data science is the even broader field that international consumers. This retailer built a
involves using statistical methods and algorithms marketing strategy that used predictive data anal-
to extract and derive general knowledge from ysis as a means of trying to ascertain whether a
data. For instance, a data scientist can create a customer was pregnant. By using consumer-cre-
dashboard to allow consumers or clients to inter- ated Target baby shower registry records, they
face with the data, to track trends, and to visualize were able to track purchases that many pregnant
the information. women had in common and use that to predict
Some industries operate under federal regula- other customers who might be pregnant along
tion in the United States that prohibits the free with their anticipated due dates. This allowed
sharing of consumer personal information. The the corporation to send coupons to non-regis-
Health Information Technology for Economic tered expectant mothers. Target reasons that if
and Clinical Health Act (HITECH) and the Health they can secure new parents as customers in the
Insurance Portability and Accountability Act second or third trimester of pregnancy, they can
(HIPAA) are two laws in the healthcare industry count on them as being loyal to Target’s brand for
within the United States that limit the ready at least a few years. In addition, the simple use of
transfer of personal information. Nonetheless, a credit card authenticates a Guest ID at Target.
healthcare corporations recognize the value of Target can then gather information on you
data and operate within these federal confines to through their records, or they can purchase data
offer somewhat synchronized care systems. This from other data creators. This can enable a
data sharing contains the potential for better care – retailer to use your purchasing history or real-
if your doctor can access your full medical history, time purchasing behavior to guide you to other
he or she can likely improve his or her products you might like or to entice you to make
effectiveness. repeat purchases through targeted promotions or
The Personal Data Ecosystem Consortium coupons. This is highly beneficial to corporations
(PDEC) was also created in 2010 with the stated given the pace of commerce in the twenty-first
purpose of connecting start-up business to share century.
personal data for reciprocal benefit among mem- The financial services industry is another prime
bers. The PDEC also advocates for individuals to example of data monetization in action. Credit
be empowered to access and utilize their own card companies have long been in the business
data. Such cooperatives are also appearing on a of using transaction data for profit maximization
much larger scale. Industry lines are disappearing and also selling such data.
and corporations are teaming up, sharing data, and This data is also a liability. Increasingly, finan-
mutually improving their business models with cial institutions have become targets of cyber
more data analysis. attacks, and cyber security has become an
310 Data Monitoring

increasing concern for businesses of all types.


Cyber attacks can potentially steal credit card Data Munging and Wrangling
data for a multitude of customers, which can
undercut consumer confidence or reduce future Scott N. Romaniuk
business from wary customers once the breach is University of South Wales, Pontypridd, UK
made known to the consumer and broader public.
In August 2014, the previously referenced Federal
Bureau of Investigation (FBI) opened an investi- The terms “Data Munging” and “Data Wrangling”
gation into hacking attacks on seven of the top 15 (also refers to “data cleaning”) are common terms
banks, most notably JPMorgan Chase. The FBI is in the world of programmers and researchers.
still unsure as to the nature or origin of the attack, They are interchangeable and refer to the manual
and it is also unclear whether the hackers accessed conversion of raw data into another form that
consumer banking or investment accounts. makes it easier for the programmer, researcher,
Around Thanksgiving the year prior, mega- and others to understand and to work with. This
retailer Target was hit with a similarly unprece- process also involves what is referred to as the
dented cyber attack. In this instance, hackers may “mapping” of raw forms of data or data files (e.g.,
have accessed information from as many as 110 txt, csv, xml, and json), and applying it to another
million customers. The origin of that cyber attack format.
remains unknown. What is known of the attack is During the course of performing data analysis
that over 40 million credit card numbers were and visualization, the performance of which are
stolen. referred to as “data science,” researchers often
face the creation of messy data sets, and this is
especially the case with larger and more complex
Further Reading data sets. Data munging, therefore, describes the
process of sorting through either small or large
About Us. Personal Data Ecosystem Consortium. http:// data sets, which can become messy and disor-
pde.cc/aboutus/. derly, and “cleaning” it up or manipulating
Data Mining and Profiling. SAGE encyclopedia on Big
Data.
it. This process is often accomplished with the
Data Monetization in the Age of Big Data. Accenture. aim of creating a final or conclusive form of
http://www.accenture.com/SiteCollectionDocuments/ data, and data presentation or recognition. After
PDF/Accenture-Data-Monetization-in-the-Age-of- cleaning the data, it can then be used more effi-
Big-Data.pdf.
FBI Expands Ability to Collect Cellphone Location Data,
ciently and lend itself to other uses. It can also
Monitor Social Media, Recent Contracts Show. The involve the manipulation of multiple data sets.
Intercept. https://theintercept.com/2020/06/24/fbi-sur Simplified, the primary steps involved in data
veillance-social-media-cellphone-dataminr-venntel/. munging are:
FBI investigating hacking attack on JPMorgan. CNN
Money. http://money.cnn.com/2014/08/27/investing/
jpmorgan-hack-russia-putin/. • Addressing variable and observation names
How Companies Learn Your Secrets. The New York Times. (rows and columns) including the creation of
http://www.nytimes.com/2012/02/19/magazine/shop new variables that might be required
ping-habits.html?pagewanted¼all&_r¼0.
Target: Hacking hit up to 110 Million Customers. CNN
• Consolidate data into a single unified data set
Money. http://money.cnn.com/2014/01/10/news/com • Molding or shaping data (address missing data/
panies/target-hacking/. values, dropping data, dealing with outliers,
balance the data/ensure consistency)

The creation of “tidy data” is important for


Data Monitoring handling data and moving them between pro-
grams and sharing them with others and can
▶ Data Profiling often be a painstaking process, but one that is
Data Munging and Wrangling 311

critical for working efficiently and effectively important issue to address in work using regres-
with data to be analyzed. The idea of “tidy data” sion. Missing data results in difficulty in deter-
refers to the ease with which data and data sets can mining the impact on regression coefficients.
be navigated. Some of the core features include Proceeding with data analysis using data sets
the placement of data such as variables in rows with missing values can also lead to a distortion
and columns, the removal of errors in the data, within the analysis, or bias, and an overall less-
ensuring internal consistency, and ensuring that desirable quality of work.
data has been converted into complementary For example, during the course of field
formats. Language harmonization falls under the research involving the distribution of question- D
category of “tidy data,” including the synchroni- naires, more than half of the questionnaires
zation of communication elements so that vari- returned with unchecked boxes can result in the
ables of the same “type” can be grouped misrepresentation of a given condition being stud-
together. For example, “men” and “males,” and ied. Values can be missing for various reasons,
“women” and “females,” representing the same including responses improperly recorded and
two populations can be grouped using a single even a lack of interest on the part of the individual
label for both. This can be applied to a range of filling out the questionnaire. It is, therefore, nec-
items that share similar enough features or condi- essary to look at the type of missing values that
tions that they can be grouped accordingly. exist in a given data set.
Data munging can be performed through the In order to address this problem, the program-
use of a variety of software and tools, including mer or researcher will undertake what is called a
but not limited to Pandas, R, Stata, and SPSS, all “filtering” process. “Filtering” in this context refers
of which present the programmer or researcher simply to the deliberate removal of data, resulting
with useful data manipulation tools or capabili- in a tidier dataset. The process can also involve
ties. Python is one of the most popular Python sorting data to present or highlight different aspects
packages for creating and managing data struc- of the dataset. For example, if a researcher exam-
tures. While data sets can be “messy,” the term ines a set number of organizations in a particular
does not necessarily imply that a given data set is city or country, they may want to organize the
not useful or has not been created properly. responses by a specific organization type or cate-
A “messy” data set might be difficult to work gory. Categorization in this sense can be made
with and therefore requires some preprocessing along the lines of gender, type of organization, or
technique in preparation for further work or pre- area(s) of operation. Organizing the data in this
sentation, depending on what purpose is intended way can also be referred to as an isolation process.
for the data. This process is useful for exploratory and
A programmer or researcher, for example, can: descriptive data analysis and investigation. If a
(a)correspond different data values so they are researcher is interested in examining country sup-
formatted the same way; (b) delete or harmonize port for political issues, such as democracy, coun-
data vocabulary so that data are easier to locate, tries can be categorized along, for example, region
and address the issue of missing values in data or regime type. In doing so, the programmer or
sets; (c) transfer data sets from one programming researcher can illustrate which regime type
tool or package to another, removing certain (s) is/are more likely to be interested in politics
values that are either irrelevant to the research or or political issues. Another example can involve
information that the programmer or researcher the acquisition of data on the number of people
wants to present, and separate or merge rows and who smoke or consume alcohol excessively in a
columns. society. Data could then be sorted according to
Compiling data often results in the creation of gender, age, or location, and so on.
data sets with missing values. Missing values can Data munging may involve the categorical
lead to problems of representation or internal and arrangement or grouping of information. It is pos-
external validity in research, which is an sible to determine if there is a relationship
312 Data Pre-processing

between two or more variables or conditions. 111–112. https://www.sciencedirect.com/science/arti


Using the previous examples, a researcher may cle/pii/S2405896315001986.
Foxwell, H. J. (2020). Cleaning your data. In Creating
want to determine if there exists any correlation good data. Berkeley: Apress. https://doi.org/10.1007/
between gender and smoking, or country type and 978-1-4842-6103-3_8.
interest in political issues or involvement in polit- https://www.elderresearch.com/blog/what-is-data-
ical processes. Various groupings can also be wrangling-and-why-does-it-take-so-long/.
Skiena, S. S. (2017). Data munging. In The data science
performed to determine the existence of further design manual. Texts in computer science. Cham:
correlations. Doing so can lead to the creation of a Springer. https://doi.org/10.1007/978-3-319-55444-0_3.
more robust data structure. Thurber, M. (2018, April 6). What is data wrangling and
Data munging as a traditional method, has why does it take so long? Elder Research. https://www.
elderresearch.com/blog/what-is-datawrangling-and-
been referred to as an “old(er)” and “outdated” why-does-it-take-so-long/.
process, given that it was invented decades ago. Wähner, K. (2017, March 5). Data preprocessing vs. Data
Recently, more streamlined and integrated wrangling in machine learning projects. InfoQ. infoq.
methods of data cleaning or arrangement have com/articles/ml-data-processing/.
Wiley, M., & Wiley, J. F. (2016). Data munging with Data.
been formulated and are now available, such as Table. In Advanced R. Berkeley: Apress. https://doi.
Power Query. Data munging can take a great deal org/10.1007/978-1-4842-2077-1_8.
of time and since the process is a manual one, the
outcome of the process can still contain errors.
A single mistake at any stage of the process can
result in subsequent and unintended errors pro- Data Pre-processing
duced due to the initial error. Nonetheless, data
munging constitutes an important part of the data ▶ Data Cleansing
handling and analysis process and is one that
requires careful attention to detail as is the case
with other types of research. The practice of
treating data in ways discussed is akin to filling Data Preservation
holes in a wall, smoothing the surface and apply-
ing primer before painting. It is a critical step in ▶ Data Provenance
preparing data for further use and will ultimately
aid the researcher working with data in various
quantities.
Data Privacy
Cross-References ▶ Anonymization Techniques

▶ Data Aggregation
▶ Data Cleansing
▶ Data Integrity Data Processing
▶ Data Processing
▶ Data Quality Management Fang Huang
Tetherless World Constellation, Rensselaer
Polytechnic Institute, Troy, NY, USA
Further Reading

Clark, D. (2020). Data munging with power query. In Synonyms


Beginning Microsoft Power BI. Berkeley: Apress.
https://doi.org/10.1007/978-1-4842-5620-6_3.
Endel, F., & Piringer, H. (2015). Data wrangling: Making Data discovery; DP; Information discovery; Infor-
data useful again. IFAC-PapersOnLine, 48(1), mation extraction
Data Processing 313

Introduction computers). This stage started in 1890


(Bohme et al. 1991). During that year, the US
Data processing (DP) refers to the extraction of Census Bureau installed a system which con-
information through organizing, indexing, sists of complicated punch card machines to
and manipulating data. Information here means help tabulate the results of a recent national
valuable relationships and patterns that can help census of population. All data are organized,
solve problems of interest. In history, the capabil- indexed, and classified. Searching and comput-
ity and efficiency of DP have been improving ing are made easier and faster than manual
with the advancement of technology. Processing work with the punch card system. D
involving intensive human labor has been gradu- • Electronic DP is to process data using com-
ally replaced by machines and computers. The puter and other advanced electronic devices.
methods of DP refer to the techniques and algo- Nowadays, Electronic DP is the most common
rithms used for extracting information from data. method and is still quickly evolving. It is
For example, processing of facial recognition data widely seen in online banking, ecommerce,
needs classification, and processing of climate scientific computing, and other activities.
data requires time series analysis. The results of Electronic DP provides best accuracy and
DP, i.e., the information extracted, also depend speed. Without specification, all DP discussed
largely on data quality. Data quality problems in later sections are Electronic DP.
like missing values and duplications can be solved
through various methods, but some systematic
Methods
issues like equipment design error and data col-
lection bias are harder to overcome at this stage.
The methods of DP here refer to the techniques
All these aspects influencing DP will be covered
and algorithms used for extracting information
in later sections, but let’s look at the history of DP
from data, which vary a lot with the information
first.
of interest and data types. One definition for
data is “Data are encodings that represent the
qualitative or quantitative attributes of a vari-
History
able or set of variables” (Fox 2018). The cate-
gorical type of data can represent qualitative
With the advancement of technology, the his-
attributes, like the eye color, gender, and jobs
tory of DP can be divided into three stages:
of a person; and the quantitative attributes can
Manual DP, Mechanical DP, and Electronic
be represented by numerical type of data, like
DP. The goal is to finally use Electronic DP to
the weight, height, and salary of a person. Based
replace the other two to reduce error and
on data types and patterns of interest, we can
improve efficiency.
choose from several DP methods (Fox 2018),
such as:
• Manual DP is to process data with little or
no aid from machines. Before the stage of
Mechanical DP, only small-scale DP could • Classification is a method that uses a classifier
be done, and they were very slow and could to put unclassified data into existing categories.
easily bring in errors. Having said that, The classifier is trained using categorized data
Manuel DP still exists at present, and it is labeled by experts, so it is one type of super-
usually because the data are hard to digitize vised learning in the machine learning termi-
or not machine readable, like in the case of nology. Classification works well with
retrieving sophisticated information from old categorical data.
books or records. • Regression is a method to study the relation-
• Mechanical DP is to process data with help ship between a dependent variable and other
from mechanical devices (not modern independent variables. The relationship can be
314 Data Processing

used for predicting future results. Regression • Missing data: direct solution is to go back and
usually uses numerical data. gather more information. If that is not possi-
• Clustering is a method to find distinct groups ble, common solution is to use domain knowl-
of data based on their characteristics. Social edge or existing algorithms to impute them
media companies usually use it to identify peo- based on correlations between different
ple with similar interests. It is a type of attributes.
unsupervised learning and works with both • Duplications: we can index the data based on
qualitative and quantitative data. certain attributes to find out duplications and
• Association Rule Mining is a method to find remove them accordingly.
relationships between variables such as “which • Inconsistency: check the metadata to see the
things or events occur together frequently in reason for inconsistency. Metadata are supple-
the dataset?”. This method was initially devel- mentary descriptive information about the
oped to do market basket analysis. Researchers data. Common issues are inconsistent data
in mineralogy apply association rule mining to units and formats, which can be converted
use one mineral as the indicator of other accordingly.
minerals. • Outliers: if they are proven to be errors, we can
• Outlier Analysis, also called anomaly detec- make corrections. In other cases, outlier data
tion, is a method to find data items that are should still be included.
different from the majority of the data.
• Time Series Analysis is a set of methods to Level 2 issues are usually originated from
detect trends and patterns from the time series equipment design error and/or data collection
data. Time series data are a set of data indexed bias, in which case data are uniformly off the
with time, and this type of data are used in true values. This makes level 2 issues harder to
many different domains. be seen in data validation. They can possibly be
seen when domain experts check the results.
DP methods are a key part for generating infor-
mation, and the abovementioned ones are just a
partial list. Data quality is another thing that can Data Processing for Big Data
influence the results of DP.
Big data are big in three aspects: volume, velocity,
and variety (Hurwitz et al., 2013). Variety has
Data Quality been discussed in data quality section. Now we
need to have a model capable of storing and
Thota (2018) defined data quality as “the extent to processing high volume of data at relatively fast
which some data successfully serve the purposes speed and/or to deal with fast and continuous
of the user.” In the case of DP, this purpose is to high-speed incoming data in a short response
get correct information, which involves two levels time.
of data quality. Level 1 is the issues in data them- Batch processing works with extremely
selves such as missing values, duplications, large static data. In this case, there is no new
inconsistency among data, and so on. Level 2 is data coming in during processing, and data
accuracy, which is the distance between data and usually stores in more than one device. The
real values. Usually level 1 quality problems are large dataset is grouped into smaller batches
easier to solve compared to the level 2. and processed, respectively, with results com-
Level 1 issues can be discovered through bined later. The processing time is relatively
exploratory inspection and visualization tools, long, so this model is not suitable for real-time
and the issues can be solved accordingly. tasks.
Data Processing 315

Stream processing, in contrast to batch pro- Yet Another Resource Negotiator (YARN):
cessing, applies to dynamic and continuous data YARN is the manager to schedule tasks
(new data keeps coming in). This model usually and mange system resources.
works for tasks that require short response time. MapReduce: MapReduce is the programming
For example, hotel and airline reservation systems model taking advantage of “divide and con-
need to provide instant feedbacks to customers. quer” algorithm to speed up DP.
Theoretically, this model can handle unlimited 2. Apache Storm: Apache Storm is a stream pro-
amount of data as long as the system has enough cessing framework suitable for highly respon-
capacity. sive DP. In a stream processing scenario, data D
Mixed processing could process both batch and coming in the system are continuous and
stream data. unbounded. To achieve the goal of delivering
results in nearly real time, Storm will divide
Big Data Processing Frameworks the incoming data stream into small and
A framework in computer science is a special discrete units (smaller than batch) for pro-
library of generic programs, which can perform cessing. These discrete steps are called bolts.
specific tasks after adding some codes for actu- Native Storm does not keep operations on bolts
ally functionality. Widely used frameworks are in order, which has to be solved by adding
usually well written and tested. Simple python extra modules like Trident (Ellingwood
scripts can take care of small datasets, but for big 2016). The bottom line is, Storm is highly
data systems, “. . .processing frameworks and efficient and supports multiprogramming lan-
processing engines are responsible for comput- guages, so it is suitable for low-latency stream
ing over data in a data system” (Ellingwood processing tasks.
2016). This section will cover five popular 3. Apache Samza: “Apache Samza is a stream
open-source big data processing frameworks processing framework that is tightly tied to
from Apache. the Apache Kafka messaging system.”
(Ellingwood 2016). Apache Kafka is a distrib-
1. Apache Hadoop: As a well-tested batch-pro- uted streaming platform that process streams in
cessing framework, Apache Hadoop was first the order of occurrence time and keep an
developed by Yahoo to build a “search engine” immutable log. Hence, Samza can natively
to compete with Google. Later Yahoo found its overcome the ordering problem in Storm and
great potential in DP. Hadoop’s cross-system enable real-time collaboration on DP between
compatibility, distributed system architecture, multiple teams and applications in big
and open-source policy made it popular with organizations. However, compared to Storm,
developers. Hadoop can easily scale up from Samza has higher latency and less flexibility
one individual machine (for testing and in programing language (only support Java
debugging) up to data severs with large num- and Scala).
ber of nodes (for large-scale computing). Due 4. Apache Spark: Apache Spark is the next-gen-
to its distributed architecture, a high-workload eration framework that combines batch and
system can be extended by adding new nodes, stream processing capabilities. Compared to
and batch process can be made more efficient Hadoop MapReduce, Spark processes data
through parallel computing (Ellingwood much faster due to its optimization on in-mem-
2016). Major components of Hadoop are: ory processing and task scheduling. Further-
Hadoop Distributed File System (HDFS): more, the deployment of Spark is more
HDFS is the underlying file managing and flexible – it can run individually on a single
coordinating system that ensures efficient system or replace the MapReduce engine and
data file communication across all nodes. incorporate into a Hadoop system. Beyond
316 Data Profiling

that, Spark programs are easier to write


because of an ecosystem of existing libraries. Data Profiling
It generally does a better job in batch pro-
cessing than Hadoop; however, it might not Patrick Juola
fit for extremely low-latency stream processing Department of Mathematics and Computer
tasks. Also, devices using Spark need to install Science, McAnulty College and Graduate School
larger RAM, which increases costs. Spark is a of Liberal Arts, Duquesne University, Pittsburgh,
versatile framework that fits diverse processing PA, USA
workloads.
5. Apache Flink: Apache Flink framework han-
dles both stream and batch processing work- Synonyms
loads. It simply treats batch tasks as bounded
stream data. This stream-only approach offers Data monitoring
Flink fast processing speed and real in order
processing. It is probably the best choice for Data profiling is the systematic analysis of a data
organizations with strong needs for stream pro- source, typically prior to any specific use, to deter-
cessing and some needs for batch processing. mine how useful it is and how best to work with it.
Flink is relatively flexible because it is This analysis will typically address matters such
compatible with both Storm and Hadoop. as what information the source contains, the meta-
However, its scaling capability is somewhat data about the information, the quality of data,
limited due to its short history. whether or not there are issues such as missing
or erroneous elements, and patterns present within
Many open-source big data processing sys- the data that may influence its use. Data profiling
tems are available on the market, and each has helps identify and improve data quality, which in
its own strengths and drawbacks. There is no turn improves the information systems built upon
“best” framework and no single framework them (Azeroual et al. 2018).
that can address all user needs. We need to Errors are inevitable in any large human pro-
choose the right one or combine multiple ject. Particularly in collecting big data, some typ-
frameworks based on the needs of real-world ical types of errors include (Azeroual et al. 2018):
projects.
• Missing data
• Incorrect information
• Duplicate data
Further Reading • Inconsistently represented data

Bohme, F., Wyatt, J. P., & Curry, J. P. (1991). 100 years of For example, the “telephone number” column
data processing: the punchcard century (Vol. 3). of a large customer database should be expected to
US Department of Commerce, Bureau of the Census,
Data User Services Division. contain telephone numbers (Kimball 2004). In the
Ellingwood J. (2016). Hadoop, storm, samza, spark, United States and Canada, such a number is
and flink: Big data frameworks compared. Retrieved defined by ten numeric digits but might be stored
25 Feb 2019, from https://www.digitalocean.com/ as a string. Empty cells may or may not represent
community/tutorials/hadoop-storm-samza-spark-and-
flink-big-data-frameworks-compared. genuine data (where the customer refused to pro-
Fox, P. (2018). Data analytics course. Retrieved 25 Feb vide a number) but may also represent data entry
2019, from https://tw.rpi.edu/web/courses/DataAna errors. Entries with alphanumeric characters may
lytics/2018. be marketing schemes but are likely to be errors.
Hurwitz, J. S., Nugent, A., Halper, F., & Kaufman, M.
(2013). Big data for dummies. Hoboken: Wiley. Even correct data may be duplicated or
Thota, S. (2018). Big data quality. Springer: Encyclopedia represented inconsistently (212-555-1234, (212)
of Big Data. 555-1234, and +1 212 555 1234 are the same
Data Provenance 317

number, written differently; similarly, Scranton, Cross-References


PA, Scranton, Penna., and Scranton, Pennsylvania
are the same city). ▶ Data Quality Management
In addition to actual errors, data can be profiled ▶ Metadata
for bias and representativeness. Depending upon
how the data is collected, not all elements of the
relevant universe will be equally sampled, which Further Reading
in turn will bias inferences drawn from the data-
base and reduce their accuracy. Abedjan, Z., Golab, L., & Naumann, F. (2015). Profiling D
relational data: A survey. The VLDB Journal, 24, 557–
The data profiling process will typically
581. https://doi.org/10.1007/s00778-015-0389-y.
involve reviewing both the structure and content Azeroual, O., Saake, G., & Schallehn, E. (2018). Analyz-
of the data. It will confirm that the data actually ing data quality issues in research information systems
describe what the metadata say they should (e.g., via data profiling. International Journal of Information
Management, 41, 50–56., ISSN 0268-4012. https://doi.
that the “state” column contains valid states and
org/10.1016/j.ijinfomgt.2018.02.007.
not telephone numbers). It should further iden- Kimball, R. (2004). Kimball design tip #59: Surprising
tify relationships between columns and should value of data profiling. Number 59, September 14,
label problematic outliers or anomalies in the 2004. http://www.kimballgroup.com/wpcontent/
uploads/2012/05/DT59SurprisingValue.pdf.
data. It should confirm that any required depen-
dencies hold (for example, the “date-of-birth”
should not be later than the “date-of-death”).
Ideally, it may be possible to fix some of the
issues identified or to enhance the data by using Data Provenance
additional information. If nothing else, data pro-
filing will provide a basis for a simple “Go–No Ashiq Imran1 and Rajeev Agrawal2
1
Go” decision about whether the proposed project Department of Computer Science &
can go forward or about whether the proposed Engineering, University of Texas at Arlington,
database is useful (Kimball 2004). If the needed Arlington, TX, USA
2
data are not in the database, or the data quality is Information Technology Laboratory, US Army
too low, it is better to learn (early) that the Engineer Research and Development Center,
project cannot continue. Vicksburg, MS, USA
In addition to improving data quality, data pro-
filing also can be used for data exploration, by
providing easily understandable summaries of Synonyms
new datasets. Abedjan et al. (2015) provide as
examples “files downloaded from the Web, old Big data; Cybersecurity; Data preservation; Data
database dumps, or newly gained access to some provenance; Data security; Data security manage-
[databases]” with “no known schema, no or old ment; Metadata; Privacy
documentation, etc.” Learning what is actually
stored in these databases can enhance their use-
fulness. Similarly, data profiling can be used to Introduction
help optimize a database or even to
reverse-engineer it. Perhaps most importantly, Data provenance refers to the description of the
data profiling can ease the task of integrating origin, creation, and propagation process of data.
several independent databases to help solve a Provenance is the lineage and the derivation of
larger problem to which they are all relevant. the data, documented history of an object, in other
Data profiling is, therefore, as argued by words, how the object was created, modified,
Kimball (2004), a highly valuable step in any propagated, and disseminated to its current loca-
data-warehousing project. tion/status. By observing the provenance of an
318 Data Provenance

object, we can infer the trustworthiness of integration steps in a pay-as-you-go fashion.


the object. It stores ownership and process history This has the advantage of increasing the timeli-
about data objects. Provenance has been studied ness of data, but in comparison with the tradi-
extensively in the past and people usually use tional approach of data warehousing comes at
provenance to validate physical objects in the cost of less precise and less well-documented
arts, literary works, manuscript, etc. Recently, metadata and data transformations. Without
the domain for provenance has gained significant information of provenance, it is difficult for a
attention in digital world and e-science. The user to understand the relevance of data, to
provenance of data is crucial for validating, estimate or judge its quality, and to investigate
debugging, auditing, evaluating the quality of unexpected or erroneous results. Big data sys-
data and determining reliability in data. In today’s tems that automatically and transparently keep
period of Internet, complex ecosystems of data track of provenance information would intro-
are even more ubiquitous. Provenance has been duce pay-as-you-go analytics that do not suffer
typically considered by the database, workflows, from this loss of important metadata. Moreover,
and distributed system communities. Capturing provenance can be used to define meaningful
provenance can be a burdensome and labor- access control policies for heavily processed
intensive task. and heterogeneous data. For instance, a user
With the growing inundation of scientific can be granted access to analysis results if they
data, a detailed description of metadata is impor- are based on data that person owns.
tant to share data and find the data and scientific
results for the scientists. Scientific workflows
assist scientists and programmers with tracking Big Data
their data through all transformations, analyses,
and interpretations. Data sets become trustwor- Big Data is a buzzword used to describe the rapid
thy when the process used to create them are growth of both structured and unstructured data.
reproducible and analyzable for defects. Current With the rapid development of social networking,
initiatives to effectively manage, share, and data collection capacity, and data storage, big data
reuse ecological data are indicative of the are growing swiftly in all science and engineering
increasing importance of data provenance. domains including social, biological, biomedical
Examples of these initiatives are National Sci- sciences (Wang et al. 2015; Glavic 2014). Exam-
ence Foundation Datanet projects, Data Conser- ples are Facebook Data, Twitter Data, Linked-In
vancy, DataONE. Data, and Health Care Data.
Big data concept refers to a database which is In simple words, big data can be defined as
continuously expanding and slowly becomes dif- data that is too big, too quick, or too hard for
ficult to control and manage. The difficulty can be existing tools to process and analyze. Here, “too
related to data capture, storage, search, sharing, big” means that organizations increasingly must
analytics, and visualization, etc. Provenance in deal with terabyte-scale or petabyte-scale collec-
big data has been identified by a recent commu- tions of data. For example, Facebook generates
nity whitepaper on the challenges and opportuni- and stores four images of different sizes, which
ties of big data. translates to a total of 60 billion images and
Provenance has found applications in 1.5 PB of storage. “Too quick” means data is
debugging data, trust, probabilistic data, and not only huge enough, but also it must be pro-
security (Hasan et al. 2007; Agrawal et al. cessed and analyzed quickly – for example, to
2014). Data provenance may be critical for identify fraud at a point of sale or transaction.
applications with typical big data features Lastly, “too hard” means data may not follow
(volume, velocity, variety, value, and veracity). any particular structure. As a result, no existing
A usual approach to handle the velocity aspect tool can process and analyze it properly. For
of big data is to apply data cleaning and example, data that is created in media, such as
Data Provenance 319

MP3 audio files, JPEG images, and Flash video Application of Provenance
files, etc.
According to Weatherhead University Profes- Provenance systems may be created to support a
sor Gary King, “There is a big data revolution.” number of uses and according to Goble, various
But the revolution is not about the quantity of applications of provenance are as follows:
data rather using the data and doing a lot of things
with the data. To understand the phenomenon • Data Quality: Lineage can be used to estimate
that is big data, it is often described using five data quality and data reliability based on the
Vs: Volume, Velocity, Variety, Veracity, and source data and transformations. It can also D
Value. provide proof statements on data derivation.
• Audit Trail: Provenance can be used to trace
the audit trail of data and evaluate resource
Provenance in Big Data usage and identify errors in data generation.
• Replication Recipes: Thorough provenance
Big data provenance is a type of provenance to information can allow repetition of data deri-
serve scientific computation and workflows that vation, help maintain its currency, and be a
process big data. Recently, an interesting example recipe for replication.
has come up. Who is the most popular footballer • Informational: A generic use of lineage is to
in the world? From social media data, all the fans query based on lineage metadata for data dis-
around the world select their favorite footballer. covery. It can also be browsed to provide a
This generates a huge volume of data. This data context to interpret data.
carries the vote of the fans. Such a massive
amount of data must be able to provide desired
Provenance in Security
result.
Let’s consider a scenario. Whenever we find
The fundamental parts of the security are the
some data, do we think what the source of data is?
confidentiality, integrity, and availability. Confi-
It is quite possible that data is copied from some-
dentiality indicates protection of data against
where else. It is also possible that data is incorrect.
disclosure. Sensitive information such as com-
The data we usually see on the web such as rating
mercial or personal information is necessary to
of a movie or smart phone, news story. Do we
keep confidential. Provenance information covers
think about it, how much legitimate is it? For
access control mechanism. With the progress of
scientists, they need to have confidence on accu-
advanced software applications more complex
racy and timeliness on the data that they are using.
security mechanisms must be used. Traditional
Some of the common challenges of big data are
access control mechanisms are built for specific
as follows:
purposes and are not easily configured to address
the complex demands. If we are able to trace the
1. It is too difficult to access all the data.
dependency of access, then it will provide essen-
2. It is difficult to analyze the data.
tial information of security.
3. It is difficult to share information and insights
with others.
4. Queries and reports take a long time to run.
5. Expertise needed to run the analysis Challenges
legitimately.
Information recording about the data at origin
Without provenance it is nearly impossible for does not come into play unless this information
a user to know the relevance of data, assess the can be interpreted and carried through data
quality of its data, and to explore the unexpected analysis. But there are a lot of issues of data
or erroneous result. provenance such as query inversion, uncertainty
320 Data Provenance

of sources, data citation, and archives manage- Opportunities


ment. To acquire provenance in big data is a
challenging task. Some of the challenges (Wang The big secret of big data is not about the size of
et al. 2015; Glavic 2014; Agrawal et al. 2014) are: the data; it is about relevancy of data which is
rather small in comparison. Timely access to
• Uncommon Structure: It is hard to define a appropriate analytic insights will replace the
common structure to model the provenance of need for data warehouses full of irrelevant data,
data sets. Data sets can be structured or most of which could not be managed or analyzed
unstructured. Traditional databases and anyway. There are a lot of opportunities of prov-
workflows may follow structured way but for enance (Wang et al. 2015; Glavic 2014; Agrawal
big data it may not necessarily true. We cannot et al. 2014), which are listed below:
reference separate entries in the file for prove-
nance without knowing the way how the data is • Less Overhead: We need to process huge
organized in a file. volume of data, so high performance is critical.
• Track Data of Distribute Storage: Big data It is necessary that provenance colle-
systems often distribute in different storages to ction has minor impact on the application’s
keep track of data. This may not be applicable performance.
for traditional databases. For provenance, we • Accessibility: A suitable coordination between
need to trace and record data and process the data and computer systems is required to
location. access different types of big data for prove-
• Check Authenticity: Data provenance needs nance and distributed systems.
to check in timely manner to verify authenticity • User Annotations Support: It is important to
of data. As increasing varieties and velocities capture user notes or metadata. This is appli-
of data, data flows can be irregular. Periodic or cable for database and workflows as well as for
event triggered data loads can be challenging big data. Thus, we need an interface that allows
to manage. users to add their notes about the experiment.
• Variety of Data: There is variety of data such • Scalability: Scalability comes into play for big
as unstructured text documents, email, video, data. The volume of big data is growing expo-
audio, stock ticker data, and financial trans- nentially. Provenance data are also rapidly
actions which comes from multiple sources. increasing which is making it necessary to
It is necessary to connect and correlate rela- scale up provenance collection.
tionships, hierarchies, and multiple data link- • Various Data Models Support: So far data
ages otherwise data can quickly get out of models for provenance system were structured
control. such as database. It is important to support for
• Velocity of Data: We may need to deal with unstructured and semi-structured provenance
streaming data that comes at unprecedented data models from users and systems because
speed. We need to react quickly enough to big data may not follow a particular structure.
manage such data. • Provenance Benchmark: If we can manage to
• Lack of Expertise: Since big data is fairly a set up a benchmark for provenance, then we
new technology, there are not enough experts can analyze performance blockages and to
who know how to deal with big data. compute performance metrics. Provenance
• Secure Provenance: There is a huge volume information can be used to support data-centric
of data and information. It is important to monitoring.
maintain privacy and security of provenance • Flexibility: A typical approach to deal with
information. But, it will be challenging to velocity of the data is to introduce data cleaning
maintain privacy and integrity of provenance and integration in pay-per-use fashion. This
information with big data. may reduce the cost and consume less time.
Data Quality Management 321

Conclusion our algorithms and paradigms are, or how intelli-


gent our “machines.”
Data provenance and reproducibility of computa- J. M. Juran provides a definition of data quality
tions play a vital role to achieve improvement of that is applicable to current Big Data environ-
the quality of research. Some studies have shown ments: “Data are of high quality if they are fit for
in the past that it is really hard to reproduce com- their intended use in operations, decision making,
putational experiments with certainty. Recently, and planning” (Juran and Godfrey 1999, p. 34.9).
the phenomenon of big data makes this even In this context, quality means that Big Data are
harder than before. Some challenges and oppor- relevant to their intended uses and are of sufficient D
tunities of provenance in big data are discussed in detail and quantity, with a high degree of accuracy
this article. and completeness, of known provenance, consis-
tent with their metadata, and presented in
appropriate ways.
Further Reading Big Data provide complex contexts for deter-
mining data quality and establishing data quality
Agrawal, R., Imran, A., Seay, C., & Walker, J. (2014, management. The Internet of Things (IoT) has
October). A layer based architecture for provenance in
complicated Big Data quality management by
big data. In Big Data (Big Data), 2014 I.E. international
conference on (pp. 1–7), IEEE. expanding the dynamic dimensions of scale,
Glavic, B. (2014). Big data provenance: Challenges and diversity, and rapidity that collectively character-
implications for benchmarking. In Specifying big data ize Big Data. From intelligent traffic systems to
benchmarks (pp. 72–80). Berlin/Heidelberg: Springer.
smart healthcare, IoT has inundated organizations
Hasan, R., Sion, R., & Winslett, M. (2007, October). Intro-
ducing secure provenance: Problems and challenges. In with ever-increasing quantities of structured and
Proceedings of the 2007 ACM workshop on storage unstructured Big Data sets that may include social
security and survivability (pp. 13–18), ACM. media, public and private data sets, sensor logs,
Wang, J., Crawl, D., Purawat, S., Nguyen, M., &
web logs, digitized records, etc., produced by
Altintas, I. (2015, October). Big data provenance: Chal-
lenges, state of the art and opportunities. In Big Data different vendors, applications, devices, micro-
(Big Data), 2015 I.E. international conference on services, and automated processes.
(pp. 2509–2516), IEEE.

Conceptual Framework

Data Quality Management Big Data taken out of their contexts are meaning-
less. As social constructs, Big Data, like “little
Erik W. Kuiler data,” can only be conceptualized in the context
George Mason University, Arlington, VA, USA of market institutions, societal norms, juridical
constraints, and technological capabilities. Big
Data-based assertions do not have greater claims
Introduction to truth (however, one chooses to define this),
objectivity, or accuracy than “small data-based”
With the increasing availability of Big Data and assertions.
their attendant analytics, the importance of data Moreover, it should be remembered that just
quality management has increased. Poor data because Big Data are readily accessible does not
quality represents one of the greatest hurdles to mean that their uses are necessarily ethical. Big
effective data analytics, computational linguistics, Data applications reflect the intersection of differ-
machine learning, and artificial intelligence. If the ent vectors: technology, the maximizing the use of
data are inaccurate, incomprehensible, or computational power and algorithmic sophistica-
unusable, it does not matter how sophisticated tion and complexity; and analytics, the
322 Data Quality Management

exploration of very large data to formulate Semiotic consistency – data items consistently use
hypotheses and social, economic, and moral the same alphabet, signs, symbols, and ortho-
assertions. graphic conventions
Poor data quality presents a major hurdle to Timeliness – a data item represents view of reality
data analytics. Consequently, data quality man- at a specific point in time
agement has taken on an increasingly important Trustworthiness – a data item has come from a
role within the overall framework of Big Data trusted source and is managed reliably and
governance. It is not uncommon for a data analyst securely
working with Big Data sets to spend approxi- Uniqueness – a data item is uniquely identifiable
mately half of his or her time cleansing and nor- so that it can be managed to ensure no
malizing data. Data of acceptable quality are duplication
critical to the effective operations of an organiza- Understandability – a data item is easily
tion and the reliability of its analytics and business comprehended
intelligence. The application of Big Data quality Usability – a data item is useful to the extent that it
dimensions and their attendant metrics, standards, may be readily understood and accessed
as well as the use of knowledge domain-specific Validity – a data item is syntactically valid if it
lexica and ontologies facilitate the tasks of Big conforms to its stipulated syntax; a data item is
Data quality management. semantically valid if it reflects its intended
meaning
Dimensions of Big Data Quality
Big Data quality reflects the application of spe- Data Quality Metrics
cific dimensions and their attendant metrics to The function of measurements is to collect, cal-
data items to assess their acceptance for use. Com- culate, and compare actual data with
monly used dimensions of Big Data quality expected data.
include: Big Data quality metrics must exhibit three
properties: they must be important to data users,
Accessibility – a data item is consistently avail- they must be computationally sound, and they
able and replicable must be feasible. At a minimum, data quality
Accuracy – the degree to which a data item management should reflect functional require-
reflects a “real world” truth ments and provide metrics of timeliness, syntactic
Completeness – the proportion of ingested and conformance, semiotic consistency, and semantic
stored instance of a data item matches the congruence. Useful starting points for Big Data
expected input quantity quality metrics include pattern recognition and
Consistency – there are no differences between identification of deviations and exceptions
multiple representations of a data item and its (useful for certain kinds of unstructured data)
stated definition and statistics-based profiling (useful for structured
Comparability – instances of a data item are con- data; for example, descriptive statistics, inferential
sistent over time statistics, univariate analyses of actual data com-
Correctness – a data item is error-free pared to expected data).
Privacy – a data item does not provide personally
identifiable information (PII) Importance of Metadata Quality
Relevance –the degree to which a data item can Metadata describe the container as well as the
meet current and future needs of its users contents of a data collection. Metadata support
Security – access to a data item is controlled to data interoperability and the transformation of
ensure that only authorized access can take Big Data sets into useable information resources.
place Because Big Data sets frequently come from
Data Quality Management 323

widely distributed sources, the completeness and techniques and processes for data profiling; how-
quality of the metadata have direct effects on Big ever, they can usually be classified in three
Data quality. Of particular importance to Big Data categories:
quality, metadata ensure that, in addition to delin-
eating the identity, lineage and provenance of the Pattern analysis – expected patterns, pattern dis-
data, the transmission and management of Big tribution, pattern frequency, and drill down
Data conform to predetermined standards, con- analysis
ventions, and practices that are encapsulated in Attribute analysis (e.g., data element/column
the metadata and that and access to, and manipu- metadata consistency) – cardinality, null D
lation of, the data items will comply with the values, ranges, minimum/maximum values,
privacy and security stipulations defined in the frequency distribution, and various statistics
metadata. Operational metadata reflect the man- Domain analysis – expected or accepted data
agement requirements for data security and values and ranges
safeguarding personal identifying information
(PII); data ingestion, federation, and integration; Data Cleansing
data anonymization; data distribution; and data Data profiling provides essential information for
storage. Bibliographical metadata provide infor- solving Big Data quality problems.
mation about the data item’s producer, such as the For example, the data profiling activities could
author, title, table of contents, applicable key- reveal that the data set contains duplicate data
words of a document. Data lineage metadata pro- items or that there are different representations
vide information about the chain of custody of a of the same data item. This problem occurs fre-
data item with respect to its provenance – the quently when merging data from different
chronology of data ownership, stewardship, and sources. Common approaches to solving such
transformations. Syntactic metadata provide problems include:
information about data structures. Semantic meta-
data provide information about the cultural and Data exclusion – if the problem with the data is
knowledge domain-specific e contexts of a deemed to be severe, the best approach may be
data item. to remove the data
Data acceptance – if the error is within the toler-
ance limits for the data item, the best approach
Methodological Framework sometimes is to accept the data with the error
Data correction – if, for example, different varia-
Big Data quality management presents a number tions of a data item occur, the best approach
of complex, but not intractable, problems. may be to select one version as the master and
A notional process for addressing these problems consolidate the different version with the mas-
may include the following activities: data profil- ter versions
ing, data cleansing, data integration, data augmen- Data value insertion – if a value for a field is not
tation, addressing issues of missing data. known and the data item is specified as NOT
NULL, this problem may be addressed by cre-
Data Profiling ating a default value (e.g., unknown) and
Data profiling provides the basis for addressing inserting that value in the field
Big Data quality problems. Data profiling is the
process of gaining an understanding of the data Data Integration
and the extent to which they comply with their Frequently in the integration of Big Data sets, it is
quality specifications: are the data complete? are not unusual to encounter problems that reflect the
they accurate? There are many different diverse provenance of the data items, in terms of
324 Data Quality Management

metadata specifications and cultural contexts. K-nearest neighbor (KNN) – adapted from data
Thus, to ensure syntactic conformance and mining paradigms: the mode of the nearest
semantic congruence, the process of Big data inte- neighbor is used for discrete data; the mean is
gration may require parsing the metadata and data used for substituted for quantitative data
analyzing the contents of the data set accomplish
the following:
Challenges and Future Trends
Metadata reconciliation – identity, categories,
properties, syntactic and semantic conventions The decreasing costs of data storage and the
and norms increasing rapidity of data creation and transpor-
Semiotic reconciliation – alphabet, signs, sym- tation assure the future growth of Big Data-based
bols, and orthographic conventions applications in both the private and public sectors.
Version reconciliation – standardization of the However, Big Data do not necessarily mean better
multiple versions cross-referenced with an information than that provided by little data. Big
authoritative version Data cannot overcome the obstacles presented by
poorly conceived research designs or indifferently
Data Augmentation executed analytics. There remain Big Data quality
To enhance the utility of the Big Data items, it issues that should be addressed. For example,
may be necessary augment a Big Data item with many knowledge communities support multiple,
the corporation of additional external data to gain frequently proprietary, standards, ontologies, and
greater insight into the contents of the data set. lexica, each of which with its own sect of devotees
so that, rather than leading to uniform data quality,
Missing Data these proliferations tend have deleterious effects
The topic of missing data has led to a lively on Big Data quality and interoperability in global,
discourse on imputation in predictive analytics. cloud-based, IoT environments.
The goal is to conduct the most accurate analysis
of the data to make efficient and valid inferences
about a population or sample. Commonly used
Further Reading
approaches to address missing data include:
Acock, A. C. (2005). Working with missing values. Jour-
Listwise deletion – delete any case that has miss- nal of Marriage and Family, 67, 1012–1028.
ing data for any bivariate or multivariate Allison, P. A. (2002). Missing data. Thousand Oaks: Sage
analysis Publications.
Juran, J. M., & Godfrey, A. B. (1999). Juran’s quality
Mean substitution – substitute the mean of the handbook (Fifth ed.). New York: McGraw-Hill.
total sample of the variable for the missing Labouseur, A. G., & Matheus, C. (2017). An introduction
values of that variable to dynamic data quality challenges. ACM Journal of
Hotdecking – identify a data item in the data set Data and Information Quality, 8(2), 1–3.
Little, R. J. A., & Rubin, D. B. (1997). Statistical analysis
with complete data that is similar to the data with missing data. New York: Wiley.
item with missing data based on a correlated Pipino, L. L. Y. W. L., & Wang, R. Y. (2002). Data quality
characteristic and use that value to replace the assessment. Communications of the ACM, 45(4), 211–218.
missing value in the other data item Saunders, J. A., Morrow-Howell, N., Spitznagel, E., Dore,
P., Proctor, E. K., & Pescarino, R. (2006). Imputing
Conditional mean imputation (regression imputa- missing data: A comparison of methods for social
tion) – use the equation to predict the values of workers. Social Work Research, 30(1), 19–30.
the incomplete cases. Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data
Multiple-imputation analysis – based on the quality in context. Communications of the ACM, 40(5),
103–110.
Bayesian paradigm; multiple imputation anal- Truong, H.-L., Murguzur, A., & Yang, E. (2018). Chal-
ysis has proven to be statistically valid from the lenges in enabling quality of analytics in the cloud.
frequentist (randomization-based) perspective. Journal of Data and Information Quality, 9(2), 1–4.
Data Repository 325

publication catalogues. While a conventional


Data Reduction library needs certain physical space to store the
hardcopies of publications, a data repository
▶ Collaborative Filtering requests less physical resources, and it is able to
organize various types of contents. In general,
data repositories can be categorized into a few
types based on the contents they organize.
Data Repository Among those types the mostly seen ones are dig-
itized data legacy, born digital and data catalogue D
Xiaogang Ma service, as well as the hybrid of them.
Department of Computer Science, University of Massive amount of data have been recorded on
Idaho, Moscow, ID, USA media that are not machine readable, such as
hardcopies of spreadsheets, books, journal papers,
maps, photos, etc. Through the use of certain
Synonyms devices and technologies, such as image scanning
and optical character recognition, those data can
Data bank; Data center; Data service; Data store be transformed into machine-readable formats.
The resulting datasets are part of the so-called
digitized data legacy. Communities of certain sci-
Introduction entific disciplines have organized activities for
rescuing dataset from the literature legacy and
Data repositories store datasets and provide access improving their reusability and have developed
to users. Content stored and served by data repos- data repositories to store the rescued datasets.
itories includes digitized data legacy, born digital For example, the EarthChem works on the pres-
datasets, and data catalogues. Standards of meta- ervation, discovery, access, and visualization of
data schemas and identifiers enable the long-term geoscience data, especially those in the fields of
preservation of research data as well as the geochemistry, geochronology, and petrology
machine accessible interfaces among various (EarthChem 2016). A typical content type in
data repositories. Data management is a broad EarthChem is spreadsheets which were originally
topic underpinned by the data repositories, and published as tables in journal papers.
further efforts are needed to extend data manage- Comparing with digitized data legacy, more
ment to research provenance documentation. The and more datasets are born digital since computers
value of datasets is extended through connections are increasingly used in data collection. A trend in
from datasets to literature as well as inter-citations academia is to use open-source formats for
among datasets and literature. datasets stored in a repository and thus to
improve the interoperability of the datasets. For
example, comma-separated values (CSV) are
Repository Types recommended for spreadsheets. EarthChem is
also open for born digital datasets. Recently,
A data repository is a place where datasets can be EarthChem has collaborated with publishers
stored and accessed. Normally, datasets in a such as Elsevier to invite journal paper authors
repository are marked up with metadata which to upload the datasets used in their papers to
provide essential context information about the EarthChem. Moreover, interlinks will be set up
datasets and enable efficient data search. The between the papers and the datasets. A unique
architecture of a data repository is comparable to type of born digital data is the crowdsourcing
a conventional library and a number of parallel datasets. The contributors of crowdsourcing
comparisons can be found, such as datasets to datasets are a large community of people rather
publication hardcopies and metadata to than a few individuals or organizations. The
326 Data Repository

OpenStreetMap is such a crowdsourcing data still needs more time to give data an equal position
repository for worldwide geospatial data. as paper. A few publishers recently released so-
The number of data repositories has signifi- called data journals to publish short description
cantly increased in recent years, as well as the papers for datasets published in repositories,
subjects covered in those repositories. To bene- which can be regarded as a way to promote data
fit data search and discovery, another type of publication. Funding organizations have also
repositories has been developed to provide taken actions to promote data as formal products
data catalogue service. For example, the data of scientific research. For example, the National
portal of the World Data System (WDS) Science Foundation in United States now allows
(2016) allows retrieval of datasets from a wide funding applicants to list data and software pro-
coverage of WDS members through well- grams as products in their bio-sketches.
curated metadata catalogues encoded in com- A data repository has both data providers and
mon standards. Many organizations such as the data users. Accordingly, there are issues to be
Natural Environment Research Council in the considered for both data publication and data
United Kingdom and the National Aeronautics citation. If a registered dataset is tagged with
and Space Administration in the United States metadata such as contributor, date, title, source,
also provide data catalogue services to their data publisher, etc. and is minted a DOI, then it is
resources. intuitively citable just like a published journal
paper. To promote common and machine-read-
able metadata items among data repositories, a
Data Publication and Citation global initiative, the DataCite, has been working
on standards of metadata schema and identifier
Many data repositories already mint unique iden- for datasets since 2009. For example, DataCite
tifiers such as digital object identifiers (DOIs) to suggests five mandatory metadata items for a
registered datasets, which partially reflect peo- registered dataset: identifier, creator, title, pub-
ple’s efforts to make data as a kind of formal lisher, and publication year. It also suggests a list
publication. The word data publication is derived of additional metadata items such as subject,
from paper publication. If a journal paper is com- resource type, size, version, geographical loca-
parable to a registered dataset in a repository, then tion, etc. The methodology and technology
the repository is comparable to a journal. A paper developed by DataCite are increasingly endorsed
has metadata such as authors, publication date, by leading data repositories across the world,
title, journal name, volume number, issue number, which make possible a common technological
and page numbers. Most papers also have DOIs infrastructure for data citation.
which resolve to the landing web pages of those Various communities have also taken efforts to
papers on their publisher websites. By promote best practices in data citation, especially
documenting metadata, minting DOIs to regis- the guidelines for data users. The FORCE11
tered datasets in a repository, the datasets are published the Joint Declaration of Data Citation
made similar to published papers. The procedure Principles in 2013 to promote good research prac-
of data publication is already technically tice of citing datasets. Earlier than that, in 2012,
established in many data repositories. the Federation of Earth Science Information Part-
However, data publication is not just a techni- ners (2012) published Data Citation Guidelines
cal issue. There are also social issues to be con- for Data Providers and Archives, which offers
sidered, because data is not conventional regarded more practical details on how a published dataset
as a “first-class” product of scientific research. should be cited. For example, it suggests seven
Many datasets were previously published as sup- required elements to be included in a data citation:
plemental materials of papers. Although data authors, release date, title, version, archive and/or
repositories make it possible to publish data as distributor, locator/identifier, and access date/
stand-alone products, the science community time.
Data Repository 327

Data Management and Provenance organizations involved in the production of scien-


tific findings with the supporting datasets, and
Works on data repositories underpin another methods used to generate them (Ma et al. 2014;
broader topic, data management, which in general Mayernik et al. 2013). Provenance involves works
is about what one will do with the data generated on categorization, annotation, identification, and
during and after a research. The academia is now linking among various entities, agents, and activ-
facing a cultural change on data management. ities. To reduce duplicated efforts, a number of
Many funding agencies such as the National Sci- communities of practice have been undertaken,
ence Foundation in United States now require such as the CrossRef for publications, DataCite D
researchers include a data management plan in for datasets, ORCID for researchers, and IGSN for
funding proposals. From the perspective of physical samples.
researchers, good data management increases effi- The Global Change Information System is
ciency in their daily work. The data publication, such a data repository that is enabled with the
reuse, and citation enabled by the infrastructure of functionality of provenance tracking. The system
data repositories increase the visibility of individ- is led by the United States Global Change
ual works. Good practices on data management Research Program and records information about
and publication drive the culture of open and people, organizations, publications, datasets,
transparent science and can lead to new collabo- research findings, instruments, platforms,
rations and unanticipated discoveries. methods, software programs, etc., as well as the
Though data repositories provide essential interrelationships among them. If a user is inter-
facilities for data management, developing a data ested in the origin of a scientific finding, then he or
management plan can still be time-consuming as she can use the system to track all the supporting
it is conventionally not included in a research resources. In this way, the provenance informa-
workflow. However, it is now regarded as a nec- tion improves the reproducibility and credibility
essary step to ensure the research data to be safe of scientific results.
and useful for both the present and future. In
general, a data management plan includes ele-
ments such as project context, data types and Value-Added Service
formats, plans for short- and long-term manage-
ment, data sharing and update plans, etc. A num- The value of datasets is reflected in the informa-
ber of organizations provide online tools to help tion and knowledge extracted from them and their
researchers draft such data management plans, applications to tackle scientific, social, and busi-
such as the DMPTool developed by the California ness issues. Data repositories, data publication
Digital Library, the tool developed by the Inte- and citation standards, data management plans,
grated Earth Data Applications project at Colum- and provenance information form a framework
bia University, and the DMPonline developed by enabling the storage and preservation of data. To
the Digital Curation Centre in the United facilitate data reuse, more efforts are needed for
Kingdom. data curation, such as data catalogue service,
Efforts on standards of metadata schema and cross-disciplinary discovery, and innovative
persistent identifier for datasets in data reposito- approaches for pattern extraction.
ries are enabling the preservation of data as Thomson Reuters released the Data Citation
research products. Recently, the academia takes Index recently, which indexes the world’s leading
a further step to extend the topic of data manage- data repositories and connects datasets to related
ment to context management or, in a short word, refereed publications indexed in the Web of Sci-
the provenance. Provenance is about the origin of ence. Data Citation Index provides access to an
something. In scientific works documenting prov- array of data across subjects and regions, which
enance includes linking a range of observations enables users to understand data in a comprehen-
and model output, research activities, people and sive context through linked content and summary
328 Data Resellers

information. The linked information is beneficial Mayernik, M. S., DiLauro, T., Duerr, R., Metsger, E.,
because it enables users to gain insights which are Thessen, A. E., & Choudhury, G. S. (2013). Data
conservancy provenance, context, and lineage services:
lost when datasets or repositories are viewed in Key components for data preservation and curation.
isolation. The quality and importance of a dataset Data Science Journal, 12, 158–171.
are reflected in the number of citations it receives, World Data System. (2016). Trusteed data services for
which is recorded by the Data Citation Index. global science. https://www.icsu-wds.org/organiza
tion/intro-to-wds. Accessed 29 Apr 2016.
Such citations, on the other hand, enrich the con-
nections among research outputs and can be used
for further knowledge discovery.
In 2011, leading web search engines Google,
Bing, Yahoo!, and Yandex started an initiative Data Resellers
called Schema.org. Its aim is to create and support
a common set of schemas for structured data ▶ Data Brokers and Data Services
markup on web pages. Schema.org adopts a hier-
archy to organize schemas and vocabularies of
terms, which are to be used as tags to mark up
web pages. Search engine spiders and other Data Science
parsers can recognize those tags and record the
topic and content of a web page. This makes it Lourdes S. Martinez
easier for users to find the right web pages through School of Communication, San Diego State
a search engine. A few data repositories such as University, San Diego, CA, USA
the National Snow and Ice Data Center in the
United States already carried out studies to use
Schema.org to tag web pages of registered Data science has been defined as the structured
datasets. If this mechanism is broadly adopted, a study of data for the purpose of producing
desirable result is a data search engine similar to knowledge. Going beyond simply using data,
the publication search engine Google Scholar. data science revolves around extracting actionable
knowledge from said data. Despite this definition,
confusion exists surrounding the conceptual
boundaries of data science in large part due to its
Cross-References intersection with other concepts, including big
data and data-driven decision making. Given that
▶ Data Discovery increasingly unprecedented amounts of data are
▶ Data Provenance generated and collected every day, the growing
▶ Data Sharing importance of the data science field is undeniable.
▶ Data Storage As an emerging area of research, data science
▶ Metadata holds promise for optimizing performance of
companies and organizations. The implications
of advances in data science are relevant for fields
Further Reading and industries spanning an array of domains.
EarthChem. (2016). About Earthchem. http://www.
earthchem.org/overview. Accessed 29 Apr 2016.
Federation of Earth Science Information Partners. (2012). Defining Data Science
Data citation guidelines for data providers and archives.
http://commons.esipfed.org/node/308. Accessed 29 The basis of data science centers around
Apr 2016.
Ma, X., Fox, P., Tilmes, C., Jacobs, K., & Waple, A. established guiding principles and techniques
(2014). Capturing provenance of global change infor- that help organize the process of drawing out
mation. Nature Climate Change, 4(6), 409–413. information and insights from data. Conceptually,
Data Science 329

data science closely resembles data mining, or a particular product has the option of solely relying
process relying on technologies that implement on intuition and past experiences, or using a com-
these techniques in order to extract insights from bination of intuition and knowledge gained from
data. According to Dhar, Jarke, and Laartz, data data analysis. The latter represents the basis for
science seeks to move beyond simply explaining data-driven decision making. At times, however,
a phenomenon. Rather its main purpose is to in addition to enabling data-driven decision mak-
answer questions that explore and uncover action- ing, data science may also overlap with data-
able knowledge that informs decision making driven decision making. The case of automated
or predicts outcomes of interest. As such, most online recommendations of products based on D
of the challenges currently facing data science user ratings, preferences, and past consumer
emanate from properties of big data and the size behavior is an example of where the distinction
of its datasets, which are so massive they require between data science and data-driven decision
the use of alternative technologies for data making is less clear.
processing. Similarly, differentiating between the concepts
Given these characteristics, data science as a of big data and data science becomes murky when
field is charged with navigating the abundance of considering that approaches used for processing
data generated on a daily basis, while supporting big data overlay with the techniques and princi-
machine and human efforts in using big data to ples used to extract knowledge and espoused by
answer the most pressing questions facing indus- data science. This conceptual intersection exists
try and society. These aims point toward the where big data technologies meet data mining
interdisciplinary nature of data science. techniques. For example, technologies such as
According to Loukides, the field itself falls Apache™ Hadoop ® which are designed to store
inside the area where computer programming and process large-scale data can also be used to
and statistical analysis converge within the con- support a variety of data science efforts related to
text of a particular area of expertise. However, solving business problems, such as fraud detec-
data science differs from statistics in its holistic tion, and social problems, such as unemployment
approach to gathering, amassing, and examining reduction. As the technologies associated with big
user data to generate data products. Although data are also often used to apply and bolster
several areas across industry and society are approaches to data mining, the boundary between
beginning to explore the possibilities offered by where big data ends and data science begins
data science, the idea of what constitutes data continues to be imprecise.
science remains nebulous. Another source of confusion in defining data
science stems from the absence of formalized
academic programs in higher education. The
Controversy in Defining the Field lack of these programs exists in part due to chal-
lenges in launching novel programs that cross
According to Provost and Fawcett, one reason disciplines and the natural pace at which these
why data science is difficult to define relates to programs are implemented within the academic
its conceptual overlap with big data and data- environment. Although several institutions within
driven decision making. Data-driven decision higher education now recognize the importance
making represents an approach characterized by of this emerging field and the need to develop
the use of insights gleaned through data analysis programs that fulfill industry’s need for practi-
for deciding on a course of action. This form of tioners of data science, the result up to now has
decision making may also incorporate varying been to leave the task for defining the field to data
amounts of intuition, but does not rely solely on scientists.
it for moving forward. For example, a marketing Data scientists currently occupy an enviable
manager faced with a decision about how much position as among the most coveted employees
promotional effort should be invested in a for twenty-first-century hiring according to
330 Data Science

Davenport and Patil. They describe data scientists informatics against terrorist threats to transporta-
as professionals, usually of senior-level status, tion and key pieces of infrastructure (including
who are driven by curiosity and guided by crea- cyberspace). Security informatics uses a three-
tivity and training to prepare and process big data. pronged approach coordinating organizational,
Their efforts are geared toward uncovering find- technological, and policy-related efforts to
ings that solve problems in both private and develop data techniques designed to promote
public sectors. As businesses and organizations international and domestic security. The use of
accumulate greater volumes of data at faster data science techniques such as crime data min-
speeds, Davenport and Patil predict the need for ing, criminal network analysis, and advanced
data scientists will to continue in a very steep and multilingual social media analytics can be instru-
upward trajectory. mental in preventing attacks as well as
pinpointing whereabouts of suspected terrorists.
Another sector flourishing with the rise of data
Opportunities in Data Science science is science and technology (S&T). Chen
and colleagues note that several areas within S&T,
Several sectors stand to gain from the explosion in such as astrophysics, oceanography, and geno-
big data and acquisition of data scientists to ana- mics, regularly collect data through sensor sys-
lyze and extract insights from it. Chen, Chiang, tems and instruments. The result has been an
and Storey note the opportunities inherent through abundance of data in need of analysis, and the
data science for various areas. Beginning with e- recognition that information sharing and data ana-
commerce and the collection of market intelli- lytics must be supported. In response, the National
gence, Chen and colleagues focus on the develop- Science Foundation (NSF) now requires the sub-
ment of product recommendation systems via e- mission of a data management plan with every
commerce vendors such as Amazon that are com- funded project. Data-sharing initiatives such as
prised of consumer-generated data. These product the 2012 NSF Big Data program are examples of
recommendation systems allow for real-time government endeavors to advance big data ana-
access to consumer opinion and behavior data in lytics for science and technology research. The
record quantities. New data analytic techniques to iPlant Collaborative represents another NSF-
harness consumer opinions and sentiments have funded initiative that relies on cyber infrastructure
accompanied these systems, which can help busi- to instill skills related to computational techniques
nesses become better able to adjust and adapt that address evolving complexities within the field
quickly to needs of consumers. Similarly, in the of plant biology among emerging biologists.
realm of e-government and politics, a multitude of The health field is also flush with opportunities
data science opportunities exist for increasing the for advances using data science. According to
likelihood for achieving a range of desirable out- Chen and colleagues, opportunities for this field
comes, including political campaign effective- are rising in the form of massive amounts of
ness, political participation among voters, and health- and healthcare-related data. In addition to
support for government transparency and data collected from patients, data are also gener-
accountability. Data science methods used to ated through advanced medical tools and instru-
achieve these goals include opinion mining, social mentation, as well as online communities formed
network analysis, and social media analytics. around health-related topics and issues. Big data
Public safety and security represents another within the health field is primarily comprised of
area that Chen and colleagues observe has pros- genomics-based data and payer-provider data.
pects for implementing data science. Security Genomics-based data encompasses genetic-
remains an important issue for businesses and related information such as DNA sequencing.
organizations in a post-September 11th 2001 era. Payer-provider data comprises information col-
Data science offers unique opportunities to pro- lected as part of encounters or exchanges between
vide additional protections in the form of security patients and the healthcare system, and includes
Data Science 331

electronic health records and patient feedback. such as big data and data-driven decision making.
Despite these opportunities, Miller notes that The future of data science appears very bright, and
application of data science techniques to health as the amount and speed with which data is col-
data remains behind that of other sectors, in part lected continues to increase, so too will the need
due to a lack of initiatives that leverage scalable for data scientists to harness the power of big data.
analytical methods and computational platforms. The opportunities for using data science to maxi-
In addition, research and ethical considerations mize corporate and organizational performance
surrounding privacy and protection of patients’ cut across several sectors and areas.
rights in the use of big data present some chal- D
lenges to full utilization of existing health data.

Cross-References
Challenges to Data Science
▶ Big Data
Despite the enthusiasm for data science and the ▶ Big Data Research and Development Initiative
potential application of its techniques for solving (Federal, U.S.)
important real-world problems, there are some ▶ Business Intelligence Analytics
challenges to full implementation of tools from ▶ Data Mining
this emerging field. Finding individuals with the ▶ Data Scientist
right training and combination of skills to become ▶ Data Storage
data scientists represents one challenge. Davenport ▶ Data Streaming
and Pital discuss the shortage of data scientists as a
case in which demand has grossly exceeded sup-
ply, resulting in intense competition among orga- Further Reading
nizations to attract highly sought-after talent.
Chen, H. (2006). Intelligence and security informatics for
Concerns related to privacy represent international security: Information sharing and data
another challenge to data science analysis of big mining. New York: Springer Publishers.
data. Errors, mismanagement, or misuse of data Chen, H. (2009). AI, E-government, and politics 2.0. IEEE
(specifically data that by its nature is traceable Intelligent Systems, 24(5), 64–86.
Chen, H. (2011). Smart health and wellbeing. IEEE Intel-
to individuals) can lead to potential problems. ligent Systems, 26(5), 78–79.
One famous incident involved Target correctly Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Busi-
predicting the pregnancy status of a teenaged girl ness intelligence and analytics: From big data to big
before her father was aware of the situation, impact. MIS Quarterly, 36(4), 1165–1188.
Davenport, T. H., & Patil, D. J. (2012). Data scientist: The
resulting in wide media coverage over issues sexiest job of the 21st century. Harvard Business
equating big data with “Big Brother.” This per- Review, 90, 70–76.
ception of big data may cause individuals to Dhar, V., Jarke, M., & Laartz, J. (2014). Big data. Business
become reluctant to provide their information, or & Information Systems Engineering, 6(5), 257–259.
Hill, K. (2012). How target figured out a teen girl was
choose to alter their behavior when they suspect pregnant before her father did. Forbes magazine.
they are being tracked, potentially undermining Forbes Magazine.
the integrity of data collected. Loukides, M. (2011). What is data science? The future
Data science has been characterized as a field belongs to the companies and people that turn data
into products. Sebastopol: O’ Reilly Media.
concerned with the study of data for the purpose Miller, K. (2012). Big data analytics in biomedical
of gleaning insight and knowledge. The primary research. Biomedical Computation Review, 2, 14–21.
goal of data science is to produce knowledge Provost, F., & Fawcett, T. (2013). Data science and its
through the use of data. Although this definition relationship to big data and data-driven decision mak-
ing. Big Data, 1(1), 51–59.
provides clarity to the conceptualization of data Wactlar, H., Pavel, M., & Barkis, W. (2011). Can computer
science as a field, there persists confusion as to science save healthcare? IEEE Intelligent Systems, 26
how data science differs from related concepts (5), 79–83.
332 Data Scientist

data analysis methods cannot reveal. The


Data Scientist demands for data scientists are high, and the
field is projected to see continued growth over
Derek Doran time. Leading universities now offer undergradu-
Department of Computer Science and ate degrees, graduate degrees, and graduate certif-
Engineering, Wright State University, Dayton, icates in data science.
OH, USA

Defining a “Data Scientist”


Synonyms
If a data scientist is one who transforms and
Data analyst; Data analytics; Data hacker; applies operations over data to acquire knowl-
Statistician edge, nearly any individual processing, analyzing,
and interpreting data may be considered to be one.
For example, a zoologist that records eating habits
Definition/Introduction of animals, a doctor who reviews a patient’s his-
tory of blood pressure, a business analyst that
A “Data Scientist” is broadly defined as a profes- summarizes data in an Excel spreadsheet, and a
sional that systematically performs operations school teacher who computes final grades for a
over data to acquire knowledge or discover non- class are data scientists in the broadest sense. It is
obvious trends and insights. They are employed for this reason that the precise definition of what a
by organizations to acquire such knowledge and data scientist is, and the skills necessary to fulfil
trends from data using sophisticated computa- the position, may be a controversial topic. Spirited
tional systems, algorithms, and statistical tech- public debate about what constitutes one to have
niques. Given the ambiguity of how a data the job title “data scientists” can be seen on pro-
scientist extracts knowledge and how the data fessional discussion boards and social networks
she operates on may be defined, the term does across the Web and in professional societies. The
not have a narrow but universally applicable def- meteoric rise in popularity of this term, as identi-
inition. Data scientists use computational tools fied by Google Trends in Fig. 1, leads some to
and computer programming skills, their intellec- suggest that the popularity of the title is but a trend
tual foundation in mathematics and statistics, and powered by excitement surrounding the term “Big
at-hand domain knowledge to collect, decon- Data.”
struct, and fuse data from (possibly) many Although the specific definition of the title
sources, compare models, visualize, and report data scientist varies among organizations, there
in non-technical terms new insights that routine is agreement about the skills required to fulfill

Data Scientist, Fig. 1 Interest in the term “data scientist” as reported by Google Trends, 2007–2013
Data Scientist 333

the role. Drew Conway’s “data science Venn process called data fusion, data from different,
diagram”, published in 2010, identifies these possibly heterogeneous, sources are melded into a
agreed upon characteristics of a data scientist. common format and then joined together to form a
In includes: (i) hacking skills, i.e., the ability to single set of data for analysis.
use general purpose computational tools, pro- Data scientists perform their analysis using
gramming languages, and system administration advanced computational frameworks built on
commands to collect, organize, divide, process, high performance, distributed computing clusters
and run data analysis across modern computing that may be offered by cloud services. They thus
platforms; (ii) mathematics & statistics knowl- have working knowledge about popular frame- D
edge, which encompasses the theoretical knowl- works deployed in industry, such as Hadoop or
edge to understand, choose, and even devise Spark for large-scale batch processing of data and
mathematical and statistical models that can Storm for real-time data analysis. They also may
extract complex information from data; and (iii) know how to build, store, and query data in SQL
substantive expertise about the domain that a relational database systems like MySQL,
data set has come from and/or about the type of MSSQL, and Oracle, as well as less traditional
the data being analyzed (e.g., network, natural noSQL database management systems, including
language, streaming, social, etc.) so that the HBase, MongoDB, Oracle NoSQL, CouchDB,
information extracted can be meaningfully and Neo4j, which emphasize speed and flexibility
interpreted. of data representation over data consistency and
transaction management. In both “small” and
Data Hacking and Processing “big” data settings, data scientists often utilize
Data scientists are often charged with the process statistical programs and packages to build and
of collecting, cleaning, dividing, and combining run A/B testing, machine learning algorithms,
raw data from a number of sources prior to anal- deep learning systems, genetic algorithms, natural
ysis. The raw data may come in a highly struc- language processing, signal processing, image
tured form such as the result of relational database processing, manifold learning, data visualization,
queries, comma or tab-delimited files, and files time series analysis, and simulations. Towards this
formatted by a data interchange language includ- end, they often have working knowledge of a
ing xml or json. The raw data may also carry a statistical computing software environment and
semistructured format through a document programming language. R or Python are often
markup language such as html, where markup selected because of their support for a number of
tags and their attributes suggest a document struc- freely available, powerful packages for data
ture but the content of each tag is unstructured. analytics.
Datasets may even be fully unstructured, format-
ted in ways such as audio recordings, chat tran- Mathematical and Statistical Background
scripts, product reviews written in natural Data scientists may be asked to analyze high-
language, books, medical records, or analog dimensional data or data representing processes
data. Data wrangling is thus an important data that change over time. They may also be charged
hacking process where datasets of any form are with making predictions about the future by fitting
collected and then transformed or mapped into a complex models to data and execute data trans-
common, structured format for analysis. Leading formations. Techniques achieving these tasks are
open source data wrangling tools include rooted in the mathematics of calculus, linear alge-
OpenRefine and the Pandas package for the bra, and probability theory, as well as statistical
Python programming language. Data scientists methods. Calculus is used by data scientists in a
will also turn to Linux shell commands and scripts variety of contexts but most often to solve model
to maximize their ability to collect (e.g., sed, jq, optimization problems. For example, data scien-
scrape) and transform raw data into alternative tists often devise models that relate data attributes
formats (cut, awk, grep, cat, join). In a second to a desired outcome and include parameters
334 Data Scientist

whose values cannot be estimated from data. In mode of data attributes) but also by studying
these scenarios, analytical or computational characteristics about the distribution of the data
methods for identifying the value of a model (variance, skew, and heavy-tailed properties).
parameter “best explaining” observed data take Depending on the nature of the data being studied,
derivatives find the direction and magnitude of relevant factors may be identified through regres-
parameter updates that reduce a “loss” or “cost” sion, mixed effect, or unsupervised machine
function. Advanced machine learning models learning methods. Predictive analytics are also
involving neural networks or deep learning sys- powered by machine learning algorithms, which
tems require a background in calculus to evaluate are chosen or may even be developed by a data
the backpropagation learning algorithm. scientist based on the statistical qualities of the
Data scientists imagine data that carry n attri- data.
butes as an n-dimensional vector oriented in an n-
dimensional vector space. Linear algebraic Domain Expertise
methods are thus used to project, simplify, com- Data scientists are also equipped with domain-
bine, and analyze data through geometric trans- and organization-specific knowledge in order to
formations and manipulations of such vectors. For translate their analysis results into actionable
example, data scientists use linear algebraic insights. For example, data scientists evaluating
methods to simplify or eliminate irrelevant data biomedical or biological data have some training
attributes by projecting vectors representing data in the biological sciences, and a data scientist that
into a lower dimensional space. Many statistical studies interactions among individuals has train-
techniques and machine learning algorithms also ing in social network analysis and sociological
rely on the spectrum, or the collection of eigen- theory. Once employed by an organization, data
values, of matrices whose rows are data vectors. scientists immediately begin to accrue organiza-
Finally, data scientists exploit linear algebraic tion-specific knowledge about the company, the
representations of data in order to build computa- important questions they need their analysis to
tionally efficient algorithms that operate over answer, and the best method for presenting their
data. results in a nontechnical fashion to the organiza-
Data scientists often use simulation in order to tion. Data scientists are unable to create insights
explore the effect of different parameter values to from an analysis without sufficient domain-spe-
a desired outcome and to test whether or not a cific expertise and cannot generate value or com-
complex effect observed in data may have arisen municate their insights to a company without
simply by chance. Building simulations that accu- organization-specific knowledge.
rately reflect the nature of a system from which
data has come from require a data scientist to The Demand for Data Scientists
codify its qualities or characteristics probabilisti- For all but the largest and influential organiza-
cally. They will thus fit single or multivariate tions, identifying and recruiting an individual
discrete and continuous probability distributions with strong theoretical foundations, familiarity
to the data and may build Bayesian models that with state-of-the-art data processing systems, abil-
reflect the conditionality’s latent within the system ity to hack at unstructured data files, an intrinsic
being simulated. Data scientists thus have a sound sense of curiosity, keen investigative skills, and an
understanding of probability theory and probabi- ability to quickly acquire domain knowledge is a
listic methods to create accurate models and sim- tremendous challenge. Well-known consulting
ulations they draw important conclusions from. and market research group McKinsey Global
Statistical analyses that summarize data, iden- Institute project deep supply and demand gaps of
tify its important factors, and yield predictive ana- analytic talent across the world, as defined by
lytics are routinely used by data scientists. Data talent having knowledge of probability, statistics,
summarizations are performed using not only and machine learning. For example, McKinsey
summary statistics (e.g. mean, median, and projects that by the end of 2017, the United States
Data Sharing 335

will face a country-wide shortage of over 190,000 Granville, V. (2013). Data science programs and training
data scientists, and of over 1.5 million managers currently available. Data Science Central. Data Sci-
ence Central. Web. Accessed 04 Dec 2014.
and analysts able to lead an analytics team and are Kandel, S., et al. (2011). Research directions in data wran-
able to interpret and act on the insights data sci- gling: Visualizations and transformations for usable
entists discover. and credible data. Information Visualization, 10(4),
271–288.
Lund, S. (2013). Game changers: Five opportunities for
Data Science Training US growth and renewal. McKinsey Global Institute.
To address market demand for data scientists, Walker, D., & Fung, K. (2013). Big data and big business:
universities and institutes around the world now Should statisticians join in? Significance, 10(4), 20–25. D
offer undergraduate and graduate degrees as well
as professional certifications in data science.
These degrees are available from leading institu-
tions including Harvard, University of Washing- Data Security
ton, University of California Irvine, Stanford,
Columbia, Indiana University, Northwestern, ▶ Data Provenance
and Northeastern University. Courses within
these certificate and Master’s programs are often
available online and in innovative formats, includ-
ing through massive open online courses. Data Security Management

▶ Data Provenance
Conclusion

Despite the challenge of finding individuals with


deep mathematical, computational, and domain Data Service
specific backgrounds, the importance for organi-
zations to identify and hire well-trained data sci- ▶ Data Repository
entists has never been so high. Data scientists will
only continue to rise in value and in demand as our
global society marches forward towards an ever
more data-driven world. Data Sharing

Tao Wen
Earth and Environmental Systems Institute,
Cross-References
Pennsylvania State University, University Park,
PA, USA
▶ Big Variety Data
▶ Computer Science
▶ Computational Social Sciences
▶ Digital Storytelling, Big Data Storytelling
Definition
▶ Mathematics
▶ R-Programming In general, data sharing refers to the process of
making data accessible to data users. It often hap-
▶ Statistics
pens through community-specific or general data
repositories, personal and institutional websites,
and/or data publications. A data repository is a
Further Reading
place storing data and providing access to users.
Davenport, T. H., & Patil, D. J. (2012). Data scientist. Data sharing is particularly encouraged in research
Harvard Business Review, 90, 70–76. communities although the extent to which data are
336 Data Sharing

being shared varies across scientific disciplines. Google Dataset Search: https://toolbox.google.
Data sharing links data providers and users, and it com/datasetsearch). Currently, it is more common
benefits both parties through improving the repro- that data users will search for desired datasets
ducibility and visibility of research as well as pro- through discipline-specific data repositories (e.g.,
moting collaboration and fostering new science EarthChem: https://www.earthchem.org/ in earth
ideas. In particular, in the big data era, data sharing sciences).
is particularly important as it makes big data
research feasible by providing the essential constit- Accessibility
uent – data. To ensure effective data sharing, data Both data and metadata should be provided and can
providers should follow findability, accessibility, be transferred to data users through data repository.
interoperability, and reusability (FAIR) principles Broadly speaking, data repository can be personal-
(Wilkinson et al. 2016) throughout all stages of or institutional-level websites (e.g., Data Com-
data management, a broader topic underpinned by mons at Pennsylvania State University: http://
data sharing. www.datacommons.psu.edu) and discipline-spe-
cific or general databases (e.g., EarthChem). Data
users should be able to use the unique identifier
FAIR Principles (e.g., DOI) to locate and access a dataset.

Wilkinson et al. (2016) provide guidelines to help Interoperability


the research community to improve the findability, As more interdisciplinary projects are proposed
accessibility, interoperability, and reusability of sci- and funded, shared data from two or more disci-
entific data. Based on FAIR principles, scientific plines often need to be integrated for data visual-
data should be transformed into a machine-read- ization and analysis. To achieve interoperability,
able format, which becomes particularly important data and metadata should not only follow broadly
given that an enormous volume of data is being adopted reporting standards but also use vocabu-
produced at an extremely high velocity. Among laries to further formalize reported data. These
those four characteristics of FAIR data, reusability vocabularies should also follow FAIR principles.
is the ultimate goal and the most rewarding step. The other way to improve interoperability is that
data repositories should be designed to provide
Findability shared data in multiple formats, e.g., CSV and
Data sharing starts with making the data findable JavaScript Object Notation (JSON).
to users. Both data and metadata should be made
available. Metadata are used to provide informa- Reusability
tion about one or more aspects of the data, e.g., Enabling data users to reuse shared data is the
who collect the data, the date/time of data collec- ultimate goal. Reusability is the natural outcome
tion, and topics of collected data. Each dataset if data (and metadata) to be shared meet the rules
should be registered and assigned a unique iden- mentioned above. Shared data can be reused for
tifier such as a digital object identifier (DOI). Each testing new science ideas or for reproducing
DOI is a link redirecting data users to a webpage published results along with the shared data.
including the description and access of the asso-
ciated dataset. Both data and metadata should be
formatted following formal, open access, and The Rise of Data Sharing
widely endorsed data reporting standard (e.g.,
schema.org: https://schema.org/Dataset). Those Before the computer age, it was not uncommon
datasets fulfilling these standards can be cataloged that research data were published and deposited as
by emerging tools for searching datasets (e.g., paper copies. Transferring data to users often
Data Sharing 337

required individual request sent to the data pro- (6) Insufficient credit has been given to data pro-
vider. The development of the Internet connects viders as data citation might not be done
everyone and allows data sharing almost in real appropriately by data users.
time (Popkin 2019).
Nowadays more data are shared through a To address some of these problems, all
variety of data repositories providing access to stakeholders of data sharing are working col-
data users. The scientific community including laboratively. For example, European Union
funders, publishers, and research institutions has projects FOSTER Plus and OpenAIRE provide
started to promote the culture of data sharing and training opportunities to researchers on open D
making data open access. For example, the data and data sharing. The emerging data
National Science Foundation requires data man- journals, e.g., Nature Scientific Data, provide
agement plans in which awardees need to describe a platform for researchers to publish and share
how research data will be stored, published, and their data along with descriptions. Many
disseminated. Many publishers, like Springer funders, including the National Science Foun-
Nature, also require authors to deposit their data dation, have allowed repository fees on grants
in general or discipline-specific data repository. In (Popkin 2019).
addition to sharing data in larger data repositories
funded by national or international agencies,
many research institutions start to format and
Best Practices
share their data in university-sponsored data
repositories for the purpose of long-term data
The United States National Academies of Sci-
access.
ences, Engineering, and Medicine published a
In some disciplines, for example, Astronomy
report in 2018 (United States National Academies
and Meteorology, where data collection often
of Sciences, Engineering, and Medicine 2018) to
relies on large and expensive facilities (e.g., satel-
introduce the concept of Open Science by Design,
lite, telescope, a network of monitoring stations)
in which a series of improvements were
and the size of dataset is often larger than what one
recommended to be implemented throughout the
research group can analyze, data sharing is a com-
entire research life cycle to ensure open science
mon practice (Popkin 2019). In some other disci-
and open data. To facilitate data sharing and to pro-
plines, some researchers might be reluctant to
mote open science, some initiates listed below
share data for varying reasons. These reasons
were recommended:
can be in the processes of data publication and
data citation. Some of these reasons include:
Data Generation
(1) Researchers are concerned that they might get During data generation, researchers should con-
scooped if they share data too early. sider collecting data in a digital form other than
(2) Researchers might lack of essential expertise noting down data on a paper copy, e.g., a labora-
to format their data to certain standard. tory notebook. Many researchers are now
(3) Funding that supports data sharing might not collecting data in electronic forms (e.g.,
be available to these researchers to pay for comma-separated values or CSV files). In addi-
their time to make data FAIR. tion, researchers should use tools compatible
(4) The support for building data repositories is with open data, and also adopt automated
insufficient in some disciplines. workflows to format and curate generated data.
(5) The research community fails to treat data These actions taken at the early stage of research
sharing as important as publishing journal life cycle can help avoid many problems in data
article. sharing later on.
338 Data Storage

Data Sharing United States National Academies of Sciences, Engineer-


After finishing preparing data, researchers should ing, and Medicine. (2018). Open science by design:
Realizing a vision for twenty-first century research.
pick one or more data repositories to share their National Academies Press.
data. Data to be shared include not only data but Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J.,
also metadata and more. For example, World Data Appleton, G., Axton, M., Baak, A., et al. (2016). The
System (2015) recommended that data, metadata, FAIR guiding principles for scientific data management
and stewardship. Scientific Data, 3, 160018.
products, and information produced from research World Data System. (2015). World Data System (WDS) Data
all should be shared although national or interna- Sharing Principles. Retrieved 22 Aug 2019, from https://
tional jurisdictional laws and policies might apply. www.icsu-wds.org/services/data-sharing-principles.
Researchers should consult with funders or pub-
lishers about recommended data repositories into
which they can deposit data. One example list of
widely used data repositories (both general and Data Storage
discipline-specific) can be found here: https://
www.nature.com/sdata/policies/repositories. Omar Alghushairy and Xiaogang Ma
Department of Computer Science, University of
Idaho, Moscow, ID, USA
Conclusion

Data sharing act as a bridge linking both data Synonyms


providers and users, and it is particularly encour-
aged in the research community. Data sharing can Data science; FAIR data; Storage media; Storage
benefit the research community in many ways system
including (1) improving the reproducibility and
visibility of research, (2) promoting collaboration,
and inspiring new science ideas, and (3) shared Introduction
data can be used as a vehicle to foster the com-
munication between academia, industry, and the Data storage means storing and archiving data
general public (e.g., Brantley et al. 2018). To in electronic storage devices that are dedicated
facilitate effective data sharing, researchers for preservation, where data can be accessed and
should follow FAIR principles (findability, acces- used at any time. Storage devices are hardware
sibility, interoperability, and reusability) when that are used for reading and writing data through
they generate, format, curate, and share data. a storage medium. Storage media are physical
materials for storing and retrieving data. The pop-
ular data storage devices are hard drives, flash
drives, and cloud storage. The term big data
Cross-References reflects not only the massive volume of data but
also the increased velocity and variety in data
▶ Data Repository generation and collection, for example, the
massive amounts of digital photos shared on the
Web, social media networks, and even the Web
Further Reading search records. Many conventional documents
such as books, newspapers, and blogs can also
Brantley, S. L., Vidic, R. D., Brasier, K., Yoxtheimer, D., be data sources in the digital world. Storing big
Pollak, J., Wilderman, C., & Wen, T. (2018). Engaging data in appropriate ways will greatly support data
over data on fracking and water quality. Science, 359
discovery, access, and analytics. Accordingly,
(6374), 395–397.
Popkin, G. (2019). Data sharing and how it can benefit various data storage devices and technologies
your scientific career. Nature, 569(7756), 445. have been developed to increase the efficiency
Data Storage 339

in data management and enable information connected to the central processing unit (Savage
extraction and knowledge discovery from data. and Vogel 2014). This type of data storage device
In domain-specific fields, a lot of scientific is usually used to increase computer capacity.
researches has been conducted to tackle the Secondary storage is nonvolatile, and the data
specific requirements on data collection, data can be retained. HDD stores the data in the mag-
formatting, and data storage, which have also netic platter, and it uses the mechanical spindle
generated beneficial feedback to computer to read and write data. The operating system iden-
science. tifies the paths and sectors of data stored on
Data storage is a key step in the data science the platters. SSD is faster than HDD because it D
life cycle. At the early stage of the cycle, well- is a flash drive, which stores data in microchips
organized data will provide strong support to and has no mechanical parts. Also, SSD is smaller
the research program where the data is collected. in size, is less in weight, and is energy-efficient in
At the late stage, the data can be shared and made comparison with HDD.
persistently reusable to other users. The FAIR data
principles (findable, accessible, interoperable, and
reusable) provide a set of guidelines for data Technologies
storage.
In recent years, data has grown fast and has
become massive. With so many data generating
Data Storage Devices sources, there is an urgent need to provide
technologies that can deal with the storage of
There are many different types of devices big data. This section provides an overview of
that store data in digital forms, which have the well-known data storage technologies that are
fundamental capacity measurement unit called able to manipulate large volumes of data, such as
bit, and every eight bits are equal to one byte. relational database, NoSQL database, distributed
Often, the data storage device is measured file systems, and cloud storage.
in megabytes (MB), gigabytes (GB), terabytes Relational Database: The relational system
(TB), and other bigger units. The data storage that emerged in the 1980s is described as a cluster
devices are categorized into two types based on of relationships, each relationship having a unique
their characteristics: the primary storage and the single name. These relationships interconnect a
secondary storage. number of tables. Each table contains a set of
Primary storage devices such as cache rows (records) and columns (attributes). The set
memory, random-access memory (RAM), and of columns in each table is fixed, and each column
read-only memory (ROM) are connected to a has a specific pattern that is allowed to be used.
central processing unit (CPU) that reads and In each row, the record represents a relationship
executes instructions and data stored on them. that linked a set of values together. Relational
Cache memory is very fast memory, which is database is functional in data storage, but it also
used as the buffer between the CPU and RAM. has some limitations that make it less efficient
RAM is temporary memory, which means the to deal with big data. For example, relational
content of stored data is lost once the power is database cannot tackle unstructured data. For
turned off. ROM is nonvolatile memory, so the datasets with network or graph patterns, it is
data stored on it cannot be changed because it has difficult to use relational database to find the
become permanent data (Savage and Vogel 2014). shortest route between two data points.
In general, these memories have limited capacity, NoSQL Database: “Not only SQL” (NoSQL)
where it is difficult to handle big data streaming. database is considered the most important big
Secondary storage such as hard disk drive data storage technology in database management
(HDD), solid-state drive (SSD), server, CD, and systems. It is a method that depends on disposal of
DVD are external data storage that are not restrictions. NoSQL databases aim to eliminate
340 Data Storage

complex relationships and provide many ways to Many companies are using emotion and
preserve and work on data for specific use cases, behavior analysis from their data or social
such as storing full-text documents. In NoSQL media to identify their audiences and costumers
database, it is not necessary for data elements to to predict the marketing and sales results. Smart
have the same structure, because it is able to deal decisions reduce costs and improve productivity.
with structured, unstructured, and semi-structured Data is the basis for informed big business
data (Strohbach et al. 2016). decision-making. Analyzing the data offers
Distributed File Systems (DFS): DFS man- more information options to make the right
ages datasets that are stored in different servers. choice.
Moreover, DFS accesses the datasets and pro- There are many techniques for managing
cesses them as if they are stored in one server the big data, but Hadoop is currently the best
or device. The Hadoop Distributed File System technology for this purpose. Hadoop offers the
(HDFS) is the most popular method in the data scientists and data analysts flexibility to
field. HDFS separates the data into multiple deal with data and extract information from it
servers. Thus, it supports big data storage and whether the data is structured or unstructured,
high-efficiency parallel processing (Minelli and it offers many other convenient services.
et al. 2013). Hadoop is designed to follow up on any system
Cloud Storage: Cloud storage can be defined failures. It constantly monitors the stored data on
as servers that contain large storage space where the server. As a result, Hadoop provides reliable,
users can manage their files. In general, this ser- fault-tolerant, and scalable servers to store and
vice is provided by companies known in the field analyze data at a low cost.
of cloud storage. Cloud storage led to the term The development of cloud storage with the
cloud computing, which means using applications widespread use of Internet services, as well as
over a virtual interface by connecting to the the development of mobile devices such as
Internet. For example, Microsoft installs the smart phones and tablets, has enhanced the
Microsoft Office on its cloud servers. If a user spread of cloud storage services. Many people
has an account in the Microsoft cloud storage carry their laptops when they are not in their
service and an available Internet connection offices, and they can easily access their files
through a computer or smart phone, the user will through their own cloud storage over the Inter-
be allowed to use the cloud system by logging net. They can use cloud storage services like
into the account from anywhere. Besides cloud Google Docs, Dropbox, and many more to
computing, cloud storage also has many other access their files wherever they are and when-
features, such as file synchronization, file sharing, ever they want.
and collaborative file editing. Companies are increasingly using cloud stor-
age for several reasons, most notably because
cloud services are becoming cheaper, faster, and
Impacts of Big Data Storage easier to maintain and retrieve data. In fact, cloud
storage is the better option for a lot of companies
Based on a McKinsey Global Institute study, to address challenges caused by the lack of office
the information that has been captured through space, the inability to host servers, and the
organizations about their customers, operations, expensive cost of using servers in the company,
and suppliers by digital systems has been in terms of maintenance and cost of purchase. By
estimated as trillions of bytes. That means data using cloud storage, companies can save the
volume grows at a great rate, so it needs advanced servers’ space and cost for other things. Google,
tools and technologies for storing and processing. Amazon, and Microsoft are the most popular
Data storage has played a major role in the big companies in cloud storage services, just to
data revolution (Minelli et al. 2013). name a few.
Data Stream 341

Structured, Unstructured, and Metadata should be made accessible even if the


Semi-structured Data data is not accessible. Interoperability is based
on containing qualified references for both data
There are various forms of data that are stored, and metadata and by representing the records
such as texts, numbers, videos, etc. These data in formal, shareable, and machine-readable
can be divided into the following three catego- language. Reusability is based on detailed
ries: structured, unstructured, and semi- information of metadata with accessible license
structured data. Structured data is considered for suitable citation to the data. In addition,
high-level data that is in an organized form, software tools and other related provenance D
such as data in an Excel sheet. For example, a information should also be accessible to support
university database has around half a million data reuse.
pieces of information for about 20 thousand stu-
dents, which contain names, phone numbers,
addresses, majors, and other data. Unstructured Cross-References
data is random and disorganized data, for exam-
ple, data that is presented on a social network, ▶ Data Center
such as text and multimedia data. Various
unstructured data are posted to social media plat-
forms like Twitter and YouTube every day. Semi- Further Reading
structured data is provided by several types of
data combined to represent the data in a specific Ma, X. (2019). Metadata. In L. A. Schintler &
C. L. McNeely (Eds.), Encyclopedia of Big Data.
pattern or structure. For example, information Cham: Springer. https://doi.org/10.1007/978-3-319-
about a user’s call contains an entity of informa- 32001-4_135-1.
tion based on the logs of the call center. How- Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big data,
ever, not all the data is structured, such as a big analytics: Emerging business intelligence and
analytic trends for today’s businesses. Hoboken: Wiley.
complaint recorded in audio format, which is
Savage, T. M., & Vogel, K. E. (2014). An introduction to
unstructured, so it is hard to be synthesized in digital multimedia (2nd ed.). Jones & Bartlett Learning
data storage (Minelli et al. 2013). Publication, Burlington, MA, USA.
Strohbach, M., Daubert, J., Ravkin, H., &
Lischka, M. (2016). Big data storage. In J. Cavanillas,
E. Curry, & W. Wahlster (Eds.), New horizons for a
FAIR Data data-driven economy. Cham: Springer.
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., et al.
FAIR data is a new point of view on data (2016). The FAIR Guiding Principles for scientific
data management and stewardship. Scientific Data, 3,
management, which follows the guidelines of 160018.
findability, accessibility, interoperability, and
reusability (Wilkinson et al. 2016). FAIR data
focuses on two principles which enhance
machine’s ability for finding and using data
Data Store
automatically and supporting the data reuse via
humans. ▶ Data Repository
Findability is based on placing the data with its
metadata in searchable and global identifiers;
then looking for data through several links
on the World Wide Web should be possible.
Accessibility is based on ensuring easy access to Data Stream
data and its metadata (Ma 2019) through the
Internet by an authorized person or a machine. ▶ Data Streaming
342 Data Streaming

The Internet of Things (IoT)


Data Streaming
IoT is a crucial source of big data. It consists of
Raed Alsini and Xiaogang Ma connected devices in a network system. Within
Department of Computer Science, IoT, as a part of the data stream application, many
University of Idaho, Moscow, ID, USA tools are already widely used, such as radio-fre-
quency identification (RFID). For example, a pro-
duction company can use RFID on their system to
Synonyms track products such as automobiles, toys, and
clothes. By doing so, the workers in that company
Data science; Data stream; Internet of Things could monitor and resolve issues during the pro-
(IoT); Stream reasoning duction process. Other examples can be seen in
the data resulted from chip devices. A
smartphone, with many sensor chips embedded,
Introduction is able to measure and record its user’s activities,
such as location, footsteps, heart rate, and calorie,
Data has become an essential component of just to name a few.
not only research but also of our daily life. In Sensors are essential components of IoT.
the digital world, people are able to use various They generate the data stream and transmit
types of technology to collect and transmit big them into many applications, such as those esti-
data, which has the features of overwhelming mating weather conditions. For instance, histor-
volume, velocity, variety, value, and veracity. ical and live weather records of rain, snow,
More importantly, big data represents a vast wind, and others are input into a weather anal-
amount of information and knowledge to be ysis system to generate hourly, daily, and
discovered. The Internet of Things (IoT) is weekly predictions. Automotive industries also
interconnected with big data. IoT applications equip many sensors in vehicles to help reduce
use data stream as a primary way for data trans- potential traffic accidents. For example, the dis-
mission and make data stream a unique type of big tance detection radar is a common component of
data. A data stream is a sequence of data blocks many automobiles nowadays. It can detect the
being transmitted. The real-time feature of the distance and space between the automobile and
data stream requires corresponding technologies a pedestrian or a building to prevent any injury
for efficient data processing. Streaming the data is when the distance approaches a certain mini-
built upon resources that are commonly used for mum value.
communication, web activity, E-commerce, and
social media. How the data is processed deter-
mines how information can be extracted from Data Science Aspect in Data Stream
the data stream. Analyzing the data stream
through queries ensures and improves the effi- In recent years, data is the “crude oil” to drive
ciency on data under the aspect of data science. technological and economic development. There
Many techniques can be used in data stream pro- is an extremely high demand, and almost every-
cessing, among which data mining is the most one uses it. We need to refine crude oil before
common approach used for detecting data latency, using it. It is the same with data. We can benefit
frequent pattern, and anomaly values, as well as from the data only when data processing, mining,
for classification and clustering. The computer analysis and extraction are able to provide useful
science community has created many open-source results.
libraries for the data stream and has built various Using the data stream in data science
best practices to facilitate the applications of data involves understanding the data life cycle. Usu-
stream in different disciplines. ally, it begins with collecting the data from the
Data Streaming 343

sources. Nowadays, data stream collection can have several libraries that can deal with many
be seen on search engines, social media, IoT, types of data stream.
and marketing. For instance, the Google Trends Data mining as a part of data science is used to
generates a massive amount of data by discover knowledge in data. For data stream min-
searching certain topics on the Web. After that, ing, it usually involves methods such as machine
it can provide the result based on what a user is learning to extract and predict new information. A
looking for with a specific range of time. The few widely used methods are clustering, classifi-
benefit of processing the data stream is getting cation, and stream mining on the sensor network.
the right information immediately. The pro- Clustering is a process of gathering similar data D
cessing needs methods and models. Two com- into a group. Clustering deals with unsupervised
mon standard models are batch processing and learning. It means the system does not need to
stream processing. have a label in order to discover the hidden pattern
Batch processing can handle a large amount in data. K-means is the most common method
of data by first collecting the data over time and used for clustering. Clustering can be used for
then doing the processing. For example, the oper- fraud detection. For example, it is able to find
ating system on a computer can optimize the anomaly records on a credit card and inform the
sequencing of jobs to make efficient usage of card holder.
the system. Micro-batch is a modified model of Classification is a process of identifying the
batch processing. It groups data and tasks into category of a piece of new data. Based on a set
small batches. Completing the processing of a of training data, the system can set up several
batch on this model is based on how the next categories and then determine to which category
batch is received. a piece of new data belongs. Classification is one
Stream processing is a model used for pro- of the supervised learning methods, in which the
cessing the data without waiting for the next data system learns and determines how to take the right
to arrive. The benefit of stream processing is that decision. For example, buying and selling holds
the system can receive the data quickly. For exam- on the stock market can be done by using this
ple, an online banking application runs stream method to make the right decisions based on the
processing when a customer buys a product. data given.
Then the bank transaction is verified and executed
without fail. Stream processing can handle a huge
amount of data without suffering any issues Data Stream Management
related to data latency. A sensor network that System (DSMS)
generates massive data can be organized easily
under this method. Regardless of how the data stream is preceded
Many technologies can be used to store the and stored, it requires data management in the
data stream in a data life cycle. Amazon Web data life cycle. Managing the data stream can be
Services (AWS) provide several types of tools to done using queries as a primary method, such as
support the various needs in storing and analyzing the Structured Query Language (SQL). SQL is a
data. Apache Spark is an open-source use for common language use for managing the database.
cluster computing framework. Spark Streaming The data stream management system (DSMS)
uses the fast scheduling capability of Apache uses an extended version of SQL known as the
Spark to implement streaming analytics. It groups Continuous Query Language (CQL). The reason
data in micro-batches and makes transformations behind CQL is to ensure any continuous data over
on them. Hadoop tool, which is another platform time can be used on the system.
under Apache, uses the batch processing to store The operation of CQL can be categorized into
massive amounts of data. It sets up a framework three groups: relation-to relation, stream-to-rela-
for distributed data processing by using the tion, and relation-to stream (Garofalakis et al.
MapReduce model. Both Spark and Hadoop 2007). Relation-to-relation is usually done under
344 Data Streaming

the SQL query. For instance, the relation between Practical Approach
two queries can be expressed using either equal,
above, greater, or less symbols. Stream-to-relation In real-world practice, the application of data
is done using the sliding window method. Sliding stream is tied with big data. The current approach
Window is based on having a window that data stream usage can be grouped into these cate-
has historical points when the data is streamed. gories: scaling data infrastructure, mining hetero-
Specifically, when there are two window sizes, the geneous information network, graph mining and
second window will not begin until the difference discovery, and recommender system (Fan and
between them is removed. Relation-to-stream Bifet 2013).
usually involves the tree method to deal with the Scaling data infrastructure is about to analyze
continuous query. Detailed operations include the data from social media, such as Twitter, which
insertion, deletion, and relation. carry various types of data such as video, image,
text, or even a hashtag trend. The data generated is
based on how the users communicate on a certain
Stream Reasoning topic. It leads to various analytics to understand
human behavior and emotion on the communica-
Stream reasoning is about processing the data tion between users. Snapchat is now another pop-
stream to get a conclusion or decision on con- ular social media application that generates and
tinuous information. Stream reasoning handles analyzes live data stream based on the location
the continuous information by defining factors and the event occurred.
on the velocity, volume, and variety of big data. Mining heterogeneous information network
For example, a production company uses sev- is about to discover the connections between mul-
eral sensors to estimate and predict the types tiple components such as people, organizations,
and amounts of raw materials needed for each activities, communication, and system infrastruc-
day. Another example is the detection of fake ture. The information network here also includes
news on social media. Each social media plat- the relations that can be seen on social networks,
form has various users across the world. sensor networks, graphs, and the Web.
Streaming reasoning can be used to analyze Graphs are being used to represent nodes and
the features in the language patterns in message their relations, and graph mining is an efficient
spreading. method to discover knowledge in the big data.
The semantic web community has proposed For example, Twitter can represent the graph
several tools that can be used in stream reason- information by visualizing each data type and
ing. The semantic web community introduced their relations. Many other graph information
RDF for the modeling and encoding of data can be obtained from the Web. For example,
schemas and ontologies on the fundamental Google has constructed knowledge graphs for
level. The linked open data is an example of various objects and relations.
how the database can be linked in the semantic Recommender system is another approach
web. SPARQL is a query language developed by for analyzing data stream in the big data. Through
W3C. SPARQL queries use the triplet pattern of the collaborative filtering (CF), the queries in a
RDF to represent patterns in the data and graph. DSMS system can be improved by adding a new
Recently, the RDF Stream Processing (RSP) CF statement, such as rating. It can extend the
working group proposed an extension of both functionality of DSMS on finding optimization,
RDF and the SPARQL query to support stream query sharing, fragmentation, and distribution.
reasoning. For instance, the Continuous Another strategy is using the content-based
SPARQL (C-SPARQL) is an example of the model. Several platforms, like Amazon, eBay,
SPARQL language to expand the use of contin- YouTube, and Netflix, have already used this in
uous queries. their systems.
Data Synthesis 345

Further Reading the original data for more privacy and confidenti-
ality. Synthetic data was first proposed by Rubin
Aggarwal, C. C. (2007). An introduction to data streams. (1993). Data synthesis therefore is a process of
In C. C. Aggarwal (Ed.), Data streams. Advances in
replacing identifying, sensitive, or missing values
database systems (Vol. 31). Boston: Springer.
Dell’Aglio, D., Valle, E. D., Harmelen, F. V., & according to multiple imputation techniques
Bernstein, A. (2017). Stream reasoning: A survey and based on regression models; the created synthetic
outlook. Data Science, 1–25. https://doi.org/10.3233/ data has many of the same statistical properties as
ds-170006.
the original data (Abowd and Woodcock 2004).
Fan, W., & Bifet, A. (2013). Mining big data.
Data synthesis includes a full synthesis for all D
ACM SIGKDD Explorations Newsletter, 14(2), 1.
https://doi.org/10.1145/2481244.2481246. variables and all records or a partial synthesis
Garofalakis, M., Gehrke, J., & Rastogi, R. (2007). for a subset of variables and records.
Data stream management: Processing high-speed
data streams (Data-centric systems and applications).
Berlin/Heidelberg: Springer.
Ma, X. (2017). Visualization. In L. Schintler & C. The Emergence of Data Synthesis
McNeely (Eds.), Encyclopedia of Big Data.
Cham: Springer. https://doi.org/10.1007/978-3-319-
While many statistical agencies disseminate sam-
32001-4_202-1.
ples of census microdata, the masked public use
data sample can be difficult for analysis, either due
to limited or even distorted information after
Data Synthesis masking or due to the limited sample size when
multiple data sources merge together. To disguise
Ting Zhang identifying or sensitive values, agencies sometimes
Department of Accounting, Finance and add random noise or use swapping for easy-to-
Economics, Merrick School of Business, identify at-risk records (Dalenius and Reiss
University of Baltimore, Baltimore, MD, USA 1982). This introduces measurement error (Yancey
et al. 2002). Winkler (2007) showed that synthetic
data has the potential to avoid the problems of
Definition/Introduction standard statistical disclosure control methods and
has better data utility and lower disclosure risk.
While traditionally data synthesis often refers to Drechsler and Reiter (2010) demonstrate that sam-
descriptive or interpretative narrative and tabula- pling with synthesis can improve the quality of
tion in studies like meta-analyses, in the big data public use data relative to sampling followed by
context, data synthesis refer to the process of standard statistical disclosure limitation.
creating synthetic data. In the big data context,
the digital technology provides unprecedented
tremendous data information. The rich data across How Data Synthesis Is Conducted?
various fields can jointly offer extensive informa-
tion about individual persons or organization for For synthetic data, sequential regression imputa-
finance, economics, health, other research, evalu- tion is used, with often a regression model for a
ation, policy making, etc. However, fortunately given variable to impute and replace the values.
our laws necessarily protect our privacy and data The process is then repeated for other variables.
confidentiality; this necessary data protection Specifically, according to Drechsler and Reiter
becomes increasing important in our big data (2010), the data agency typically follows the fol-
world where thefts and various levels of data lowing four steps:
breach could become much easier. The synthetic
data has the same or highly similar attributes of (i) Selects the set of values to replace with
the real data for many analytic purposes but masks imputations
346 Data Synthesis

(ii) Determines the synthesis models for the remote site outside with the same computer envi-
entire dataset to take use of all available ronment as at the agency’s restricted access
information Research Data Centers. Researchers can access
(iii) Repeatedly simulates replacement values for the synthetic files at the Virtual RDC. The virtual
the selected data to create multiple, disclo- RDC is now operational at the Cornell University.
sure-protected populations In addition to government agencies, synthetic
(iv) Releases samples from the populations data can also be used to create research samples of
confidential administrative data or, as a technique,
The newly created samples have a mixture of to impute survey data. For example, the national
genuine and simulated data. longitudinal Health and Retirement Study that
collects survey data on American older adults
biannually. Also, the Ewing Marion Foundation
Main Challenges has been collecting annual national Kauffman
Survey Data longitudinally for years. Both survey
For synthetic data, the challenges include using data sets have certain components of data synthe-
appropriate inferential methods for different pro- sis used to impute the data.
cess of data generation. The algorithm could result
in different synthetic datasets for different order-
ings of the variables or possibly different order- Conclusion
ings of the records. However, it is not anticipated
to affect data utility (Drechsler and Reiter 2010). In the context of big data, data synthesis refers to
Essentially, synthetic data only replicates certain the process of creating synthetic data. The limita-
specific attributes or simulate general trends of the tion of masked public use data makes data syn-
data; they are not exactly the same as the original thesis particularly valuable. Data synthesis offers
dataset for all purposes. the necessary rich data information needed for
numerous research and analysis without sacrific-
ing data privacy and confidentiality in the big data
Application of Synthetic Data world. Synthetic data adopts regression-based
multiple imputations to replace identifying, sensi-
Synthetic data is typically used among national tive, or missing values, which helps to avoid stan-
statistics agencies because of the potential advan- dard statistical disclosure control methods in
tages of synthetic data over data with standard public use government data samples. With some
disclosure limitation, such as the American Com- challenges, synthetic data now is widely used
munity Survey, the Survey of Income and Program across government or other data agencies and
Participation, the Survey of Consumer Finances, can be used for other data purposes, including
and the Longitudinal Employer Household Dynam- imputing survey data, as well.
ics Program. Synthetic data from other national
agency experience outside the United States include
German Institute for Employment Research and References
Statistics New Zealand.
Abowd and Lane (2004) describe an ongoing Abowd, J.M., & Lane, J.I. (2004). New Approaches to
Confidentiality Protection: Synthetic Data, Remote
effort, called “Virtual Research Data Centers” or Access and Research Data Centers. In Domingo-Ferrer,
“Virtual RDC”, to benefit both the research com- J. & Torra, V. (Eds.), Privacy in Statistical Databases:
munity and statistical agencies. In “Virtual RDC,” CASC Project International Workshop, PSD 2004,
multiple public use synthetic data sets can be cre- Barcelona, Spain, June 9-11, 2004, Proceedings. (pp.
282-289). Berlin: Springer.
ated from a single underlying confidential file and Abowd, J. M., & Woodcock, S. D. (2004). Multiply-imput-
customized for different uses. It is called “Virtual ing confidential characteristics and file links in longitu-
RDC” because the synthetic data is maintained at a dinal linked data. In Domingo-Ferrer, J. & Torra, V.
Data Virtualization 347

(Eds.), Privacy in Statistical Databases: CASC Project specialized data formats can be hidden from the
International Workshop, PSD 2004, Barcelona, Spain, applications analyzing datasets. However,
June 9-11, 2004, Proceedings. (pp. 290–297). Berlin:
Springer. supporting it can require significant effort. For
Dalenius, T., & Reiss, S. P. (1982). Data-swapping: A each dataset layout and abstract view that is
technique for disclosure control. Journal of Statistical desired, a set of data services need to be
Planning and Inference, 6, 73–85. implemented. An additional difficulty arises
Drechsler, J., & Reiter, J. P. (2010). Sampling with synthe-
sis: A new approach for releasing public use census from the fact that the design and implementation
microdata. Journal of the American Statistical Associ- of efficient data virtualization and data services
ation, 105(492), 1347–1357. oftentimes require interaction of two complemen- D
Rubin, D. B. (1993). Discussion: Statistical disclosure tary players. The first player is the scientist who
limitation. Journal of Official Statistics, 9, 462–468.
Winkler, W. E. (2007). Examples of easy-to-implement, possesses a good understanding of the applica-
widely used methods of masking for which analytic tion, datasets, and their format, but is less knowl-
properties are not justified. Tech. Rep., U.S. Census edgeable about database and data services
Bureau Research Report Series, No. 2007–21. implementation. The second player is the database
Yancey, W. E., Winkler, W. E., & Creecy, R. H. (2002).
Disclosure risk assessment in perturbative microdata developer who is proficient in the tools and tech-
protection. In J. Domingo-Ferrer (Ed.), Inference con- niques for efficient database and data services
trol in statistical databases (pp. 135–152). Berlin: implementation, but has little knowledge of the
Springer. specific application.
The two key aspects in the automatic data
virtualization approach are as follows.
Designing a Meta-Data Description Lan-
Data Tidying guage: This description language is expressive
enough to present a relational table view for com-
▶ Data Cleansing plex multidimensional scientific datasets and
describe the low-level data layout. It is very
expressive and, particularly, can allow the
description as follows:
Data Virtualization
• Dataset physical layout within the file system
Gagan Agrawal of a node.
School of Computer and Cyber Sciences, Augusta • Dataset distribution on nodes of one or more
University, Augusta, GA, USA clusters.
• The relationship of the dataset to the logical or
virtual schema that is desired.
Data Virtualization is the ability to support a vir- • The index that can be used to make subsetting
tual (more abstract) view of a complex dataset. more efficient.
Involving several systems built to support rela-
tional views on complex array datasets, automatic By using it, the scientist and database devel-
data virtualization was introduced in 2004 (Weng oper together can describe the format of the
et al. 2004). The motivation was that scientific datasets generated and used by the application.
datasets are typically stored as binary or character Generating Efficient Data Subsetting and
flat-files. Such low-level layouts enable compact Access Services Automatically: Using a com-
storage and efficient processing, but they make piler that can parse the meta-data description and
the specification of processing much harder. generate function code to navigate the datasets,
In view of this, there recently has been increas- the database developer (or the scientist) can con-
ing interest in data virtualization, and data ser- veniently generate data services that will navigate
vices to support such virtualization. Based on the the datasets. These functions take the user query
virtualization, low-level, compact, and/or as input and help create rela- tional tables.
348 Data Visualisation

Since this initial work in 2004 (Weng et al. has been the NoDB approach (Alagiannis et al.
2004), several other implementations of this 2012). The idea of NoDB is that datasets continue
approach have been created. The most important to be stored in raw-files, but are queries using a
ones have involved abstractions on top of high-level language. This work is indeed an exam-
NetCDF and HDF5 datasets (Wang et al. 2013; ple of NoDB approach, but distinct in the focus on
Su and Agrawal 2012). multidimensional array data.
These implementations addressed some of key
challenges in dealing with scientific data. On one
hand, this approach does not require data to be Further Reading
loaded into a specific system or to be reformatted.
At the same time, it allows the use of a high-level Alagiannis, I., Borovica, R., Branco, M., Idreos, S., &
Ailamaki, A. (2012, May). NoDB: Efficient query exe-
language for specification of processing, which is
cution on raw data files. In Proceedings of the 2012
also independent of the data format. The tool ACM SIGMOD international conference on manage-
supported SQL select and aggregation queries ment of data (pp. 241–252).
specified over the virtual relational table view of Su, Y., & Agrawal, G. (2012, May). Supporting user-
defined subsetting and aggregation over parallel netcdf
the data. Besides supporting selection over dimen-
datasets. In 2012 12th IEEE/ACM international sym-
sions, which is directly supported by HDF5 API posium on cluster, cloud and grid computing (ccgrid
also, it also supports queries involving dimension 2012) (pp. 212–219). IEEE.
scales and those involving data values. For this, Wang, Y., Su, Y., & Agrawal, G. (2013, May). Supporting a
light-weight data management layer over hdf5. In 2013
code for hyperslab selector and content-based
13th IEEE/ACM international symposium on cluster,
filter was generated in the system. Selection and cloud, and grid computing (pp. 335–342). IEEE.
aggregation queries using novel algorithms also Weng, L., Agrawal, G., Catalyurek, U., Kur, T.,
are effectively parallelized. Narayanan, S., & Saltz, J. (2004, June). An approach
for automatic data virtualization. In Proceedings. 13th
Implementation has been extensively evalu-
IEEE international symposium on high performance
ated with queries of different types, and perfor- distributed computing, 2004. (pp. 24–33). IEEE.
mance and functionality have been compared
against OPeNDAP. Even for subsetting queries
that are directly supported in OPeNDAP, the
sequential performance of the system is better by Data Visualisation
at least a factor of 3.9. For other types of queries,
where OPeNDAP requires hyperslab selector ▶ Data Visualization
and/or content-based filter code to be written man-
ually, the performance difference is even larger. In
addition, the system is capable of scaling perfor-
mance by parallelizing the queries, and reducing Data Visualization
wide area data transfers through server-side data
aggregation. In terms of functionality, the system Jan Lauren Boyles
also supported certain state-of-the-art HDF5 fea- Greenlee School of Journalism and
tures, including dimension scale and compound Communication, Iowa State University, Ames,
datatype. A similar implementation was also car- IA, USA
ried out in context of another very popular format
for scientific data, NetCDF (Su and Agrawal
2012). Synonyms
Since the initial work in this area and also con-
currently with more recent implementations in con- Data visualisation; Dataviz, Datavis; Information
text of HDF5 and NetCDF, a popular development visualization; Information visualisation
Data Visualization 349

Definition/Introduction scale datasets that contain geolocation (primarily


GPS coordinates). Data visualizations are also
Data visualization encompasses the planning, pro- more likely than infographics to render the output
duction, and circulation of interactive, “graphical of real-time data streams, rather than providing a
representations” that emerge from big data ana- snapshot of a dataset as it appeared at one time in
lyses (Ward et al. 2010, 1). Given the volume and the past. Data visualizations, broadly speaking,
complexity of big data, data visualization is tend to evolve as the dataset changes over time.
employed as a tool to artfully demonstrate under- To this end, once the data visualization is created,
lying patterns and trends – especially for lay audi- it can quickly incorporate new data points that D
ences who may lack expertise to directly engage may emerge in the dynamic dataset.
with large-scale datasets. Visually depicting such
insights thereby broadens the use of big data
beyond computational experts, making big data Data Visualization as Knowledge
analyses more approachable for a wider segment Translation
of society. More specifically, data visualization
helps translate big data into the sphere of deci- Humans have long been naturally inclined toward
sion-making, where leaders can more easily inte- processing information visually, which helps
grate insights gleaned from large-scale datasets to anchor individual-level decision-making (Cairo
help guide their judgments. 2012). In this context, visualizing complex phe-
Firstly, it is important to fully distinguish data nomena predates our current digital era of data
visualization from the manufacture of information visualization (Friendly 2008). The geographic
graphics (also known as infographics). The most mapping of celestial bodies or the rudimentary
prominent difference rests in the fact that data sketches of cave wall paintings, for instance,
visualizations (in the vast majority of cases) are were among the earliest attempts to marry the
constructed with the assistance of computational complexity of the physical world to the abstrac-
structures that manage the statistical complexity tion of visual representation (Friendly 2008;
of the dataset (Lankow et al. 2012; Mauldin Mauldin 2015).
2015). On the other hand, while computer- To today’s users, data visualizations can help
assisted products are often used to construct unpack knowledge so that nonexperts can better
infographics, the design is often not directly understand the topic at hand. As a result, the
dependent on the dataset itself (Mauldin 2015). information becomes more accessible to a wider
Rather, infographic designers often highlight audience. In relating this knowledge to the general
selected statistics that emerge from the dataset, public, data visualization should, ideally, serve
rather than using the entire dataset to fuel the two primary purposes. In its optimal form, data
visualization in the aggregate (Lankow et al. visualizations should: (1) contextualize an
2012; Mauldin 2015). The corpus of data for existing problem by illustrating possible answers
infographics, in fact, tends to be smaller in scope and/or (2) highlight facets of a problem that may
than the big data outputs of data visualization not be readily visible to a nonspecialist audience
projects, which typically encompass millions, if (Telea 2015). In this light, data visualization may
not billions, of data points. Additionally, data be a useful tool for simulation or brainstorming.
visualizations are highly interactive to the user This class of data visualization, or scientific visu-
(Yau 2012). Such data visualizations also often alization, is generally used to envision phenom-
tether the data to its spatial qualities, particularly ena in 3D – such as weather processes or
emphasizing the interplay of the geographic land- biological informatics (Telea 2015). These data
scape. A subfield of data visualization – interac- visualization products typically depict the phe-
tive mapping – capitalizes on this feature of large- nomena realistically – where the object or
350 Data Visualization

interaction occurs in space. Informational visual- When applying data visualization to a shared
ization, on the other hand, does not prioritize the problem or issue, the ultimate goal is to fully
relationship between the object and space (Telea articulate a given problem within the end user’s
2015). Instead, the data visualizations focus upon mind (Simon 2014). Data visualizations may also
how elements operate within the large-scale illustrate the predictive qualities embedded in
dataset, regardless of their placement or dimen- large-scale datasets. Ideally, data visualization
sion in space. Network maps, designed to charac- outputs can highlight cyclical behaviors that
terize relationships between various actors within emerge from the data corpus, helping to better
social confines, would serve as one example of explicate cause and effect for complex societal
this type of data visualization. issues. The data visualization may also help in
To disseminate information obtained from the identifying outliers contained in the dataset,
visualization publicly, however, the dataviz prod- which may be easier to locate graphically than
uct must be carefully constructed. First, the data numerically. Through the data visualization, non-
must be carefully and systematically “cleaned” to specialists can quickly and efficiently look at the
eliminate any anomalies or errors in the data cor- output of a big data project, analyze vast amounts
pus. This time consuming process often requires of information, and interpret the results. Broadly
reviewing code, ensuring that that ultimate out- speaking, data visualizations can provide general
puts of the visualization are accurate. Once the audiences a simpler guide to understanding the
cleaning process is complete, the designer/devel- world’s complexity (Steele and Iliinsky 2010).
oper of the data visualization must be able to fully
understand the given large-scale dataset in its
entirety and decide which visualization format The Rise of Visual Culture
would fit best with the information needs of the
intended audience (Yau 2012). In making trade- Several developments of the digital age have con-
offs in constructing the visualization, the large- tributed to the broad use of data visualization as
scale dataset must also be filtered through the part of everyday practice in analyzing big data
worldview of the dataviz’s creator. The choices across numerous industries. Primarily, the rapid
of the human designer/developer, however, must expansion of computing power – particularly the
actively integrate notions of objectivity into the emergence of cloud-based systems – has greatly
design process, making sure the dataset accurately accelerated the ability to process large-scale
reflects the entirety of the data corpus. At the same datasets in real time (Yau 2012). The massive
time, the designer/developer must carefully con- volume of data generated daily in the social
sider any privacy or security issues associated space – particularly on social networking sites –
with constructing a public visualization of the has led both academicians and software devel-
findings (Simon 2014). opers to create new tools to visualize big data
In the mid-2010s, data visualization has been created via social streams (Yau 2012). Some of
used selectively, not systematically, within orga- the largest industry players in the social space,
nizations (Simon 2014). In fact, data visualization such as Facebook and Twitter, have created Appli-
is not a flawless tool for presenting big data in all cation Programming Interfaces (APIs) that pro-
cases. In some cases, data visualization may not vide public access to large-scale datasets
be the correct tool altogether for conveying the produced from user-generated content (UGC) on
dataset’s findings, if the work can be presented social networking sites (Yau 2012). Because APIs
more clearly in narrative or if little variability have proven so popular for developers working in
exists in the data itself (Gray et al. 2012). Taken the big data environment, industry-standard file
together, practitioners using data visualization formats have emerged, which enable data visual-
must carefully contemplate every step of the pro- izations to be more easily created and shared
duction and consumption process. (Simon 2014).
Data Visualization 351

Technical advances in the required tools to are too difficult and laborious to understand on
physically create visualizations have also made their own. A typical use case in the field of busi-
the production of data visualization more open to ness, for instance, centers upon visually depicting
the lay user. Data visualization does not rely return on investment – translating financial
upon a singular tool or computational method; data for decision makers who may not fully
instead, practitioners typically use a wide spec- grasp the technical facets of big data (Steele and
trum of approaches. At one end of the spectrum Iliinsky 2010). The visualization can also unite
rests simple spreadsheet programs that can be decision makers by creating a shared orientation
used by nonspecialists. When paired with off- toward a problem, from which leaders can D
the-shelf (and often open source) programs – take direct action (Simon 2014). Within the prac-
which incorporate a battery of tools to visualize tice of journalism, data visualization is also
data in scatterplots, tree diagrams and maps – increasingly accepted as a viable mechanism for
users can create visualizations in hours without storytelling – sharing insights gleaned from big
skill sets in coding or design. For those with data with the general public (Cairo 2012).
more proficiency with programming, however,
tailored and sophisticated data visualizations
can be created using Python, D3, or R. Devel- Conclusion
opers are currently pouring significant energies
to making these coding languages and scripts Future cycles of technological change (particu-
even easier for the general public to manipulate. larly the entry of augmented and virtual reality)
Such advances will further empower non- will likely create a stronger climate for visual
specialist audiences in the creation of data data, researchers concur. Within the last
visualizations. 20 years alone, the internet has become a more
In the mid-2010s, the movement toward open visual medium. And with the rise of the semantic
data has also led to greater public engagement in web and the expansion of data streams available,
the production of data visualization. With a nod the interconnectivity of data will require more
toward heightened transparency, organizational sophisticated tools – such as data visualization
leaders (particularly those in government) have – to convey deeper meaning to users (Simon
unlocked access to big data, providing these 2014). The entry of visual analytics, for instance,
large-scale datasets to the public for free. uses data visualizations to drive real-time deci-
Expanded access to big data produced by Western sion-making (Myatt and Johnson 2011). Despite
democratic governments, such as open govern- the expanded use of data visualizations in a vari-
ment initiatives in the United States and the ety of use cases, a current knowledge gap of data
United Kingdom, has precipitated the use of data literacy exists between those creating visualiza-
visualization by civic activists (Yau 2012). The tions and the end users of the products (Ryan
creation of data visualizations following from 2016). As this gap begins to close, the continued
open government initiatives may, in the long growth of data visualization as a tool to under-
term, foster stronger impact measures of public stand the complexity of large-scale datasets will
policy, ultimately bolstering the availability and likely broaden public use of big data in future
efficiency of governmental services (Simon decades.
2014).
Beyond use by governments, the application of
data visualization in industry can help inform the Cross-References
process of corporate decision-making, which can
further spur innovation (Steele and Iliinksy 2010). ▶ Open Data
To the busy executive, data visualizations provide ▶ Social Network Analysis
an efficient encapsulation of complex datasets that ▶ Visualization
352 Data Visualizations

Further Reading
Database Management
Cairo, A. (2012). The functional art: An introduction to Systems (DBMS)
information graphics and visualization. Berkeley: New
Riders.
Friendly, M. (2008). A brief history of data visualiza- Sandra Geisler1 and Christoph Quix1,2
1
tion. In C. Chen, W. Hardle, & A. Unwin (Eds.), Fraunhofer Institute for Applied Information
Handbook of data visualization (pp. 15–56). Technology FIT, Sankt Augustin, Germany
Springer: Berlin. 2
Hochschule Niederrhein University of Applied
Gray, J., Chambers, L., & Bounegru, L. (2012). The data
journalism handbook: How journalists can use data to Sciences, Krefeld, Germany
improve the news. Beijing: O'Reilly Media.
Lankow, J., Ritchie, J., & Crooks, R. (2012). Infographics:
The power of visual storytelling. Hoboken: Wiley.
Mauldin, S. K. (2015). Data visualizations and
Overview
infographics. Lanham: Rowman & Littlefield.
Myatt, G. J., & Johnson, W. P. (2011). Making sense of DBMS have a long history back to the late 1960s
data iii: A practical guide to designing interactive data starting with navigational or network systems
visualizations. Hoboken: Wiley.
(CODASYL). These were the first to enable man-
Ryan, L. (2016). The visual imperative: Creating a visual
culture of data discovery. Cambridge: Morgan aging a set of related data records. In the 1970s,
Kaufmann. relational systems have been defined by Codd
Simon, P. (2014). The visual organization: Data visualiza- (1970) which are the most important DBMS
tion, big data, and the quest for better decisions. Hobo-
until today. Relational database systems had
ken: Wiley.
Steele, J., & Iliinsky, N. (2010). Beautiful visualization: the advantage that they could be accessed in a
Looking at data through the eyes of experts. Sebasto- declarative way while navigational systems used
pol: O'Reilly Media. a procedural language. As object-oriented pro-
Telea, A. (2015). Data visualization: Principles and prac-
gramming became more popular in the 1990s,
tice. Boca Raton: Taylor & Francis.
Ward, M. O., Grinstein, G., & Keim, D. (2010). Interactive there was also a demand for object-oriented data-
data visualization: Foundations, techniques, and base systems; especially, to meet the requirement
applications. Boca Raton: CRC Press. of storing more complex objects for engineering,
Yau, N. (2012). Visualize this: The flowing data guide to
architectural, or geographic applications. How-
design, visualization and statistics. Indianapolis:
Wiley. ever, the idea of object-oriented DBMS as a back-
end for object-oriented applications was not
completely realized as relational DBMS domi-
nated the market and provided object-relational
Data Visualizations extensions in the late 1990s.
With the growing popularity of Internet appli-
▶ Business Intelligence Analytics cations around the year 2000, DBMS had to face
again new challenges. The semi-structured data
model addressed the problems of data heteroge-
neity, i.e., datasets with objects that have an irreg-
ular, hierarchical structure (Abiteboul et al. 1999).
Data Warehouse The XML data format was used as a representa-
tion of semi-structured data which resulted in a
▶ Data Mining demand for XML DBMS. Again, as in the object-
oriented case, relational DBMS were equipped
with additional functionalities (e.g., XML data
type and XPath queries) rather than that XML
Data Wrangling DBMS became popular.
On the other hand, data was no longer only
▶ Data Cleansing viewed as a structured container whose
Database Management Systems (DBMS) 353

interpretation was left to a human user. More Furthermore, due to the availability of cheap
interoperability was required; systems should be sensors, IoT devices, and other data sources pro-
able to exchange data not only a syntactical level ducing data at a very high scale and frequency, the
(as with XML), but also on a semantical level. need for processing and analyzing data on-the-fly
This led to the need for attaching context and without the overhead and capability of storing it in
meaning to the data and to create more intelligent its entirety was pressing. Hence, specific systems,
applications, which was addressed by the idea of such as data stream management systems and
the Semantic Web (Berners-Lee et al. 2001). complex event processing systems, which are
Linked data aims at providing more semantics able to cope with these data, evolved. D
by linking information with semantic ontologies In principle, DBMS can be categorized along
or other data items, leading to complex knowl- various criteria (Elmasri and Navathe 2017). They
edge graphs that need to be managed by graph are distinguished according to the data model they
databases (Heath and Bizer 2011). implement (relational, document-oriented, etc.),
By the event of mobile devices, widely avail- the number of parallel users (single or multiple
able mobile Internet and high bandwidths, the user systems), distributability (one node, multiple
production of data with higher volume, velocity, equal nodes, multiple autonomous nodes),
and variety, challenging requirements for higher system architecture (client-server, peer-to-peer,
scalability, distributed processing, and storage of standalone. . .), internal storage structures, purpose
data came up. Relational systems were no longer (graph-oriented, OLTP, OLAP, data streams),
able to fulfill these needs appropriately. Data- cloud-based or not, and many more. Due to space
intensive web applications, such as search constraints, we are only able to explain some of
engines, had the problem, that many users these aspects in this article.
accessed their services at the same time, posing
the same simple queries to the system. Hence, the
notion of NoSQL systems came up, which pro- Architectures
vided very simple but flexible structures to store
and retrieve the data while also relaxing the con- A DBMS architecture is usually described in sev-
sistency constraints of relational systems. Also, eral layers to have a modular separation of the key
shorter software development cycles required functionalities. There are architectures that focus
more flexible data formats and corresponding on the internal physical organization of the
storage solutions, i.e., without a mandatory DBMS. For example, a five-layer architecture
schema (re-)design step before data can be man- for relational DBMS is presented in Härder
aged by a DBMS. (2005). It describes in detail the mapping from a
NoSQL in most of the cases means non- logical layer with tables, tuples, and views, to files
relational, but many NoSQL systems also provide and blocks. Transaction management is consid-
a relational view of their data and a query lan- ered as a function that needs to be managed across
guage similar to SQL, as many tools for data several layers.
analysis, data visualization, or reporting rely on In contrast, the ANSI/SPARC architecture
a tabular data representation. In the succeeding (or three-schema architecture) is an abstract
notions of Not-only-SQL and NewSQL systems, model focusing on the data models related to a
the concepts of NoSQL and relational systems are database system (Elmasri and Navathe 2017). It
approaching each other again to combine the defines an internal layer, a conceptual layer, and
advantages of ACID capabilities (ACID is an an external or view layer. The internal layer
acronym for desirable properties in transaction defines a physical model describing the physical
management: atomicity, consistency, isolation, storage and access structures of the database sys-
and durability (“ACID Transaction” 2009)) of tem. The conceptual layer describes the concepts
relational systems with the flexibility, scalability, and their attributes, relationships, and constraints
and distributability of NoSQL systems. which are stored in the database. The external
354 Database Management Systems (DBMS)

layer or view layer defines user or application different approaches. Two-phase locking uses
specific views on the data, thus, presenting only locks to schedule parallel transactions with the
a subset of the database. Each layer hides the risk of deadlocks; an optimistic concurrency con-
details of the layers below, which realizes the trol allows all operations to be executed but aborts
important principle of data independence. For a transaction if a critical situation is identified;
example, the conceptual layer is independent of snapshot isolation provides a consistent view
the organization of the data at the physical layer. (snapshot) to a transaction and checks at commit
There are several architecture models time whether there is a conflict with another
according to which DBMS can be implemented. update in the meantime.
Usually, a DBMS is developed as a client/server
architecture where the client and server software Distributed Systems and Transaction
are completely separated and communicate via a Management
network or inter-process communication. Another NoSQL systems often have relaxed guarantees for
possibility is a centralized system where the transaction management and do not strictly follow
DBMS is embedded into the application program. the ACID properties. As they focus on availabil-
For processing big data, distributed systems with ity, there transaction model is abbreviated by
various nodes managing a shard or replica of the BASE: Basically Available, Soft-state, and Even-
database are used. These can be distinguished tual consistent. This model is based on the CAP
according to the grade of autonomy of the nodes, theorem (Brewer 2000), which states that a dis-
e.g., if a master node organizes the querying and tributed (database management) system can only
storage of data between the nodes or not (peer-to- guarantee two out of the three properties: consis-
peer system). Finally, as systems can have now a tency, availability, and partition tolerance. To
main memory with more than one TB, in-memory guarantee availability the system has to be robust
DBMS become popular. An in-memory system against failures of nodes (either simple nodes or
manages the data in main memory only and coordinator nodes). For partition tolerance, the
thereby achieves a better performance than disk- system must be able to compensate network fail-
based DBMS that always guarantee persistence of ures. Consistency in distributed DBMS can be
committed data. Persistence of data on disk in viewed from different angles. For a single node,
in-memory DBMS can be also achieved but the data should be consistent, but in a distributed
requires an explicit operation. Examples are SAP system, the data of different nodes might be incon-
HANA, Redis, Apache Derby, or H2. sistent. Eventual consistency, as implemented in
many NoSQL systems, assures that the changes
Transaction Management and ACID are distributed to the nodes, but it is not known
Properties when. This provides a better performance as
Transaction management is a fundamental feature delays because of the synchronization between
of DBMS. In the read-write model, a transaction different nodes is not necessary while the transac-
is represented as a sequence of the following tion is committed.
abstract operations: read, write, begin (of transac- In addition to the relaxed transaction manage-
tion), commit, and abort. Transaction manage- ment, distributed NoSQL systems provide addi-
ment controls the execution of transactions to tional performance by using sharding and
ensure the consistency of the database (i.e., the replication. Sharding is partitioning the database
data satisfies the constraints) and to enable an into several subsets and distributing these shards
efficient execution. If the DBMS allows multiple across several nodes in a cluster. Thus, complex
users to access the same data at the same time, queries processing a huge amount of data (as often
several consistency problems may arise (e.g., required in Big Data applications) can be distrib-
Dirty-Read or Lost-Update). To avoid these prob- uted to multiple nodes, thereby multiplying the
lems, a transaction manager strives to fulfill the compute resources which are available for query
ACID properties, which can be guaranteed by processing. Of course, distributed query
Database Management Systems (DBMS) 355

processing can be only beneficial for a certain type NoSQL Database Management Systems
of queries, but the Map-Reduce programming As stated above, NoSQL DBMS have been devel-
model fits very well to this type of distributed oped to address several limitations of relational
data management. Replication is also an impor- DBMS. The development of NoSQL DBMS
tant aspect for a distributed DBMS which allows started in the early 2000s, with an increasing
for the compensation of network failures. As demand for simple, distributed data management
shards are replicated to several nodes, in the case in Internet applications. Limited functions for hor-
of failure of a single node, the work can be izontal scalability and high costs for mature dis-
re-assigned to another node holding the same tributed relational DBMS were main drivers of the D
shard. development of a new class of DBMS. In addi-
tion, the data model and query language was
considered as too complex for many Internet
DBMS Categories applications. While SQL is very expressive, the
implementation of join queries is costly for an
Relational DBMS application developer as updates cannot be
Relational DBMS are by far the most popular, performed directly on the query result. This
mature, and successful DBMS. Major players in leads to the development of new data models
the market are Microsoft SQL Server, Oracle, that fit better to the need of Internet applications.
IBM DB2, MySQL, and PostgreSQL. Their The main idea of the data models is to store the
implementations are based on a mathematically data in an aggregated data object (e.g., a JSON
well-founded relational data model and the object or XML document) rather than in several
corresponding languages for querying the data tables as in the relational model.
(relational algebra and relational calculus) (Codd The simplest data model is the key-value data
1970). A relational database system requires the model. Any kind of data object can be stored
definition of a schema, before data can be with a key in the database. Retrieval of the object
inserted into a database. Main principles in the is done by the key, may be in combination with
design of relational schemata are integrity, con- some kind of path expression to retrieve only a
sistency, and the avoidance of redundant data. part of the data object. The data object is often
The normalization theory has been developed represented in the JSON format as this is the data
as a strict methodology for the development of model for web applications. This model is well
relational schemata, which guarantees these suited for applications which retrieve objects
principles. only by a key, such as session objects or user
Furthermore, this methodology ensures that profiles. On the other hand, more complex
the resulting schema is independent of a parti- queries to filter or aggregate the objects based
cular set of queries, i.e., all types of queries or on their content are not directly supported. The
applications can be supported equally well. Rela- most popular key-value DBMS is Redis (https://
tional query languages, mainly SQL and its dia- redis.io/).
lects, are based on the set-oriented relational The document-oriented data model is a natural
algebra (procedural) and the relational calculus extension of the key-value model as it stores
(declarative). However, one critique of the JSON documents (as a collection of single JSON
NoSQL community is that normalized schemata objects). The query language of document-
require multiple costly join operations in order to oriented DBMS is more expressive than for key-
recombine the data that has been split across mul- value systems, as also filtering, aggregation,
tiple tables during normalization. Nevertheless, restructuring operations can be supported. The
the strong mathematical foundation of the rela- most prominent system in this class is MongoDB
tional query languages allows for manifold (https://www.mongodb.com/), which also sup-
query optimization methods on different levels ports a kind of join operation between JSON
in the DBMS. objects.
356 Database Management Systems (DBMS)

Wide column stores can be considered as a registered once at the system (hence, also termed
combination of relational and key-value DBMS. standing queries) and are executed over and over
The logical model of a wide column store has also again producing results, while data streams into
tables and columns, as in a relational DBMS. the system. Usually a DSMS follows also a certain
However, the physical structure is more like a data model and usually this is the relational
two-level key-value store. In the first level, a row model. Here a stream is equivalent to a table, for
key is used that links the row key with several which a schema with attributes of a certain
column families (i.e., groups of semantically domain is defined. Timestamps are crucial for
related columns). The second level relates a col- the processing and are an integral part of the
umn family with values for each column. Physi- schema. Several principles which are valid for
cally, the data within a column family is stored common DBMS cannot be applied to DSMS.
row-by-row as in relational systems. The advan- For example, specific query operators, such as
tage is that a column family is only a subset of the joins, block the production of a result, as it waits
complete table; thus, only a small part of the table infinitely of the stream (the data set) to end to
has to be read if access to only one column family produce a result. Hence, either operator
is necessary. Apache Cassandra (http://cassandra. implementations suitable for streams are defined,
apache.org/) is the most popular system in this or only a part of the stream (a window) is used for
category. the query to operate on. Furthermore, there are
Finally, graph-oriented DBMS store the data operators or system components which are
in graphs. As described above, graph DBMS can responsible to keep up the throughput and perfor-
be used in cases where more complex, semanti- mance of the system. If measured QoS parameters
cally rich data has to be managed, e.g., data of indicate that the system gets slower, and if com-
knowledge graphs. Although, it is often stated pleteness of data is not an issue, tuples may be
that the graph data model fits well to social net- dropped by sampling them. Parallel to the DSMS
working applications, Facebook uses a custom, paradigm, the concept of complex event pro-
distributed variant of MySQL to manage their cessing (CEP) developed. In CEP each tuple is
data. Graph-oriented DBMS should be applied regarded as an event and simple and complex
only if specific graph analytics is required in the events (describing a situation or a context) are
applications (e.g., shortest path queries, neigh- distinguished. CEP systems are specifically
borhood queries, clustering of the network). designed to detect complex events based on
Neo4j (https://neo4j.com/) is frequently used as these simple event streams using patterns. CEP
graph database. systems can be build using DSMS and both share
similar or the same technologies to enable stream
Streaming processing.
A specific type of DBMS has evolved as data Further possibilities to work with streaming
sources producing data at high frequencies data are time series databases (TSDB). TSDB
became more and prevalent. Low-cost sensors are optimized to work on time series and offer
and high-speed (mobile) Internet available to the functionality to enable usual operations on time
wide mass of people opened up new possibilities series, such as aggregations over large time
for applications, such as high-speed trading or periods. In contrast to DSMS, TSDB keep a tem-
near real-time prediction. Common DBMS were porary history of data points for analysis. Here
no longer able to handle nor store these amounts also different data models are used, but more
and frequency of data, such the new paradigm of balanced between NoSQL systems and relational
data stream management systems developed systems. They can be operated as in-memory sys-
(Arasu et al. 2003). A data stream is usually tems or persistent storage systems. Important fea-
unbounded, and it is not known if and when a tures are the distributability and clusterability to
stream will end. DSMS are human-passive ensure a high performance and a high availability.
machine-active systems, i.e., queries are TSDBs can furthermore be distinguished based on
Database Management Systems (DBMS) 357

the granularity of data storage they offer. Elasticsearch (https://www.elastic.co/prod


A prominent example is InfluxDB included in ucts/elasticsearch) is also not a DBMS in the first
the TICK stack offering a suite of products for place, but a search engine. However, it can be
the complete chain for processing, analysis, and used to manage any kind of documents, including
alerting. Another example system is Druid, which JSON documents, thereby, enabling the manage-
offers the possibility of OLAP on time series data ment of semi-structured data. In combination with
used by big companies, such as Airbnb or eBay. other tools to transform (Logstash) and visualize
(Kibana), it is often used as a platform for analyz-
Other Data Management Systems in the ing and visualizing semi-structured data. D
Context of Big Data We are able to mention here the most important
Apache Hadoop (https://hadoop.apache.org/) is a Big Data management systems. As the field of Big
set of tools which are focused on the processing Data management is very large and developing
of massive amounts of data in a parallel way. very quickly, an enumeration can be only
The main components of Hadoop are the Map- incomplete.
Reduce programming framework and the
Hadoop Distributed File System (HDFS).
HDFS, as the name suggests, is basically a file
Cross-References
system and can store any kind of data, includ-
ing simple formats such as CSV and unstruc-
▶ Big Data Theory
tured texts. As a distributed file system, it
▶ Complex Event Processing (CEP)
implements also the features of sharding and
▶ Data Processing
replication described above. Thereby, it fits
▶ Data Storage
well to the Map-Reduce programming model
▶ Graph-Theoretic Computations/Graph Databases
to support massive parallel, distributed com-
▶ NoSQL (Not Structured Query Language)
puting tasks.
▶ Semi-structured Data
As HDFS provides the basic functionality of a
▶ Spatial Data
distributed file system, many NoSQL systems can
work well with HDFS as the underlying file sys-
tem, rather than a common local file system. In
References
addition, specific file formats such as Parquet or
ORC have been developed to fit better to the Abiteboul, S., Buneman, P., & Suciu, D. (1999). Data on
physical structure of data in a HDFS. On top of the web: From relations to semistructured data and
HDFS, systems like HBase and Hive can be used XML. San Francisco: Morgan Kaufmann.
ACID Transaction. (2009). In L. Liu & M. T. Özsu (Eds.),
to provide a query interface to the data in HDFS,
Encyclopedia of database systems (pp. 21–26).
which is similar to SQL. Springer US. Retrieved from https://doi.org/10.1007/
Apache Spark (https://spark.apache.org/) is not 978-0-387-39940-9\_2006.
a DBMS although it also provides data manage- Arasu, A., Babu, S., & Widom, J. (2003). An abstract
semantics and concrete language for continuous
ment and query functionalities. In its core, Apache
queries over streams and relations. In Proceedings of
Spark is an analytics engine which can efficiently international conference on data base programming
retrieve data from various backend DBMS, languages.
including classical relational systems, NoSQL Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The
semantic web. Scientific American, 284(5), 34–43.
systems, and HDFS. Spark supports its own dia- Brewer, E. A. (2000). Towards robust distributed systems
lect of SQL, called SparkSQL, which is translated (abstract). In G. Neiger (Ed.), Proceedings of the nine-
to the query language of the underlying database teenth annual ACM symposium on principles of dis-
system. As Spark has an efficient distributed com- tributed computing, 16–19 July 2000, Portland. ACM,
p. 7.
puting system for transformation or analysis of
Codd, E. F. (1970). A relational model of data for large
Big Data, it has become very popular for Big shared data banks. Communications of the ACM, 13(6),
Data applications. 377–387.
358 Datacenter

Elmasri, R. A., & Navathe, S. B. (2017). Fundamentals of things in the world as sources of data to be
database systems (7th ed.). Harlow: Pearson. “mined” for correlations or sold, and from
Härder, T. (2005). DBMS architecture – still an open prob-
lem. In G. Vossen, F. Leymann, P. C. Lockemann, & which insights can be gained about human
W. Stucky (Eds.), Datenbanksysteme in business, behavior and social issues. This term is often
technologie und web, 11. fachtagung des employed by scholars seeking to critique such
gi-fachbereichs “datenbanken und logics and processes.
informationssysteme” (dbis), karlsruhe, 2.-4. märz
2005 (Vol. 65, pp. 2–28). GI. Retrieved from http://
subs.emis.de/LNI/Proceedings/Proceedings65/arti
cle3661.html. Overview
Heath, T., & Bizer, C. (2011). Linked data: Evolving the
web into a global data space. San Rafael: Morgan &
Claypool Publishers. The concept of datafication was initially
employed by scholars seeking to examine how
the digital world is changing with the rise of big
data and data economies. However, as datafication
itself becomes more widespread, scholarship in a
Datacenter range of disciplines and subdisciplines has drawn
on the concept to understand broader shifts
▶ Data Center towards rendering information as data for pattern
analysis, beyond online platforms. The concept of
datafication, in its present form, emerged in the
last 5 years with the growth of data analytics,
Data-Driven Discovery being popularized by Viktor Mayer-Schönberger
and Kenneth Cukier’s (2013) book Big Data: A
▶ Data Discovery Revolution That Will Transform How We Live,
Work, and Think, who describe its capacity as a
new approach for social research. While
datafication is distinct from digitization, as
Datafication Mayer-Schönberger and Cukier point out (2013),
digitization is often part of the process of
Clare Southerton datification. So too is quantification, as when
Centre for Social Research in Health and Social information is “datafied,” it is reduced to elements
Policy Research Centre, UNSW, Sydney, Sydney, of the information that can be counted, aggre-
NSW, Australia gated, calculated, and rendered machine-readable.
As such there are significant complexities that are
lost in this process that renders qualitative detail
Definition invisible, and indeed this critique has been sub-
stantially developed in the literature examining
Datafication refers to the process by which the logics of big data (see, e.g., Boyd and
subjects, objects, and practices are trans- Crawford 2011; Kitchin 2014; van Dijck 2014).
formed into digital data. Associated with the Furthermore, a range of methodological and epis-
rise of digital technologies, digitization, and temological issues are raised about the insights of
big data, many scholars argue datafication is data drawn from new data economies, in which
intensifying as more dimensions of social life there are a range of existing inequalities, as well as
play out in digital spaces. Datafication renders huge value to be found in encouraging participa-
a diverse range of information as machine- tion in digitally mediated social interaction and
readable, quantifiable data for the purpose of practices of sharing personal information online
aggregation and analysis. Datafication is also (see, e.g., Birchall 2017; van Dijck 2013; Zuboff
used as a term to describe a logic that sees 2015).
Datafication 359

The Datalogical Turn datafication was seen as a way to bypass the


unnecessary complexity of social life and identify
As more and more aspects of social life have correlations, without the need for meaningful
begun to generate digital data, the possibility of explanation. Many of these early claims have
analyzing this data in order to produce opportuni- been tempered, especially as the predictive
ties for profit has substantially changed the nature power of big data has failed to deliver on many
of how digital infrastructures are oriented. In par- of its utopian promises.
ticular, the capacity that exists now to analyze
large data sets, what we call ‘big data’, and the D
ability to draw together insights from multiple The Datafication of Social Life
data sets (e.g., search engine data, social media
demographic information, viewing history on The data that is aggregated by data scientists is
YouTube etc.) has significantly changed how predominantly made possible by the data traces
online platforms operate. Data scientists seek to that online interactions generate which can now
produce findings on a wide range of issues by be collected and analyzed, by what we might term
examining the data traces that individuals left the “datafication of social life.” As social media
behind. The big data analytics have been used in platforms have come to host more of our daily
the private sector for a range of purposes includ- interactions, these interactions have become par-
ing for filtering digital content or products in the cels of data in the form of comments, likes, shares,
online marketplace in the form of recommenda- and clicks. Alongside these interactions, our dig-
tions, and, most prominently, through targeted ital lives are also comprised of a range of data-
advertisements. In addition, datafication has gathering activities: web browsing and using
been identified by some as an opportunity to search engines, interactions with advertisements,
gain unprecedented access to data for social online shopping, digital content streaming, and a
research. Sometimes called “the datalogical turn” vast array of other digital practices are rendered as
or “the computational turn,” recently greater pieces of data that can be collated for analysis to
attention has been paid to the sociological insights identify trends and, likely if commercialized, fur-
offered by these large datasets. There has also ther opportunities for profit (Lupton 2019). Even
been unease in the social sciences surrounding beyond the activities users undertake online, the
the use of big data, particularly social media geo-locative capacities of digital devices now
data, to analyze culture and address social prob- allow the collection of detailed location data
lems. Media scholar José van Dijck (2014) about their user. This datafication of social life
argues that datafication has become a pervasive has significantly changed the organization of dig-
ideology – “dataism” – in which the value and ital platforms as profit can now be drawn by the
insights of aggregated data are seen as implicit. collection of data, and as such dataveillance has
This ideology also places significant trust and become embedded into almost all aspects of dig-
legitimacy in the institutions that collect this ital interaction.
data, despite their financial interests. Indeed, Beyond online interactions and social media
such an ideology is clear in the claims made use, recent years have seen datafication and data
early in the big data revolution about the expo- analytics spread to a range of fields. Scholars have
nential capacity of big data to explain social life, identified the datafication of health, both in the
with some proponents of big data analysis pro- trend of individual self-tracking technologies,
claiming the “end” of social theory. These claims such as fitness trackers and smart watches, and
were made on the basis that theorizing why people the ways in which clinical practice has become
act in certain ways, indeed the very questions that increasingly data-driven, especially when it
formed the basis of much social scientific inquiry, comes to how governments deal with medical
was rendered irrelevant by big data’s ability to see information (Ruckenstein and Schüll 2017). So
patterns of actions on a mass scale. In essence, too has education been impacted by datafication,
360 Datafication

as children are increasingly monitored in schools Scholars have also raised concerns about the
by RFID in uniforms, facial recognition-enabled datafication of social science and the ways compu-
CCTV, and online monitoring of classwork (Tay- tational methods have impacted social research.
lor 2013). Scholars have also drawn attention to Computational social science has been accused of
the forms of dataveillance impacting childhood presenting big data as speaking for itself, as a kind of
beyond education, through parenting apps and capture of social relations rather than constituted by
child-tracking technologies. The spread of commercial forces and indeed by the new forms of
datafication points to the power of the pervasive digital sociality this data emerges from. For exam-
ideology of datafication van Dijck (2014) ple, using social media data as a way to gauge public
described, whereby objective truth is to be found opinion often inappropriately represents such data as
by rendering any social problem as digital data for representative of society as a whole, neglecting
computational analysis. important differences between the demographics of
different platforms and the specific affordances of
digital spaces, which may give rise to different
Critiques of Datafication responses. Similarly, these large data can establish
correlations between seemingly disparate variables
The logics of datafication have been substantially that, when presented as proof of a causal relation-
critiqued by social scientists. Privacy and surveil- ship, can prove misleading (Kitchin 2014). This is
lance scholars have highlighted widespread issues not to suggest scholars disregard this data, but rather
surrounding the way datafication facilitates the caution must be employed to ensure that such data is
collection of personal information passively, in appropriately contextualized with theoretically
ways that platform users may not be aware of, informed social scientific analysis. Qualitative dif-
and data is stored for a wide range of future uses ferences must be examined through attention rather
that users cannot meaningfully consent to. As than smoothed out.
datafication spreads into more areas of social
life, notions of consent become less helpful as
users of digital platforms and datafied services Conclusion
often feel they do not have the option to opt out.
Furthermore, large-scale data leaks and hacks of The process of datafication serves to transform a
social media platforms demonstrate the fragility wide range of phenomena into digitized, quantifi-
of even high-standard data protection systems. able units of information for analysis. With the
In addition to privacy concerns, datafication mass infiltration of smart technologies into every-
can reproduce and even exacerbate existing social day life and as more social interaction is filtered
inequalities. Data-driven risk evaluation systems through social media platforms and other online
such as those now routinely employed by finan- services, data is now generated and collected from
cial service providers and insurance companies a diverse array of practices. Consequently,
can perpetuate discrimination against already datafication and computational social science
marginalized communities (Leurs and Shepherd can offer significant insights into digitally embed-
2017). Furthermore, such discrimination is ded lives. However, as many scholars in the social
masked by the mythology of objectivity, insight, sciences have argued, inevitably this process is
and accuracy surrounding these systems, despite reductive of the complexity of the original object
their often opaque workings. While discrimina- and the rich social context to which it belongs.
tion is certainly not new, nor does it arise solely as
a product of datafication and systems drive by big
data, however these systems facilitate discrimina-
tion in a manner that eludes observation and dan-
Further Reading
gerously legitimizes inequalities as natural, Birchall, C. (2017). Shareveillance: The dangers of openly
evidenced by data, rather than a product of sharing and covertly collecting data. Minneapolis:
implicit bias. University of Minnesota Press.
Data-Information-Knowledge-Action Model 361

boyd, d., & Crawford, K. (2011). Six Provocations for Big us through the process of applying big data to
Data. Presented at the a decade in internet Time: Sym- tackle scientific and societal issues. Knowledge
posium on the Dynamics of the Internet and Society,
Oxford. https://doi.org/10.2139/ssrn.1926431. is one’s expertise or familiarity with a subject
Kitchin, R. (2014). Big data, new epistemologies and par- under working. Knowledge is necessary in the
adigm shifts. Big Data & Society, 1(1), 1–12. process to generate information from data about
Leurs, K., & Shepherd, T. (2017). Datafication & Discrim- a certain issue, and then take actions. New knowl-
ination. In M. T. Schäfer & K. van Es (Eds.), The
Datafied society: Studying culture through data (pp. edge can be generated on both the individual level
211–234). Amsterdam: Amsterdam Press. and the community level, and certain explicit
Lupton, D. (2019). Data selves: More-than-human per- knowledge can be encoded as machine readable D
spectives. Cambridge: Polity. knowledge bases and be used as tools to facilitate
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A
revolution that will transform how we live, work, and the process of data management and analysis.
think. London: John Murray Publishers.
Ruckenstein, M., & Schüll, N. D. (2017). The Datafication
of health. Annual Review of Anthropology, 46(1), 261– Understand the Concepts
278.
Taylor, E. (2013). Surveillance schools: A new era in
education. In E. Taylor (Ed.), Surveillance schools: The four concepts data, information, knowledge
Security, discipline and control in contemporary edu- and action are often seen in the language people
cation (pp. 15–39). London: Palgrave Macmillan UK. used in problem tackling and decision-makings
van Dijck, J. (2013). The culture of connectivity: A critical
history of social media. Oxford: Oxford University for various scientific and societal issues. Data are
Press. the representation of some facts. We can see data
van Dijck, J. (2014). Datafication, dataism and of various topics, types, and dimensions in the real
dataveillance: Big data between scientific paradigm world, such as a geologic map of United King-
and ideology. Surveillance & Society, 12(2), 197–208.
Zuboff, S. (2015). Big other: Surveillance capitalism and dom, records of sulfur dioxide concentration in
the prospects of an information civilization. Journal of the plume of Poás Volcano, Costa Rica, the
Information Technology Impact, 30(1), 75–89. weekly records of the sales of cereal in a Wal-
Mart store located at Albany, NY, and all the
Twitter tweets with hash tag #storm in January
2015. Data can be recorded on different media.
Data-Information-Knowledge- The computer hard disks, thumb drives, and
Action Model CD-ROMS that are popularly used in nowadays
are just part of the media, and the use of computer
Xiaogang Ma readable media significantly promote and speed-
Department of Computer Science, University of up the transmission of data.
Idaho, Moscow, ID, USA Information is the meaning of data as
interpreted by human beings. For example, a geol-
ogist may find some initial clues for iron mine
Synonyms exploration by using a geologic map, a volcanol-
ogist may detect a few abnormal sulfur dioxide
DIKW pyramid; Information hierarchy; Knowl- concentration values about the plume of a vol-
edge hierarchy; Knowledge pyramid cano, a business manager may find that the sales
of the cereal of a certain brand have been rising in
the past three weeks, and a social media analyst
Introduction may find spatio-temporal correlations between
tweets with hash tag #storm and the actual storm
Facing the massive amounts and various subjects that happened in northeast United States in
of datasets in the Big Data era, it is impossible for January, 2015. In the context of Big Data, there
humans to handle the datasets alone. Machines are are massive amounts of data available but most of
needed in data manipulation, and a model of Data- them could be just noise and are irrelevant to the
Information-Knowledge-Action will help guide subject under working. In this situation, a step of
362 Data-Information-Knowledge-Action Model

data cleansing can be deployed to validate the care of the relationships among the four concepts
records, remove errors, and even collect new in the Data-Information-Knowledge-Action
records to enrich the data. People then need to model as well as the loops between two or more
discover clues, news, and updates that make steps in the usage of this model. The first is the
sense to the subject from the data. data-information relationship. The saying “Gar-
Knowledge is people’s expertise or familiarity bage in, garbage out” describes the relationship
with one or more subjects under working. People in an intuitive way. Users need to recognize the
use their knowledge to discover information from inherent flaws of all data and information. For
data, and then make decisions based on the infor- example, the administrative boundary areas on
mation and their knowledge. The knowledge of a geological maps often confuse people because
subject is multifaceted and is often evolving with the geological survey and mapping of the areas
new discoveries. The process from data to infor- at each side of a boundary are taken by different
mation in turn may make new contribution to teams, and there could be discrepancies in their
people’s knowledge. A large part of those exper- description about the same region that is divided
tise and familiarity are often described as tacit by the boundary. For complex datasets, visualiza-
knowledge in a people’s brain, which is hard to tion is a useful approach, which can be used not
be written down or shared. In contrast, it is now only for summarizing data to be processed but
possible to record and encode a part of people’s also for presenting the information generated
knowledge into machine readable format, which from the data (Ma et al. 2012).
is call explicit or formal knowledge, such as the The second relationship is between informa-
controlled vocabulary of a subject domain. Such tion and knowledge. People may interpret the
recorded formal knowledge can be organized as a same data in different ways based on their knowl-
knowledge base, which, in conjunction with an edge, and generate different information. Even for
inference engine, can be used to build an expert a same piece of information, they may use it
system to deduce new facts or detect differently in decision-making. To use data and
inconsistencies. information in a better way, it is necessary for
Action is the deeds or decisions made based on people to keep in mind about what they know
knowledge and information. In real-world prac- best and what they know less, and collaborate
tices of the Data-Information-Knowledge-Action with colleagues where necessary. If information
model, data are collected surrounding a problem is generated from data through a workflow, well-
to be addressed, then the data are interpreted to documented metadata about the workflow will
identify competing explanations for the problem, provide provenance for both data and informa-
as well as uncertainties of the explanations. Based tion, which improve the credibility of them.
on those works, one or more decisions can be Through reading the metadata, users know about
made, and the decisions will be reflected in fol- the data collectors, collection date, instruments
lowing actions. The phrase “informed decision” is used, data format, subject domains, as well as
often used nowadays to describe such a situation the experiment operator, software program used,
that people make decisions and take actions after methods used, etc., and thus obtain insight into the
knowing all the pros and cons and uncertainties reliability, validity, and utility of the data and
based on their knowledge as well as the data and information.
information in hand. The third relationship is between knowledge
and action. For a real-world workflow, the focus to
address the facing problems is the action items.
Mange the Relationships Analyzing and solving problems need knowledge
on many subjects. Knowledge sharing is neces-
In practice, data users such as scientific sary within a working team that runs the
researchers and business managers need to take workflow. Team members need to come together
Data-Information-Knowledge-Action Model 363

to discuss essential issues and focus on the major ontology is a specification of a shared conceptu-
goals. For example, a mineral exploration project alization of a domain. There are various types of
may need background knowledge in geology, ontologies based on the level of details on the
geophysics, geochemistry, remote sensing image specification of the meaning of concepts and the
processing, etc. because data of those subjects are assertion of relationships among concepts. For
collected and to be used. People need to contribute instance, a controlled vocabulary of a subject
their knowledge and skills relevant to the problem domain can be regarded as a simple ontology.
and collaborate with others to take actions. In Ontologies are widely used in knowledge-based
certain circumstances, they need to speed up the systems to handle datasets in the Web, including D
decision and action process at the expense of the generation of new information. In a recent
optimization and accuracy. research, an ontology was built for the geologic
An ideal usage of the Data-Information- time scale. Geologic time is reflected in the rock
Knowledge-Action model is that relevant infor- age descriptions on various geologic map ser-
mation is discovered, appropriate actions are vices. The built ontology was used to retrieve
taken, and people’s knowledge is enhanced. the spatial features of with records of certain
However, in actual works there could be closed rock ages and to generalize the spatial features
loops that weaken the usefulness of the model. according to user commands. To realize the map
For example, there could be loops among data, generalization function, programs were devel-
information and knowledge without actions oped to query the relationships among geologic
taken. There could also be loops between knowl- time concepts in the ontology. Besides those
edge and action, in which no information from functions, the ontology was also used to annotate
latest data is used in decision-making. Another rock age concepts retrieved from a geologic map
kind of loops is between information and action, service based on the concept specifications
in which no experience is saved to enhance the encoded in the ontology and further information
knowledge and people continue to respond to the retrieved from external resources such as the
latest information without learning from previ- Wikipedia.
ous works. Another kind of reverse thinking is that, after a
whole procedure of Data-Information-
Knowledge-Action, changes in knowledge may
Reverse Thinking take place inside an individual who took part in
the workflow. The changes could be the discovery
Besides relationships and loops between two or of new concepts, recognition of updated relation-
more steps in the usage of the Data-Information- ships between concepts, or modification of previ-
Knowledge-Action model, the model itself can ous beliefs, etc. The word “action learning” is
also be considered in a reversed way. The used to describe the situation that an individual
so-called knowledge-based system or expert sys- learns when he takes part in an activity. When the
tem is the implementation of the reverse thinking individual has learned to do a different action, he
of this model. There are two typical components has obtained new knowledge. This can be
in a knowledge-based system, one is a knowl- regarded as a revered step to the Data-
edge base that encodes explicit knowledge of Information-Knowledge-Action model and can
relevant subject domains and the other is an also be regarded as an extension to it. That is,
inference engine. While people’s knowledge is two other steps, Learning and New Knowledge
always necessary in the process to discover infor- can be added next to the Action step in the model.
mation from data, the knowledge-based systems All learning is context dependent (Jensen 2005).
are a powerful tool to facilitate the process. In the Similar to the aforementioned collaboration
field of semantic web, people are working on among individuals to transform data into informa-
ontologies as a kind of knowledge base. An tion, learning takes place as a negotiation of
364 Data-Information-Knowledge-Wisdom (DIKW) Pyramid, Framework, Continuum

meaning between the collaborators in a commu- significantly improve the interoperability of


nity of practice. Individuals learn and obtain new multi-source datasets made open on the web, and
knowledge, and in turn the new knowledge can be facilitate the development of intelligent functions
used to discover new information from data, to aid the procedure of Data-Information-
which provides the foundation for the creation of Knowledge-Action in Big Data manipulation.
new knowledge in other individuals’ minds.

Cross-References
Communities of Practice
▶ Data Provenance
The generation of new knowledge involves ▶ Data-Information-Knowledge-Wisdom
human thinking with information. The significa- (DIKW) Pyramid, Framework, Continuum
tion progress of artificial intelligence and ▶ Decision Theory
knowledge-based systems in recent years has ▶ Knowledge Management
inspired knowledge revolution in various subject ▶ Pattern Recognition
domains (Petrides 2002). Yet, the knowledge rev-
olution itself still needs human systems to realize
it. To facilitate knowledge revolution, issues rele- Further Reading
vant to thinking and information should both be
addressed. An intuitive and natural way to do this Clampitt, P. G. (2012). Communicating for managerial
effectiveness: Problems, strategies, solutions. Thou-
is to build communities and promote communities
sand Oaks: SAGE Publications.
of practice. As mentioned above, communities of Jensen, P. E. (2005). A contextual theory of learning and
practice not only allow members to work together the learning organization. Knowledge and Process
on data analysis to generate new information, they Management, 12(1), 53–64.
Ma, X., Carranza, E. J. M., Wu, C., & van der Meer, F. D.
also help community members think together to
(2012). Ontology-aided annotation, visualization and
generate new knowledge, on both the individual generalization of geological time scale information
level and the community level (Clampitt 2012). from online geological map services. Computers &
Modern information technologies have already Geosciences, 40, 107–119.
McDermott, R. (1999). Why information technology
provided efficient facilities for such collabora-
inspired but cannot deliver knowledge management.
tions, and more challenges are from the social or California Management Review, 41(4), 103–117.
cultural side. That is, individuals in a community Petrides, L. A. (2002). Organizational learning and the case
should be willing to share their findings and be for knowledge-based systems. New Directions for
Institutional Research, 2002(113), 69–84.
open to new ideas. For the community as whole, it
should maintain diversity while trying to achieve
consensus on the commonly shared and agreed
ideas (McDermott 1999). A typical example for
such communities is the World Wide Web Con- Data-Information-Knowledge-
sortium (W3C), which develops standards for the Wisdom (DIKW) Pyramid,
Web. A large part of W3C’s work is coordinating Framework, Continuum
the development of ontologies for various
domains, which can be regarded as machine read- Martin H. Frické
able knowledge bases. An ontology is normally University of Arizona, Tucson, AZ, USA
under work by a group of individual and organi-
zations across the world and should go through
several stages of review, test and revision before it The Data-Information-Knowledge-Wisdom (DIKW)
can become a W3C Recommendation. The con- hierarchy, or pyramid, relates data, information,
struction and implementation of ontologies knowledge, and wisdom as four layers in a
Data-Information-Knowledge-Wisdom (DIKW) Pyramid, Framework, Continuum 365

Data are symbols that represent properties of


objects, events and their environments. They are
Wisdom products of observation. To observe is to sense.
The technology of sensing, instrumentation, is, of
course, highly developed. (Ackoff 1989, 3)
Knowledge
In turn, information is relevant, or usable, or
significant, or meaningful, or processed, data
Information (Rowley 2007, Section 5.3 Defining Information).
The vision is that of a human asking a question D
Data beginning with, perhaps, “who,” “what,” “where,”
“when,” or “how many” (Ackoff 1989, 3); and the
Data-Information-Knowledge-Wisdom (DIKW) Pyra- data is processed into an answer to an enquiry.
mid, Framework, Continuum, Fig. 1 The knowledge
pyramid When this happens, the data becomes “informa-
tion.” Data itself is of no value until it is trans-
pyramid. Data is the foundation of the pyramid, formed into a relevant form.
information is the next layer, then knowledge, Information can also be inferred from data – it
and, finally, wisdom is the apex. DIKW is a does not have to be immediately available. For
model or construct that has been used widely example, were an enquiry to be “what is the
within Information Science and Knowledge Man- average temperature for July?”; there may be
agement. Some theoreticians in library and infor- individual daily temperatures explicitly recorded
mation science have used DIKW to offer an as data, but perhaps not the average temperature;
account of logico-conceptual constructions of however, the average temperature can be calcu-
interest to them, particularly concepts relating to lated or inferred from the data about individual
knowledge and epistemology. In a separate realm, temperatures. The processing of data to produce
managers of information in business process set- information often reduces that data (because,
tings have seen the DIKW model as having a role typically, only some of the data is relevant).
in the task meeting real world practical challenges Ackoff writes
involving information (Fig. 1). Information systems generate, store, retrieve, and
process data. In many cases their processing is
Historically, the strands leading to DIKW statistical or arithmetical. In either case, information
come from a mention by the poet T.S. Eliot and, is inferred from data. (Ackoff 1989, 3)
separately, from research from Harland Cleveland
and the systems theorists Mortimer Adler, Russell Information is relevant data, together with, on
Ackoff, and Milan Zeleny. The main views are occasions, the results of inferences from that rel-
perhaps best expressed in the traditional sources evant data. Information is thus a subset of the data,
of Adler, Ackoff, and Zeleny. Russell Ackoff, in or a subset of the data augmented by additional
his seminal paper, describes the pyramid from the items inferred or calculated or refined from that
top down: subset.
Knowledge, in the setting of DIKW, is often
Wisdom is located at the top of a hierarchy of types
construed as know-how or skill. Ackoff suggests
. . .. Descending from wisdom there are understand-
ing, knowledge, information, and, at the bottom, that know-how allows an agent to promote infor-
data. Each of these includes the categories that fall mation to a controlling role – to transform infor-
below it. . . (Ackoff 1989, 3) mation into instructions.
In fact, the way the pyramid works as a method Knowledge is know-how, for example, how a
is from the bottom up, not top down. The process system works. It is what makes possible the trans-
formation of information into instructions. It
starts with data and ascends to wisdom. makes control of a system possible. To control a
Data are the symbolic representations of system is to make it work efficiently. (Ackoff
observable properties 1989, 4)
366 Data-Information-Knowledge-Wisdom (DIKW) Pyramid, Framework, Continuum

Next up the hierarchy are understanding and Drawing It All Together and
wisdom. The concept of understanding is almost Appraising DIKW
always omitted from DIKW by everyone (except
Ackoff) and, in turn, wisdom is given only lim- DIKW seems not to work. Ackoff (1989) urges us
ited discussion by researchers and theorists. to gather data with measuring instruments and
Ackoff sees wisdom as being the point at which sensors. But instruments are constructed in the
humans inject ethics or morality into systems. He light of theories, and theories are essential to
explains inform us of what the surface indications of the
instruments are telling us about a reality beyond
Wisdom adds value, which requires the mental the instruments themselves. Data is “theory-
function we call judgement.. . . The value of laden.” Data itself can be more than the mere
an act is never independent of the actor. . .
“observable,” and it can be more than the pro-
[ethical and aesthetic values] are unique and
personal. nouncements of “instruments.” There are con-
texts, conventions, and pragmatics at work. In
. . .wisdom-generating systems are ones that man
will never be able to assign to automata. It may particular circumstances, researchers might regard
well be that wisdom, which is essential to the effec- some recordings as data which report matters that
tive pursuit of ideals, and the pursuit of ideals itself, are neither observable nor determinable by
are the characteristics that differentiate man from
instrument.
machines. (Ackoff 1989, 9)
All data is information. However, there is
Ackoff concludes with some numbers, appar- information that is not data. Information can
ently produced out of thin air without any range much more widely than data; it can be
evidence much more extensive than the given. For exam-
. . .on average about forty percent of the human ple, consider the universal statements “All rattle-
mind consists of data, thirty percent information, snakes are dangerous” or “Most rattlesnakes are
twenty percent knowledge, ten percent under- dangerous.” These statements presumably are, or
standing, and virtually no wisdom. (Ackoff
might be, information, yet they cannot be
1989, 3)
inferred from data. The problem is with the uni-
versality, with the “All” or “More.” Any data, or
conjunctions of data, are singular. For example,
Modern Developments and Variations “Rattlesnake A is dangerous,” “Rattlesnake B is
dangerous,” “Rattlesnake C is dangerous,” etc.,
There are publications that argue that DIKW are singular in form. Trying to make the infer-
should be top down in process, not bottom ence from “some” to “all,” or to “most,” are
up. The suggestion is that there is no such inductive inferences, and inductive inferences
thing as “raw data,” rather all data must have are invalid.
theory in it and thus theory (i.e., knowledge, The step from information to knowledge is also
information) must illuminate data, top down, not the easiest. In epistemology, philosophers dis-
rather than the other way around, bottom tinguish knowledge that from knowledge how.
up. There are publications that add or subtract A person might know that the Eiffel Tower is in
layers from DIKW— most omit understanding, France, and she might also know how to ride a
some omit wisdom, some add messages and bicycle. If knowledge is construed as “know-
learning, and there are other variations on the that,” then, under some views of information and
theme. There are publications that draw DIKW knowledge, information and knowledge are much
more into management practices, into business the same. In which case, moving from information
process theory, and into organizational learn- to knowledge might not be so hard. However, in
ing. Finally, there are publications that take the context of DIKW, knowledge is usually taken
DIKW into other cultures, such as the Maori to be “know-how,” and that makes the step diffi-
culture. cult. Consider a young person learning how to ride
Dataviz 367

a bike. What information in particular is required? process that first decides which information is rele-
It is hard to say, and maybe no specific informa- vant, and how it is to be used.
tion in particular is required. However, like many Wisdom is important in management and
skills, riding a bike is definitely coachable, and decision-making, and there is a literature on this.
information can improve performance. Know- But, seemingly, no one wants to relate wisdom in
how can benefit from information. The problem management to the DIKW pyramid. The literature
is in the details. on wisdom in management is largely independent
Wisdom is in an entirely different category to of DIKW.
data, information, and know-how. Wisdom cer- D
In sum, DIKW does not sit well in modern
tainly uses or needs data, information, and business process theory. To quote (Weinberger
know-how, but it uses and needs more besides. 2010) again
Wisdom is not a distillation of data, information,
The real problem with the DIKW pyramid is that it’s
and know-how. Wisdom does not belong at the a pyramid. The image that knowledge (much less
top of a DIKW pyramid. Basically, this is wisdom) results from applying finer-grained filters
acknowledged implicitly by all writers on the at each level, paints the wrong picture. That view is
topic, from Plato, through Ackoff, to modern natural to the Information Age which has been all
about filtering noise, reducing the flow to what is
researchers. clean, clear and manageable. Knowledge is more
What about DIKW in the setting of work pro- creative, messier, harder won, and far more
cesses? The DIKW theory seems to encourage discontinuous.
uninspired methodology. The DIKW view is that
data, existing data that has been collected, is pro-
moted to information and that information Further Reading
answers questions. This encourages the mindless
and meaningless collection of data in the hope that Ackoff, R. L. (1989). From data to wisdom. Journal of
Applied Systems Analysis, 16, 3–9.
one day it will ascend to information – i.e., pre- KM4DEV. (2012). DIKeW model. Retrieved from http://
emptive acquisition. It also leads to the desire for wiki.km4dev.org/DIKW_model.
“data warehouses,” with contents that are to be Rowley, J. (2007). The wisdom hierarchy: Representations
analyzed by “data mining.” Collecting data also is of the DIKW hierarchy. Journal of Information Sci-
ence, 33(2), 163–180.
very much in harmony with the modern “big data” Weinberger, D. (2010). The problem with the data-
approach to solving problems. Big data and data information-knowledge-wisdom hierarchy. Harvard
mining are somewhat controversial. The worry is Business Review. Retrieved from https://hbr.org/2010/
that collecting data blind is suspect 02/data-is-to-info-as-info-is-not.
Zins, C. (2007). Conceptual approaches for defining data,
methodologically. information, and knowledge. Journal of the American
Know-how in management is simply more Society for Information Science and Technology, 58(4),
involved than DIKW depicts it. As Weinberger 479–493. https://doi.org/10.1002/asi.20508.
(2010) writes
. . . knowledge is not a result merely of filtering or
algorithms. It results from a far more complex pro-
cess that is social, goal-driven, contextual, and
culturally-bound. We get to knowledge — espe- Datavis
cially “actionable” knowledge — by having desires
and curiosity, through plotting and play, by being
wrong more often than right, by talking with others
▶ Data Visualization
and forming social bonds, by applying methods and
then backing away from them, by calculation and
serendipity, by rationality and intuition, by institu-
tional processes and social roles. Most important in
this regard, where the decisions are tough and
Dataviz
knowledge is hard to come by, knowledge is not
determined by information, for it is the knowing ▶ Data Visualization
368 Decision Theory

research may include, e.g., the role of figurative


Decision Theory language in making decisions. Thus, it may be
studied how metaphors facilitate or hinder the
Magdalena Bielenia-Grajewska process of choosing among alternatives. Another
Division of Maritime Economy, Department of scope of investigation is the link between meta-
Maritime Transport and Seaborne Trade, phors and emotions in the process of decision-
University of Gdansk, Gdansk, Poland making. Staying within the field of linguistics,
Intercultural Communication and education also benefits from decision theories.
Neurolinguistics Laboratory, Department of Understanding how people select among offered
Translation Studies, University of Gdansk, alternatives facilitates the cognition of how for-
Gdansk, Poland eign languages are studied. Decision theory in,
e.g., foreign language learning should focus on
both learners and teachers. Taking the teacher
Decision Theory can be defined as the set of perspective into account, decision is related to
approaches used in various disciplines, such as how methods and course materials for students
economics, psychology and statistics, directed at are selected. Looking at the same notion from
the notions and processes connected with making the learner point of view, investigation concen-
decisions. Decision theory focuses on how people trates on the way students select language courses
decide, what determines their choices and what and how they learn. Apart from education, lin-
the results of their decisions are. Taking into guistic decisions can be observed within organi-
account the fact that decisions are connected zational settings. For example, managers decide
with various spheres of one’s life, they are the to opt for a linguistic policy that will facilitate
topic of investigation among researchers effective corporate communication. It may
representing different disciplines. The popularity include, among others, the decisions made on
of decision theory is not only related to its multi- the corporate lingo and the usage of regional
disciplinary nature but also to the features of the dialects in corporate settings. Decisions are also
twenty-first century. Modern times are character- made in the sphere of education. Taking into
ized by multitasking as well as the diversity of account the studied domain of linguistics, individ-
products and services that demand making deci- uals decide which foreign languages they want to
sions more often than before. In addition, the study and whether they are interested in languages
appearance and growth of big data have led to for general purposes or languages for specific
the intensified research on decision processes purposes. Another domain relying on decision
themselves. theory is economics. Discussing economic sub-
domains, behavioral economics concentrates on
how various determinants shape decisions and the
Decision Theory and Disciplines market as such. Both qualitative and quantitative
methods are used to elicit needed data. Focusing
Decision theory is the topic of interest taken up by more on neuroscientific dimensions,
the representatives of various disciplines. The neuroeconomics focuses on how economic deci-
main questions interesting for linguists focus on sions are made by observing the brain and the
how linguistic choices are made and what the nervous system. Management also involves the
consequences of linguistic selections are. The studies on decision making. The way managers
decisions can be studied by taking different make their decisions is investigated by taking into
dimensions into account. Analyzing the micro account, among others, their leadership styles,
perspective, researchers investigate how words their personal features, the type of industry they
or phrases are selected. This sphere of observation work in and the type of situations they have to
includes the studies on the role of literal and face. For example, researchers are interested how
nonliteral language in communication. As managers decide in standard and crisis situations,
Bielenia-Grajewska (2014, 2016) highlights, and how their communication with employees
Decision Theory 369

determines the way they decide. Decision theory Factors Determining Decisions
is also grounded in politics since it influences not
only the options available but also the type of Another factor shaping decisions is motivation
decisions that can be made. For example, the often classified in science by taking the division
type of government, such as democracy, autoc- into intrinsic and extrinsic features into account.
racy and dictatorship determine the influence of For example, motivational factors may be related
individuals on decision-making processes taking to one’s personal needs as well as the expectation
place in a given country. As Bielenia-Grajewska to adjust to the social environment. Another factor
(2013, 2015) stresses in her publications, one of determining decisions is technology; its role can D
other important fields for studying decisions is be investigated in various ways. One of them is
neuroscience. It should be underlined that the technology as the driver of advancements; thus,
recent advancement in scientific tools has led to technologically advanced products may be pur-
the better understanding of how decisions are chased more than other merchandise since they
made. First of all, neuroscience offers a complex limit the amount of time spent on some activities.
and detailed method of studying one’s choices. In addition, technology may influence
Secondly, neuroscientific tools enable the study decision-making processes indirectly. For exam-
of cognitive aspects that cannot be researched by ple, commercials shown on TV stimulate purchas-
using other forms of investigation. A similar ing behaviors. Another application of technology
domain focusing on decision theory is cognitive is the possibility of making decisions online. The
science, with its interest in how people perceive, next factor determining decision-related processes
store information and make decisions. Decision is language. Language in that case can be under-
theory is a topic of investigation in psychology; stood in different ways. One approach is to treat
psychologists study how decisions are made or language as the tool of communication used by a
what makes decision-making processes difficult. given group of people. It may be a national lan-
One of the phenomena studied is procrastination, guage, a dialect or a professional sublanguage.
being the inability to plan decision-making pro- Taking into account the issue of a language under-
cesses, concentrating on less important tasks and stood in a broad way, being proficient in the lan-
often being late with submitting the work on time. guage the piece of information is written in
Psychologists also focus on heuristics, studying facilitates decision processes. On the other hand,
how people find solutions to their problems by the lack of linguistic skills in a given language
trying to reduce the cognitive load connected with may lead to the inability of making decisions or
decision processes and, consequently, supporting making the wrong ones. Apart from the macro
their choices with, e.g., guesses, stereotyping, or level of linguistic issues, the notion of profes-
common sense. Psychologists, psychiatrists, and sional sublanguage or specialized communication
neurologists are also interested how decisions are determines the process of decision-making. One
made by people suffering from mental illnesses or of the most visible examples of the link between
having experienced a brain injury. Another professional discourses can be studied in the orga-
domain connected with decision theory is ethics. nizational settings. Taking into account the spe-
The ethical aspect of decision-making processes cialized terminology used in professional
focuses on a number of notions. One of them is discourse, the incomprehension or misunder-
providing correct information on offered possibil- standing of specialized discourses determines
ities. The ethical perspective is also connected decision processes. Decisions vary by taking into
with the research itself. It involves not only the account the source of knowledge. Decisions can
proper handling of data gathered in experiments be made by using one’s own knowledge or gained
but also using tools and methods that are not expertise through the processes of social interac-
invasive for the subjects. If the methods used in tions, schooling or professional training. In addi-
experiments carry any type of risk for the partic- tion, decisions can be divided into individual
ipant, the person should be informed about such decisions, made by one person, and cooperative
issues. decisions, made by groups of individuals.
370 Decision Theory

Decision may also be forced or voluntary, understanding of texts since this type of analysis
depending on the issue of free will in decision- offers a deep understanding of verbal and nonver-
making processes. There are certain concepts that bal tools that facilitate decision-making. In addi-
are examined through the prism of decision-mak- tion, intercultural communication, together with
ing. One of them is rationality, studied by analyz- its classifications of national or professional cul-
ing the concept of reason and its influence on tures, is important in understanding how decisions
making choices. Thus, decisions can be classified are made by studying differences across cultures
as rational or irrational, depending whether they and occupations. Taking into account cross-disci-
are in accordance with reason, that is whether they plinary approaches, network and systemic theo-
reflect the conditions of the reality. Decisions are ries offer the understanding of complexities
also studied through the prism of compromise. determining decision-making processes. For
Researchers take into account the complexity of example, Actor-Network-Theory stresses the
issues to be decided upon and the steps taken role of living and non-living entities in decisional
towards reaching the compromise between activities. ANT draws one’s attention to the fact
offered choices. Decision theory also concentrates that also technological advancements, such as the
on the influence of other impulses on decision- Internet or social media tools influence decisions,
making processes. This notion is studied in, e.g., by, e.g., offering the possibility of making choices
marketing to investigate how auditory (songs or also online. Social Network Analysis (SNA)
jingles) or olfactory (smells) stimuli determine pur- focuses on the role of relations in individual and
chasing behaviors. Apart from these dimensions, group decisions. Thus, analyzing the studied mul-
marketing specialists are interested in the role of tifactorial and multidisciplinary nature of deci-
verbal and nonverbal factors in making the selec- sions, the theory of decision-making processes
tion of merchandise. Decision theory also focuses should be of holistic approach, underlying the
on the understanding of priorities. As has been role of different elements and processes in deci-
mentioned before, psychologists are interested sion-making activities. As Krogerus and
why some individuals procrastinate. Decision the- Tschäppeler (2011) state, there are different tools
ory is focused on different stages of decision-mak- and methods that may support the decision-mak-
ing processes; decisions do not only have to be ing process. One of them is the Eisenhower
made, by selecting the best (in one’s opinion) alter- Matrix, also known as the Urgent-Important-
native, but they should be later implemented and Matrix, facilitating the decisions when the tasks
their results should be monitored. should be done, according to their importance.
Thus, issues to be done are divided into the ones
that are urgent and have to be done or delegated
Decision Theory and Methodology and the ones that are not urgent and one has to
decide when they will be done or delete them.
Methodologies used in decision theories can be Another technique used in management is
classified by taking into account different SWOT analysis which is applied to evaluate the
approaches. One of the ways decision theories project’s strengths, weaknesses, opportunities and
can be studied is investigating various disciplines threats. In addition, costs and benefits can be
and their influence on understanding how deci- identified by using the BCG Box. The Boston
sions are made. As far as the decisions are made, Consulting Group developed a method to estimate
the approaches discussed in different theories turn investments by using the concepts of cash cows,
out to be useful. Thus, such domains as, among stars, question marks and dogs. Moreover,
others, psychology, economics, and management Maslow’s theory of human needs offer informa-
use theories and approaches that facilitate the tion on the priority on making decisions. Deci-
understanding of decision-making. For example, sions differ when one takes into account the type
psychology relies on cognitive models and behav- of data; the ones made on the basis of a relatively
ioral studies. Taking into account linguistics, Crit- small amount of data are different from the ones
ical Discourse Analysis facilitates the creating and that have to be made in the face of big data.
Decision Theory 371

Consequently, decisions involving big data may The growing role of decision-making in both pro-
involve the participation of machines that facili- fessional and private life has led to the increasing
tate data gathering and comprehension. Moreover, popularity of game theory in various scientific
decisions involving big data are often supported disciplines as well as in the studies representing
by statistical or econometric tools. In making how individuals behave in everyday situations. In
decision, the characteristics of big data are crucial. addition, games as such experience their renais-
As Firican (2017) mentions, there are ten Vs of Big sance in the twenty-first century, being present in
Data. They include: volume, velocity, variety, var- different spheres of life, not exclusively only the
iability, veracity, validity, vulnerability, volatility, ones related to entertainment. The development in D
visualization, and value. Big data can, among the sphere of games is also connected with new
others, help understand the decisions made by cus- types of technologically advanced appearing on
tomers, target their needs and expectations as well the market. Moreover, the growing role of tech-
as optimize business processes. In addition, Podnar nology and the Internet resulted in novel forms of
(2019) draws our attention to the issue of new data competition and cooperation. Apart from the vivid
privacy regulations. Thus, the company or the indi- interest among researchers representing different
vidual gathering information should identify sensi- domains in the possibilities of applying game
tive data and treat it in the proper way, e.g., by theory to study the nuances of a given discipline,
using special encryption. Moreover, all processes game theory is known to a wider public because of
connected with data, that is gathering, storing, the biographical drama film entitled A Beautiful
modifying and erasing should be done according Mind directed by Ron Howard showing the life of
to the legal regulations, e.g., the EU’s General Data Professor John Nash.
Protection Regulation (GDPR).

Game Theory – Definition and Basic


Decision Theory and Game Theory Concepts

Although the first traces of game theory can be As far as the definition of game theory is
noticed in the works of economists in the previous concerned, Philip D. Straffin in his book states
centuries, the origin of game theory is associated that game theory examines the logical analysis
with the book by John Neumann and Oskar of conflict and cooperation. Thus, the concept of
Morgenstern Theory of Games and Economic a game is used if there are at least two players
Behavior printed in 1944. In the 1950’s John (human beings and non-human beings, such as
Forbes Nash Jr. published papers on non-cooper- communities, companies or countries) involved
ative games and the Nash equilibrium. Since the in cooperative and conflicting activities. Although
publication of seminal works by Neumann, games concern mainly human beings or compa-
Morgenstern and Nash, the findings of game the- nies, they can also be studied during the observa-
ory started to be applied in disciplines related to tion of plants and animals. Moreover, every player
mathematics as well as the ones not connected has some strategies at his or her disposal that can
with it. The application of game theory includes be used to play a game. The combination of strat-
different spheres of study, such as economics, egies selected by players determines the result of a
linguistics, translation, biology, anthropology, to game. The outcome is connected with the payoff
mention a few of them. Professor Nash was the for players that can be exemplified in numbers.
Nobel Prize winner in economics in 1994 for the Game theory investigates how players should play
work on the game theory. Apart from the men- in a rational way leading to the highest possible
tioned scientific recognition of this theory among payoffs. The results of a game are determined by
researchers, the interest in the game theory is also the choices made by a player and other players.
connected with the general curiosity shared by The decisions of other players can be studied
both researchers and laymen about how people through the perspective of cooperation and con-
make choices and what drives their selection. flict. Conflict is connected with different needs of
372 Decision Theory

players, often having contradictory aims. Cooper- acting in one’s own interests leads to worse out-
ation reflects the situation when the coordination of comes when a cooperation is chosen. In the Pris-
interests takes place. Although game theory can be oner’s Dilemma two suspects of a crime are
applied to many situations and players, Straffin detained in separate rooms, without the possibility
draws one’s attention to the potential limitations of communicating with each other. Each of them is
of game theory. First, games played in a real world informed that if he/she cooperates and testifies
are complicated, with the total number of players against the second detainee, he or she will go free.
and the outcomes of their strategies difficult to When he or she decides not to cooperate but the
estimate. The second challenge is connected with other prisoner opts for cooperation, he or she will
the assumption that a player behaves in a rational have to spend 3 years in prison. When both pris-
way. In reality, not all players perform rationally, oners decide to confess, they will be imprisoned for
some behaviors cannot be easily explained. The 2 years. If none of them cooperates they will spend
third problem is connected with the fact that game 1 year in prison. Although cooperation is the best
theory cannot predict how the game evolves if the selection for both prisoners, the most often chosen
aims of players are not contradictory in a distin- option is confessing against the other participant.
guishable way or when more than two players take Picardo in his contribution also shows more
part in a game. For such games, partial solutions, advanced game theories that rely on the Prisoner’s
cases and examples exist. Thus, some games fail Dilemma. One of them is Matching Pennies in
easy categorization and additional research has to which two players place a penny simultaneously
be carried out to understand the complex picture of on the table, with payoffs depending how often
a given phenomenon. Games can be classified tak- heads or tails appear. If both coins are heads or
ing into account the number of players, types of tails, the first player wins and can take the second
payoffs and potential outcomes. Starting with the player’s coin. When one penny turns heads and the
last feature, zero-sum games encompass situations other tails, the second player is the winner. A similar
with completely antagonistic aims, when one wins social choice to the one of the Prisoner’s Dilemma
and the other loses. On the other hand, in non-zero- is represented in Deadlock, with the dominant strat-
sum games the winning of one player does not egy being the selection of the greatest benefit for
necessarily entail losing of the other one. The tax- both sides. Another type of advanced game theory
onomy of games related to players includes the is Cournot Competition, used in, e.g., depicting
division of games according to the number of such economic phenomena as duopoly. An example
players (from one-to-many users). Games can of sequential game is Centipede Game, with players
also be classified through the prism of signaling. making moves one after another.
John Neumann and Oskar Morgenstern
highlighted that inverted signaling, aimed at mis-
leading the other player, can be observed in most
games. Direct signaling, on the other hand, takes Cross-References
place very rare in games. The payoffs in games
depend on, among others, type of game and disci- ▶ Economics
pline it is applied in. They can take the form of ▶ Knowledge Management
money or utility (e.g., economics) as well as fitness ▶ Online Advertising
from the genetic perspective (biology). ▶ Social Network Analysis

Game Theory Strategies, Decision Further Reading


Theory and Big Data
Bielenia-Grajewska, M. (2013). International
neuromanagement. In D. Tsang, H. H. Kazeroony, &
Elvis Picardo describes basic game strategies. One G. Ellis (Eds.), The Routledge companion to interna-
of them is Prisoner’s Dilemma that shows how tional management education. Abingdon: Routledge.
Deep Learning 373

Bielenia-Grajewska, M. (2014). CSR Online Communica- on a variety of studies such as applied mathemat-
tion: the metaphorical dimension of CSR discourse in ics, statistics, neuroscience, and human brain
the food industry. In R. Tench, W. Sun, & B. Jones
(Eds.), Communicating corporate social responsibility: knowledge. The prime principles of applied math-
perspectives and practice (Critical studies on corpo- ematics, such as linear algebra and probability
rate responsibility, governance and sustainability) theories, are the major disciplines that inspire the
(Vol. 6). Bingley: Emerald Group Publishing Limited. fundamentals of the modern deep learning. The
Bielenia-Grajewska, M. (2015). Neuroscience and learn-
ing. In R. Gunstone (Ed.), Encyclopedia of science main idea behind deep learning is to represent
education. New York: Springer. real-world entities as related concepts of a nested
Bielenia-Grajewska, M. (2016). Good health is above hierarchy, where each concept is defined by its D
wealth. Eurozone as a patient in eurocrisis discourse. relation to a simpler concept. Therefore, deep
In N. Chitty, L. Ji, G. D. Rawnsley, & C. Hayden (Eds.),
Routledge handbook on soft power. Routledge: learning aims to empower computers to learn
Abingdon. from experience and to understand different
Firican, G. (2017). The 10 Vs of big data. Available online: domains regarding a specific hierarchy of con-
https://tdwi.org/articles/2017/02/08/10-vs-of-big-data. cepts. This allows computers to build and
aspx. Accessed June 2019.
Krogerus, M., & Tschäppeler, R. (2011). The decision learn complex concepts and relationships from
book: fifty models for strategic thinking. London: Pro- gathering the related simpler concepts.
file Books Ltd.. Deep learning began to gain its popularity in
Picardo, E. (2016). Advanced game theory strategies for the middle 2000s. At that time, the initial intention
decision-making. Investopedia. http://www.
investopedia.com/articles/investing/111113/advanced- was to make more generalized deep learning
game-theory-strategies-decisionmaking.asp. Accessed models with small datasets. Today, deep learning
10 Sept 2016. models have achieved great accomplishments by
Podnar, K. (2019). How to survive the coming data privacy leveraging large datasets. Moreover, a continuous
Tsunami. Available at https://tdwi.org/Articles/2019/
06/17/DWT-ALL-How-to-Survive-Data-Privacy-Tsu accomplishment in deep learning is the increase of
nami.aspx. Accessed 20 June 2019. model size and performance due to the advances
Straffin, P. G. (2004). Teoria Gier (Game theory and strat- in general-purpose CPUs, software infrastructure,
egy). Warszawa: Wydawnictwo Naukowe Scholar. and network connectivity. Another successful
Von Neumann, J., & Morgenstern, O. (1944). Theory of
games and economic behavior. Woodstock: Princeton achievement in today’s deep learning is its
University Press. enhanced ability to make predictions and recog-
nitions with high level of accuracy, unambiguity,
and reliability. Deep learning’s ability to perform
tasks with high-level complexity is escalating.
Deep Learning Thus, many modern applications have success-
fully applied deep learning from different aspects.
Rayan Alshamrani and Xiaogang Ma Furthermore, deep learning provides useful tools
Department of Computer Science, to process massive amounts of datasets and makes
University of Idaho, Moscow, ID, USA practical contributions to many other scientific
domains.

Introduction
The History of Deep Learning
Artificial intelligence (AI) is a growing and well-
known discipline in computer science with essen- Through the history, computer science scholars
tially specialized fields. Deep learning is a major have defined deep learning by different terms
part of AI that is associated with machine learning that reflect their different points of view. Although
notion. As a branch of machine learning, deep many people assumed that deep learning is a
learning focuses on approaches to improve new discipline, it has existed since the 1940s,
AI systems through learning and training from but it was not quite popular back then. The evo-
experience and observation. Deep learning relies lution of deep learning started between the 1940s
374 Deep Learning

and the 1960s when the researchers knew it as learning algorithms yield better and promising
cybernetics. Between the 1980s and 1990s, deep results in different machine learning applications
learning scientists acknowledged it as connection- such as computer vision, natural language pro-
ism. The current name of this discipline, deep cessing, and speech recognition. Deep learning
learning, took its shape around the first decade is an important achievement in AI and machine
of the 2000s up until today. Hence, the previous learning. It enhances the agents’ abilities to handle
terminologies illustrate the revolution of deep complex data representations and to perform AI
learning through three different waves tasks independently from human knowledge. In
(Goodfellow et al. 2017). summary, deep learning introduces several
benefits which are: (1) enabling simple models
to work with knowledge acquired from huge
Deep Learning as a Machine Learning data with complex representations, (2) automating
Paradigm the extraction of data representations which
makes agents work with different data types,
The main idea behind machine learning is to con- and (3) obtaining semantic and relational knowl-
sider generalization when representing the input edge from the raw data at the higher level of
data and to train the machine learning models with representations.
these sets of generalized input data. By doing so,
a trained model is able to deal with new sets of
input data in future uses. Hence, the efficient Deep Learning Applications and
generalization of the data representation has a Challenges
huge impact on the performance of the machine
learners. However, what will happen if these Since the 1990s, many commercial applications
models generate unwanted, undesired, or incor- have been using deep learning more as a state-of-
rect results? The answer to this question is to feed the-art concept than applied technology, because
these models with more input data. This process the comprehensive application of deep leaning
forms a key limitation in machine learning. algorithms need expertise from several disci-
Besides, machine learning algorithms are limited plines, and only a few people were able to do
in their ability to perform on and extract raw forms that. However, the number of skills required
from natural data. Because of this, machine learn- to cope with today’s deep learning algorithms is
ing systems require considerable domain exper- in decrease due to the availability of a huge
tise and a high level of engineering in order to amount of training datasets. Now, deep learning
design models that extract raw data and transform algorithms and models can solve more compli-
it into useful data representation. cated tasks and reach high-level human
As mentioned earlier, deep learning relies on performance.
the hierarchical architecture where lower level Deep learning can perform accurate and valid
features define high level features. Because of tasks, such as prediction and recognition, with a
their nature, deep learning algorithms aid agents high level of complexity. For instance, nowadays
to overcome the machine learning algorithms’ deep learning models can recognize objects in
limitations. Deep learning algorithms support photographs without the need to crop or resize
machine learners by extracting data representation the photograph. Likewise, these models can rec-
with a high level of complexity. This data extra- ognize a diversity of objects and classify them into
ction mechanism feeds machine learners with raw corresponding categories. Besides object recogni-
data and enables these learners to automatically tion, deep learning also has some sort of influence
discover the suitable data representations. on speech recognition. As deep learning models
Deep learning is beneficial for machine drop the error rate, they could recognize voices
learning because it enables machine learners more accurately. Traffic sign categorization,
to handle a large amount of input datasets, espe- pedestrian detection process, drug discovery,
cially unsupervised datasets. Consequently, deep and image segmentation are examples of deep
Deep Learning 375

learning’s recent successful case studies. learn to play Atari video games just like profes-
Accordingly, many companies such as Apple, sional gamers.
Amazon, Microsoft, Google, IBM, Netflix, In order to fully understand deep learning
Adobe, and Facebook have increased their atten- basics, the concept of artificial neural networks
tion towards deep learning as they are positively (ANN) must be illustrated. The main idea of the
profitable in business applications. ANN is to demonstrate the learning process of the
In contrast, with these successful achievements human brain. The structure of the ANN consists
come the drawbacks and limitations. There are of interconnected nodes called neurons and a set
major challenges associated with deep learning of edges that connect these neurons all together. D
that remain unsettled and unresolved, especially The main functionality of ANN is to receive a set
when it comes to big data analytics. First, there are of inputs, perform several procedures (complex
specific characteristics that cause the drawbacks calculations) on input sets, and use the resulted
and limitations of adopting deep learning algo- output to solve specific real-world problems.
rithms in big data analytics, such as models’ scal- ANN is highly structured with multiple layers,
ability, learning with streaming data, distributed namely the input layer, the output layer, and the
computing, and handling high dimensional data hidden layer in between.
(Najafabadi et al. 2015). Second, the nature of Multilayer perception (MLP), also called
deep learning algorithms, which briefly map feedforward neural networks, or deep
objects through a chain of related concepts, feedforward networks, is a workhorse of deep
inhibits it from performing commonsense reason- learning models. MLP is a mathematical function
ing exercises regardless of the amount of data (called function f) that maps input to output.
being used. Third, deep learning has some limita- This f function is formulated through composing
tions when performing classification on unclear several simple functions. Each of these simple
images due to the imperfection of the model functions provides a new way to represent the
training phase. This imperfection makes the input data. Deep learning models that adopt
potential deep learning model more vulnerable MLP are known as feedforward because the infor-
to the unrecognizable data. Nevertheless, several mation flows from the input to the output through
research contributions are validating and embrac- the model’s function without any feedback.
ing techniques to improve the deep learning Broadly speaking, a deep neural network
algorithms against the major limitations and (DNN) learns from multiple and hierarchical
challenges. layers of sensory data representation, which
enables it to perform tasks at a level close to
human ability. This makes DNNs more powerful
Concepts Related to Deep Learning than shallow neural networks. DNN layers are
divided into early layers, which are dedicated for
There are several important concepts related to identifying simple concepts of input data, and
deep learning, such as reinforcement learning, later layers, which are dedicated for complex and
artificial neural networks, multilayer perception, abstract concepts. The DNN differs from the shal-
deep neural networks, deep belief networks, and low neural network in the number of hidden
backpropagation. layers. A DNN has more than two hidden layers.
A key success in deep learning is its extension With that said, a network is deep if there are many
to the reinforcement learning field. Reinforcement hidden layers.
learning helps an agent to learn and observe Deep belief networks (DBN) is a type of DNN.
through trial and error without any human inter- Specifically, DBN is a generative graphical model
vention. The existence of deep learning has with multiple layers of stochastic latent variables
empowered reinforcement learning in robotics. consisting of both directed and undirected edges.
Moreover, reinforcement learning systems that The multiple layers of DBN are hidden units.
apply deep learning are performing tasks at DBN layers are connected with each other, but
human level. For example, these systems can units within each layer are not. Namely, DBN is a
376 Deep Learning

stack of Restricted Boltzmann Machine, and it due to the fact that the growth rate of big data is
uses Greedy Layer-wise algorithm for model faster compared to the gain in computational per-
training. formance (Zhang et al. 2018). However, many
Backpropagation is a quintessential supervised recent studies have outlined deep learning models
learning algorithm for various neural networks. that are suitable for big data purposes. It is notice-
Backpropagation algorithm is a mathematical able that deep learning algorithms have made
tool, which computes weights’ gradient descent, great progress in big data era, and the challenges
for improving predictions accuracy. The aim of that face deep learning in big data era are under
the backpropagation algorithm is to train neural current and prospective research consideration.
networks by comparing the initial output with the Emerging semantic technologies with deep
desired output and then adjust the system until the learning is a cutting-edge research topic. This
comparison difference is minimized. emergence has created the notion of semantic
deep learning. For instance, the key successes of
both semantic data mining and deep learning
Future of Deep Learning have inspired researchers to potentially assist
deep learning by using formal knowledge rep-
Deep learning will have a very bright future with resentations (Wang 2015). Similarly, deep
many successes as the world is moving toward the learning approaches would be beneficial for
big data era with the rapid growth in the amount evaluating semantic similarity of two sentences
and type of data. The expectation is on the rise with 16–70% improvement compared to baseline
because this valuable discipline requires very little models (Sanborn and Skryzalin 2015). In addition
engineering by manual work, and it leans heavily to semantic technologies, Decision Support
on the automation of data extraction, comp- Systems (DSS) is another aspect that will gain
utation, and representation. Ongoing research more advantages from the adaptation of deep
have outlined the need to combine different con- learning. The current studies that relate
cepts of deep learning together to enhance appli- DSS to deep learning focus more on applying
cations such as recognition, gaming, detection, deep learning concepts to the clinical DSS in
health monitoring, and natural language pro- healthcare. It is possible to see more DSSs
cessing. Thus, the evaluation of deep learning in different domains that use deep learning
should focus on working more with unsupervised methods in the near future. Deep learning is an
learning, so agents can mimic the human brain important part of machine learning. As most
behavior further and start thinking on behalf of researchers are looking for ways to simulate
human beings. Researchers will continue intro- the biological brain, deep learning will be power-
ducing new deep learning algorithms and forms fully presented in machine learning studies and
of learning in order to develop general purpose applications.
models with high levels of abstraction and
reasoning.
The big data bang poses several challenges to
deep learning, such as those listed above. Indeed, Cross-References
most big data objects consist of more than one
modality, which require advanced deep learning ▶ Artificial Intelligence
models that can extract, analyze, and represent ▶ Deep Learning
different modalities of input datasets. It is true
that big data offers satisfiable amount of datasets
to train the deep learning models and improve Further Reading
their performance. Yet, the process of training
Bengio, Y., Goodfellow, I., & Courville, A. (2017). Deep
deep learning models using huge datasets depends
learning (Vol. 1). MIT press.
significantly on high-performance computing Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M.,
infrastructure, which is sometimes challenging Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep
De-identification/Re-identification 377

learning applications and challenges in big data analyt- information remaining in the database and thus
ics. Journal of Big Data, 2(1), 1. to “re-identify” them. The development of robust
Sanborn, A., & Skryzalin, J. (2015). Deep learning for
semantic similarity. CS224d: Deep Learning for Natu- and reliable methods of de-identification is an
ral Language Processing Stanford, CA, USA: Stanford important public policy question in the handling
University. of big data.
Wang, H. (2015). Semantic Deep Learning. University of
Oregon, 1–42.
Zhang, Q., Yang, L. T., Chen, Z., & Li, P. (2018). A survey
on deep learning for big data. Information Fusion, 42, Privacy and Personally Identifiable
146–157. Information D

Databases often contain information about indi-


viduals that can be embarrassing or even harmful
Deep Web if widely known. For example, university records
might reveal that a student had failed a class
▶ Surface Web vs Deep Web vs Dark Web several times before passing it; hospital records
might show that a person had been treated for an
embarrassing disease, and juvenile criminal
records might show arrests for long-forgotten mis-
Defect Detection deeds, any of which can unjustly harm an individ-
ual in the present. However, this information is
▶ Anomaly Detection still useful for researchers in education, medicine,
and criminology provided that no individual per-
son is harmed. De-identifying this data protects
privacy-sensitive information while allowing
De-identification other useful information to remain in the database,
to be studied, and to be published.
▶ Anonymization Techniques The United States (US) Department of Health
and Human Services (HHS 2012), for example,
references 19 types of information (Personally
Identifiable Information, or PII) that should be
De-identification/Re- removed prior to publication. This information
identification includes elements such as names, telephone, and
fax numbers, and biometric identifiers such as
Patrick Juola fingerprints. This list is explicitly not exhaustive,
Department of Mathematics and Computer as there may be another “unique identifying num-
Science, McAnulty College and Graduate School ber, characteristic, or code” that must also be
of Liberal Arts, Duquesne University, Pittsburgh, removed, and determining the risk of someone
PA, USA being able to identify individuals may be a matter
of expert judgment (HHS 2012).

Introduction
De-identifying Data
Big data often carries the risk of exposing impor-
tant personal information about individuals in the By removing PII from the database, the presump-
database. To reduce the risk of such privacy vio- tion is that the remaining de-identified informa-
lations, databases often mask or anonymize indi- tion no longer contains sensitive information and
vidual identifiers in databases, a process known as therefore can be safely distributed. In addition to
“de-identification.” Unfortunately, it is often pos- simply removing fields, it is also possible to adjust
sible to infer the identities of individuals from the data delivery method so that the data delivered
378 Demographic Data

to analysts does not allow them to identify indi- For this reason, de-identification and re-identi-
viduals. One such method, “statistical disclosure fication remain active research areas and should
limitation,” masks the data by generating syn- be treated with concern and caution by any analyst
thetic (meaning, “fake”) data with similar proper- dealing with individual human data elements or
ties to the real data (Rubin 1993). Another method other sensitive data.
(“differential privacy”) is to add or subtract small
random values to the actual data, enough to break
the link between any individual data point and the Cross-References
person it represents (Garfinkel 2015).
However, de-identification brings its own ▶ Profiling
problems. First, as HHS recognizes, “de-
identification leads to information loss which
may limit the usefulness of the resulting health Further Reading
information in certain circumstances.” (HHS
2012). Other researchers agree. For example, Fredrikson, M., et al. (2014). Privacy in pharmaco-
Fredrikson (2014) showed that using differential genetics: An end-to-end case study of personalized
Warfarin Dosing. 23rd Usenix Security Symposium,
privacy worsened clinical outcomes in a study of August 20–22, 2014, San Diego, CA.
genomics and warfarin dosage. More seriously, it Garfinkel, S. L. (2015). De-identification of personal infor-
may be possible that the data can be re-identified, mation. NISTIR 8053. National Institute of Standards
defeating the purpose of de-identification. and Technology. https://doi.org/10.6028/NIST.IR.
8053.
Health and Human Services (HHS), US. (2012). Guidance
regarding methods for de-identification of protected
Re-identification health information in accordance with the Health
Insurance Portability and Accountability Act (HIPAA)
privacy rule. https://www.hhs.gov/sites/default/files/
Even when PII is removed, it may be possible to ocr/privacy/hipaa/understanding/coveredentities/De-
infer it from information that remains. For exam- identification/hhs_deid_guidance.pdf.
ple, if it is known that all patients tested in a given Rubin, D. B. (1993). Discussion: Statistical disclosure
month for a specific condition were positive, then limitation. Journal of Official Statistics, 9(2), 461–468.
Sweeney, L. (2000). Simple demographics often identify
if a person knows that a particular patient was people uniquely. Carnegie Mellon University, Data
tested during that month, the person knows that Privacy Working Paper 3, Pittsburgh. http://datapriva
patient tested positive. cylab.org/projects/identifiability/paper1.pdf.
A common way to do this is by using one set
of data and linking it to another set of data. In
one study (Sweeney 2000), a researcher spent
$20 on a set of voter registration records and Demographic Data
obtained the ZIP code, birth date, and gender
of the Governor of Massachusetts. She was Jennifer Ferreira
then able to use this to identify the Governor’s Centre for Business in Society, Coventry
medical records via a public Federal database. University, Coventry, UK
She estimated that more than 85% of the US
population can be identified uniquely from
these three datapoints. Even using counties, The generation of demographic data is a key ele-
instead of the more informative ZIP codes, she ment of many big data sets. It is the knowledge
was able to identify 18.1% of the US population about people which can be gleaned from big data
uniquely. It is clear, then, that re-identification is which has the potential to make these data sets
an issue even when all the obvious unique links even more useful not only for researchers but also
to individuals have been purged (Sweeney 2000; policy makers and commercial enterprises.
Garfinkel 2015). Demography, formed from two Greek words,
Demographic Data 379

broadly means “description of the people” and in can also stimulate actions in policy making, in
general refers to the study of populations, pro- terms of how to meet the needs of present and
cesses, and characteristics including population future populations. A key example of this relates
growth, fertility, mortality, migration, and popu- to aging populations in some developed countries
lation aging, while the characteristics examined where projections of how this is likely to continue
are as varied as age, sex, birthplace, family struc- have informed policies around funding pensions
ture, health, education, and occupation. Demo- and healthcare provision for the elderly. For busi-
graphic data refers to the information that is nesses, demographic data is a vital source of
gained about these characteristics which can be information about their consumer base D
used to examine changes and behaviors of popu- (or potential consumer base); understanding how
lation and in turn be used to generate population consumers behave can inform their activities or
predictions and models. Demographic data can be how they target particular cohorts or segments of
used to explore population dynamics, analytical the population. The wide relevance of demo-
approaches to population change, the demo- graphic data for policy, planning, research, and
graphic transition, demographic models, spatial commerce means that this particular aspect of
patterns, as well as planning, policy making, and big data has experienced much attention.
commercial applications especially projecting and Many of the traditional data sets which gener-
estimating population composition and behavior. ate demographic data, such as the census, are not
Applied demography seeks to emphasize the conducted frequently (every 10 years in the UK)
potential for the practical application of demo- and are slow to release the results, so any analysis
graphic data to examine present and future demo- conducted on a particular population will often be
graphic characteristics, across both time and space. significantly out of date given the constant chang-
Where demographic data is available over time, ing dynamics of human populations. Furthermore,
this allows for historical changes to populations to other population surveys draw on a relatively
be examined in order to make predictions and small sample of the population and so may not
develop models about how populations may be truly representative of the range of situations
behave in the future. Traditionally the principal experienced in a population.
sources for the study of population are censuses Demographic data refers to data which relates
and population surveys which are often infrequent to a particular population which is used to identify
and not always comprehensive. particular characteristics or features. There are a
Understanding demographic data and popula- wide range of demographic variables which could
tion patterns has useful applications in many be included in this category, although these com-
areas, indulging planning, policy making, and monly used include age, gender, ethnicity, health,
commercial enterprises. In planning, estimates income, and employment status. Demographic
and projections are important in terms of ensuring data has been a key area which has been the
accurate allocation of resources according to pop- focus for big data research, primarily because
ulation size, determining the level of investment much of the data generated relates to individuals
needed for particular places. Planning requires and therefore has the potential to provide insights
reliable demographic data in order to make deci- into the characteristics and behaviors of popula-
sions about future requirements, and so will tion beyond what is possible from traditional
impact on many major planning decisions, and demographic data sources. Demographic data is
therefore many large financial decisions are also vital for many commercial enterprises as they
made on the basis of demographic data in combi- seek to explore the demographic profile of their
nation with other information. Population statis- customer base or the customer base whom they
tics reveal much about the nature of society, the wish to target for their products.
changes that take place within it, and the issues This vast new trove of big data generated by
that are relevant for government and policy. new technological advancements (mobile phone,
Therefore, demographic projections and models computers, satellites, and other electronic
380 Demographic Data

devices) has the potential to transform spatial- in ways which respond to these concerns. The
temporal analyses of demographic behavior, par- measures developed were designed to address
ticularly related to economic activity. As a result the spatial and social nature of human mobility,
of the technological advancements which have led to remain independent of social, economic, polit-
to the generation of many big data sets, the quan- ical, or demographic characteristics of context,
tity of demographic data available for population and to be comparable across geographic regions
research is increasing exponentially. Where there and time.
are consistent large-scale data sets that now The generation of big data via social media has
extend over many years sometimes crossing also led to the development of new research
national boundaries with fine geographic detail, methods. Pablo Mareos and Jorge Durand explore
this collectively creates a unique laboratory for the potential value of netnographic methods in
studying demographic processes and for examin- social media for migration studies. To do this the
ing social and economic scenarios. These models researchers explore data obtained from Internet
are then used in order to explore population discussion forums on migration and citizenship.
changes including fertility, mortality, and This research uses a combination of classification
depopulation. methods to analyze discussion themes of migra-
The growth in the use of big data in demo- tion and citizenship. The study revealed results
graphic research is reflected in its growing pres- which identified key migrating practices which
ence in discussions at academic conferences. In were absent from migration and citizenship liter-
the Population Association for America in 2014, a ature, suggesting that analyses of big data may
session entitled “Big Data for Demographic provide new avenues for research, with the poten-
Research” demonstrates some of the ways big tial for this to revolutionize traditional population
data has been used. research methods.
Mobile phone data in Estonia was used to Capturing demographic patterns from big data
examine ethnic segregation. This data set is a key activity for researchers, political teams,
included information about the ethnicity of indi- and marketing teams. Being able to examine the
viduals (Russian/Estonian), the history of loca- behavior of populations, but in particular specific
tions visited by the individuals, and their phone- segments of population is a key activity. While
based interactions. This study found evidence to many of the techniques and technologies which
suggest that ethnic composition of an individ- harness big data may have been developed by
ual’s geographic neighborhood influenced the commercial enterprises or for commercial gain,
structure of an individual’s geographic network. there are an increasing number of examples of
It also found that patterns of segregation were where this data is being used for public benefit,
evident where migrants were more likely to inter- as seen in Chicago.
act with other individuals of their ethnicity. In Chicago, health officials employed a data
A further study also used mobile phone data to analytics firms to conduct data mining to explore
explore human mobility. This project highlighted ethnic minority women who were not getting
the potential that large-scale data sets like this breast screenings even though they were offered
have for studying human behavior on a scale not them for free at a hospital in a particular area. The
previously possible. It argued that some mea- analytics firm helped Chicago health department
sures of mobility using mobile data are contam- to refine its city outreach for breast cancer screen-
inated by infrastructure and demographic and ing program by using big data to identify the
social characteristics of a population. The uninsured women aged 40 and older living in
authors also highlight problems with using the south side of the city. This project indicates
mobile phone data to explore mobility and out- the potential impact that big data could have on
line potential new methods to measure mobility public services.
Digital Advertising Alliance 381

There are of course challenges associated with Further Reading


using demographic data collected in data sets.
The use of Twitter data, for example, to examine Blumenstock, J., & Toomet, O. (2014). Segregation and
‘Silent Separation’: Using large-scale network data to
population patterns or trends though is problem-
model the determinants of ethnic segregation. Paper
atic in that population of Twitter users is not presented at the population association of America
necessarily representatives of the wider popula- 2014 annual meeting, Boston, May 1–3.
tion or the population being studied. The massive Girosi, F., & King, G. (2008). Demographic forecasting.
Princeton/Oxford: Princeton University Press.
popularity of social media, and the ability to
extract the data about communication behaviors,
Mateos, P., & Durand, J. (2014). Netnography and demog- D
raphy: Mining internet discussion forums on migration
has made them a valuable data source. However, and citizenship. Paper presented at the population asso-
a study which compared the Twitter population ciation of America 2014 annual meeting, Boston,
1–3 May.
to the US population along three axes
Mislove, A., Lehmann, S., Ahn, Y. -Y., Onnela, J. -P.., &
(geography, gender, and race/ethnicity) found Rosenquist, N. Understanding the demographics of Twit-
that the Twitter population is a highly non- ter users. In Proceedings of the fifth international AAAI
uniform sample of the population. Ideally when conference on weblogs on and social media. http://www.
aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view
comparing the Twitter population to society as a
File/2816/3234.
whole, we would compare properties including Ruggles, S. (2014). Big microdata for population research.
socioeconomic status, education level, and type Demography, 51(1), 287–297.
of employment. However, it is only possible to Rowland, D. (2003). Demographic methods and concepts.
Oxford: Oxford University Press.
obtain characteristics which are self-reported and
Sobek, M., Cleveland, L., Flood, S., Hall, P., King, M.,
made visible by the user in the Twitter profile Ruggles, S., & Shroeder, M. (2011). Big data: Large
(usually the name, location, and text included in historical infrastructure from the Minnesota population
the tweet). Research has indicated that Twitter center. Historical Methods, 44(2), 61–68.
Williams, N., Thomas, T., Dunbar, M., Eagle, N., & Dobra,
users are more likely to live within densely pop-
A. (2014). Measurement of human mobility using cell
ulated area and that sparsely population regions phone data: Developing big data for demographic
are underrepresented. Furthermore, research science. Paper presented at the population association
suggested that there is a male bias in Twitter of America 2014 annual meeting, Boston, 1–3 May.
users, again making the results gained from this
big data source unrepresentative of the wider
population.
Despite the challenges and limitations associ- Digital Advertising Alliance
ated with big data, the study of populations and
their characteristics and behaviors is a growing Siona Listokin
area for big data researchers. The applications Schar School of Policy and Government, George
for demographic data analysis are being adopted Mason University, Fairfax, VA, USA
and explored by data scientists in both the public
and private sector in an effort to explore the past,
present, and future patterns of populations across The Digital Advertising Alliance (DAA) is a non-
the world. profit organization in the United States (US) made
up of marketing and advertising industry associa-
tions that seeks to provide self-regulatory consumer
Cross-References privacy principles for internet-based advertising.
The DAA is one of the most prominent self-
▶ Education regulation associations in consumer data privacy
▶ Epidemiology and security but has been criticized for promoting
▶ Geography weak data privacy programs and enforcement.
382 Digital Advertising Alliance

The DAA was established in 2009 by several browser-enabled personalized advertising. DAA
US advertising associations, following the release created a separate App Choices program for con-
of a Federal Trade Commission (FTC) report on sumer control of mobile app data collection.
“Self-Regulatory Principles for Online Behav- While the opt-out option applies to behavioral
ioral Advertising.” It is led by the Association of advertising, data collection and third party track-
National Advertisers, The American Advertising ing are not blocked.
Federation, 4A’s, Network Advertising Initiative,
Better Business Bureau National Programs, and
Interactive Advertising Bureau. The DAA repre- Enforcement and Criticism
sents thousands of advertising and marketing
companies and includes hundreds of participating Enforcement is handled by the Association of
companies and organizations, across a range of National Advertisers (ANA) and the Better Busi-
industries. Originally, participating companies ness Bureau National Programs (BBBNP); DAA
consisted of advertisers and third party analytics refers to their independent enforcement compo-
companies, but starting in 2011, DAA expanded nent though it is worth noting that these organiza-
its efforts to include social networks and non- tions are participating and founding associations.
advertising firms. In cases of potential noncompliance, the
The Alliance’s major self-regulatory guide- BBBNP’s Digital Advertising Accountability
lines stem from its “Principles for Internet Based Program (DAAP) process begins. DAAP sends
Advertising” released in mid-2009 and form the an inquiry letter to the company and may begin a
basis for the DAA AdChoices icon and the con- formal review with subsequent decision. Since
sumer opt-out program for customized ads. The 2011 (through the first half of 2020), the
alliance has issued applications of its principles to BBBNP has ruled on a total of 80 cases through
digital advertising areas including political ads, DAAP. In the first half of 2020, the ANA received
cross-device data use, mobile, multisite data, and about 4000 consumer inquries about online adver-
online behavioral advertising. The self-regulatory tising, repeating 2019s jump in consumer com-
principles, which participating companies can plaints in this area from a previous average of
highlight with the DAA’s blue icon, are adminis- about 500 a year. These inquiries range from
tered by the Advertising Self-Regulatory Council concerns about ads blocking web content, inde-
(ASRC) of the Council of Better Business cent advertisements, and incorrectly targeted ads.
Bureaus and the Association of National Adver- In rare occasions, the alliance refers a case to the
tisers. The principles are explicitly meant to cor- FTC, and a number of FTC Commissioners have
respond with the FTC’s report and focus on supported DAA oversight in speeches and reports.
consumer education, transparency, control, data The DAA has been criticized by advocacy
security, consent, and sensitive data like health, groups and policymakers for failing to provide
financial, and child-directed data. The DAA icon, meaningful privacy protection and transparency
launched in October 2010, is meant to serve as a to consumers. Although the DAA had agreed to
signaling device to consumers that informs users the Do Not Track effort in principle after the FTC
of tracking activities. recommended it, it disagreed with the extent of
The DAA’s consumer opt-out page, known as tracking restrictions proposed by the working
“Your AdChoices,” is an element of the icon group and declared that it would not penalize
program that allows users to click in and choose advertisers that ignore the standards. The DAA’s
to opt-out of specific interest-based advertising. own AdChoices opt-out relies on cookies that
The page formed in November 2010 as the Alli- must be manually updated to prevent new third
ance participated in, and subsequently withdrew party tracking and that can negative impact user
from, the World Wide Web Consortium’s working experience. In 2013, Senator John D. Rockefeller
group on “Do Not Track” standardization. IV criticized the DAA’s opt-out program for hav-
Consumers can visit the opt-out page and select ing too many exceptions that allow for consumer
to opt-out of participating third parties’ tracking for market research. A 2018 study noted
Digital Divide 383

major usability flaws in the mobile app opt-out then, the digital divide has rapidly and widely
program. become a topic of research for both policymakers
and scholars, calling attention to the problem of
unequal access to ICTs. That unequal access also
Further Reading raises questions of big data in relation to concerns
about tracking and tracing usage and privacy and
Federal Trade Commission. (2009). FTC staff report: Self- of the implications of accelerated data flows to
regulatory principles for online behavioral advertising,
already data-rich as opposed to data-poor contexts
2009. Washington, DC: Federal Trade Commission.
for widening digital divides. D
Garlach, S., & Suthers, D. (2018). I’m supposed to see
that? ‘AdChoices usability in the mobile environment. From the beginning of debates on the digital
Proceedings of the 51st Hawaii international confer- divide, there have been different positions regard-
ence on System Sciences.
ing the possibility of overcoming gaps in access
Mayer, J. R., & Mitchell, J. C. (2012). Third-party web
tracking: Policy and technology. Security and Privacy between different countries, groups, and individ-
(SP), 2012 IEEE Symposium on IEEE. uals. On the one hand, related digital inequalities
Villafranco, J., & Riley, K. (2013). So you want to self- have been framed as a temporary problem that
regulate? The national advertising division as standard
will gradually fade over time due to two factors:
bearer. Antitrust, 27(2), 79–84.
steadily decreasing costs of use of the Internet and
its continuously increasing ease of use. Based on
these assumptions, some views have it that,
instead of a source of divide, the Internet provided
Digital Agriculture a technological opportunity for information free-
dom and, above all, a tool for illiterate people to
▶ Agriculture learn and read, abridging what was considered the
“real” divide – the gap between those who can
read well and those who cannot – giving the latter
opportunities to take advantage of easily accessi-
Digital Divide ble information resources. On the other hand, the
digital divide has been considered a long-term
Lázaro M. Bacallao-Pino pattern, generating a persistent division between
University of Zaragoza, Zaragoza, Spain “info-haves” and “info-have-nots.” In that sense,
National Autonomous University of Mexico, perspectives on the digital divide have distin-
Mexico City, Mexico guished among cyber-pessimists underlining
deep structures and trends in social stratification
that result in the emergence of unskilled groups
Synonyms without technological access; cyber-skeptics pro-
posing a one-way interrelationship between soci-
Digital inequality ety and technology in which the latter adapts to
the former, not vice versa; and cyber-optimists
The notion of the “digital divide” came into broad proposing a positive scenario considering that, at
use during the mid-1990s, beginning with reports least in developed countries, the digital divide will
about access to and usage of the Internet published be bridged as a result of the combined action of
by the United States (US) Department of Com- technological innovations, markets, and state
merce, National Telecommunications and Infor- (Norris 2001).
mation Administration in 1995, 1998, and 1999. As mentioned, from its original description, the
The term was defined as the divide between those digital divide was commonly defined as the per-
with access to information and communication ceived gap between those who have access to
technologies (ICTs) and those without it and was ICTs and those who do not, summarized in terms
considered one of the leading economic and civil of a division between information “haves” and
rights issues in contemporary societies. Since “have-nots.” From this perspective, it initially
384 Digital Divide

was measured in terms of existing numbers of perspective on the digital divide, there has been a
subscriptions and digital devices, but, as these shift from an approach centered on physical
numbers constantly increased and there was a access towards a focus on skills and usage, i.e., a
transition from narrow-band Internet towards second-level digital divide (Hargittai 2002).
broadband DSL and cable modems in the early The multidimensional nature of digital divide
2000s, more recently from an access perspective arguably refers to at least three levels: global,
the digital divide has been measured in terms of social, and democratic (Norris 2001). While the
the existing bandwidth per individual. It also is in global divide is focused on the different levels of
this regard that increasingly massive data flows access between industrialized and developing
and collection from various devices have implica- countries, the social one refers to gaps among
tions for tracking, monitoring, and measuring dif- individuals who are considered as information-
ferent dimensions or aspects of the digital divide. rich and poor in each country. At the democratic
Since new kinds of connectivity are never intro- level, the digital divide is associated with the
duced simultaneously and uniformly to society as quality of the use of ICTs by individuals,
a whole, level and quality of connectivity are distinguishing between those ones who use its
associated with another degree of digital divide resources for their engagement, mobilization,
in terms of access. and participation in public life, and others who
Although access is important, at the same do not.
time, it has been noted that a binary notion of a From access-focused trends, public policies to
“yes” or “no” proposition regarding physical help bridge the digital divide have mainly focused
access to computers or the Internet does not on the development of infrastructures for provid-
offer an adequate understanding of the complex- ing Internet access to some groups. However,
ity and multidimensionality of the digital divide. some researchers have judged the results of
In this sense, technological gaps are related to those actions as insufficient. On the one hand,
other socioeconomic, cultural, ethnic, and racial and regarding physical access, many groups with
differences, such that there is a necessity to low digital opportunities have been making sub-
rethink the digital divide and related stantial gains in connectivity and computer own-
rhetoric. This divide is relevant particularly in ership. However, on the other hand, significant
the context of what has been defined as the Infor- divides in Internet penetration persist between
mation Age, such that not having access to those individuals, in close relationship to different
technologies and information is considered an levels of income and education, as well as other
economic and social handicap. Consequently, dimensions such as race and ethnicity, age, gen-
different approaches have aimed to broaden the der, disabilities, type of family, and/or geographic
notion of the digital divide to provide a more location (urban-rural). At the same time, while
complex understanding of access, usage, mean- there may be trends in the digital divide closing
ing, participation, and production of digital in terms of physical access – mainly in the most
media technology. developed countries – the digital divide persists or
Besides transition towards a more complex and even widens in the case of digital skills and the use
multidimensional point of view on the particular- of applications associated with ICTs.
ities of the inequalities created by the digital Differences in trends between physical access
divide when comparing it to other scarce material and digital skills show the complex interrelation-
and immaterial resources, one can also find efforts ships among the different levels at which the
for understanding different types of access asso- digital divide exists. Technical, social, and types
ciated with the digital divide: from motivational of uses and abilities for effective and efficient
and physical ones to others related to skills and use of ICTs are articulated in a multidimensional
usage. As part of the tendency towards this new phenomenon in which the ways that people use
Digital Ecosystem 385

the Internet have rising importance for under- ▶ Digital Literacy


standing the digital divide. Leaders in the corpo- ▶ Information Society
rate sector, governments and policymakers,
nongovernmental organizations, and other civil
society actors and social movements have been Further Reading
concerned about the digital divide, given the
increasing centrality of the Internet to socializa- Compaine, B. M. (2001). The digital divide: Facing a
crisis or creating a myth? Cambridge, MA: MIT Press.
tion, work, education, culture, and entertainment,
as a source of training and educational advance-
Hargittai, E. (2002). Second-level digital divide: Differ- D
ences in people’s online skills. First Monday, 7(4).
ment, information, job opportunities, community https://doi.org/10.5210/fm.v7i4.942.
networks, etc. Norris, P. (2001). Digital divide: Civic engagement, infor-
mation poverty, and the internet worldwide. Cam-
In summary, from this point of view, especially
bridge: Cambridge University Press.
in relation to civil society and commitments to Van Dijk, J. A. G. M. (2006). Digital divide research,
social change, moving beyond a usage and skills- achievements and shortcomings. Poetics, 34(4–5),
centered approach to the digital divide towards a 221–235.
Warschauer, M. (2004). Technology and social inclusion:
perspective on the appropriation of digital technol-
Rethinking the digital divide. Cambridge, MA: MIT
ogies by socially and digitally marginalized groups Press.
involves articulation of both the uses of ITCs and
meanings associate with them. Closing the digital
divide is considered, from this perspective, part of a
more general process of social inclusion, particu- Digital Ecosystem
larly in contemporary societies where the access to
and creation of knowledge through ICTs are seen Wendy Chen
as a core aspect of social inclusion, given the rising George Mason University, Arlington, VA, USA
importance of dimensions such as identity, culture,
language, participation, and sense of community.
More than overcoming some vision of the digital The Definition of Digital Ecosystem
divide marked by physical access to computers and
connectivity, considering it from an approach “Digital ecosystem” is a concept building upon
focused on technology for social inclusion and “ecosystem” coined by British botanists Arthur
change, the digital divide is reoriented towards a Roy Clapham and Arthur George Tansley during
more complex understanding of the effective artic- the 1930 who argued that in nature, living organ-
ulation of ICTs into communities, institutions, and isms and the environment surrounding them inter-
societies. It is in this regard that big data has been act with each other, which constitutes as an
particularly engaged for measuring and analyzing “ecosystem” (Tansley 1935). Since then, the eco-
the digital divide relative to processes of social system concept has been applied to various
development taking into account all the dimen- domains and studies including education and
sions – economic, political, cultural, educational, entrepreneurship etc. (Sussan and Acs 2017).
institutional, and symbolic – of meaningful access Over the recent decades, with the rapid develop-
to ICTs. ment of technology and internet, the “digital eco-
system” idea was born. It can be thought of as an
extension of a biological ecosystem in the digital
Cross-References context, which relies on technical knowledge, and
as “robust, self-organizing, and scalable architec-
▶ Cyberinfrastructure (U.S.) tures that can automatically solve complex,
▶ Digital Ecosystem dynamic problems” (Briscoe and Wilde 2009).
386 Digital Ecosystem

The Applications of Digital Ecosystem Computer Science


In computer science, digital ecosystem refers to
The digital ecosystem has been applied to an array the two-level system of interconnected machines
of perspectives and areas in which it is studied (Briscoe and Wilde 2009). At the first level, opti-
including business, education, and computer mization and computing services take place over a
science. decentralized network which feeds into a second
level that operates on a local level that seeks to
Business and Entrepreneurship operate within local constraints. Local searches
In business, digital ecosystem describes the rela- and computations can be performed more effi-
tionship between a business and the end con- ciently as a result of this process because of the
sumers in the digital world (Weill and Woerner requests first being handled by other peers with
2015). New technology creates disruption for tra- similar constraints. This scalable architecture is
ditional business models. This process is referred referred to as a digital ecosystem that builds
to as digital disruption, such as books being read upon the concept of “service-oriented architecture
on eReaders rather than paperbacks or Uber’s with distributed evolutionary computing.” Differ-
disruption of the taxi industry. A central theme ent from the other domains’ ecosystems, in this
in business for understanding digital ecosystems model of ecosystem, the actors within the ecosys-
is for businesses to fully understand their end tem are applications or groups of services (Briscoe
consumers, leveraging strong customer relation- and Wilde 2009).
ships, and increase cross-selling opportunities
(Weill and Woerner 2015). Artificial Life and Intelligence
In entrepreneurship, the term Digital Entrepre- The study of artificial life also has the concept of
neurship Ecosystem (DEE) refers to “an ecosys- digital ecosystem which came into fruition in late
tem where digital entrepreneurship emerges and 1996 with the creation of an artificial life enter-
develops” (Li et al. 2017). DEE reflects a group of tainment software product called Creatures. In
entities that integrates resources to help facilitate this example, the digital ecosystem was com-
and transform digital entrepreneurship (Li et al. prised of users who all interact with one another,
2017). In entrepreneurship literature, one of the essentially creating an online persona separate
fundamental differences between the digital entre- from their own (Cliff and Grand 1999).
preneurial ecosystem and a traditional business Additionally, smart homes, homes in the basic
ecosystem is the ventures in the digital entrepre- elements such as air conditioning or security sys-
neurial ecosystem focus on the interaction tems can be controlled via wireless technology,
between digital technologies and users via digital are considered to be digital ecosystems as well
infrastructure (Sussan and Acs 2017). due to their interconnected nature and reliance
upon one another to make decisions using artifi-
Education cial intelligence (Harper 2003). In this instance,
In the education domain, teachers seek to expand the actors that make up the digital ecosystem are
the use of technology within the classroom to the independent components that interact with one
create a classroom digital ecosystem composed another supported by a collection of knowledge
of “general school support, infrastructure, profes- that then can be disseminated amongst the ecosys-
sional development, teacher attitude, and teacher tem (Reinisch et al. 2010).
personal use” (Besnoy et al. 2012). In such an
ecosystem, the teachers use technology to aid
them to prepare coursework, grade assignments, The Future Research on Digital
and communicate with students while students Ecosystem
can interact digitally while allowing students to
also explore many different technology applica- From a general digital ecosystem perspective,
tions (Palak and Walls 2009). future research could focus on adding additional
Digital Knowledge Network Divide (DKND) 387

potential applications for studying digital ecosys- Zhongguancun. Frontiers of Business Research in
tems such as the healthcare industry, manufactur- China, 11(1), 69–100.
Palak, D., & Walls, R. T. (2009). Teachers’ beliefs and
ing, retail, or even politics, etc., especially as they technology practices. Journal of Research on Technol-
pertain to big data. As in all contexts defined, ogy in Education, 41(4), 417–441.
digital ecosystems are comprised of highly Reinisch, C., Kofler, M. J., & Kastner, W. (2010).
interconnected people and/or machines and as ThinkHome: A smart home as digital ecosystem. In
Conference proceedings in 4th IEEE international con-
such their specific relations to one another. Big ference on digital ecosystems and technologies. Dubai:
data focuses on finding overarching trends by United Arab Emirates (pp. 12–15). https://books.
bridging together disparate data sources and cre- google.com/books/about/4th_IEEE_International_Con D
ating profiles of sorts in different contexts. That ference_on_Dig.html?id=2AgunQAACAAJ.
Sussan, F., & Acs, Z. (2017). The digital entrepreneurial
said, digital ecosystem research could play a piv- ecosystem. Small Business Economics, 49, 55–73.
otal role in unlocking new avenues to discover Tansley, A. G. (1935). The use and abuse of vegetational
new trends which could aid future data science concepts and terms. Ecology, 16, 284–307.
research and companies as well. Weill, P., & Woerner, S. L. (2015). Thriving in an increas-
ingly digital ecosystem. MITSloan Management
Additionally, as digital ecosystems can bridge Review, 56(4), 27–34.
the gap of space and distance, new research could
be conducted to understand digital ecosystems in
an international context. Most of studies which
were covered did not really consider the impact Digital Inequality
of the interaction between countries via a digital
ecosystem could play on how that ecosystem per- ▶ Digital Divide
forms in each country’s environment. Therefore,
digital ecosystems impact in an international con-
text could help shed light on these digital
ecosystems. Digital Knowledge Network
Divide (DKND)

Conclusion Connie L. McNeely and Laurie A. Schintler


George Mason University, Fairfax, VA, USA
Dependent on technology, digital ecosystem con-
nects people and machines. The concept has been
applied to various domains including business, In general, the “digital divide,” as a term, has
education, artificial intelligence, etc. Digital eco- referred to differential access to digital means
systems provide platforms for big data to be pro- and content and, as a phenomenon, increasingly
duced and exchanged. affects the ways in which information is engaged
at a most basic level. However, the digital divide
is becoming, more fundamentally, a “knowledge
Further Reading divide.” Knowledge implies meaning, appropria-
tion, and participation, such that access to knowl-
Besnoy, K. D., Dantzler, J. A., & Siders, J. A. (2012). edge is a means to achieve social and economic
Creating a digital ecosystem for the gifted education goals (UNESCO 2005). In this sense, the knowl-
classroom. Journal of Advanced Academics, 23(4),
305–325.
edge divide indicates a growing situation of rela-
Briscoe, G, & De Wilde, P. (2009) Digital ecosystems: tive deprivation in which, as in other societal
Evolving service-oriented architectures. arXiv.org. domains, some individuals and groups reflect
Cliff, D., & Grand, S. (1999). The creatures global digital lesser capacities relative to others to access
ecosystem. Artificial Life, 5(1), 77–93.
Harper, R. (2003). Inside the smart home. Bristol: Springer.
knowledge for social benefit and contribution.
Li, W., Du, W., & Yin, J. (2017). Digital entrepreneurship Furthermore, today’s knowledge society encom-
ecosystem as a new form of organizing: The case of passes a system of highly complex and
388 Digital Knowledge Network Divide (DKND)

interconnected networks. These are digital net- access, opportunities, usage, and benefits between
works marked by growing diversification in infor- and among individuals, groups, and geographic
mation and communication technology (ICT) areas. Thus, the DKND must be understood in
capacities by which data generation and diffusion keeping with its many guises and dimensions,
translate into differences in overall access and which means considering various perspectives
participation in the knowledge society. Related on related capacities to explore broader implica-
conceptions of this networked knowledge society tions and impacts across global, national, and
rest on visions of a world in which ICTs contribute regional contexts and levels of analysis. Framed
to organizational and social structures by which relative to capacities to not only access and
access and participation are differentially avail- engage data, but also to transform it into knowl-
able to various member in society (cf. Castells edge and actionable insights, DKND is deter-
2000). To more fully capture the effective dimen- mined by such issues as inequalities in digital
sions of these differentiating and asymmetric rela- literacy and access to education. These issues
tions, the more explicit notion of the Digital affect, for example, scientific mobility, smart tech-
Knowledge Network Divide (DKND) has been nologies and automation, labor market structures
posited to better describe and understand related and relations, and diversity within and across dif-
structures, dynamics, and relationships (Schintler ferent types and levels of socio-technological
et al. 2011). engagement and impact. In particular, digital lit-
Increasingly characterized by big data derived eracy and digital access are fundamental to con-
from social actors and their interactions within ceptualizing and understanding the basic
and across levels of analysis, DKND culminates dimensions of the DKND.
in a situation that reflects real world asymmetries
among privileges and limitations associated with
stratified societal relations. Referring to the explo- Digital Literacy
sion in the amounts of data available for research,
governance, and decision making, big data is one Undergirded by the growth of big data, knowl-
of the most prominent technology and informa- edge intensification and expansion are raising
tion trends of the day and, more to the point, is a concerns about building digital literacy, especially
key engine for social, political, and economic as a key DKND determinant. Indeed, the “online/
power and relations, creating new challenges and not online” and technology “have/have not” focus
vulnerabilities in the expanding knowledge soci- of many digital divide discussions obscures a
ety. In fact, the collection, analysis, and visualiza- larger digital equity problem: disparities in levels
tion of massive amounts of data on politics, of digital readiness, that is, of digital skills and
institutions and culture, the economy, and society capacities (Horrigan 2019). The knowledge
more generally have important consequences for divide also is a skills divide and, while demands
issues such as social justice and the well-being of are being issued for a more highly educated and
individuals and groups, and also for the stability digitally literate population and workforce, oppor-
and prosperity of countries and regions, as found tunities for participation and mobility are at the
in global North and South (or developed and same time highly circumscribed for some groups.
developing) country divides. Accordingly, a com- In developing knowledge, big data analytics, or
prehensive understanding of DKND involves the techniques and technologies needed for
consideration of various contexts for and harnessing value from use of data, increasingly
approaches to bridging digital, knowledge, and define digital literacy in this regard. On the one
North-South divides. hand, digital literacy is about enabling and putting
Such divides present critical expressions of technology to use. However, on the other hand,
relative inequalities, inequities, and disparities in questions of digital literacy move the issue
information (which may include misinformation beyond base access to hardware and technology.
and disinformation) and knowledge creation, Digital literacy is intimately about the capabilities
Digital Knowledge Network Divide (DKND) 389

and skills needed to generate knowledge, to use access, use, and impact are found within and
relevant hardware and technology to help extract across levels of analysis. Moreover, crossover
information and meaning out of data while coping and networked big data challenge notions of con-
with its volume, velocity, and variety, and also its sent and privacy, with affected individuals and
variability, veracity, vulnerability, and value (the groups having little recourse in disputing it.
“7 Vs” of big data). Those with access in this regard have capacities
More than access to the basic digital infrastruc- to intervene to mitigate the gaps and disparities
ture needed to benefit from big data, digital liter- linked to big data and, in turn, the information and
acy is “the ability to use information and power imbalances dictated by those with control D
communication technologies to find, evaluate, and are the major users of big data and the intel-
create, and communicate information, requiring ligence produced from it and related technologies
both cognitive and technical skills” (ALA 2020). and analytics. This point again leads to questions
Access to and use of relevant technologies argu- concerning who has access to the data and, more
ably can provide an array of opportunities for generally, who benefits.
digital literacy and knowledge-enhancing options. Related inequalities and biases are reflected in
Digital literacy is about making meaning and can gatekeeping processes reflected in the broader
be considered a key part of the path to knowledge societal context, such that there exists not only a
(Reedy and Parker 2018), with emphasis on dig- digital divide but, more specifically, knowledge
ital capacities for finding, creating, managing, brokers and a DKND in which asymmetry is the
processing, and disseminating knowledge. In the defining feature (Schintler et al. 2011). As men-
ideal, a basic requirement for digital literacy is tioned, the digital divide generally has been char-
high-quality education for all, and the growth of acterized as a critical expression of relative
digital networks, which are at the core of the inequalities and differences in ICT access and
knowledge society, opening opportunities to facil- usage between and among individuals, groups,
itate education and learning. However, even if and geographic areas. However, it more broadly
there is technical access to digital information references gaps within and among digital “haves”
and networks, those features may not be meaning- and “have-nots” and the digitally literate and illit-
ful or commensurate in people’s everyday lives erate in different socio-spatial contexts. In this
with education and learning opportunities sense, the DKND is constituted by various modal-
(Mansell and Tremblay 2013), which are the ities of inequality and points to the notion that it
means for broader digital access. encompasses multiple interactive relational struc-
tures enacted in diverse institutional frames in
which access to education, skills, and ICT capa-
Digital Access bilities and opportunities are variably distributed
throughout social, political, and economic
Access is the primary allocating factor determin- systems.
ing network entry and participation and can be Some depictions of knowledge societies posit
cast in terms of at least four interrelated issues that they ideally should be concerned not only
relative to big data generation and use: 1) physical with technological innovation and impacts, but
and technical means; 2) learning and cognitive they are defined in concert with human develop-
means; 3) utilization preferences and styles as ment resting particularly on universal access to
means to different types of data and knowledge; information and knowledge, quality education
and 4) the extent and nature of those means. These for all, and respect for diversity – that is, on a
issues speak to how knowledge networks are vision of promoting knowledge societies that are
engaged, referencing differences in access to inclusive and equitable (Mansell and Tremblay
data and knowledge, with implications for societal 2013). In a digitally enabled big data world,
participation and contributions and for benefit as many human activities depend on how and the
opposed to disadvantage. Differences in data degree to which information and knowledge are
390 Digital Knowledge Network Divide (DKND)

accessed, generated, and processed. Moreover, of the DKND. These aspects are highly interre-
the capacity, motivation, education, and quality lated, operating interdependently relative to vari-
of knowledge acquired online have consequences ous divides in regard to impact on society and the
for life opportunities in the social realm world. Over the last several years, digital content
(Ragnedda 2019). Different capacities for digital has been growing at an astronomical rate and it is
access and use can translate to different roles of in this sense that big data diversification and com-
big data for individuals, communities, and coun- plexity feeds into the creation and diffusion of
tries, strongly influencing inequalities and related knowledge as a networked process. User-gener-
divides. Also, open, re-used, and re-combined ated data networks, and the types of information
data can bring both opportunities and challenges or knowledge to which they may or may not have
for society relative to social, economic, and inter- access, constitute complex digital divides. Net-
national dimensions, with implications for equity works are indeed an integral feature of the knowl-
and social cohesion. Big data pervasiveness is edge divide, constituting complex systems in
linked to the rise and persistence of digital divides which some individuals or groups are more cen-
and inequalities that reflect impacts of and on tral or have more influence and control over the
structural relations determining the access and creation and flow of knowledge and information.
engagement of related knowledge and network The flow and manipulation of data and informa-
resources. tion to create knowledge are dynamic processes,
In keeping with varying capacities and possi- influencing access to and participation in digital
bilities to transform digitally valuable resources knowledge networks. Participation in these digital
and knowledge into social and tangible benefits networks can have variable effects. For example,
(Ragnedda 2019), different access and different the use of algorithms that determine information
abilities and skills for exploiting ICT-related ben- access based on past activities can result in filter
efits are strongly connected with societal inequal- bubbles that cause limited access and intellectual
ities understood relative to physical, financial, isolation (Pariser 2012). Another example is akin
cognitive, production, design, content, institu- to the “Matthew effect,” in which “the rich get
tional, social, and political access – all of which richer and the poor get poorer,” referencing cumu-
can operate to create or reinforce divides in digital lative advantage and gaps between haves and
experiences and related outcomes (Ragnedda have-nots (Merton 1988). This situation reflects
2019; DiMaggio et al. 2004). Although this situ- inequalities embedded in knowledge networks
ation need not be framed as static reproduction, (epistemic networks) where the status of some
understanding societal dynamics means that, members relative to others is elevated, contribut-
despite a relatively open internet, everyone is not ing to inequalities in terms of digital capabilities to
in the same position to access and use opportuni- access knowledge and to disseminate and receive
ties offered in the digital arena (Ragnedda 2019). recognition for it (Schintler and McNeely 2012).
Even with better skills and qualifications – the As discussed, networks can offer opportunities
acquisition of which is affected by social dynam- for empowerment of marginalized and excluded
ics and structures – previous societal positions and groups, but possibilities for those opportunities
relationships influence capacities to access related must be understood relative to discrimination,
opportunities in the social realm. privacy, and ethical issues. With increasing digi-
tization of everything from retail and services to
cities and healthcare, and the growth of the Inter-
Conceptual Scope net of Things (IoT), the emergence of network
haves and have-nots is more specifically tied to
Not only technological but also social, economic, current digital divides. Networks, as collaborative
and political developments mark the parameters structures, are central to stimulating the produc-
Digital Knowledge Network Divide (DKND) 391

tion of knowledge beneficially relevant for those to which massive amounts of data translate into
who can access and apply it; they can offer oppor- information, and that into expanded knowledge,
tunities or can block knowledge sharing. The speaks to ways in which big data is being used and
knowledge divide represents inequalities and framed as sources of discovery and knowledge
gaps in knowledge among individuals and groups. creation. Big data, as a broad domain of practice
The digital divide extends this idea, and application, can be understood relative to
distinguishing among those with and those with- advancing knowledge and productivity (Schintler
out access to the internet. Accordingly, the DKND and McNeely 2012). It is in this regard that net-
is conceived via structural relations and dynamics works reflect defining processes that address both D
that look beyond technological applications to formal and informal relationships among different
consider institutional, regulatory, economic, polit- actors ranging from individuals to countries. This
ical, and social conditions that frame the genera- situation does not occur in isolation and, as such,
tion of digital, knowledge, and network necessitates a comprehensive view on the prom-
relationships. That is, the DKND reflects ises and challenges attached to big data and the
networked disparities and gaps in knowledge diffusion of knowledge and establishment of net-
and associated value that also operate to differen- works determining digital relations.
tially restrict or enhance access and participation The DKND reflects a complex and adaptive
of certain segments of the population. system bound by socio-technological structures
However, understanding the DKND also and dynamics that largely depend on access, cog-
requires a state-of-the-art perspective on other nitively, normatively, and physically determined
aspects of digital networks, pointing to how they across different levels and units of analysis. Crit-
appear today and can be expected to do so increas- ical dimensions of this perspective include rela-
ingly in the future. Reference here is to humans tional and territorial digital knowledge network
interacting with humans mediated by machines formation, characteristics, and effects; digital
(social machines) and, importantly, machines knowledge network opportunity structures and
interacting with machines. Indeed, the IoT is all differentiation; and overall vertical and horizontal
about machine-to-machine (M2M) interactions, trends and patterns in the DKND. As such, it has
exchanging data and information and producing broad implications for ways of thinking about
knowledge by advanced computational modeling data, their sources, uses, and purposes. The
and deep learning. Note that there is a profound DKND brings particular attention to the extent to
global divide in terms of who has access to the which big data might exacerbate already rampant
machine hardware and software needed to plug disparities, pointing to how big data are used by
into the IoT. Moreover, there are information/ different actors, for what purposes, and with what
knowledge asymmetries between machines and effects. In fact, this is the big data divide and
humans, and even machines and machines related developments have been at the center of
(Schintler 2017). critical public and intellectual debates and
controversies.
By definition, the DKND operates in accor-
Conclusion dance with the social contexts in which big data
analytics are engaged and applied, and these rela-
Inequality marks big data, reflected in hierarchi- tions can be considered in terms of institutional
cally differentiated structures defined and frameworks, governance conditions, and system
privileged according to those who create the dynamics. Big data analytics are enabled by tech-
data, those who have the means to collect it, and nical advances in data storage capacities, compu-
those who have the expertise to analyze and use it tational speeds, and the near real-time availability
(Manovich 2011; Schintler et al. 2011). The extent of massive datasets. The ability to integrate and
392 Digital Knowledge Network Divide (DKND)

analyze datasets from disparate sources and to and realities of life in the big data knowledge
generate new kinds of knowledge can be benefi- society.
cial, but also can constitute legal, ethical, and
social dilemmas leading to hierarchical
asymmetries. Considered relative to questions of Further Reading
social and structural dynamics and material capac-
ities, different perspectives on digital asymmetries American Library Association (ALA). (2020). Digital lit-
and related effects can be framed in terms of the eracy. https://literacy.ala.org/digital-literacy.
evolving socio-technological landscape and of Castells, M. (2000). Rise of the network society. Malden:
Blackwell.
disparities grounded in broader societal and his- DiMaggio, P., Hargittai, E., Celeste, C., & Shafer, S.
torical dynamics, relationships, and structures that (2004). Digital inequality, from unequal access to dif-
constrain equal and equitable outcomes. ferentiated use. In K. Neckerman (Ed.), Social inequal-
More to the point, big data has led to profound ity (pp. 355–400). New York: Russell Sage Foundation.
Horrigan, J. B. (2019, August 14). Analysis: Digital divide
changes in the way that knowledge is generated isn’t just a rural problem. Daily Yonder. https://
and utilized, underlying the increasingly deep dailyyonder.com/analysis-digital-divide-isnt-just-a-
penetration and systems nature of related devel- rural-problem/2019/08/14.
opments in human activities. Accordingly, the Manovich, L. (2011). Trending: The promises and the
challenges of big social data. http://manovich.net/
idealized vision of the knowledge society is one index.php/projects/trending-the-promises-and-the-chal
in which the full potential of digital networks is lenges-of-big-social-data.
achieved in an equitable and balanced knowledge Mansell, R., & Tremblay, G. (2013). Renewing the knowl-
environment – one in which knowledge is inte- edge societies vision: Towards knowledge societies for
peace and sustainable development. Paris: UNESCO.
grated in ways that maximize benefits and mini- http://eprints.lse.ac.uk/id/eprint/48981.
mize harms, taking into account goals of social, Merton, R. K. (1988). The Matthew effect in science, II:
economic, and environmental wellbeing (Mansell Cumulative advantage and the symbolism of intellec-
and Tremblay 2013). However, broad socioeco- tual property. Isis, 79, 606–623.
Pariser, E. (2012). The filter bubble: How the new person-
nomic characteristics are basic factors affecting alized web is changing what we read and how we think.
capacities for realizing digital literacy, access, New York: Penguin.
and engagement, differentially positioning and Ragnedda, M. (2019). Reconceptualizing the digital
enabling individuals, groups, and countries to divide. In B. Mutsvairo & M. Ragnedda (Eds.), Map-
ping the digital divide in Africa: A mediated analysis
capture knowledge benefits. (pp. 27–43). Amsterdam: Amsterdam University Press.
Big data capacities and possibilities for digital Reedy, K., & Parker, J. (Eds.). (2018). Digital literacy
access, broadly defined, are affected by the unpacked. Cambridge, UK: Facet. https://doi.org/10.
DKND, which determines and is determined by 29085/9781783301997.
Sagasti, A. (2013). The knowledge explosion and the
the character, types, and consequences of differ- knowledge divide. http://hdr.undp.org/sites/default/
entiated digital access and opportunities. The files/sagasti-1-1.pdf.
digital divide in general is a multifaceted phenom- Schintler, L. A. (2017). The constantly shifting face of the
enon, interwoven with existing processes of social digital divide. Big Data for Regional Science, 28, 336.
Schintler, L., & McNeely, C. L. (2012). Gendered science
differentiation and, in fact, may accentuate in the 21st century: The productivity puzzle 2.0? Inter-
existing inequalities (Ragnedda 2019). While, at national Journal of Gender, Science and Technology, 4
a fundamental level, the digital divide is based on (1), 123–128.
technological and physical access to the internet Schintler, L., McNeely, C. L., & Kulkarni, R. (2011).
Hierarchical knowledge relations and dynamics in the
and related hardware, knowledge and networks “Tower of Babel.” In Rebuilding the mosaic: Fostering
further affect participation and consequences research in the social, behavioral, and economic sci-
also tied to already existing social inequalities ences at the National Science Foundation in the next
and gaps. In the face of digital, knowledge, and decade (SBE 2020), NSF 11–086. Arlington: National
Science Foundation. http://www.nsf.gov/sbe/sbe_2020.
network asymmetries, the DKND stands in con- United Nations Educational, Scientific, and Cultural Orga-
trast to broader visions of societal equity and well- nization (UNESCO). (2005). UNESCO world report:
being, reflecting sensitivity to the complexities Towards knowledge societies. Paris: UNESCO.
Digital Literacy 393

new technologies are a usual part of their daily


Digital Literacy lives. For them the Internet is part of the pattern of
their day and integrated into their sense of place
Dimitra Dimitrakopoulou and time.
School of Journalism and Mass Communication, Social web presents new possibilities as well as
Aristotle University of Thessaloniki, challenges. On the one hand, the main risks of
Thessaloniki, Greece using the Internet can be classified to four levels:
(a) commercial interests, (b) aggression,
(c) sexuality, and (d) values/ideology. On the D
Digital literacy means having the knowledge and other hand, the web opens a whole new world of
the skills to use a wide range of technological opportunities for education and learning, partici-
tools in order to read and interpret various media pation and civic engagement, creativity, as well as
messages across different digital platforms. Digi- identity and social connection. Wikis, Weblogs,
tal literate people possess critical thinking skills and other social web tools and platforms raise
and are able to use technology in a strategic way to possibilities for project-based learning and facili-
search, locate, filter, and evaluate information; to tate collaborative learning and participation
connect and collaborate with others in online among students and educators. Moreover,
communities and social networks; and to produce project-based learning offers many advantages
and share original content on social media plat- and enhances skills and competencies.
forms. In the era of big data, digital literacy The changes in the access and management of
becomes extremely important as internet users information as well as in possibilities for interac-
need to be able to identify when and where per- tivity, interaction, and networking signal a new
sonal data is being passively collected on their learning paradigm that is created due to the need
actions and interactions and form patterns on to select and manage information from a vast
their online behavior, as well as contemplate the variety of available sources, while at the same
ethical dilemmas on data-driven decisions for time learning in the digital era is collaborative in
both individuals and society as a whole. nature and the learner is no more a passive recip-
The interactive platforms that the web has ient of information but as an active author,
introduced to the fields of communication, con- co-creator, evaluator, and critical commentator.
tent producing and sharing as well as to network- The abovementioned changes signify the
ing, offer great opportunities for the learning and foundations for Learning 2.0, resulting from the
educational procedure for both educators and stu- combination of the use of social computing to
dents. The expanding literature on the “the directly enhance learning processes and out-
Facebook generation” indicates a global trend in comes with its networking potential. The
the incorporation of social networking tools for changes that we are experiencing through the
connectivity and collaboration purposes among development and innovation that the interactive
educators, students, and between these two web introduces are framed by the participatory
groups. The use of social software tools holds culture that we live in.
particular promise for the creation of learning Participatory culture requires new literacies
settings that can interest and motivate learners that involve social skills which are developed
and support their engagement, while at the same through collaboration and networking. In this
time addressing the social elements of effective environment with new opportunities and new
learning. At the same time, it is widely suggested challenges, it is inevitable that new skills are
that today’s students require a whole new set of also required, namely, play, performance, simula-
literacy skills in the twenty-first century. tion, appropriation, multitasking, distributed cog-
The current generation learners, namely, young nition, collective intelligence, judgment,
people born after 1982, have been and are being transmedia navigation, networking, and
raised in an environment that presupposes that negotiation.
394 Digital Literacy

Nevertheless, the participatory culture is pro- of “reality” as no longer real, blur lines between
spectively participatory for all, e.g., providing and the consumer and the creator and expectations for
enabling all to an open and equal access as well as ubiquitous access to the Internet.
to a democratized and regulated environment. These characteristics should be definitely taken
Three core problems are identified as the main into account when designing or evaluating a digital
concerns in the digital era, which are addressed literacy program. Children use the Internet mainly
by Jenkins et al.: as an educational resource; for entertainment,
games, and fun; information seeking; and social
(a) Participation gap: Fundamental inequalities networking and share experiences with others.
in young people’s access to new media tech- Communication with friends and peers, especially,
nologies and the opportunities for participa- is a key activity. They use different tools such as
tion they represent chats, instant messaging, or e-mail to stay in con-
(b) Transparency problem: Children are not nec- tact with each other or to search for new friends.
essarily reflecting actively on their media They also participate in discussion forums or use
experiences and cannot always articulate the Internet to search for information, to download
what they learn from their participation. music or videos, and to play online games. Com-
(c) Ethics challenge: Children cannot develop on munication and staying in touch with friends and
their own the ethical norms needed to cope colleagues is ranked highly for them.
with a complex and diverse social environ- Learning 2.0 is an emergent phenomenon, fos-
ment online (Jenkins et al. 2006: 12, see tered by bottom-up approach take up of Web 2.0
more on pp. 12–18). in educational contexts. Although social comput-
ing originated outside educational institutions, it
The necessity to deal with these challenges has huge potential in formal Education and Train-
calls for a twenty-first century media literacy, ing (E&T) for enhancing learning processes and
which can be described as the set of abilities and outcomes and supporting the modernization of
skills where aural, visual, and digital literacy over- European E&T institutions. Learning 2.0
lap. These include, as the New Media Consortium approaches promote the technological, pedagogi-
indicates, the ability to understand the power of cal, and organizational innovation in formal Edu-
images and sounds, to recognize and use that cation andTraining schemes. As Redecker
power, to manipulate and transform digital indicate, the interactive web builds up the pros-
media, to distribute them pervasively, and to eas- pects for (a) enhancing innovation and creativity,
ily adapt them to new forms. (b) improving the quality and efficiency of provi-
Pupils that still attend school are growing in a sion and outcomes, (c) making lifelong learning
technology dominated world. Youth born after and learner mobility a reality, and (d) promoting
1990 are currently the largest generation in the equity and active citizenship.
last 50 years and live in a technology saturated On the other hand, there are major challenges
world with tools such as mobile phones and that should be dealt with. While there are currently
instant access to information. Moreover, they vast numbers of experimental Learning 2.0 pro-
have become avid adopters of Web 2.0 and jects under way all over the world, on the whole,
beyond technologies such as podcasting, social Learning 2.0 has not entered formal education yet.
networking, instant messaging, mobile video/ The following technical, pedagogical, and organi-
gaming, IPTV. Being the first generation to grow zational bottlenecks have been identified by
up surrounded by digital media, their expectations Redecker et al. which may hinder the full deploy-
of connectivity are high, with technology every- ment of Learning 2.0 in E&T institutions in
where in their daily life. The characteristics of the Europe: (a) access to ICT and basic digital skills,
new generation of students include, among others, (b) advanced digital competence, (c) special
multi-tasking, information age mindset, eagerness needs, (d) pedagogical skills, (e) uncertainty,
for connectivity, “fast-track” accomplishments, (f) safety and privacy concerns, and
preference towards doing than knowing, approach (g) requirements on institutional change.
Digital Storytelling, Big Data Storytelling 395

Cross-References observed on the walls of buildings coming from


the antiquity or on the parchments originating
▶ Curriculum, Higher Education, Humanities from the dim and distant times. Nowadays, people
▶ Digital Knowledge Network Divide (DKND) also tell stories in private and professional life.
▶ Education and Training Shell and Moussa (2007) stress that there are
▶ Information Society certain aspects that make stories interesting and
effective. What differentiates a story from an
example is dynamism. When one listens to the
Further Reading story, he/she starts to follow the plot and think D
what happens next. Allan et al. (2001) state that
Hasebrink, U., Livingstone, S., & Haddon, L. (2008). stories stimulate imagination. The narrative
Comparing children’s online opportunities and risks
approach points to the items that are often
across Europe: Cross-national comparisons for EU
Kids Online. Deliverable D3.2. EU Kids Online, neglected, such as the ones that are simpler or
London. Retrieved from http://eprints.lse.ac.uk/21656/ less precise. As Denning claims, “a knowledge-
1/D3.2_Report-Cross_national_comparisons.pdf. sharing story describes the setting in enough detail
Jenkins, H., et al. (2006). Confronting the challenges of
that the solution is linked to the problem by the
participatory culture: Media education for the 21st
century. Chicago: The MacArthur Foundation. best available explanation” (2004: 91).
Retrieved from http://digitallearning.macfound.org/atf/ The important caesura for storytelling was the
cf/%7B7E45C7E0-A3E0-4B89-AC9C-E807E1B0AE invention of the print that facilitated the distribu-
4E%7D/JENKINS_WHITE_PAPER.PDF.
tion of narrations among a relatively large group
Literacy Summit. (2005). NMC: The New Media Consor-
tium. Retrieved from http://www.nmc.org/pdf/Global_ of people. The next crucial stage was the rapid
Imperative.pdf. development in the technological sphere,
Redecker, C., et al. (2009). Learning 2.0: The impact of represented by, e.g., the introduction and prolifer-
web 2.0 innovations on education and training in
ation of online technologies. Digital Storytelling
Europe. Final Report. European Commission: Joint
Research Centre & Institute for Prospective Technolog- can be defined as the application of technological
ical Studies. Luxembourg: Office for Official Publica- advancements in telling stories, being visible in
tions of the European Communities. Retrieved from the usage of computer-related technologies in
http://ftp.jrc.es/EURdoc/JRC55629.pdf.
documenting events or narrating one’s personal
experience. Pioneers in Digital Storytelling
included the following people: Joe Lambert, who
co-founded the Center for Digital Storytelling
Digital Storytelling, Big Data (CDS) in Berkeley and Daniel Meadows who is
Storytelling the British photographer, author and a specialist in
education. Also, alternative names for this phe-
Magdalena Bielenia-Grajewska nomenon include digital documentaries, com-
Division of Maritime Economy, Department of puter-based narratives, digital essays, electronic
Maritime Transport and Seaborne Trade, memoirs or interactive storytelling. The plethora
University of Gdansk, Gdansk, Poland of names shows how different functions digital
Intercultural Communication and storytelling may have; it may be used to document
Neurolinguistics Laboratory, Department of a story, narrate an event or act as a diary in the
Translation Studies, University of Gdansk, computer-related reality.
Gdansk, Poland Moreover, technology has influenced storytell-
ing also in another way: by creating and distribut-
ing big data. Thus, digital storytelling focuses not
Storytelling dates back to the ancient times when only on presenting a story but also on displaying
stories were told among the members of commu- big data in an efficient way. However, it should be
nities, using oral, pictorial, and, later, also writing mentioned that since this method of data creation
systems. The reminiscent of the first examples of and dissemination can be used by individuals
storytelling from the past centuries can still be regardless of their knowledge, technical
396 Digital Storytelling, Big Data Storytelling

capabilities and attentiveness to proper online audience. As far as the requirements and possibil-
expression, digital storytelling offered in the ities of the viewers are concerned, the content
open access mode can be of different quality. itself and the method of presenting the content
The most common failures that can be observed should be fitting to their age, linguistic skills,
as far as the production of digital storytelling is education, and occupation. For example, a piece
concerned are the lack of presentation skills and of digital storytelling recorded to master English
technical abilities to create an effective piece of should be produced by taking into account the age
storytelling. Another factor that may influence the of learners, their level of linguistic competence as
perception and cognition of storytelling is the well as professional background (especially in the
absence of adequate linguistic skills. Digital Sto- case of teaching English for Specific Purposes).
rytelling created by a person who makes language Another important feature of digital storytelling in
mistakes and has incomprehensible pronunciation relation to the immensity of information that
is not likely to have many followers and its edu- should be covered in the film lasting from 2 to
cational application is also limited. In addition, 10 min is linked with using and presenting big
the effective presentation of big data in digital data in an efficient way. Apart from storytelling,
storytelling is a difficult task for individuals not there are also other names used to denote telling
familiar with big data analytics and management. stories in different settings. Looking at the orga-
There are certain elements that are connected with nizational environment, Henderson and Boje
a success or failure of this mode of expression. At (2016) discuss the phenomenon of fractal patterns
the website called “Educational Uses of Digital in quantum storytelling and Big Story by Mike
Storytelling,” seven important elements of digital Bonifer and his colleagues.
storytelling are mentioned. The first one is Point
of View and it is connected with presenting the
main point of the story and the approach adopted Tools in Digital Storytelling
by the author. The second issue – A Dramatic
Question – concerns the main question of the Tools used in digital storytelling can be classified
story to be answered in the last part of storytelling. into verbal and nonverbal ones. Verbal tools
The third notion, Emotional Content, reflects the encompass all types of linguistic representation
emotional and personal way of presenting a story used in telling stories. They include the selective
that makes target viewers involved in the plot. The use of words, phrases, and sentences to make the
fourth element named The Gift of Your Voice piece of information more interesting for the tar-
encompasses the strategies aimed at personalizing get audience. The linguistic dimension can be
the storytelling that facilitates the understanding further investigated by applying the micro,
of the story. The fifth notion is connected with the meso, or macro perspectives. The micro approach
audio dimension of digital storytelling; The is connected with analyzing the role of a single
Power of the Soundtrack is related to the usage word in the processes of creation and cognition of
of songs, rhythms, and jingles to make the story information. For example, adjectives are very
more interesting and informative. The next one powerful in creating an image of a thing or a
called Economy is related to the amount of mate- person. Such adjectives as prestigious, unique,
rial presented for the user; digital storytelling or reliable stress the high quality and effective-
should not bore viewers because of its length ness of a given offer. When repeated throughout
and the immense amount of content. The last the piece of digital storytelling, they strengthen
element, Pacing, is connected with adjusting the the identity of a company offering such products
rhythm to the presented content. or services. Numerals constitute another effective
It also should be mentioned that a good exam- tool of digital storytelling. The same number pre-
ple of digital storytelling should offer a combina- sented in different numerical representation may
tion of audio, pictorial, and verbal representation have a different effect on the target audience. For
that is coherent and adopted for the target example, the dangerousness connected with the
Digital Storytelling, Big Data Storytelling 397

rising death toll of a contagious disease is per- laymen and facilitate the communication between
ceived in a different way when presented with specialists representing different domains. Using
percentage (e.g., 0.06% infected) and when the macro approach, metaphors are used to create
described by numbers (e.g., 100,000 infected). organizational identity. For example, such meta-
Another example may include organizational dis- phors as organization as a teacher or organization
course and the usage of numerals to stress the as a family can be constructed after analyzing the
number of customers served every month, the content presented in computer-related sources.
amount of early income, etc. In the mentioned Discussing the individual level of metaphorical
case, big numbers are used to create the image of storytelling, metaphors can be used to create the D
a company as a leader in its industry, being a identity of speakers. Forming new metaphors
reliable and an efficient player on the market. makes the creator of digital storytelling character-
The meso dimension of investigating focuses on istic and outstanding. At the same time, it should
structures used to decode and encode messages in be noticed that digital storytelling may facilitate
digital storytelling. Taking the grammar perspec- the creation of novel symbolic representations.
tive into account, active voice is used to stress the Since metaphors origin when people observe the
personal involvement of speakers or the described reality, digital storytelling, as other types of texts
individuals in the presented topic (e.g., we have and communication channels, may be a potential
made instead of it was made). Active voice is source of new metaphors. Thus, digital storytell-
often used to stress responsibility and devotion ing may stimulate one’s creativity in terms of
to the discussed phenomenon. Moreover, ques- writing, speaking, or painting.
tions are used in digital storytelling to draw the The sphere of nonverbal tools encompasses
viewers’ attention to the topic. The macro mainly the auditory and pictorial way of commu-
approach, on the other hand, is related to the nicating information. As far as the auditory
selection of texts used in digital storytelling. dimension is concerned, such issues as
They include, among others, stories, interviews, soundtracks, jingles, sounds as well as the voice
descriptions, presentations of websites, and other of the speaker are taken into account. The pictorial
online textual forms. dimension is represented by the use of any type of
Verbal tools can also be subcategorized into picture-related representation, such as drawings
literal and nonliteral methods of linguistic expres- and pictures. It should be stated that all the men-
sion. Literal tools encompass the types of mean- tioned tools should not be discussed in isolation;
ings that can be deduced directly, whereas the power of digital storytelling lies in the effec-
nonliteral communication makes use of speakers’ tive combination of different representations. It
intuition and knowledge. Nonliteral (or figurative) should be stressed that the presentation of audio,
discourse relies, e.g., on metaphors in presenting verbal and pictorial representations often relies on
information. Applying the micro perspective, advanced technology.
metaphorical names are used to tell a story. Rely- The tools used in digital storytelling may also
ing on a well-known domain in presenting a novel be studied by taking into account the stage of
concept turns out to have better explanatory char- creating and disseminating digital storytelling.
acteristics than the literal way of describing a For example, the preparation stage may include
phenomenon. Apart from the informational func- different qualitative and quantitative methods of
tion, metaphors, having often some idea of mys- data gathering in order to construct a viable rep-
tery imbedded in them, attract the viewers' resentation of the discussed topic. The stage of
attention more than literal expressions. Taking creating digital storytelling encompasses the
into account the sphere of mergers and acquisi- application of computer-based technologies to
tions, such names as white knight, lobster trap, or incorporate verbal, audio and pictorial representa-
poison pill describe the complicated strategies of tions into storytelling. The discussion on tools
takeovers in just few words. In the case of spe- used in digital storytelling should also encompass
cialized data, metaphors offer the explanation for the online and offline methods of disseminating
398 Digital Storytelling, Big Data Storytelling

the produced material. As far as the online dimen- the piece of digital storytelling as well as between
sion is concerned, it includes mainly the applica- the interlocutors and the audience.
tion of social networking tools, discussion Another network approach, Actor-Network-
forums, websites and newsletters that may pro- Theory, is used to stress the importance of living
vide the piece of digital storytelling itself or the and non-living entities in the creation and perfor-
link to it. The offline channels, such as books, mance of a given phenomenon. In the case of
newspaper articles or corporate leaflets include digital storytelling, it may be used to study the
mainly the link to the piece of digital storytelling. role of technological advancements and human
As Ryan (2018) mentions, since modern storytell- creativity in designing a piece of digital storytell-
ing focuses much on numbers, we can talk about ing. Since digital storytelling is aimed at creating
visual data storytelling. some emotions and response among the target
audience, both researchers and creators of digital
storytelling are interested what linguistic, audio
Methodologies of Studying Digital and pictorial tools make the piece of digital story-
Storytelling telling more effective. In the mentioned case, the
researchers representing cognitive studies and
The investigation on digital storytelling should neuroscience may offer an in-depth analysis of
focus on all dimensions related to this form of constituting elements. For example, by observing
communication, namely, the verbal, audio, and the brain or the nervous systems scientists may
pictorial ones. Since digital storytelling relies on check the reactions of individuals to the presented
pictures, the method called video ethnography, stimuli. Using such neuroscientific equipment as
used to record people in their natural settings, magnetoencephalography (MEG), transcranial
facilitates the understanding how films are made. magnetic stimulation (TMS) functional magnetic
To research the verbal sphere of digital storytell- resonance imaging (fMRI), eye-tracking or gal-
ing, one of the approaches used in text studies vanic skin response offers the studies of ones’
may be applied. These include ethnographic stud- reactions without the fear of facing fake answers
ies, narrative analysis, narrative semiotics, or that they may sometimes happen in standard inter-
functional pragmatics. At the same time, an views or surveys.
attempt should be made to focus on an approach
that may investigate at least two dimensions
simultaneously as well as the interrelation Applications and Functions of Digital
between them. Storytelling
One of the methods to research digital story-
telling from more than one perspective is to adopt The functions of digital storytelling can be
the Critical Discourse Analysis that focuses not divided into individual and social ones. The indi-
only on the verbal layer of a studied text but also vidual role of digital storytelling is connected with
on the relation between nonverbal and verbal rep- informing others about one’s personal issues as
resentation and how their coexistence determines well as giving vent for emotions, opinions and
the way a given text is perceived. This approach feelings. It is often used by people to master
provides information for the authors of digital their own speaking or presentation skills as well
storytelling how to make their works reach a rel- as to exercise technical and computer passion. The
atively large group of people. The complexity and social dimension of digital storytelling encom-
multifactorial character of digital storytelling can passes the role of digital storytelling in serving
be researched by using network theories. For functions other than the purely personal ones. For
example, Social Network Analysis studies the example, digital storytelling is used in education.
relations between individuals. In the case of dig- It draws the attention of students to important
ital storytelling, it may be applied to investigate issues, by using diversified (e.g., verbal and non-
the relations between the individuals speaking in verbal) ways to tell a story. This method of
Digital Storytelling, Big Data Storytelling 399

teaching may also result in the active participation promote their products and services as well as
of participants; students are not only the passive communicate with the broadly understood stake-
recipients of displayed issues, but they are also holders. It should also be mentioned that digital
capable of constructing their own stories. As far as storytelling, being a complex tool itself, may com-
the publication outlets are concerned, there are bine different functions at the same time. For
online platforms devoted to the presentation of example, corporate materials presented by using
digital stories that can be used to enhance one’s digital storytelling may serve both marketing and
knowledge on the application of digital educational functions. The case is the policy of
storytelling. adopting Corporate Social Responsibility that D
The usage of digital storytelling in education stresses the active involvement of companies in
can be understood in two ways. One approach is creating and sustaining the harmony with the
to use digital storytelling to inform viewers about broadly understood environment. Presenting
new issues and concepts. Explaining a novel prod- CSR policies in digital storytelling does not only
uct or a technologically difficult matter by using create the positive image of a company, being an
digital storytelling proves to be more efficient active and supportive member of the community,
than standard methods of providing information. but also shows how the viewers may take care
The second function of digital storytelling is more about the environment themselves. Another social
socially-oriented; digital storytelling may facili- function of digital storytelling is the formation of
tate the understanding of intercultural differences communities. Those who use digital storytelling
or social issues and make people more sensitive to may comment the content and express opinions
other people’s needs, expectations and problems. on the topics presented in the recording. Thus, the
It should also be stated that digital storytelling application of digital storytelling serves many
facilitates offering education to those who cannot functions for both individuals and organizations.
access the same type of educational package in
offline environments. For example, digital story-
telling offers education to handicapped people Big Data Storytelling
who cannot frequent regular schooling as well as
to those who because of geographical or economic Digital storytelling undergoes constant changes
distance cannot participate in the regular class. and finds new applications due to the rapid devel-
Moreover, digital storytelling may provide opment in the sphere of technology. One of the
knowledge for those who are not interested in most visible changes can be observed in the
participating in standard courses but they want to sphere of data, being represented by the large
learn just for pleasure. An example of courses that and complex datasets as well as handling them
meet different needs of users is the idea of (creating, storing, applying, using, updating). Big
MOOCs, massive open online courses, being data may have different forms, such as written,
offered on the Internet in the open access mode. visual, audio, video ones or have more than one
Often created by top universities and supported by form at the same time. Moreover, modern tech-
international organizations, MOOCS are often nology allows for changing the form of data into a
accessible for free or for a charge if an individual different one, meeting the needs and expectations
is interested in gaining a proved statement of of the target audience. The application of big data
taking the course, a certificate or ECTS points. is connected with a given area of life and type of
By publishing courses on specialized online plat- profession. For example, demographic informa-
forms, MOOCs reach diversified users in different tion, data on business transactions (e.g., pur-
geographical locations who can study the pre- chases) and data on using mobile technology are
sented content at their own pace. studied in marketing. The attitude to data nowa-
Another application of digital storytelling con- days is also different from the one that could be
cerns the sphere of marketing. Digital storytelling observed in the past. Nowadays companies do not
is used by companies to create their identity, only accumulate information after sales but also
400 DIKW Pyramid

monitor and gather data during operations. For Further Reading


example, logistics companies track the route of
their products to optimize services and reduce Allan, J., Gerard, F., & Barbara, H. (2001). The power of
Tale. Using narratives for organisational success.
transportation costs. The attitude to data is also
Chichester: Wiley.
connected with the profile of business entities. For Bielenia-Grajewska, M. (2014a). The role of figurative
example, companies offering online services deal language in knowledge management. Knowledge
with data on everyday basis, by managing users’ encoding and decoding from the metaphorical perspec-
tive. In M. Khosrow-Pour (Ed.), Encyclopedia of infor-
data and monitoring the interest in the offered
mation science and technology. Hershey: IGI
merchandise, etc. The main functions of gathering Publishing.
and researching big data encompass the opportu- Bielenia-Grajewska, M. (2014b). CSR Online Commu-
nity to profile customers and their needs, analyze nication: the metaphorical dimension of CSR dis-
course in the food industry. In R. Tench, W. Sun, &
how often and what they purchase as well as
B. Jones (Eds.), Communicating corporate social
estimate general business trends. responsibility: perspectives and practice (critical
After information is gathered and stored, spe- studies on corporate responsibility, governance and
cialists must take care of presenting it to the audi- sustainability) (Vol. 6). Bingley: Emerald Group Pub-
lishing Limited.
ence. Data can be analyzed and models can be
Bielenia-Grajewska, M. (2014c). Corporate online social
created by the use of such programs as, e.g., networks and company identity. In R. Alhajj & J.
MATLAB. This program offers signal, image Rokne (Eds.), Encyclopedia of social network analysis
and video processing and data visualizing for and mining. Berlin: Springer.
Denning, S. (2004). Squirrel Inc. A fable of leadership
engineers and scientists. The visualization of big
through storytelling. San Francisco: Jossey-Bass.
data may be supported by tools (e.g., Google Henderson, T., & Boje, D. M. (2016). Organizational
Maps API) that offer maps and data layers. Such development and change theory: managing fractal
tools provide the visual presentation of such data organizing processes. Abingdon: Routledge.
Madhavan, J., et al. (2012). Big data storytelling through
as, among others, geographical and geospatial
interactive maps. IEEE Data Eng Bull, 35(2), 46–54.
data, traffic conditions, public transport data and Ryan, L. (2018). Visual data storytelling with tableau:
weather forecasts. These tools facilitate the crea- story points, telling compelling data narratives. Bos-
tion and cognition of big data used in digital ton: Addison-Wesley Professional.
Shell, R., & Moussa, M. (2007). The art of Woo: using
storytelling by making immense information
strategic persuasion to sell your ideas. London: Pen-
compact and comprehensible. Madhavan et al. guin Books Ltd..
(2012) discuss the Google Fusion Tables (GFT)
offering collaborative data management in the Websites
cloud. This type of tool offers reusing existing Educational Uses of Digital Storytelling, http://
data gathered from different producers and apply- digitalstorytelling.coe.uh.edu/. Accessed 10 Nov 2014.
Google Fusion Tables., https://developers.google.com/
ing it in different contexts. It is used by, e.g.,
fusiontables/. Accessed 10 Nov 2014.
journalists to visualize some aspects presented in Matlab. http://www.mathworks.com/products/matlab/.
their articles. Maps created by using GFT can be
found by readers of various periodicals (e.g., UK
Guardian, Los Angeles Times, Chicago Tribune,
and Texas Tribune).
DIKW Pyramid

Cross-References ▶ Data-Information-Knowledge-Action Model

▶ Content Management System (CMS)


▶ Digital Literacy
▶ Humanities (Digital Humanities) Disaster Management
▶ Knowledge Management
▶ Online Identity ▶ Natural Hazards
Disaster Planning 401

disaster occurs at many levels from individuals to


Disaster Planning national and international levels.
Preparation for disasters can be initiated from
Carolynne Hultquist the ground up as grassroots movements and from
Geoinformatics and Earth Observation the top down as government policy. Community
Laboratory, Department of Geography and and individual planning normally stems from per-
Institute for CyberScience, The Pennsylvania sonal initiatives to initiate a procedure and store
State University, University Park, PA, USA basic human needs such as food, water, and med-
ical supplies as disasters can disrupt access to D
these essentials. Government planning occurs at
Definition/Introduction federal, state, and local levels with the goals of
integrating disaster preparation and responses at
Disaster planning is important for all stages in the each level to efficiently use resources and not
disaster management cycle, and it occurs at many duplicate efforts. Long-term government planning
levels from individuals to communities and gov- for hazards can include steps such as analyzing
ernments. Planning for disasters at a large scale data in order to make strategic decisions that mit-
requires information of physical and human attri- igate the hazard and to not contribute to costlier
butes which can be derived from data collection disasters. Disaster planning should include data
and analysis of specific areas. This general collec- that are location specific and which will provide
tion of spatial data can have a large volume and be relevant information to responders. Geographical
from a variety of sources. Hazards stemming from Information Systems (GIS) can be used to store
different types of natural, man-made, and techno- data as layers of features representing physical
logical processes require unique planning consid- and social attributes. Data collection and analysis
erations but can often use a common basis of does not just start when the event occurs but
information to understand the location. A com- should be used to help with prediction of risks
mon structure of organization can be adopted and assessments of impact. During the event, data
despite needing to plan for providing varying being received in real-time can help direct efforts
resources and implementing procedures required and be fused with pre-existing data to provide
during unique disaster events. context.
One of the goals of planning for disasters is to
take steps to be more resilient to hazards. The
Planning for Disasters majority of the time invested in planning for haz-
ards are for those that are likely to occur in the
Disaster planning is a crucial part of the disaster specific area in which people live or that would be
management cycle as the planning stage supports the most devastating. Hazards are spatial pro-
all the stages in the process. A hazard is the event cesses as physical and metrological conditions
itself, such as an earthquake or hurricane, but a occur in relation to the features and properties of
disaster is when there is loss of human life and the Earth. Disasters are inherently linked to
livelihood. Part of the planning is preparation to human location and involve human-environ-
be less vulnerable to hazards in order for an event ment interactions; it is clear that a major earth-
not to be a devastating disaster. Information on quake and tsunami in Alaska will likely have
who is located where and how vulnerable that area much less loss of life than if it occurs off the
is to certain hazards is important to plan for a coast of Japan. A spatial relationship exists for
disaster. It involves considering data on physical both the occurrence of a hazardous event and for
processes and human interests and developing the event to have a human impact which makes it a
disaster response plans, procedures, and processes disaster. Impact is primarily measured in human
to make decisions on how to respond during the loss but other considerations such as financial and
event and to effectively recover. Planning for a environmental losses are often considered
402 Disaster Planning

(Hultquist et al. 2015). Geospatial data are often occur in specific areas so that the population gains
used to recognize and analyze changes to better familiarity. However, after many years without a
understand this relationship through the use of major hazard of a specific type, the collective
technologies. Monitoring networks are often put memory of knowing how to recognize such phe-
in place to identify that events are taking place in nomena as a tsunami is lost. Unfortunately, many
order to have data on specific hazards that are of people went out on the beach when the waters
interest to the region. receded during the 2004 Indian Ocean tsunami
Varying considerations are necessary to plan without realizing that this is an indicator to flee.
for natural physical processes, man-made, and Likewise, there is a risk of having too much expe-
technological hazards. Even an individual hazard rience as people can become complacent, for
such as a flood can involve many physical pro- example, not seeking safety when they hear a
cesses that could be considered from views of tornado warning or not feeling the need to evacu-
hydrology, soil science, climate, meteorology, ate because the tsunami walls have handled
etc. Planning the resources and procedures needed smaller previous events and “the big one” is not
to respond to specific aspects of hazards differs foreseeable.
greatly; however, the structure for how operations
are to occur can be consistent. A common struc-
ture for how operations are to occur can be
planned by adopting an “all hazards approach” Conclusion
(FEMA 1996). It is important to have an all haz-
ards policy to have a unified approach to handling Disaster planning is essential to the success of further
disasters; having a structure of operations is nec- stages in the disaster management cycle. It is neces-
essary for decision-makers and responders to pro- sary to analyze where disasters are most likely to
ceed in light of compounding events. occur in order to be prepared for the event by having
Often hazards have multiple types of impacts a data-driven understanding of human interests,
such as a hurricane which is a meteorological physical attributes, and resources available. However,
event often associated with the high winds due to a perspective that there is less risk, there is
involved, but resulting flooding can also be sig- often less data collection planning implemented for
nificantly impactful from both storm surge and disasters that are unforeseen to occur in an area or to
rainfall. The March 2011 Japanese disaster was a be so severe that it leads to unexpected challenges.
compounding disaster case as it started with an Organized planning is needed at all levels of society
earthquake which caused a tsunami and both of for many different types of physical processes, man-
which contributed to a nuclear accident. Earth- made, and technological hazards which require
quakes are primarily caused by geologic shifts, unique planning considerations.
and tsunamis are generated by geological and
bathymetric configurations which can cause
flooding. In this case in Japan, flooding Cross-References
compounded the event by making the backup
diesel generators nonfunctional. Part of being ▶ Big Geo-Data
resilient is having an adaptive system to face this ▶ Big Variety Data
compounding events as many failures are made ▶ Data Fusion
worse by responses that are not flexible and pre-
viously good plans can cause unforeseen failures
(David Woods 2010). Further Reading
Consistent experience with a hazard can lead to
FEMA. (1996). Guide for all-hazard emergency opera-
general knowledge of what to plan for given an tions planning. Washington, DC: The Federal Emer-
event. Tornados in Oklahoma or earthquakes in gency Management Agency. https://www.fema.gov/
Japan are examples of hazards that are probable to pdf/plan/slg101.pdf.
Discovery Analytics, Discovery Informatics 403

Hultquist, C., Simpson, M., Cervone, G., and Huang, Q. Discovery analytics – involving the analysis
(2015). Using nightlight remote sensing imagery and and exploration of the data to determine trends
Twitter data to study power outages. In Proceedings of
the 1st ACM SIGSPATIAL International Workshop on and patterns – and discovery informatics – refer-
the Use of GIS in Emergency Management (EM-GIS ring to the application and use of related findings –
’15). ACM, New York, NY, Article 6, 6 pages. https:// are based on the engagement of principles of
doi.org/10.1145/2835596.2835601. intelligent computing and information systems to
Woods, D. (2010). How do systems manage their adaptive
capacity to successfully handle disruptions? A resil- understand, automate, improve, and innovate var-
ience engineering perspective. Complex adaptive sys- ious aspects of those processes (Gil and Hirsh
tems – Resilience, robustness, and evolvability: Papers 2012). Unstructured data is of particular note in D
from the Association for the Advancement of artificial this regard. Generated from various sources (e.g.,
intelligence (AAAI): Fall symposium (FS-10-03).
the tracking of website clicks, capturing user sen-
timents from online sources or documents such as
social media platforms, bulletin boards, telephone
calls, blogs, or fora) and stored in nonrelational
Discovery Analytics, Discovery data repositories, or the “data lake,” vast amounts
Informatics of unstructured data are analyzed to determine
patterns that might provide knowledge insights
Connie L. McNeely and advantages in various arenas, such as business
George Mason University, Fairfax, VA, USA intelligence and scientific discovery. Discovery
analytics are used to mine vast portions of the
data lake for randomly occurring patterns; the
While big data is a defining feature of today’s bigger the data in the lake, the better the odds of
information and knowledge society, a huge and finding random patterns that, depending on inter-
widening gap exists between the ability to accu- pretation, could be useful for knowledge creation
mulate the data and the ability to make effective and application (Sommer 2019).
use of it to advance discovery (Honavar 2014). In reference to big data, discovery analytics has
This issue is particularly prominent in scientific been delineated according to four types of discov-
and business arenas and, while the growth and ery: visual, data, information, and event (Smith
collection of massive and complex data have 2013; Cosentino 2013). Visual discovery has been
been made possible by technological develop- linked, for example, to big data profiling and
ment, significant challenges remain in terms of capacities for visualizing data. Combined with
its actual usefulness for productive and analytical data mining and other techniques, visual discov-
purposes. Realizing the potential of big data to ery attends to enhanced predictive capability and
both accelerate and transform knowledge crea- usability. Data discovery points to the ability to
tion, discovery requires a deeper understanding combine and relate data from various sources,
of related processes that are central to its use with the idea of expanding what is possible to
(Honavar 2014). Advances in computing, stor- know. Data-centric discovery is interactive and
age, and communication technologies make it based on massive volumes of source data for
possible to organize, annotate, link, share, dis- analysis or modeling. Information discovery
cuss, and analyze increasingly large and diverse rests on search technologies, especially among
data. Accordingly, aimed at understanding the widely distributed systems and big data. Different
role of information and intelligent systems in types and levels of search are core to information
improving and innovating scientific and techno- discovery based on a variety of sources from
logical processes in ways that will accelerate which big data are derived, from documents to
discoveries, discovery analytics and discovery social media to machine data. Event discovery –
informatics are focused on identifying processes also called “big data in motion” – represents oper-
that require knowledge assimilation and ational intelligence, involving the data collected
reasoning. on and observation of various phenomena,
404 Diversity

actions, or events, providing rationales to explain Dzeroski, S., & Todorovski, L. (Eds.). (2007). Computa-
the relationships among them. tional discovery of communicable scientific knowledge.
Berlin: Springer.
In practical terms, given their critical roles in Gil, Y., & Hirsh, Y. (2012). Discovery informatics: AI
realizing the transformative potential of big data, opportunities in scientific discovery. AAAI Fall Sym-
discovery analytics and informatics can benefit posium Technical Report FS-12-03.
multiple areas of societal priority and well-being Honavar, V. G. (2014). The promise and potential of big
data: A case for discovery informatics. Review of Policy
(e.g., education, food, health, environment, Research, 31(4), 326–330.
energy, and security) (Honavar 2014). However, Smith, M. (2013, May 7). Four types of discovery technol-
also in practical terms, discovering meaning and ogy for using big data intelligently. https://marksmith.
utility in big data requires advances in representa- ventanaresearch.com/marksmith-blog/2013/05/07/four-
types-of-discovery-technology-for-using-big-data-
tions and models for describing and predicting intelligently.
underlying phenomena. Automation dictates the Sommer, R. (2019). Data management and the efficacy of
translation of those representations and models big data: An overview. International Journal of Busi-
into forms that can be queried and processed. ness and Management, 7(3), 82–86.
In this regard, computing – the science of infor-
mation processing – offers tools for studying
the processes that underlie discovery, concerned
primarily with acquiring, organizing, verifying, Diversity
validating, integrating, analyzing, and communi-
cating information. Automating aspects of discov- Adele Weiner1 and Kim Lorber2
1
ery and developing related tools are central to Audrey Cohen School For Human Services and
advancing discovery analytics and informatics Education, Metropolitan College of New York,
and realizing the full potential of big data. Doing New York, NY, USA
2
so means meeting challenges to understand and Social Work Convening Group, Ramapo College
formalize the representations, processes, and of New Jersey, Mahwah, NJ, USA
organizational structures that are crucial to dis-
covery; to design, develop, and assess related
information artifacts; and to apply those artifacts Diversity and Big Data
and systems to facilitate discovery (Honavar
2014; Gil and Hirsh 2012; Dzeroski and Diversity reflects a number of different sociocul-
Todorovski 2007). tural demographic variables including, but not
limited to, race, ethnicity, religion, gender,
national origin, disability, sexual orientation,
Cross-References
age, education, and socioeconomic class. Big
▶ Data Lake data refers to extremely large amounts of infor-
mation that is collected and can be analyzed to
▶ Data Mining
▶ Informatics identify trends, patterns, and relationships. The
▶ Unstructured Data data itself is not as important as how it is used.
Census data is an example of big data that pro-
vides information about characteristics across
Further Reading nations and populations. Other big data is used
by multinational organizations, such as the World
Cosentino, T. (2013, August 19). Three major trends in Bank and the United Nations, to help document
new discovery analytics. Smart Data Collective. and understand how policies and programs differ-
https://www.smartdatacollective.com/three-major- entially affect diverse populations. In the USA,
trends-new-discovery-analytics/#:~:text¼One%20of%
20those%20pillars%20is%20what%20is%20called,dis
analysis of big data on voting, housing, and
covery%2C%20data%20discovery%2C%20informa employments patterns led to the development of
tion%20discovery%20and%20event%20discovery. affirmative action and anti-discrimination policies
Diversity 405

and laws that identify and redress discrimination the population and replaced it with the American
based on diversity characteristics. Community Survey (ACS). The ACS is a contin-
uous survey designed to provide reliable and
timely demographic, housing, social, and eco-
Self-reported Diversity Information nomic data every year. This large dataset is
much more extensive than that collected by the
Many of the mechanisms used to create big Census and offers the opportunity to determine
datasets depend on self-reports as with the US the relationship between some diversity variables,
Census and public school records. When self- such as gender, race, and ethnicity to economic, D
reporting, individuals may present themselves housing, employment, and educational variables.
inaccurately because of concerns about the use Again, this data is self-reported and persons com-
of the data or their perceived status within society. pleting the form may interpret questions differ-
For example, self-descriptions of race or ethnicity ently. For example, one question asks the
may not be reported by respondents because of respondent about how well they speak English
political or philosophical perceptions of inade- (very well, well, not well, not at all). It is easy to
quate categories, which do not meet an individ- see how a native speaker of English and a person
ual’s self-definition. Some data such as age may for whom it is a second language may have dif-
appear to be fairly objective, but the person com- ferent fluency self-perceptions. In addition, it is
pleting the form may have inaccurate information possible to identify individuals with functional
or other reasons for being imprecise. Options for disabilities from this survey but not specifics.
identifying sex may only be male and female,
which requires transgender and other individuals
to select one or the other when, perhaps, they Private and Public Records and Diversity
differently self-identify.
The data collection process or forms may intro- Both public and private information is aggregated
duce inaccuracies in the information. On the 2010 into large datasets. Although individual health
US Census short form, race and ethnicity seem to information is private, it is collected and analyzed
be merged. Korean, Chinese, and Japanese are by insurance networks and governmental entities.
listed as races. In the 2010 Census, questions ask Health datasets may be used to make inferences
about the relationship of each household member about the health needs and utilization of services
to person #1 (potentially the head of the house- by diverse populations. Demographic data may
hold) and rather than spouse it has the choice of demonstrate certain health conditions that are
husband or wife. Many gay and lesbian individ- more prevalent among specific populations. For
uals, even if married, may not use these terms to example, the Centers for Disease Control uses
self-identify and hopefully spouse will be pro- collected data on a variety of health indicators
vided as an answer option in the next cycle. The for African-Americans, Hispanic, or Latino
identification of each person’s sex on the form populations, and men’s and women’s health.
may currently allow the federal government to This data is grouped and provides information
identify same-sex marriages. On the other hand, about the health needs of various populations,
individuals who have concerns about privacy and which can focus prevention, education, and treat-
potential discrimination may not disclose their ment efforts.
marital status. Heterosexual couples, living as if The development of such large databases has
they are married, may self-identify as such even if been established by health networks and insur-
not legally wed. Census data is primarily descrip- ance companies to facilitate health care. In the
tive and can be used by both municipalities and USA, the Health Insurance Portability and
merchants to identify certain populations. Accountability Act (HIPPA) established rules
In 2010, the US Census eliminated the long- regarding the use of this data and to protect the
form, which was administered to only a sample of privacy of individuals’ medical information. The
406 Diversity

Center for Medicare and Medicaid has developed primarily used by African-Americans might
large datasets of information collected by health receive additional communications about a new
care providers that can be analyzed for research line of ethnically diverse dolls.
and policy and programming decisions. Big data is generated when an individual
Other records can be used to collect popula- searches the Internet, even if they do not purchase
tion information when paired with demographic an item. Many free services, such as Facebook and
diversity variables. School records can be used to Google, use analysis of page views to place
highlight the needs of children in a given com- targeted advertisements on members’ pages. Not
munity. Library book borrowing is recorded and only do they earn income from these ads but when
can provide information about the needs and a person “Likes” an item, the advertisement is
interests of book borrowers. Data collected by then showed to all the others in the person’s net-
the Internal Revenue Service for tax purposes work. Social media companies also use this data
can also be used to identify low-income neigh- to show other products they estimate the user may
borhoods. And certainly information collected be interested in. This can easily be demonstrated
by government programs such as Social Security, by this simple experiment. Go online and look at a
Medicare, Medicaid, Temporary Assistance for variety of items not normally of interest to you and
Needy Families (TANF), and the Supplemental see how long it takes for advertisements of these
Nutrition Assistance Program (Food Stamps) can products to appear on the webpages you visit.
link diversity to incomes, housing, and other Imagine what happens if a person looks for sensi-
community needs. tive information online, perhaps about sexuality,
abortion, or a serious medical condition, and
another family member uses the same device and
Social Media and Retail Data sees advertisements linked to these private
searches. This is even more challenging when
Information about diversity can also be gleaned people use work computers where there is no
from a variety of indirect data collection methods assurance of privacy.
used by social media and retail sources. The hair
products a person buys may provide clues as to
their race while the types of books they purchase Conclusion
may indicate religious beliefs. Retailers use this
kind of information for targeted advertising cam- People are becoming increasingly aware that
paigns and special offers for potential sales while current big data mining and analytics will pro-
also selling customer lists of relevant consumers vide private information to be used without their
to other vendors. This occurs in both online and permission. There are concerns about the ways
brick and mortar stores when a person uses their diversity identification information can be used
credit cards. Medical equipment, cosmetics, foods by retailers, governments, and insurers. Such
and spices, books and vitamin supplements all information can positively redress discrimina-
may give clues to a person’s race, age, religion, tion and inequities experienced by individuals
ethnicity, sexuality, or disability. When one uses a who are members of diverse, minority groups.
credit or store discount card the information is On the other hand, accessing this information
aggregated to create a profile of the consumer, may violate individual privacy. In the age of big
even though they have not provided this informa- data and the Internet, new data collection
tion directly. Analysis of such large amounts of methods are being created and used in ways not
consumer information allows retailers to adapt covered by current legislation. Regulations to
their inventory to meet the specific needs of their maintain privacy and prevent data from being
customers and to individually market to them used to discriminate against diverse groups
directly through electronic or mail promotions. need to be adjusted to deal with the rapidly
For example, a person who buys cosmetics changing access to data.
Driver Behavior Analytics 407

Cross-References production, process, progress, and productivity


in many fields, institutions, sectors, applications
▶ Biomedical Data or implementations. They also help to make better
▶ Census Bureau (U.S.) plan and decision, to give better service, to take
▶ Facebook advantage, to establish new company, to get new
▶ Gender and Sexuality discovery, to have new output, finding, perception
▶ Google as well as thought, even judgment with the support
▶ Religion of big data techniques, technologies and analytics.
In order to analyse, model, establish, forecast or D
predict driver/driving behaviour, driver in duty
Further Reading (man, woman, machine), driving media (in vehi-
cle system) and driving enviroment (inside or
American Community Survey – http://www.census.gov/ outside vehicle) are considered. In order to under-
acs/www/.
stand these explanations clearly, first thing to do is
Centers for Disease Control and Prevention – Health Data
Interactive – http://www.cdc.gov/nchs/hdi.htm. to understand the data, data types, data volume,
Population Reference Bureau – http://www.prb.org/. data structure, method and methodology in data
The World Bank – Data – http://data.worldbank.org/. analysis and analytics. Figures 1 and 2 briefly
The United Nations – UNdata – http://data.un.org/.
show the revolution and change of data journey.
U.S. Census – http://www.census.gov/ – http://data.un.org/.
If these data are collected properly, better analysis
or analytics can be achieved. More benefits, out-
comes, findings and profits might be acquired by
industry, sector, university or institution.
Document-Oriented Database
In order to understand big data analytics, it is
important to understand the concept of driving/
▶ NoSQL (Not Structured Query Language)
driver behavior to develop better and faster sys-
tems. As shown in Fig. 3, to explain and analyze
driver/driving behavior, there are three major
issues considered as given below:
DP
1. Driver (man, woman, machine, etc.)
▶ Data Processing
2. Driving media (car, lorry, track, motobike,
cycle, ship, flight, etc.)
3. Driving environment (roads, motorways, high-
ways, inner city, intersections, heavy traffics,
Driver Behavior Analytics highway design standards, crowds, weather
conditions, etc.)
Seref Sagiroglu
Department of Computer Engineering, Gazi
University, Ankara, Turkey Driving behavior or driver behavior is a cyclic
process covering media, environment, and driver
as illustrated in Fig. 3. Any value can be achieved
Driver or Driving Behavior Analytics analyzing the data acquired from drivers, driving
media, and driving environments especially using
Internet technologies support many new fields of big data analytics. Even if driving behavioral data
research, innovation, technology as well as devel- were categorized into three different groups in the
oping new applications and implementations. literature (Miyajima et al. 2006; Wakita et al.
Recently, big data analytics and technologies 2005), it can be categorized into five considering
might help to improve quality, system, big data analytics for better understanding. Big
408 Driver Behavior Analytics

Driver Behavior
Analytics, Fig. 1 Data
revolution

Smart
big Data
large data
data
data

Driver Behavior
Analytics, Fig. 2 Data
types

Streaming
time
data
relative series
data
static
data

1. Vehicle operational data:


– Steering angle, velocity, acceleration,
engine speed, engine on-off, etc.
2. Vehicle data:
– Gas pedal position, Various sensor data, main-
tenance records, etc.
ENVIRONMENT DRIVER
3. Driver psychiatric/psychologic data:
– Driver record, driver mood, chronical ill-
ness, hospital records, etc.
4. Driver physical data:
– Usage or driving records, following dis-
tance, drowsiness, face parts, eye move-
MEDIA
ment, etc.
5. Vehicle outside data:
– Outside temperature, humidity, distance for
other vehicles, speeds of other vehicles,
road signs, radio alerts, heavy traffic record,
Driver Behavior Analytics, Fig. 3 Basic dynamics of
coordinate info, etc.
driving behavior

As can be seen above, there have been many


data types available in the literature. It can be
data features play a crucial role due to availability clearly pointed out that most of the studies in the
of different types of vehicles, sensors, environ- literature have focused on the vehicle data and
ments, and technology support. The categoriza- vehicle operational data. It is expected that all of
tion covering new suggestions is given below as: five data types, or even more, might be used for
Driver Behavior Analytics 409

modeling driver behavior considering big data and Erzin 2012; Campo et al. 2014). The
analytics. selected signals, GP presure and BP pressure
Driver/driving behavior can be estimated or pre- are evaluated for each frame k (the short term
dicted with the help of available models, formulas, real cepstrum) and K cepstral features f are
or theories. When the literature is reviewed, there extracted in (Öztürk and Erzin 2012) as:
have been many methods based on machine learn-
ing, SVM, Random Forrest, Naive Bayes, KNN, K-
means, statistical methods, MLP, Fuzzy neural net- f k ¼ F1 BPF f log Fðxðn þ kTÞÞg
work (FNN), Gaussian mixture model, and HMM D
models applied to modeling, predicting, and esti- where
mating driver behaviors (Enev et al. 2016; Kwak x(n þ kT): the frame signal multiplied by the
et al. 2017; Wakita et al. 2006; Meng et al. 2006; window,
Miyajima et al. 2007; Nishiwaki et al. 2007; Choi BPF: the band-pass filter separating noise from
et al. 2007; Wahab et al. 2009; Dongarkar and Das driving behavior signals,
2012; Van Ly et al. 2013; Zhang et al. 2014). It F: denotes the discrete-time Fourier transform
should be emphasized that there are no big data (DTFT), and
analytics solution yet. F1: donetes its inverse.
In order to understand the mathematics behind
this topic, some of the articles and important – Spectral analysis of pedal signals (Öztürk
models available in the literature are reviewed and Erzin 2012):
and summarized. It can be seen that the spectra are similar in
the same driver but it is different among two
drivers. Assuming that the spectral envelope
– A model for car following task (Wakita et al. can capture the differences in between the
2005): characteristics of different drivers (Öztürk and
This task is a car following (stimulus Erzin 2012).
response) model and involves following a
vehicle with a constant distance in front, and
adjusting the relative velocity as stimuli is to – GMM driver modeling and identification
calculate the response of drivers by either (Jensen et al. 2011).
accelerating or decelerating action as in A Gaussian mixture model (GMM) (Jensen
(Wakita et al. 2005). et al. 2011) was used to represent the distribu-
tions of feature vectors of cepstrum of each
driver. The GMM parameters were estimated
v_ ðt þ T Þ ¼ C1 h_ðtÞ þ C2 fhðtÞ  Dg using the expectation maximization (EM)
algorithm. The GMM driver models were eval-
where uated in driver identification experiments, in
C1 or C2 is the response sensitivity to the which the unknown driver was identified as
stimulus. driver k who gave the maximum weighted
D is the optimum distance to the vehicle in GMM log likelihood over gas pedal and
front. brake pedals:
T is the response delay. These values may be
the constants or the functions of other variables.
k ¼ argmaxfAlogPðGPjlG,k Þk
– Cepstral Analysis for Gas Pedal (GP) and þ ð1  AÞlogPðBPjlB,k Þg
Brake Pedal (BP) Pressure (Öztürk and Erzin
2012; Campo et al. 2014): where
Cepstral feature extraction is used for driv-
ing behaviour signals as reported in (Öztürk 0  A  1.
410 Driver Behavior Analytics

GP and BP are the cepstral sequences of gas and 2011; Salemi 2015; Öztürk and Erzin 2012;
brake pedals. Campo et al. 2014; Wakita et al. 2006; Meng
lG,k and lB,k are the k-th driver models of GP and et al. 2006; Miyajima et al. 2007; Nishiwaki
BP, respectively. et al. 2007; Choi et al. 2007; Wahab et al. 2009;
A is the linear combination weight for the likeli- Dongarkar and Das 2012; Van Ly et al. 2013;
hood of gas pedal signals. Zhang et al. 2014) and combined for
representation.
In order to have data for driving analysis, sim- Today technology supports to acquire those
ulators or data generators were used for data col- parameters and data given above from many inter-
lection (Wakita et al. 2006; Meng et al. 2006; nal and external sensors. These sensors might be
Miyajima et al. 2007; Zhang et al. 2014), but multi-sensors, audio, video, picture or text. In
today most data have been picked up from real some cases, questionnaires are also used for this
environments mounted on/in vehicles, carried analysis. The literature on driver behavior also
mobile devices, or wearable devices on drivers. covers available techniques and technologies
Especially, driving behavioral signals are col- (Miyajima et al. 2006; Hallac et al. 2016; Wakita
lected using data collection vehicles designed et al. 2005; Enev et al. 2016; Kwak et al. 2017;
and supported by companies, projects, and Hartley 2000; Colten and Altevogt 2006; Jensen
research groups (Miyajima et al. 2006; Hallac et al. 2011; Salemi 2015; Öztürk and Erzin 2012;
et al. 2016; Enev et al. 2016; Kwak et al. 2017; Campo et al. 2014; Wakita et al. 2006; Meng et al.
Nishiwaki et al. 2007; Choi et al. 2007; Wahab 2006; Miyajima et al. 2007; Nishiwaki et al. 2007;
et al. 2009; Dongarkar and Das 2012; Van Ly et al. Choi et al. 2007; Wahab et al. 2009; Dongarkar
2013; Zhang et al. 2014). It should be emphasized and Das 2012; Terzi et al, 2018; Van Ly et al.
that recent researches focus on data collection 2013; Zhang et al. 2014). When the analysis
from vehicles via CAN or other protocols for based on perception of big data is considered,
further complex analysis in many other fields. more parameters and data types might be used to
When the literature is reviewed, there have analyze driver behavior more accurately and com-
been a number of approaches to analyze driver pactly. For doing that, the data not only collected
behaviors. The features or parameters of data col- in Table 1 but also the data such as weather con-
lected from vehicles are given in Table 1. These dition, road safety info, previous health or driving
data are obtained from the literature (Miyajima records of drivers, accident records, road condi-
et al. 2006; Hallac et al. 2016; Wakita et al. tion, real-time alerts for traffic, traffic jam, speed,
2005; Enev et al. 2016; Kwak et al. 2017; Hartley etc. can be also used for big data analytics for
2000; Colten and Altevogt 2006; Jensen et al. achieving better models, evaluations, results,

Driver Behavior Analytics, Table 1 Parameters used for estimating/predicting/modeling driver behavior
- Vehicle speed, - Brake pedal position, pressure - Face mood
acceleration, and - Gas (accelerator) pedal position, pressure - Head movement
deceleration - Transmission oil temperature, activation of air - Sleepy face, sleepiness, tiredness,
- Steering, steering wheel compressor, torque converter speed, wheel - Gyro
- Gear shift velocity, rear, front, left hand, right hand - Stress, drowsiness
- Engine condition, - Retrader - Lane deviation
torque, rpm, speed, - Throttle position - Long-term fuel trim bank, intake
coolant temperature - Start-stop air pressure, friction torque,
- Vehicle air conditioning - Turning signal calculated load value
- Yaw rate - Following distance from vehicle
- Shaft angular velocity ahead
- Fuel consumption
- Mass air flow rate
Driver Behavior Analytics 411

outcomes or values. Especially, a recent and com- – Benefit-cost relations. It should be considered
prehensive survey introduced by Terzi et. al. that in some cases the cost would be high in
(2018) provides a recent big data perspective on comparison with the achieved value. It should
driver / driving behavior and discusses the contri- be emphasized that having big data means does
bution of big data analytics to the automative not always guarantee to get values from this
industry and reserach field. analytics.
– Facing lost connection in some places to trans-
fer the data from vehicle to the system.
Conclusions – The lack of algorithms needed to be used in D
real-time applications.
This entry concludes that driver behavior analyt-
ics is a challenging problem. Even if there have Final words, the solutions and suggestions pro-
been many studies available in the literature, the vided in this chapter might help to reduce traffic
studies having big data analytics are very rare. In accidents, injuriousness, loss, traffic jams, etc. and
order to achieve this task, the points are discussed also increase productivity, quality, safety not only
and given below. for safer, better and comfortable driving but also
Developing a study on driver behavior based designing, developing and manufacturing better
on big data analytics: vehicles, establishing better roads and confortable
driving, etc.
– Requires suitable infrastructure, algorithms,
platforms, and enough data for better and faster
analytics Further Reading
– Enables more or better models for driver
behavior Campo, I., Finker, R., Martinez, M. V., Echanobe, J., &
– Provides solution not only on one driver but Doctor, F. (2014). A real-time driver identification sys-
tem based on artificial neural networks and cepstral
also a large number of drivers belonging to a analysis, 2014 I.E. International Joint Conference on
company, institution, etc. Neural Networks (IJCNN), 6–11 July 2014, Beijing,
– Requires not only data but also smart data for pp. 1848–1855.
further analytics Choi, S., Kim, J., Kwak, D., Angkititrakul, P., & Hansen, J.
H. (2007). Analysis and classification of driver behav-
– Provides new solutions, gains, or perceptions ior using in-vehicle can-bus information. In Biennial
for problems workshop on DSP for In-vehicle and mobile systems,
– Needs experts and expertise for getting pp. 17–19.
expected solutions Colten, H. R., & Altevogt, B. M. (Eds.). (2006). Sleep
disorders and sleep deprivation, an unmet public
– Costs more than classical approaches health problem. : Institute of medicine (US) committee
on sleep medicine and research. Washington, DC:
There are other issues that might affect the Natioenal Academies Press (US). ISBN: 0-309-
success or failure of big data analytics for model- 10111-5.
Dongarkar, G. K., & Das, M. (2012). Driver classification
ing/predicting driving behavior due to: for optimization of energy usage in a vehicle. Procedia
Computer Science, 8, 388–393.
– Availability of not enough publications and big Enev, M., Takakuwa, A., Koscher, K., & Kohno, T.
data sources for researches. (2016). Automobile driver fingerprinting. In Proceed-
ings on Privacy Enhancing Technologies, 2016(1),
– Collecting proper data having different data 34–50.
sets, time intervals, size, format, or Hallac, D., Sharang, A., Stahlmann, R., Lamprecht, A.,
parameters. Huber, M., Roehder, M., Sosic, R., & Leskovec, J.
– Limitations of bandwidth of mobile technolo- (2016). Driver identification using automobile sensor
data from a single turn 2016 I.E. 19th international
gies or operators used in transferring data from conference on intelligent transportation systems (ITSC)
vehicles to the storage to collect or to analyze Windsor Oceanico Hotel, Rio de Janeiro, 1–4 Nov
the data for further progresses. 2016.
412 Drones

Hartley, L. (2000). Review of fatigue detection and predic- identification using driving behavior signals. IEICE
tion technologies. Melbourne: National Road Transport Transactions on Information and Systems, 89(3),
Commission. 1188–1194.
Jensen, M., Wagner, J., & Alexander, K. (2011). Analysis Zhang, X., Zhao, X., & Rong, J. (2014). A study of indi-
of in-vehicle driver behaviour data for improved safety. vidual characteristics of driving behavior based on
International Journal of Vehicle Safety, 5(3), 197–212. hidden markov model. Sensors & Transducers, 167
Kwak, B. I., Woo, J. Y., & Kim, H. K. (2017). Know your (3), 194.
master: Driver profiling-based anti-theft method,
arXiv:1704.05223v1 [cs.CR] 18 April.
Meng, X., Lee, K. K., & Xu, Y. (2006). Human driving
behavior recognition based on hidden markov models.
In IEEE International Conference on Robotics and Drones
Biomimetics 2006 (ROBIO’06) (pp. 274–279). Kun-
ming: China.
Miyajima, C., Nishiwaki, Y., Ozawa, K., Wakita, T., Itou,
R. Bruce Anderson1,2 and Alexander Sessums2
1
K., & Takeda, K. (2006). Cepstral analysis of driving Earth & Environment, Boston University,
behavioral signals for driver identification. In IEEE Boston, MA, USA
International Conference on ICASSP, 14–19 May 2
Florida Southern College, Lakeland, FL, USA
2006, (pp. 921–924). Toulouse: France. https://doi.
org/10.1109/ICASSP.2006.1661427.
Miyajima, C., Nishiwaki, Y., Ozawa, K., Wakita, T., Itou,
K., Takeda, K., & Itakura, F. (2007). Driver modeling In the presence of the “information explosion”
based on driving behavior and its evaluation in driver Big Data means big possibilities. Nowhere is the
identification. In Proceedings of the IEEE, 95(2), 427–
ultimacy of this statement more absolute than in
437.
Nishiwaki, Y., Ozawa, K., Wakita, T., Miyajima, C., Itou, the rising sector of Big Data collecting via
K., & Takeda, K. (2007). Driver identification based unmanned aerial vehicles or drones. The defini-
on spectral analysis of driving behavioral signals. In tion of Big Data, much like its purpose, is all
J. H. L. Hansen and K. Takeda (Eds.), Advances for in- encompassing, overarching, and umbrella-like in
vehicle and mobile systems - 2007 Challenges for
International Standards (pp. 25–34). Boston: Springer. function. More fundamentally, the term refers to
Öztürk, E., & Erzin, E. (2012). Driver status identification the category of data sets which are so large they
from driving behavior signals. prove standard methods of data interrogation to no
In J. H. L. Hansen, P. Boyraz, K. Takeda, and H. Abut longer be helpful or applicable. In other words, it
(Eds.), Digital Signal Processing for In-Vehicle Sys-
tems and Safety. Springer NY, pp. 31–55.
is “a large volume unstructured data which cannot
Salemi, M. (2015). Authenticating drivers based on driving be handled by standard data base management
behavior, Ph.D. dissertation. Rutgers University, Grad- systems like DBMS, RDBMS, or ORDBMS.”
uate School, New Brunswick. Instead, new management software tools must be
Terzi, R., Sagiroglu, S., & Demirezen, M. U. (2018) employed to “capture, curate, manage and process
Big data perspective for driver/driving behavior. In
IEEE Intelligent Transportation Systems Magazine, the data within a tolerable elapsed time.” The
Accepted for Publication. trend toward Big Data in the past few decades
Van Ly, M., Martin, S., & Trivedi, M. M. (2013). Driver has been in large measure due to analytical tools
classification and driving style recognition using iner- which allows experts to now spot correlations in
tial sensors. In IEEE Intelligent Vehicles Symposium
(IV), 23-26 June 2013, (pp. 1040–1045). Gold Coast:
large data volumes where they would have previ-
Australia. ously been indistinguishable and therefore more
Wahab, A., Quek, C., Tan, C. K., & Takeda, K. (2009). accurately identify trends in solutions.
Driving profile modeling and recognition based on soft One technique used to grow data sets is
computing approach. In IEEE Transactions on Neural
Networks, 20(4), 563–582.
through the technique of remote sensing and the
Wakita, T., Ozawa, K., Miyajima, C., Igarashi, K., Itou, K., use of aerial drones. A drone is an unmanned
Takeda, K., & Itakura, F. (2005). Driver Identification aerial vehicle which can be operated in two dis-
Using Driving Behavior Signals. In Proceedings of the tinct ways: either it can be controlled autono-
8th International IEEE Conference on Intelligent
Transportation Systems, Vienna, 13–16 Sept 2005.
mously by onboard software or wirelessly by
Wakita, T., Ozawa, K., Miyajima, C., Garashi, K. I., grounded personnel. What was once a closely
Katunobu, I., Takeda, K., & Itakura, F. (2006). Driver guarded military venture has now been
Drones 413

popularized by civilian specialists and is now now enlist the capabilities of a preprogrammed
producing a sci-fi like sector of the Big Data drone to apply precise amounts of pesticide in
economy. Over the past two decades, a dramatic precise locations without wasting chemicals. In
rise in the number of civilian drone applications the American mid-western states, where the eco-
has empowered a breathtaking number of civilian nomic value of farming is tremendous, tractor
professionals to utilize drones for specialist tasks farming has now become intelligent. GPS laden
such as nonmilitary security, firefighting, photog- planting tools now enable farmers to “monitor in
raphy, agricultural and wildlife conservation. real-time-while they’re planting-where every
While the stereotypical drone is usually employed seed is placed.” Drones have immense potential D
on tasks that are too “dull, dirty or dangerous” for for the farming community and provide intelli-
human workers to attempt, drones extend them- gent capabilities for smart data collection. Per-
selves to seemingly endless applications. What haps the chief capability being the ability to
has been coined, “the rise of the drones” has identify farm deficiencies and then compute
produced an amount of data so vast it may never data sets into statistical solutions for farmers.
be evaluated and indeed, critics claim it was never Precision data sets function “by using weather
intended to be evaluated but stored and garnished reports, soil conditions, GPS mapping, water
as a prized information set with future monetary resources, commodity market conditions and
benefit. For example, on top of the mountains of market demand” and allows “predictive analyt-
photographs, statistics, and surveillance the ics” to then be applied to “improve crop yield”
United States military’s Unmanned Aircraft Vehi- and therefore encourage farmers to “make
cles (UAVs) have amassed over the past few statistical-based decisions regarding planting,
years, drone data is phenomenally huge and grow- fertilizing and harvesting crops.” This anywhere,
ing ever larger. The age of big drone data is here anytime accessibility of drones is incredibly
and its presence is literally pressing itself into our attractive to modern farmers and is offering tre-
lives, leaving a glut of information for analysts to mendous improvement to under-producing
decipher and some questioning its ethical charac- farms. As small, cost-effective drones begin to
ter. What was once a fairy-tale concept imagined flood American farms, drones and data sets will
by eccentric data managers, is now an accessible offer a boon of advantages that traditional farm-
and comprehensible mass of data. ing methods simply cannot compete against.
Perhaps the greatest beneficiary of the drone Drones and Big Data are also offering them-
data revolution is in the discipline of agriculture. selves to other civilian uses such us meteorology
Experts predict that 80% of the drone market will and forestry. Many are expecting that drones and
be dedicated to agriculture over the next “Big data will soon be used to tame big winds.”
10 years, resulting in what could potentially be That is because the American science agency,
a $100 billion dollar industry by the year 2025. National Oceanic and Atmospheric Administra-
The technique of drones feeding data sets on the tion, recently released a pack of “Coyote” drones
farm is referred to as Precision Agriculture and is for use in capturing data on hurricanes. The three-
quickly becoming a favorite tool of farmers. Pre- foot drone will enable data to be collected above
cision Agriculture is now allowing farmers the surface of the water and below the storm, a
greater access to statistical information as well place previously inaccessible to aviators. This
as the ability to more accurately interpret natural new access will enable forecasters to better ana-
variables throughout the life cycle of their crop lyze direction, intensity, and pressure of incoming
ecosystems. In turn, Precision Agriculture hurricanes well before they reach land, allowing
enables farmers increased control over day-to- for better preparations to be formed. Drones have
day farm management and increases farmer’s also found employment in forestry fighting wild-
agility in reacting to market circumstances. For fires. Traditional methods of firefighting were
example, instead of hiring an aviation company often based on “paper maps and gut feelings.”
to perform daily pest application, a farmer can With drones, uncertainty is reduced allowing for
414 Drug Enforcement Administration (DEA)

“more information for less cost and it doesn’t put Noyes, K. (2014, May 30). Cropping up on every farm: Big
anyone in harm’s way.” Data technology. Retrieved September 1, 2014.
Press, G. (2015, May 9). A very short history of Big Data.
In the military world, especially within West- Retrieved September 1, 2014.
ern superpowers, Big Data is now supporting Wirthman, L. (2014, July 28). How Drones and Big Data
military ventures and enabling better combat are creating winds of change for Hurrican Forecasts.
decisions to be made during battle. However, Retrieved September 1, 2014.
big military data is now posing big problems
that the civilian world has yet to encounter –
the military has a “too much data problem.”
Over the past decade thousands of terabytes of Drug Enforcement
information have been captured by orbiting sur- Administration (DEA)
veillance drones from thousands of locations all
across the world. Given the fact that the use of Damien Van Puyvelde
drones in conventional warfare coincided with University of Glasgow, Glasgow, UK
the height of American activity in the Middle
East, the American military complex is drowning
in information. Recently, the White House Introduction
announced it would be investing “more than
$200 million” in six separate agencies to develop The Drug Enforcement Administration (DEA) is
systems that could “extract knowledge and the lead US government agency in drug law
insights from large and complex collections of enforcement. Its employees investigate, identify,
digital data.” It is thought that this investment disrupt, and dismantle major drug trafficking
will have big advantages for present and future organizations (DTOs) and their accomplices,
military operations. interdict illegal drugs before they reach their
Working in tandem, drones and Big Data offer users, arrest criminals, and fight the diversion of
tremendous advantages for both the civilian and licit drugs in the United States and abroad. For-
military world. As drones continue to become mally a part of the Department of Justice, the DEA
invaluable information gatherers, the challenge is one of the largest federal law enforcement agen-
will be to make sense of the information they cies with close to 10,000 employees working
collect. As data sets continue to grow, organiza- domestically and abroad. In recent years, the
tion and application will be key. Analysts and DEA has embraced the movement towards greater
interpretation software will have to become use of big data to support its missions.
increasingly creative in order to decipher the
good data from the menial. However, there is no
doubt that drones and Big Data will play a big part Origins and Evolution
in the future of ordinary lives.
The US government efforts in the area of drug
control date back to the early twentieth century
Further Reading when the Internal Revenue Service (IRS) actively
sought to restrict the sale of opium following the
Ackerman, S. (2013, April 25). Welcome to the age of big passage of the Harrison Narcotics Act of 1914.
drone data. Retrieved September 1, 2014.
The emergence of a drug culture in the United
Bell, M. (n.d.). US Drone Strikes are Controversial are they
war crimes. Retrieved September 1, 2014. States and the expansion of the international
Big Data: A Driving Force in Precision Agriculture. (2014, drug market led to further institutionalization of
January 1). Retrieved September 1, 2014. drug law enforcement in the second half of the
CBS News – Breaking News, U.S., World, Business ...
twentieth century. The US global war on the man-
(n.d.). Retrieved September 1, 2014.
Lobosco, K. (2013, August 19). Drones can change the ufacture, distribution, and use of narcotics started
fight against wildfires. Retrieved September 1, 2014. when Congress passed the Controlled Substances
Drug Enforcement Administration (DEA) 415

Act of 1970, and President Richard Nixon put drug traffickers in jail and to dismantle their
established the DEA. Presidential reorganization conspiracy networks. The DEA enforces drug
plan no. 2 of 1973 merged pre-existing agencies – laws targeting both illegal drugs, such as cocaine
the Bureau of Narcotics and Dangerous Drugs and heroin, and legally produced but diverted
(BNDD), the Office for Drug Abuse Law Enforce- drugs including stimulants and barbiturates. One
ment (ODALE), and the Office of National Nar- of its major responsibilities is to investigate and
cotics Intelligence (ONNI) – into a single agency. prepare for the prosecution of major violators of
This consolidation of drug law enforcement controlled substance laws that are involved in the
sought to provide momentum in the “war on growing, manufacturing, and distribution of con- D
drugs,” better coordinate the government’s drug trolled substances appearing in or destined for
enforcement strategy, and make drug enforcement illicit traffic in the United States. When doing so,
more accountable. the DEA primarily targets the highest echelons of
Less than a decade after its 1982 inception, drug trafficking by means of the so-called kingpin
Attorney General William French Smith decided strategy. It is also responsible for the seizure and
to reorganize drug law enforcement in an effort to forfeiture of assets related to illicit drug traffick-
centralize drug control and to increase the ing. From 2005 to 2013, the DEA stripped drug
resources available for the “war on drugs.” trafficking organizations of more than $25 billion
Smith gave concurrent jurisdiction to the Federal in revenues through seizures. A central aspect of
Bureau of Investigation (FBI) and the DEA, and the DEA’s work is the management of a national
while the DEA remained the principal drug drug intelligence program in cooperation with
enforcement agency, its administrator was relevant agencies at the federal, state, local, and
required to report to the FBI director instead of international levels. The DEA also supports
the associate attorney general. This arrangement enforcement-related programs aimed at reducing
brought together two of the most important law the availability and demand of illicit controlled
enforcement agencies in the United States and substance. This includes the provision of special-
inevitably generated tensions between them. ized training for state and local law enforcement.
Such tensions have historically complicated the It is also responsible for all programs associated
implementation of the US government’s drug con- with drug law enforcement counterparts in foreign
trol policies. countries and liaison with relevant international
Over the past 40 years, the DEA has evolved organizations on matters relating to international
from a small, domestic-oriented agency to one drug control.
primarily concerned with global law enforce- The agency has four areas of strategic focus
ment. In its early days, the DEA employed related to its key responsibilities. First, interna-
some 1470 special agents for an annual budget tional enforcement includes all interactions with
of $74 million. Since then, DEA resources have foreign counterparts and host nations to target the
grown steadily, and, in 2014, the agency leadership, production, transportation, communi-
employed 4700 special agents, 600 diversion cations, finance, and distribution of major interna-
investigators, over 800 intelligence research spe- tional drug trafficking organizations. Second,
cialists, and nearly 300 chemists, for a budget of domestic enforcement focuses on disrupting and
2.7 billion USD. dismantling priority target organizations, that is to
say, the most significant domestic and interna-
tional drug trafficking and money laundering
Missions organizations threatening the United States.
Third, the DEA advises, assists, and trains state
Although the use of drugs has varied over the last and local law enforcement and local community
four decades, the issues faced by the DEA and the groups. Finally, the agency prevents, detects, and
missions of drug law enforcement have remained eliminates the diversion of controlled substances
consistent. Put simply, DEA’s key mission is to that are diverted to the black market and whose
416 Drug Enforcement Administration (DEA)

use may be diverted from their intended (ONSI) is formally part of the Intelligence Com-
(medical) use. munity. ONSI facilitates intelligence coordination
With the advent of the Global War on Terror- and information sharing with other members of
ism in 2001, the DEA has increasingly sought to the US Intelligence Community. This rapproche-
prevent, disrupt, and defeat terrorist organiza- ment can be seen as an outgrowth of the link
tions. In this context, narcoterrorism – which between drug trafficking and terrorism. The oper-
allows hostile organizations to finance their activ- ations division conducts the field missions orga-
ities through drug trafficking – has been a key nized by the DEA. Special agents plan and
concern. This nexus between terrorism and drug implement drug interdiction missions and under-
trafficking has effectively brought the DEA closer cover operations and develop networks of crimi-
to the US Intelligence Community. nal informants (CI). The cases put together by
analysts and special agents are often made against
low-level criminals and bargained away in return
Organization for information about their suppliers, in a bottom-
up process. Within the operations division, the
The DEA headquarters are located in Alexandria, special operations division coordinates multi-
VA, and the agency has 221 domestic offices jurisdictional investigations against major drug
organized in 21 divisions throughout the United trafficking organizations. The majority of DEA
States. The DEA currently has 86 offices in special agents are assigned to this branch, which
67 countries around the world. Among these for- forms a significant part of the agency as a whole.
eign offices, a distinction is made between the A vast majority of DEA employees are special
more important country office and smaller resi- agents, while analysts form a minority of less
dent and regional offices. The DEA is currently than a thousand employees. The prominence of
headed by an administrator and a deputy admin- special agents in the agency reflects its emphasis
istrator who are both appointed by the president on action and enforcement. The operational sup-
and confirmed by the senate. At lower levels, port division provides some of the key resources
DEA divisions are led by special agents in charge necessary for the success of the DEA’s mission,
or SAC. including its information infrastructure and labo-
The agency is divided into six main divisions: ratory services. Within DEA laboratories, scien-
human resources, intelligence, operations, opera- tific experts analyze seized drugs and look for
tional support, inspection, and financial manage- signatures, purity, and information on their
ment. The DEA is heavily dependent on manufacturing and the routes they may have
intelligence to support its missions. Its intelli- followed.
gence division collects information from a variety The government’s effort to reduce the supply
of human and technical sources. Since the DEA is and demand of drugs is a broad effort that
primarily a law enforcement agency, intelligence involves a host of agencies. Many of the crimes
is primarily aimed at supporting its operations at and priority organizations targeted by the DEA
the strategic, operational, and tactical levels. DEA transcend standard drug trafficking, and this
analysts work in conjunction with and support requires effective interagency coordination.
special agents working in the field; they search Within the federal government, the DEA and the
through police files and financial and other FBI have primary responsibility for interior
records and strive to demonstrate connections enforcement, which concerns those organizations
and uncover networks. Throughout the years, and individuals who distribute and use drugs
DEA intelligence capabilities have shifted in and within the United States. The shared drug law
out of the US Intelligence Community. Since enforcement responsibilities between these two
2006, its Office of National Security Intelligence agencies were originally supposed to fuse DEA
Drug Enforcement Administration (DEA) 417

street knowledge and FBI money laundering drugs rather than on drug traffickers. From this
investigation skills, leading to the establishment perspective, the agency’s resources would be bet-
of joint task forces. Other agencies, such as the ter spent on drug treatment and education.
Internal Revenue Service (IRS), Immigration and Although the DEA has made some efforts to
Customs Enforcement (ICE), and Bureau of Cus- tackle demand, the latter have remained limited.
toms and Border Patrol (CBP), are also involved The DEA’s “kingpin strategy” and its focus on
in drug law enforcement. For example, IRS assists hard drugs like heroin and cocaine have also
the DEA with the financial aspects of drug inves- been criticized for their ineffectiveness because
tigations, and CBP intercepts illegal drugs and they overlook large parts of the drug trafficking D
traffickers at entry points to the United States. business.
Since the DEA has sole authority over drug inves-
tigations conducted abroad, it cooperates with
numerous foreign law enforcement agencies, pro- From Databases to Big Data
viding them with assistance and training to further
US drug policies. Coordination between govern- The ability to access, intercept, collect, and pro-
ment agencies at the national and international cess data is essential to combating crime and
levels continues to be one of the main challenges protecting public safety. To fight the “war on
faced by DEA employees as they seek to imple- drugs,” the DEA developed its intelligence capa-
ment US drug laws and policies. bilities early on, making use of a centralized com-
puter database to disseminate and share
intelligence. The DEA keeps computer files on
Criticisms millions of persons of interest in a series of data-
bases such as the Narcotics and Dangerous Drug
The DEA has been criticized for focusing exces- Information System (NADDIS). In the last few
sively on the number of arrests and seizures it decades, DEA investigations have become
conducts each year. The agency denied approxi- increasingly complex and now frequently require
mately $25.7 billion in drug trafficking revenues sophisticated investigative techniques including
through the seizure of drugs and assets from 2005 electronic surveillance and more extensive docu-
to 2013, and its arrests rose from 19,884 in 1986 ment and media exploitation, in order to glean
to 31,027 in 2015. Judging by these numbers, the information related to a variety of law enforce-
agency has fared well in the last decades. How- ment investigations. The increasing volumes and
ever, drug trafficking and consumption have risen complexity of communications and related tech-
consistently in the United States, and no matter nologies have been particularly challenging and
how much the agency is arresting and seizing, the have forced the law enforcement agency to
market always provides more drugs. Targeting explore new ways to manage, process, store, and
networks and traffickers has not deterred the disseminate big databases. When doing so, the
widespread use of drugs. Some commentators DEA has been able to rely on private sector capa-
argue that this focus is counterproductive to the bilities and partner agencies such as the National
extent that it leads to further crimes, raises the cost Security Agency (NSA). Recent revelations
of illicit drugs, and augments profit of drug traf- (Gallagher 2014) suggest that the DEA has been
fickers. Other criticisms have focused on the ways able to access some 850 billion metadata or
in which the DEA targets suspects and have records about phone calls, emails, cellphone loca-
accused the agency of engaging in racial profiling. tions, and Internet chats, thanks to a search engine
Critiques hold that drug control should focus developed by the NSA. Such tools allow security
more on the demand than the supply side and on agencies to identify investigatory overlaps, track
the reasons behind the consumption of illegal suspects’ movements, map out their social
418 Drug Enforcement Administration (DEA)

networks, and make predictions in order to Cross-References


develop new leads and support ongoing cases.
The existence of large databases used by ▶ Data Mining
domestic law enforcement agencies poses impor- ▶ National Security Agency (NSA)
tant questions about the right to privacy, govern- ▶ Social Network Analysis
ment surveillance, and the possible misuse of
data. Journalists reported that telecommunication
companies’ employees have worked alongside Further Reading
DEA agents to supply them with phone data
based on records of decades of American phone Gallagher, R. (2014). The surveillance engine: How the
NSA built its own secret Google. The Intercept. https://
calls. These large databases are reportedly
firstlook.org/theintercept/2014/08/25/icreach-nsa-cia-
maintained by telecommunication providers, and secret-google-crisscross-proton/. Accessed 5 Apr 2016.
the DEA uses administrative subpoenas, which do Lyman, M. (2006). Practical drug enforcement. Boca
not require the involvement of the judicial branch, Raton: CRC Press.
Van Puyvelde, D. (2015 online). Fusing drug enforcement:
to access them. This has fostered concerns that the
The El Paso intelligence center. Intelligence and
DEA may be infringing upon the privacy of US National Security, 1–15.
citizens. On the whole, the collection and pro- U.S. Drug Enforcement Administration. (2009). Drug enforce-
cessing of big data is only one aspect of the ment administration: A tradition of excellence, 1973–2008.
San Bernardino: University of Michigan Library.
DEA’s activities.
E

E-agriculture sensors (which are often part of the Internet of


Things), and other sources (Baumann et al. 2016;
▶ AgInformatics Yang et al. 2019), and data science is critical for
developing models of complex Earth phenomena.
Time-series and spatial elements are common
attributes for Earth science data. In reference to
Earth Science big data, technologies such as cloud computing
and artificial intelligence are now being used to
Christopher Round address challenges in using earth science data for
George Mason University, Fairfax, VA, USA projects that historically were difficult to conduct
Booz Allen Hamilton, Inc., McLean, VA, USA (Yang et al. 2019). While big data is used in
traditional statistical analysis and model develop-
ment, increasingly machine learning and deep
Earth science (also known as geoscience) is learning are being utilized with big data in Earth
the field of sciences dedicated to understand- analytics for understanding nonlinear relation-
ing the planet Earth and the processes that ships (Yang et al. 2019).
impact it, including the geologic, hydrologic, Earth science contributes to and interacts with
and atmospheric sciences (Albritton and other scientific fields such as environmental sci-
Windley 2019). Geologic science concerns ence and astronomy. As an interdisciplinary field,
the features and composition of the solid environmental science incorporates the earth sci-
Earth, hydrologic science refers to the study ences to study the environment and provide solu-
of Earth’s water, and atmospheric science is tions to environmental problems. Earth science
the study of the Earth’s atmosphere. Earth also contributes to astronomy, focused on celestial
science aims to describe the planet’s processes objects, by contributing information that could be
and features to understand its present state and valuable for the study of other planets. In general,
how it may have appeared in the past, and will research and knowledge from earth science con-
appear in the future. tribute to other sciences and play significant roles
Much of related research relies on Earth ana- in understanding and acting on global issues
lytics, referring to the branch of data science used (including social and political issues). For exam-
for Earth science. Big data in earth science is ple, the race to access resources in the Arctic is a
generated by satellites, models, networks of result of the understanding of changes in the

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
420 Eco-development

cryosphere (the portion of the earth that is solid 355–371). Dordrecht: Springer Netherlands. https://
ice), which is receding in response to rises in doi.org/10.1007/978-94-007-5757-8_18.
Hewage, P., Trovati, M., Pereira, E., & Behera, A. (2020).
global temperatures (IPCC 2014; Moran et al. Deep learning-based effective fine-grained weather
2020). forecasting model. Pattern Analysis and Applications.
In this regard, much of our understanding of https://doi.org/10.1007/s10044-020-00898-1.
climate change and future projections come from IPCC. (2014). IPCC, 2014: Summary for policymakers.
In Climate change 2014: Impacts, adaptation, and
complex computer models reliant on big data vulnerability. Part a: Global and sectoral aspects.
synthesized from a wide variety of data inputs Contribution of working group II to the fifth assess-
(Farmer and Cook 2013; Schnase et al. 2016; ment report of the intergovernmental panel on cli-
Stock et al. 2011). These models’ granularity has mate change. Cambridge: Cambridge University
Press.
been tied to available computer processing power Kagan, Y. Y. (1997). Are earthquakes predictable? Geo-
(Castro 2005; Dowlatabadi 1995; Farmer and physical Journal International, 131(3), 505–525.
Cook 2013). Considerable time must be placed https://doi.org/10.1111/j.1365-246X.1997.tb06595.x.
into justifying the assumptions in these models Moran, B., Samso, J., & Feliciano, I. (2020, December 12).
Warming arctic with less ice heats up Cold War ten-
and into what they mean for decision makers in sions. PBS NewsHour. https://www.pbs.org/newshour/
the international community. On a more local show/warming-arctic-with-less-ice-heats-up-cold-war-
level, weather reports, earthquake predictions, tensions
mineral exploration, etc. are all supported by the Schnase, J. L., Lee, T. J., Mattmann, C. A., Lynnes, C. S.,
Cinquini, L., Ramirez, P. M., Hart, A. F., Williams, D.
use of big data and computer modeling (Dastagir N., Waliser, D., Rinsland, P., Webster, W. P., Duffy, D.
2015; Hewage et al. 2020; Kagan 1997; Sun et al. Q., McInerney, M. A., Tamkin, G. S., Potter, G. L., &
2019). Carriere, L. (2016). Big data challenges in climate
science: Improving the next-generation cyber-
infrastructure. IEEE Geoscience and Remote Sensing
Magazine, 4(3), 10–22. https://doi.org/10.1109/
Further Reading MGRS.2015.2514192.
Stock, C. A., Alexander, M. A., Bond, N. A., Brander, K.
Albritton, C. C., & Windley, B. F. (2019, November 26). M., Cheung, W. W. L., Curchitser, E. N., Delworth, T.
Earth sciences. Encyclopedia Britannica. https://www. L., Dunne, J. P., Griffies, S. M., Haltuch, M. A., Hare, J.
britannica.com/science/Earth-sciences A., Hollowed, A. B., Lehodey, P., Levin, S. A., Link, J.
Baumann, P., Mazzetti, P., Ungar, J., Barbera, R., Barboni, S., Rose, K. A., Rykaczewski, R. R., Sarmiento, J. L.,
D., Beccati, A., Bigagli, L., Boldrini, E., Bruno, R., Stouffer, R. J., et al. (2011). On the use of IPCC-class
Calanducci, A., Campalani, P., Clements, O., Dumitru, models to assess the impact of climate on living marine
A., Grant, M., Herzig, P., Kakaletris, G., Laxton, J., resources. Progress in Oceanography, 88(1), 1–27.
Koltsida, P., Lipskoch, K., et al. (2016). Big data ana- https://doi.org/10.1016/j.pocean.2010.09.001.
lytics for earth sciences: The EarthServer approach. Sun, T., Chen, F., Zhong, L., Liu, W., & Wang, Y. (2019).
International Journal of Digital Earth, 9(1), 3–29. GIS-based mineral prospectivity mapping using
https://doi.org/10.1080/17538947.2014.1003106. machine learning methods: A case study from Tongling
Castro, C. L. (2005). Dynamical downscaling: Assessment ore district, eastern China. Ore Geology Reviews, 109,
of value retained and added using the regional atmo- 26–49. https://doi.org/10.1016/j.oregeorev.2019.04.
spheric modeling system (RAMS). Journal of Geo- 003.
physical Research, 110(D5). https://doi.org/10.1029/ Yang, C., Yu, M., Li, Y., Hu, F., Jiang, Y., Liu, Q., Sha, D.,
2004JD004721. Xu, M., & Gu, J. (2019). Big earth data analytics: A
Dastagir, M. R. (2015). Modeling recent climate change survey. Big Earth Data, 3(2), 83–107. https://doi.org/
induced extreme events in Bangladesh: A review. 10.1080/20964471.2019.1611175.
Weather and Climate Extremes, 7, 49–60. https://doi.
org/10.1016/j.wace.2014.10.003.
Dowlatabadi, H. (1995). Integrated assessment models of
climate change: An incomplete overview. Energy Pol-
icy, 23(4–5), 289–296.
Farmer, G. T., & Cook, J. (2013). Types of models. In G. T.
Eco-development
Farmer & J. Cook (Eds.), Climate change science: A
modern synthesis: volume 1—The physical climate (pp. ▶ Sustainability
E-Commerce 421

$74.4 billion for 2013; and eBay, with reported


E-Commerce sales of $16 billion for 2013. Other major online
providers with a strong presence in their home and
Lázaro M. Bacallao-Pino adjacent regional markets are Rakuten in Japan,
University of Zaragoza, Zaragoza, Spain Kobo in India, Wuaki in Spain, or Zalando in
National Autonomous University of Mexico, Europe.
Mexico City, Mexico Among the far-reaching ramifications of the
emergence of the Internet as a tool for the
business-to-consumer (B2C) aspect of
Synonyms e-commerce, an aspect underlined by many ana-
E
lyses is the necessity to understand how and why
Electronic commerce; Online commerce people participate in e-commerce activities, in a
context where businesses have more opportunities
Electronic commerce, commonly known as to reach out to consumers in a very direct way.
e-commerce, has been defined in several ways, Regarding this aspect, the novelty of the online
but, in general, it is the process of trading – both environment and, consequently, the e-commerce
buying and selling – products or services using as a phenomenon has produced a diversity of
computer networks, such as the Internet. criteria for understanding online shopping behav-
Although a timeline for the development of ior, from positions that consider actual purchases
e-commerce usually includes some experiences as a measure of shopping in this case to others that
during the 1970s and the 1980s, analyses agree in employ self-reports of time online and frequency
considering that it has been from the 1990s of use as a criterion.
onward when there has been a significant shift When analyzing consumer behavior in
in methods of doing business with the emergence e-commerce activities, two dimensions
of the of e-commerce. It draws on a diverse highlighted by many studies are customer loyalty
repertory of technologies, from automated data and website design. On the one hand, changes in
collection systems, online transaction pro- customer loyalty in e-commerce have been a topic
cessing, or electronic data interchange to Internet of particular concern among researchers and busi-
marketing, electronic funds transfer, and mobile nessmen, since the instantaneous availability of
commerce, usually using the WWW for at least information in the Internet has been seen as a
one process of the transaction’s life cycle, circumstance that vanishes brand loyalty as poten-
although it may also use other technologies tial buyers can compare the offerings of sellers
such as e-mail. worldwide, reducing the information asymmetries
As a new way of conducting businesses online, among them and modifying the bases of customer
e-commerce has become a topic analyzed by aca- loyalty in digital scenarios. On the other hand,
demics and businesses, given its rapid growth – website design is considered one of the most
some estimates consider that global e-commerce important factors for successful user experiences
will reach almost $1.4 trillion in 2015, while in e-commerce. Besides the website’s usability –
Internet retail sales for 2000 were $25.8 billion – recognized as a key to e-commerce success – sev-
and the increasing trend in “consumers” purchas- eral researchers have also argued that a successful
ing decisions to be made in an online environ- e-commerce website should be designed in a way
ment, as well as the raising number of people that inspires “customers” trust and engagement,
engaging in e-commerce activities. The world’s persuading them to buy products. It is assumed
largest e-commerce firms are the Chinese Alibaba that its elements affect online consumers’ inten-
Group Holding Ltd., with sales for 2014 estimated tions to the extent of even influencing their beliefs
at $420 billion; Amazon, with reported sales of related to e-commerce and, consequently, their
422 E-Commerce

attitudes as e-buyers, as well as their sense of aspects such as the millions of products available
confidence or perceived behavioral controls. for consumers at the largest e-commerce sites, the
In that scenario of an excess of information, access to narrow market segments that are widely
e-commerce websites have developed some distributed, greater flexibility, lower cost struc-
technological resources to facilitate consumers’ tures, faster transactions, broader product lines,
online decisions by giving them information greater convenience, as well as better customiza-
about product quality and some assistance on tion. The improvement of the quality of products
product search and selection. One of those tech- and the creation of new methods of selling have
nological tools is the so-called recommendation also been considered benefits of e-commerce. But,
agents (RAs), a software that – based on individ- at the same time, it has been noted that
ual consumers’ interests and preferences about e-commerce also sets a number of challenges for
products – provides certain advice on products both consumers and businesses, such as choosing
that match these interests and predilections. RAs among the many available options, the conse-
then become technological resources that can quences of the new online environment for the
potentially improve the quality of consumers’ processes of buying and selling – for instance,
decisions by helping to reduce the amount of the particularities of “consumers” online decision
information they have about products as well as making – consumers’ confidence, and privacy or
the complexity of the online search in security issues.
e-commerce websites. E-commerce is a market strategy where enter-
As specifically e-commerce technological prises may or may not have a physical presence.
resources, RAs set a number of debates associated Precisely, the online/offline interrelationships
with some critical social issues, including privacy have been a particular point at issue in the ana-
and trust. Those applications are part of the efforts lyses of e-commerce. For instance, some authors
of commercial websites to provide certain infor- have analyzed its impact on shopping malls that
mation about a product to attract potential online includes a change in shopping space, rental con-
shoppers, but as in this case, those suggestions are tracts, the shopping mall visit experience, service,
based on some information collected from users – image, multichannel strategy, or lower in-shop
their preferences, shopping history or browsing prices. Instead of regarding e-commerce as a
pattern, or the pattern of choices by other con- threat, these analyses suggest that shopping
sumers with similar profiles – then consumers are malls should examine and put in practice integra-
concerned about what information is collected, tion strategies with the e-commerce, for instance,
whether it is stored, and how it is used, mainly by virtual shopping malls, portals, click and col-
because it can be obtained explicitly but also in an lect, check and reserve, showrooms, or virtual
implicit way. Seen by some authors as an example shopping walls.
of mass customization in e-commerce, RAs In the same line, another dimension of analyses
include different models or forms, from general has been the comparisons between returns obtained
recommendation lists or specific and personalized by conventional firms from e-commerce initiatives
suggestions of products to customer comments and returns to the net firms that only have online
and ratings and community opinions or critiques presence. Other relevant issues have been the anal-
to notification services and other deep kinds of ysis of how do returns from business-to-business
personalizations. (B2B) e-commerce compare with returns from
B2C e-commerce and how do the returns to
e-commerce initiatives involving digital goods
Debates on a Multidimensional compare to initiatives involving tangible goods.
Phenomenon In this sense, there are opposite opinions; while
some authors have considered that the opportuni-
Debates on e-commerce have highlighted both its ties in the B2B e-commerce field far exceed the
opportunities and challenges for businesses. For opportunities in B2C one, others have suggested
many authors, advantages of e-commerce include that the increasing role of ICTs in everyday life
E-Commerce 423

becomes raising opportunities for the B2C permanently changing nature of ICTs and the
e-commerce. continuous development of new technologies
and applications, following a process through
which sellers’ strategies and consumer behavior
Tendencies on E-Commerce evolve with the technology. In that sense, two
current tendencies are the emergence of what has
Some general assumptions on e-commerce have been called social commerce and increasing
suggested that, as a consequence of the reduced mobile e-commerce or m-commerce.
search costs associated with the Internet, it would Social commerce, a new trend with no stable
encourage consumers to abandon traditional mar- and agreed-upon definition that has been a topic of
E
ketplaces in order to find lower prices and online research for few analyses, refers to the evolution
sellers would be more efficient than offline com- of e-commerce as a consequence of the adoption
petitors, forcing traditional offline stores out of of Web 2.0 resources to enhance customer partic-
business. Other authors have considered that, ipation and achieve greater economic value.
since there are increasing possibilities of direct Debates on it analyze, for instance, specific design
relationships between manufacturers and con- matters of social commerce platforms – such as
sumers, some industries would become Amazon and Starbucks on Facebook – and their
disintermediated. However, contrary to those ten- relations to e-commerce and Web 2.0 and propose
dencies, previous researches have noted that few a multidimensional model of it that includes indi-
of those assumptions proved to be correct since vidual, conversation, community, and commerce
the structure of retail marketplace in countries levels.
with high levels of B2C e-commerce – such as Mobile e-commerce, for its part, although was
the United States – has not followed those trends, originally proposed in the late 1990s for referring
because consumers also give importance to other to the delivery of e-commerce capabilities into the
aspects besides prices, such as brand name, trust, consumer’s hand via wireless technologies, has
reliability, and delivery time. recently become a subject of debate and research,
The main trends in e-commerce, observed by as the number of smartphones raised and this is
different studies, include a tendency towards the primary way for one-third of smartphone users
social shopping users, who share their opinions for going online. M-commerce has moved away
and recommendations with other buyers through from SMS systems and into current applications,
online viral networks, in line with the increasing avoiding this way security vulnerabilities and
use of interactive multimedia marketing – blogs, congestion problems. Many payment methods
user-generated content, video, etc. – and the Web are available for m-commerce consumers, includ-
2.0. At the same time, analyses agree in noting the ing premium-rate phone numbers, charges added
increasing profits of e-commerce, as well as the to the consumer’s mobile phone bill and credit
raising diversity of goods and services available cards –allowing, in some cases, credit cards to
online and the average annual amount of pur- be linked to a phone’s SIM card – micropayment
chases. The development of customized goods services, or stored-value cards, frequently used
and services, the emphasis on improved online with mobile device application stores or also
shopping experience by focusing on easy naviga- music stores. Some authors argue that this transi-
tion or offering on-line inventory updates, and the tion to m-commerce will have similar effects as
effective integration of multiple channels by the Internet had for traditional retailing in the late
sellers –including alternatives such as on-line 1990s.
order and in-store pickup, “bricks-and-clicks,” Although early m-commerce consumers
“click and drive,” online web catalog, gift cards, appear to be previous heavy e-commerce users,
or in-store web kiosk ordering – are also some researches on the social dimension of the phenom-
relevant trends. enon have concluded, e.g., that there are differ-
E-commerce innovation and changes are, to ences in user behavior across the mobile
some extent, inherently associated to the applications and regular Internet sites and, besides
424 Economics

this, mobile shopping applications appear to be resources, demand and supply of individuals
also associated to an immediate and sustained and organizations, as well as the processes that
increase in total purchasing. These findings on are connected with the life cycle of products.
m-commerce corroborate the articulation between Walter Wessels (2000) in his definition high-
the technological, business, and sociocultural and lights that economics shows people how to allo-
behavioral dimensions of e-commerce. cate their scarce resources. For centuries, people
have been making economic choices about the
most advantageous process of allocating rela-
Cross-References tively scarce resources and choosing the needs
to be met. From this perspective, economics is
▶ Information Society the science of how people use the resources at
▶ Online Advertising their disposal to meet various material and non-
material needs. However, big data have brought
dramatic changes to economics as a field. In
Further Reading particular, beyond traditional econometric
methods, new analytic skills and approaches –
Einav, L., Levin, J., Popov, I., & Sundaresan, N. (2014). especially those associated with machine learn-
Adoption, and use of mobile E-commerce. American ing – are required to engage big data for econom-
Economic Review: Papers & Proceedings, 104(5), ics research and applications (Harding and Hersh
489–494. https://doi.org/10.1257/aer.104.5.489.
Huang, Z., & Benyoucef, M. (2013). From e-commerce to 2018).
social commerce: A close look at design features. Elec-
tronic Commerce Research and Applications, 12(4),
246–259. Processes of Rational Management
Laudon, K.C., & Guercio Traver, C. (2008). E-commerce:
Business, technology, society. Upper Saddle River: (Change from Homo Oeconomicus to the
Pearson Prentice Hall. Machine of Economics)
Schafer, J. B., Konstan, J. A., & Riedl, J. (2001).
E-commerce recommendation applications. Data Min- According to D.N. Wagner (2020), changing eco-
ing and Knowledge Discovery, 5(1), 115–153.
Turban, E., Lee, J. K., King, D., Peng Liang, T., & Turban, nomics practice includes the process of discover-
D. (2009). Electronic commerce 2010. Upper Saddle ing how economic patterns change under the
River: Prentice Hall Press. influence of technological innovations. He claims
that one of the specific economic pattern
influenced by artificial intelligence (AI) is the so-
called machina economica (its predecessor was
Economics homo oeconomicus) entering the world economy.
What is more, Wagner (2020) shows that disci-
Magdalena Bielenia-Grajewska and plines like economics and computer science use
Magdalena Bielenia-Grajewska an analytical perspective rooted in institutional
Division of Maritime Economy, Department of economics. In more details, the author presents
Maritime Transport and Seaborne Trade, the economic model in the world with AI using an
University of Gdansk, Gdansk, Poland analytical angle grounded in institutional eco-
Intercultural Communication and nomics, in the context of artificial intelligence, it
Neurolinguistics Laboratory, Department of is no surprise that artificial intelligence agents
Translation Studies, University of Gdansk, were also created as economic actors. Moreover,
Gdansk, Poland he claims that homo economicus has long served
as a desirable role model for artificial intelligence
(AI). The first researcher interested in the eco-
Economics can be briefly defined as the disci- nomic rationality of man was A. Smith who intro-
pline that focuses on the relation between duced the model of man rationally operating in the
Economics 425

sphere of economy. According to the ideology of and analytical frameworks for a world with artifi-
man as an individual seeking to maximize profit, it cial intelligence.
is worth noting that an entrepreneur can be treated The observation of economic reality shows
as homo oeconomicus. The paradigm of main- that complementing the economic analysis typical
stream economics postulates that the economic of mainstream economics with the theme of meth-
entity (homo oeconomicus) is guided by its own odological holism becomes almost essential.
interest when making decisions. The perspective Institutional economists (the advocates of the
of perceiving economic man’s decisions is called new institutional economy) see this issue in a
methodological individualism. This position similar way as sociologists who reject the assump-
assumes that individual egoism (selfish motives; tion of perfect rationality of individual subjects
E
internally defined interest) is of great importance (individuals) making decisions. A broader and
because the decisions of countless free individuals more multi-faceted concept is needed. The new
create social welfare. Moreover, mainstream eco- institutional economy examines socio-economic
nomics proves that the economic system is the phenomena much better than neoclassical eco-
sum of all the economic units (homo nomics, mainly by assuming limited individual
oeconomicus). Epistemology based on methodo- rationality. To be predictive, the set of fundamen-
logical individualism ignores the very important tal assumptions (paradigm) of modern economics
fact that an individual making free choices acts in should be based on the assumption of the emer-
a specific social context, it does not remain in gence of levels of integration of social phenomena
isolation from the surrounding world. At the (the assumption of a new institutional economy).
beginning of the institutional changes in the In this sense, the new institutional economy is
years 1980–1990, the paradigm of economism making a statement of two perspectives: method-
was used as an effect of looking for an answer to ological individualism and methodological
the question about the proper paradigm of socio- holism. Methodological individualism has been
economic development. The characteristics of the described as the position of mainstream econom-
economics paradigm in economic sciences are ics that preaches the primacy of homo
discussed based on the homo oeconomicus oeconomicus, while economic theory, called the
model, with economic decisions based on the new institutional economy, refers to the need for
economic value of results. The mainstream econ- methodological individualism and methodologi-
omists perceive many successes, including eco- cal holism, in which the starting point is the socio-
nomic development in the economic unit. Taking logical man (homo sociologicus) or the socio-
into account the recent economic crisis, it is worth economic man. The theoretical orientation of
remembering that the mainstream economic para- methodological holism will be briefly and gener-
digm should be enriched with contextual analysis ally presented, within the framework of which the
(social, cultural, etc.). The new institutional eco- host entity should be perceived as an
nomics (NIE) is criticized for reducing human interconnected environment, thus showing that
existence to homo oeconomicus, excluding its its decisions are influenced by historical, cultural
social rooting. The criticism about the achieve- and social context (the primacy of a holistic
ments of mainstream economics is caused by approach to phenomena). Epistemology based
some questionable assumptions of using the con- on methodological holism takes as its starting
cept of homo oeconomicus, seen as a model of point the behavior of society (of a certain commu-
rational choice in its extreme version, and by the nity and not the behavior of an individual) for
assumption of the rules of the game as a result of understanding socio-economic mechanisms. The
the interaction of individuals. As D.N. Wagner methodology called holism is undoubtedly func-
claims, institutional economic perspective and tional in the process of analyzing the network of
influence of neoclassical economics (with model dependencies of a given community, because it
of man – homo oeconomicus – the so-called wel- allows showing the individual with the so-called
come role model for AI) establish suitable notions blessing of the inventory and taking into account
426 Economics

the network of human relations with the institu- neuroeconomics, studying how the brain and the
tional biospheres. nervous system may provide data on economic
decisions. It should be stated that economics
does not exist in a vacuum; economic conditions
Economics: Different Typologies and are determined by culture, politics, natural
Subtypes resources, etc. In addition, economics is not only
a domain that is shaped by other disciplines as
As Bielenia-Grajewska (2015a) discusses, eco- well as different internal and external factors, but
nomics focuses on the problem of how scarce it is also the type of discipline that influences other
resources with different uses are distributed (allo- areas of life. For example, economics affects lin-
cated) in a society. The purpose of these activities guistics since it leads to the creation and dissem-
is to produce goods. Since resources are limited, ination of new terms denoting the economic
the possibilities of their use are numerous and reality.
diverse, therefore, it is examined how the pro-
duced goods are divided among members of soci-
ety. Taking into account different typologies of Goods: Products and Services
economics discussed by researchers, one of the
most well-known classifications of economics is In the process of management man produces
the taxonomy of microeconomics and macroeco- goods to satisfy his needs. In economic theory,
nomics. Macroeconomic analysis allows to ana- goods are divided into products and services.
lyze the problems of allocation at the level of the Products are those goods which by purchasing
whole society, the whole national economy. As far and using them people become their legal owners
as macroeconomics is concerned, such issues as (e.g., food, clothing, furniture, car). Services, on
inflation, unemployment, demand and supply, the other hand, are economic goods giving pur-
business cycles, exchange rates as well as fiscal chasers the right to use them only temporarily
and monetary policy are examined by researchers. (e.g., the service of airplane flight does not make
Microeconomic analysis helps consider allocation a human being the owner of an airplane). In the era
processes at the level of individual economic enti- of the developing world economy there is a huge
ties (enterprise, consumer). Microeconomics number of various products and services aimed at
focuses on, among others, price elasticity, compe- satisfying more and more exorbitant human
tition, monopoly and game theory. needs. Together with technological developments,
Another division of economics takes into new products and services are constantly
account the scope of research: national economics appearing, the manufacturers of which aim at
and international economics. The difference in satisfying more and more diverse and sophisti-
focus is also visible in subcategorizing econom- cated needs of buyers.
ics, by taking into account a sphere of life it In the area of big data, diverse data related to
governs; consequently, sports economics or eco- buyers and products are created, stored and ana-
nomics of leisure or tourism can be distinguished. lyzed. It is assumed that the concept of Big Data
Economics is also studied through the type of was first discussed in 1997 by Cox and Ellsworth
economy, with such subtypes analyzed as market in Managing Big Data for Scientific Visualization.
economy and planned economy. Although eco- The above-mentioned authors noted that there are
nomics may seem for some as a traditional and two different meanings of this concept, such as
fixed domain as far as the scope of research is big data collections and big data objects. The use
concerned, it actively responds to the changes of Big Data’s and Data Science’s knowledge and
taking place in modern research. For example, techniques is becoming increasingly common
the growing interest in neuroscience and cogni- from the perspective of these products and ser-
tion has led to the creation and development of vices as well. Big Data, including the analysis of
such disciplines as behavioral economics and large data sets generated by various types of IT
Economics 427

systems, is widely used in various areas of busi- considered a tool that will allow targeting of cus-
ness and science, including economics. The data tomer needs and forecasting investments, com-
streams generated by smartphones, computers, pany portfolios, etc., with high accuracy.
game consoles, appliances, household appliances,
installations/software in apartments and homes,
and even clothes – are of great importance for Economy and Technological
modern business. Emails, the location of mobile Development
devices, social media, such as Facebook,
LinkedIn, Twitter, or blogs, lead to the growth of As Bielenia (2020) states, the global expansion of
data. Every activity taken on the Internet gener- online technologies has changed global business
E
ates new information. From the perspective of and the work environment for organization. The
goods and services, Data Science enables, on the impact of Big Data in the economic field (data
basis of data sent from the above-mentioned analysis and predictive modeling) is tremendous.
devices, to determine users’ preferences and Big Data encompasses a series of concepts and
even to forecast their future behavior. Big Data activities related to the acquisition, maintenance,
is a potential flywheel for the development of and operation of data. It is worth mentioning that
global IT, which is very important for the econ- findings of EMC Digital Universe with Research
omy. Another issue concerning the use of the & Analysis by IDC The Digital Universe of
possibilities of Big Data is related to the demand Opportunities: Rich Data and the Increasing
for goods and services. The demand depends on Value of the Internet of Things proved that digital
the purchasing power (derived from the economic bits are doubling in size every 2 years. In 2020 it
opportunity) of citizens. The concept of purchas- reached the value of 44 zettabytes, or 44 trillion
ing power is closely linked to the income of peo- gigabytes. Analyzing and utilizing Big Data leads
ple and the price of goods. The demand for goods to improved predictions. New databases and sta-
is treated as the amount of goods that people want tistical techniques open up many opportunities.
to have if there is no limit to purchasing power. Big Data has become an important part of
Technological development has enabled people to economists’ work and require new skills in com-
communicate more effectively; Big Data systems puter science and databases. Thanks to the use of
processing unimaginable amounts of data provide Big Data, analysts gain access to huge amounts of
information for more efficient distribution of reference and comparative data that allow them to
resources and products. Big Data (a large amount simulate social and economic processes. The
of data and its processing) over the next few years increasing time range of the available data allows
can become a tool to precisely target the needs of generating more and more reliable information
customers. Nowadays, the so-called wearable about trends in the economy. Big Data is a tool
technology or wearable devices, i.e., intelligent that helps entities to better understand their own
electronic equipment that individuals wear, are environment and the consumers who use their
very popular among users. The history of wear- products or services. Big Data applied to the
able technology began with a watch tracking economy corresponds to the use of scarce
human activity. Wearable devices are an example resources. Doug Laney of META Group (now
of the Internet of Things; the huge potential of the Gartner) in the publication 3D Data Management:
Internet of Things is also demonstrated by the Controlling Data Volume, Velocity, and Variety,
current and forecast number of devices connected defined the concept of Big Data in the “3 V”
to the network. Due to the very dynamic develop- model: volume (the amount of data), velocity
ment of the concept and application of Big Data in (the speed at which data is generated and pro-
various areas of human activity, there is more and cessed), and variety (the type and nature of
more talk about the possibility of using related data). Over the years, the 3 V model has been
methods of data analysis in activities aimed at expanded with an additional dimension veracity,
improving competitiveness. Big Data is creating a 4 V model. Extracting volume from
428 Economics

data is characterized by “4 Vs”: occurrence in the focus should be rather placed on some ten-
large amount (volume), huge variety, high variabil- dencies that determine the link between eco-
ity (velocity), and a significant value. According to nomics and big data.
Balar and Chaabita, Balar and Naji, Hilbert Big Although economists have been facing studies
Data is characterized by 5Vs – volume, variety, of extensive amounts of data for years, modern
velocity, value, and veracity. Volume refers to the economics has to deal with big data connected
sheer volume of generated and stored data. Variety with rapid technological advancements in the
means that the data comes from various sources. sphere of economic activities. For example, now-
Big Data can use structured as well as unstructured adays many individuals purchase goods and ser-
data. Velocity corresponds to the speed at which the vices online and, consequently, e-commerce is
data arrives and the time it is analyzed. Value is connected with generating and serving big data
related to data selection analyzed, i.e., which data related to sales, purchases, customers, and the role
will be relevant and valuable, and which will be of promotion in customer decisions. Big data is
useless. Finally, veracity relates to data reliability. also visible in the sphere of online trading, with
In other words, data credibility relates to the truth- many investors purchasing and selling financial
fulness of the data as well as the data regularity. instruments in the web. The birth of such financial
opportunities as forex, continuous trading, or
computerized stock exchanges have also
Economics and Big Data influenced the amount of big data handled on an
everyday basis. In addition, the relation between
Taylor, Schroeder, and Meyer (2014) state that economics with big data can be observed from the
providing the definition of big data in econom- discipline perspective. Thus, the role of big data
ics is not an easy task. First of all, they stress may be studied by taking into account economics
that the discussion on the big versus not big data subdisciplines and their connection with big data.
is still going on in social science. Secondly, In microeconomics, big data is related with labor
economics as a discipline has been using data- economics, whereas macroeconomists focus on
bases and searching for tools that enable to deal big data related to monetary policies. Big data
with a considerably large amount of data for has also led to the growing research interest in
years. As Bielenia and Podolska (2020) state, econometrics and statistics that may facilitate the
an innovative economy based on knowledge process of gathering and analyzing data. Taking
and modern technological solutions could not into account the relatively new subdomains of
function without the Internet. Technological economics, such as behavioral economics and
development has meant that the generation of neuroeconomics, neuroscientific tools facilitate
more and more computer data. There is a lot of the process of acquiring information (Bielenia-
data provided over the Internet and they come Grajewska 2013, 2015b). These tools prove to
from various sources (market data, social net- be useful especially if obtaining data from other
works, own sales systems or partner systems). sources is connected with the high risk of, e.g.,
The amount of data collected is enormous and respondents providing fake answers or leaving the
grows with each new action performed via questions unanswered. In addition, neuroscien-
Internet by users. The concept of Big Data is tific tools offer data on different issues simulta-
therefore characterized by a few dimensions neously by observing the brain in a complex way.
like volume, velocity, variety, value, and verac- Apart from the scientific perspective, big data is
ity. Most of the studies in the field of computer also connected with studying economics. The
science that deal with the 3 V, 4 V, or 5 V learning dimension encompasses how data is
problem in the context of managing and draw- managed by students and how the presented data
ing knowledge from big data sets have one goal influences the perception and cognition of eco-
– how to tame Big Data and give it a structure. nomic notions. The profit dimension of econom-
Thus, instead of providing a clear-cut definition, ics being connected with big data is the focus on
Economics 429

the economical character of gathering and storing manage data are as follows: (1) Study and plan-
data. Thus, specialists should rely on the methods ning, (2) Data collection, (3) Data documentation
that do not generate excessive costs. Moreover, and quality assurance, (4) Data integration, (5)
some companies try to sell big data they possess in Data preparation, (6) Data analysis, (7) Publishing
order to gain profit. and sharing, and (8) Data storage and
Taking into account available methods, the maintenance.
performance of companies in relation to big data
is studied by applying, e.g., the Big Data Business
Model Maturity Index by Bill Schmarzo.
Cross-References
Schmarzo’s model consists of the following
E
phases: Business Monitoring, Business Insights,
▶ Behavioral Analytics
Business Optimization, Data Monetization, and
▶ Business
Business Metamorphosis. Business Monitoring
▶ Decision Theory
encompasses the use of Business Intelligence
▶ E-commerce
and traditional methods to observe business per-
formance and this stage concentrates on trends,
comparisons, benchmarks and indices. The sec-
Further Reading
ond phase, Business Insights, is connected with
using statistical methods and data mining tools to Balar, K., & Chaabita, R. (2019). Big Data in economic
deal with unstructured data. In the third phase analysis: Advantages and challenges. International
called Business Optimization, the focus is on the Journal of Social Science and Economic Research, 04
(07).
automatic optimization of business operations. An
Balar, K., & Naji, A. (2015). A model for predicting
example includes using algorithms in trading by ischemic stroke using data mining algorithms IJISET.
financial companies. The fourth stage named International Journal of Innovative Science, Engineer-
Data Monetization is devoted to making an ing & Technology, 2(11).
Bielenia, M. (2020). Different approaches of leadership in
advantage of big data to generate revenue. An
multicultural teams through the perspective of actor-
example is the creation of “intelligent products” network theory. In I. Williams (Ed.), Contemporary
that follow customer behaviors and needs. The applications of actor network theory. Singapore: Pal-
last phase called Business Metamorphosis is grave Macmillan.
Bielenia, M., & Podolska, A. (2020). Powszechny dostęp
connected with changing the company into a busi-
do Internetu jako prawo człowieka i warunek rozwoju
ness entity operating in new markets or offering gospodarczego. In B. Daria, K. Ryszard, & M.
new services able to meet complex needs of cus- Prawnicze (Eds.), Prawa człowieka i zrównoważony
tomers. The size and changing parameters rozwój: konwergencja czy dywergencja idei i polityki.
Warszawa: Wydawnictwo C.H. Beck.
(unstructured) make traditional management and
Bielenia-Grajewska, M. (2013). International
analysis impossible. As Racka (2016) claims, neuromanagement. In D. Tsang, H. H. Kazeroony, &
technological solutions used for very large G. Ellis (Eds.), The Routledge companion to interna-
datasets reaching huge sizes, vertical scaling (pur- tional management education. Abingdon: Routledge.
Bielenia-Grajewska, M. (2015a). Economic growth and
chase of better and better machines for Big Data
technology. In M. Odekon (Ed.), The SAGE encyclope-
purposes), or horizontal scaling (expansion by dia of world poverty. Thousand Oaks: SAGE
adding more machines) are used. The advantage Publications.
of Big Data technology is that it allows one to Bielenia-Grajewska, M. (2015b). Neuroscience and learn-
ing. In R. Gunstone (Ed.), Encyclopedia of science
analyze the fast incoming and changing data in
education. Dordrecht: Springer.
real time, without having to enter it into databases. Blazquez, D., & Domeneech, J. (2018). Big Data sources
Referring to the author, the most commonly used and methods for social and economic analyses. Tech-
Big Data technology solutions currently include nological Forecasting and Social Change, 130.
Cox, M., & Ellsworth, D. (1997). Managing Big Data for
NoSQL, MapReduce, and Apache Hadoop. Also,
scientific visualization. Siggraph. www.dcs.ed.ac.uk/
Blazquez and Domenech (2018) define the data teaching/cs4/www/visualisation/SIGGRAPH/giga
lifecycle within a Big Data paradigm. The steps to byte_datasets2.pdf.
430 Education

EMC. https://www.emc.com/leadership/digital-universe/ of educational organizations, more and more


2014iview/executive-summary.htm school districts have turned to using big data to
Harding, M., & Hersh, J. (2018). Big Data in economics.
IZA World of Labor, 451. https://doi.org/10.15185/ solve some of the problems they face. While cer-
izawol.451. tain leaders of schools and other organizations
Hilbert, M. Big Data for development: A review of prom- responsible for training students have rushed to
ises and challenges. Development Policy Review. embrace the use of big data, those concerned with
martinhilbert.net. Retrieved 2015-10-07.
Laney Doug. (2001). 3D data management: Controlling student privacy have sometimes been critical of
data volume, velocity, and variety. META Group these attempts. The economic demands of setting
(now Gartner) [online http://blogs.gartner.com]. up systems that permit the use of big data have
Racka, K. (2016). Big Data – znaczenie, zastosowania i also hindered some efforts by schools and training
rozwiązania technologiczne. Zeszyty Naukowe PWSZ
w Płocku Nauki Ekonomiczne, t. XXIII. organizations to use this, as these bodies lack the
Schmarzo, B. (2013). Big Data: Understanding how data infrastructure necessary to proceed with such
powers big business. Indianapolis: John Wiley & Sons, efforts. As equipment and privacy concerns are
Inc. overcome, however, the use of big data by
Taylor, L, Schroeder, R., & Meyer, E. (2014, July–Decem-
ber). Emerging practices and perspectives on Big Data schools, colleges, universities, and other training
analysis in economics: Bigger and better or more of the organizations seems likely to increase.
same? Big Data & Society.
Wagner, D. N. (2020). Economic patterns in a world with
artificial intelligence. Evolutionary and Institutional
Economics Review, 17. Background
Wessels, W. J. (2000). Economics. Hauppauge: Barron’s
Educational Series, Inc. Government agencies, businesses, colleges, uni-
versities, schools, hospitals, research centers, and
a variety of other organizations have long col-
lected data regarding their operations, clients, stu-
Education dents, patients, and findings. With the emergence
of computers and other electronic forms of data
▶ Data Mining storage, more data than ever before began to be
collected during the last two decades of the twen-
tieth century. Because this data was often kept in
separate databases, however, and was inaccessible
Education and Training to most users, much of the information that could
be gleaned from it was not used. As technologies
Stephen T. Schroth developed, however, many businesses became
Department of Early Childhood Education, increasingly interested in making use of this infor-
Towson University, Baltimore, MD, USA mation. Big data became seen as a way of orga-
nizing and using the numerous sources of
information in ways that could benefit organiza-
The use of big data, which involves the capture, tions and individuals.
collection, storage, collation, search, sharing, By the late 1990s, interest in the field that
analysis, and visualization of enormous data sets became known as infonomics surged as compa-
so that this information may be used to spot nies and organizations wanted to make better use
trends, prevent problems, and to proactively of the information they possessed, and to utilize it
engage in activities that make success more likely, in ways that increased profitability. A variety of
has become increasingly popular and common. consulting firms and other organizations began
As the trend toward using big data has coincided working with large corporations and organiza-
with large-scale school reform efforts, which have tions in an effort to accomplish this. They defined
provided increased data regarding student and big data as consisting of three “v”s, volume, vari-
teacher performance, operations, and the needs ety, and velocity. Volume, as used in this context,
Education and Training 431

refers to the increase in data volume caused by Schools, colleges, universities, and other train-
technological innovation. This includes transac- ing centers have long had access to tremendous
tion-based data that has been gathered by corpo- amounts of data concerning their students,
rations and organizations over time, but also teachers, and other operations. Demographic
includes unstructured data that derives from social information, concerning age, gender, ethnicity,
media and other sources as well as increasing race, home language, addresses, parents’ occupa-
amounts of sensor and machine-to-machine data. tions, and other such data are collected as a matter
For years, excessive data volume was a storage of course. Evidence of students’ academic perfor-
issue, as the cost of keeping much of this infor- mance also exists from a variety of sources,
mation was prohibitive. As storage costs have including teacher gradebooks, achievement tests,
E
decreased, however, cost has diminished as a con- standardized tests, IQ tests, interest inventories,
cern. Today, how best to determine relevance and a variety of other sources of information. As
within large volumes of data, and how best to technological innovations such as computers, tab-
analyze data to create value have emerged as the lets, and other devices have become common in
primary issues facing those wishing to use it. educational settings, it has become possible to
Velocity refers to the amount of data streaming gather an enormous amount of data related to
in at great speed raises the issue of how best to how students think and perform, as well as how
deal with this in an appropriate way. Technologi- they make errors. As interest in school reform and
cal developments, such as sensors and smart improvement grew, so too did notice that a vast
meters, and client and patient needs, emphasize amount of data existed in education and training
the necessity of overseeing and handling inunda- programs that was going unused. As a result, a
tions of data in near-real time. Responding to data great deal of effort has been put into attempts to
velocity in a timely manner represents an ongoing create ways to harness this information through
struggle for most corporations and other organi- the use of big data analysis to offer solutions that
zations. Variety in the types of formats in which might improve student performance.
data today comes to organizations presents a prob-
lem for many. Data today includes that in struc-
tured numeric forms which is stored in traditional Educational Applications
databases, but has grown to include information
created from business applications, e-mails, text Educational and training programs have long col-
documents, audio, video, financial transactions, lected data regarding students. Traditionally, how-
and a host of others. Many corporations and orga- ever, much of this data remained in individual
nizations struggle with governing, managing, and classrooms and schools, and was inaccessible by
merging different forms of data. administrators and policy makers concerned with
Some have added two additional criteria to student learning. Although many local education
these: variability and complexity. Variability con- authorities in the United States traditionally col-
cerns the potential inconsistency that data can lected certain data regarding student performance,
demonstrate at times, which can be problematic the federal No Child Left Behind legislation,
for those who analyze the data. Variability can passed in 2001, commenced a period when data
hamper the process of managing and handling regarding student performance in literacy and
the data. Complexity refers the intricate process mathematics was collected to a greater degree
that data management involves, in particular when than ever before. This practice was duplicated in
large volumes of data come from multiple and most other nations, which resulted in an influx of
disparate sources. For analysts and other users to data related to schools, teachers, and students.
fully understand the information that is contained While much of this data was collected and trans-
in these data, they must be must first be connected, ferred using traditional methods, over the past
correlated, and linked in a way that helps users decade schools began using cloud storage that
make sense of them. permitted easier access for district leaders.
432 Education and Training

Schools also started sending more data to state colleges, universities, and other training institu-
education agencies, which allowed it to be col- tions lack a unified system, it has proven impos-
lected and analyzed in more sophisticated ways sible for institutions to share such data on an
than ever before. As schools have increasingly internal basis, let alone across institutions. Unless
used more programs, apps, tablets, and other elec- and until these issues are resolved, big data will
tronic devices in an attempt to improve student not have the capacity to permit all students to
performance, the amount of data has also grown. reach their full potential.
Schools and other organizations can now collect
information that reflects not just student perfor-
mance, but that indicates how a student thought Privacy Issues and Other Concerns
about a problem when answering. Data can
include individual keystrokes and deletions, eye Although using big data to permit students in
movement, or how long a student held a mouse schools, colleges, universities, and training pro-
pointer above a certain answer. grams has been trumpeted by many, including the
Big data has been touted as providing many United States Department of Education, many have
potential benefits for educational institutions and objected to the process as endangering student pri-
students. By providing the tools to collect and vacy rights. Indeed, many schools, colleges, and
analyze data that schools, colleges, universities, universities lack rules, procedures, or policies that
and training programs already collect, big data guide teachers and administrators regarding how
will allow these educational institutions access to much data to collect, how long to keep it, and who
a series of predictive tools. These predictive tools to permit access to it. Further, many schools, col-
will identify individual students’ strengths and leges, universities, and training programs have
areas of need. As a result, educational and training found themselves to be inundated by data, with little
programs will be able to improve learning out- idea how best to respond. In an effort to best deal
comes for individual students by tailoring educa- with this problem, many educational and training
tional programs to these strengths and needs. A programs have sought to establish systems that
curriculum that collects data at each step of a would permit them to effectively deal with this.
student’s learning process will permit schools, In order to effectively use big data practices,
colleges, universities, and other training programs schools, colleges, universities, and training pro-
to meet student need on a daily basis. Educational grams must set up systems that permit them to
and training programs will be able to offer differ- store, process, and provide access to the data they
entiated assignments, feedback, units, and educa- collect. And as the data has grown to include not
tional experiences that will promote optimal and just student grades, but also attendance records,
more efficient learning experiences. disciplinary actions, participation in sports, spe-
With this tremendous promise, big data’s cial education services provided, medical records,
implementation and use is hindered by the needs test performance, and the like. The data needs to
for highly sophisticated hardware and software to be stored in a single database, in compatible for-
permit real-time analysis of data. Using big data mats, and accessible with a single password for
for the improvement of education and training the data to be used effectively. This infrastructure
programs requires massively parallel-processing requires funding, and often the use of consultants
(MPP) databases, which also require the ability to or collaboration with other organizations.
store and manage huge amounts of data. Search- As systems to accumulate and analyze data
based applications, data-mining processes, dis- were established, many critics expressed fears
tributed file systems, the Internet, and cloud- that doing this might invade students’ privacy
based computing and storage resources and appli- rights, harm those who struggle, and allow data
cations are also necessary. As most schools, to fall into the hands of others. Many parents are
Electronic Health Records (EHR) 433

concerned, for example, that their child’s early Further Reading


struggles with reading or mathematics could
imperil their chances to be admitted to college, Foreman, J. W. (2014). Data smart: Using data science to
transform information into insight. Hoboken: Wiley.
be bullied by peers, or looked at negatively by
Lane, J. E., & Zimpher, N. L. (2014). Building a smarter
future employers. Fears have also been expressed university: Big data, innovation, and analytics.
that student data will be sold to commercial con- Albany: The State University of New York Press.
cerns. As the data held by schools becomes more Mayer-Schönberger, V., & Cukier, K. (2013). Big data.
New York: Mariner Books.
comprehensive and varied, student disabilities,
Siegel, E. (2013). Predictive analytics: The power to pre-
infractions, and other information that individuals dict who will click, buy, lie, or die. Hoboken: Wiley.
might not want released is increasingly protected
E
by those whom it concerns. This attitude has
imperiled many attempts to use big data in educa-
tional and training settings. Electronic Commerce
Efforts to establish state-of-the-art systems to
use big data procedures with students have met ▶ E-Commerce
with opposition. In the United States, for example,
the Bill & Melinda Gates Foundation and the not-
for-profit Carnegie Corporation provided over
$100 million in funding for inBloom, a nonprofit Electronic Health Records
organization that could provide the necessary (EHR)
technological support to permit K-12 schools to
glean the benefits of big data. Although the states Barbara Cook Overton
of Illinois, Massachusetts, and New York joined Communication Studies, Louisiana State
the process, the project was shut down after University, Baton Rouge, LA, USA
2 years, largely because of opposition from par- Communication Studies, Southeastern Louisiana
ents and other privacy advocates. Despite this University, Hammond, LA, USA
failure, other for-profit enterprises have been
able to accumulate data from large numbers of
students through programs that are sold to Federal legislation required healthcare providers in
schools, who in turn receive information about the United States to adopt electronic health records
student learning. Renaissance Learning, for exam- (EHR) by 2015; however, transitioning from
ple, sells the popular Accelerated Reader program paper-based to electronic health records has been
that monitors students’ reading comprehension to challenging. Obstacles include difficult-to-use sys-
a global system of schools. As a result, it has tems, interoperability concerns, and the potential
accumulated data on over ten million students, for EHRs negatively impacting provider-patient
and provides this to teachers and administrators relationships. EHRs do offer some advantages,
who can use it to improve student performance. such as the ability to leverage data for insights
into disease distribution and prevention, but those
capabilities are underutilized. EHRs generate big
Cross-References data, but how to convert unstructured derivatives of
patient care into useful and searchable information
▶ Big Data Quality remains problematic.
▶ Correlation Versus Causation EHRs were not widely used in American hos-
▶ Curriculum, Higher Education, and Social pitals before the Health Information Technology
Sciences for Economic and Clinical Health Act (HITECH)
▶ Education was passed by congress in 2009. HITECH required
434 Electronic Health Records (EHR)

hospitals receiving Medicaid and Medicare reim- defined as use that improves patient care, reduces
bursement to adopt and meaningfully use EHRs by disparities, and advances public health). Second,
2015. The legislation was partly a response to it imposed financial penalties for hospitals that
reports released by the National Academy of Med- failed to meet certain MU objectives by 2015
icine (then called the Institute of Medicine) and the (penalties included withheld and/or delayed
World Health Organization which, collectively, Medicare and Medicaid reimbursement).
painted an abysmal picture of the American Many MU requirements, however, are difficult
healthcare system. Respectively, the reports noted to meet, costly to implement, and negatively
that medical mistakes were the eighth leading impact provider productivity. Nearly 20% of
cause of patient deaths in the United States and early MU participants dropped out of the pro-
that poor utilization of health information technol- gram, despite financial incentives and looming
ogies contributed significantly to the US health penalties. A majority of early MU participants –
system’s low ranking in overall performance (the namely physicians – concluded that EHRs were
United States was ranked 37th in the world). not worth the cost, did not improve patient care,
Public health agencies argued that medical errors and did not facilitate coordination among pro-
could be reduced with the development and wide- viders. A survey of 1,000 physicians administered
spread use of health information technologies, such in 2013 revealed that nearly half believed EHRs
as EHRs. Studies suggested EHRs could both made patient care worse and two-thirds reported
reduce medication errors and cut healthcare costs. significant financial losses following their EHR
It was predicted that improved access to patients’ adoptions. Five years later, a Stanford Medicine
complete medical histories would help healthcare poll found that 71% of physicians surveyed
providers avoid duplicating treatment and over- believed EHRs contributed to burnout and 59%
prescribing medications, thereby reducing medical thought EHRs needed a complete overhaul.
errors, curtailing patient deaths, and saving billions According to many healthcare providers, there
of dollars. Despite the potential for improved are two main reasons EHRs need to be over-
patient safety and operational efficiency, pre- hauled. The first has to do with ease of use. Most
HITECH adoption rates were low because EHRs EHR systems were designed with billing depart-
were expensive, difficult to use, and negatively ments in mind, not end users. Thus, the typical
affected provider-patient relationships. Evidence EHR interface resembles an accounting spread-
that EHRs would improve the quality of health sheet, not a medical chart. Moreover, the medical
care was neither conclusive nor straightforward. community’s consensus that EHRs are hard to use
Nonetheless, the HITECH Act required hospitals has been widely documented. Providers contend
to start using EHRs by 2015. that EHRs will not be fully functional or user-
HITECH’s major goals include reducing friendly until providers themselves are part of
healthcare costs, improving quality of care, reduc- the design process.
ing medical errors, improving health information The second reason providers believe EHRs
technology infrastructure through incentives and need an overhaul centers on interoperability. Fol-
grant programs, and creating a national electronic lowing HITECH passage in 2009, EHR makers
health information exchange. Before HITECH rushed to meet the newly legislated demand. The
was passed, only 10% of US hospitals used result was dozens of proprietary software pack-
EHRs. By 2017, about 80% had some form of ages that did not talk to one another. This is
electronic charting. The increase is attributed to especially problematic given HITECH’s goals
HITECH’s meaningful use (MU) initiative, which include standardized and interoperable EHRs.
is overseen by the Centers for Medicaid and Medi- This means providers should be able to access
care Services. MU facilitated EHR adoption in and update health records even if patients seek
two ways. First, it offered financial incentives for treatment at multiple locations, but, as of 2020,
hospitals adopting and meaningfully using EHRs most EHR systems were not interoperable. Conse-
before the 2015 deadline (“meaningful use” is quently, the most difficult MU objective to meet,
Electronic Health Records (EHR) 435

according to several reports, is data exchange many of the errors to poorly designed and hard-
between providers. to-use EHRs.
Another factor complicating widespread EHR In addition to increasing the likelihood of
adoption is the widely held belief that EHRs wrong-patient errors occurring, EHRs can affect
negatively impact provider-patient relationships. patients in other ways as well. For example, EHRs
Several studies show EHRs decrease the amount can alter patients’ perceptions of their healthcare
of interpersonal contact between providers and providers This is important because patient satis-
patients. For example, computers in exam rooms faction is associated positively with healthy out-
hinder communication between primary care comes. Difficult-to-use EHRs have been shown to
providers and their patients: a third of the aver- decrease providers’ productivity and, thereby,
E
age physician’s time is spent looking at the com- increase patients’ wait times and lengths of stay –
puter screen instead of the patient, and the two factors tied directly to patients feeling
physician, as a result, misses many of the dissatisfied.
patient’s nonverbal cues. Other studies note that Patient satisfaction hinges on several factors,
physicians’ exam room use of diagnostic support but one important determinant is whether and how
tools, a common EHR feature, erodes patient patients tell their stories. Storytelling is a way for
confidence. For this reason, the American Med- patients to make sense of uncertain circumstances,
ical Association urges physicians to complete as and patients who are allowed to tell their stories
much data entry outside the exam room as pos- are generally more satisfied with their providers
sible. Studies also find many nurses, even when and typically have better outcomes. Patient narra-
portable EHR workstations are available, opt to tives can also mean fewer diagnostic tests and
leave them outside of patients’ rooms because of lower healthcare costs. However, EHRs limit the
perceptions that EHRs interfere with nurse- amount of free text available for capturing
patient relationships. patients’ stories. Providers, instead, reduce narra-
When compared with physicians, nurses have tives to actionable lists by checking boxes that
generally been more accepting of and enthusiastic correlate with patients’ complaints and symp-
about EHRs. Studies find more nurses than phy- toms. Check boxes spread across multiple screens
sicians claim EHRs are easy to use and help them remove spontaneity from discourse, forcing
complete documentation tasks more quickly. patients to recite their medical histories, ailments,
Nurses, compared with physicians, are consider- and medications in a prescribed fashion. Such
ably more likely to conclude that EHRs make their medical records, comprised largely of numbers
jobs easier. Despite concerns that EHRs can dehu- and test results, lack context.
manize healthcare delivery, nurses’ positive atti- EHRs generate and store tremendous amounts
tudes are often rooted in their belief that EHRs of data, which, like most big data, are text-heavy
improve patient safety. and unstructured. Unstructured data are not orga-
EHRs are supposed to help keep patients safe nized in meaningful ways, thereby restricting
by reducing the likelihood of medical mistakes easy access and/or analysis. Structured data, by
occurring, but research finds EHRs have intro- contrast, are well organized, searchable, and
duced new types of clinical errors. For example, interoperable. Some EHR systems do code
“wrong-patient errors,” which were infrequent patient data so as to make the data partially struc-
when physicians used paper-based medical charts, tured, but increasing volumes of unstructured
are increasingly commonplace – physicians using patient data handicap many healthcare systems.
EMRs regularly “misclick,” thereby ordering, This is due, in large part, to hybrid paper-elec-
erroneously, medications and/or medical tests for tronic systems. Before and during an EHR adop-
the wrong patients. During an experiment, tion, patients’ health data are recorded in paper
researchers observed that 77% of physicians did charts. Along with printouts of lab reports and
not confirm patients’ identities before ordering digital imaging results, the medical chart is nei-
laboratory tests. The study’s authors attributed ther unified nor searchable. Scanning paper-
436 Electronic Health Records (EHR)

based items digitizes the bulk of the medical EHRs, like paper medical charts, must be
record, but most scanned items are not search- safeguarded against privacy and security threats.
able via text recognition software. Handwritten Patient privacy laws require encrypted data be
notes, frequently copied and then scanned, are stored on servers which are firewall- and pass-
often illegible, further limiting access and usabil- word-protected. These measures afford improved
ity. As documentation shifts from paper-based to control and protection of electronic health data,
electronic charting, more of patients’ records considering paper charts can be accessed, copied,
become searchable. Although EHRs are not or stolen by anyone entering a room where records
maximally optimized yet, they do present a are kept. Controlling access to electronic health
clear advantage over paper-based systems: data is accomplished by requiring usernames and
reams of patient data culled from medical and passwords. Most EHRs also restrict access to cer-
pharmaceutical records are no more searchable in tain portions of patients’ data depending on users’
paper form than unstructured big data. EHRs do level of authorization. For instance, while nurses
offer a solution; however, leveraging that data may view physicians’ progress notes and medica-
requires skill and analytical tools. tion orders, they may not change them. Likewise,
Although an obvious benefit of using an EHR physicians cannot change nurses’ notes. Techs
is readily accessible medical records, providers may see which tests have been ordered, but not
who lack the expertise and time necessary for the results. This is important given EHRs are
searching and reviewing patients’ histories often accessible by many providers, all of whom can
underutilize this feature. Evidence suggests some contribute to the patient record.
providers avoid searching EHRs for patients’ his- EHRs, while promising, are not widely uti-
tories. Instead, providers rely on their own mem- lized nor efficiently leveraged for maximum pro-
ories or ask patients about previous visits. One ductivity. Many are calling for a “next-
study found that although physicians believed generation” EHR prioritizing interoperability,
reviewing patients’ medical records before exam- information sharing, usability, and easily acces-
ining them was important, less than a third did so. sible/searchable data. Nonetheless, EHRs in their
Thirty-five percent of physicians admitted that current form ensure data are largely protected
asking patients about their past visits was easier from security breaches and are backed up regu-
than using the EHR, and among those who tried larly – these are clear advantages over paper
using the EHR, 37% gave up because the task was charts susceptible to violation, theft, or damage
too time-consuming. (i.e., consider the tens of thousands of paper-
A burgeoning field of health data analytics is based medical records destroyed by Hurricane
poised to facilitate access and usability of EHR Katrina). How EHRs affect provider productivity
data. Healthcare analytics can help reduce costs and provider-patient relationships are highly
while enhancing data exchange, care coordina- contested subjects, but evidence suggests
tion, and overall health outcomes. This merger enhanced data-mining capabilities can improve
of medicine, statistics, and computer science can disease prevention and intervention efforts
facilitate creating longitudinal records for patients thereby improving health outcomes.
seeking care in numerous venues and settings.
This sets the stage for improved patient-centered
health care and predictive medicine. Analytic Cross-References
tools can identify patients at risk for developing
chronic conditions like diabetes, high blood pres- ▶ Biomedical Data
sure, or heart disease. Combining health records ▶ Epidemiology
with behavioral data can enable population-wide ▶ Health Care Delivery
predictions of disease occurrence and facilitate ▶ Health Informatics
better prevention programs and improve public ▶ Patient-Centered (Personalized) Health
health. ▶ Patient Records
Ensemble Methods 437

Further Reading Ensemble methods are defined as “learning algo-


rithms that construct a set of classifiers and then
Adler-Milstein, J., et al. (2017). Electronic health record classify new data points by taking a (weighted)
adoption in US hospitals: The emergence of a digital
vote of their predictions” (Dietterich 2000).
‘advanced use’ divide. Journal of the American Medi-
cal Informatics Association, 24(6), 1142–1148. Assuming reasonable performance and diversity
Christensen, T., & Grimsmo, A. (2008). Instant availability on the part of each of the component classifiers
of patient records, but diminished availability of patient (Dietterich 2000), the collective answer should be
information: A multi-method study of GP’s use of
more accurate than any individual member of the
electronic health records. BMC Medical Informatics
and Decision Making, 8(12), 1–9. ensemble. For example, if the first classifier makes
an error, the second and third classifiers, if correct,
DesRoches, C. (2013). Meeting meaningful use criteria
E
and managing patient populations: A National Survey can “outvote” the first classifier and lead to a
of practicing physicians. Annals of Internal Medicine,
correct analysis of the overall system. Ensemble
158, 791–799.
Henneman, P., et al. (2008). Providers do not verify patient methods thus provide a simple and commonly
identity during computer order entry. Academic Emer- used method of boosting performance in big data
gency Medicine, 15(7), 641–648. analytics.
Institute of Medicine. (1999). To err is human: Building a
Ensemble methods provide many advantages
safer health system. Washington DC: National Acade-
mies Press. in big data classification. First, because the overall
Montague, E., & Asan, O. (2014). Dynamic modeling of performance is generally better than that of any
patient and physician eye gaze to understand the effects individual classifier in the ensemble, “you can
of electronic health records on doctor-patient commu-
often get away with using much simpler learners
nication and attention. International Journal of Medi-
cal Informatics, 83, 225–234. and still achieve great performers” (Gutierrez and
Nambisan, P., et al. (2013). Understanding electronic med- Alton 2014). Second, these classifiers can often be
ical record adoption in the United States: Communica- trained in parallel on smaller subsets of data
tion and sociocultural perspectives. Interactive Journal
requiring less time and data access than a more
of Medical Research, 2, e5.
Overton, B. (2020). Unintended consequences of elec- sophisticated system that requires access to the
tronic medical records: An emergency room ethnogra- entire dataset at once.
phy. Lanham: Lexington Books. There are several different methods that can
Stanford Medicine. (2018). How doctors feel about electronic
be used to construct ensembles. One of the eas-
health records: National physician Poll by the Harris
Poll. Retrieved from https://med.stanford.edu/content/ iest and most common methods is “bagging”
dam/sm/ehr/documents/EHR-Poll-Presentation.pdf. (Breiman 1996; Dietterich 2000), where each
Stark, P. (2010). Congressional intent for the HITECH classifier is trained on a randomly chosen subset
act. The American Journal of Managed Care, 16,
of the original data. Each classifier is then given
SP24–SP28.
one vote and the overall prediction of the ensem-
ble is the answer that receives the most votes.
Other methods used include weighting votes by
the measured accuracy of each classifier (a more
Ensemble Methods accurate classifier receives greater weight), sep-
arating the training set into disjoint sets and
Patrick Juola cross-validating, or calculating probabilities and
Department of Mathematics and Computer using Bayesian statistics to directly assess the
Science, McAnulty College and Graduate School probability of each answer. More esoteric
of Liberal Arts, Duquesne University, Pittsburgh, methods may involve learning classifiers as
PA, USA well as learning additional selection algorithms
to choose the best classifier or classifiers for any
specific data point.
Synonyms Another method of constructing ensembles is
to use adaptive training sets in a procedure called
Consensus methods; Mixture-of-experts “boosting.” Any specific classifier is likely to
438 Entertainment

perform better on some types of input than on


others. If these areas of high and low performance Entertainment
can be identified, the boosting algorithm con-
structs a new training set that focuses on the mis- Matthew Pittman and Kim Sheehan
takes of that classifier and trains a second classifier School of Journalism & Communication,
to deal with them. “Briefly, boosting works by University of Oregon, Eugene, OR, USA
training a set of learners sequentially and combin-
ing them for prediction, where the later learners
focus more on the mistakes of the earlier learners” Advances in digital technology have given most
(Zhou 2012). mobile devices the capability to not only stream
Within this framework, almost any learning or but actually recognize (Shazam, VideoSurf, etc.)
classification algorithm can be used to construct entertaining content. Streaming data is replacing
the individual classifiers. Some commonly used rentals (for video) and hard disc ownership (for
methods include linear discriminant analysis, video and audio). Consumers have more plat-
decision trees, neural networks (including deep forms to watch entertaining content, more devices
learning), naïve Bayes classifiers, k-nearest on which to watch it, and more ways to seek out
neighbor classifiers, and support vector new content. The flip side is that content pro-
machines. Other applications of ensemble ducers have more ways to monitor and monetize
methods include not only classification into cat- who is consuming it. To complicate matters, user-
egories, but also prediction of numeric values or generated content (YouTube videos, remixes, and
discovering the structure of the data space via social media activity) and metadata (data about
clustering. data, or the tracking information attached to most
Applications of ensemble methods include net- files) are changing the need for – and enforcement
work intrusion detection, molecular bioactivity of – copyright laws.
and protein locale prediction, pulmonary The traditional Hollywood distribution model
embolisms detection, customer relationship man- (theatrical release, pay-per-view, rental, premium
agement, educational data mining, music and cable, commercial cable) has changed dramati-
movie recommendations, object detection, and cally in the wake of smartphones, tablets, and
face recognition (Zhou 2012). Ensemble methods similarly mobile devices on which people can
provide a powerful and easy to understand now view movies and television shows. With
method of analyzing data that is too complicated DVD and Blu-ray sales declining, production stu-
for manual analysis. dios are constantly experimenting with how soon
to allow consumers to buy, rent, or stream a film
after its theatrical release. Third-party platforms
Further Reading like Apple TV, Netflix, or Chromecast are com-
Breiman, L. (1996). Bagging predictors. Machine Learn-
peting with cable providers (Comcast, Time War-
ing, 24(2), 123–140. ner, etc.) to be consumers’ method of choice for
Dietterich, T. G. (2000). Ensemble methods in machine entertainment in and out of the home.
learning. In Multiple classifier systems. MCS 2000 Andrew Wallenstein has said that, for content
(Lecture notes in computer science) (Vol. 1857). Ber-
producers, the advent of big data will be like going
lin/Heidelberg: Springer. https://doi.org/10.1007/3-
540-45014-9_1. from sipping through a straw to sucking on a fire
Gutierrez, D., & Alton, M. (2014). Ask a data scientist: hose: once they figure out how to wrangle all this
Ensemble methods. InsideBigData.com. https:// new information, they will understand their cus-
insidebigdata.com/2014/12/18/ask-data-scientist-
tomers with an unprecedented level of sophistica-
ensemble-methods/.
Zhou, Z.-H. (2012). Ensemble methods: Foundations and tion. Already, companies have learned to track
algorithms. Boca Raton: CRC Press. everything users click on, watch, or scroll past in
Entertainment 439

order to target them with specific advertising, occurs whenever it is convenient for the customer,
programming, and products. Netflix in particular which may be hours, days, weeks, or even years
is very proud of the algorithms behind their rec- after the content originally aired. The issue then
ommendation systems: they estimate that 75% of becomes how long after entertainment is posted to
viewer activity is driven by their recommenda- count a single viewing toward its ratings. Shows
tions. Even when customers claim to watch for- like Community might fail to earn high enough
eign films or documentaries, Netflix knows more ratings to stay with NBC, the network that origi-
than enough to recommend programs based on nally produced it. NBC would initially air an
what people are actually watching. episode via traditional network broadcast and
When it comes to entertainment, Netflix and then host it online the next day. However, thanks
E
Hulu also developed such successful models for to Community’s loyal following online, it found a
distributing content that they have begun to create new digital home: the show is now produced and
their own. Creating original content (in the case of hosted online by Yahoo TV. As long as there is a
Netflix, award-winning content) has solidified the fan base for a kind of entertainment, its producers
status of Netflix, Amazon, Hulu, and others as the and consumers should always be able to find each
new source for entertainment content distribution. other amid the sea of digital data.
On the consumer end, the migration of enter- Musical entertainment is undergoing a similar
tainment from traditional cable televisions to dig- shift in the age of big data. Record companies now
ital databases makes it difficult to track a license their music to various platforms: the
program’s ratings. In the past, ratings systems iTunes store came out in 2003, Pandora in 2004,
(like Neilson) used viewer diaries and television and Spotify in 2006. These digital music services
set meters to calculate audience size and demo- let users buy songs, listen to radio, and stream
graphic composition for various television shows. songs, respectively. Like with Netflix and its
They knew which TV sets were tuned to what videos, music algorithms have been developed
channels at what time slot, and this data was to help consumers find new artists with a similar
enormously useful for networks in figuring out sound to a familiar one. Also like with video, the
programming schedules. digital data stream of consumption lets companies
Online viewing, however, presents new chal- know who is listening to their product, when they
lenges. There are enormous amounts of data that listen, and through what device.
corporations have access to when someone Analytics are increasingly important for pro-
watches their show online. Depending on the cus- ducers and consumers. The band Iron Maiden
tomer’s privacy settings and browser, he or she found that lots of people in South America were
might yield the following information about him- illegally downloading their music, so they put on a
self or herself to the site from which they are single concert in São Paulo and made $2.58 mil-
streaming audio or video: what other pages the lion. Netflix initially paid users to watch and
customer has recently visited and in what order, metatag videos and came up with over 76,000
his or her social media profile and thus demo- unique ways to describe types of movies. Com-
graphic information, purchase history and items bining these tags with customer viewing habits led
which he or she might currently be in the market, to shows people actually wanted: House of Cards
or what other media programs he or she watches. and Orange Is the New Black. Hulu experimented
New algorithms are constantly threatening the with a feature that let users search for words in
delicate balance between privacy and captions on a show’s page. So while looking at,
convenience. say, Parks and Recreation, if users frequently
With traditional ratings, the measurements searched for “bloopers” or “Andy naked prank,”
occurred in real time. However, most online view- Hulu could prioritize that content. The age of big
ing (including DVR- or TiVo-mediated viewing) data has wrought an almost limitless number of
440 Environment

ways to entertain, be entertained, and keep track the term. So, here, it is firstly required to clarify
of that entertainment. what is really said by the term environment.

The Term Environment


Cross-References
The term environment in fact can be defined in
▶ Netflix
several different ways and can be used in various
forms, contexts. To illustrate, its definition can be
categorized in four basic categories and several
Further Reading
subcategories: 1. building blocks, 1.1. architectural
Barnes, S. B. (2006). A privacy paradox: Social network-
(built environment-natural environment), 1.2. geo-
ing in the United States. First Monday, 11(9), 0–14. graphical (terrestrial environment-aquatic environ-
http://firstmonday.org/article/view/1394/1312_2 ment), 1.3. institutional (home environment-work
Breen, C. Why the iTunes store succeeded. http://www. environment-social environment); 2. economic
macworld.com/article/2036361/why-the-itunes-store-
succeeded.html. Accessed Sept 2014.
uses, 2.1. inputs (natural resources-system
Schlieski, T., & Johnson, B. D. (2012). Entertainment in services), 2.2. outputs (contamination-products),
the age of big data. Proceedings of the IEEE, 100 2.3. others (occupational health-environmental
(Special Centennial Issue), 1404–1408. engineering); 3. spatial uses, 3.1. ecosystems
Vanderbilt, T. The science behind the Netflix algorithms
that decide what you’ll watch next. http://www.wired.
(forest-rangeland-planet), 3.2. comprehensive
com/2013/08/qq_netflix-algorithm/. Accessed Sept (watershed-landscape); 4. ethical/spiritual uses,
2014. 4.1. home (nature-place-planet-earth), 4.2. spiritual
(deep ecology-culture-wilderness-GAIA).
However, given the definitions of some dictio-
naries, it is generally defined as the external and
Environment internal conditions affecting the development and
survival of an organism and ultimately giving its
Zerrin Savasan form; or the sum of social and cultural conditions
Department of International Relations, Sub- influencing the existence and growth of an indi-
Department of International Law, Selçuk vidual or community’s life. As commonly used,
University, Konya, Turkey the term environment is usually understood as the
surrounding which an organism finds itself
immersed in. It is actually a more complex term
The environment phenomena include both “that involving more than that, because it includes all
environs” and “what is environed” and the rela- aspects or elements that affect that organisms in
tionship between the “environing” and the distinct ways and each organism in turn affects all
“environed.” It can be understood as all the phys- those which affect itself. That is, each organism is
ical and biological surroundings involving link- surrounded by all those influencing each other
ages/interrelationships at different scales between through a causal relationship.
different elements. It can also be defined as all
natural elements from ecosystem to biosphere
plus human-based elements and their interactions. Related Terms: Nature
If a clear grasp of the term cannot be rendered as a
first step, it cannot be understood correctly what is In its narrow sense the environment implies all in
meant by the term in related subjects. Therefore, it nature from ecosystem to biosphere, on which
can be applied wrongly/or incompletely in the there is no human impact or the human impact is
processes of studying these subjects, due to the kept under a limited level. Most probably because
substantial divergences in the understanding of of that, for many, the terms environment and
Environment 441

nature have been used as interchangeably, and it is (b) Consumers/heterotrophs: The consumers
so often thought that the term environment is depend on the producers for energy directly
synonymous with nature. Yet, the term nature (herbivores such as rabbits) or indirectly
consists of all on the earth, but not the human- (carnivores such as tigers). When they con-
made elements. So, while the word environment is sume the plants, they absorb their chemical
used, it means more than the nature, so actually energy into their bodies, and thus, make use of
should not be substituted for the nature. In its this energy in their bodies to maintain their
more broadly usage, it refers to all in nature in livelihood, e.g., animals of all sizes ranging
which all human beings and all other living organ- from large predators to small parasites, e.g.,
isms, plants, animals, etc., have their being and herbivores, carnivores, omnivores, mosquito,
E
interrelationships among all in nature and with the flies, etc.
living organisms. That means, it covers natural (c) Decomposers: When plants and animals die,
aspects as well as human-made aspects the rest of the chemical energy staying in the
(represented by built environment). Therefore, it consumers’ bodies are used by the decom-
can be classified into two primary dimensions. posers. The decomposers convert the complex
organic compounds of these dead plants and
1. Natural (or Physical) Dimension: It encom- animals to simpler ones by the processes of
passes all living and nonliving things occurring decomposition and disintegration, e.g., micro-
naturally on earth, so, two components, abiotic organisms such as fungi, bacteria, yeast, etc.,
(or nonliving) and biotic (or living) as well as a diversity of worms, insects, and
component. many other small animals.

• Abiotic (or nonliving) component involves 2. Human-Based (or Cultural) Dimension: It


physical factors including sunlight (essential basically includes all human-driven character-
for photosynthesis), precipitation, temperature, istics of the environment, so its all components
and types of soil present (sandy or clay, dry or that are strongly influenced by human beings.
wet, fertile or infertile to ensure base and nutri- While living in the natural environment,
ents); and chemical factors containing proteins, human beings change it to their needs, they
carbohydrates, fats, and minerals. These ele- accept norms, values, and make regulations,
ments establish the base for further studies on manage economic relations, find new technol-
living components. ogies, establish institutions and administrative
• Biotic (or living) component comprises plants, procedures, and form policies to conduct them,
animals, and microorganisms in complex com- so in brief, create a new environment to meet
munities. It can be distinguished by three their survival requirements modifying the nat-
types. ural environment.

(a) Producers/autotrophs: The producers Related Terms: Ecosystem/Ecology


absorb some of the solar energy from the sun
and transform it into nutritive energy through Another term which is used often interchangeably
photosynthesis, i.e., they are self-nourishing with the environment is ecosystem. Like the term
organisms preparing organic compounds from nature, ecosystem is also used as synonymously
inorganic raw materials through the processes with the environment. This is particularly because
of photosynthesis, e.g., all green plants, both the research subjects of all sciences related to the
terrestrial and aquatic ones such as environment are interconnected and interrelated
phytoplankton. to each other. To illustrate, natural science is
442 Environment

concerned with the understanding of natural phe- definitions. An environment is established by the
nomena on the basis of observation and empirical surroundings (involving both natural environment
evidence. In addition, earth science which is one and human-based environment) in which we live
of the branches of natural science provides the in; an ecosystem is a community of organisms
studies of the atmosphere, hydrosphere, litho- (biotic) functioning with an environment. In
sphere, and biosphere. Ecology, on the other order to make it easier to understand, it is usually
hand, as the scientific study of ecosystem, is studied as divided into two major categories:
defined as a discipline studying on the interactions aquatic (or water) ecosystem, such as lakes, seas,
between some type of organism and its nonliving streams, rivers, ponds etc.; terrestrial (or land)
environment and so on how the natural world ecosystem, such as deserts, forests, grasslands,
works. In other words, its research area is basi- etc. However, in fact, while a lake or a forest can
cally restricted to the living(biotic) elements in the be considered as an ecosystem, the whole struc-
nature, i.e., the individual species of plants and ture of the earth involving interrelated set of
animals or community patterns of interdependent smaller systems also forms an ecosystem, referred
organisms which along with their nonliving envi- to as ecosphere (or global ecosystem), the ecosys-
ronment including the atmosphere, geosphere, tem of earth.
and hydrosphere. Thus, ecology arises as a sci-
ence working like the biological science of envi-
ronmental studies. Human Impact on Environment
Nevertheless, particularly after the increasing
role of human component in both disciplines, i.e., The humankind is dependent on the environment
in both environmental studies and ecology, the for their survival, well-being, continued growth
difference on the research subjects of two sciences and so their evolution and the environment is
has almost been eliminated. Hence, currently, dependent on the humankind for its conservation
both the environmental scientists and ecologists and evolution. So, there is an obvious
examine the impacts of linkages in the nature and interdependent relationship between humankind
also interactions and interrelationships of living development and their environment. The humans
(biotic) and nonliving (abiotic) elements with should be aware of the fact that while they are
each other. Their investigations are so mostly degrading the environment, they are actually
rested on similar methods and approaches. harming themselves and preparing their own
Based on these facts, the question here arises end. Yet, unfortunately, till date, the general atti-
then what the concept ecosystem means, what tude of human beings has been to focus on their
should be understood from that concept as differ- development rather than the protection and devel-
ent from the concept of environment. Ecosystem opment of the environment. Indeed, enormous
can be simply identified as an interacting system increase in human population has required new
in which the total array of plant and animal species needs in greater numbers and so raised the
(biological component) inhabiting a common area demand for constantly increasing development.
in relationship to their nonliving environment It has resulted in growing excesses of industriali-
(physical component) having effects on one zation and technology development to facilitate
another interdependently. So, it constitutes a sig- transformation of resources rapidly into the needs
nificant unit of the environment, and environmen- of humans, and so increasing consumption of
tal studies. Accordingly, it should be underlined various natural resources. Various sources of envi-
that even if two terms – environment and ecosys- ronmental pollution (air, land, water), deforesta-
tem – are both deeply interrelated terms concerned tion and climate change which are among the most
with nature and their scientific studies using the threatening environmental problems today, have
similar perspectives, they should not be also generated in human activities. Therefore, it is
substituted for each other. This is particularly generally admitted that human intervention has
because the two terms differ dramatically in their been a very crucial factor changing the
Environment 443

environment, although it is sometimes positively, gone up rapidly, particularly in the period from
unfortunately often negatively, causing large scale Stockholm to Rio.
environmental degradation.
There are various environments as understood
from above mentioned explanations ranging from The Goal of Sustainable Development
those at very small scales to the entire environ-
ment itself. They are all closely linked to each Recently, in the Rio+20 United Nations Confer-
other, so the presence of adverse effects in even ence on Sustainable Development (UNCSD),
small-scale environment may ultimately be held in Rio de Janeiro, Brazil, from 20 to
followed by environmental degradation at entire 22 June 2012, while the seriousness of global
E
world. The human being realizes this fact and environmental deterioration is acknowledged,
environmental problems have started to be seen at the same time the importance of the goal
as a major cause of global concern with late of sustainable development as a priority is re-
1960s. Since then, a great number of massive emphasized. Indeed, its basic themes are building
efforts – agreements, organizations, and mecha- a green economy for sustainable development
nisms – have been created and developed to create including support for developing countries and
awareness about environment pollution and also building an institutional framework for sus-
related adverse effects, and about the humans’ tainable development to improve international
responsibilities towards the environment, and coordination. The renewed political commitment
also to form meaningful support towards environ- for sustainable development – implying simply
mental protection. They all, working in the fields the integration of the environment and develop-
concerning global environmental protection, have ment, and more elaborately, development meeting
been initiated to reduce these problems which the needs of the present without compromising the
require to be coped with through a globally con- ability of future generations to meet their own
certed environmental policy. needs, as defined in Brundtland Report (1987) pre-
Particularly, the United Nations (UN) system, pared by World Commission on Environment and
developing and improving international environ- Development – is also reaffirmed by the docu-
mental law and environmental policy, with its ment, namely “The Future We Want,” created by
crucial organs, significant global conferences the Conference. This document also supports on
like Stockholm Conference establishing the the development of 17 measurable targets aimed
United Nations Environment Programme at promoting sustainable development globally,
(UNEP), Rio Conference establishing the Com- namely, Sustainable Development Goals (SDGs)
mission on Sustainable Development (CSD), and of the 2030 Agenda for Sustainable Development.
numerous specialized agencies such as Interna- These goals adopted by the UN Headquarters held
tional Labour Organization (ILO), World Health in September 2017 include the followings: ending
Organization (WHO), International Monetary poverty and hunger, ensuring healthy lives, inclu-
Fund (IMF), United Nations Educational, Scien- sive and equitable quality education, gender
tific and Cultural Organization (UNESCO), semi- equality, clean water and sanitation, affordable
autonomous bodies like the UN Development and clean energy, decent economy, sustainable
Programme (UNDP), UN Institute for Training industrialization, sustainable cities and communi-
and Research (UNITAR), the UN Conference on ties, responsible consumption and production, cli-
Trade and Development (UNCTAD), and the UN mate action, peace, justice and strong institutions,
Industrial Development Organization (UNIDO), partnerships for the goals, and reducing inequal-
has greatly contributed to the struggle with the ities. They are built on the eight Millennium
challenges of global environmental issues. More- Development Goals (MDGs) adopted in
over, since the first multilateral environmental September 2000 at another UN Headquarters, set-
agreement (MEA), Convention on the Rhine, ting out a series of targets such as eradicating
adopted in 1868, the number of MEAs have extreme poverty/hunger, improving universal
444 Environment

primary education, gender equality, maternal such as economics, geology, geography,


health, environmental sustainability, global part- hydrology, history, physics, physiology, etc.
nership for development, reducing child mortality, 2. Environmental science examining basically
and coping with HIV/AIDS, malaria, and other environment has a direct relevance to different
diseases with a deadline of 2015. sides of life of all living beings. It is a multi-
The foundation for the concept of sustainable disciplinary science involving many different
development is laid firstly through the Founex research topics like protection of nature and
report (Report on Development and Environment) natural resources, biological diversity, preven-
which is prepared by a panel of experts meeting in tion and reduction of environmental pollution,
Founex, Switzerland, in June 1971. Indeed, stabilization of human population, the relation
according to the Founex report, while the degra- between development and environment, im-
dation of the environment in wealthy countries is provement of modern technologies supporting
mainly as a result of their development model, in renewable energy systems etc.
developing countries it is a consequence of under-
development and poverty. Then, the official title This extraordinarily broad field stemming from
of the UN Conference on Environment and Devel- both the environment’s and environmental sci-
opment (UNCED), held in Rio de Janeiro, in 1992 ence’s multiplicity results in an explosion in the
in itself summarizes, in fact, the efforts of the UN data type/amount/methods of storage/models of
Conference on the Human Environment usage, etc., at an unprecedented rate. To capture
(UNCHE), held in Stockholm, in 1972, or rather, this complexity and diversity and to better under-
those of the Founex Report. The concept has been stand the field of environment, to address the
specifically popularized by the official titles of challenges associated to environmental prob-
two last UN Conferences, namely, the World lems/sustainable development(Keeso (2014)
Summit on Sustainable Development, held in argues that while Big Data can become an integral
Johannesburg, in 2002 and the United Nations element of environmental sustainability, Big Data
Conference on Sustainable Development can make environmental sustainability an essen-
(UNCSD), held in Rio de Janeiro, in 2012. tial part of its analysis vice versa, through emerg-
ing new tools such as collaborative partnerships
and business model innovation), it is very recently
Intelligent Management of the suggested to build Big Data sets, having a multi-
Environment: Big Data disciplinary dimension encompassing diverse
fields in themselves and having influence within
As mentioned above, the humankind has remark- multiple disciplines.
ably developed itself on learning the ways of Although there is no common understanding/
reconciling both the environmental and develop- identification on Big Data (ELI 2014; Keeso
mental needs, and thus on achieving a sustainable 2014; Simon 2013; Sowe and Zettsu 2014), it
relationship with the environment. Yet, despite all generally refers to large-scale and technology-
those efforts, it seems that comprehension and driven/computer-based data collection/storage/
intelligent management of the environment is analysis, in which data obtaining/monitoring/esti-
still inadequate and incomplete. It remains as mating is quick and easy by the means of satel-
one of the most important challenges of the lites, sensor technology, and models. Therefore, it
humankind to face with. This is particularly is often argued that, by means of Big Data, it
because of two fundamental reasons. becomes easier to identify vulnerabilities requir-
ing further environmental protection and to make
1. Environment is a multidisciplinary subject qualified estimations on future prospects, thus to
encompassing diverse fields that should be take preventive measures urgently to respond to
examined from many different aspects. It environmental problems and in the final analysis
should include the studies of different sciences to reduce hazardous exposures of environmental
Environment 445

issues, ranging from climate change to air-land- issues gradually increases, yet, still there is need
water pollution. for further research for tackling with challenges
In exploring how these large-scale and com- raised about the use of Big Data (Boyd 2010;
plex data sets are being used to cope with envi- Boyd and Crawford 2012; De Mauro et al. 2016;
ronmental problems in a more predictive/ Forte Wares, ; Keeso 2014; Mayer-Schönberger
preventive and responsive manner, the following and Cukier 2013; Simon 2013; Sowe and Zettsu
cases can be shown as examples (ELI 2014): 2014).

• Environmental Maps generated from US Envi-


ronmental Protection Agency (EPA) databases Cross-References E
including information on environmental activ-
ities the context of EnviroMapper ▶ Earth Science
• Online accession to the state Departments of ▶ Pollution, Air
Natural Resources (DNRs) and other agencies ▶ Pollution, Land
for Geographic Information Systems (GIS) ▶ Pollution, Water
data on environmental concerns
• Usage of Big Data sets in many states and
localities’ environmental programs/in the Further Reading
administration of their federally delegated
programs Boyd, D. (2010). Privacy and publicity in the context of big
• The Green Initiatives Tracking Tool (GITT) data. WWW Conference, Raleigh, 29 Apr 2010.
Retrieved from http://www.danah.org/papers/talks/
developed by the US Postal Service to collect
2010/WWW2010.html. Accessed 3 Feb 2017.
information on employee-led sustainability Boyd, D., & Crawford, K. (2012). Critical questions for big
projects – related to energy, water, and fuel data. Information, Communication & Society, 15(5),
consumption and waste generation – taking 662–679. Retrieved from http://www.tandfonline.com/
doi/abs/10.1080/1369118X.2012.678878. Accessed 3
place across its individual facilities
Feb 2017.
• Collection of site-based data by the National De Mauro, A., Greco, M., & Grimaldi, M. (2016).
Ecological Observatory Network (NEON) A formal definition of big data based on its essential
related to the effects of climate change, land features. Retrieved from https://www.researchgate.
net/publication/299379163_A_formal_definition_of_
use change, and invasive species from several
Big_Data_based_on_its_essential_features. Accessed
sites throughout the USA 3 Feb 2017.
• The Tropical Ecology Assessment and Moni- Environmental Law Institute (ELI). (2014). Big data and
toring Network (TEAM) of publicly shared environmental protection: An initial survey of public and
private initiatives. Washington, DC: Environmental Law
datasets developed by Conservation Interna-
Institute. Retrieved from https://www.eli.org/sites/
tional (CI) to serve as an early warning system default/files/eli-pubs/big-data-and-environmental-protec
to alert about environmental concerns, to mon- tion.pdf. Accessed 3 Feb 2017.
itor the effects of climate or land use changes Environmental Performance Index (EPI)(-). Available at:
http://epi.yale.edu/. Accessed 3 Feb 2017.
on natural resources and ecosystems
Forte Wares(-). Failure to launch: From big data to big
• Country/issue ranking on countries’ manage- decisions why velocity, variety and volume is not improv-
ment of environmental issues and investigation ing decision making and how to fix it. White Paper.
of global data comparing environmental per- A Forte Consultancy Group Company. Retrieved from
http://www.fortewares.com/Administrator/userfiles/Bann
formance with GDP, population, land area, or er/forte-wares–pro-active-reporting_EN.pdf. Accessed
other variables by a Data Explorer under the 3 Feb 2017.
context of the Environmental Performance Keeso, A. (2014). Big data and environmental sustainability:
Index (EPI). A conversation starter. Smith School Working Paper
Series, Dec 2014, Working paper 14-04. Retrieved
from http://www.smithschool.ox.ac.uk/library/working-
As shown above by example cases, the use of papers/workingpaper%2014-04.pdf. Accessed 3 Feb
Big Data technologies on environment-related 2017.
446 Epidemiology

Kemp, D. D. (2004). Exploring environmental issues. The primary goals of epidemiology are:
London/NewYork: Taylor and Francis.
Mayer-Schönberger, V., & Cukier, K. (2013). Big data:
A revolution that will transform how we live, work and To describe the health status of populations and
think. London: John Murray. population subgroups. This information is
Patten, B. C. (1978). Systems approach to the concept of used to develop statistical models showing
environment. The Ohio Journal of Science, 78(4), how different groups of people are affected
206–222.
Raven, P. H., & Berg, L. R. (2006). Environment. Danvers: by different diseases and other health conse-
John Wiley&Sons. quences. Big data in the form of demographic
Saunier, R. E., & Meganck, R. A. (2007). Dictionary and information is essential for the descriptive pro-
Introduction to Global Environmental Governance. cess of epidemiology.
London: Earthscan.
Simon, P. (2013). Too big to ignore: The business case for To explain the etiological and causative factors
big data. Hoboken: Wiley. that lead to or protect against disease or injury.
Sowe, S. K., & Zettsu, K. (2014). Curating big data made Explanatory epidemiological data is also used
simple: Perspectives from scientific communities. Big to determine the ways in which disease and
Data, 2(1), 23–33. Mary Ann Liebert, Inc.
Withgott, J., & Brennan, S. (2011). Environment. New York: other health phenomena are transmitted. Big
Pearson. data are essential for the accurate identification
of both causative factors and patterns of dis-
ease transmission.
To predict the occurrence of disease and the prob-
ability of outbreaks and epidemics. Predictive
Epidemiology data are also used to estimate the positive
effects of positive scientific and social changes
David Brown1,2 and Stephen W. Brown3 such as the development of new vaccines and
1
Southern New Hampsire University, University lifestyle changes such as increasing the amount
of Central Florida College of Medicine, of time people spend exercising. Big data
Huntington Beach, CA, USA allows for expanded data collection, improved
2
University of Wyoming, Laramie, WY, USA data analysis, and increased information dis-
3
Alliant International University, San Diego, semination; these factors clearly improve pre-
CA, USA diction accuracy and timeliness.
To control the distribution and transmission of
disease and other negative health events and
Epidemiology is the scientific discipline to promote factors that improve health. The
concerned with the causes, the effects, the descrip- activities of describing, explaining, and pre-
tion of, and the quantification of health phenom- dicting come together in implementing public
ena in specific identifiable populations. health’s most important function that improves
Epidemiologists, the public health professionals national and world health. Epidemiological big
who study and apply epidemiology, investigate data is used to identify potential curative fac-
the geographic, the behavioral, the economic, the tors for people who have contracted a disease.
hereditary, and the lifestyle patterns that increase Big data identify factors that have the potential
or decrease the likelihood of disease or injury in of preventing future outbreaks of disease and
specific populations. The art and science of epi- epidemics. Big data can also help identify areas
demiology investigates the worldwide outbreak of where health education and health promotion
diseases and injury in different populations activities are most needed and have the poten-
throughout the world. Epidemiological data is tial of having a positive impact.
used to understand the distribution of disease
and injury in an attempt to improve peoples’ Epidemiological research is divided into two
health and prevent future negative health broad and interrelated types of studies: descrip-
consequences. tive epidemiological research and analytic
Epidemiology 447

epidemiological research. Both of these types of occurred in the past that differentiate between
studies have been greatly influenced by big the two groups.
data. It is well documented that the four Vs of big
Descriptive epidemiology addresses such data, volume, velocity, variety, and veracity, are
questions as: Who contracts and who does not well applied in the discipline of epidemiology.
contract some specific disease? What are the In examining the methods and the applications
people factors (e.g., age, ethnicity, occupation, of epidemiology, it is apparent that the amount
lifestyle, substance use and abuse) that affect the of information (volume) that is available via big
likelihood of contracting or not contracting some data is a resource that will continue to help
specific health problem? What are the place fac- descriptive epidemiologists identify the people,
E
tors (e.g., continent, country, state, province, res- place, and time factors of health and disease. The
idence, work space, places visited) that affect the speed (velocity) at which big data can be col-
probability of contracting or not contracting lected, analyzed, disseminated, and accessed
some specific health problem? What are the will continue to provide epidemiologists
“time factors” (e.g., time before diagnosis, time improved methods conducting analytic epidemi-
after exposure before symptoms occur) that ological studies. The different types of big data
affect the course of a health problem? Clearly, (variety) that are available for epidemiological
big data provides much needed information for analysis offer epidemiologists opportunities for
the performance of all descriptive epidemiologi- research and application that were not even
cal tasks. thought of only a few years ago. The truth
Analytic epidemiological studies typically test (veracity) that big data provides the discipline
hypotheses about the relationships between spe- of epidemiology can only lead to improvements
cific behaviors (e.g., smoking cigarettes, eating a in processes of health promotion and disease
balanced diet) and exposures to specific events prevention.
(e.g., experiencing a trauma, receiving an inheri-
tance, being exposed to a disease) and the mortal-
ity (e.g., people in a population who die in a Early Uses of Big Data in Epidemiology
specific time period) and morbidity (e.g., people
in a population who are ill in a specific time The discipline of epidemiology can be traced back
period). Analytic epidemiological studies require to Hippocrates who speculated about the relation-
the use of a comparison group. ship between environmental factors and the inci-
There are two types of analytic studies: pro- dence of disease. Later, in 1663, an armature
spective and retrospective. A prospective epide- statistician, John Graunt, published a study
miological study looks at the consequence of concerning the effects of the bubonic plague on
specific behaviors and specific exposures. As an the mortality rates in London. Using these data,
example, a prospective analytic study might Graunt was able to produce the first life table that
investigate the future lung cancer rate among peo- estimated the probabilities of survival for different
ple who currently smoke cigarettes with the future age groups. However, the beginning of modern
lung cancer rate of people who do not smoke epidemiology, and a precursor to the use of what
cigarettes. we now call big data, can be traced back to Dr.
In a retrospective epidemiological study, the John Snow and his work with the cholera epi-
researcher identifies people who currently have a demic that broke out in London, England, in the
certain illness or other health condition, and then mid-1800s.
he or she identifies a comparable group of these The father of modern epidemiology, John
people who do not have the illness or health Snow, was a physician who practiced in London,
condition. Then, the retrospective researcher England. As a physician, Dr. Snow was very
uses many investigative techniques to look back aware of the London epidemic of cholera that
in time to identify events or situations that broke out in the mid-1800s. Cholera is a
448 Epidemiology

potentially deadly bacterial intestinal infection Now in the age of big data, data collection,
that is caused by, and transmitted through, the analysis, retrieval, and dissemination have greatly
ingestion of contaminated water or foods. Snow improved the previous process. Some of the ways
used what for his day were big data techniques big data is being used in epidemiology involve the
and technologies. In doing so, Dr. Snow had a use of the geographic information systems (GIS).
map of London on which he plotted the geograph- Information from this technology is being used by
ical location of those people who contracted chol- health program planners and epidemiologists to
era. By studying his disease distribution map, identify and target specific public health interven-
Snow was able to identify geographical areas tions that will meet the needs of specific
that had the highest cluster of people who were populations. GPS data has enabled expanded and
infected with the disease. Through further inves- improved methods for disease tracking and map-
tigation of the high outbreak area, Dr. Snow deter- ping. In many cases, smartphones enable patients
mined and demonstrated that the population in the with chronic diseases to transmit both location
affected area received their water from a specific and symptom data that enables epidemiologists
source known as the Broad Street pump. Snow to find correlates between environmental factors
persuaded public officials to close the well that in and symptom exacerbation.
turn led to a significant decrease in the incidence Big data application in public health and health
of cholera in that community. This mapping and care is currently being used in many other areas,
plotting of the incidence of disease and the appli- and it is only reasonable to expect that over time
cation of his discovery for the improvement of these uses will expand and become more sophis-
public health were the beginning of the discipline ticated. Some of the currently functioning big data
of epidemiology, and for its day, it was a clear and applications include the development of a sophis-
practical application of big data and big data ticated international database to trace the clinical
techniques. outcomes and cost of cancer and related disorders.
In addition, medical scientists and public health
specialists have developed a large international
Contemporary and Future Uses of Big big data system to share information about spinal
Data in Epidemiology cord injury and its rehabilitation. Another big data
application has been in the area of matching
The value of epidemiological science is only as donors and recipients for organ transplantation.
strong as the data it has at its disposal. In the early Several major scientific and research groups
years and up until fairly recently, epidemiological have been convened to discuss big data and epi-
data were collected from self-report from infected demiology. The consensus of opinion is that a
patients and reports from the practitioners who major value of big data is dependent upon the
diagnosed the patients. These individuals had the willingness of scientists to share their data, their
responsibility for reporting the occurrence and methodologies, and findings in open settings and
characteristics of the problem to a designated that researchers need to work collaboratively on
reporting agency (e.g., the Centers for Disease the same and similar problems. With such coop-
Control and Prevention, a local health department, eration and team efforts, big data is seen as having
a state health department). The reporting agency great potential to improve the nation’s and the
then entered the data into a database that may have world’s health.
been difficult to share with other reporting agen- As more sophisticated data sets, statistical
cies. As much as possible, the available data were models, and software programs are developed, it
shared with epidemiologists, statisticians, medical is logical to predict that the epidemiological appli-
professionals, public health professionals, and cations of big data will expand and become more
health educators who would use the data to facil- sophisticated. As a corollary to this prediction, it
itate positive health outcomes. is also reasonable to predict that big data will have
Ethical and Legal Issues 449

many significant contributions to world health and Nielson, J. L., et al. (2014). Development of a database for
safety. translational spinal cord injury research. Journal of
Neurotrauma, 31(21), 1789–1799.
Vachon, D. (2005). Doctor John Snow blames water pol-
lution for cholera epidemic. Old News, 16(8), 8–10.
Conclusion Webster, M., & Kumar, V. S. (2014). Big data diagnostics.
Clinical Chemistry, 60(8), 1130–1132.
Epidemiology is a public health discipline that
studies the causes, the effects, the description of,
and the quantification of health phenomena in Error Tracing
specific identifiable populations. Since the disci-
E
pline concerns world populations, the field is a ▶ Anomaly Detection
natural area for the application of big data. It
should be noted that several scientific conferences
have been conducted. The consensus from these
conferences is that big data has the potential of Ethical and Legal Issues
making great strides in the improvement of world
health epidemiological studies. However, they Rochelle E. Tractenberg
also note that in order to achieve this potential, Collaborative for Research on Outcomes
data will need to be shared openly and that and – Metrics, Washington, DC, USA
researchers will need to work cooperatively in Departments of Neurology; Biostatistics,
their use of big data. Bioinformatics & Biomathematics; and
Rehabilitation Medicine, Georgetown University,
Washington, DC, USA
Cross-References

▶ Biomedical Data Definition


▶ Data Quality Management
▶ Prevention “Ethical and legal issues” are a subset of “ethical,
legal, and social issues” – or ELSI – where the “I”
sometimes also refers to “implications.” This con-
Further Reading struct was introduced in 1989 by the National
Program Advisory Committee on the Human
Andrejevic, M., & Gates, K. (2014). Big data surveillance:
Introduction. Surveillance & Society, 12(2), 185–196.
Genome in the United States, with the intention
Kao, R. R., Haydon, D. T., Lycett, S. J., & Murcia, P. R. of supporting exploration, discussion, and even-
(2014). Supersize me: How whole-genome sequencing tually the development of policies that anticipate
and big data are transforming epidemiology. Trends in and address the ethical, legal, and social implica-
Microbiology, 22(5), 282–291.
tions of/issues arising from the advanced and
Marathe, M. V., & Ramakrishnan, N. (2013). Recent
advances in computational epidemiology. IEEE Intelli- speedily advancing technology associated with
gent Systems, 28(4), 96–101. genome mapping. Since this time, and with ever-
Massie, A. B., Kuricka, L. M., & Segev, D. L. (2014). Big increasing technological advances that have the
data in organ transplantation: Registries and adminis-
potential to adversely affect individuals and
trative claims. American Journal of Transplantation,
14(8), 1723–1730. groups (much as genome research has) – includ-
Michael, K., & Miller, K. W. (2013). Big data: New oppor- ing both research and non-research work that
tunities and new challenges. Computer, 46(6), 22–24. involves big data – ELSI relating to these domains
Naimi, A. I., & Westreich, D. J. (2014). Big data: A
revolution that will transform how we live, work, and
are active areas of research and policy develop-
think. American Journal of Epidemiology, 179(9), ment. Since social implications of/issues arising
1143. from big data are discussed elsewhere in this
450 Ethical and Legal Issues

encyclopedia, this entry focuses on just ethical These definitions are all quite specific to the use
and legal implications and issues. of data and the scientific focus on the generation
and review of publications and grants and there-
fore must be reconsidered when non-research ana-
Introduction lyses or interpretations/inferences are the focus of
work. By contrast, when considering data itself,
There are ELSI in both research and nonscience including its access, use, and management, the
analysis involving big data. However, whatever legal issues have focused on protecting the pri-
they are known to be right now, one of the princi- vacy of the individuals from whom data are
pal issues in both research and other analysis of obtained (sometimes without the individuals’
big data is actually that we cannot know or even knowledge or permission) and the ownership of
anticipate what they can be in the future. There- this data (however, see, e.g., Dwork and Mulligan
fore, training in the identification of, and appro- 2013; Steinmann et al. 2016). Since individuals
priate reaction to, ELSI is a universally are typically unable to “benefit” in any way from
acknowledged need; but this training must be their own data alone, those who collect data from
comprehensive without being overly burdensome, large numbers of individuals (e.g., through health-
and everyone involved should reject the lingering care systems, through purchasing, browsing, or
notion that either the identification or response to through national data collection efforts) both
an ethical or legal problem “is just common expend the resources to collect the data and
sense.” That ethical professional practice that is those required to analyze them in the aggregate.
simply a matter of common sense is an outdated – This is the basis of some claims that, although the
never correct – idea representing the perspective individual may contribute their own data to a
that scientists work within their own disciplinary “big” data set, the owner of that data is the person
silo in which all participants follow the same set of or agency that collects and houses/manages that
cultural norms. Modern work – with big data – data for analysis and use.
involves multiple disciplines and is not uniquely/ Additional legal issues may arise when conclu-
always scientific. The fact that these “cultural sions or inferences are made based on aggregated
norms” have always been ideals, rather than stan- data that result in biased decisions, policies, and/
dards to which all members of the scientific com- or resource allocation against any group (Dwork
munity are, or could be, held, is itself an ethical and Mulligan 2013). Moreover, those who collect,
issue for research and analysis in big data. The house, and manage data from individuals incur the
effective communication of these ideals to all who legal (and ethical) obligation to maintain the secu-
will engage with big data, whether as researchers rity and the integrity of that data – to protect the
or nonscientific data analysts, is essential. source of the data (the individuals about whom it
is ultimately descriptive) and to ensure that deci-
sions and inferences based on that data are unbi-
Legal Implications/Issues ased and do not adversely affect these individuals.
Claims based on big data include formally derived
Legal issues have traditionally been focused on risk estimates, e.g., from health systems data
liability (i.e., limiting this), e.g., for engineering about risk factors that can be modified or other-
and other disciplines where legal action can result wise targeted to improve health outcomes in the
from mistakes, errors, or malfeasance. In the sci- aggregate (i.e., not specifically for any single indi-
entific domains, legal issues tend to focus only on vidual) and from epidemic or pandemic trends
plagiarism, falsification of data or results, and such as influenza, Zika, and Ebola. However,
fraud, e.g., making false claims in order to secure they also include informal risk descriptions, such
funding for research and knowingly mis- as claims made in support of the Brexit vote
interpreting or reusing unrelated data to trick (2016) or climate change denial (2010–2017).
readers into accepting an argument or claim. False claims based on formal analyses may arise
Ethical and Legal Issues 451

from faulty assumptions rather than the intent to subjects and other researchers, publications, and
defraud or mislead and so may be difficult to label professional conducts in research or laboratory
as “illegal.” Fraud relating to big data may settings, it can appear – erroneously – that these
become a legal issue in business and industry are the only ethical implications of interacting or
contexts where shareholders or investors are mis- working with big data (or any data).
led intentionally by inappropriate analyses; The fact that training in ethical research prac-
national and international commercial speech tices is typically considered to be discipline spe-
representing false claims – whether arising from cific is another worrisome challenge. Individuals
very large or typical size data sets – are already from many different backgrounds are engaging in
subject to laws protecting consumers. Govern- both research and non-research work with big
E
ments or government agents falsifying big data, or data, suggesting that no single discipline can
committing fraud, may be subject to sanctions by assert its norms or code of conduct as “the best”
external bodies (e.g., see consumer price index approach to training big data workers in ethical
estimation/falsification in Argentina, 2009; gross professional behavior. Two extremely damaging
domestic product estimation in Greece, 2016) that attitudes impeding addressing this growing chal-
can lead foreign investors to distrust their data or lenge are the attitudes that (a) “ethical behavior is
analyses. These recent examples used extant law just common sense” and (b) whatever is not illegal
(i.e., not new or data-specific regulations) to is actually acceptable. In 2016 alone, two edited
improperly prosecute competent analysts whose volumes outlined ethical considerations around
results did not match the governments’ self-image. research and practice with big data. Both the
Thus, much of the current law (nationally and American Statistical Association (http://www.
internationally) relating to big data can be extrapo- amstat.org/ASA/Your-Career/Ethical-Guidelines-
lated for new cases, although future legal protec- for-Statistical-Practice.aspx) and the Association
tions may be needed as more data become more for Computing Machinery (https://www.acm.org/
widely available for the confirmation of results that about-acm/acm-code-of-ethics-and-professional-
some government bodies wish to conceal. conduct) have codes of ethical practice which
accommodate the analysis, management, collec-
tion, and interpretation of data (big, big, or
Ethical Implications/Issues “small”), and the Markkula Center for Applied
Ethics at Santa Clara University maintains (as of
By contrast to the legal issues, ethical issues relat- January 2017) a listing of ethical uses or consid-
ing to big data research and practice are much less erations of big data https://www.scu.edu/ethics/
straightforward. Two major challenges to under- focus-areas/internet-ethics/articles/articles-about-
standing, or even anticipating, ethical implica- ethics-and-big-data/. A casual reading of any of
tions of/issues arising from the collection, use, or these articles underscores that there is very little to
interpretation of big data are (1) most training on be gleaned from “common sense” relating to eth-
ethics for those who will eventually work in this ical behavior when it comes to big data, and the
domain are focused on research and not practice major concern of groups from trade and profes-
that will not result in peer review/publication; and sional associations to governments around the
(2) this training is typically considered to be “dis- world is that technology may change so quickly
cipline specific” – based on norms for specific that anything which is rendered illegal today may
scientific domains. become obsolete tomorrow. However, it is widely
People who work with, but do not conduct argued (and some might argue, just as widely
research in or with, big data may feel or be ignored) that what is “illegal” is only a subset of
told – falsely – that there are no ethical issues what is “unethical” in every context except the
with which they should be concerned. Because ones where these are explicitly linked (e.g., med-
much of the training in “ethical conduct in icine, nursing, architecture, engineering). Ethical
research” relates to interactions with research implications arise in both research and nonscience
452 Ethics

analysis involving big data, and we cannot know


or even anticipate what they can be in the future. Ethics
These facts do not remove the ethical challenges
or prevent their emergence. Thus, all those who Erik W. Kuiler
are being trained to work with (big) data should George Mason University, Arlington, VA, USA
receive comprehensive training in ethical reason-
ing to promote the identification and appropriate
response to ethical implications and issues arising Big data ethics focus on the conduct of individuals
from work in the domain. and organizations, both public and private, engaged
in the application of information technologies
Conclusion (IT) to the construction, acquisition, manipulation,
analytics, dissemination, and management of very
Ethical and legal implications of the collection, large datasets. In application, the purpose of big
management, analysis, and interpretation of big data codes of ethics is to delineate the moral dimen-
data exist and evolve as rapidly as the technology sions of the systematic computational analyses of
and methodologies themselves. Because it is structured and unstructured data and their attendant
unwieldy – and essentially not possible – to con- outcomes and to guide the conduct of actors
sider training all big data workers and researchers engaged in those activities.
in the ELSI (or just ethical, or just legal, implica- The expanding domains of big data collection
tions) of the domain, training in reasoning and and analytics have introduced the potential for
practicing with big data ethically needs to be com- pervasive algorithm-driven power asymmetries
prehensive and integrated throughout the prepara- that facilitate corruption and the commodification
tion to engage in this work. Modern work – with of human rights within and across such spheres as
big data – involves multiple disciplines and is not health care, education, and access to the work-
uniquely research oriented. The norms for profes- place. By prejudging human beings, algorithm-
sionalism, integrity, and transparency arising from based predictive big data–based analytics and
two key professions aligned with big data – statis- practices may be used to perpetuate or increase
tics and computing – are concrete, current, consen- de jure and de facto inequalities regarding access
sus-based codes of conduct, and their transmission to opportunities for well-being based on, for
to all who will engage with big data, whether as example, gender, ethnicity, race, country of ori-
researchers or workers, is essential. gin, caste, language, and also political ideology or
religion. Related asymmetries are understood in
terms of ethical obligations versus violations and
Further Reading are framed in terms of what should or should not
occur, such that they should be eliminated. To that
Collmann, J., & Matei, S. A. (Eds.). (2016). Ethical rea- end, big data ethics typically are discussed along
soning in big data: An exploratory analysis. Cham,
broadly practical dimensions related to methodo-
CH: Springer International Publishing.
Dwork, C., & Muliigan, D. K. (2013). It’s not privacy, and logical integrity, bias mitigation, and security and
it’s not fair. Stanford Law Review Online, 66(35), 35– data privacy.
40.
Mittelstadt, B. D., & Luciano, F. (Eds.). (2016). The ethics
of biomedical big data. Cham, CH: Springer Interna-
tional Publishing. Method Integrity
Steinmann, M., Shuster, J., Collmann, J., Matei, S.,
Tractenberg, R. E., FitzGerald, K., Morgan, G., & Rich- To ensure heuristic integrity, big data analytics
ardson, D. (2016). Embedding privacy and ethical
must meet specific ethical obligations and provide
values in big data technology. In S. A. Matei, M. Russell,
& E. Bertino (Eds.), Transparency on social media – the appropriate documentation. Disclosure of
tools, methods and algorithms for mediating online research methods requires specific kinds of infor-
interactions (pp. 277–301). New York, NY: Springer. mation, such as a statement of the problem, a clear
Ethics 453

definition of a research problem, and the research algorithms that reflect specific perspectives or
goals and objectives. A data collection and man- dogmas. Epistemic bias reflects stove-piping: sys-
agement plan should provide a statement of the tematic, repeatable errors introduced by the adher-
data analyzed, or to be analyzed, and methods of ence to professional or academic points of view
collection, storage, and safekeeping. Data privacy shared within specific disciplines without explor-
and security assurance enforcement oversight and ing other perspectives external to those disciplines
processes should be explicitly stated, with a and the perpetuation of intellectual silos.
description of the mechanisms and processes for
ensuring data privacy and security. For example,
in the USA, the Health Insurance Portability and
Data Privacy and Security Assurance E
Accountability Act (HIPAA) information and Per-
sonally Identifiable Information (PII) require spe-
cial considerations to ensure that data privacy and Big data, especially in cloud-based environ-
ments, require special care to assure that data
security requirements are met. In addition, a state-
security and privacy regimens are specified and
ment of data currency must specify when the data
were, or are to be, collected. Where appropriate maintained. Data privacy processes ensure that
sensitive personal or organizational data are not
and applicable, hypothesis formulation should be
acquired, manipulated, disseminated, or stored
clearly explained, describing the hypotheses used,
or to be used, in the analysis and their formulation. without the consent of the subjects, providers,
or owners. Data security processes protect data
Similarly, hypotheses testing should be explained,
from unauthorized access, and include data
describing hypothesis testing paradigms, includ-
ing, for example, algorithms, units of analysis, encryption, tokenization, hashing, and key man-
agement, among others.
units of measure, etc. applied, or to be applied,
in the research. Likewise, results dissemination
should be explained in terms of dissemination
mechanisms and processes. A statement of repli-
Summary
cability should also be included, indicating how
methods and data can be acquired and how the Big data ethics guide the professional conduct of
research can be duplicated by other analysts. individuals engaged in the acquisition, manipula-
tion, analytics, dissemination, and management of
very large datasets. The proper application of big
data ethics ensures method integrity and transpar-
Bias Mitigation
ency, bias mitigation, data privacy assurance, and
data security.
Big datasets, by their very size, make it difficult to
identify and mitigate different biases. For exam-
ple, algorithmic bias includes systematic, repeat-
able errors introduced (intentionally or
Further Reading
unintentionally) by formulae and paradigms that
American Statistical Association. American Statistical
produce predisposed outcomes that arbitrarily Association Ethical Guidelines for Statistical Practice.
assign greater value or privileges to some groups Available from: https://www.amstat.org/ASA/Your-
over others. Sampling bias refers to systematic, Career/Ethical-Guidelines-for-Statistical-Practice.aspx
Association for Computing Machinery. ACM code of ethics
repeatable errors introduced by data that reflect and professional conduct. Available from: https://
historical inequalities or asymmetries. Cultural www.acm.org/code-of-ethics
bias can be based on systematic, repeatable errors Data Science Association. Data Science Association Code
introduced by analytical paradigms that reflect of Professional Conduct. Available from: https://www.
datascienceassn.org/code-of-conduct.html
personal or community mores and values,
Institute of Electrical and Electronics Engineers. IEEE
whereas ideological bias is systematic, repeatable ethics and member conduct. Available from: https://
errors introduced by analytical designs and www.ieee.org/about/corporate/governance/p7-8.html
454 Ethnographic Observation

Richterich, A. (2018). The big data agenda: Data ethics Union zone, for example, in negotiating trade
and critical data studies. London: University of West- agreements between the EU and other countries
minster Press.
United States Senate. The Data Accountability and Trans- (Nugget 2010). It has its headquarters in Brussels,
parency Act of 2020 Draft. Available from https://www. Belgium, but has offices also in Luxemburg. The
banking.senate.gov/download/brown_2020-data- European Commission is also present with own
discussion-draft representative offices in each EU member state.
Zwitter, A. (2014). Big data ethics. Big Data & Society,
1(2), 1–6. The representations of the European Commission
can be considered the “eyes” and “ears” of the
Commission at the local level providing the head-
quarters with updated information on major issues
Ethnographic Observation of importance occurring in each member state.

▶ Contexts
Election Procedure and
Organizational Structure

The European Commission is formally a college of


European Commission commissioners. Today, it comprises 28 commis-
sioners, including the President and the Vice-Pres-
Chiara Valentini idents (European Commission, 2017a). The
Department of Management, Aarhus University, commissioners are in charge of one or more port-
School of Business and Social Sciences, Aarhus, folios, that is, they are responsible for specific pol-
Denmark icy areas (Nugget 2010).
Until 1993, the European Commission was
appointed every 4 years by common accord of
Introduction governments of member states, and initially the
number of commissioners reflected the number of
The phenomenon of big data and how organiza- states in the European Community. After the
tions collect and handle personal information are introduction of the Treaty of Maastricht in 1993,
often discussed in relation to human rights and the length of the mandate and election procedures
data protection. Recent developments in legisla- was revised. The European Commission mandate
tion about human rights, privacy matters, and data was changed to 5 years with the college of com-
protection are taking place more and more at the missioners appointed 6 months after the European
European Union level. The institution that pro- Parliament elections. Furthermore, the composi-
poses and drafts legislation is the European Com- tion of the European Commission has to be nego-
mission. The European Commission is one of tiated with the European Parliament. The
three EU institutions in charge of policy making. candidate of the President of the European Com-
It has proposal and executive functions and rep- mission is also chosen by the governments of EU
resents the interests of the citizens of the European member states in consultation with the European
Union. Specifically, it is in charge of setting objec- Parliament. The commissioners, which were in
tives and political priorities for action. It proposes the past nominated by the governments of mem-
legislation that is approved by the European Par- ber states, are chosen by the President of the
liament and the Council of the European Union. It European Commission. Once the college of com-
oversees the management and implementation of missioners is formed, it needs to get its approval
EU policies and the EU budget. Together with the from the Council of the European Union and the
European Court of Justice, it enforces European European Parliament (Nugget 2010). The position
law, and it represents the EU outside the European of the European Parliament in influencing the
European Commission 455

composition of the college of commissioners and required to have civil servants of at least three
the election of the president of the college of different nationalities. This decision was made to
commissioners, that is, the President of the Euro- prevent that specific national interests dominate
pean Commission, was further strengthened with the discussion on policy developments. The cab-
subsequent treaties. The latest treaty, the Lisbon inets perform research and policy analyses that are
Treaty, also stipulated that one of the commis- essential in keeping the commissioners informed
sioners should be the person holding the post of on developments of the assigned policy areas but
High Representative of the Union for Foreign also helping the commissioners to be updated on
Affairs and Security Policy. This position some- other cabinets and commissioners’ activities.
how resembles that of a Minister of Foreign
E
Affairs, yet, with more limited powers.
The main representative of the European Com- The European Commission’s Legislative
mission with other EU institutions and with exter- Work
nal institutions is the President. While all
decisions made in the European Commission are Administratively speaking, the European Com-
collective, the President’s main role is to give a mission is divided into Directorate-Generals
sense of direction to the commissioners. He or she (DGs) and other services which are organizational
allocates commissioners’ portfolios, has the units specializing in specific policy areas and ser-
power to lay off commissioners from their post, vices. According to the European Commission,
and is directly responsible for the management of over 32,500 civil servants work for the European
the Secretariat General which is in charge of all Commission in one of these units in summer 2017
activities in the Commission. The President also (European Commission 2017b). The European
maintains relations with the other two decision- Commission’s proposals are prepared by one of
making institutions, that is, the European Parlia- these DGs. Drafts of proposals are crafted by
ment and the Council of the European Union, and middle-ranking civil servants in each DGs.
can assume specific policy responsibilities of his/ These officers often rely on outside assistance,
her own initiative (Nugget 2010). for instance, from consultants, academics,
The college of commissioners represents the national experts, officials, and interest groups,
interests of the entire union. Commissioners are, too. Draft proposals are scrutinized by the Secre-
thus, asked to be impartial and independent from tariat General to meet the existing legal require-
the interests of their country of origin in ments and procedures. The approved draft is then
performing their duties. Commissioners are gen- inspected by senior civil servants in the DGs, the
erally national politicians of high rank, often for- cabinet personnel, and finally reaches the com-
mer national ministers. They hold one or more missioners. The draft proposal is shaped and
portfolios. Prior to the implementation of the revised continuously during this process. Once
Amsterdam Treaty, when the President of the the college of commissioners meets to discuss
European Commission gained more power to and approve the draft, they may accept it in the
decide which commissioners should hold which submitted form, reject it, or ask for revisions. If
portfolio, the distribution of portfolios among revisions are asked, the draft goes back to the
commissioners was largely a matter of negotiation responsible DG (Nugget 2010).
between national governments and of political The European Commission’s proposals
balance among the member states (Nugget 2010). become official only once the college of commis-
Each commissioner has his/her own cabinet sioners adopts them. The decisions are taken by
that helps to perform different duties. While orig- consensus but majority voting is possible. Typi-
inally civil servants working in a commissioner’s cally, the leadership for making proposals
cabinet came from the same country of the com- pertaining specific policy areas lies on the com-
missioner, from the late 1990s, each cabinet was missioner holding the portfolio in question.
456 European Commission

Proposals related to fundamental rights, data pro- developing a knowledge economy. Fourth, it
tection, and citizens’ justice are typically carried plans to establish data market monitoring tools
out by the DG for Justice. Because political, to measure the size and trends of the European
social, and economic issues may affect several data market. Finally, it intends to consult and
policy areas, informal and ad doc consultations engage stakeholders and research communities
between different commissioners who may be across industries and fields to identify major
particular affected by a proposal occur. There are priorities for research and innovation in relation
also groups of commissioners in related and over- to digital economy (European Commission
lapped policy areas that facilitate cooperation 2014b).
across the DGs and enable the discussion when Since 2014, the European Commission has
the college of commissioners meets. worked in developing frameworks that promote
open data policy, open standards, and data inter-
operability. To secure that the development of
The European Commission’s Position big data initiatives does not undermine funda-
Toward Big Data mental rights to personal data protection, the
European Commission has worked on revising
The European Commission released a position EU legislation. A revised version of the 2012
document in summer 2014 in response to the proposal for data protection regulation was
European Council’s call for action in autumn approved by the European Parliament and by
2013 to develop a single market for big data the Council of the European Union in 2014. To
and cloud computing. The European Commis- guarantee security and data protection as well as
sion is overall positive toward big data and sees to support organizations in implementing big
data at the center of the future knowledge econ- data initiatives, the Commission intends to
omy and society. At the same time, it stresses work with each member state and relevant stake-
that an unregulated use of data can undermine holders to inform and guide private organizations
fundamental rights. To develop a digital econ- on issues related to data collection and pro-
omy in Europe as recommended by the Euro- cessing such as data anonymization and pseudo-
pean Council, the Commission has proposed a nymization, data minimization, and personal
framework to boost data-driven innovations data risk analysis. The Commission is also inter-
through the creation of infrastructures that ested in enhancing consumer awareness on big
allow for quality, reliable, and interoperable data and their data protection rights (European
datasets. Among others, the Commission seeks Commission 2014b).
to improve the framework conditions that facil- In relation to cloud computing, the Commis-
itate value generation from datasets, for exam- sion initiated in 2012 a strategy to achieve a com-
ple, by supporting the creation of collaborations mon agreement on standardization guidelines for
among different players such as universities/ cloud computing service. Relevant stakeholders
public research institutes, private entities, and and industry leaders were invited to discuss and
businesses (European Commission 2014b). The propose guidelines in 2013. A common agree-
European Commission actions to achieve these ment was reached and guidelines have been
goals revolve around five main initiatives. First, published in June 2014. Despite the diversity of
it intends to create a European public-private technologies, businesses, and national and local
partnership on data and to develop digital entre- policies, the guidelines aim at facilitating the com-
preneurship. Second, the Commission seeks to parability of service level agreements in cloud
develop an open data incubator, that is, pro- computing, providing clarity for cloud service
grams designed to support the successful devel- customers, and generating trust in cloud comput-
opment of entrepreneurial companies. Third, it ing services (Watts et al. 2014, p. 596) and are thus
aims at increasing the development of specific considered a step forward in regulating data min-
competences necessary to have more skilled data ing and processing in Europe but also around the
professionals who are considered important for world.
European Commission 457

The European Commission’s Concerns National Security Agency (NSA) data collection
on Big Data practices on Europeans and subsequent state-
ments by the NSA that data on European citizens
The European Commission has recognized that was supplied by European intelligence services
the big data phenomenon offers several opportu- according to existing agreements (Brown 2015).
nities for growth and innovation, yet there are no The European Commission has initiated a process
sufficient measures to safeguard data privacy and for regulating data processing. With the United
security of users and consumers. A number of States, an agreement called data protection
challenges have been identified including the umbrella agreement has been reached in certain
anonymization of data, that is, data which is areas. The agreement deals with personal data
E
deprived of direct or indirect personal identifiers, such as names, addresses, and criminal records
and the problem of creating personal profiles transferred from the EU to the US for reasons of
based on patterns of individuals’ behaviors prevention, detection, investigation, and prosecu-
obtained through data mining algorithms. The tion of criminal offences, including terrorism. Yet,
creation of such personal profiles could be mis- the EU and US had for quite some time different
used and affect how individuals are treated. The opinions on the right of effective judicial redress
European Commission is particularly concerned that should be granted by the US to EU citizens not
with the aspect of maintaining the fundamental resident in the United States (European Commis-
rights of citizens and of possible misuse of EU sion 2014a).
citizens’ personal information by data collectors On 2 June 2016, the EU and the US agreed on a
and through data processing. There are also a wide-ranging high-level data protection frame-
number of security risks, particularly when data work for criminal law enforcement cooperation,
collected by a device are transferred elsewhere. called “umbrella agreement.” The agreement
The lack of transparency on when and how should improve, in particular, EU citizens’ rights
data is collected and for which purposes are the by providing equal treatment with US citizens
major concerns. Countless data from European when it comes to judicial redress rights before
citizens is collected and processed by non-EU US courts (CEU 2016).
companies and/or outside the EU zone. Therefore, Critics further point out that the European
various types of data are handled in many differ- Commission regulatory proposal for data protec-
ent manners, and existing EU legislation is short tion lacks the capacity to safeguard fully Euro-
in regulating possible abuses and issues occurring pean citizens’ rights. Specifically, the problem of
beyond its jurisdiction. The European Commis- anonymization of data is considered by some not
sion has already initiated negotiations with well addressed by the European Commission’s
established trade partners such as the United proposal, since data that is stripped away of
States. Since March 2011, the European Commis- names and direct identifiers can still be associated
sion has been negotiating procedures for transfer- to specific individuals using a limited amount of
ring personal data from the EU to the US for law publicly available additional data (Aldhouse
enforcement purposes. The European Commis- 2014). This could leave space to commercial enti-
sion does not consider the US-EU Safe Harbor ties and third-party vendors for exploiting the
Framework a sufficient measure to protect funda- potential of big data while legally abiding to
mental rights such as data privacy and security of the EU legislation.
Europeans (European Commission 2012). The
US-EU Safe Harbor Framework, approved in
2000, is a framework that allows registered US Cross-References
organizations to have access to data from EU
citizens upon declaring that they have adequate ▶ Big Data Theory
privacy protection mechanisms in place (Export. ▶ Cloud Computing
gov 2013). The concern of data security and pri- ▶ Cloud Services
vacy has increased after the revelations of the ▶ Data Mining
458 European Commission: Directorate-General for Justice (Data Protection Division)

▶ Data Profiling
▶ European Commission: Directorate-General European Commission:
for Justice (Data Protection Division) Directorate-General for
▶ European Union Justice (Data Protection
▶ National Security Administration (NSA) Division)
▶ Open Data
▶ Privacy Chiara Valentini
Department of Management, Aarhus University,
School of Business and Social Sciences, Aarhus,
Further Reading Denmark

Aldhouse, F. (2014). Anonymisation of personal data – A


missed opportunity for the European Commission.
Computer Law and Security Review, 30(4), 403–418.
Introduction
Brown, I. (2015). The feasibility of transatlantic privacy-
protective standards for surveillance. International Jour- Debates on big data, data-related situations and
nal of Law and Information Technology, 23(1), 23–40. international developments have increased in
CEU. (2016, June 2). Enhanced data protection rights for
recent years. At the European level, the European
EU citizens in law enforcement cooperation: EU and US
sign “Umbrella agreement”. Press Release of the Coun- Commission has created a subunit within its
cil of the European Union. http://www.consilium. Directorate-General for Justice to study and mon-
europa.eu/en/press/press-releases/2016/06/02-umbrella- itor the big data phenomenon and its possible
agreement/. Accessed 7 July 2016.
impact on legislative matters. The Directorate-
European Commission. (2012, October 9). How will the
‘safe harbor’ arrangement for personal data transfers to General for Justice is one of the main departments
the US work? http://ec.europa.eu/justice/policies/pri of the European Commission, specializing in pro-
vacy/thridcountries/adequacy-faq1_en.htm. Accessed moting justice and citizenship policies and pro-
21 Oct 2014.
tecting fundamental rights. The Directorate is in
European Commission. (2014a, June). Factsheet EU-USA.
negotiations on data protection. http://ec.europa.eu/ charge of the justice portfolio and proposes legis-
justice/data-protection/files/factsheets/umbrella_ lative documents in relation to four areas: civil
factsheet_en.pdf. Accessed 21 Oct 2014. justice, criminal justice, fundamental rights and
European Commission. (2014b, July 2). Communication union citizenship, and equality. The justice port-
from the Commission to the European Parliament, the
Council, the European Economic and Social Commit- folio is relatively new; it was only created in 2010
tee and the Committee of Regions. Towards a thriving under the leadership of the President José Manuel
data-driven economy. COM(2014) 442 final, http://ec. Barroso. Previously, it was part of the former
europa.eu/information_society/newsroom/cf/dae/docu Directorate-General for Justice, Freedom and
ment.cfm?action¼display&doc_id¼6210. Accessed
21 Oct 2014. Security which was split into two departments,
European Commission. (2017a). The Commissioners. The the Directorate-General Home Affairs and the
European Commission’s political leadership, https:// Directorate-General for Justice (European Com-
ec.europa.eu/commission/commissioners/2014-2019_ mission 2014).
en. Accessed 5 September 2017.
European Commission. (2017b). Staff members. http://ec.
europa.eu/civil_service/docs/hr_key_figures_en.pdf.
Accessed 05 September 2017. The Data Protection Division
Export.gov. (2013, December 18). U.S.-EU safe harbor
overview. http://export.gov/safeharbor/eu/. Accessed
19 Oct 2014. The Data Protection Division is a subunit of the
Nugget, N. (2010). The government and politics of the Directorate-General for Justice specializing in all
European Union (7th ed.). New York: Palgrave aspects concerning the protection of individual
Macmillan. data. It provides the Directorate-General for Jus-
Watts, M., Ohta, T., Collis, P., Tohala, A., Willis, S.,
Brooks, F., Zafar, O., Cross, N., & Bon, E. (2014).
tice and the European Commission with indepen-
EU update. Computer Law and Security Review, 30 dent advice on data protection matters and helps
(5), 593–598. with the development of harmonized policies on
European Commission: Directorate-General for Justice (Data Protection Division) 459

data protection in the EU countries (CEC 1995). the protection of privacy in the electronic commu-
Data protection has become an important new nications sector. The 2002 directive was amended
policy area for the Directorate-General for Justice in 2006 to include also aspects related to the
following the approval and implementation of the retention of data generated or processed in con-
1995 EU Data Protection Directive. nection with the provision of publicly available
The 1995 directive that came into force in 1998 electronic communications services, which are
has been considered the first most successful services provided by means of electronic signals
instrument for protecting personal data (Bennett over, for example, telecommunications or broad-
and Raab 2006; Birnhack 2008). The directive casting networks, or of public communication
binds EU member states and three members of networks (CEC 2006). In 2008 the protection
E
the European Economic Area (Iceland, Lichten- was extended to include data collection and shar-
stein, Norway) to establish mechanisms to moni- ing within police and judicial cooperations in
tor how personal data flows across countries criminal matters.
within the EU zone but also in and out of third
countries. It requires authorities and organizations
collecting data to have in place adequate protec- Latest Developments in the Data
tion mechanisms to prevent misuse of sensitive Protection Regulation
data (Birnhack 2008). The directive restricts the
capacity of organizations to collect any type of Due to the increased development, use, and access
data. The processing of special categories of data of the Internet and other digital technologies by
on racial background, political beliefs, health con- individuals, companies, and authorities around the
ditions, or sexual orientation is, for example, pro- world, new concerns about privacy rights have
hibited. The directive has also increased the called the attention of the European Commission.
overall transparency of the data collection proce- The 1995 directive and the following directives
dures, by expanding people’s rights to know who were considered not sufficient to provide the
gathers their personal information and when (De legal framework for protecting Europeans’ funda-
Hert and Papakonstantinou 2012). Authorities mental rights (Birnhack 2008; De Hert and
and organizations that intend to collect personal Papakonstantinou 2012). Furthermore, the 1995
data have to notify individuals of their collection directive allowed member states a certain level
procedures and their data use. Individuals have of freedom in the methods and instruments
the right to access the data collected and can implementing EU legislation. As a result, the
deny certain processing. They have also the right Data Protection Directive was often transposed in
not to be subjected to an automated decision, that national legislations in very different manners
is a decision that is relegated to computers which arising enforcement divergences (De Hert and
gather and process data as well as suggest or make Papakonstantinou 2012). The fragmentation of
decisions silently and with little supervision (CEC data protection legislation and the administrative
1995). burden of handling all member states’ different
Due to rapid technological developments and rules motivated the EU Commissioner responsible
the increased globalization of many activities, the for the Directorate-General for Justice to propose a
European Commission started a process of mod- unified, EU-wide solution in the form of a regu-
ernization of the principles constituting the 1995 lation in 2012 (European Commission 2012).
directive. First, in 2001 the regulation 45/2001 on The 2012 proposal includes a draft EU legisla-
data processing and its free movement in the EU tive framework comprising a regulation on gen-
institutions was introduced. This regulation aims eral data protection directly applicable to all
at protecting individuals’ personal data when the member states and a directive specifically for per-
processing takes place in the EU institutions and sonal data protection that leaves discretions to
bodies. Then in 2002 a new directive on privacy member states to decide the form and method of
and electronic communications was set to ensure application. It also proposes the establishment of
460 European Commission: Directorate-General for Justice (Data Protection Division)

specific bodies that oversee the implementation the European data protection principles. The
and respect of data protection rules. These are a “right to be forgotten,” which is the right by
data protection officer (DPO) to be located in individuals to see removed irrelevant or excessive
every EU institution and a European data protec- personal information from search engine results,
tion supervisor (EDPS). The DPO is in charge of is also included (Rees and Heywood 2014).
monitoring the application of the regulation
within the EU institutions whereas the EDPS has
the duty of controlling the implementation of data Possible Impact of the Data Protection
protection rules across member states (European Regulation
Commission 2013).
The initial draft was revised several times to Critics noted that the introduction of the data
meet the demands of the European Parliament and protection regulation may affect substantially
the Council of the European Union, the two EU third-party vendors and those organizations that
decision-making institutions. In spring 2014, the use third-party data, for example, for online mar-
European Parliament supported and pushed for a keting and advertising purposes. The regulation
voting on the Data Protection Regulation, which will demand data collectors who track users on the
is an updated version of the regulation that was web, their pages visited, the amount of time spent
first proposed by the European Commission in in each page, and any other online movement, to
2012.The final approval required, however, the prove that they have obtained individuals’ con-
support of the other two institutions. sents to use and sell personal data, otherwise they
On 15 December 2015, the European Parlia- will have to pay high infringements fines. This
ment, the Council, and the Commission reached regulation may impact the activities of multina-
an agreement on the new data protection rules, tional companies and international authorities,
and on 24 May 2016, the regulation entered into since it is expected that the new EU data protec-
force, but its application is not expected before 25 tion standards apply to any data collected on EU
May 2018 to allow each member state to transpose citizens, no matter where data is processed. Goo-
it into their national legislations (European Com- gle, Facebook, and other Internet companies have
mission 2016). The approved version extends the lobbied against the introduction of this data pro-
territorial scope of its application, which means tection regulation but with little success (Chen
that the regulation will apply to all data processing 2014). The EU debate on data protection regula-
activities concerning EU citizens even if the data tion seems to have sparkled international debates
processing does not take place in the European on data protection in other non-EU countries and
Union. The regulation includes an additional pro- on the fitness of their national regulation. For
vision concerning processing children personal instance, not long after the European Parliament
data and moves the responsibility to the data con- voted for the General Data Protection Regulation,
troller, that is the organization or authority the state of California, in the U.S., passed a state
collecting it, to prove that consent of gathering law that requires technological companies to
and handling personal data was given. Another remove material posted by a minor, if the user
amendment that the European Parliament has requests it (Chen 2014).
requested and obtained is the introduction of a
new article about transferring disclosures that
are not authorized by the European Union Law. Cross-References
This article allows organizations and authorities
that collect personal data to deny releasing infor- ▶ Big Data Theory
mation to non-European law enforcement bodies ▶ Charter of Fundamental Rights (EU)
for reasons that are considered to be contrary to ▶ European Commission
European Union 461

▶ European Union
▶ Privacy European Union

Chiara Valentini
Further Reading Department of Management, Aarhus University,
School of Business and Social Sciences, Aarhus,
Bennett, C. J., & Raab, C. D. (2006). The governance of Denmark
privacy: Policy instruments in global perspective.
Cambridge, MA: MIT Press.
Birnhack, M. D. (2008). The EU data protection directive:
An engine of a global regime. Computer Law and Introduction E
Security Review, 24(6), 508–520.
CEC. (1995, November 23). Directive 95/46/EC of the
The development and integration of big data con-
European Parliament and of the Council of 24 October
1995 on the protection of individuals with regard to the cerns legislators and governments around the
processing of personal data and on the free movement world. In Europe, legislation regulating big data
of such data. Official Journal of the European Union, and initiatives promoting the development of dig-
L281. http://eur-lex.europa.eu/legal-content/en/TXT/?
ital economy are handled at the European Union
uri¼CELEX:31995L0046.
CEC. (2006, April 13). Directive 2006/24/EC of the Euro- level. The European Union (EU) is a union of
pean Parliament and of the Council of 15 March 2006 European member states. It was formally
on the retention of data generated or processed in con- established in 1993 when the Maastricht Treaty
nection with the provision of publicly available elec-
came into force to overcome the limits of the
tronic communications services or of public
communications networks and amending Directive European Community and strengthen the eco-
2002/58/EC. Official Journal of the European Union, nomic and political agreements of participating
L105/54. http://eur-lex.europa.eu/LexUriServ/ countries. The European Community was
LexUriServ.do?uri¼CELEX:32006L0024:en:HTML.
established in 1957 with the Treaty of Rome and
Chen, F. Y. (2014, May 13). European court says Google
must respect ‘right to be forgotten’. Reuters US Edi- had primarily an economic purpose to establish a
tion. http://www.reuters.com/article/2014/05/13/us-eu- common market among six nation-states—Bel-
google-dataprotection-idUSBREA4C07120140513. gium, France, West Germany, Italy, Luxembourg,
Accessed 8 Oct 2014.
and the Netherlands. The EU is a supranational
De Hert, P., & Papakonstantinou, V. (2012). The proposed
data protection regulation replacing directive 95/46/ polity that acts in some policy areas as a federa-
EC: A sound system for the protection of individuals. tion, that is, its power is above member states’
Computer Law and Security Review, 28(2), 130–142. legislation, and in other policy areas as a confed-
European Commission. (2012, January 25). Commission
eration of independent states, similar to an inter-
proposes a comprehensive reform of data protection
rules to increase users’ control of their data and to cut governmental organization, that is, it can provide
costs for businesses. Press Release. http://europa.eu/ some guidelines but decisions and agreements are
rapid/press-release_IP-12-46_en.htm. Accessed 10 Oct not enforceable, and member states are free to
2014.
decide whether or to which extent to follow
European Commission. (2013, July, 16). European data
protection supervisor. http://ec.europa.eu/justice/data- them (Valentini 2008). Its political status is thus
protection/bodies/supervisor/index_en.htm. Accessed unique in several respects, because nation-states
10 Oct 2014. that join the European Union must accept to relin-
European Commission. (2014). Policies and activities.
quish part of their national power in return for
DG Justice. http://ec.europa.eu/justice/index_en.
htm#newsroom-tab. Accessed 10 Oct 2014. representation in the EU institutions. The EU
European Commission. (2016, July 6). Reform of EU data comprises diverse supranational independent
protection rules. http://ec.europa.eu/justice/data-protec institutions such as the European Commission,
tion/reform/index_en.htm. Accessed 7 July 2016.
the European Parliament, and the Council of the
Rees, C., & Heywood, D. (2014). The ‘right to be forgot-
ten’ or the ‘principle that has been remembered’. Com- European Union also known as the Council of
puter Law and Security Review, 30(5), 574–578. Ministers. It also operates through
462 European Union

intergovernmental negotiated decisions by mem- implementation of the EU legislation once


ber states that gather together, for instance, in the adopted (Nugget 2010).
European Council. The European Parliament is elected every
The supranational polity has grown from the five years with direct universal suffrage since
six founding European countries to current 27, 1979. It is the only EU institution that is directly
after the United Kingdom decided to leave the elected by citizens aged 18 years or older in all
EU in summer 2016. On January 2016, the popu- member states, except Austria where the voting
lation of the EU was about 510 million people age is 16. Voting is compulsory in four member
(Eurostat 2016). To become an EU member state, states (Belgium, Luxembourg, Cyprus, and
countries need to meet the so-called “Copenhagen Greece), and European citizens that reside in a
criteria”. These require that a candidate country member state other than their own have the right
has achieved institutional stability guaranteeing to vote for the European Parliament elections in
democracy, is based on the Rule of Law, has in their state of residence (European Parliament
place policies for protecting human and minority 2014). The European Parliament comprises a
rights, has a functioning market economy, and can President and 751 members across seven political
cope with competitive pressures and market groups representing the left, central, and right
forces (Nugget 2010, p. 43). Five countries are political positions. In the co-decision procedure,
recognized as candidates for membership: Alba- that is, the most common procedure for passing
nia, Macedonia, Montenegro, Serbia, and Turkey. EU law, the Parliament together with the Council
Other countries, such as Iceland, Lichtenstein, of the European Union is in charge of approving
Norway, and Switzerland, are not EU members EU legislation.
but are part of the European Free Trade Associa- The Council of the European Union represents
tion and thus enjoy specific trade agreements the executive governments of the EU’s member
(EFTA 2014). states and comprises a Presidency and a council of
27 ministers (one per member state) that changes
according to the policy area under discussion.
EU Main Institutions There are ten different configurations, that is, var-
iations of council composition. The Presidency is
The EU political direction is set by the Euro- a position held by a national government and
pean Council which has no legislative powers rotates every 6 months among the governments
but acts as body and issues guidelines to the of the member states. To maintain some consis-
European Commission, the European Parlia- tency in the program, the Council has adopted an
ment, and the Council of the European Union. agreement called Presidency trios under which
It comprises a President, the national heads of three successive presidencies share common
state or government and the President of Euro- political programs. The administrative activities
pean Commission. The three main institutions of the Council of the European Union are run by
involved in the legislative process are the Euro- the Council’s General Secretariat.
pean Commission, the European Parliament, Decision-making in the Council can be by
and the Council of the European Union. The unanimity, by qualified majority (votes are
European Commission is the institution that weighted by the demographic clause, which
drafts and proposes legislation based on its own means that high-populated countries have more
initiative but also on suggestions made by the votes than low-populated ones), and by simple
European Council, the European Parliament, majority (Nugget 2010).
the Council of the European Union, or other The legal power is given to the Court of Justice
external political actors. It comprises a Presi- which makes sure that the EU law is correctly
dent and 27 commissioners who are responsible interpreted and implemented in member states.
each of one or more policy areas. The Commis- The Court of Auditors checks cases of maladmin-
sion is also responsible for monitoring the istration in the EU institutions and bodies.
European Union 463

Typically maladministration cases that are value for the global economy, driving innovation,
solicited by citizens, business, and organizations productivity, efficiency, and growth (Tene and
are handled by the European Ombudsman. Citi- Polonesky 2012). The EU particularly sees in
zens’ privacy issues are handled, on the other big data opportunities to improve public services
hand, by the European Data Protection Supervisor for citizens such as healthcare, transportation, and
who is in charge of safeguarding the privacy of traffic regulation. It also believes that big data can
people’s personal data. increase innovation and clearly expresses an inter-
While all EU member states are part of the est in further developing its use. In the last five
single market, only 19 of them have joined the years the EU has promoted the creation of a
monetary union by introducing a single currency, “cloud of public services” for the delivery of
E
the euro. The EU institution responsible for the more flexible public services. These can be pro-
European monetary policy is the European Cen- vided by combining building blocks, such as IT
tral Bank (ECB). Located in Frankfurt, Germany, components for building e-government services
the ECB’s main role is to maintain price stability. involving ID, signatures, invoices, data exchange,
It also defines and implements the monetary pol- and machine translation and by allowing service
icy, conducts foreign exchange operations, holds sharing between public and private providers
and manages the official foreign reserves of the (Taylor 2014). EU Open Data rules were
euro area countries, and promotes smooth opera- approved in spring 2013, and in the next years, it
tions of payment systems (Treaty of the European is expected that the new rules will make all public
Union 1992, p. 69). The Union has also its own sector information across the EU available for
specific international financial institution, the reuse, provided that the data is generally accessi-
European Investment Bank (EIB), publicly ble and not personal (European Commission
owned by all EU member states. The EIB finances 2013). The EU believes that better regulations
EU investment projects and helps small busi- that protect citizens’ rights as well as a better
nesses through the European Investment Fund framework to help organizations taking advantage
(EIB 2013). of big data are top priorities in the Europe 2020
Another important EU institution is the Euro- digital agenda.
pean External Action Service which is a unit Even today with new challenges related to
supporting the activities of the High Representa- collecting data across border through mobile and
tive of the Union for Foreign Affairs and Security other smart technologies, the EU and the US posi-
Policy, and its main role is to ensure that all tions tend to traditionally differ in relation to data
diplomacy, trade, development aid, and work protection. The EU has been cooperating closely
with global organization activities that the EU with US law enforcement agencies to share
undertakes are consistent and effective (EEAS information about online behavior in order to
2014). Other interinstitutional bodies that play a identify terrorists and other criminals. Among
role in the activities of the EU are the European the existing agreement, the EU and US have
Economic and Social Committee representing information-sharing arrangements for their police
civil society, employers, and employees through and judicial bodies, two treaties on extradition and
a consultative assembly and issuing opinions to mutual legal assistance and accords on container
the EU institutions, and the Committee of Regions security and airline passenger data. Yet, the EU
representing the interests of regional and local and US have different opinions on data privacy
authorities in the member states (Nugget 2010). and data protection (Archick 2013, September
14). The 1995 Data Protection Directive, that was
the main regulation in the EU until 2016, pre-
Big Data and the EU venting the personal data of individuals living in
the EU from being shared with anyone else with-
The EU position toward big data is generally out express consent, was perceived by the US
positive. It believes that data can create enormous counterpart as too restrictive and undermining
464 European Union

free market economy. Salbu (2002) noted that the EU institutions. The CERT-EU cooperates closely
EU 1995 Directive had negative impacts on global with other CERTs in the member states and beyond
negotiation because companies had to comply as well as with specialized IT security companies
with the EU requirements and these can be more (CERT-EU 2014). Furthermore, the EU revised the
restrictive than in other countries. Scholars 1995 Data Protection Directive and approved a new
observe that the 1995 Directive was not coercive regulation, the General Data Protection Regulation
since countries outside the EU were not asked to (GDPR). GDPR entered into force on 24 May 2016,
change their laws to fit the directive. Yet, as but it will apply from 25 May 2018 to allow each
Birnhack (2008) noted, countries that wished to member state to transpose the directive into own
engage in data transactions with EU member national legislation (European Commission
states were indirectly required to provide an ade- 2017a). GDPR poses a number of issues for inter-
quate level of data protection. national partners such as the US. These must abide
The 1995 Data Protection Directive was con- to GDPR if personal information on EU citizens is
sidered to be one of the strictest data protection collected by any organization or body regardless of
legislations in the world. Yet, different scandals whether it is located in the EU or not. This means
and the increased concern by citizens on how their that cross-border transfer of EU citizens’ personal
personal data is handled by organizations (Special data outside of the EU is only permissible when
Eurobarometer 359 2011) have brought the issue GDPR conditions are met. In practice, the entrance
of privacy and security on the top EU political into foce of GDPR will require organizations
agenda. The European Parliament commissioned collecting EU citizens’ personal data to have a
an investigation on the status of the EU Intelli- Data Protection Officer and to conduct privacy
gence and identified that several countries have in impact assessments to ensure they comply with the
place mass surveillance programs in which data regulation in order to avoid being subject of sub-
collection and processing activities go beyond stantial fines. GDPR is considered one of the most
monitoring specific individuals for criminal or advanced data protection regulations in the world,
terrorist reasons. Milaj and Bonnici (2014) argue yet it is still to be seen if it benefits or hampers the
that mass surveillance not only damages the pri- EU capacity of taking advantage of the opportunities
vacy of citizens but limits the guarantees offered that big data can offer.
by the principle of presumption of innocence dur-
ing the stages of a legal process. Similarly, Leese
(2014) argues that pattern-based categorizations
in data-driven profiling can impact the EU’s non- Cross-References
discrimination framework, because possible cases
of discrimination will be less visible and trace- ▶ Cloud Computing
able, leading to diminishing accountability. ▶ Data Mining
As results of these increasing concerns, in 2013 ▶ European Commission
the EU launched a cybersecurity strategy to address ▶ European Commission: Directorate-General
shortcomings in the current system. The network for Justice (Data Protection Division)
information security (NIS) directive, adopted by ▶ Metadata
the European Parliament on 6 July 2016 (European ▶ National Security Administration (NSA)
Commission 2017b, May 9), requires all member ▶ Privacy
states to set up a national cybersecurity strategy
including Computer Emergency Response Teams
(CERTs) to react to attacks and security breaches. Further Reading
On September 2012, the EU decided to set up a
Archick, K. (2013, September 14). U.S.-EU cooperation
permanent Computer Emergency Response Team against terrorism. Congressional Research Service 7-
(CERT-EU) for the EU institutions, agencies, and 5700. http://fas.org/sgp/crs/row/RS22030.pdf. Accessed
bodies comprising IT security experts from the main 31 Oct 2014.
European Union Data Protection Supervisor 465

Birnhack, M. D. (2008). The EU data protection directive: Taylor, S. (2014, June). Data: The new currency? European
An engine of a global regime. Computer Law and Voice. http://www.europeanvoice.com/research-papers/.
Security Review, 24(6), 469–570. Accessed 31 Oct 2014.
CERT-EU. (2014). About us. http://cert.europa.eu/cert/ Tene, O., & Polonesky, J. (2012, February 2). Privacy in
plainedition/en/cert_about.html. Accessed 31 Oct the age of big data. A time for big decisions. Stanford
2014. Law Review Online. http://www.stanfordlawreview.
EEAS. (2014). The EU’s many international roles. European org/online/privacy-paradox/big-data. Accessed 30 Oct
Union External Action. http://www.eeas.europa.eu/what_ 2014.
we_do/index_en.htm. Accessed 30 Oct 2014. Treaty of the European Union. (1992). Official journal of the
EFTA. (2014). The European free trade association. http:// European Community. https://www.ecb.europa.eu/ecb/
www.efta.int/about-efta/european-free-trade-association. legal/pdf/maastricht_en.pdf. Accessed 30 Oct 2014.
Accessed 30 Oct 2014. Valentini, C. (2008). Promoting the European Union. com-
EIB. (2013). Frequently asked questions. http://www.eib. parative analysis of EU communication strategies in E
org/infocentre/faq/index.htm#what-is-the-eib. Accessed Finland and in Italy. Doctoral dissertation. Jyväskylä
30 Oct 2014. Studies in Humanities, 87. Finland: University of
European Commission. (2007). Framework for advancing Jyväskylä Press.
transatlantic economic integration between the
European Union and the United States of America.
http://trade.ec.europa.eu/doclib/docs/2007/may/tradoc_
134654.pdf . Accessed 21 Oct 2014.
European Commission. (2013). Commission welcomes European Union Data
parliament adoption of new EU open data rules.
Press Release. http://europa.eu/rapid/press-release_
Protection Supervisor
MEMO-13-555_en.htm. Accessed 30 Oct 2014.
European Commission. (2017a). Protection of personal Catherine Easton
data. http://ec.europa.eu/justice/data-protection/. School of Law, Lancaster University,
Accessed 7 Sep 2017.
Bailrigg, UK
European Commission. (2017b). The Directive on security
of network and information systems (NIS Directive).
European Commmission, Strategy, Single Market.
https://ec.europa.eu/digital-single-market/en/network- The European Union Data Protection Supervisor
and-information-security-nis-directive. Accessed 7 Sep
(EDPS) is an independent supervisory authority
2017.
European Parliament. (2014). The European parliament: established by Regulation (EC) No 45/2001 on
Electoral procedures. Factsheet on the European Union. the processing of personal data. This regulation
http://www.europarl.europa.eu/ftu/pdf/en/FTU_1.3.4. also outlines the duties and responsibilities of the
pdf. Accessed 24 Oct 2014.
authority which, at a high level, focuses upon
Eurostat. (2016). Population on 1 January. http://epp.
eurostat.ec.europa.eu/tgm/table.do?tab¼table& ensuring that the institutions of the European
plugin¼1&language¼en&pcode¼tps00001. Accessed Union uphold individuals’ fundamental rights
07 July 2016. and freedoms, in particular the right to privacy.
Leese, M. (2014). The new profiling: Algorithms, black
In this way the holder of the office seeks to
boxes, and the failure of anti-discriminatory safeguards
in the European Union. Security Dialogue, 45(5), 494– ensure that European Union provisions regard-
511. ing data protection are applied, and measures
Milaj, J., & Bonnici, J. P. M. (2014). Unwitting sub- taken to achieve compliance are monitored. The
jects of surveillance and the presumption of inno-
EDPS also has an advisory function and provides
cence. Computer Law and Security Review, 30(4),
419–428. guidance to the EU institutions and data subjects
Nugget, N. (2010). The government and politics of the on the application of data protection measures.
european union (7th edn). New York: Palgrave The European Parliament and Council, after an
Macmillan.
open process, appoint the supervisor and the
Salbu, S. R. (2002). The European union data privacy
directive and international relations. Vanderbilt Journal assistant supervisor both for periods of 5 years.
of Transnational Law, 35, 655–695. Since 2014 Giovanni Buttarelli has carried out
Special Eurobarometer 359. (2011). Attitudes on data pro- the role with Wojciech Wiewiórowski as his
tection and electronic identity in the European Union.
assistant.
Report. Gallup for the European Commission. http://ec.
europa.eu/public_opinion/archives/ebs/ebs_359_en.pdf. Article 46 of Regulation 45/2001 outlines in
Accessed 30 Oct 2014. further detail the duties of this authority, in
466 Event Stream Processing

addition to those outlined above: hearing and


investigating complaints; conducting inquiries; Evidence-Based Medicine
cooperating with national supervisory bodies and
EU data protection bodies; participating in the David Brown1,2 and Stephen W. Brown3
1
Article 29 working group; determining and justi- Southern New Hampsire University, University
fying relevant exemptions, safeguards, and autho- of Central Florida College of Medicine,
rizations; maintaining the register of processing Huntington Beach, CA, USA
2
operations; carrying out prior checks of notified University of Wyoming, Laramie, WY, USA
3
processing; and establishing his or her own rules Alliant International University, San Diego, CA,
of procedure. USA
In carrying out these duties, the authority has
powers to, for example, order that data requests
are complied with, give a warning to a controller, Evidence-based medicine (EBM) and the term
impose temporary or permanent bans on pro- evidence-based medical practice (EBMP) are
cessing, refer matters to another EU institution, two interrelated advances in medical and health
and intervene in relevant actions brought before sciences that are designed to improve individual,
the European Union’s Court of Justice. Each year national, and world health. They do this by
the EDPS produces a report on the authority’s conducting sophisticated research to document
activities; this is submitted to the EU institutions treatment effectiveness and by delivering the
and made available to the public. highest-quality health services. The term evi-
The EDPS was consulted in the preparatory dence-based medicine refers to a collection of
period before the EU’s recent wide-ranging the most up-to-date medical and other health pro-
reform of data protection and published an opin- cedures that have scientific evidence documenting
ion on potential changes. The EU’s General Data their efficacy and effectiveness. Evidence-based
Protection Regulation was passed in 2016 with medical practice is the practice of medicine and
the majority of its provisions coming into force other health services in a way that integrates the
within 2 years. This legislation outlines further health-care provider’s expertise, the patient’s
provisions relating to the role of the EDPS; in its values, and the best evidence-based medical infor-
Article 68, it creates the European Data Protection mation. Big data plays a central role in all aspects
Board, upon which the EDPS sits and also pro- of both EBM and EBMP.
vides a secretariat. EBM information is generated following a
The authority of the EDPS is vital in ensur- series of procedure known as clinical trials. Clin-
ing that the rights of citizens are upheld in this ical trials are research-based applications of the
increasingly complex area in which technol- scientific method. The first step of this method
ogy is playing a fundamental role. By holding involves the development of a research hypothe-
the EU institutions to account, monitoring and sis. This hypothesis is a logically reasoned spec-
providing guidance, the EDPS has maintained ulation that a specific medication or treatment will
an active presence in developing and have a positive outcome when applied for the
enforcing privacy-protecting provisions across treatment of some specific health malady in
the EU. some specific population. Hypotheses are typi-
cally developed by reviewing the literature in
recent health and medical journals and by using
creative thinking to generate a possible new appli-
Event Stream Processing cation of the reviewed material.
After the hypothesis has been generated, an
▶ Complex Event Processing (CEP) experimental clinical trial is designed to test the
Evidence-Based Medicine 467

validity of the hypothesis. The clinical trial is the results of the clinical trial in great detail. This
designed as a unique study; however, it usually information is big data that reports the character-
has similarities to other studies that were identi- istics of the people who were in the different
fied while developing the hypothesis. Before the groups, the unique experience of each research
proposed clinical trial can be performed, it needs participant, the proportion of the treatment and
to be reviewed and approved by a neutral Institu- the control who got better, the proportion in each
tional Review Board (IRB). The IRB is a group of group that got worse, the proportion in each group
scientists, practitioners, ethicists, and public rep- that had no change, the proportion of people in
resentatives who evaluate the research proce- each group that experienced different kinds of
dures. Members of the IRBs use their expertise side effects, and a description of any unusual or
E
to determine if the study is ethical and if the unexpected events that might have occurred dur-
potential benefits of the proposed study will far ing the course of the study.
outweigh its possible risk. After data analysis, the researchers prepare an
After, and only after, IRB approval has been article that gives a detailed description of the logic
obtained, the researcher begins the clinical trial by that they used in designing the clinical trial, the
recruiting volunteer participants who of their own exact and specific methods that were used in
free will agree to participate in the research. Par- conducting the clinical trial, the quantitative and
ticipant recruitment is a process whereby a large qualitative results of the study, and their interpre-
number of the people who meet certain inclusion tation of the results and suggestions for further
criteria (e.g., they have the specific disorder and research. The article is then submitted to the editor
they are members of the specific population of of a professional journal. The journal editor iden-
interest) and who don’t have any exclusion tifies several neutral experts in the discipline who
criteria (e.g., they don’t have some other health are asked to review the article and determine if it
condition that might lead to erroneous findings) has scientific accuracy and value that makes it
are identified. These people are then contacted and suitable for publication. These experts judiciously
asked if they would be willing to participate in the review all aspects of the clinical trial to determine
study. The risks and benefits of the study are if the proposed new treatment is safe and effective
explained to each participant. After a large group and that, in at least some cases, it is superior to the
of volunteers has been identified, randomization traditional treatment that is used for the health
procedures are used to assign each person to one problem under study. If the experts agree, the
of two different groups. One of the groups is the article is published in what is called a peer-
treatment group; members of this group will all reviewed health-care journal. The term “peer
receive the experimental treatment. Members of reviewed” means that an independent and neutral
the other group, the control group, will receive the panel of experts has reviewed the article and that
treatment that is traditionally used to treat the this panel believes the article is worthy of dissem-
disorder being studied. The randomization pro- ination and study by other health-care
cess of assigning people to groups helps insure professionals.
that each participant has an equal chance of It should be noted that many different studies
appearing in either the treatment or the control are usually conducted concerning the same treat-
group and that the two groups do not differ in ment and the same disorder. However, each study
some systematic way that might influence the is unique in that different people are studied and
results of the study. the treatment may be administered using some-
After a reasonable period of time during which what different procedures. As an example, the
the control group received treatment as usual specific people being studied may differ from
and the treatment group received the experimental study to study (e.g., some trials may only include
treatment, big data techniques are used to analyze people between the ages of 18 and 35, some
468 Evidence-Based Medicine

studies may include only Caucasians, some stud- treatment is in fact an evidence-based treatment.
ies may only include people who have had the In performing a meta-analysis, researchers use
disease for less than 1 year). Other studies may multiple online databases to identify all of the
look at differences in the treatment procedures different articles that address the topic of using
(e.g., some studies may only use very small the specific treatment with the specific disorder. In
doses, other studies may use large doses, some a meta-analysis, each of the different identified
studies may administer the treatment early in the articles is studied in great detail. Then, a statistic
morning, other studies may administer the treat- called effect size is calculated. The effect size
ment late at night). Clearly, there are many differ- describes the average amount of effect that a spe-
ent variables that can be manipulated, and these cific treatment has on a specific disorder. Each
changes can affect the outcome of a clinical trial. study’s effect size is calculated by comparing the
After the article has been published, it joins a amount of improvement in the disorder that
group of other articles that address the same gen- occurs when the new treatment is used with the
eral topic. Information about the article and all amount of improvement in the disorder when the
other similar articles is stored in multiple different new treatment is not being used. After the effect
online journal databases. These journal databases size has been calculated for each of the articles
are massive big data files that contain the article that is being reviewed, an average effect size for
citation as well as other important information the new treatment is calculated based on all of the
about the article (e.g., abstract, language, country, different clinical trials. If the average effect size
key terms). Practitioners, researchers, students, shows that the treatment has a significant positive
and others consult these journal databases to effect on the condition being studied, then, and
determine what is known about the effects of only then, it is labeled an evidence-based treat-
different treatments on the health problem being ment, and this information is widely disseminated
studied. These big data online databases enable to practitioners and health researchers throughout
users to identify the most current, up-to-date the world.
information as well as historical findings about a An evidence-based medical practice (EBMP)
condition and its treatment. Many journal data- is one of the ways evidenced-based medicine
bases have options that allow users to receive a (EBM) is applied in the delivery of health-care
copy of the full journal article in a matter of services. As noted earlier in this article, evidence-
seconds; at other times, it may take as long as a based medical practice is the practice of medicine
week to retrieve an article from some distant and other health services in a way that integrates
country. the health-care provider’s expertise, the patient’s
Now that the article has been published and values, and the best evidence-based medical infor-
listed in online journal databases, it joins a group mation. There are many different elements of this
of articles that all address the same general topic. type of practice. Some of these are described
By using search terms that relate to the specific below.
treatment and the specific condition, an online Automated symptom checkers are online
journal database user can find the published clin- resources that patients can use to help understand
ical trial article discussed above as well as all of and explain their health difficulties. These algo-
the other articles that concern the same topic. In rithms are not a substitute for seeking professional
reviewing the total group of articles, it becomes help; however, they often give plausible explana-
apparent that some of the articles show that the tions for the patient’s objective signs and patient’s
treatment under study is very highly effective for subjective symptoms. Big data is central for the
the condition being investigated, while other stud- development of symptom checkers in that they
ies show it to be less effective. That is, the evi- integrate many different types of data from world-
dence is highly variable. Meta-analysis is a wide sources.
research technique that is designed to resolve Automated appointment schedulers are
these differences and determine if the new devices that enable patients to make appointments
Evidence-Based Medicine 469

with their health-care provider online or by tele- Conclusion


phone. In large health systems such as a health
maintenance organization (HMO), these are big Evidence-based medicine and evidence-based
data systems that track the time availability and medical practice are recent health-care advances.
the location of many different providers. By using The use of these mechanisms depends upon the
these systems, patients are able to schedule their availability of big data, and as they are used, they
own appointments with a provider at a time and generate more big data. It is a system that can only
place that best meets their needs. Very often, the lead to improvements in individual’s, nation’s,
automatic appointment scheduler arranges for a and worldwide health improvements.
reminder postcard, email, and phone calls to the
E
patients to remind them of the time and place of
the scheduled appointment.
Cross-References
Electronic health records (EHR) are compre-
hensive secure electronic files that contain and
▶ Health Informatics
collate all of the information about all aspects of
▶ Telemedicine
a patient’s health. They contain information that is
collected at each outpatient and inpatient health-
care encounter. This includes the patient’s medical
Further Reading
history, all of their diagnoses, all of the medica-
tions the patient has ever taken and a list of the De Vreese, L. (2011). Evidence-based medicine and pro-
medicines the patient is currently taking, all past gress in the medical sciences. Journal of Evaluation in
and current treatment plans, dates and types of all Clinical Practice, 17(5), 852–856.
Epstein, I. (2011). Reconciling evidence-based practice,
immunizations, allergies, all past radiographic
evidence-informed practice, and practice-based
images, and the results from all laboratory and research: The role of clinical data-mining. Social
other tests. These data are usually portable, Work, 56(3), 284–287.
which means that any time a patient sees a pro- Ko, M. J., & Lim, T. (2014). Use of big data for evidence-
based healthcare. Journal of the Korean Medical Asso-
vider, the provider can securely access all of the
ciation., 2014, 57(5), 413–418.
relevant information to provide the best possible Michael, K., & Miller, K. W. (2013). Big data: New oppor-
care to the patient. tunities and new challenges. Computer, 46(6), 22–24.
F

Facebook meaning what users are currently discussing on


Facebook. On Facebook’s main screen, users can
R. Bruce Anderson1,2 and Kassandra Galvez2 go to the trending section and view the top
1
Earth & Environment, Boston University, 10 most popular things that all Facebook users
Boston, MA, USA are discussing. This feature includes things that
2
Florida Southern College, Lakeland, FL, USA just your friends are discussing to what the nation
is talking about. Many of these topics are pre-
ceded with the number sign, #, to present a
When it comes to the possibility of gathering truly hashtag which links all these topics together. All
“big data” of a personal nature, it is impossible to of these conversations, information about individ-
think of a better source than Facebook. Millions uals, and location materials are potential targets
use Facebook every day, and likely give up every- for data harvesting.
thing they publish to data collectors every time These “trending topics” have provided
they use it. Facebook users instant knowledge about a spe-
Since its 2004 launch, Facebook has become cific event or topic. Facebook has integrated news
one of the biggest websites in the world with with social media. In the last 12 months, traffic
400 million people visiting the site each month. from home pages has dropped significantly across
Facebook allows any person with a valid email many websites while social media’s share of
address to simply sign up for an account and clicks has more than doubled, according to a
immediately start connecting with other users 2013 review of the BuzzFeed Partner Network, a
known as “friends.” Once you are connected conglomeration of popular sites including
with another “friend,” users are able to view the BuzzFeed, the New York Times, and Thought Cat-
other person’s information listed such as: birth- alog. Facebook, in particular, has opened the
day, relationship status, and political affiliation; spigot, with its outbound links to publishers grow-
however, some of this information may not listed ing from 62 million to 161 million in 2013. Two
depending on the various privacy regulations that years ago, Facebook and Google were equal pow-
Facebook has. ers in sending clicks to the BuzzFeed network’s
By having a Facebook account, users have a sites. Today Facebook sends 3.5 times more
main screen known as the “newsfeed.” This traffic. Facebook has provided its 1 billion users
“newsfeed” shows users what his or her “friends” with a new way of accessing the news.
are currently doing online either from liking pic- However, such access can have a double edge.
tures or writing comments on statuses. Addition- For example, during the 2016 US election,
ally, “users” can see what is currently “trending” hackers from foreign sources apparently took
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
472 Facebook

advantage of the somewhat laissez faire approach behavior, allowing targeted ad buys, directed
Facebook had towards users security (and content “teasers” and the like.
control) and set up false accounts to spread fake Unfortunately, the release of private informa-
information about candidates for office. tion has caused concern. Users of social-
With the increased use of Facebook, there has networking sites such as Facebook passively
also been an increase in research on Facebook. accept losing control of their personal information
Many psychologists believes that social can add because they are not fully aware – or have given
to a child’s learning capacity, but it is also asso- up caring – about the possible implications for
ciated with a host of psychological disorders. their privacy. These users should make sure they
Additionally, social media can be “the training understand who could access their profiles, and
wheels of life” to social networking teens how that information could be used. They also
because it allows he or she to post public infor- should familiarize themselves with sites’ privacy
mation onto the sites, see how other users react to options.
that information, and learn as they go. Research Facebook has a variety of privacy settings that
has gone as far as to show “that people who can be useful to those who are wary of the private
engage in more Facebook activities – more status information online such as: “public,” “friends,”
updates, more photo uploads, more "likes" – also “only me,” and “custom.” Users who elect to
display more virtual empathy.” An example of have their information “public” allow any person
this “virtual empathy” is if someone posts he had on or off Facebook to see his or her profile and
a difficult day, and you post a comment saying, information. The “friends” privacy setting allows
"Call me if you need anything,” this displays users who are connected to another’s account to
virtual empathy in action. view the profile and information. Additionally,
While Facebook has benefited society in a Facebook provides two unique privacy settings
variety of ways, it has also provided a lack of which are: “only me” and “custom.” The “only
privacy. When creating a Facebook account, you me” privacy only allows the account user to view
are asked a variety of personal questions that may his or her own material. The last privacy setting is
be shared on your Facebook page. These ques- “custom” which allows the user to make his own
tions include your: birthday, marital status, polit- privacy settings which is a two-step process. The
ical affiliation, phone number, work and first step for this “custom” privacy setting asks
educations, family and relationships, and home- users to choose who he or she would like the
town. While some of this private and personal information to be shared with. The user is the
information may not seem as important to some able to write “friends” which allows all his or
users, it allows other users who are connected to her Facebook friends to view this information or
your account to access that information – but the user can write the names of specific people.
access by others is frankly very easy to obtain, If the user decides to opt for the second choice, all
making any notion of a firewall between your the user’s information will only be viewed by
information and the information of millions of those specific friends he or she chose. The second
others a very doubtful proposition. step asks users who he or she does not want the
With the introduction of the program information to be shared with. The user can then
“Facebook Pixel” – program sold to advertisers – write the names of users he or she does not want to
commercial users can track, “optimize” and view the information. While the second step may
retarget site visitors to their products, ads, and seem redundant, the option has proven to be use-
related materials – keeping the data they have ful to teens who do not want their parents to see
gathered, of course, for further aggregation. everything on his or her Facebook page.
When data such as these are aggregated, projec- While the privacy settings may be handy, there
tions can be made about mass public consumer is some information that Facebook makes
Facebook 473

publicly available. This type of information con- using or are running Facebook, such as when
sist of the user’s name, profile pictures and cover you look at another person’s account, send or
photos, networks, gender, user name, and user receive a message, search for a “friend” or page,
ID. The purpose of having your name public is click on view or otherwise interact with things,
so that other users may be able to search for your use a Facebook mobile app, or make purchases
account. When searching for a user, profile and through Facebook. Also, not only does Facebook
cover photos are normally the indicator that he or received data when a user posts photos or videos
she has found the person they are searching for. but it also receives data such as the time, date, and
While these photos remain public, the user may place you took the photo or video. The last para-
change the privacy settings for these specific pic- graph of that section states “We only provide data
tures once you have updated your account and to our advertising partners or customers after we
changed them. For example, a user may be tired have removed your name and any other person- F
of the same profile picture so he or she changes it, ally identifying information from it, or have com-
the old picture would still appear; however, he or bined it with other people’s data in a way that it no
she can change that specific picture to private now longer personally identifies you.” With all the data
that there is a new picture in its place. The same collected, Facebook customizes the ad’s that users
can be done with cover photos. Sadly, Facebook see on their account. For example, college student
does not allow every single picture to be private. users who constantly view text book websites will
Networks allow users to be linked to specific areas start seeing advertisements about these specific
such as: college or university, companies, and websites on his or her Facebook account.
other organizations. These networks provide Facebook can be compared to as the “all-seeing
other users who have linked to the same network eye” because it is constantly watching all 20 bil-
to search for you due to this custom audience. lion users.
College students will normally link their accounts The last section of the Facebook private policy
to these networks to that other college students is titled “How we Use the Information We
can search for them with ease. By having this Receive.” Facebook states that they use the infor-
information public, other users can look onto a mation received in connection with services and
person’s account and recognize the network and features they provide to users, partners, advertises
make that connection. Lastly, Facebook requires that purchase ads on the site, and developers of
the users gender, username, and user ID to be games, applications, and websites users use. With
public. According to Facebook privacy policies, the information received, Facebook uses it as part
“Gender allows us to refer to you properly.” of their efforts to keep their products, services, and
Username and user IDs provide users to supply integrations safe and secure; to protect Facebook’s
others with custom links to his or her profile. or others’ rights or property; to provide users with
When signing up for a Facebook account, location features and services, to measure and
many users forget to “read the fine print” meaning understand the effectiveness of ads shown to
that users do not read the terms of agreement users, to suggest Facebook features to make the
which includes Facebook’s policy on information. users experience easier such as: contact importer
If users were to read this section, he or she may not to find “friends” based on your cell phone con-
have created an account in the first place. In this tacts; and lastly internal operations that include
section, there is a sense that Facebook is con- troubleshooting, data analysis, testing, search and
stantly watching you especially in the section service improvement. According to Facebook,
titled “Other Information we receive about you.” users are granting the use of information by sim-
This section details how and when Facebook is ply signing up for an account. By granting
receiving data from his or her account. Facebook Facebook the use of information, Facebook, in
states that they receive data whenever you are turn, uses the information to provide users with
474 Facebook

new and innovative features and services. The the app may post on your Facebook page with any
Facebook private policy does not state how long recent activity the user has done through that
it keeps and stores the data for. The only answer specific website.
users will receive is “we store date for as long as it An example of applications in motion is
is necessary to provide products and services to Instagram, which is a photo social networking
you and others. . .typically, information associated service owned by Facebook. Users that have
with your account will be kept until your account both Facebook and Instagram can connect each
is deleted.” account for a more personal experience.
Two ways of ridding users of Facebook are Instagram users who connect to Facebook
deactivation and deletion. Many Facebook users accounts provide Facebook with basic informa-
believes that deactivation and deletion are the tion from their Instagram account and in turn,
same; however, these users are mistaken. By Instagram may access data from Facebook from
deactivating an account, that specific account is other pages and applications used. All this con-
put on hold. While other users may not see his or nection and data opens up a “Pandora’s box” for
her information, Facebook does not delete any users because he or she may not know what appli-
information. Facebook believe that deactivating cation is getting what information and from
an account is the same as a user telling Facebook where. Furthermore, Instagram can access the
not to delete any information because he or she users data anytime even when he or she may not
might want to reactivate the account at some point be using the Instagram application. The applica-
in the future. The period of deactivation is tion may even post on my behalf which includes
unlimited-users can deactivate their accounts for objects Instagram users have posted and more.
years but Facebook will not delete their informa- Lastly, each additional application has its own
tion. By deleting a Facebook account, the account policy terms.
is permanently deleted from Facebook; however, While the nature of people around the world
it takes up to one month to delete but some infor- may be to be trusting, users of all ages need to be
mation may take up to 90 days to erase. The cautious of what they post online. Recently, there
misconception is that the amount and all its infor- has been a recent shift of job seekers getting asked
mation is immediately erased; however, that is not for Facebook passwords. The reason for this shift
the case, even for Facebook itself. is that employers want to view what (possible)
In the case of resident programs or third parties future employees are posting. Since the rise of
that access or harvest the data itself, Facebook social networking, it has become common for
assumes no responsibility – there is nothing, for managers to review publicly available Facebook
example, to force advertisers to give up the mega- profiles, Twitter accounts and other sites to learn
data they collect (legally thus far) when con- more about job candidates. However, many users,
sumers visit their site on the platform. especially on Facebook, have their profiles set to
Facebook also provides users with outside private, making them available only to selected
applications. These applications allow users to people or certain networks. Other solutions to
gain a more tailored and unique experience private Facebook accounts have surfaced such as
through the social networking site. These applica- asking applicants to friend human resource man-
tions include games, shopping websites, music agers or to log in to a company computer during
players, video streaming websites, organizations, an interview. Once employed, some workers have
and more. Now in this day and age, websites will been required to sign non-disparagement agree-
ask users to connect their social networking sites ments that ban them from talking negatively about
to the specific website account. By granting that an employer on social media. While such mea-
access, users provide the websites their basic sures may seem fair because employers want to
information and email address. Additionally, the have employees that represent themselves well on
website can send the user notifications but more social media, these policies have raised concerns
importantly post on your behalf. This means that about the invasion of privacy.
Facial Recognition Technologies 475

Facebook has been a source for people to con-


nect, communicate, research and few latest trends, Facial Recognition
and post; however, it has also become a hub for Technologies
private information. While Facebook has private
policies available for all users to read and under- Gang Hua
stand, users are not particularly interested in those Visual Computing Group, Microsoft Research,
aspects. Unfortunately users, who do not yet Beijing, China
understand the importance of private information
and create Facebook accounts to communicate
with others, do not pay attention to the things he Facial recognition refers to the application of
or she is posting. The real question is: is Facebook automatically identifying or verifying a person
really upholding their private policy rules and is from face images and videos. Face verification F
my information really private? Given the number aims at arbitrating if a pair of faces is from the
of pending and active cases against the platform, same person or not. While face identification
the answer is likely “not so much.” focuses on predicting the identity of a query
In the end, the financial health of such plat- face given a gallery face dataset with known
forms is predicated on turning a tidy profit through identities, there are lots of applications of facial
the sale of advertising, and advertiser access to the recognition technologies, in domains such as
results of the passive and sometimes active security, justice, social networks, military
harvesting of big data through the site. operations, etc.
While early face recognition technologies dealt
with face images taken from well-controlled envi-
Cross-References ronment, the current focus in facial recognition
research is pushing the frontier in handling real-
▶ Cybersecurity world face images/videos taken from uncontrolled
▶ Data Mining environment. There are two major unconstrained
▶ LinkedIn visual sources: (1) face images and videos taken
▶ Social Media by users and shared on the internet, such as those
images uploaded to Facebook, and (2) face videos
taken from surveillance cameras.
Further Reading In these unconstrained domain visual sources,
face recognition technologies must contend with
Brown, U. (2011). The influence of Facebook usage on the uncontrolled lighting, large pose variations, a
academic performance and the quality of life of college
range of facial expressions, makeup, changes in
students. Journal of Media and Communication Stud-
ies, 3(4), 144–150. Web. facial hair, eye-wear, weight gain, aging, and par-
Data Use Policy. Facebook. 1 Apr 2014. Web. 24 Aug tial occlusions. Recent progresses in real-world
2014. face recognition have greatly benefited from the
Mcfarland, S. (2012). Job seekers getting asked for
BIG face dataset from unconstrained sources.
Facebook passwords. USATODAY.com, 3 Mar 2012.
Web 24 Aug 2014.
Roberts, D., & Kiss, J. (2013). Twitter, Facebook and more
demand sweeping changes to US surveillance. History
Theguardian.com. Guardian News and Media, 9 Dec
2013. Web 24 Aug 2014.
Thompson, D. (2014). The Facebook effect on the news. As one of the most intensively studied areas in
The Atlantic. Atlantic Media Company, 12 Feb 2014. computer vision, facial recognition researchers are
Web 24 Aug 2014. among the first in advocating systematic data-
Turgeon, J. (2011). How Facebook and social media affect
driven performance benchmarking. We briefly
the minds of generation next. The Huffington Post.
TheHuffingtonPost.com, 9 Aug 2011. Web 24 Aug review the history of face recognition research in
2014. the context of datasets they evaluated on.
476 Facial Recognition Technologies

Early work by Woody Bledsoe, Helen Chan images. The sparse representation-based face iden-
Wolf, and Charles Bisson, though not published tification method proposed by John Wright et al. in
due to restriction of funded research from an 2009 can be regarded as a smarter manifold repre-
unnamed intelligent agent, could be traced back sentation for face recognition.
to 1964 at Panoramic Research, Inc., and later These manifold learning-based face recogni-
continued by Peter Hart at Stanford Research tion algorithms have largely been evaluated on
Institute after 1966. The task was, given a photo, several popular face recognition benchmarks,
identifying from a book of mug shots, a small set including the Yale Face Database, the Extended
of candidate records that have one matched with Yale Face Database B, the ORL dataset, and the
the query photo. PIE dataset. These datasets are in the order of
Due to the limited capacity of computers back several hundreds to several thousands. One defect
then, human operators were involved in extracting of all these subspace representations is that they
a set of distances among a set of predefined facial would fail if the faces are not very well aligned. In
landmarks. These distances were then normalized other words, they would fail if the faces are under
and served as features to match different face large pose variations.
photos. The method proposed by Bledsoe was While these manifold learning-based methods
evaluated in a database of 2000 face photos, and largely operated on raw image pixels, other invari-
it consistently outperformed human judges. ant local descriptor-based method gained its pop-
Later work attempted to build fully automated ularity due to their robustness to pose variations.
computer program without involving human These include the elastic bunch graph matching
labors. Takeo Kanade, in 1977, built a benchmark method in 1999, the local binary pattern-based
dataset of 20 young people, each with two face face recognition method in 2004, the series of
images. Hence the dataset consists of 40 face local descriptor-based elastic part matching
images. Takeo’s program conducted fully auto- method published by Gang Hua and Haoxiang
mated analysis of different facial regions and Li between 2009 and 2015 including the series
extracted a set of features to characterize each of probabilistic elastic part (PEP) models
region. Back then, the evaluation on these 40 dig- published between 2013 and 2015, the Joint
itized face images are considered to be a large- Bayesian faces method in 2012, and the Fisher
scale evaluation. Takeo also evaluated his algo- Vector faces in 2013.
rithm on a dataset of 800 photos later on. The performance of these methods have
The Eigenfaces approach proposed by Mat- largely been evaluated on more recent real-world
thew Turk and Alex Pentland in 1991 was the face recognition benchmark dataset including the
first to have introduced statistical pattern recogni- labeled faces in the wild (LFW), the YouTube
tion method for face recognition. It conducted Faces Database, and the more recent point-and-
principle component analysis (PCA) on a set of shoot dataset. These face datasets are either col-
face images to identify a subspace representation lected from the Internet or collected by point-and-
for face images. The efficacy of the Eigenfaces shoot cameras in unconstrained settings.
representation was evaluated on a dataset of 2500 Since 2014, we have witnessed a surge of deep
digitized face images. learning-based face recognition systems, e.g., the
The Eigenfaces spurred a theme of work, DeepFace system from Facebook, the DeepID
namely, manifold learning, in identifying better systems from the Chinese University of Hong
face spaces for face recognition, including the Kong, and the FaceNet system from Google.
Fisherfaces method by Peter Belhumeur et al. in They are all trained with millions of face images.
1997 and the Laplacianfaces method by Shuicheng For example, the DeepFace system from
Yan et al. in 2005, which aims at identifying a Facebook has leveraged 4.4 million face images
discriminative subspace for representing face from 4030 people from Facebook for training, and
Facial Recognition Technologies 477

the FaceNet system leveraged 100 million to into generative model based and discriminative
200 million faces consistent of about 8M different model based. The seminal Eigenfaces method is
identifies. a generative model, while the Fisherfaces method
In 2014, the US Government funded the is a discriminative model. From 2014, the recent
JANUS program under the Intelligence Advanced trend in face recognition is to exploit deep neural
Research Projects Activity (IARPA), which is network to learn discriminative face representa-
targeting on pushing the frontiers of facial recog- tions from a large amount of labeled face images.
nition technology in unconstrained environment
and emphasizing the comprehensive modeling of
age, pose, illumination, and facial expression Datasets and Benchmarks
(A-PIE) and unifying both image and video face
recognition. Accompanied with this program is a While early face recognition research worked on F
face recognition benchmark, namely, IARPA proprietary datasets which were not used by other
Janus Benchmark, from the National Institute of researchers, the face recognition research commu-
Standards and Technologies (NIST). The neural nity is perhaps the earliest in the computer vision
aggregation network invented by Gang Hua and community in adopting systematic and public
his colleagues in 2016 at Microsoft Research is data-driven benchmarking.
one representative of the current state of the art on This is catalyzed by the FERET dataset funded
this benchmark to date. by US Department of Defense’s Counterdrug
Technology Development Program through the
Defense Advanced Research Projects Agency
Approaches (DARPA) during 1993 and 1997. The final
FERET dataset consists of 14051 8-bit gray-
Face recognition technology can be categorized in scale images of human heads with views ranging
different ways. In terms of visual features from frontal view to left and right profile. The
exploited for face representation and hence for FERET dataset is the basis of the Face Recogni-
recognition, face recognition algorithms can be tion Vendor Test (FRVT) organized by NIST in
categorized as geometric feature based and 2000 and 2002.
appearance feature based. While early work has The FRVT in 2006 adopted the face recogni-
focused on geometric invariants, such as the size tion grand challenge (FRGC) dataset, which eval-
of certain facial components, and the distance uated performance of facial recognition systems
between certain facial landmarks, modern face from different vendors on high-resolution still
recognition algorithms largely focused on model- imagery (5–6 megapixels), 3D facial scans,
ing the appearances. multi-sample still facial imagery, and pre-
From a modeling point of view, facial recogni- processing algorithms that compensate for pose
tion technologies can be categorized as holistic and illumination. The winning team is Neven
methods or part-based methods. Holistic methods Vision, a Los Angeles start-up. Neven Vision is
build the representation based on the holistic later acquired by Google.
appearance of the face. The numerous manifold The most recent FRVT was organized in 2013;
learning-based methods belong to this category, facial recognition systems from various vendors
while part-based methods attempt to characterize are tested to identify up to 1.6 million individuals.
each facial part for robust matching. The series of The task is largely focused on visa photos and
PEP models developed by Gang Hua and mug shot photos, where the sources are more or
Haoxiang Li are one such example. less controlled. The system that ranked overall in
From the perspective of pattern recognition, the top is the NEC system. Other participants
face recognition technologies can be categorized include Toshiba, MorphoTrust, Cognitec,
478 Facial Recognition Technologies

etc. These FRVT tests organized by the US gov- images of 293 people and 2802 videos of 256 peo-
ernment in the past have largely been focused on ple. These photos and videos are taken with cheap
more controlled environment. The IAPRA digital cameras including those on smartphones.
JANUS benchmarking is currently ongoing, Compared with performance of face recognition
which will further stimulate more accurate face algorithms on the LFW and YouTube Faces Data-
recognition technologies. base, where nearly perfect verification accuracy
Meanwhile, there are also widely adopted has been achieved, current state-of-the-art verifi-
benchmark dataset from academia, including cation accuracy, up to September 2015, on the
the early small-scale datasets collected in point-and-shoot dataset, is 58% at the false accep-
1990s, such as the Yale and Extended Yale tance rate of 1%, achieved by the team from the
B datasets, the ORL dataset from the AT&T Chinese Academy of Science Institute of
Labs, and mid-scale datasets such as the PIE Computing.
and Multi-PIE datasets collected at CMU. The current largest publically available facial
These datasets are often collected to evaluate recognition dataset is the MegaFace dataset, with
some specific visual variations that confront one million faces obtained from Flickr. The cur-
facial recognition. Specifically, the Yale datasets rent state-of-the-art rank 1 identification accuracy
are designed for modeling illuminations; the with one million distractors is around 75%.
ORL dataset is constructed to evaluate occlusion
and facial expressions; and the PIE datasets are
designed to model poses, illuminations, and Software
facial expressions.
These datasets are more or less taken in well- Well-known commercial software systems that
controlled setting. The labeled faces in the wild have used facial recognition technology include
(LFW) dataset published in 2007 is the first the Google Picasa photo management system, the
dataset collected from the Internet, which released Apple iPhoto system, the photo application of
publically to the research community, for system- Facebook, Windows Live Photo Gallery, Adobe
atic evaluation of facial recognition technologies Photoshop Elements, and Sony Picture Motion
in uncontrolled settings. It contains 13,000 images Browser. The OKAO vision system from Omron
from 5749 celebrities. The benchmark task on provided advanced facial recognition technolo-
LFW has been mainly designed for face verifica- gies, which has been licensed to various compa-
tion, with different protocols depending if the nies for commercial applications.
algorithms are trained with external data. Later As software as a service (SaaS) becomes an
on, the YouTube Faces Database, published in industry common practice, more and more com-
2011, followed the same protocol as LFW, but panies are offering their latest face recognition
each face instance is a video clip instead of a technologies through the cloud. One of the most
single image. The dataset contains 3425 videos matured ones is the Face API provided by Micro-
from 1595 people. soft Cognitive Service. Other similar APIs are
One limitation of the LFW dataset as well as also offered by Internet giants such as Baidu and
the YouTube Faces Database is that the people in start-ups such as Megvii and SenseTime in China.
these datasets are celebrities. As a result, the
photos and videos published are often taken by
professional photographers. This is different from Cross-References
photos taken by amateur users in their daily life.
This largely motivated the construction of the ▶ Biometrics
point-and-shoot dataset, released from the Univer- ▶ Facebook
sity of Notre Dame. It is composed of 9376 still ▶ Social Media
Financial Data and Trend Prediction 479

Further Reading
Financial Data and Trend
Belhumeur, P. N., et al. (1997). Eigenfaces vs. fisherfaces: Prediction
Recognition using class specific linear projection. IEEE
Transcation Pattern Analysis Machine Intelligence,
19(7), 711–720. Germán G. Creamer
Chen, D., et al. (2012). Bayesian face revisited: A joint School of Business, Stevens Institute of
formulation. In Proceedings of European Conference Technology, Hoboken, NJ, USA
on Computer Vision.
Huang, G. B., et al. (2007). Labeled faces in the wild:
A database for studying face recognition in
unconstrained environments. University of Massachu- Synonyms
setts, Amherst, Technical report (pp. 07–49).
Kanade, T. (1977). Computer recognition of human faces.
Interdisciplinary Systems Research, 47.
Financial econometrics; Machine learning; Pat- F
Li, H., et al. (2013). Probabilistic elastic matching for pose tern recognition; Risk analysis; Time series;
variant face verification. In Proceedings of IEEE Com- Financial forecasting
puter Society Conference on Computer Vision and Pat-
tern Recognition.
Schroff, F., et al. (2015). FaceNet: A unified embedding for
face recognition and clustering. In Proceedings of IEEE Introduction
Computer Society Conference on Computer Vision and
Pattern Recognition. The prediction of financial time series is the pri-
Simonyan, K., et al. (2014). Fisher vector faces in the
mary object of study in the area of financial econo-
wild. In Proceedings of. British Machine Vision
Conference metrics. The first step from this perspective is to
Taigman, Y., et al. (2014). DeepFace: Closing the gap to separate any systematic variation of these series
human-level performance in face verification. In Pro- from their random movements. Systematic
ceedings of IEEE Computer Society Conference on
changes can be caused by trends and seasonal
Computer Vision and Pattern Recognition.
Turk, M. A & Pentland, A. P. (1991) Face recognition and cyclical variations. Econometric models
using eigenfaces. In Proceedings of IEEE Computer include different levels of complexity to simulate
Society Conference on Computer Vision and Pattern the existence of these diverse patterns. However,
Recognition.
machine-learning algorithms can be used to fore-
Wolf, L., et al. (2011). Face recognition in unconstrained
videos with matched background similarity. In Pro- cast nonlinear time series as they can learn and
ceedings of IEEE Computer Society Conference on evolve jointly with the financial markets.
Computer Vision and Pattern Recognition. The most standard econometric approach to
Yang, J., et al. Neural aggregation network for video face
forecast trends of financial time series is the Box
recognition. http://arxiv.org/abs/1603.05474
and Jenkins (1970) methodology. This approach
has three major steps: (1) identify the relevant
systematic variations of the time series (trend,
seasonal or cyclical effects), the input variables,
Factory of the Twenty-First and the dynamic relationship between the input
Century and the target variables; (2) estimate the parame-
ters of the model and the goodness-of-fit statistics
▶ Data Center of the prediction in relation to the actual data; and
(3) forecast the target variable.
The simplest forecasting models are based on
either the past values or the error term of the
FAIR Data financial time series. The autoregressive model
[AR(p)] assumes that the current value depends
▶ Data Storage on the “p” most recent values of the series plus an
480 Financial Data and Trend Prediction

error term, while the moving average model [MA numerical calculations to make investment deci-
(q)] simulates the current values based on the “q” sions. Additionally, technical analysis helps to
most recent past errors or innovation factors. The formalize traders’ rules: “buy when it breaks
combination of these two models leads to the through a high,” “sell when it is declining,” etc.
autoregressive moving average [ARMA(p,q)] The presence of technical analysis has been very
model, which is the most generic and complete limited in the finance literature because of its lack
model. These models can include additional fea- of a robust statistical or mathematical foundation,
tures to simulate seasonal or cyclical variations or its highly subjective nature, and its visual charac-
the effect of external events or variables that may ter. In the 1960s and 1970s, researchers studied
affect the forecast. trading rules based on technical indicators and did
Some of the main limitations of this method are not find them profitable. Part of the problem of
that it assumes a linear relationship between the these studies was the ad hoc specifications of the
different features when that relationship might be trading rules that led to data snooping. Later on,
nonlinear, and can manage only a limited number Allen and Karjalainen (1999) found profitable
of quantitative variables. Because the complexi- trading rules using a genetic algorithmic model
ties of the financial world have grown dramati- for the S&P 500 with daily prices from 1928 to
cally in the twenty-first century, better ways of 1995. However, these rules were not consistently
forecasting time series are needed. For example, better than a simple buy-and-hold strategy.
the 2007–2009 financial crisis brought on a global
recession with lingering effects still being felt
today. At that point, widespread failures in risk Machine-Learning Algorithms
management and corporate governance at almost
all the major financial institutions threatened a Currently, the major stock exchanges such as
systemic collapse of the global financial markets NYSE and NASDAQ have mostly transformed
leading to large bailouts by governments around their markets into electronic financial markets.
the world. Often, the use of simplified forecasting Players in these markets must process large
methods was blamed for the lack of transparency amounts of structured and unstructured data and
of the real risks embedded in the financial assets, make instantaneous investment decisions. As a
and their inability to deal with the complexity of result of these changes, new machine-learning
high-frequency datasets. algorithms that can learn and make intelligent
The high-frequency financial datasets share the decisions have been adapted to manage large,
four dimensions of big data: an increase in the fast, and diverse financial time series. Machine-
volume of transactions; the high velocity of the learning techniques help investors and corpora-
trades; the variety of information, such as text, tions discover inefficiencies in financial markets
images and numbers, used in every operation; and and recognize new business opportunities or
the veracity of the information in terms of quality potential corporate problems. These discoveries
and consistency required by the regulators. This can be used to make a profit and, in turn, reduce
explosion of big and nonlinear datasets requires the market inefficiencies. Also, corporations
the use of machine-learning algorithms, such as could save a significant amount of resources if
the methods that we introduce in the next sections, they can automate certain corporate finance func-
that can learn by themselves the changing patterns tions such as planning, risk management, invest-
of the financial markets, and that can combine ment, and trading.
many different and large datasets. There is growing interest in applying machine-
learning methods to discover new trading rules or
to formulate trading strategies using technical
Technical Analysis indicators or other forecasting methods. Machine
learning shares with technical analysis the empha-
Technical analysis seeks to detect and interpret sis on pattern recognition. The main problem with
patterns in past security prices based on charts or this approach is that every rule may require a
Financial Data and Trend Prediction 481

different set of updated parameters that should be • Unsupervised:


adjusted to every particular challenge. Creamer – Clustering: Aggregate stocks according to
(2012) proposed a method to calibrate a forecast- their returns and risk to build a diversified
ing model using many indicators with different portfolio. These techniques can also be used
parameters simultaneously. According to this in risk management to segment customers
approach, a machine-learning algorithm charac- by their risk profile.
terized by robust feature selection capability, such – Modeling: Uncover linear and nonlinear
as boosting (described below), can find an optimal relationships among economic and financial
combination of the different parameters for each variables.
market. Every time that a model is built, its param- – Feature selection: Select the most relevant
eters are optimized. This method has shown to be variables among vast and unstructured
profitable with stocks and futures. datasets that include text, news, and finan- F
The advantage of machine-learning methods cial variables.
over methods proposed by classical statistics is – Anomaly detection: Identify outliers that
that they do not estimate the parameters of the may represent a high level of risk. It can
underlying distribution and instead focus on mak- also help to build realistic future scenarios
ing accurate predictions for some variables given to forecast prices, such as anticipating
others variables. Breiman (2001) contrasts these spikes in electricity prices. The complex
two approaches as the data modeling culture and and chaotic nature of the forces acting on
the algorithmic modeling culture. While many energy and financial markets tend to defeat
statisticians adhere to the data-modeling ARMA models at extreme events.
approach, people in other fields of science and
engineering use algorithmic modeling to con-
struct predictors with superior accuracy.
Learning Algorithms

Classification of Algorithms to Detect The following are some of the best well-known
Different Financial Patterns learning algorithms that have been used to fore-
cast financial patterns:
The following categories describe the application Adaboost: Apply a simple learning algorithm
of machine-learning algorithms to various finan- to perform an iterative search to locate observa-
cial forecasting problems: tions that are difficult to predict, then it generates
particular rules to differentiate the most difficult
• Supervised: cases. Finally, it classifies every observation com-
– Classification: Classify observations using bining the different rules generated. This method,
two or more categories. These types of algo- invented by Freund and Schapire (1997), has
rithms can be beneficial to forecast asset demonstrated to be very useful to support auto-
price trends (positive or negative) or to pre- mated trading systems due to its feature selection
dict customers who may default on their capability and reliable forecasting potential.
loans or commit fraud. These algorithms Support vector machine (SVM): Preprocess
can also be used to calculate investors’ or data in a higher dimension than the original
news’ sentiment. space. As a result of this transformation proposed
– Regression: Forecast future prices or evalu- by Vapnick (1995), observations can be classified
ate the effect of several features into the into several categories. Support vector machine
target variable. Results could be similar to has been used for feature selection and financial
those generated by an ARMA model, forecasting of several financial products such as
although machine-learning methods may the S&P 500 index, and the US and German
capture nonlinear relationships among the government bond futures, using moving averages
different variables. and lagged prices.
482 Financial Econometrics

C4.5: This is a very popular decision-tree algo- high parallelism in data stream management, and
rithm. It follows a top-down approach where the in the data analysis, either directly or using map/
best feature, introduced as the root of the tree, can reduce architectures, which in turn will require
separate data according to a test, such as the infor- new algorithms to take full advantage of those
mation gain, and its branches are the values of this characteristics. This will provide some benefits
feature. This process is repeated successively with including independent analysis of diverse sources
the descendants of each node creating new nodes without high, initial synchronization require-
until there are no additional observations. At that ments; software that will run on relatively inex-
point, a leaf node is included with the most com- pensive, commodity hardware, and a mix of
mon value of the target attribute. Decision trees algorithms, along with innovative architectures,
are very useful to separate customers with differ- that can provide both real-time alerting as well
ent risk profiles. The advantage of decision trees is as in-depth analysis.
that their interpretation is very intuitive and may This paper introduced some of these machine-
help to detect unknown relationships among the learning algorithms that can learn new financial
various features. markets behaviors, approximate very complex
Neural network (connectionist approach): financial patterns embedded in big datasets, and
This is one of the oldest and the most com- predict trends on financial time series.
monly studied algorithms. Most trading sys-
tems generate trading rules using neural
networks where their primary inputs are tech- Further Reading
nical analysis indicators and the algorithm
build different layers of nodes simulating how Allen, F., & Karjalainen, R. (1999). Using genetic algo-
the brain works. Based on the final result, the rithms to find technical trading rules. Journal of Finan-
cial Economics, 51(2), 245–271.
model back propagates its errors and corrects Box, G. Y., Jenkins, G. (1970). Time Series Analysis:
the parameters until it has an acceptable accu- Forecasting and Control. San Francisco: Holden-Day.
racy rate. This approach has been applied to Breiman, L. (2001). Statistical modeling: The two cultures.
forecast and trade S&P 500 index futures, the Statistical Science, 16(3), 199–215.
Creamer, G. (2012). Model calibration and automated trad-
Warsaw stock price index 20 futures, and Korea ing agent for euro futures. Quantitative Finance, 12(4),
stock index 200 futures. 531–545.
Genetic algorithm (emergent approach): The Freund, Y., & Schapire, R. (1997). A decision-theoretic
genetic algorithm or genetic programming generalization of on-line learning and an application to
boosting. Journal of Computer and System Sciences,
approach is used to generate trading rules where 55, 119–139.
its features are coded as evolving chromosomes. Vapnik, V. (1995). The nature of statistical learning theory.
These rules are represented as binary trees in New York: Springer-Verlag.
which the leaves are technical indicators, and the
non-leaves are Boolean functions. Together they
represent simple decision functions. The advan-
tage of this approach is that the rules are interpret- Financial Econometrics
able and can change according to the financial
product under study. ▶ Financial Data and Trend Prediction

Conclusion
Financial Forecasting
The big data characteristics of the financial data
and the modeling requirements allow for very ▶ Financial Data and Trend Prediction
Financial Services 483

Insurance), new or fairly new firms with fairly


Financial Services famous names (like Ally Bank, Fidelity, and Pay
Pal), and firms so new and small that only their
Paul Anthony Laux customers know their names.
Lerner College of Business and Economics and The economic importance of the financial ser-
J.P. Morgan Chase Fellow, Institute for Financial vices sector can be sensed from its size. For exam-
Services Analytics, University of Delaware, ple, financial services accounts for a bit less than
Newark, DE, USA 10% of the total value added and GDP of the US
economy. It employs around 6 million people in
the USA. The US Department of Labor Statistics
The Nature of the Financial Services refers to it as the “Financial Activities Super-
Sector sector.” Even further, because its functions are so F
central to the functioning of the rest of the econ-
The financial services sector performs functions omy, the financial sector has importance beyond
that are crucial for modern economies. Most cen- its size. Famously, financial markets and the pro-
trally these are: vision of credit are subject to bouts of instability
with serious implications for the economy as a
• The transfer of savings from household savers whole.
and business investors in capital goods; this
enables capital formation and growth in the
economy over time. The Importance of Big Data for the
• The provision of payment systems for good Financial Services Sector
and services; this enables larger, faster, and
more cost-efficient markets for goods and ser- Big data is a fast developing area. This entry
vices as any point in time. focuses on recent developments. A historical dis-
• The management of risk, including insurance, cussion of related issues, grouped under the term
information, and diversification; this enables “e-finance,” is provided by Allen et al. (2002).
individuals and firms to bear average risks Another discussion, using the term “electronic
and avoid being paralyzed by undue volatility. finance,” is given by Claessens, Glaessner, and
Klingebiel (2002). Even though an attempt to
These services are provided in contract-like delineate all the connections of big data with
ways (via bank accounts, money market and financial services runs the risk of being incom-
mutual funds, pensions and 401-Ks, credit cards, plete and soon obsolete, there are some key link-
business, car and mortgage loans, and life and ages that appear likely to persist over time. These
casualty insurance policies) and in security-like include:
ways (via stocks, bonds, options, futures, and the
like). These services are provided by traditional • Customer analytics and credit assessment
firms (like commercial and investment banks, • Fintech
mutual fund companies, asset managers, insur- • Financial data security/cybersecurity
ance companies, and brokerage firms) and new • Financial data systems for systemic risk
economy enterprises (like peer-to-peer lending, management
cooperative payment transfer systems, and risk • Financial data systems for trading, clearing,
sharing cooperatives). These services are pro- and risk management
vided by long-standing firms with famous names • Financial modeling and pricing
(like Goldman Sachs, Bank of America, General • Privacy
Electric, Western Union, AIG, and Prudential • Competitive issues
484 Financial Services

Customer analytics. One of the major early firms and networks supported by Internet commu-
uses of big data methods within the financial nication and data provision. From a finance point
services sector is for customer analytics. Several of view, much of fintech fits into “shadow bank-
particular uses are especially important. The first ing,” the provision by nonbanks of financial
of these is the use of big data methods to assess services traditional provided by banks. Much of
creditworthiness. Banks, credit card issuers, and fintech also fits into the more general concept
finance companies need to decide on credit terms of peer-to-peer service provision. Examples of
for new applicants and, in the case of revolving fintech include equity finance for projects via
credit, to periodically update terms for existing such networks as Kickstarter, interpersonal lend-
customers. The use of payment and credit history, ing via networks such as Lending Club and Pros-
job, and personal characteristics for this purpose is per, single currency payment networks like
long-ago well developed. However, the use of PayPal and CashEdge, and foreign exchange
data acquired as a natural part of doing business transfer services such as CurrencyFair and
with retail credit customers (for example, the pur- Xoom. The operations of many of these services
chasing details of credit card users) is only are Internet-enabled but not big-data. Even so, the
recently being undertaken. In less developed aggregation and policing of working-size peer-to-
credit markets (China, for example) where credit peer network definitely involves big-data
score and the like are not available, lenders have methods. For more on peer-to-peer payments
been experimenting successfully with inferring developments, see Windh (2011). For a broader
creditworthiness from Internet purchasing history. recent discussion, see The Economist (2013). The
A second prominent use of big data for cus- use of the term “fintech” has evolved over time;
tomer analytics is to tailor the offering of financial for an early commercial use of the term, which
products via cross-selling, for example, using also gives a sense for former nature of “big data,”
credit card purchase data to help decide which see Bettinger (1972).
insurance products might be attractive. Financial data security/cybersecurity.
A third prominent use of big data in financial Cybersecurity is a huge and growing need
services customer analytics arises because a credit when it comes to financial services, as demon-
relationship is inherently long-lived. Thus, deci- strated by the frequency and size of successful
sions must be made over time to tailor the rela- hacker attacks on financial institutions (such as
tionship with the customer. A specific example is JP Morgan Chase) and financial transactions at
the use of data extracted from recordings of phone retail firms (such as credit card transactions at
calls with delinquent-pay mortgage borrowers. It Target). The banking system has a long experi-
is of interest to the lender to distinguish between ence of dealing with security in relatively closed
clients who want to stay in their homes but are systems (such as ATM networks). Increasingly,
experiencing financial trouble from those who are nonbank firms and more open networks are
less committed to retaining ownership of a house. involved. With the advent of mobile banking,
Experiments are underway at one large mortgage mobile payments, and highly dispersed point of
services provider to analyze voice data (stress payment networks, the issue will continue to
levels, word choice, pacing of conversation, etc.) grow in importance.
in an attempt to discern the difference. From a Financial data systems for systemic risk man-
broader point of view, many of these activities fit agement. The global financial crisis of 2007–
in with recent thinking about customer analytics 2008, and the recession that followed, exemplified
in service businesses more generally, in that they with painful clarity the deep interconnections
focus on ways to create value via customer between the financial system and the real econ-
engagement over time, as discussed in, for exam- omy. Few expected that consumers’ and banks’
ple, Bijmolt et al. (2010). shared incentives to over-leverage real estate
Fintech. Fintech refers to the provision of investments, the securitization boom that this
financing and financial services by nontraditional supported, and the eventual sharp decline in
Financial Services 485

values could trigger such massive difficulties. learning methods (more recently) have been
These included a near-failure of the banking sys- explored with particular application to trading
tem, a freeze-up in short-term lending markets, a and portfolio management.
global bear market in stocks, and extensive and Privacy. The tradeoffs of privacy and benefit
persistent unemployment in the real economy. A for big data in financial services are qualitatively
core problem was convergence: in a crisis, all similar to those in other sectors. However, the
markets move down together, even though they issues are more central given that personal and
may be less-correlated in normal times. Big data company information is at the heart of the sec-
research methods have the potential to help reduce tor’s work. With more amassing, sharing, and
the chance of a repeat, by helping us understand analysis of data collected for one purpose to
the sources of cross-market correlation. For a serve another purpose, there can be more benefit
broader discussion and implications for the future for the consumer/client and/or for the firm pro- F
of data systems and risk in the financial system, viding the service. Conflicts of interest are
see Kyle et al. (2013). numerous, and the temptation (or even tendency)
Financial data systems for trading, clearing, will be to use the consumers’ data for the benefit
and risk management. From an economic point of the firm, or for the benefit of a third party to
of view, banks, financial markets, businesses, which the data is sold. As in other sectors,
and consumers are intrinsically interconnected, establishing clear ownership of the data them-
with effects flowing from one to the others in a selves is key, as is establishing guidelines and
global system. From a data systems point of legal limits for their use. Privacy issues seem
view, the picture is more of moats than rivers. certain to be central to a continuing public policy
For example, the United States uses one number- debate and to Competitive issues. Just as many of
ing system for tracking stocks and bonds the big-data uses listed above have privacy
(CUSIP), while the rest of the world is not on implications, they also have implications for
this standard. Within the United States, the num- financial firms’ competitiveness. That is, big
bering systems for mortgage loans and mortgage data may help us to better understand the inter-
securities are not tightly standardized with those connections of consumers, housing, banks, and
for other financial products. To systematically financial markets if we can link consumer pur-
trace from, say, payroll data to purchase of chases, mortgage payment, borrowing, and stock
goods and services via credit cards, to effects trading. But financial products and services are
on ability of make mortgage payments, to the relatively easy to duplicate, so customer identi-
default experience on mortgage bonds is, to put ties and relationships are a special and often
it lightly, hugely difficult. The development of secret asset. In finance, information is a compet-
ontologies (structural frameworks for systemati- itive edge and is likely to be jealously guarded.
cally categorizing and linking information) is a
growing need.
Financial modeling and pricing. One of the Further Reading
earliest uses of extensive computing power within
the banking industry was for large simulation Allen, F., McAndrews, J., & Strahan, P. (2002). E-finance:
models that could inform the buying, selling, An introduction. Journal of Financial Services
and risk management of loans, bonds, derivatives, Research, 22(1–2), 5–27.
Bettinger, A. (1972). FINTECH: A series of 40 time shared
and other securities and contracts. This activity models used at manufacturers Hanover trust company.
has become extremely well developed and some- Interfaces, 2, 62–63.
what commoditized at this point. Big data, in the Bijmolt, T. H., Leeflang, P. S., Block, F., Eisenbeiss, M.,
sense of unstructured data, has been less used, Hardie, B. G., Lemmens, A., & Saffert, P. (2010).
Analytics for customer engagement. Journal of Service
though cutting-edge computer scientific methods Research, 13(3), 341–356.
are routinely employed. In particular, neural net- Kyle, A., Raschid, L., & Jagadish, L.V. (2013). Next gen-
work methods (less recently) and machine eration community financial cyberinfrastructure for
486 Forester

managing systemic risk, National Science Foundation growth and harvest strategies, and how to opti-
Report for Grant IIS1237476. mize forest management for different goals.
The Economist. (2013). Revenge of the nerds: Financial-
technology firms, 03 Aug, 408, 59.
Windh, J. (2011). Peer-to-peer payments: Surveying a
rapidly changing landscape, Federal Reserve Bank of What Is Forestry?
Atlanta, 15 Aug.
Forestry is the science or practice of planting,
managing, and caring for forests to meet human
goals and environmental benefits (Merriam-Web-
Forester ster 2019). As a field, forestry has a long history,
with evidence of practices dating back to ancient
▶ Forestry times (Pope et al. 2018). While originally viewed
as a separate science, today it is considered a land-
use science similar to agriculture. Someone who
performs forestry is a forester. Forestry as a disci-
Forestry pline pulls from the fields of environmental sci-
ence, ecology, and genetics (Pope et al. 2018).
Christopher Round
George Mason University, Fairfax, VA, USA
Booz Allen Hamilton, Inc., McLean, VA, USA Relations to Other Disciplines

Forestry is differentiated from forest ecology and


Synonyms that forest ecology is a value neutral study of
forests as ecosystems (Barnes et al. 1998; Pope
Forester; Silviculture; Verderer et al. 2018). Forestry is not value neutral as it is
focused on studying how to use forest ecosystems
to achieve different socioeconomic and/or envi-
Definition ronmental conservation goals. While natural
resource management, which is focused on the
Forestry is the science or practice of planting, long-term management of natural resources over
managing, and caring for forests to meet human often intergenerational time scales, may use tech-
goals and environmental benefits (Merriam-Web- niques from forestry, forestry is a distinct disci-
ster 2019). While originally viewed as a separate pline (Epstein 2016).
science, today it is considered a land-use science Forestry is related to silviculture, and the two
similar to agriculture. Someone who performs terms have been used interchangeably (Pope et
forestry is a forester. Forestry as a discipline al. 2018; United States Forest Service 2018).
pulls from the fields of environmental science, Silviculture however is exclusively concerned
ecology, and genetics. Forestry can be applied to with growth, composition, and the establishment
a myriad of goals such as but not limited to timber of timber (Pope et al. 2018), while forestry has
management for resource extraction, long-term a broader focus on the forest ecosystem. Thus,
forest management for carbon sequestration, and silviculture can be considered a subset of forestry.
ecosystem management to achieve conservation
goals. Forestry is ultimately a data-driven prac-
tice, relying on a combination of the previous Types of Forestry
experience of the forester and growth models.
Big data is increasingly important for the field of Forestry can be applied to a myriad of goals such
forestry as it can improve both the knowledge and as but not limited to timber management for
process of supply chain management, optimum resource extraction, long-term forest management
Fourth Amendment 487

for carbon sequestration, and ecosystem manage- Savard, J.-P. L., Clergeau, P., & Mennechez, G. (2000).
ment to achieve conservation goals. Modern Biodiversity concepts and urban ecosystems.
Landscape and Urban Planning, 48(3–4), 131–142.
forestry is focused on the idea of forests having https://doi.org/10.1016/S0169-2046(00)00037-2.
multiple uses (also known as the multiple-use Scimago Institutions Rankings. (2018). Journal Rankings
concept) (Barnes et al. 1998; Pope et al. 2018; on Forestry. Retrieved 28 July 2018, from https://www.
Timsina 2003). This leads to focuses on sustained scimagojr.com/journalrank.php?category¼1107
Timsina, N. P. (2003). Promoting social justice and
yield of forest products as well as recreational conserving montane forst environments: A case study
activities and wildlife conservation. Sustained of Nepal’s community forestry programme. The
yield is act of extracting ecological resources Geographical Journal, 169(3), 236–242.
without reducing the base of the resources them- United States Forest Service. (2018). Silviculture.
Retrieved 28 July 2018, from https://www.fs.fed.us/
selves. This is to avoid the loss of ecological forestmanagement/vegetation-management/silvicul
resources. Forestry can be used to manage water- ture/index.shtml F
sheds and prevent issues with erosion (Pope et al.
2018). It is also connected to fire prevention,
insect and disease control, and urban forestry (for-
estry in urban settings). Urban forestry is Fourth Amendment
of particular concern for the burgeoning field
of urban ecology (Francis and Chadwick 2013; Dzmitry Yuran
Savard et al. 2000; United States Forest Service School of Arts and Communication, Florida
2018). Institute of Technology, Melbourne, FL, USA

Examples of Prominent Journals The Fourth Amendment to the US Constitution is


fundamental for the privacy law. Part of the US
Journal of Forestry published by the Society of Bill of Rights, ratified in 1791 and adopted in
American Foresters (Scimago Institutions 1792, it was designed to ensure protection for
Rankings 2018) citizens against unlawful and unreasonable
Forest Ecology and Management published by searches and seizures of property by the govern-
Elsevier BV (Scimago Institutions Rankings ment. The prime role of the Fourth Amendment
2018) has not changed since the eighteenth century, but
Forestry published by Oxford University Press today’s expanded number of threats to citizens’
(Scimago Institutions Rankings 2018) privacy demands a wider range of applications for
the amendment and brings to life a number of
necessary clarifications.
Further Reading Great amounts of information are generated by
organizations and individuals every day. Ever-
Barnes, B. V., Zak, D. R., Denton, S. R., & Spurr, S. H. evolving technology makes capturing and storing
(1998). Forest ecology (4th ed.). New York: Wiley.
this information increasingly simple by turning it
Epstein, C. (2016). Natural resource management.
Retrieved 28 July 2018, from https://www.britannica. into an automated, relatively cheap routine every
com/topic/natural-resource-management office and private business owner, website admin-
Francis, R., & Chadwick, M. (2013). Urban ecosystems: istrator and blogger, smartphone user and video
Understanding the human environment. New York:
gamer, car driver and most anyone else engage in
Routledge.
Merriam-Webster. (2019). Definition of FORESTRY. every day. The ease of storing, accessing, analyz-
Retrieved 12, September 2019, from https://www. ing, and transferring digital information, made
merriam-webster.com/dictionary/forestry possible by technological advances, creates addi-
Pope, P. E., Chaney, W. R., & Edlin, H. L. (2018, June 14).
tional vulnerabilities to citizens’ privacy and secu-
Forestry – Purposes and techniques of forest manage-
ment. Retrieved 25 July 2018, from https://www. rity. Laws and statutes have been put in place to
britannica.com/science/forestry protect privacy of US citizens and shield them
488 Fourth Amendment

from governmental and corporate abuse. Many personal information for targeted advertisement
argue these regulations struggle to keep up with and the PRISM program was an issue, the initial
the rapidly evolving world of digital communica- amendment text was mainly concerned with phys-
tions and are turning into obsolete and largely ical spaces of citizens and their physical posses-
meaningless legislation. While clarifying certain sions. Up until the late 1960s, interpretations of
moments about the ways in which US citizens’ the Fourth Amendment did not consider elec-
privacy is protected by the law, the statues ratified tronic surveillance without physical invasion of a
by the US government in its struggle to ensure protected area to be a violation of one’s constitu-
safety of the nation curb the protective power of tional rights. In other words, wiretapping and
constitutional privacy law. And while constitu- intercepting communications were legal. The
tional law limits, to certain degree, the ability of Supreme Court decision in Katz v. United States
law enforcement agencies and other governmental (1967) case (in which attaching listening and
bodies to gather and use data on US citizens, recording devices on telephone booths in public
private companies and contractors collect infor- places was contended a violation of the Fourth
mation that later could be used by the government Amendment) signified recognition of the exis-
and other entities, raising additional concerns. tence of a constitutional right for privacy and
Like many articles of the US Constitution and expanded the Fourth Amendment protection
of the Bill of Rights, the Fourth Amendment has from places onto people.
undergone a great deal of interpretations in court Privacy is a constitutional right, despite the fact
decisions and has been supplemented and limited that the word “privacy” is not mentioned any-
by acts and statutes enacted by various bodies of where in the Constitution. That applies to the
the US government. In order to understand its Bill of Rights and its first 8 and the 14th amend-
applications in the modern environment, one has ments, which combined are used to justify the
to consider the ever-evolving legal context as well right to be let alone by the government. The US
as the evolution of the areas of application of the Supreme Court decisions (beginning with the
law. The latter is highly affected by accelerating 1961 Mapp v. Ohio search and seizure case) cre-
development of technology and is changing too ated a precedent protecting citizens against
quickly, some argue, for the law to keep up. unwarranted intrusions by the police and the gov-
The full text of the Fourth Amendment was ernment. The ruling in Katz added one’s elec-
drafted in the late eighteenth century, during the tronic communications to the list of private
times when privacy concerns had not yet become things and extended private space into public
topical. As the urban population of the United areas. However, legislation was enacted by con-
States was not much larger than 10%, no declara- gress since, which broadened significantly the
tions of privacy were pertinent outdoors and in authority of law enforcement agencies in
shared quarters of one-room farm houses. conducting surveillance of citizens.
Constitutional law is not the only regulation of
electronic surveillance and other forms of privacy
Amendment IV infringement – federal laws as well as state stat-
utes play crucial parts in the process as well.
The right of the people to be secure in their per- While individual state legislation acts vary signif-
sons, houses, papers, and effects, against unrea- icantly and have limited areas of influence, a
sonable searches and seizures, shall not be rather complex statuary scheme implemented by
violated, and no warrants shall issue, but upon Congress applies across state borders and law
probable cause, supported by oath or affirmation, enforcement agencies. The below four statutes
and particularly describing the place to be were designed to supplement and specify applica-
searched, and the persons or things to be seized. tion of the constitutional right to be let alone in
Before the electrical telegraph came into exis- specific circumstances. At the same time, they
tence and well before Facebook started using open up new opportunities for law enforcement
Fourth Amendment 489

and national security services to collect and ana- governmental agencies. While data sharing was
lyze information about private citizens. streamlined and regulated, the investigative pow-
Title III of the Omnibus Crime Control and ers of several agencies were significantly
Safe Streets Act (Title III) of 1968 regulates increased.
authorized electronic surveillance and The dualistic nature of the four statutes men-
wiretapping. It requires law enforcement repre- tioned above makes them a source of great con-
sentatives to obtain a court order prior to troversy in the legal and the political worlds. On
intercepting private citizens’ communications. If the one hand, all four contain claims of protecting
state legislation does not allow for issuance of individuals’ privacy. On the other hand, they
such orders, wiretapping and other forms of elec- extend the law enforcement agencies’ power and
tronic surveillance could not be authorized and authority, which in return limits rights of individ-
carried out legally. The regulated surveillance uals. The proponents of the statutes argue that F
includes that to which neither of the parties in empowering law enforcement is absolutely neces-
the surveyed communication gives their consent – sary in combating terrorism and fighting crime,
in other words, eavesdropping. “National secu- while their opponents raise the concern of
rity” eavesdropping has an exceptional status undermining the Constitution and violating indi-
under the statute. vidual rights.
The Electronic Communications and Privacy While the main scope of constitutional law is
Act (EPCA) of 1986 amended Title III and added on protecting citizens from governmental abuses,
emails, voicemail, computing services, and wire- regulation of data gathering and storage by private
less telephony to the list of regulated communica- entities has proven to be altogether critical in this
tions. And while the purpose of EPCA was to regard as well. Governmental entities and agen-
safeguard electronic communications from gov- cies outsource some of their operations to private
ernment intrusion and to keep electronic service contractors. According to the Office of the Direc-
providers from accessing personal information tor of National Intelligence, nearly 20% of the
without users’ consent, the statute did add to the intelligence personnel worked in private sector
powers of law enforcement in electronic surveil- in 2013. Blumberg Industries analyses showed
lance in some circumstances. about 70% of the intelligence budget going to
The Communications Assistance for Law contractors that year. While the intelligence agen-
Enforcement Act (CALEA) of 1994, in the lan- cies do not have sufficient resources to oversee all
guage of the act, made “clear a telecommunica- the outsourced operations, access of private enti-
tions carrier’s duty to cooperate in the interception ties to sensitive information brings with it risks to
of communications for law enforcement pur- national security. Disclosure of some of that infor-
poses.” The statute was also aimed at mation has also revealed questionable operations
safeguarding law enforcement’s authorized sur- by the government agencies, like the extent of
veillance, protecting privacy of citizens and pro- National Security Agency’s eavesdropping,
tecting the development of new technology. Yet leaked by Edward Snowden, former employee of
again, while more clarity was brought to pro- Booz Allen Hamilton consulting firm. Some
tecting citizens’ rights in light of new technolog- argue that the latter is a positive stimulus for the
ical advances, more authority was granted to law development of transparency, which is crucial for
enforcement agencies in monitoring and healthy evolution of a democratic country. The
accessing electronic communications. other side of the argument focuses attention on
The heaviest blow that constitutional privacy the threats to national security brought about by
rights had suffered from congressional statutes untimely disclosure of secret governmental
was dealt by the USA PATRIOT Act. Passed information.
after the terrorist attacks on the United States on Besides commissioning intelligence work to
September 11, 2001, it was designed to address the private sector, the government often requests
issues with communications between and within information from organizations unaffiliated with
490 Fourth Amendment

the federation or individual states. According to app’s database contained information about
Google Transparency Report, during the 2013 15 million users, including 110 million birthdays
calendar year, the corporation received 21,492 and other events. Users provided this information
requests (over 40% of all requests by all countries in order to receive updates about their family and
that year) for information about its 39,937 users friends, and now it ended up in hands of a publicly
from the US government and provided some data held company, free to do anything they would like
on 83% of the requests. While not having to with that data.
maintain files on private citizens and engage in Collecting data about users’ activities on the
surveillance activities requiring court orders, the Internet via tracking cookies is considered volun-
government can get access to vast amounts of data tary exposure, even though users are completely
collected by private entities for their own pur- unaware of what exactly is being tracked. No
poses. What kind of data Google and other private affirmative consent is needed for keeping log of
companies can store and then provide to the US what a person reads, buys, or watches.
government upon legitimate congressional stat- According to a 2008 study out of Carnegie
utes acts requests is just as much of a Fourth Mellon University, it would take an average per-
Amendment issue and a government access to son 250 working hours a year to read all the
information concern as NSA eavesdropping. privacy policy statements for all the new websites
In the 1979 Miller vs. Maryland case, the they visited, more than 30 full working days. By
Supreme Court of the United States ruled that 2016, 8 years later, the number has likely
willingly disclosed information was not protected increased considering the rise in the time people
by the Fourth Amendment. Advances in technol- spend online. Not surprisingly, the percentage of
ogy allow for using the broad “voluntary expo- people who actually read privacy statements of
sure” terminology for a wide variety of the websites they visit is very low. And while
information gathering techniques. privacy settings create an illusion of freedom
Quite a few things users share with electronic from intrusion for the users of social networking
communications service providers voluntarily and sites and cloud-storage services alike, they grant
knowingly. In many cases, however, people are no legal protection.
unaware of the purposes that the data collected As users of digital services provide their infor-
from them are serving. Sometimes, they don’t mation for the purpose of getting social network
even know that information is being collected at updates, gift recommendations, trip advice, pur-
all. As an example, Google installed tracking chase discounts, etc., it often ends up stored, sold,
cookies bypassing Safari’s (Web browser devel- and used elsewhere, often for marketing purposes
oped by Apple Inc.) privacy settings and had to and sometimes by the government.
pay a substantial fine (nearly 23 million dollars) Storage and manipulation of certain informa-
for such behavior. The case illustrates that com- tion pose risks to the privacy and safety of people
panies would go to great lengths and will engage providing it. By compiling bits and pieces of
in questionable activities to gather more informa- information, one can establish an identifiable pro-
tion, often in violation of users’ expectations and file. In an illustrative example, America Online
without users possibly foreseeing it. As another released 20 million Web searches over a three-
example, Twitter, a social networking giant, used month period by 650,000 users. While no names
its iPhone application software to upload and store or IP addresses were left in the data set, the
all the email addresses and phone numbers from New York Times managed to piece together pro-
its users’ devices in 2012 while leaving service files from the data, full enough to establish peo-
subscribers oblivious to the move. The company ple’s identities. And while a private company has
kept the data for 18 months. In another instance gone so for just for the sake of proving the con-
from the same year, Walmart bought Social Cal- cept, governmental agencies, foreign regimes, ter-
endar, a Facebook application. At the time, the rorist organizations, and criminals could
Fourth Industrial Revolution 491

technically do the same with very different goals


in mind. Fourth Industrial Revolution
Constitutional law of privacy has been
evolving alongside with technological develop- Laurie A. Schintler
ments and state security challenges. Due to the George Mason University, Fairfax, VA, USA
complexity of the relationships between the
governmental agencies and the private sector
today, its reach exceeds prescription of firsthand Overview
interactions between law enforcement officers
and private citizens. Information gathered and The Fourth Industrial Revolution (4IR) is just
stored by third parties allows for governmental beginning to unfold and take shape. Characterized
infringement on its citizens privacy just the by developments and breakthroughs in an array of F
same. And while vague language of “voluntary emerging technologies (e.g., nanotechnology,
disclosure” and “informed consent” is being artificial intelligence, blockchain, 3D printing,
used by private companies to collect private quantum computing, etc.), the 4IR – also known
information, users are often unaware of the as Industry 4.0 – follows from three prior indus-
uses for the data they provide and potential trial revolutions (Schwab 2015):
risks from its disclosure.
1. Steam power and mechanization of
manufacturing and agriculture (eighteenth
century)
Cross-References 2. Electricity and mass production (early nine-
teenth century)
▶ National Security Administration (NSA)
3. Information technology and automation of
▶ Privacy
routine manual and cognitive processes (sec-
ond half of the twentieth century)
Further Reading While the 4IR builds on digital technologies
and platforms that arose in the Third Industrial
Carmen, D., Rolando, V., & Hemmens, C. (2010). Crimi-
nal procedure and the supreme court: A guide to the Revolution, i.e., the “digital revolution,” this lat-
major decisions on search and seizure, privacy, and est period of disruptive technological and social
individual rights. Lanham, MD: Rowman & Littlefield change is distinct and unprecedented in its “veloc-
Publishers.
ity, scope, and systems” impact (Schwab 2015).
Gray, D., & Citron, D. (2013). The right to quantitative
privacy. Minnesota Law Review, 98(1), 62–144. Technology is progressing at an accelerating rate,
Joh, E. E. (2014). Policing by numbers: Big data and the advancing exponentially rather than linearly.
fourth amendment. Washington Law Review, 89(1), Moreover, technologies tied to the 4IR touch
35–68.
large swaths of the globe and every industry and
McInnis, T. N. (2009). The evolution of the fourth amend-
ment. Lanham: Lexington books. sector, “transforming entire systems of produc-
Ness, D. W. (2013). Information overload: Why omnipres- tion, management, and governance”
ent technology and the rise of big data Shouldn’t spell (Schwab 2015). Emerging technologies are also
the end for privacy as we know it. Cardozo Arts &
blurring the boundaries between the “physical,
Entertainment Law Journal, 31(3), 925–957.
Schulhofer, S. J. (2012). More essential than ever: The digital, and biological” worlds (Schwab 2015),
fourth amendment in the twenty first century. Oxford: with the capacity to assist, augment, and automate
Oxford University Press. human behavior and intelligence in was not pos-
United States Courts. What does the fourth amendment
mean? http://www.uscourts.gov/educational-resources/
sible before. Indeed, the 4IR is radically reshaping
get-involved/constitution-activities/fourth-amendment/ how we live, work, interact, and play in novel and
fourth-amendment-mean.aspx. Accessed August, 2014. remarkable ways.
492 Fourth Industrial Revolution

Big data and big data analytics are vital ele- connected to the IoT and IoT applications (e.g.,
ments of the Fourth Industrial Revolution. They weather prediction systems, smart cities, and
play a prominent and essential role in all the precision agriculture) produce continual data
technological pillars of the 4IR, including cyber- streams, thus providing an ongoing and real-
physical systems (CPS), the Internet of Things time picture of people, places, industries, and
(IoT), cloud computing, artificial intelligence the environment.
(AI), and blockchain, among others. As critical Big data generated by CPSs, IoT, and other
inputs and outputs to these systems and their technologies rely heavily on distributed storage
components and related applications, big data and processing technology based on cloud com-
(big data analytics) can be considered the connec- puting (Hashem et al. 2015). Cloud computing
tive glue of the 4IR. captures, stores, and processes information at
data centers in the “cloud,” rather than on a local
electronic device such as a computer. Through
Role of Big Data and Analytics resource virtualization, parallel processing, and
other mechanisms, cloud computing facilitates
Cyber-physical systems (CPSs), which are “smart scalable data operations and warehousing
systems” that integrate physical and cyber compo- (Hashem et al. 2015). Moreover, given it uses a
nents seamlessly and automatically to perform sens- “software-as-a-service” model, any person, place,
ing, actuating, computing, and communicating or organization can access it – at least, in theory.
functions for controlling real-world systems, are However, with all that said, cloud computing is
big data engines (Atat et al. 2018). CPSs are every- not an efficient or scalable solution for managing
where – from autonomous vehicles to the smart grid the deluge of geo-temporal data produced by
to industrial control and robotics systems, and they mobile devices and spatially embedded sensors,
are contributing to a tsunami of big data (Atat et al. such as those connected to the IoT and CPSs.
2018). Consider a single autonomous vehicle, Accordingly, there is a move toward edge com-
which produces 4,000 gigabytes of data per data puting, which handles data at its source rather than
for just a single hour of driving (Nelson 2016). To at a centralized server.
handle the massive amounts of data it generates, a Traditional computational and statistical
CPS relies on two functional components: system approaches and techniques cannot effectively
infrastructure and big data analytics (Xu and Duan accommodate and analyze the volume and com-
2019). The former supports activities tied to data plexity of data produced by CPSs, the IoT, and
acquisition, storage, and computing, while the latter other sources (Ochoa et al. 2017). In this regard,
enables real-time, actionable insight to be gleaned artificial intelligence (AI), referring to a suite of
from the data. Both are critical for ensuring that data-driven methods that mimic various aspects
CPSs are scalable, secure, resilient, and efficient of human information processing and intelli-
and that the products and services they provide are gence, has a critical role to play. For example,
customized to the needs and desires of consumers deep learning, a particular type of AI, has the
(Xu and Duan 2019). capacity for “extracting complex patterns from
The Internet of Things (IoT) serves as a crit- massive volumes of data, semantic indexing,
ical bridge between CPSs, enabling data and data tagging, fast information retrieval, and
information exchange between systems (Atat simplifying discriminative tasks” (Najafabadi
et al. 2018). The IoT is a massive (and continu- et al. 2015). Cognitive computing is an emerg-
ally growing) network of machine devices (e.g., ing analytical paradigm, which leverages and
sensors, robots, and wearables) – or “smart integrates various methods and frameworks,
objects” – tied to the Internet. Each object has including deep learning, natural language pro-
a “unique identifier” and the capacity to transfer cessing, and ontologies (Hurwitz et al. 2015). In
data over a network without the need for a contrast to AI alone, cognitive computing can
human-in-the-loop (Rose et al. 2015). Devices learn at scale, interact with reason, understand
Fourth Industrial Revolution 493

the context, and naturally interact with humans together,” ensuring that, ultimately, humanity
(Vajradhar 2019). Therefore, it is a highly intel- remains in the loop.
ligent human-centered approach for making
sense of and gaining actionable knowledge
from big data.
Cross-References
The quality, integrity, and security of big data
used and produced by technological systems in
▶ Artificial Intelligence
the 4IR are enormous concerns. Blockchain, a
▶ Blockchain
decentralized, distributed, and immutable digi-
▶ Internet of Things (IoT)
tal ledger, provides a possible means for
addressing these issues. Unlike centralized led-
gers, blockchain records transactions between
Further Reading F
parties directly, thus removing the intermediary.
Each transaction is vetted and authenticated by Alkhalifah, A., Ng, A., Kayes, A. S. M., Chowdhury, J.,
powerful computer algorithms running across Alazab, M., & Watters, P. (2019). A taxonomy of
all the blocks and all the users, where consensus blockchain threats and vulnerabilities. Preprints.
across the nodes is required to establish a trans- Atat, R., Liu, L., Wu, J., Li, G., Ye, C., & Yang, Y. (2018).
Big data meet cyber-physical systems: A panoramic
action’s legitimacy. All data on the blockchain survey. IEEE Access, 6, 73603–73636.
is encrypted and hashed. Accordingly, any data Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S.,
added to a blockchain should be accurate, Gani, A., & Khan, S. U. (2015). The rise of “big data”
immutable, and safe from intrusion and on cloud computing: Review and open research issues.
Information Systems, 47, 98–115.
unauthorized alterations; however, in reality, Hurwitz, J., Kaufman, M., Bowles, A., Nugent, A.,
this is not always the case (Alkhalifah et al. Kobielus, J. G., & Kowolenko, M. D. (2015). Cognitive
2019). computing and big data analytics. Indianapolis: Wiley.
Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M.,
Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep
learning applications and challenges in big data analyt-
Conclusion ics. Journal of Big Data, 2(1), 1.
Nelson, P. (2016). Just one autonomous car will use 4,000
As with previous industrial revolutions, the 4IR is GB of data/day. Networkworld. December 7, 2016.
https://www.networkworld.com/article/3147892/one-
likely to benefit society in various ways – e.g., by autonomous-car-will-use-4000-gb-of-dataday.html. Ac-
increasing productivity, efficiency, and quality of cessed 1 Feb 2021.
life. However, it also comes with some downsides Ochoa, S. F., Fortino, G., & Di Fatta, G. (2017). Cyber-
and dangers. The use and application of emerging physical systems, internet of things and big data. Future
Generation Computer Systems, 75, 82–84.
technologies and big data raise various social and Rose, K., Eldridge, S., & Chapin, L. (2015). The internet of
ethical issues and challenges. In this regard, one things: An overview. The Internet Society (ISOC), 80,
grave concern is that the 4IR could deepen 1–50.
existing gaps and disparities, such as the digital Schwab, K. (2015). The Fourth Industrial Revolution:
What It Means and How to Respond. Retrieved from
divides, or contribute to new inequalities and https://www.foreignaffairs.com/articles/2015-12-12/
inequities altogether. Algorithmic bias and dis- fourth-industrial-revolution
crimination, privacy infringement, and degrada- Vajradhar, V. (2019). Benefits of cognitive technology.
tion of human autonomy are additional concerns. Medium. November 6, 2019. https://pvvajradhar.
medium.com/benefits-of-cognitive-technology-
Given the dehumanizing effects of the 4IR, some c1bf35e4b103. Accessed 19 Jan 2021.
are envisioning the next industrial revolution, i.e., World Economic Forum. (2019). What the Fifth Industrial
the Fifth Industrial Revolution (5IR), which ide- Revolution is and why it matters. https://europeansting.
ally would facilitate trust in big data and technol- com/2019/05/16/what-the-fifth-industrial-revolution-
is-and-why-it-matters/. Accessed 18 Jan 2021.
ogy by bringing humans back into proper focus Xu, L. D., & Duan, L. (2019). Big data for cyber physical
(World Economic Forum 2019). In other words, in systems in industry 4.0: A survey. Enterprise Informa-
the 5IR, “humans and machines will dance tion Systems, 13(2), 148–169.
494 Fourth Paradigm

data does not mean scientists no longer need to


Fourth Paradigm collect small data observations. In fact, each of the
paradigm shifts, as they sequentially emerged,
Kristin M. Tolle were in many ways complementary methodolo-
University of Washington, eScience Institute, gies to facilitate scientific discovery.
Redmond, WA, USA A current example of an ongoing scientific
endeavor that employs all four paradigms is
NASA’s mission to Jupiter with the Juno space-
In 2009 when the book The Fourth Paradigm: craft (NASA 2020). Simulation (Third Paradigm)
Data-intensive Scientific Discovery (Hey et al. is critical to scientific data collection when exact
2009) was published, few people understood the replication is not possible regarding the condi-
impact on science we are experiencing today as a tions under which these data will be collected.
result of having an overwhelming prevalence of Simulation is valuable to extrapolate how models
data. To cope with this situation, the book developed on actual observations can potentially
espoused a shift to multidisciplinary research. apply to other celestial objects, both intra- and
The new reality of science was that scientists extra-solar. The spacecraft will not only be
would either need to develop big data computing collecting massive amounts of streaming data
skills or, more likely, collaborate with computing (Fourth Paradigm), it will also collect individual
experts to shorten time to discovery. It also pro- direct observations (First Paradigm) such as when
posed that data, computing, and science, com- it plunges into Jupiter’s atmosphere. Modeling
bined, would facilitate a future of fundamental using statistical and machine learning techniques
and amazing discovery. (Second Paradigm) will be needed to extrapolate
Rather than heralding the obsolesce of scien- beyond observations to create knowledge
tific methods as some have suggested (Anderson from data.
2007), The Fourth Paradigm espoused that sci- Various conclusions can be drawn from this
ence, big data, and technology, together, were example. (1) No paradigm obviates the need for
greater than the sum of their parts. One cannot the other. (2) Machine learning and artificial intel-
have science without scientists or scientific meth- ligence are comprehensively covered by the sec-
odology. This paradigm shift was for science, as a ond paradigm. (3) It is virtually impossible for one
whole (well beyond the examples provided in the person to be an expert at all four paradigms – thus,
book). The point was that with the advancement the need for collaboration. This last point bears
of data generation and capture, science would further clarification. Experts in chemistry need not
need additional technological and educational be experts in cloud computing and AI. In fact,
advances to cope with the coming data deluge. their depth of knowledge acts as a method to
The extent of the fourth paradigm shift is as evaluate all four paradigms. They have the ability
defined by physicist Thomas Kuhn in his 1962 to see a flaw in a computer simulation that a
book, The Structure of Scientific Revolutions modeler often cannot, and though, over time, an
(Kuhn 1962): “a fundamental change in the applied mathematician might get better at seeing
basic concepts and experimental practices of a simulation anomalies, there is always a need for
scientific discipline.” To establish a baseline for someone with extensive knowledge of a scientific
the fourth paradigm and beyond, it is important to discipline as a fail safe to drawing inaccurate
recognize the three earlier scientific paradigms: conclusions and potentially determine that some
observation, modeling, and simulation. The avail- data collections may be invalid or not applicable.
ability of big data is the fourth and, as of this Much has changed for scientists since The
writing, current scientific paradigm shift. Fourth Paradigm was published in 2009.
The application of each of these scientific A crucial change was the recognition and formal-
methodologies had profound impacts on science, ization of the field of data science as an educa-
and all are in use today. The availability of big tional discipline. Due to a dearth of professionals
Fourth Paradigm 495

needed in this area, the National Academies of engineering advances and are theoretically possi-
Sciences, Engineering, and Medicine in the ble. Quantum computing (QC) is a revolution
United States (U.S.) convened a committee and occurring in computing (The National Academies
developed a report guiding the development of of Sciences, Engineering, and Medicine 2019).
undergraduate programs for higher education Like its predecessors supercomputers and cloud
with the goal of developing a consistent curricu- computing, QC has the potential to have a huge
lum across undergraduate institutions (National impact on data science because it will enable
Academies of Sciences, Engineering, and scientists to solve problems and build models
Medicine 2018). Also impacting education is that are not feasible with conventional computing
the availability of online education and, today, today. Though QC is not a paradigm shift as such,
the risks involved with in-person education in it would be a tool, like supercomputing before it,
the face of the COVID-19 pandemic (Q&A with that could enable future paradigm shifts and give F
Paul Reville 2020). rise to new ways by which scientists conduct
This is not to say that data science is a “new” science.
field of study. For example, a case can be made Like earlier paradigm shifts, binary-based,
that the “father” of modern genetics, Johann Men- conventional computing (CC) will have a place
del [1822–1884], used data science in his genetics in big data analysis, particularly regarding inter-
experimentation with peas. Earlier still, Nicolaus faces to increasingly powerful and portable sen-
Copernicus [1473–1543] used data observations sors. Many scientific analyses do not need the
and analysis to determine that the planets of our excessive computational capacity that qubit-
solar system orbited the sun, not the Earth as based QC can provide, and it also is likely that
espoused by the educators of the day. Their data CC will always be more cost-effective to use than
were “small” and analysis methodologies simplis- QC. The logistical problem of reducing the results
tic, but their impact on science, driven by data of a QC process to one that can be conventionally
collection, was no less data science than the accessed, stored, and used is just one of the many
methods and tools of today. barriers of using QC today.
What is different is that today’s data are so vast Qubits, as opposed to the 1’s and 0’s of binary,
that one cannot analyze them using hand calcula- can store and compute over exponentially larger
tions as Mendel and Copernicus did. New tools, problem spaces than binary systems (The
like cloud computing and deep learning, are National Academies of Sciences, Engineering,
applied to Fourth Paradigm-sized data streaming and Medicine 2019), enabling the building of
in from everywhere, and every day new sources models that are challenging to perform today.
and channels are opening, and more and more data For example, the ability to compute all potential
are being collected. An example is the recent biological nitrogen fixation in nitrogenase is espe-
launch by the US National Aeronautics and cially important in that Earth is facing food secu-
Space Administration (NASA) of the Sentinel-6: rity problem as the population increases and
‘Dog Kennel’ satellite used to map the Earth’s arable land decreases (Reiher et al. 2017). Nitro-
oceans (Sentinel-6 Mission, Jet Propulsion gen fixation is a trivial problem for QC analysis
Laboratory 2020). As a public institution, NASA that would enable the creation of custom fertil-
makes the data freely available to researchers and izers for specific food crops that are more efficient
the public. It is the availability of these types of and less toxic to the environment. Solving this
data and the rapidity with which they can be problem could help end hunger, one of the United
accessed that are driving the current scientific Nations Sustainable Development Goals (United
revolution. Nations 2015).
What are potential candidates for a next para- Another example of a problem that could be
digm shift? Two related issues immediately come addressed by QC would be to monitor and flag
to mind: quantum computing and augmented hate speech in all public postings across social
human cognition. Both require significant media sites and enable states to apply legal action
496 Fourth Paradigm

to those who commit such acts within their bor- will be or in what ways it will shift. Revolutionary
ders. This problem is beyond CC today and will changes are only conceivable as they emerge or in
likely remain so. Yet, many organizations, includ- retrospect.
ing the United Nations (United Nations 2020) and
the Amnesty International (Amnesty International
2012), have missions to protect humanity against Further Reading
hate speech and its ramifications. This remains a
challenge even for the social media companies Amnesty International. (2012). Written contribution to the
themselves, who initially rely on participants to thematic discussion on racist hate speech and freedom
of opinion and expression organized by the United
flag offensive content and then train systems to Nations Committee on elimination of racial discrimina-
identify additional cases, but this method is tion, August 28, 2012. https://www.amnesty.org/down
fraught with challenges (Crawford and Gillespie load/Documents/24000/ior420022012en.pdf. Accessed
2014), the sovereignty, privacy, volume, and 9 Dec 2020.
Anderson, C. (2007) The end of theory: The data deluge
changing nature of these data being only a few. makes the scientific method obsolete. Wired Magazine,
Augmented human cognition (AHC) or human 16:07. http://www.wired.com/science/discoveries/mag
cognitive enhancement (HCE), in addition to QC, azine/16-07/pb_theory. Accessed 15 Dec 2020.
is another potential technological capability that Cinel, C., Valeriani, D., & Poli, R. (2019). Neurotechnologies
for human cognitive augmentation: Current state of the art
will alter the conduct of science. HCE refers to the and future prospects. Frontiers in Human Neuroscience,
development of human capabilities to acquire and 13. https://doi.org/10.3389/fnhum.2019.00013.
generate knowledge and understanding through Crawford, K., & Gillespie, T. (2014). What is a flag for?
technological enhancement (Cinel et al. 2019). Social media reporting tools and the vocabulary of
complaint. New Media & Society, 18. https://doi.org/
An example of this would be a scientist that no 10.1177/1461444814543163.
longer needs an external device to perceive light Hey, A., Tansley, D., & Tolle, K. (2009). The fourth par-
in spectrums that are currently not visible to the adigm: Data driven scientific discovery. Redmond:
human eye, such as ultraviolet. Such a capability Microsoft Research.
Jordan, S. P. (2017). Fast quantum computation at arbi-
would necessitate a scientist to undergo various trarily low energy. Physical Review A, 95, 032305.
augmentations, not limited to performance- https://doi.org/10.1103/PhysRevA.95.032305.
enhancing drugs, prosthetics, medical implants, Kuhn, T. (1962). The structure of scientific revolutions.
and direct human-computer linkages. Chicago: University of Chicago Press.
Kurzweil, R. (1999). The age of spiritual machines: When
HCE is possible because the human brain is computers exceed human intelligence. New York:
estimated to have more processing power than Viking.
today’s fastest supercomputers (Kurzweil 1999) NASA. (2020). Juno spacecraft and instruments, NASA
and also will likely have more than near-term QC, website. https://www.nasa.gov/mission_pages/juno/
spacecraft. Accessed 8 Dec 2020.
although there are researchers investigating the National Academies of Sciences, Engineering, and Medi-
quantum processing “speed limit” (Jordan 2017). cine. (2018). Envisioning the data science discipline:
The bigger challenge is the neurology – under- The undergraduate perspective. Washington, DC: The
standing how humans function to a well enough National Academies Press.
Q&A with Paul Reville. (2020). The Pandemic’s impact on
degree to allow the engineering of working education. Harvard Gazette. https://news.harvard.edu/
human-computer interfaces (Reeves et al. 2007). gazette/story/2020/04/the-pandemics-impact-on-
Overcoming such challenges, assuming they are education/. Accessed 15 Dec 2020.
not legally or politically prevented, could result in Reeves, L. M., Schmorrow, D. D., & Stanney, K. M.
(2007). Augmented cognition and cognitive state
significant scientific changes and development. assessment technology – Near-term, mid-term, and
There are likely to be several other candidates long-term research objectives (Lecture notes in com-
beyond HCE and QC with potential to create a puter science). In D. D. Schmorrow (Ed.)
scientific paradigm shift. What these two exam- (pp. 220–228). Berlin: Springer.
Reiher, M., Wiebe, N., Svore, K., Wecker, D., & Troyer,
ples illustrate by looking beyond current abilities M. (2017). Reaction mechanisms on quantum com-
to leverage scientific discovery today is that it is puters. Proceedings of the National Academy of Sci-
not known what the next paradigm shift in science ences, 114(29), 7555–7560.
France 497

Sentinel-6 Mission, Jet Propulsion Laboratory. (2020). The enthusiasm for big data in the French govern-
Sentinel-6: ‘Dog kennel’ satellite blasts off on ocean ment resulted in the establishment of the Ministry
mission – BBC News, JPL Website. https://www.jpl.
nasa.gov/missions/sentinel-6/. Accessed 15 Dec 2020. for Digital Affairs in 2014, under the leadership of
The National Academies of Sciences, Engineering, and notable French socialist party figure and French
Medicine. (2019). Quantum computing: Progress and tech movement member Axelle Lemaire, who
prospects. Washington, DC: National Academies reported directly to the Minister of the Economy
Press.
United Nations. (2015). The 17 goals | sustainable devel- and Industry at the time, Emmanuel Macron. The
opment. United Nations, Department of Economic and Ministry for Digital Affairs was instrumental in
Social Affairs. http://sdgs.un.org/goals. Accessed devising the Digital Republic Act, a piece of leg-
9 Dec 2020. islation adopted by the National Assembly in
United Nations. (2020). United Nations strategy and
plan of action on hate speech. United Nations, 2016, following a widely publicized consultation
Office on Genocide Prevention and the Responsibil- process. The law introduced provisions to regulate F
ity to Protect. https://www.un.org/en/genocide the digital economy, including in regards to open
prevention/hate-speech-strategy.shtml. Accessed data, accessibility, and protections.
9 Dec 2020.
The Digital Republic Act had the goal of pro-
viding general guidelines for a big data policy to
serve as the basis for further sectorial policies.
This landmark piece of legislation made France
France the world’s first nation to mandate local and cen-
tral government to automatically publish docu-
Sorin Nastasia and Diana Nastasia ments and public data. According to this law, all
Department of Applied Communication Studies, data considered of public interest, including infor-
Southern Illinois University Edwardsville, mation derived from public agencies and private
Edwardsville, IL, USA enterprises receiving support from public funds,
should be accessible to the citizenry for free. In a
2015 interview, Axelle Lamaire highlighted the
Introduction importance of open data as a mine of information
for creative ideas to be generated by startups as
In recent years, big data has become increas- well as established organizations. She stated:
ingly important in France, with impact on areas “Both innovation and economic growth are
including government operations, economic strongly driven by open data from either public
growth, political campaigning, and legal pro- or private sources. Open data facilitate the demo-
cedures, as well as on fields such as agriculture, cratic process and injects vitality into society”
industry, commerce, health, education, and cul- (Goin and Nguyen 2015). The law also imposed
ture. Yet, the broad integration of big data into administrative fines of up to 4% of an organiza-
the country’s life has not remained without tion’s total worldwide annual turnover for data
criticism from those who worry about individ- protection violations.
ual privacy protections and data manipulation The legislation was only one part of the project
practices. of streamlining the country’s core data infrastruc-
ture. One component of the data infrastructure is
data.gouv.fr, the government portal hosting over
The State of Big Data in France 40,000 public datasets from nearly 2000 different
organizations. Another component, launched in
When François Hollande became the president in 2017, is SIRENE, an open register listing legal
2012, he asserted that big data was a key element and economic information about all the compa-
of the national strategy aimed at fostering innova- nies in France. In 2017, the government of France
tion and increasing the competitiveness of the also launched the Health Data Hub to promote
country in the European and the global contexts. open data and data sharing for supporting medical
498 France

and health-care functions, including clinical Arthur Muller who met while volunteering for
decision-making, disease surveillance, and popu- the Obama campaign in 2008. The three founders
lation health management. As part of this project, of LMP sought to tap into advanced data analytics
technology specialists are recruited into govern- to reinvent the old-fashioned technique of door-
ment, initially on short-term contracts on agile to-door canvassing, devising an algorithm helping
projects followed by full-time employment when French politicians to connect directly to voters.
the projects mature, at salaries that rival those they The software package they created, Fifty Plus
could receive in the private sector. However, One, flags up the specific characteristics of polit-
critics contend that the digitalization of adminis- ical territories that need to be targeted. “We were
trative documents and procedures remains all fascinated by how political parties use data to
unequal among different ministries, and some create mindsets,” Liegey stated in an interview
are largely falling behind (Babinet 2020). (Halliburton 2017).
The change of the presidency of France to As France has been developing and testing
Emmanuel Macron in 2017 did not diminish the various approaches based on big data to domestic
critical significance of big data for the national and international issues as well as to social and
strategy. In December 2017, the Minister for political issues, a more skeptical view in regards
Europe and Foreign Affairs Jean-Yves le Drian to analytics aspects of big data has emerged too. In
unveiled France’s international digital strategy, 2016, when France’s government announced the
aimed at serving as a framework for big data use creation of a new database to collect and store
as well as a roadmap for big data practices in personal information on nearly everyone living
regards to government, the economy, and security in the country and holding a French identity card
in international settings. The document or passport, there was immediate outrage in the
highlighted the country’s commitment to advocat- media. The controversial database, secure elec-
ing the inclusion in the digital sphere of states, tronic documents, was aimed at cracking down
private sector, and civil societies, spreading digi- on identity theft, but the government’s selection of
tal commons for software, content, and open data, a centralized architecture to store everyone’s bio-
evaluating the impact of algorithms and encour- metric details raised huge concerns in regards to
aging their transparency and fairness, educating both the security of the data and the possibilities
citizens into the encryption of communications for misuse of the information. Despite the outcry
and its implications, fighting the creation and and the concerns expressed publically, the data-
spreading of misinformation, and ensuring the base was launched and is still in function.
full effect of international law in the cyberspace. Another area of concern has been judicial ana-
The document also expressed support for the con- lytics. The new article 33 of the Justice Report Act
cepts of privacy by design and security by design adopted in 2019 makes it illegal to engage in the
in the way tech products are conceived and use of statistics and machine learning to under-
disseminated. stand or predict judicial behavior. The law states:
Moreover, pundits have claimed that big data “No personally identifiable data concerning
helped secure the astounding victory of Emman- judges or court clerks may be subject to any
uel Macron in his presidential bid in 2017, reuse with the purpose or result of evaluating,
followed by the triumph of his movement turned analyzing, or predicting their actual or supposed
into a new political party La République en professional practices.” The article applies to indi-
marche! in the French parliamentary elections viduals as well as technology companies and
and local elections held the same year. The cam- establishes sanctions for prejudices pertaining to
paigns of Macron and La République en marche! data processing. The law is supported by civic
were supported by data-driven canvassing tech- society groups and legislators believing that the
niques implemented by LMP, an organization increase in public access to data, even with some
established by Harvard graduates Guillaume privacy protections in place, may result in unfair
Liegey and Vincent Pons and MIT graduate and discriminatory data manipulation and
France 499

interpretation. However, it has also had its oppo- opportunity to establish the nation as a pioneer
nents. “This new law is a complete shame for our in regards to digital processes in global settings.
democracy,” stated Louis Larret-Chahine, the France has become broadly thought of as one of
general manager and cofounder of Prédictice, a the best in the world for open data, while the
French legal analytics company (Livermore and European data portal proclaimed it one of the
Rockmore 2019). trendsetters in European data policy. While the
Both governmental and nongovernmental enti- government of France has succeeded to build a
ties in France have demanded the monitoring and, strong belief that data openness and data pro-
when needed, action against companies using cessing for the benefit of people is key to leading
consumer data for the personalization of content, government and business digital transformation,
the behavioral analysis of users, and the targeting concerns remain in regards to privacy as well as to
of ads. In 2019, France’s data protection authority uses of data by such organizations as courts and F
established through the Digital Republic Act fined businesses.
Google 50 million Euros for the intrusive ad per-
sonalization systems and its inadequate systems
of notice and consent when users create accounts Further Reading
for Google services on Android devices. The data
protection authority found that Google violated its Babinet, G. (2020). The French government on digital –
Midterm evaluation. Institut Montaigne. https://www.
duties by obfuscating essential information about
institutmontaigne.org/en/blog/french-government-
data processing purposes, data storage periods, digital-mid-term-evaluation
and categories of personal information used for Goin, M., & Nguyen, L. T. (2015). A big bang in the
ad personalization. The decision implied that French big data policy. https://globalstatement2015.
wordpress.com/2015/10/30/a-big-bang-in-the-french-
behavioral analysis of user data for the personal-
big-data-policy/
ization of advertising is not necessary to deliver Halliburton, R. (2017). How big data helped secure
mail or video hosting services to users. Emmanuel Macron’s astounding victory. Prospect
Magazine. https://www.prospectmagazine.co.uk/poli
tics/the-data-team-behind-macrons-astounding-victory
Livermore, M., & Rockmore, D. (2019). France kicks data
Conclusion scientists out of its courts. Slate. https://slate.com/
technology/2019/06/france-has-banned-judicial-
In France, big data policy was considered by the analytics-to-analyze-the-courts.html
Toner, A. (2019). French data protection authority takes on
Hollande administration and has continued to be
Google. Electronic Frontier Foundation. https://www.
considered by the Macron administration as a eff.org/deeplinks/2019/02/french-data-protection-
potential direct economic growth driver and an authority-takes-google
G

Gender and Sexuality A variety of resources, including social net-


work profiles and shared photos, can provide
Kim Lorber1 and Adele Weiner2 group affiliations, favorite films, celebrities,
1
Social Work Convening Group, Ramapo College friendships, causes, and other personal informa-
of New Jersey, Mahwah, NJ, USA tion. Facebook privacy concerns arise regularly.
2
Audrey Cohen School For Human Services and While intended for friends to see, data miners can
Education, Metropolitan College of New York, easily draw conclusions about an individual’s
New York, NY, USA gender and sexuality. Short of living off the grid
without traceable billing, banking, and finances,
some residue of much we do in our lives is col-
Introduction lected, somewhere. This entry proposes to dem-
onstrate how current and future research will be
Gender refers to ways one self-defines as male and based on very private elements of our lives in
female or within this spectrum based on social regard to attitudes toward and experiences of gen-
constructs separate from biological characteris- der and sexuality.
tics. Sexuality is one’s expression of their sexual
identity. Individuals can be heterosexual, homo-
sexual, or another along a fluid spectrum that can Big Data on Gender and Sexuality
change throughout one’s life. In some societies
one’s gender or sexuality offers different social, The immediacy and interconnectedness of big
economic, and political opportunities or biases. data become obvious when one finds an item
Big data is a useful tool in understanding indi- looked at on Amazon.com, for example, as a
viduals in society and their needs. However, the regular advertisement on almost every accessed
multitude of available resources can provide a webpage with advertising. Big data calculations
transparency to one’s private life based on tech- create recommendations based on likelihoods
nology. Regardless of whether the information is about one’s personal lifestyle, gender, sexual ori-
gathered by retailers or the federal government, entation, entertainment, and retail habits. Emails
profiles can include much private information abound with calculated recommendations. This
ranging from television preferences to shopping can easily feel like an invasion of privacy. Is
profiles all of which combined can include very there a way to exclude oneself and how can such
personal information related to one’s gender iden- personal information challenge one’s life? Who
tity and sexuality. has access to it?

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
502 Gender and Sexuality

Gender and sexuality are very personal to each visited may now offer other resources which can
individual. Most people do not blatantly discuss be jarring, and there is no easy way to make them
such private elements of their lives. Ideally, a disappear.
utopian society would allow anyone to be who-
ever they are without judgment; our society is not
this way, and implications from disclosure of cer- Big Data and Gender
tain types of information can result in bias at
social, professional, medical, and other levels. Big data analysis can also provide answers to
Such biases may not be legal or relevant and can long overlooked biases and inequities. Melinda
lead to detrimental results. Gates (2013) discussed eight Millennium Devel-
Friends, strangers, or casual acquaintances opment Goals (MDGs), adopted by the UN in
may know more about one’s personal gender and 2000 to serve as a charter for the field of devel-
sexuality details than spouses, other family mem- opment. Gender is not one of the big data items
bers, and best friends. Can personal information specifically explored beyond eliminating gender
be disclosed by personalized web advertising differences in education. Ideally new priorities
viewed by another while one innocently concen- post-2015 will address gender-based areas such
trates on elements of interest not even noticing the as violence, property rights for women, and the
other uninvited clues about their life? issue of child marriage. Increasingly precise
The Internet had been a forum for individuals goals addressing women and agriculture, for
to anonymously seek information and connec- example, would show globally the strengths
tions, explore friendships and relationships, and and weaknesses by country presenting big data
live and imagine beyond their developed social conclusions. In sub-Saharan Africa, women do
identities. This has been particularly true for indi- the majority of farm work. However, agricultural
viduals seeking information about gender and programs are designed for a minority of male
sexuality, in a society where gender identity is farmers possibly because government trainers
presumed as fixed and discussions of sex may be either prefer working with men or are not allo-
taboo. Information now readily extrapolated by wed to train women resulting in female farmers
big data crunchers can make an environment being significantly less productive. Big data
unsafe and no longer anonymous open for anyone highlighting the disparity between the genders
to find many of one’s most personal thoughts and in such programs will be used to make programs
feelings. For example, how much information can more equitable.
be found about someone’s gender identity or sex-
ual orientation via access to their online groups
joined or “liked” on Facebook, followed on Twit- Big Data and Sexual Identity
ter, etc. and, combined perhaps with retail pur-
chases, consultations, or memberships, and social Google, a seemingly limitless and global informa-
venues, which can strongly suggest another layer tion resource, allows for endless searching. Using
to an individual’s calculated and assumed private the search autofill feature, some tests were done
identity? using key words. The top autofill responses to
Similarly, virtual friendships and relationships, homosexuality should resulted in: “homosexuality
groups, and information sites can be calculated should be banned,” “homosexuality should be
into gender and sexual identity conclusions. If a illegal,” “homosexuality should be accepted,”
man married to a woman purchases books, bor- and “homosexuality should be punishable by
rows similar titles from a library, joins an online death.” “Gay men can’t” “give blood” was a top
group, and makes virtual friends who may be autofill suggestion. “Bisexuality is” was autofilled
questioning their heterosexuality, a trail has been with “bisexuality is not real,” “bisexuality is real,”
created. Perhaps uncomfortable exploring these and “bisexuality isn’t real.” Many of these autofill
concerns privately, advertisements on websites responses, based on past searches, reflect biases
Genealogy 503

and stereotypes related to gender and sexual


orientation. Genealogy
Kosinski et al. (2013) studied Facebook likes
and demographic profiles of 58,000 volunteers. Marcienne Martin
Using logistic/linear regressions, the model was Laboratoire ORACLE [Observatoire Réunionnais
able to identify homosexuals versus heterosexuals des Arts, des Civilisations et des Littératures dans
(88% accuracy). leur Environnement] Université de la Réunion,
Saint-Denis, France, Montpellier, France

Conclusion Articulated around the “same” and the “different,”


the living world participates in strange phenom-
While great outcomes are possible from big data, ena: architect genes, stem cells, and varied species
as in the case of identifying discrimination based whose specificity is, however, derived from com- G
on gender or sexual orientation, there is also the mon trunks. Each unit forming the said universe is
risk of culling deeply personal information, which declined, simultaneously, into a similar object and
can easily violate gender and sexuality privacy. a completely different object (Martin 2017).
Like any other personal information, once it is Through different genetic researches, geneticists
made public, it cannot be taken back. Big data have created new technologies as CRISPR
allows the curious perhaps absent negative inten- (https://www.aati-us.com/applications/crispr-
tions, to know what we, as citizens of the world, cas9/) (Clustered Regularly Interspaced Short
wish to share on our own terms. The implications Palindromic Repeats) which allow the decoding
are dramatic and, in the case of gender and sexual of the transformation of a stem cell into a specific
identity, can transform individuals’ lives in posi- cell and the evolution of this last. These new
tive or, more likely, negative ways. technologies correspond to a database and, indi-
rectly, to big data.
The lexical-semantic analysis of the term
Cross-References “genealogy” refers, firstly, to the study of the
transmission chains of genes; secondly, this lexi-
▶ Data Mining cal unit also refers to anthroponomy, which is the
▶ Data Profiling mode of symbolic transmission of the genetic
▶ Privacy structure through nomination in human societies.
▶ Profiling Moreover, beyond the symbolic transmission of
genetic codes received through nomination, civil
society is based on the transmission of property to
Further Reading descendants or to collaterals. Finally, genealogy
also points to different beliefs in relation to the
Foremski, T. (2013). Facebook ‘Likes’ can reveal your
world of death.
sexuality, ethnicity, politics, and your parent’s divorce.
Retrieved on 28 Aug 2014 from www.zdnet.com/ The basis of genetics was founded by the dis-
facebook-likes-can-reveal-your-sexuality-ethnicity- coveries made, in particular, by Johann Gregor
politics-and-your-parents-divorce-7000013295/. Mendel (1822–1884) who studied the transmis-
Gates, M. (2013). Bridging the gender gap. Retrieved on
28 Aug 2014 from http://www.foreignpolicy.com/
sion of genetic characteristics with respect to
articles/2013/07/17/bridging_the_gender_gap_ plants (peas). These results formed the basis of
women_data_development. what is now known as Mendel’s laws which
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private define how genes are transmitted from generation
traits and attributes are predictable from digital records
to generation. Furthermore, Darwin showed in his
of human behavior. Proceedings of the National Acad-
emy of Sciences of the United States of America analysis of the evolution of species in the living
(PNAS), 110(15), 5802–5805. world, how a variety of life may be part of a
504 Genealogy

species, based on the differences and similarities not name the newborn until he or she reaches an
as well as the adaptation to a specific environ- age where his or her survival is most likely. In
ment. These scientific approaches have opened relation to the nomination of newborn children,
new fields of research in relation to the human Mead (1963), an anthropologist who studied dif-
world. In his entomological study, Jaisson (1993) ferent ethnic groups, including Arapesh, specifies
states that in relation to bees some genetic markers that among the Arapesh, as soon as a newborn
are used to supply different behavioral predispo- smile when looking at his or her father, he or she
sitions among workers, such as reflex condition- will be given a name, in particular a name of a
ing, conditional or acquired, according to Ivan member of the father’s clan. The creation that
Petrovich Pavlov. According to Dawkins (1990), presides over the implementation of the nomen
who is a neo-Darwinian ethologist, evolution and of its transmission is often articulated around
occurs step by step, for the survival of certain the genealogical chain. Thus, there are systems
genes and the elimination of others in the gene called “rope,” which are either links which con-
pool. These studies raise the question of whether nect a man, his daughter, and his daughter’s sons
certain phenomena in human beings are innate or a woman, her son, and his son’s daughters
and acquired. (Mead, 1963). Ghasarian (1996), in a study on
Anthroponomy is the identity marker for the kinship, mentions that the name is not automati-
transmission of such genetic codes in such a cally given at birth. So, Emperaire (1955), a
human group. Another function of anthropogenesis researcher at the Musée de l’Homme in Paris,
is to find a name for a new object in the world in gives the example of Alakalufs, an ethnicity living
analogy with an existing object integrated in the in Terre de Feu, which does not give any names to
human paradigm. As Brunet et al. (2001–2002) newborns; indeed, at their birth children do not
pointed out, every individual is carrying a name receive a name; it is only when they begin to talk
that refers to his or her community, but also which and walk that the father chooses one.
is a reference to his or her cultural practices; the Anthroponomy may also point to one’s fam-
name will not be the same and will not be trans- ily history. For example, Levi-Strauss (1962)
mitted in the same mode, depending on the differ- gives the example of the Penans, a tribe of
ent human groups. Furthermore, Martin (2009) nomads living in Borneo, where at the birth of
states that “the other” considered in its only oth- their first child, the father and the mother adopt a
erness can also integrate divine paradigms. Louis “tecknonym” expressing their relationship to
XIV, who was not only named the Sun King, had their child specifically designated such as Tama
beside his status of a monarch also divine right. Awing, Tinen Awing: father’s (or mother’s)
In relation to anthroponomy, identity is linked Awing. Levi-Strauss also states that Western
to one’s descent. Indeed, if one has to name Penans have no less than 26 different necronyms
an object in the world, one also has to give it corresponding to the degree of relationship,
a sense and to identify an individual means to according to the deceased’s age, the gender,
recognize him or her, but also to put him or her and also to the birth order of children until the
into the group they belong to. The first group ninth.
an individual belongs to is that of his or her To write the story of the life, tale of a newborn
gender, which is administratively materialized begins at his or her identification; so in the Viet-
when registered as one’s civil status after one’s namese culture, the first emblematic name given
birth. In French society, in addition to gender to the Vietnamese child is used for his or her
anthroponyms (full name, surname), date and private family use. This is not always a beautiful
place of birth as well as the identity of the parents name, but a substantivized adjective that emerges
are registered (Martin 2012). according to the event or to the family experience
From one culture to another, anthropogenesis when the baby is born. There was also a denom-
may take various turns, sometimes connected to ination created to disabuse an evil genius who
practical considerations, such as in groups that do could make the child more fragile when hearing
Genealogy 505

his or her beautiful name. The name given to the in the appearance of certain types of similar
child usually has a meaning, a moral quality, or it behaviors, as reflected by the establishment of
is the name of an element of nature whose litera- dominant groups with alpha males, like that of
ture makes it a symbol. In Vietnam which is a nobility, especially royalty.
tropical country, the name tuyet means snow and it Nobility refers to a highly codified patronymic
is given in reference to its whiteness as a symbol system referring to the notion of belonging.
of purity and of sharpness as the literature inspired According to Brunet and Bideau when identifying
by the Chinese culture reports (see Luong, 2003). (2001–2002) particular groups within society,
In the Judeo-Christian culture, hagiography refers they generally include several elements: a sur-
to the exemplary life of such a person considered name (often prior to the ennoblement) and one
in the context of this religious ideology. Further- or more land names and titles (in Le patronyme,
more, Héritier (2008), referring to the different histoire, anthropologie et société, 2001–2002). If
kinship systems, specifies that it simply gives us we refer to the book by Debax (2003), who in her
indications on the fact that all human populations study focuses on the feudal system in the eleventh G
have thought about the same biological data, but and the twelfth centuries in the Languedoc (region
they have not thought about it the same way. Thus, of France), it is shown that the relationship
in patriarchal societies, the group, or the individ- between lords and vassals is articulated around
ual, has the status of the dominant male, while in the concept of fiefdom whose holding always
matriarchal societies it is women who are the ones involves at least two individuals, one holding
that integrate a dominant role. the other, which results in the importance of the
Anthroponomy in the context of genealogy personal relationship when the specification is
is a procedure that contributes to the creation of a fiefdom anthroponym. Hence, the term “fief”
the identity of the individual. All of these was used by Bernard d’Anduze, Bermond Pelet,
modes allow the social subject to distance and Guilhem de Montpellier. The particle “de” is
themselves from the group and to become an a preposition, which in this case marks the origin
individual whose status is more or less affirmed as “calissons,” sweets made in Aix-en-Provence
according to a more or less meaning full group (France). As part of this type of anthropogenesis,
structure. And it is based on these procedures we have several layers of onomastic formation
that identity takes its whole significance. The whose interpenetration emphasizes different
nominal identity incorporates de facto the indi- levels of hierarchical organization. In France,
vidual in his or her genealogical membership these names are often composed of a first name
group, irrespective of whether the latter is real called by interested “surname” and a second, or
or based on adoption. The process, inferring even a third name, called “land names,” connected
the identity structure of a social subject, by a preposition (e.g., Bonnin de la Bonninière de
seems to be initiated by the nomination act. Beaumont); these names usually place their
This identity also reflects the place occupied holders in a category of nobility, who has been
by the family in the social group. The sets and legally abolished for more than two centuries;
the alliances of family are the subject of limited nevertheless it has remained important in the
arrangements as analyzed by Héritier (2008). French society (Barthelemy in Le patronyme,
The reason is that the rules we follow locally to histoire, anthropologie et société, 2001–2002).
find a partner are adapted to systems kinship The death is part of an underlying theme in life
groups classifications, as to kinship and alli- through the use of some anthroponomical forms
ance; they can be subtly altered by the games of as the necronyms cited by Lévi-Strauss (1962);
power, witchcraft, and economy. Diamond this mode of nomination expresses the family
(1992) showed that we share over 98% of relationship of a deceased relative to the subject.
our genetic program with primates (pygmy The postmortem identification of the social sub-
chimpanzee of Zaire and the common chim- ject is associated the memorial inscription that is
panzee from Africa). This has as a consequence found, in particular, on the plates of memorials
506 Geographic Information

and of gravestones. This discursive modality


seeking to maintain, in a symbolic form, the past Geographic Information
reality of the individual, also finds its realization
in expressions like Mrs. X (Y widow) wife . . . late ▶ Spatial Data
Mr. X., the remains of Mr. Z, or it can take reality
in the name given to a newborn which has been
given by a member of a group of deceased. The
cult of ancestors is similar to the ceremonial prac- Geography
tices addressed collectively to the ancestors
belonging to a same lineage, often done on altars; Jennifer Ferreira
this is another way to keep ascendants of the Centre for Business in Society, Coventry
genealogical relationship in the memory of their University, Coventry, UK
descendants. Finally, the construction of a genea-
logical tree, which is very popular in Western
societies, refers to a cult of symbolic ancestors. Geography as a discipline is concerned with
developing a greater understanding of processes
that take place across the planet. While many
Further Reading geographers agree that big data presents oppor-
tunities to glean insights into our social and spa-
Brunet, G., Darlu, P., & Zei, G. (2001–2002). tial world, and the processes that take place
Le patronyme, histoire, anthropologie et société.
within it, many are also cautious about how it
Paris: CNRS.
Darwin, C. (1973). L’origine des espèces. Verviers: used and the impact it may have on how these
Marabout Université. worlds are analyzed and understood. Given that
Dawkins, R. (1990). Le gène égoïste. Paris: Odile Jacob. often big data are either explicitly or implicitly
Debax, H. (2003). La féodalité languedocienne XIe – XIIe
siècle – Serments, hommages et fiefs dans le
spatially or temporally referenced, this makes it
Languedoc de Trencavel. Toulouse-Le Mirail: Presses particularly interesting for geographers. Geogra-
Universitaires du Mirail. phy, then, becomes part of the big data
Diamond, J. (1992). Le troisième singe – Essai sur phenomenon.
l’évolution et l’avenir de l’animal humain. Paris:
As a term that has only relatively recently
Gallimard.
Emperaire, J. (1955). Les nomades de la mer. Paris: become commonly used, definitions of big data
Gallimard. still vary. Rob Kitchin suggests there are in fact
Ghasarian, C. (1996). Introduction à l’étude de la parenté. seven characteristics of big data, extending
Paris: Éditions du Seuil.
beyond the three Vs proffered by Doug Laney
Héritier, F. (2008). L’identique et le différent. Paris:
Diffusion Seuil. (volume, velocity, and variety) which are widely
Jaisson, P. (1993). La fourmi et le sociobiologiste. Paris: cited:
Odile Jacob.
Lévi-Strauss, C. (1962). La pensée sauvage. Paris: Plon.
1. Volume: often terabytes and sometimes
Luong, C. L. (2003). L’accueil du nourrisson: la modernité
de quelques rites vietnamiens. L’Information petabytes of information are being produced.
Psychiatrique, 79, 659–662. http://www.jle.com/fr/ 2. Velocity: often a continuous flow created in
revues/medecine/ipe/e-docs/00/03/FC/08/article.md. near real time.
Martin, M. (2009). Des humains quasi objets et des objets
3. Variety: composed of both structured and
quasi humains. Paris: Éditions L’Harmattan.
Martin, M. (2012). Se nommer pour exister – L’exemple du unstructured forms.
pseudonyme sur Internet. Paris: Éditions L’Harmattan. 4. Exhaustivity: striving to capture entire
Martin, M. (2017). The pariah in contemporary society. A populations.
black sheep or a prodigal child? Newcastle upon Tyne:
Cambridge Scholars Publishing.
5. Fine grained: aiming to provide detail.
Mead, M. (1963). Mœurs et sexualité en Océanie – Sex and 6. Relational: with common fields so data sets can
temperament in three primitive societies. Paris: Plon. be conjoined.
Geography 507

7. Flexible: so new fields can be added as data sets. With around 45 million journeys every
required, the data can be extended and where week or around a billion every year, the data is
necessary exhibit scalability. endless. He acknowledges that big data is
enriching our knowledge of how cities function,
This data is produced largely through three particularly with respect to how people move
forms: directed, generated largely by digital around them. However, it can be questioned how
forms of surveillance; automated, generated by much this data can actually tell us. Around 85% of
inherent automatic functions of digital devices; all travelers using public transport in London on
and volunteered, provided by users, for example, these forms of transport use the Oyster card, and
via interactions on social media or crowdsourcing so clearly there is an issue about representative-
activities. ness of the data. Those that do not use the card,
The prevalence of spatial data has grown mas- tourists, occasional users, and other specialist
sively in recent years, with the advent of real-time groups will not be represented. Furthermore
remote sensing and radar imagery, crowdsourcing because we cannot actually trace where an indi- G
map platforms such as OpenStreetMap, and digi- vidual begins and ends their journey, it only pre-
tal trails created by ubiquitous mobile devices. sents a partial view of the travel geographies of
This has meant there is a wealth of data to be those in London. Nevertheless this data set is
analyzed about human behavior, in ways not pre- hugely important for the operation of transport
viously possible. systems in the city.
Large data sets are not a new concept for geog- Disaster response using big data has also
raphy. However, even some of the most widely received significant media attention in recent
used large data sets used geography, such as the years: crisis mapping community after the 2010
census, do not constitute big data. While they are Haiti earthquake or collecting tweets in response
large in volume and seek to be exhaustive, and to disaster events such as Hurricane Sandy. This
high in resolution, they are very slow to be gen- has led to many governments and NGOs promot-
erated and have little flexibility. The type of big ing the use of social media as potentially useful
data now being produced is well exemplified by data sources for responding to disasters. While
companies such as Facebook which in 2012 alone geo-referenced social media provides one lens
processed over 2.5 billion pieces of content, on the impact of disaster events, it should not be
2.7 billion “likes,” and 300 million photo uploads relied on as a representative form of data covering
in just 1 day or Walmart which generated over all populations involved. Big data in these scenar-
2.5 petabytes of data information every hour in ios presents a particular view of the world based
20,102. One of the key issues for using big data is on the data creators and essentially can mask the
that collecting, storing, and analyzing these kinds complexity and multiplicity of scenarios that actu-
of data is very different from that of traditionally ally exist. Taylor Shelton, Ate Poorthuis, Mark
large data sets such as the census. These new Graham, and Matthew Zook explore the use of
forms of data creation are creating new questions twitter around the time of Hurricane Sandy, and
about how the world operates, but also about how they acknowledge that their research did not pre-
we analyze and use such data forms. sent any new insights into the geography of twit-
Governments are increasingly turning to big ter, but that it did show how subsets of big data
data sources to consider a variety of issues, for could be used for particular forms of spatial
example, public transport. A frequent system analysis.
referred to about the production of big data related Trevor Barnes argues that criticisms of the
to public systems is the use of the Oyster card in quantitative revolution in geography are also
London. Michael Batty discusses the example of applicable to the use of big data. First that a
public in transport in London (tube, heavy rail, focus on the computational techniques and data
and buses) to consider some of the issues with big collected can become disconnected from what is
508 Geography

important, i.e., the social phenomena being constructions formed from variegated socioeco-
researched. Second that it may create an environ- nomic contexts and therefore will present a vision
ment where quantitative information is deemed of the world that is uneven in its representation of
superior and that where phenomena cannot be populations and their behavior. Big data, despite
counted they will not be included. Third, that attempts to make it exhaustive, will always be
numbers do not speak for themselves – numbers spatially uneven and biased. Data will always be
created in data sets (of any size) emerge as a produced by systems that have been created with
product of particular social constructions even influences from different contexts and from
where they are automatically collected by techno- groups of people with different interests.
logical devices. Sandra González-Bailón highlights that while
The growth of big data as part of the data technology has allowed geospatial data to be gen-
revolution presents a number of challenges for erated much more quickly than in the past, and if
geographers, there has been much hype and spec- mobilized in an efficient manner, people can use
ulation over the adoption of big data into societies, these technologies as network of sensors. How-
changing the ways that businesses operate, the ever, the use of digital tools can produce distorted
way that government manage places, and the maps or results if the inputs to the system are
way that organizations manage their operations. systematically biased, i.e., those who do not
For some, the benefits are overstated. While it have access to the tools will not be represented.
may be assumed that because much technology Therefore there are questions about how to extract
contains GPS that the use of big data sets is a valid information from the ever-growing data del-
natural extension of the work geographic infor- uge. Then there are issues around privacy and
mation scientists, it should be noted that the emer- confidentiality of the data produced and how it
gence of such data sets created by mobile will be used potentially in both the public and
technology has created a large new amount of private sectors.
data, but also data that geographic information Michael Goodchild highlights that while a lot
scientists have not typically focused their efforts. of big data is geo-referenced and can contribute to
Therefore work is needed to develop sound meth- a growing understanding of particular locations,
odological frameworks to work with such there are issues about quality of data that is pro-
data sets. duced. Big data sets are often comprised of dispa-
The sheer size of the data sets that are being rate data sources which do not always have quality
created, sometimes with millions or billions of controls or do not have metadata about the prov-
observations being created in a variety of forms enance of the data. This raises questions about the
on a daily basis, is a clear challenge. Traditional extent such data can be trusted, or used, to make
statistical analysis methods used in geography are valid conclusions. There is a need for geographers
designed to work using smaller data sets with to explore how data can become more rigorous.
much more known about the properties of the Michael explains how Twitter streams continue to
data being analyzed. Therefore new methodolog- be seen as important sources of information about
ical techniques for data handling and analysis are social movements, or events, but often little is
required to be able to extract useful information known about the demographics of those tweeting,
for geographical research. and so it is impossible to understand the extent to
Data without a doubt are a key resource for which these tweets represent the wider sentiments
modern the world; however it is important to of society. Furthermore, only a small percentage
remember that data does not exist independently of tweets are geo-referenced, and so the data is
of the systems (and people in them) from which skewed toward the data provided by people who
they are produced. Big data sets have their own opt in to provide that level of data. Much like
geographies; they are themselves social many other geographers writing on the topic of
Geospatial Data 509

big data, the potential for such source s of data to Cross-References


be useful, but questions need to be raised about
how it is used and how the quality is improved. ▶ Demographic Data
Mark Graham has begun to ask questions ▶ Disaster Planning
about the geographies of big data and considered ▶ Smart Cities
which areas of the world are displayed through big ▶ Socio-spatial Analytics
data sets and what kinds of uneven geographies ▶ Spatial Data
are produced by them. The geographies of how
data is produced are revealing in itself. This is
exemplified by examining the content of Further Reading
Wikipedia: every article on Wikipedia was down-
loaded and placed on a map of the world. While Barnes, T. (2013). Big data, little history. Dialogues in
Human Geography, 3(3), 297–302.
this showed a global distribution, particularly for
Batty, M. (2013). Big data, smart cities and city planning.
articles in English language, the worlds displayed Dialogues in Human Geography, 3(3), 274–278. G
by those in Persian, for example, were much more Gonzalez-Bailon, S. (2013). Big data and the fabric of
limited. The key point here was that the represen- human geography. Dialogues in Human Geography,
3(3), 292–296.
tations made available to the world through the
Goodchild, M. (2013). The quality of big (geo)data. Dia-
use of big data can lead to the omission of other logues in Human Geography, 3(3), 280–284.
worlds that still exist but may not be visible. These Kitchin, R. (2013). Big data and human geography: Oppor-
absences or “data shadows” are also a concern for tunities, challenges and risks. Dialogues in Human
Geography, 3(3), 262–267.
geographers. It raises questions about what this
Kitchin, R. (2014). The data revolution: Big data open
says about the places they represent. In exploring data, data infrastructures and their consequences.
this phenomenon, geographers are seeking to London: Sage.
explore the geographies of data authorship in the Li, L., Goodchild, M., & Xu, B. (2013). Spatial, temporal,
and socioeconomic patterns in the use of twitter and
data deluge, considering why there are differences
Flickr. Cartography and Geographic Information Sci-
in the production of information and asking ques- ence, 40(2), 61–77.
tions about why some people produce large Laney, D. (2001). 3D data management: controlling data vol-
amounts of data while others are excluded. ume, velocity, and variety. Available from: http://blogs.
gartner.com/doug-laney/files/2012/01/ad949-3DData-
It is without question that digital technologies
Management-Controlling-Data-Volume-Velocityand-Vari-
have transformed the ways in which we can ety.pdf Accessed 18 Nov 2014.
explore the way the world works; the flood of Shelton, T., Poorthuis, A., Graham, M., & Zook,
data now being produced can be used to create M. (2014). Mapping the data shadows of hurricane
Sandy: Uncovering the sociospatial dimensions of
more maps of places, more models of behavior,
‘big data’. Geoforum, 52(1), 167–179.
and more views on the world. With companies,
governments, and research funding agencies call-
ing for more effort to be put into generating and
exploring big data, some geographers have
Geospatial Big Data
highlighted that in order to deliver significantly
valuable insights into societal behavior, then more
▶ Big Geo-data
effort is needed to ensure that the big data collec-
tion and analysis are scientifically robust. Big data
and particularly data that are geo-referenced have
provided a new wealth of opportunities to under-
stand more about people and places, asking new Geospatial Data
questions and measuring new processes and phe-
nomena in ways not previously possible. ▶ Spatial Data
510 Geospatial Information

fourth time as the best company to work for in


Geospatial Information America in a list compiled by Fortune. Their
offices were designed to work in the company as
▶ Spatial Data in the place where it was germinated: a college
campus.
People are what really make the company that
Google is. They hire smart and determined peo-
Geospatial Scientometrics ple, and they set above the capacity for working to
the experience. While Googlers (this is how the
▶ Spatial Scientometrics employees of this company are known) share
goals and expectations about the company, come
from diverse professional fields, and among all
speak dozens of languages, they represent a global
Google audience for which they work.
Google keeps an open culture that usually
Natalia Abuín Vences1 and Raquel Vinader occurs at the beginning of a company, when
Segura1,2 everyone contributes in a practical way and feels
1
Complutense University of Madrid, Madrid, comfortable sharing ideas and opinions. Googlers
Spain do not hesitate to ask questions on any matter of
2
Rey Juan Carlos University, Fuenlabrada, the company directly to Larry, Sergey, and other
Madrid, Spain executives in both the Friday meetings and email
and in the coffee shop.
The offices and coffee shops are designed to
Google Inc. is an American multinational com- promote interaction between Googlers and
pany specialized in products and services related encourage work and play. The offices offer mas-
to the Internet, software, electronic devices, and sage services, fitness center, and a wide range of
other technology services, but its main service is services that allow workers to relax and interact
the search engine that gives name to the company with each other.
and according to data from Alexa is the most The company receives over one million
visited site in the world. resumes per year, and the selection process may
The company was founded by Larry Page and extend over several months. Only 1 out of
Sergey Brin. These two entrepreneurs met each 130 applicants gets a place in the company,
other while studying at Stanford University in while at Harvard, one of the best universities in
1995. In 1996, they created a search engine the world gets it out 1 of 14.
(initially called BackRub) that used links to deter-
mine the importance of specific web pages. They
decided to call the company Google, making a Google: The Search Engine
play on the mathematical term “googol” used to
describe a number followed by a hundred zeros. The centerpiece of the company is the search
Google Inc. was born in 1998 when Andy engine, which processes more than two million
Bechtolsheim, cofounder of Sun Microsystems, requests per second, compared to the 10,000 that
wrote a check for $ 100,000 to this organization, processed daily during the year of its creation.
which until now did not exist. At present the The Google search index has a size of more
company has more than 70 offices across 40 coun- than 100 million gigabytes. To simplify the data,
tries and has over 50,000 employees. 100,000 1 TB hard drives are needed to reach this
The corporate culture of the organization is capacity. Google aims to improve the daily load
geared toward the care and trust of human time searching and indexing more URLs and
resources. In 2013 Google was elected for the improve search.
Google 511

Each month the search engine receives more • More accurate predictions: even if the user does
than 100 million unique users that perform 12,800 not know exactly what he is looking for, pre-
million searches. dictions will guide him in his search. The first
PageRank is one of the successful search prediction is shown in gray so the user can stop
engines, a family of algorithms used to numeri- typing as soon as he finds what he is looking for.
cally assign the relevance of web pages indexed • Instant results: when user starts writing, the
by a search engine. results begin to appear.
PageRank relies on the democratic nature of the
web by using its vast link structure as an indicator The algorithm checks the query and uses over
of the value of a particular page. Google interprets 200 variables to choose the most relevant answers
a link from page A to page B as a vote, by page A, among millions of pages and content. Google
for page B. But, Google looks at more than the updates their ranking algorithms with more than
sheer volume of votes, or links a page receives, and 500 improvements annually.
also analyzes the page that casts the vote. The votes Examples of the variables used to locate a page G
cast by an “important” page, i.e., with high are:
PageRank weight more, help to consider other
“important” page. Therefore, the PageRank of a • Updating content of a website
page reflects the importance of the Internet itself. • Number of websites that target a home page
Thus the more links a site receives from others, the and links
more lasting are those links come from reputed • The website keywords
spaces, and the greater the chances of a page appear • Synonyms of keywords searched
in the top search results of Google. • Spelling
The route of a search begins long before • Quality of the content of the site URL and title
inserting it into Google. The search engine uses of the web page
computer robots (spiders or crawlers) looking • Personalization
web pages for its inclusion in the Google search • Recommendations of users to whom we are
results. Google software stores data on these connected
pages in data centers. The web is like a book
with billions of pages that are being indexed by The results are sorted by relevance and
the search engine, a task to which they have displayed on the page. In addition to instant results,
already spent more than one million hours. we can get a preview of web pages by placing the
When the user initiates a search, the Google cursor on the arrows to the right of a result to
algorithm starts searching for the information quickly decide whether we want to visit that site.
demanded by the Internet. The search runs an aver- The Instant Preview takes a split second to load
age of 2400 miles before offering an answer: it can (on average).
stop in data centers around the world during the Some important figures related to the search
journey and travels hundreds of millions of miles are listed:
per hour traveling at nearly the speed of light.
When the user enters a query, it would appear • Google has responded 450.000 million new
predicted searches and results before pressing queries (searches never seen before),
Enter. This feature saves time and helps to get since 2003.
the required response faster. This is what is called • 16% of daily searches are new.
Google Instant.
This feature has several advantages:
AdWords
• Faster searches: predicted searches show
results before the user has finished drafting it; The main source of income of the company and
Google Instant can save 2–5 s per search. the search engine is advertising. In 2000 Google
512 Google

introduced AdWords, a method of dynamic adver- • Google Books: allows to locate books on
tising for the client: advertisers, with the concept the web.
of pay-per-click ads, pay only for those in which a • Google Finance: seeks economic news.
surfer clicked. On the part of the site owners, they
charge based on the number of clicks the ads on
their website have been generated. Social and Communication Services
Running AdWords is simple: the advertiser • Google Alerts: this is a notification system
writes an ad showing potential customers the through email that sends alerts based on a
products and services he offers. Then choose the search term when it appears in search engine
search terms that will make this announcement to results.
appear among Google search results. If the key • Google Docs is a tool for creating and sharing
words entered by users match the search terms documents, spreadsheets, and presentations
selected by the advertiser, the advertising is online. It works similar to the programs of the
displayed above or alongside search results. This office suite.
tool generates 97% of company revenues. • Google Calendar is a tool that lets you organize
and share online calendar appointments and
events, and it is integrated with Gmail. This
Other Products and Services tool is available to access data from computers
or other devices.
As already mentioned above, the search engine is • Google Mail (Gmail) is the main and most
the flagship product of the company, in which important email service in the Internet, with
spend most of their human and material resources. all the necessary features. Its speed, safety
But the company has a great capacity for innova- features, and options make it almost unneces-
tion and has developed many products and ser- sary to use another email service. In addition,
vices to market in the field of Information and through this mail service, the user has direct
Communication Technologies related to social access to many of the services and products
media and communication, maps, location, web such as Google+, Google Drive, Calendar,
browsing, developers, etc. YouTube, etc.
Then we will classify according to the type of • Google Plus (Google+) is the social network of
functionality, major Google products, and Google and in the first half of 2014 had nearly
services. 350 million active users. This service is inte-
grated into the main services of Google. When
Services and Information Searches we create a Google account, we automatically
In addition to the web search engine, Google are part of this network.
offers a number of specialized services that can • Google Hangouts: this is an instant messaging
locate only certain kinds of content: service that also allows us to make video calls
and conferences between several people. It
• Google Images: searching only images and began as a text chat with video support up to
pictures stored on the web. ten participants and to take control of SMS
• Google News: shows the latest news on any devices with Android.
topic, introducing a term. It allows us to sub- • Google Groups: this application allows us
scribe and receive them by email. This way, to create discussion groups on almost
users will get this news tracking several any topic. Users can create new groups
websites dedicated to information. or join existing ones, make posts, and
• Google Blog Search: localized content hosted share files.
on blogs. • Play Google: the Android store content. We
• Google Scholar: searches for books published can find millions of apps, books, movies, and
in academic journals and books. songs, paid and free.
Google 513

Maps, Location and Exploration Developer Tools and Services


• Google Maps provide different maps from • Google Developers: is a page with technical
around the world and allow us to calculate documentation to use and exploit the resources
distances between different geographic of Google advanced form.
locations. • Google Code: is a repository or warehouse
• Google Street View displays panoramic photos where host (serve) code to use or share it freely
of places and sites from GMaps. with others.
• Google Earth: this is a virtual 3D World Atlas, • Google Analytics: is a powerful and useful
used for satellite imagery and aerial photos. statistics and analytics service for websites.
There are applications for portable devices. Using a tracking code that we insert into our
Other similar versions let the user admire the site creates detailed reports with all types of
moon and Mars and explore space using pho- visitor data.
tographs from NASA: Google Moon, Google • Google Webmaster Tools: is a set of tools for
Mars, and Google Sky. those who have a blog or website on the Inter- G
net. Check the status of indexing by the search
Tools and Utilities engine to optimize our visibility. It offers sev-
• Google Translate is a complete service to trans- eral reports that help us understand the extent
late text or web pages in different languages. of our publications also receive notifications
We can have the offline service with Android for any serious error.
devices. • Google Fonts: is a service that offers several
• Google Chrome: the Google web browser. It is online sources for use on any website.
simple and fast and has a lot of extensions to • Google URL Shortener: is an application that
add functionality. All our navigation data such allows to shorten URL service and also offers
as history, saved passwords, bookmarks, and statistics.
cache can be synchronized automatically with • Google Trends: this service provides tools to
our Google account so we have a backup in the compare the popularity of different terms and
cloud and have them in different computers or locate trend map search.
devices. • Google Insights: a tool for checking the popu-
• Android is an open operating system for por- larity of one or more search terms on the Inter-
table device code. net, using the Google database.
• Google AdSense is an advertising service to • Apart from all these products and services
display ads on the pages of a website or a blog related to the web and its many features, Goo-
created in blogger. The ads shown are from the gle continues to diversify its market and is
Google AdWords service. currently working on products whose purpose
• Blogger is a publishing platform that lets us remains make life easier for people.
create personal blogs with their own • Google Glasses: a display device, a kind of
domain name. augmented reality glasses that let us browse
• Google Drive is a storage service in the cloud the Internet, take pictures, capture videos, etc.
to store and share files such as photos, videos, • Driverless car is a Google project that aims to
documents, and any other file types. develop a technology that allows the marketing
• Picasa is an application that lets us store photos of driverless vehicles. They are working with
online, linked to our account in the social net- legislators to testing autonomous vehicles on
work Google+, including tools for editing, public roads and already have the approval of
effects, face recognition, and geo-location. the project in two states in the United States:
• YouTube is a service that allows us to upload, California and Nevada. Still unknown is when
view, and share videos. This service was technology will be available in the market.
acquired by Google in 2006 by 1650 million • Android Home: is a Google home automation
euros. technology, which will connect the full home
514 Google Analytics

to the Internet. For example, this service can improve the site’s content organization, increase
tell us the time of completion of a product in the interaction of a website, and learn the reasons
the refrigerator. why visitors leave a site without making any
purchase.
This service is free of charge, a feature
Cross-References emphasized in its presentation of the November
11, 2005. For experts, this circumstance is to
▶ Google Analytics redefine the existing business model at the
moment, so far based on a charge depending on
the volume of traffic analyzed. The benefit for the
Further Reading Telecommunications Company is focused on its
advertising products: AdWords and AdSense,
Jarvis, J. (2009). What would Google do? New York: with an annual turnover estimated at 20 billion
Harper Collins. dollars. However, it is important to note that
Page, L., Brin, S., Motwani, R., Winograd, T. (1999). The although Google offers this free service – except
PageRank citation ranking: Bringing order to the Web.
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf. for those companies whose sites have more
Poundstone, W. (2012). Are you smart enough to work at than five million visits per month – is necessary
Google? New York: Little, Brown and Company. for all clients to invest in the implementation of
Stross, R. (2009). Planet Google: One Company’s auda- this system.
cious plan to organize everything we know. New York:
Free Press.
Vise, D., & Malseed, M. The Google story. London: Pan
Macmillan. Features

The service offers the following features:

Google Analytics 1. Monitoring of several websites. Through the


account on the platform, the user of this service
Natalia Abuín Vences1 and Raquel Vinader has multiple views so can refer to specific
Segura1,2 reports of a domain or subdomain.
1
Complutense University of Madrid, 2. Monitoring of a blog, MySpace, or Facebook
Madrid, Spain pages. It is possible to use Google Analytics on
2
Rey Juan Carlos University, Fuenlabrada, sites that cannot change the code of the page
Madrid, Spain (for example, on MySpace). In that case, it is
recommend the use of third party widgets in
order to set up this service for predefined tem-
Google Analytics is a web statistics service plates of this kind of websites (Facebook,
offered by the North American Telecommunica- MySpace, WordPress).
tions giant Google since the beginning of 2006. It 3. Follow-up to visits of RSS feeds, so it is nec-
provides great useful information and allows all essary to previously run such a service tracking
type of companies to get fast and easily data about code. Since most of the programs for reading
their website traffic. Google Analytics not only let RSS feeds or Atom cannot execute JavaScript
to measure sales and conversions, but it also offers code, Google Analytics does not count page
statistics on how visitors use a website, how they views that are recorded through an RSS reader.
contact it, and what to do so they keep visiting it. In order to track those page views by this
According to the company, it is a service aimed at service, visitors must run a JavaScript file on
executives, marketers, and web and content devel- Google’s servers.
opers. It is useful to optimize online marketing 4. Compatible with other web analytics solu-
campaigns in order to increase their effectiveness, tions. Google Analytics can be used with
Google Analytics 515

other internal or third parties solutions that 1.2. Page views. Number of times a web page
have been installed for this purpose. For a list is loaded completely from the web server
of possible compatible solutions, please check to the browser of a user during a given
its App Gallery. period.
1.3. Pages/visit. Pages per visit ratio. Measure
The operation of Google Analytics is based on the depth of the visit and is intimately
a Page Tag Technique, which consists in the col- related to the time spent on the website.
lection of data through the browser of the visitor, 1.4. Bounce Rate or percentage of rebound. It
which will be stored in data collector servers. This is the percentage of visits only consulting
is a popular data collection method because it is a page of a site before you leave it.
technically easier and cheaper. Combine cookies, 1.5. Average time on site (average dwell time).
Internet browsers, and codes within each of the It is the time means that visitors to a site
pages of the website. This information is collected are interacting with it. Serves to under-
through a JavaScript code (known as tags or bea- stand if users are engaged with the site. G
cons) generated by logging into Google Analyt- It is calculated by the difference between
ics. This fragment of JavaScript code (several in last and first page view to visitor request,
case of making different trackings for several not when users leave the page.
domains) must be copied and inserted into the 1.6. % new visits. Percentage of new visitors
header of the source code of the website. When a in the page.
user visits a specific page that contains this track- 2. Traffic sources
ing code, it will be charged simultaneously to the 2.1. Direct Traffic. Visitors arrive directly to
other elements of the site and will generate a the website by typing in your browser the
cookie: a small text message that a web server address of the site.
transmit to web browser to track user’s activity 2.2. Referring sites or pages from which visi-
on a website until the end of the visit. While this tors arrive at the site measured.
happens, all data captured will be loading Google 2.3. Search engines. Visits to the page from
servers that will turn on the Google Analytics search engines are.
panel by means of the corresponding reports will 3. Place. Origin location of the visitor
be available immediately. 3.1. Number of visits to places of origin
Thus, this system allows obtaining real-time 3.2. Graphic by region
reports and, at the same time, having the possibil- 3.3. Map indicating places of visit.
ity to be customized based on each client’s inter- 4. Content: main contents of the site visited.
ests. In addition, it offers the options of advanced
segmentation and traffic flow view, that is, ana- Other variables of measurement that can be
lyze the path of a visitor on a site, while it allows obtained are the visitor language setting, the
you to visually evaluate how users interact with browser used, or the type of Internet connection
the pages. used by the user.
Google Analytics displays these metrics both
graphically and numerically helping the user to
Metrics interpret main results.

The reports offered by Google Analytics on a


certain web page provide the following Benefits of Google Analytics
information:
Among others, we must mention the following:
1. Visits
1.1. Visits received by the site. A visit is an 1. Compatibility with third-party services. It
interaction, by an individual, with a website. offers the possibility of using applications
516 Google Analytics

that improve data collected by Google Analyt- developers work on the program to make it
ics. Many of these applications are contained in more intuitive.
the App Gallery. – Loss of data by mistake. If the code provided
2. Loading speed: Google Analytics code is light by analytics and integrated in the source code
and it is loaded asynchronously, so will not of the site is deleted by mistake, i.e., while
affect the loading of the page. updating some themes in WordPress, any
3. Customized Dashboards. Possibility to select records registered during the time it is not
abridged reports from the main sections of working will be lost.
Google Analytics. Up to 12 different reports
can be selected, modified, or added to the
Privacy
main page. These reports provide information
in real time and can be activated to be distrib-
As discussed previously, the service uses
uted via email.
cookies to collect information about the inter-
4. Map Overlay Reports. A graphical way of pre-
actions of certain website users. The reports
senting data that reflect where around the
offered by Google Analytics provide non-
world visitors are connecting from when view-
personally identifiable information as part of
ing a website. This is possible thanks to IP
the Google policy and do not include any
address location databases and provide a clear
information about real IP address. That is,
representation of the parts of the world visitors
the information provided to Google’s clients
are coming from.
does not include any information which can
5. Data Export and Scheduling. Report data can
be used to identify a user, such as personal
be manually exported in a variety of formats,
names, addresses, emails, or credit cards num-
including CS V, TS V, PDF, or the open-source
bers, among others. In fact, Google has
XML.
implemented a privacy policy which is com-
6. Multiple Language Interfaces and Support.
mitted to keep the information stored in their
Google Analytics currently can display reports
computer systems in a secure manner.
in 25 languages, and this number is growing
Google Analytics protects the confidentiality
continually.
of your data in different ways:
7. In addition to Facebook and other social
spaces monitoring, Google Analytics can
track mobile websites and mobile applications 1. The clients of this service are not allowed to
on all web-enabled devices, whether or not the send personal information to Google.
device runs JavaScript. 2. Data are not shared without the user’s consent,
except in certain limited circumstances, such as
Some disadvantages of Google Analytics disclosures required by law.
we can point out are: 3. Investments in security systems. Engineering
teams dedicated to security in Google fight
– Technical limitations: in case of the visitor’s against external data threats.
browser does not have JavaScript code enabled
or if some cookies capture functions are
blocked, the results reported will not be accu- Certified Partner Network
rate. This situation is not common but can
occur and be relevant in sites with millions of Google Analytics has a network of certified part-
hits. ners around the world that provides support for the
– Some users claim that the interface is not implementation of the system and offer expert
as intuitive as it is often suggested. It usually analysis of the data collected. They are agencies
requires a preparation time to get familiar- and consultancies that provide applications of
ized with the information offered. Google web statistics, analysis services and testing of
Google Books Ngrams 517

websites, and optimization services that have – Allows downloading without sample
passed a long selection process. reports: up to three million rows of data
To become a member of this network, it without sample to analyze them with great
is necessary that interested companies overcome precision.
a training program that passes necessarily by
obtaining Google Analytics Individual Qualifica- This service also offers a dedicated account
tion (GAIQ) certified by Google Analytics manager, who works as one member of the
Conversion University. company team, and offers technical assistance in
A list of certified partners by country can be real time in order to ensure the quick resolution
consulted on the Google Analytics site. Most of incidents.
important services provided are the following: Google Analytics is a service that offers an
enormous amount of data about the traffic on a
– Web measurement planning particular website. Real importance lies in the
– Basic, advanced, and custom implementations necessary interpretation of this information to G
– Mobile sits and applications development optimize processes, as well as designing and
– Technical assistance launching marketing campaigns.
– Google Analytics’ API integrations
– Analysis and consulting for online media
– Online media channel allocation Cross-References
– Websites and landing page testing
– Development of custom panels ▶ Google
– Training

At the same time, there is a program of autho- Further Reading


rized partners of Google Analytics Premium
where the certified partners are ready to offer Clifton, B. (2010). Advanced web metrics with Google
analytics. Indianapolis: Wiley.
Google Analytics Premium package directly to
Google Inc. Google analytics support. http://www.google.
customers. com/analytics/. Accessed Aug 2014.
Ledford, J., Teixeira, J., & Tyler, M. E. (2010). Google
analytics. Indianapolis: Wiley.
Quinonez, J. D. What is and how Google analytics.
Google Analytics Premium http://wwwhatsnew.com/2013/08/27/que-es-y-como-f
unciona-google-analytics. Accessed Aug 2014.
The Premium service offers the same
features as Google Analytics free version
but includes extra features that make it an
ideal tool for large companies. In exchange for
a flat fee, offers a higher processing power Google Books Ngrams
for more detailed results, a dedicated team of
service and support, warranties of service and up Patrick Juola
to thousands of millions of visits a month. Department of Mathematics and Computer
Google Analytics Premium allows the client Science, McAnulty College and Graduate School
to collect, analyze, and share more data than of Liberal Arts, Duquesne University, Pittsburgh,
ever before. PA, USA

– Extended data limits. Measure much more than


before: up to one billion visits per month. Synonyms
– Can use 50 custom variables, 10 times more
than the standard version. Google Books Ngrams
518 Google Books Ngrams

Introduction word phrase, like “in the” or “grocery store,”


while a typical 5-gram would be a five-word
The Google Books Ngram corpus is the largest phrase like “the State of New York” or “Yesterday
publicly available collection of linguistic data in I drove to the.” These data are tabulated by year of
existence. Based on books scanned and col- publication and frequency to create the database.
lected as part of the Google Books Project, the For example, the 1-gram “apple” appeared seven
Google Books Ngram Corpus lists the “word times in one book in 1515, 26 times across
n-grams” (groups of 1–5 adjacent words, with- 16 books in 1750, and 24,689 times across 7871
out regard to grammatical structure or com- books in 1966.
pleteness) along with the dates of their Google Books Ngrams provides data for sev-
appearance and their frequencies, but not the eral different major languages, including
specific books involved. This database provides English, Chinese (simplified), French, German,
information about relatively low-level lexical Hebrew, Italian, Russian, and Spanish. Although
features in a database that is orders of magni- there are relatively few books from the early
tude larger than any other corpus available. It years (only one English book was digitized
has been used for a variety of research tasks, from 1515), there are more than 200,000 books
including the quantitative study of lexicogra- from 2008 alone in the collection, representing
phy, language change and evolution, human nearly 20 billion words published in that year.
culture, history, and many other branches of The English corpus as a whole has more than
the humanities. 350 billion words (Michel et al. 2011), substan-
tially larger than other large corpora such as
News on the Web (4.76 billion words),
Google Books Ngrams Wikipedia (1.9 billion words), Hansard (1.6 bil-
lion words), or the British National Corpus
The controversial Google Books project was an (100 million words), but does not provide large
ambitious undertaking to digitize the world’s col- samples or contextual information.
lection of print books. Google used high-speed
scanners in conjunction with optical character
recognition (OCR) to produce a machine-readable Using the Google Books Ngrams Corpus
representation of the text contained in millions of
books, mostly through collaboration with libraries Google provides web access through a form, the
worldwide. Although plagued by litigation, Goo- Ngram Viewer, at https://books.google.com/
gle managed to scan roughly 25 million works ngrams. Users can type the phrases that interest
between 2002 and 2015, roughly a fifth of the them into the form, choose the specific corpus,
130 million books published in human history and select the time period of interest. A sample
(Somers, 2017). screen shot is attached as Fig. 1. This plots the
In part to satisfy the demands of copyright law, frequency of the words (technically, 1-grams)
Google does not typically make the full text of “man,” “woman,” and “person” from 1950 to
books available, but is allowed to publish and 2000. Three relatively clear trends are visible in
distribute snippets. The Google Books Ngram this figure. The frequency of “man” is decreas-
corpus (Michel et al., 2011) provides n-grams ing across the time period, a drop-off that accel-
(groups of n consecutive nonblank characters, erates after 1970, while the frequency of both
separated by whitespace) for five million books “woman” and “person” start to increase after
at values of n from 1 to 5. A typical 1-gram is just 1970. This could be read as evidence of the
a word, but could also be a typing mistake (e.g., growing influence of the feminist movement
“hte”), an acronym (“NORAD”), or a number and its push towards more gender-inclusive
(“3.1416”). A typical 2-gram would be a two- language.
Google Books Ngrams 519

Google Books Ngrams, Fig. 1 Google Books Ngram Viewer (screen shot)

The Ngram Viewer provides several advanced Uses of the Corpus


features, like the ability to search for wildcards
(“*”) such as “University of *.” Searching for The Google Books Ngram corpus has proven to
“University of *” gives a list of the ten most be a widely useful research tool in a variety of
common/popular words to complete that phrase. fields. Michel et al. (2011) were able to show
Perhaps as expected, this list is dominated in the several findings, including the fact that the
1700s and 1800s by the universities of Oxford and English lexicon is much larger than estimated by
Cambridge, but by the early 1900s, the most fre- any dictionary; that there is a strong and measur-
quent uses are the universities of California and able tendency for verbs to regularize (that is, for
Chicago, reflecting the increasing influence of US regular forms like “learned” or “burned” to
universities. The viewer can also search for spe- replace irregular forms like “learnt” or “burnt”);
cific parts of speech (for example, searching for that there is an increasingly rapid uptake of new
“book_VERB” and/or “book_NOUN”), groups technology into our discussions, and similarly, an
of inflectionally related words (such as “book,” increasing abandonment of old concepts such as
“booking,” “books,” and “booked”), or even the names of formerly famous people; and, finally,
related in terms of dependency structure. In addi- showed an effective method of detecting censor-
tion, the raw data is available for download from ship and suppression. Other researchers have used
http://storage.googleapis.com/books/ngrams/ this corpus to detect semantic shifts over time, to
books/datasetsv2.html for people to use in run- examine the cultural consequences of the shift
ning their own experiments. from an urban to rural environment, to study the
520 Google Books Ngrams

growth of individualization as a concept, and to publishing was a rare event, and not all books
measure the amount of information in human from that period survive. The corpus contains a
culture (Juola, 2012). The Ngram corpus has single book from 1505, none at all from 1506, and
been useful to establish how common a colloca- one each from 1507, 1515, 1520, 1524, and 1525.
tion is, and by extension, help assess the creativity Not until 1579 (three books) is there more than
of a proposed trademark. Both for small-scale one book from a single year, and still, in 1654,
(at the level of words and collocations) and only two books are included. Even as late as 1800,
large-scale (at the level of language or culture there are only 669 books in the corpus. Of course,
itself) investigations, the Ngram corpus creates the books that have survived to be digitized are the
new affordances for many types of research. books that were considered worth curating in the
intervening centuries, and probably do not accu-
rately represent language as typically used or
Criticisms of the Google Books Ngram spoken.
Corpus More subtly, researchers like Pechenick et al.
(2015) have shown that the corpus itself is not
The corpus has been sharply criticized for several well balanced. Each book is equally weighted,
perceived failings, largely related to the methods meaning that a hugely influential book like Gul-
of collection and processing (Pechenick et al., liver’s Travels (1726) has less weight on the
2015). The most notable and common criticism statistics than a series of largely unread sermons
is the overreliance on OCR. Any OCR process by a prolific eighteenth-century minister that
will naturally contain errors (typically 1–2% of happened to survive. Furthermore, the composi-
the characters will be misscanned with high- tion of the corpus changes radically over time.
quality images), but the error rates are much Over the twentieth century, scientific literature
higher with older documents. One aspect that starts to become massively overrepresented in
has been singled out for criticism (Zhang, 2015) the corpus, while fiction decreases, despite
is the “long s” or “medial s,” a pre-1800 style of being possibly a more accurate guide to speech
printing the letter “s” in the middle of a word that and culture. (The English corpus offers fiction-
looks very similar to the letter “f.” Figure 2 shows only as an option, but the other languages do
an example of this from the US Bill of Rights. not.)
Especially in the earlier (2009) version of the
Google Books Ngram corpus, the OCR engine
was not particularly good at distinguishing Conclusion
between the two, so the word “son” could be
read as “fon” in images of early books. Despite these criticisms, the Google Books
Another issue is the representativeness of the Ngram corpus has proven to be an easily accessi-
corpus. In the early years of printing, book ble, powerful, and widely useful tool for linguistic
and cultural analysis. At more than 50 times the
next largest linguistic data set, the Ngram corpus,
provides access to more raw data than any other
corpus currently extant. While not without its
flaws, these flaws are largely inherent to any
large corpus that relies on large-scale collection
of text.

Cross-References
Google Books Ngrams, Fig. 2 Example of “medial s”
from the United States Bill of Rights (1788) ▶ Corpus Linguistics
Google Flu 521

Further Reading investigate these symptoms online. Such searches


might be accurate (i.e., people searching for the
Juola, P. (2013). Using the Google N-Gram corpus to correct affliction), but many are not. As such, this
measure cultural complexity. Literary and linguistic
results in the high number of physicians noting
computing, 28(4), 668–675.
Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., patients are coming to a doctor’s visit with a lists
Gray, M. K., Pickett, J. P., et al. (2011). Quantitative of symptoms and possible conditions, all discovered
analysis of culture using millions of digitized books. on websites by the likes of WebMD or Mayo
Science, 331(6014), 176–182.
Clinic. To further complicate the accuracy of GFT,
Pechenick, E. A., Danforth, C. M., & Dodds, P. A. (2015).
Characterizing the Google Books Corpus: Strong limits many identify “cold” symptoms (e.g., runny nose)
to inferences of socio-cultural and linguistic evolution. as flu symptoms, while in fact influenza is mainly a
PloS One, 10(10), e0137041. https://doi.org/10.1371/ respiratory infection; the major symptoms include
journal.pone.0137041.
cough and chest congestion. One of the complica-
Somers, J. (2017). Torching the modern-day library of
Alexandria. In The Atlantic.. https://www.theatlantic. tions of GFT results from the algorithm for GFT not
com/technology/archive/2017/04/the-tragedy-of- being accurately designed to “flag” the correct G
google-books/523320/. Accessed 23 July 2017. symptoms. In a separate study, David Lazer, Ryan
Zhang, S. (2015). The pitfalls of using google ngram to
Kennedy, Gary King, and Alessandro Vespignani
study language. In Wired. https://www.wired.com/
2015/10/pitfalls-of-studying-language-with-google- suggest that heavy media reporting of flu outbreaks
ngram/. Accessed 23 July 2017. led to many unnecessary Google searches about the
flu. Because of this influx, GFT results spiked,
falsely indicating the number of cases of the flu.
The combination of searching for incorrect flu
Google Flu symptoms and heavy media attention proved cat-
astrophic for GFT’s initial year of reporting. The
Kim Lacey inaccuracies actually went unnoticed until well
Saginaw Valley State University, University after the 2008–2009 flu season. In 2009, an out-
Center, MI, USA break of H1N1 (popularly known as the swine flu)
during the off-season also caused unexpected dis-
ruptions of GFT. Because GFT was designed to
Google Flu Trend (GFT) is a tool designed by identify more common strains of influenza, the
Google to collect users’ web searches to predict unpredicted H1N1 outbreak wreaked havoc on
outbreaks of influenza. These trends are identified reporting trends. Even though H1N1 symptoms
by tracking search terms related to symptoms of the do not vary too much from common flu symptoms
virus combined with the geographic location of (it is the severity of the symptoms that causes the
users. In terms of big data collection, GFT is seen majority of the worry), its appearance during the
as a success due to its innovative utilization of large off-season threw trackers of GFT for a loop. Addi-
amounts of crowd-sourced information. However, it tionally, H1N1 proved to be a global outbreak and
has also been deemed somewhat a failure due to the thus another complication in the ability to accu-
misunderstanding of the symptoms of the flu and rately report the flu trends.
media-influenced searches. Matthew Mohebbi and Within the first 2 years of GFT, it received a
Jeremy Ginsberg created GFT in 2008. As of 2014, great amount of attention, for good and for ill.
GFT actively monitors 29 countries. Some of the positive attention focused specifically
When GFT was first launched, reviewers praised on the triumph of big data. To put this phenome-
its accuracy; in fact, GFT was touted to be 97% non in perspective, Viktor Mayer-Schonberger
accurate. However, these numbers were discovered and Kenneth Cukier point out that GFT utilizes
to be quite misleading. GFT analyzes Google billions of data points to identify outbreaks. These
searches to map geographic trends of illness. data points, which are collected from smaller sam-
When an individual feels an illness coming on, ple groups, are then applied to larger populations
one of the common, contemporary reactions is to to predict outbreaks. Mayer-Schonberger and
522 Google Flu

Cukier also note that before GFT, flu trends were patterns and prepare for outbreaks. Even though
always a week or so offtrack – in a sense, we were all users agree to Google’s terms of service any-
always playing catch-up. Google, on the other time they utilize its search engines, few recognize
hand, recognized an opportunity. If Google what happens to this information beyond the
could flag specific key word searches, both by returned search results. The returned results are
volume and geographic location, it might be able not the only information that is being shared dur-
to identify a flu outbreak as it was happening or ing a web search. And even though on its privacy
even before it occurred. On the surface, this idea policy page, Google acknowledges the ways it
seems to be an obvious and positive use of big collects user data, what it collects, and what it
data. Diving deeper, GFT has received a lot of will do to secure the privacy of that information,
critical attention due to skewed reporting, incor- some critics feel this exchange is unfair. In fact,
rect algorithms, and misinformed interpretation of users share much more than they realize, thus the
the data sets. ability for Google to create a project such as GFT
Mayer-Schonberger and Cukier imply that cor- based on search terms and geolocation. The seem-
relation has a lot to do with the success and failure ingly harmless exchange of search terms for
of GFT. For example, the more people who search search results is precisely from where Google
for symptoms of the flu in a specific location, the draws its collection of data for GFT. By aggregat-
more likely there is to be an outbreak in that area. ing the loads of metadata users provide, Google
(In a similar vein, we might be able to use Google hoped it would be able to predict global flu trends.
analytics to track how many people are experienc- While all this data collection for predictive
ing a different event specific to a geographic loca- health purposes sounds great, there has been a
tion, such as a drought or a flood.) However, what lot of hesitation regarding the ownership and use
Google did not take into consideration was the of private information. For one, GFT has received
influence the media would have on its product. some criticism because of the fact that it repre-
Once GFT began receiving media attention, more sents the shift in the access of data from academics
users began searching for both additional infor- to companies. Put simply, rather than academics
mation on the Google Flu project and also flu- collecting, analyzing, and sharing the information
related symptoms. These searches were discov- they glean from years of complex research, put-
ered to have arisen only because users had heard ting these data sets in the hands of a large corpo-
about GFT on the news or in other reporting. ration (such as Google) causes pause for users. In
Because the GFT algorithm did not take into con- fact, for many users, the costs of health care and
sideration traffic from media attention, the algo- access to health providers leave few alternatives to
rithm did not effectively account for the lack of web searches to find information about symp-
correlation between searches related to media toms, preventative measures, and care. Another
coverage and searches related to actual symptoms. concern about GFT is the long-term effects of
To this day, Google’s data sets remain at the collecting large amounts of data. Because big
unstable and struggle with the relation between data collection is fairly new, we do not know the
actual cases of the flu and media coverage. As of ramifications of collecting information to which
yet, GFT does not have an algorithm that has been individuals do not have easy access. Even though
effectively designed to differentiate between Google’s privacy policy states that user informa-
media coverage and influenza outbreaks. tion will remain encrypted, the ways in which this
But still, there have been many researchers information can be used and shared remain vague.
who have touted GFT as a triumph of collective This hesitancy is not exclusive to GFT. For exam-
intelligence, a signal that big data is performing in ple, in another form of big data, albeit a more
the ways researchers and academics imagined and personalized version, many believe the results of
hoped it would from the start. The ability to use DNA sequencing being shared with insurance
large data sets is an impressive, and advantageous, companies will lead to a similar loss of control
use of mundane actions to establish health over personal data. Even though individuals
Google Flu 523

should maintain legal ownership over personal created more media coverage of the problems
health, the question of what happens when it themselves, thus adding to (or at minimum sus-
merges with big data collection remains unclear. taining) the spike in flu-related searches. At one
Further, the larger implications of big data are point, Lazer, Kennedy, King, and Vespignani
unknown, but projects like GFT and DNA sought to apply the newly adjusted algorithm to
sequencing pose the question of who owns our previously collected data. To better evaluate the
personal health data. If we do not have access to flu trends from the start of GFT, Lazer, Kennedy,
our own health information, or if we do not feel King, and Vespignani applied the adjusted algo-
able to freely search for health-related issues, then rithm to backdated data found using the Wayback
Google’s tracking might post more problems than Machine, although trends were still difficult to
it was designed to handle. identify because of the uncertainty of the influ-
Along these lines, one of the larger concerns ence of media coverage.
with GFT is the use of user data for public Another one of the more troubling issues with
purposes. Once again echoing concerns about GFT is that it is failing to do what it was designed G
DNA sequencing, a critique of GFT is how Goo- to do: forecast the Centers for Disease Control’s
gle collects its data and what it will do with it (CDC) results. The disconnect between what actu-
once a forecast has been made. Because ally occurred (how many people actually had the
Google’s policies state that it may share user flu) and what GFT predicted was apparent in year
information with trusted affiliates or forcible 1. Critics of GFT are the fact that it consistently
government requests, further, some are worried overshot the number of cases of the flu in almost
that Google’s collection of health-related data every year since its inception in 2008. This over-
might lead to geographically specific ramifica- estimation is troubling not only because it sig-
tions (e.g., higher health insurance premiums). nifies a misunderstanding of big data sets, but it
On the flip side, Miguel Helft, writing in The also could potentially cause a misuse of capital
New York Times, notes that while some users are and human resources for research. The higher the
concerned about privacy issues, GFT did not predicted number of flu cases, the greater amount
alter any of Google’s regular tracking devices, of attention fighting that outbreak will receive. If
but instead allows users to become aware of flu these numbers remain high, financial resources
trends in their area. Helft points out that Google will not be spent on projects which deserve more
is only using the data it originally set out to attention. David Lazer, Ryan Kennedy, Gary
collect and has not adjusted or changed the King, and Alessandro Vespignani reviewed
ways it collects user information. This explana- Google’s algorithm and discovered in the
tion, however, does not appease everyone, as 2011–2012 flu season alone (3 years into the
some are still concerned with Google’s lack of project) GFT overestimated the number of cases
transparency in the process of collecting data. by as much as 50% more than what the CDC
For example, GFT does not explain how the data reported. The following year, after Google
is collected nor does it explain how the data will retooled its algorithm, GFT still overestimated
be used. Google’s broadly constructed guide- the number of flu cases by approximately 30%.
lines and privacy policy are, to some, flexible Additionally, Lazer, Kennedy, King, and
enough to apply to many currently unimagined Vespignani noticed GFT estimates were high in
intentions. 100 of the 108 weeks they tracked. The same four
In response to many of these concerns, Google scientists suggested that Google was guilty of
attempted to adjust the algorithm once again. what they call “big data hubris”: the assumption
Unfortunately, once again the change did not con- that big data sets are more accurate than traditional
sider (or consider enough) the high media cover- data collection and analysis. Further, the team
age GFT would receive. This time, by adjusting suggested that GFT couples its data with CDC
the algorithm on the assumption that media cov- data. Since GFT was not designed as a substitute
erage was to blame for the GFT spike, it only for doctor visits, by linking the predictions with
524 Governance

reported cases, GFT would be able to more effec- com/2008/11/13/does-google-flu-trends-raises-new-


tively and accurately predict flu outbreaks. privacy-risks/?_php¼true&_type¼blogs&_r¼0.
Accessed 26 Aug 2014.
GFT is not disappearing, however. It is a pro- Lazer, D., Kennedy R., King G., & Vespignani A. The
ject that many are still behind because of its parable of Google Flu: Traps in big data analysis.
potential to impact global flu outbreaks. Google Science [online] 343(6176), pp.1203–1205. Available
continues to back its trend analysis system at: http://www.sciencemag.org/content/343/6176/
1203.full. Accessed 27 Aug 2014.
because of its potential impact to recognize out- Lazer, D., Kennedy R., King G., & Vespignani A. Google
breaks of the flu and, eventually other, more seri- Flu still appears sick: An evaluation of the 2013–2014
ous diseases. One of the ways they are addressing Flu season. Available at: http://gking.harvard.edu/
concerns is my using what they call “nowcasting”: publications/google-flu-trends-still-appears-sick%
C2%A0-evaluation-2013%E2%80%902014-flu-season.
using data trends to provide daily updates rather Accessed 26 Aug 2014.
than larger, seasonal predictions. Others, too, Mayer-Schonberger, V., & Cukier, K. (2013). Big data:
remain cautiously optimistic. Eric Topol suggests A revolution that will transform how we live, work, and
that while GFT is fraught with complications, the think. New York: Houghton Mifflin.
Salzberg, Steven. Why Google Flu is a failure. Forbes.
idea that big data collection could be applied to Available at: http://www.forbes.com/sites/stevensalz
many different conditions is what we need to berg/2014/03/23/why-google-flu-is-a-failure/.
focus on. Accessed 27 Aug 2014.
Topol, E., Hill, D., & Tantor Media. (2012). The creative
destruction of medicine: How the digital revolution will
create better health care. New York: Basic Books.
Cross-References

▶ Bioinformatics
▶ Biosurveillance Governance
▶ Correlation Versus Causation
▶ Data Mining Algorithms Sergei A. Samoilenko1 and Marina Shilina2
▶ Google 1
George Mason University, Fairfax, VA, USA
2
Moscow State University (Russia), Moscow,
Russia
Further Reading

Bilton, N. Disruptions: Data without context tells a mislead- The Impact of Big Data
ing story. The New York Times: Bits Blog. http://bits.
blogs.nytimes.com/2013/02/24/disruptions-google-flu-
trends-shows-problems-of-big-data-without-context/?_ The rise of Web 2.0 in the new millennium has
php¼true&_type¼blogs&_r¼0. Accessed 27 Aug 2014. drastically changed former approaches of infor-
Blog.google.org. Official google.org Blog: Flu Trends mation management. New social media applica-
Updates Model to Help Estimate Flu Levels in the
US. http://blog.google.org/2013/10/flu-trends-updates-
tions, cloud computing, and software-as-a-
model-to-help.html. Accessed 27 Aug 2014. service applications further contributed to the
Cook, S., Conrad, C., Fowlkes, A. L., & Mohebbi, M. H. data explosion. The McKinsey Global Institute
(2011). Assessing Google Flu trends performance in (2011) estimates that data volume is growing
the United States during the 2009 influenza virus a
(H1N1) pandemic. PloS One, 6(8), e23610 http://
40% per year and will continue to grow 44
www.plosone.org/article/info%3Adoi%2F10.1371% times between 2009 and 2020. Primarily, the
2Fjournal.pone.0023610. Accessed 27 Aug. 2014. interactive data poses new challenges to enter-
Google.com. Privacy Policy–Privacy & Terms–Google. prises that now have to deal with issues related to
Available at: http://www.google.com/intl/en-US/poli
data quality and information life-cycle manage-
cies/privacy/#infosecurity. Accessed 27 Aug 2014.
Helft, M. Is there a privacy risk in Google Flue trends? The ment. Companies constantly seek new ideas to
New York Times: Bits Blog. http://bits.blogs.nytimes. better understand how to collect, store, analyze,
Governance 525

and use big data in ways that are meaningful to The process of governance refers to profiling the
them. data, understanding what it will be used for, and
Big data is generally referred within the context then determining the required level of data man-
of “the three Vs” – volume, velocity, and variety. agement and protection. In other words, informa-
However, its polystructured nature made it neces- tion governance is the set of principles, policies,
sary to review big data in the context of value or and processes that corresponds to corporate strat-
ways of utilizing all kinds of data including data- egy and define its operational and financial goals.
base content, log files, or web pages in a cost- These processes may include (a) following docu-
effective manner. In many ways, both internal ment policies relating to data quality, metadata,
data (e.g., enterprise application data) and external privacy, and information life-cycle management,
data (e.g., web data) have become a core business (b) assigning new roles and responsibilities such
asset for an enterprise. Most organizations now as data stewards for improving the quality of cus-
recognize big data as an enterprise asset with finan- tomer data, (c) monitoring compliance with data
cial value. They often use big data for predictive policies regulating the work of customer service G
analytics to improve business results. and call center agents, and (d) managing various
Big data has the potential to add value across data issues such as storing and eliminating dupli-
all industry segments. For example, collecting cate records.
sensor data through in-home health-care monitor- Big data governance determines what data is
ing devices can help analyze elderly patients’ held, how it is held, where, and in what quality.
health and vital statistics proactively. Health-care According to Soares (2013), big data can be clas-
companies and medical insurance companies can sified into five distinct types such as web and
then make time interventions to save lives or social media, machine to machine, big transaction
prevent expenses by reducing hospital admissions data, biometrics, and human generated. Organiza-
costs. In finances capital markets generate large tions need to establish the appropriate policies to
quantities of stock market and banking transaction prevent the misuse of big data and assess the
data that can help detect frauds and maximize reputational and legal risks involved when han-
successful trades. Electronic sensors attached to dling various data. For example, a big data gover-
machinery, oil pipelines, and equipment generate nance policy might state that an organization will
streams of incoming data that can be used for not integrate a customer’s Facebook profile into
preventive means to avoid disastrous failures. his or her master data record without that cus-
Streaming media, smartphones, and other GPS tomer’s informed consent.
devices offer advertisers an opportunity to target
consumers when they are in close proximity to a
store or a restaurant. Big Data Governance Management

A prerequisite to efficient data governance is


Why Big Data Governance proper data management. In order to minimize
potential risks related to data misuse or privacy
Big data governance is a part of a broader informa- violation, a strong information management
tion governance program that manages policies should include a comprehensive data model
relating to data optimization, privacy, and moneti- supporting an enterprise‘s business applications,
zation. A report from the Institute for Health Tech- proper data management tools, and methodology,
nology Transformation demonstrates that a as well as competent data specialists. A good data
standardized format for data governance is essen- governance program assures adhering to the pri-
tial for health-care organizations to leverage the vacy, security, financial standards, and legal
power of big data. requirements. With effective information
526 Governance

governance in place, business stakeholders tend to any node. Next, EIM helps in defining policies,
have greater trust and confidence in data. standards, and procedures to find appropriate data
According to Mohanty et al. (2013), every models and data stores in an enterprise setup. For
enterprise needs an ecosystem of business appli- example, too many data models and data stores
cations, data platforms to store and manage the can cause severe challenges to the enterprise IT
data, and reporting solutions. The authors discuss infrastructure and make it inefficient. Information
an Enterprise Information Management (EIM) life-cycle management is another process to mon-
framework that allows companies to meet the itor the use of information by data officers through
information needs of their stakeholders in compli- its lifecycle, from creation through disposal,
ance with appropriate organizational policies. The including compliance with legal, regulatory, and
first component of EIM is selecting the right busi- privacy requirements. Finally, EIM helps an orga-
ness model. There are three types of organization nization estimate and address the regulatory risk
models: “decentralized model,” “shared services that goes with data regulations and compliance.
model,” and “independent model.” The This helps some industries like financial services
“decentralized” model enables rapid analysis and and health care in meeting regulatory require-
execution outcomes produced by separate analyt- ments, which are of highest importance.
ics teams in various departments. At the same Soares (2012) introduces the IBM Information
time, generated insights are restrictive to a partic- Governance Council Maturity Model as a neces-
ular business function with little vision to strategic sary framework to address “the current state and
planning for an entire organization. The “shared the desired future state of big data governance
services” model brings the analytics groups under maturity” (p. 28). This model is comprised of
a centralized management that potentially slows four groupings containing 11 categories:
down insight generation and decision-making.
The “independent” model has direct executive-
Goals are the anticipated business outcomes of the
level reporting and can quickly streamline
information governance program that focuses
requirements, however often lacks specific
on reducing risk and costs and increasing value
insights from each department.
and revenues.
Next, an EIM program needs to assure efficient
Enables include the areas of organizational struc-
information management usage that data and con-
tures and awareness, stewardship, data risk
tent are managed properly and efficiently and
management, and policy.
create a reference architecture that integrates
Core disciplines include data management, infor-
emerging technologies into their infrastructure.
mation life-cycle management, and informa-
Business requirements and priorities often dictate
tion security and privacy.
which enterprise technology and architecture to
Finally, supporting disciplines include data archi-
follow. For example, if the company decides they
tecture, classification and metadata, and audit
would like to interact with their customers
information logging and reporting.
through mobile channels, then the enterprise tech-
nologies and architectures will need to make pro-
visions for mobility. EIM helps establish a data- Big Data Core Disciplines
driven organization and culture and introduce
new roles such as data stewards within the enter- Big data governance programs should be
prise. The company’s business priorities and road maintained according to new policies regarding
maps serve as a critical input to define what kind the acceptable use of cookies, tracking devices,
of business applications need to be built and and privacy regulations. Soares (2012) addresses
when. For example, Apache Hadoop ecosystem seven core disciplines of data governance.
can be used for distributing very large data files The information governance organization
across all the nodes of a very large grid of servers assures the efficient integration of big data to an
in a way that supports recovery from the failure of organizational framework by identifying the
Governance 527

stakeholders in big data governance and assigning computing policies, employee handbook, brand
new roles and responsibilities. guidelines, etc.). It will develop a communications
According to McKinsey Global Institute plan to introduce necessary training in policy
(2011), one of the biggest obstacles for big data enforcement for directors or managers and then
is a shortfall of skills. With the accelerated adop- the social media policies to the overall employee
tion of deep analytical techniques, a 60% shortfall population.
is predicted by 2018. The big data analytical capa- The big data metadata discipline refers to an
bilities include statistics, spatial, semantics, inter- organization process of metadata that describes
active discovery, and visualization. Adoption of the other data characteristics such as its name,
unstructured data from external sources and location, or value. The big data governance pro-
increased demand for managing big data in real gram integrates big data terms within the business
time requires additional management functions. glossary to define the use of technical terms and
According to Venkatasubramanian (2013), data language within that enterprise. For example, the
governance team comprises three layers. The exec- term “unique visitor” is a unit used to count indi- G
utive layer comprised of senior management mem- vidual users of a website. This important term may
bers who oversee the data governance function and be used in click-stream analytics by organizations
ensure the necessary funding. A chief data officer differently: either to measure unique visitors per
(CDO) at this level is responsible for generating month or per week. Also, organizations need to
more revenue or decreasing costs through the address data lineage and impact analysis to
effective use of data. The strategic layer is respon- describe the state and condition of data as it goes
sible for setting data characteristics, standards, and through diverse application processes.
policies for the entire organization. Compliance The introduction of new sources of external
officers integrate regulatory compliance and infor- personal data can lead to sudden security breach
mation retention requirements and help determine due to malware in the external data source and
audit schedules. The legal team assesses informa- other issues. This could happen due to lack of
tion risk and determines if information capture and enterprise-wide data standards, minimal metadata
deletion are legally defensible. Data scientists use management processes, inadequate data quality
statistical, mathematical, and predictive modeling and data governance measures, unclear data archi-
to build algorithms in order to ensure that the val policies, etc. A big data governance program
organization effectively uses all data for its analyt- needs to address two key practices related to the
ics. The tactical layer implements the assigned big data security and privacy discipline. First, it
policies. Data stewards are required to assist data would need to address tools related to data
analysts approve authorization for external data for masking. These tools are critical to de-identify
business use. Data analysts conduct real-time ana- sensitive information, such as birth dates, bank
lytics and use visualization platforms according to account numbers, or social security numbers.
specific data responsibilities such as processing These tools use data encryption to convert plain
master/transactional data, machine-generated text within a database into a format that is
data, social data, etc. According to Breakenridge unreadable to outsiders. Database monitoring
(2012), public relations and communications pro- tools are especially useful when managing sensi-
fessionals should also be engaged in the develop- tive data. For example, call centers need to protect
ment of social media policies, training, and the privacy of callers when voice recordings con-
governance. These may include research or audit tain sensitive information related to insurance,
efforts to identify potential areas of concern related financial services, and health care. The Payment
to their brand’s social media properties. They Card Industry (PCI) Security Standards Council
should work with senior management to build the suggests that organizations use technology to pre-
social media core team to identify additional com- vent the recording of sensitive data and securely
pany policies that need to be incorporated into the delete sensitive data in call recordings after
social media policy (i.e., code of ethics, IT and authorization.
528 Governance

The data quality discipline ensures that the kept and when they should be destroyed, and (c)
data is valid, is accurate, and can be trusted. Tra- legal holds and evidence collection requiring
ditionally, data quality concerns relate to deciding companies to preserve potential evidence such as
on the data quality benchmarks to ensure the data e-mail, instant messages, Microsoft Office docu-
will be fit for its intended use. It also determines ments, social media, etc.
the measurement criteria for data quality such as
validity, accuracy, timeliness, completeness, etc.
It includes clear communication of responsibili- Government Big Data Policies and
ties for the creation, use, security, documentation, Regulations
and disposal of information.
The business process integration program Communications service providers (CSPs) now
identifies key business processes that require big have access to more complete data on network
data, as well as key policies to support the gover- events, location, web traffic, channel clicks, and
nance of big data. For example, in oil and gas social media. Recently increased volume and
industry, the big data governance program needs types of biometric data require strict governance
to establish policies around the retention period relating to privacy and data retention. Many CSPs
for sensor data such as temperature, flow, pres- actively seek how to monetize their location data
sure, and salinity on an oil rig for the period of by selling it to third parties or develop new ser-
drilling and production. vices. However, big data should be used consid-
Another discipline called master data integra- ering the ethical and legal concerns and associated
tion refers to the process when organizations risks.
enrich their master data with additional insight For many years, the European Union has
from big data. For example, they might want to established a formalized system of privacy legis-
link social media sentiment analysis with master lation, which is regarded as more rigorous than the
data to understand if a certain customer demo- one in the USA. Companies operating in the
graphic is more favorably disposed to the European Union are not allowed to send personal
company’s products. The big data governance data to countries outside the European Economic
program needs to establish policies regarding the Area unless there is a guarantee that it will receive
integration of big data into the master data man- adequate levels of protection at a country level or
agement environment. It also seeks to organize at an organizational level. According to the Euro-
customer data scattered across business systems pean Union legal framework, employers may only
throughout the enterprise. Each data has specific adopt geolocation technology when it is demon-
attributes, such as customer’s contact information strably necessary for a legitimate purpose, and the
that need to be complete and valid. For example, if same goals cannot be achieved with less intrusive
an organization decides to merge Facebook data means. The European Union Article 29 Data Pro-
with other data, it needs to be aware that they tection Working Party states that providers of
cannot use data on a person’s friends outside of geolocation applications or services should imple-
the context of the Facebook application. In addi- ment retention policies that ensure that
tion, it needs to obtain explicit consent from the geolocation data, or profiles derived from such
user before using any information other than basic data are deleted after a “justified” period of time.
account information such as name, e-mail, gender, In other words, an employee must be able to turn
birthday, current city, etc. off monitoring devices outside of work hours and
The components of a big data life-cycle man- must be shown how to do so.
agement include (a) information archiving of In January of 2012, the European Commission
structured and unstructured information, (b) came up with a single law, the General Data Protec-
maintenance of laws and regulations that deter- tion Regulation (GDPR), which intended to unify
mine a retention of how long documents should be data protection within the European Union (EU).
Governance 529

This major reform proposal is believed to become a and third parties capable of maintaining the
law in 2015. A proposed set of consistent regulations confidentiality, security, and integrity of such
across the European Union would protect Internet information. The law requires businesses to
users from clandestine tracking and unauthorized have apps and websites directed at children to
personal data usage. This new legislation would give parental notice and obtain consent before
consider the important aspects of globalization and permitting third parties to collect children’s
the impact of social networks and cloud computing. personal information through plug-ins. At the
The Data Protection Regulation will also hold com- same time, it only requires that personal infor-
panies accountable for various types of violations mation collected from children be retained
based on their harmful effect. In the USA, data only “as long as is reasonably necessary to
protection law is comprised of a patchwork of fed- fulfill the purpose for which the information
eral and state laws and regulations, which govern the was collected.”
treatment of data across various industries and busi- The Health Insurance Portability and Account-
ness operations. The US legislation has been more ability Act (HIPAA) requires notice in writing G
lenient with respect to web privacy. Normally, the of the privacy practices of health-care services.
Cable Act (47 USC § 551) and the Electronic Com- If someone posts a complaint on Twitter, the
munications Privacy Act (18 USC § 2702) prohibit health plan might want to post a limited
operators and telephone companies to offer tele- response and then move the conversation
phony services without the consent of clients and offline. The American Medical Association
also prevent disclosure of customer data, including requires physicians to maintain appropriate
location. When a person uses a smartphone to place boundaries within the patient-physician rela-
a phone call to a business, that person’s wireless tionship according to professional ethical
company cannot disclose his or her location infor- guidelines and separating personal and profes-
mation to third parties without first getting express sional content online. Physicians should be
consent. However, when that same person uses that cognizant of patient privacy and confidentiality
same phone to look that business on the Internet, the and must refrain from online postings of iden-
wireless company is legally free to disclose his or tifiable patient information.
her location. While no generally applicable law The Geolocation Privacy and Surveillance Act
exists, some federal laws govern privacy policies (GPS Act) introduced in the US Congress in
in specific circumstances, such as: 2011 seeks to state clear guidelines for govern-
ment agencies, commercial entities, and private
US-EU Safe Harbor is a streamlined process for citizens pertaining to when and how geolocation
US companies to comply with the EU Direc- information can be accessed and used. The bill
tive 95/46/EC on the protection of personal requires government agencies to get a cause
data. Intended for organizations within the warrant to obtain geolocation information such
EU or USA, the Safe Harbor Principles are as signals from mobile phones and global posi-
designed to prevent accidental information dis- tioning system (GPS) devices. The GPS Act also
closure or loss of customer data. prohibits businesses from disclosing geographi-
Children’s Online Privacy Protection Act cal tracking data about its customers to others
(COPPA) of 1998 affects websites that know- without the customers’ permission.
ingly collect information about or target at The Genetic Information Act of 2008 prohibits
children under the age of 13. Any such discrimination in health coverage and employ-
websites must post a privacy policy and adhere ment based on genetic information. Although
to enumerated information-sharing restric- this act does not extend to life insurance, dis-
tions. Operators are required to take reasonable ability insurance, or long-term care insurance,
steps to ensure that children’s personal infor- most states also have specific laws that prohibit
mation is disclosed only to service providers the use of genetic information in these contexts.
530 Governance

Collection departments may use customer statement condemning employers for asking job
information from social media sites to conduct candidates for their Facebook passwords. In April
“skip tracking” to get up-to-date contact infor- 2012, the state of Maryland passed a bill pro-
mation on a delinquent borrower. However, hibiting employees from having to provide access
they have to adhere to regulations such as the to their social media content.
US Fair Department Collection Practices Act In 2014 the 11th US Circuit Court of Appeals
(FDCPA) to prevent collectors from harassing issued a major opinion extending Fourth Amend-
debtors or infringing on their privacy. Also, ment protection to cell phones even when
collectors would be prohibited from creating a searched incident to an arrest. Police need a war-
false profile to friend a debtor on Facebook or rant to track the cell phones of criminal suspects.
tweeting about an individual’s debt. Investigators must obtain a search warrant from a
judge in order to obtain cell phone tower tracking
Today, facial recognition technology enables data that is widely used as evidence to show
the identification of an individual based on his or suspects were in the vicinity of a crime. As such,
her facial characteristics publicly available on obtaining the records without a search warrant is a
social networking sites. Facial recognition soft- violation of the Fourth Amendment’s ban on
ware with data mining algorithms and statistical unreasonable searches and seizures.
re-identification techniques may be able to iden- According to Byers (2014), “while most
tify an individual’s name, location, interests, and mobile companies do have privacy policies, but
even the first five digits of the individual’s social they aren't often communicated to users in a con-
security number. The Federal Trade Commis- cise or standardized manner.” The National Tele-
sion (2012) offers recommendations for compa- communications and Information Administration
nies to disclose to consumers that the facial data suggested a transparency blueprint designed in
they use might be used to link them to informa- 2012 and 2013 that called for applications to
tion from third parties or publicly available clearly describe what kind of information (e.g.,
sources. location, browser history, or biometric data) it
Some states have implemented more stringent collects and shares. While tech companies (e.g.,
regulations for privacy policies. The California Google and Facebook) have tried to be more clear
Online Privacy Protection Act of 2003 – Business about the data they collect and use, most compa-
and Professions Code sections 22575–22579 – nies still refuse to adhere to such a code of conduct
requires “any commercial web sites or online ser- due to increased liability concerns. A slow adop-
vices that collect personal information on Califor- tion for such guidelines is also partially due to the
nia residents through a web site to conspicuously government’s failure to push technology compa-
post a privacy policy on the site.” According to nies to act upon ideas for policy change. In May
Segupta (2013), in 2014 California passed three 2014, the President’s Council of Advisors on Sci-
online privacy bills. One gives children the right ence and Technology (PCAST) released a new
to erase social media posts, another makes it a report, Big Data: A Technological Perspective,
misdemeanor to publish identifiable nude pictures which details the technical aspects of big data
online without the subject’s permission, and a and new concerns about the nature of privacy
third requires companies to tell consumers and the means by which individual privacy
whether they abide by “do not track” signals on might be compromised or protected. In addition
web browsers. In 2014 Texas passed a bill intro- to a number of recommendations related develop-
duced that requires warrants for e-mail searches, ing privacy-related technologies, the report rec-
while Oklahoma enacted a law meant to protect ommends that Congress pass national data breach
the privacy of student data. At least three states legislation, extend privacy protections to non-US
proposed measures to regulate who inherits digital citizens, and update the Electronic Communica-
data, including Facebook passwords, when a user tions Privacy Act, which controls how the gov-
dies. In March 2012 Facebook released a ernment can access e-mail.
Granular Computing 531

Further Reading
Granular Computing
Breakenridge, D. (2012). Social media and public rela-
tions: Eight new practices for the PR professional.
Davide Ciucci
New Jersey: FT Press.
Byers, A. (2014). W.H.’s privacy effort for apps is stuck in Università degli Studi di Milano-Bicocca, Milan,
neutral. Politico. p. 33. Italy
Federal Trade Commission. (1998). Children's online privacy
protection rule (“COPPA”). Retrieved from http://www.
ftc.gov/enforcement/rules/rulemaking-regulatory-reform-
proceedings/childrens-online-privacy-protection-rule. Introduction
Federal Trade Commission. (2012). Protecting consumer
privacy in an era of rapid change: Recommendations Granular Computing (GrC) is a recent discipline
for businesses and policymakers. Retrieved from
that deals with representing and processing infor-
http://www.ftc.gov/reports/protecting-consumer-pri
vacy-era-rapid-change-recommendations-businesses- mation in the form of information granules or
policymakers. simply granules that arise in the process of data G
Institute for Health Technology Transformation. (2013). abstraction and knowledge extraction from data.
Transforming health care through big data strategies
The concept information granularity was intro-
for leveraging big data in the health care industry.
Retrieved from http://ihealthtran.com/wordpress/ duced by Zadeh in 1979 (Zadeh 1979); however,
2013/03/iht%C2%B2-releases-big-data-research-repor the term granular computing was coined by Lin in
t-download-today/. 1997 (Lin 1997), and in the same year it was used
McKinsey Global Institute. (2011, May). Big data: The
again by Zadeh in (Zadeh 1997).
next frontier for innovation, competition, and produc-
tivity. Retrieved from http://www.mckinsey.com/busi According to Zadeh, an information granule is
ness-functions/digital-mckinsey/our-insights/big-data- a chunk of knowledge made of different objects
the-next-frontier-for-innovation. “drawn together by indistinguishability, similar-
Mohanty, S., Jagadeesh, M., & Srivatsa, H. (2013). Big
ity, proximity or functionality” (Zadeh 2008). A
data imperatives: Enterprise ‘big data’ warehouse,
‘BI’ implementations and analytics (the Expert's granule is related to uncertainty management in
voice). New York: Apress. the sense that it represents a lack of knowledge on
Segupta, S. (2013). No action in Congress, so states move to a variable X. Indeed, instead of assigning it a
enact privacy laws. Star Advertiser. Retrieved from http://
precise value u, we use a granule, representing
www.staradvertiser.com/news/20131031_No_Action_In_
Congress_So_States_Move_To_Enact_Privacy_Laws. “some information which constrains possible
html?id¼230001271. values of u” (Zadeh 2008).
Soares, S. (2012). Big Data Governance. Information GrC is meant to group under the same formal
Asset, LLC.
framework a set of techniques and tools exploiting
Soares, S. (2013). A Platform for Big Data Governance and
Process Data Governance. Boise, ID: MC Press abstraction for approximate reasoning, decision
Online, LLC. theory, data mining, machine learning, and the
The President’s Council of Advisors on Science and Tech- like.
nology. (2014). Big data and privacy: A technological
At present, it is not yet a formalized theory with
perspective. Retrieved from http://www.whitehouse.
gov/sites/default/files/microsites/ostp/PCAST/pcast_ a unique methodology, but it can be rather viewed
big_data_and_privacy_-_may_2014.pdf. as a unifying discipline of different fields of
Venkatasubramanian, U. (2013). Data governance for big research. Indeed, it includes or intersects interval
data systems [White paper]. Retrieved from http://www.
analysis, rough set theory, fuzzy set theory, inter-
lntinfotech.com/resources/documents/datagovernancefor
bigdatasystems_whitepaper.pdf. active computing, and formal concept analysis,
among others (Pedrycz et al. 2008).

Granule and Level Definition


Governance Instrument
The main concepts of GrC are of course granule
▶ Regulation and levels of granularity, which are closely
532 Granular Computing

related: a level is the collection of granules of computation, shadowed sets, and formal concept
similar nature. Each level gives a different point analysis. Many of these tools are also typical of
of view (sometimes called granular perspective knowledge representation in presence of uncer-
(Keet 2008)) to the subject under investigation. tainty, as described in the following.
Just to make a simple and typical example, Rough set theory is a set of mathematical tools
structured writing can be described through the to represent imprecise and missing information
GrC paradigm. An article or a book can be viewed and data mining tools to perform feature selection,
at different levels of granularity, from top to bot- rule induction, and classification. The idea of
tom: the article (book) itself, chapters, sections, granulation is at the core of the theory, indeed,
paragraphs, and sentences. the starting point is a relation (typically, equiva-
We can move from one level to another, by lence or similarity) used to group indiscernible or
going from top to bottom by decomposing a similar objects. This relation is defined on object
whole into parts through a refinement process. features; thus, two objects are related if they have
Or the other way round, going to an upper level equal/similar values for the features under inves-
merging parts into wholes, by a generalization tigation. For instance, two patients are indiscern-
process. For instance, in an animal taxonomy, ible if they have the same symptoms. The
the category of felines can be split into tigers, obtained granulation is thus a partition (in case
cats, lions, etc, or in the opposite direction, of equivalence relation) or a covering (in case of a
Afghan Hound, Chow Chow, and Siberian weaker relation) of the universe, based on the
Husky can all be seen as dogs at a more general available knowledge, that is, the features under
(abstract) level. investigation (that may also contain missing
Thus, according to the level of granularity values).
taken into account, i.e., to the point of view, a Fuzzy sets are a generalization of Boolean sets,
granule “may be an element of another granule where each object is associated with a member-
and is considered to be a part forming the other ship degree (typically a value in [0,1]) to a given
granule. It may also consist of a family of granules subset of the universe, representing the idea that
and is considered to be a whole” (Yao 2008). membership can be total or partial. A linguistic
Granulation can be characterized according to variable is then defined as a collection of fuzzy
different dimensions: based on a scale or not- sets describing the same variable. For example,
scale-dependent; the relationship between two the linguistic variable Temperature can have
levels; the focus being on the granules or on the values low, medium, or high, and these last are
levels; the mathematical representation. This defined as fuzzy sets on the range of temperature
leads to a taxonomy of types of granularity; for a degrees. It turns out that these values (i.e., low,
detailed discussion on this point, we refer to Keet medium, and high) can be seen as graduated
(2008). granular values. That is, medium temperature is
The idea of granule is quite akin to that of not described by a unique and precise value but by
cluster, that is the elements of the same granule a graduated collection of values (Zadeh 2008).
should be related, whereas the elements of two Shadowed sets can be seen as a simplified and
different granules should be sufficiently different more computationally treatable form of fuzzy sets,
to be separated. Thus, it is clear that all clustering where uncertainty is localized due to a formal
algorithms can be used to granulate the universe. criterion.
In particular, hierarchical algorithms produce not Interval computation is based on the idea that
only the granulation at a fixed level, but the whole measurements are always imprecise. Thus,
hierarchical structure. Other typical tools to build instead of representing a measurement with a sin-
granules from data arise in the computational gle precise value, an interval is used. Intervals can
intelligence field: rough sets, fuzzy sets, interval also be the result of the discretization of a
Granular Computing 533

continuous variable. A calculus with intervals is emerging behavior at an upper level in the hierar-
then needed to compute with intervals. In this chy (Yao 2008).
approach, any interval is a granule. Thus, different
granulations can differ in precision and scale. An
important aspect with respect to discretization of Conclusion
continuous variables is the definition of the suit-
able granularity to adopt that should be specific Granular computing is an emerging discipline
enough to address the problem to handle and aimed to represent and analyze data in form of
make the desired characteristics emerge, yet chunk of knowledge, the granules, connected in a
avoiding intractability due to a too high detail. hierarchical structure, and it exploits abstraction
Formal concept analysis is a formal framework to reach its goals. As such, it complies with the
based on lattice theory, aimed to create a concept human way of thinking and knowledge organiza-
hierarchy (named concept lattice) from a set of tion. In big data, granular computing can be used
objects, where each concept contains the objects to reduce volume and to provide different points G
sharing the same features. Hence, a concept can be of view on the same data.
viewed as a granule and the concept lattice as the
hierarchy of levels. Cross-References

▶ Data Mining
Granular Computing and Big Data ▶ Ontologies

GrC is a way to organize data at a suitable level of


abstraction, by ignoring irrelevant details, to be Further Reading
further processed. This is useful to handle noise in
data and to reduce the computational effort. Keet, C. M.. (2008). A formal theory of granularity. Ph.D.
Indeed, a granule becomes a single point at the thesis, KRDB Research Centre, Faculty of Computer
upper level, thus simplifying the data volume. In Science, Free University of Bozen-Bolzano, Italy.
Lin, T. Y.. (1997). Granular computing: From rough sets
this way, an approximate solution is obtained, and neighborhood systems to information granulation
which can be refined in a following step if needed. and computing in words. In Proceedings of European
This idea is explored in (Slezak et al. 2018) by congress on intelligent techniques and soft computing
providing “an engine that produces high value (pp. 1602–1606). Aachen: Germany.
Pedrycz, W., Skowron, A., & Kreinovich, V. (Eds.). (2008).
approximate answers to SQL statements by utiliz- Handbook of granular computing. Chichester: Wile.
ing granulated summaries of input data”. It is to be Slezak, D., Glick, R., Betlinski, P., & Synak, P. (2018). A
noticed that in order to cope with big data, the new approximate query engine based on intelligent
engine gives approximate answers, so one should capture and fast transformations of granulated data
summaries. Journal of Intelligent Information System,
be aware that velocity comes to the price of 50(2), 385–414.
precision. Yao, Y. Y.. (2008). Granular computing: Past, present and
Moreover, the representation of granules in a future. In Proceedings of 2008 IEEE international con-
hierarchy permits to represent and analyze the ference on granular computing. Hangzhou: China
Zadeh, L. (1979). Fuzzy sets and information granularity.
available data from different perspectives In N. Gupta, R. Ragade, & R. Yager (Eds.), Advances in
according to a different granulation (multiview) fuzzy set theory and applications (pp. 3–18). Amster-
or at a different level of abstraction (multilevel). dam: North-Holland.
From a more general and philosophical perspec- Zadeh, L. (1997). Toward a theory of fuzzy information
granulation and its centrality in human reasoning and
tive, this possibility is also seen as a way to rec- fuzzy logic. Fuzzy Sets and Systems, 90(2), 111–127.
oncile reductionism with system theory, since it Zadeh, L. (2008). Is there a need for fuzzy logic? Informa-
preserves the split of a whole into parts and an tion Sciences, 178, 2751–2779.
534 Graph-Theoretic Computations/Graph Databases

direction. Directed graphs are widely used in


Graph-Theoretic social networks. For example, V could be the set
Computations/Graph of registered users and E could represent a follows
Databases relationship say in ResearchGate.
In graph analytics and databases, vertices and/
John A. Miller1, Arash Jalal Zadeh Fard1,2 and or edges often have labels attached to provide
Lakshmish Ramaswamy1 greater information content. Formally, labels
1
Department of Computer Science, University of may be added to a digraph through functions
Georgia, Athens, GA, USA mapping vertices and/or edges to labels:
2
Vertica (Hewlett Packard Enterprise),
Cambridge, MA, USA
lv : V ! Lv ðvertex labeling functionÞ
le : V ! Le ðedge labeling functionÞ
Introduction ð2Þ

The new millennium has seen a very dramatic Many applications in graph analytics work
increase in applications using massive datasets with multi-digraphs that allow multiple edges
that can be organized in the form of graphs. Exam- from a vertex u to a vertex v, so long as the edge
ples include Facebook, LinkedIn, Twitter, and labels are distinct. As an example, let V be regis-
ResearchGate. Consequently, a branch of big tered users in ResearchGate and E represents rela-
data analytics called graph analytics has become tionships between these users/researchers. Given
an important field for both theory and practice. three vertices u, v, and w, the vertices and edges
Although foundations come from graph theory could be labeled as follows: lv(u) ¼ “PhD Candi-
and graph algorithms, graph analytics focuses on date”, lv(v) ¼ “Post - Doc”, lv(w) ¼ “Professor”,
computations on large graphs, often to find inter- le(u, w) ¼ “follows”, le(v, w) ¼ “cites”, and
esting paths or patterns. le(w, v) ¼ “reads” (Fig. 1).
Graphs come in two flavors, undirected and While graph theory can be traced back to
directed. One may think of an undirected graph work by Euler in the eighteenth century, work
as having locations connected with two-way on graph algorithms began in earnest in the
streets and a directed graph having one-way 1950s (e.g., Bellman-Ford-Moore algorithm
streets. As directed graphs support more precision and Dijkstra’s algorithm for finding shortest
of specification and can simulate the connectivity paths). Over the years, a wide range of graph
of undirected graphs by replacing each undirected computations and algorithms have been devel-
edge {u, v} with two directed edges (u, v) and oped. For big data graph analytics, work can be
(v, u), this reference will focus on directed graphs. categorized into four main areas: Paths in
More formally, a directed graph or digraph
may be defined as a two-tuple G(V, E) where we
have:
follows

V ¼ set of vertices PhD Candidate Professor


ð1Þ
EVV ðset of edgesÞ u reads
w
cites
A directed edge e  E is an ordered pair of
Post-Doc
vertices e ¼ (u, v) where u  V and v  V.
v
Given a vertex u, the set {v| (u, v)  E} is
referred to as the children of u. Parents can be Graph-Theoretic Computations/Graph Databases,
defined by following the edges in the opposite Fig. 1 An example graph in ResearchGate
Graph-Theoretic Computations/Graph Databases 535

Graphs, Graph Patterns, Graph Partitions, and eccentricity can be defined for digraphs, typically
Graph Databases. eccentricity, radius, and diameter are given in
terms of its underlying undirected graph (directed
edges turned into undirected edges).
Paths in Graphs Issues related to paths/connectivity include
measures of influence in social media graphs.
Many problems in graph analytics involve finding Measures of influence of a vertex, v (e.g., a Twitter
paths in large graphs; e.g., what is the connection user), include indegree, outdegree, and
between two users in a social networking applica- (undampened) PageRank (PR).
tion, or what is the shortest route from address A
to address B. indegreeðvÞ ¼ jfujðu, vÞ  Egj
An intuitive way to define path is to define it in
outdegreeðvÞ ¼ jfwjðv, wÞ  Egj
terms of trail. A trail of length n in a digraph or ð5Þ
multi-digraph can be defined as a sequence of P PRðuÞ G
PRðvÞ ¼ f jðu, vÞ  Eg
non-repeating edges. outdegreeðuÞ

t ¼ ðe1 , . . . en Þ ð3Þ Graph Patterns

such that for all i, the consecutive edges ei, ei+1 In simple terms, finding a pattern in a graph is
must be of the form (u, v) and (v, w). A (simple) finding a set of similar subgraphs in that graph.
path p is then just a trail in which there are no There are different models for defining similarity
repeating vertices. More specifically, path(u, v) is between two subgraphs, and we will introduce a
a path beginning with vertex u and ending with few in this section. When unknown patterns need
vertex v. to be discovered (e.g., finding frequent sub-
The existence of a path from u to v means that v graphs), it is called graph pattern mining. In
is reachable from u. In addition to finding a path comparison, when the pattern is known in
from u to v, some applications may be interested in advance and the goal is to find the set of its
finding all paths (e.g., evidence gathering in an similar subgraphs, it is called graph pattern
investigation) or a sufficient number of paths (e.g., matching. In some applications of graph pattern
reliability study or traffic flow). matching, it is not the set of similar subgraphs
As indicated, the length len(path(u, v)) is the that is important but its size. For example,
number of edges in the path. The weighted length counting the number of triangles in a graph is
wlen(path(u, v)) is the sum of the edge labels as used in many applications of social networks
weights in the path. (Tsourakakis et al. 2009).
A particular path paths(u, v) is a shortest path The simplest form of pattern query is to take a
when len(paths(u, v)) is the least among all paths query graph Q and match its labeled vertices to
from u to v (may also be defined in terms of wlen). corresponding labeled vertices in a data graph G;
The position of a vertex within an undirected i.e., pattern(Q, G) is represented by a multivalued
graph can also be defined in terms of paths. The function F:
eccentricity of a vertex u  V is defined as the
length of the maximum shortest path from u to any F : Q:V ! 2G:V s:t:8u0  FðuÞ, lv ðu0 Þ
other vertex: ¼ lv ðuÞ ð6Þ

eccðuÞ ¼ maxflenðpaths ðu, vÞÞjv  Vg ð4Þ In addition to matching the labels of the verti-
ces, patterns of connectivity should match as well.
Now, the radius of a (connected) graph is sim- Common connectivity is established by examin-
ply the minimum eccentricity, while the diameter ing edges (models may either ignore or take edge
of a graph is the maximum eccentricity. Although labels into account).
536 Graph-Theoretic Computations/Graph Databases

Traditional graph similarity matching can be • Strong simulation (Ma et al. 2014): As dual
grouped as graph morphism models. This group simulation allows counterintuitive solu-
introduces complex and often quite constrained tions that contain large cycles, various
forms of pattern matching. The most famous locality restrictions may be added to dual
models in this group are graph homomorphism simulation to eliminate them. For strong
and subgraph isomorphism. simulation, any solution must fit inside a
ball of radius equal to the diameter of the
• Graph homomorphism: It is a function f map- query graph Q.
ping each vertex u  Q.V to a vertex f • Strict simulation (Fard et al. 2013): Based on
(u)  G.V, such that (1) lv(u) ¼ lv(f(u)) and strong simulation, it applies dual simulation
(2) if (u, v)  Q.E, then (f (u), f (v))  G.E. first to reduce the number of balls. Balls are
For graph pattern matching, all or a sufficient only made from vertices that are in the image
number of graph homomorphisms can be of F. This also reduces the number of solu-
retrieved. tions, making the results closer to those of
• Subgraph isomorphism: It is a more restrictive traditional models like subgraph
form of graph homomorphism where we sim- isomorphism.
ply change the mapping function f to a bijec- • Tight simulation (Fard et al. 2014b): The
tion onto a subgraph of G. solutions can be further tightened by reduc-
ing the number of balls and making them
High computational complexity and inability smaller. First a central vertex uc (ecc(uc)
of these models to find certain meaningfully sim- equal to the radius) of the query graph Q is
ilar subgraphs in new applications have led to a chosen, and then balls are created only for
more recently emerging group of graph pattern u0  F(uc). In addition, the radius of the
models called simulation. The major models in balls is now equal to the radius of Q, not
this group are graph simulation, dual simulation, its diameter as before.
strong simulation, strict simulation, tight simula- • CAR-tight simulation (Fard et al. 2014a):
tion, and CAR-tight simulation. Results even closer to subgraph isomorphism
can be obtained by adding a further restriction
• Graph simulation (Henzinger et al. 1995): to tight simulation. A cardinality restriction on
Algorithms for finding graph simulation child and parent matches pushes results toward
matches typically follow a simple approach. one-to-one correspondences. This modifica-
For each vertex u  Q.V, initially compute tion is referred to as cardinality restricted
the mapping set F(u) based on label matching. (CAR)-tight simulation.
Then, repeatedly check the child match condi-
tion for all vertices to refine the mapping F Figure 2 illustrates a simple example of sub-
until there is no change. The child match con- graph pattern matching. For the given query Q
dition is simply that if u0  F(u), then the (query graph), applying different pattern matching
labels of the children of u0 that are themselves models on G (data graph) yields different results. In
within F must include all the labels of the this figure, the numbers are the IDs of the vertices
children of u. and the letters are their labels. Table 1 summarizes
• Dual simulation (Ma et al. 2011): It adds a the results for different models where F is a multi-
parent match condition to graph simulation. valued function (could be represented as a relation)
The parent match condition is simply that if and f is a function from vertices of Q to vertices of
u0  F(u), then the labels of the parents of u0 G (f : Q.V!G.V). The table gives all such non-
that are themselves within F must include all redundant functions/mappings. Subgraph pattern
the labels of the parents of u. matching has applications in analyzing social
Graph-Theoretic Computations/Graph Databases 537

Graph-Theoretic a 1
A
b 1
A A
6
C
11
12
Computations/Graph A: Arts Book
Databases, B: Biography Book
A 13
8 7
Fig. 2 Example of 2 B A B B
C: Children’s Book 2 B
subgraph pattern matching
with different models M: Music CD C M
C C C C C 10 C
3 4 3 4 5 9 14

Q: Pattern G: Data Graph

Graph-Theoretic Computations/Graph Databases, Table 1 Results of different pattern matching models of Fig. 2
Model Subgraph results
Tight simulation F(1, 2, 3, 4)!(1, 2, {3, 4, 5}, {3, 4, 5}) , (12, 13, 14, 14)
F(1, 2, 3, 4)!(1, 2, {3, 4, 5}, {3, 4, 5})
CAR-tight simulation
G
Subgraph isomorphism f(1, 2, 3, 4)!(1, 2, 3, 4) , (1, 2, 3, 5) , (1, 2, 4, 5)

networks, web graphs, bioinformatics, and graph algorithms and implementations that do an effec-
databases. tive job on very large graphs. One of the better
software packages for graph partitioning is
METIS (Karypis and Kumar 1995) as it tends to
Graph Partitions provide good balance with fewer edge cuts than
alternative software. “METIS works in three
Many problems in graph analytics can be sped up steps: (1) coarsening the graph, (2) partitioning
if graphs can be partitioned. A k-partition takes a the coarsened graph, and (3) uncoarsening”
graph G(V, E) and divides vertex set V into k (Wang et al. 2014). Faster algorithms that often
disjoint subsets Vi such that. result in more edge cuts than METIS include
random partitioning and ordered partitioning,
k while label propagation partitioning trades off
[ V i ¼ V: ð7Þ
i¼1 fewer edge cuts for less balance.
Related topics in graph analytics include graph
The usefulness of a partition is often judged clustering and finding graph components. In
positively by its evenness or size balance and graph clustering, vertices that are relatively
negatively by the number of edges that are cut. more highly interconnected are placed in the
Edge cuts result when an edge ends up crossing same cluster, e.g., friend groups. A subgraph is
from one vertex subset to another. Each part of a formed by including all edges (u, v) for which u
partitioned graph is stored in a separate graph and v are in the same cluster. At the extreme end,
(either on a server with a large memory or to vertices could be grouped together, so long as
multiple servers in a cluster). Algorithms can there exists a path(u, v) between any two vertices
then work in parallel on smaller graphs and u and v in the group. The subgraphs formed from
combine results to solve the original problem. these groups are referred to as strongly connected
The extra work required to combine results is components. Further, the subgraphs are referred to
related to the number of cuts done in as weakly connected components, if there is a path
partitioning. between any two vertices in the group in the
Although finding balanced min-cut partitions underlying undirected graph (where directionality
is an NP-hard problem, there are practical of edges is ignored).
538 Graph-Theoretic Computations/Graph Databases

Graph Databases Patterns” may be applied as well. In Neo4j, query


processing corresponds to the subgraph isomor-
A very large vertex and edge labeled multi- phism problem, while for RDF/SPARQL stores, it
digraph where the labels are rich with information corresponds to the homomorphism problem
content can be viewed as a graph database. Prop- (Gubichev 2015).
erty graphs, often mentioned in the literature, are
extensions where a label (or property) is allowed
to have multiple attributes (e.g., name, address, Conclusions
phone). For a graph database, the following capa-
bilities should be provided: (1) persistent storage Graph analytics and databases are growing areas
(should be able to access without complete load- of interest. A brief overview of these areas has
ing of a file), (2) update/transactional capability, been given. More detailed surveys and historical
and (3) high-level query language. When data can background may be found in the following litera-
be organized in the form of a graph, query pro- ture: A historical view of graph pattern matching
cessing in a graph database can be much faster covering exact and inexact pattern matching is
than the alternative of converting the graph into given in (Conte et al. 2004). Research issues in
relations stored in a relational database. big data graph analytics is given in (Miller et al.
Examples of graph databases include Neo4j, 2015). A survey of big data frameworks
OrientDB, and Titan (Angles 2012). In addition, supporting graph analytics, including Pregel and
Resource Description Framework (RDF) stores Apache Giraph, is given in (Batarfi et al. 2015).
used in the Semantic Web are very similar to
graph databases (certain restricted forms would
qualify as graph databases). Query languages
include Cypher, Gremlin, and SPARQL. The fol- Further Reading
lowing example query written in the Cypher lan-
Angles, R. (2012). A comparison of current graph database
guage (used by Neo4j) expresses the graph pattern models. In IEEE 28th international conference on data
discussed in the section “Introduction”: engineering workshops (ICDEW) (pp. 171–177).
MATCH (u: PhDCandidate, v: PostDoc, w: Washington, DC: IEEE.
Professor, Batarfi, O., ElShawi, R., Fayoumi, A., Nouri, R., Beheshti,
S. M. R., Barnawi, A., & Sakr, S. (2015). Large scale
graph processing systems: Survey and an experimental
u – [:FOLLOWS]–>w, evaluation. Cluster Computing, 18(3), 1189–1213.
v – [:CITES]–>w, Conte, D., Foggia, P., Sansone, C., & Vento, M. (2004).
w–[:READS]–>v) Thirty years of graph matching in pattern recognition.
International Journal of Pattern Recognition and Arti-
ficial Intelligence, 18(03), 265–298.
The answer would be all (or a sufficient num- Fard, A., Nisar, M. U., Ramaswamy, L., Miller, J. A., &
ber of) matching patterns found in the graph data- Saltz, M. (2013). A distributed vertex-centric approach
base, with vertex variables, u, v, and w, replaced for pattern matching in massive graphs. In IEEE inter-
national conference on Big Data (pp. 403–411). Wash-
by actual ResearchGate users. ington, DC: IEEE.
Query processing and optimization for graph Fard, A., Manda, S., Ramaswamy, L., & Miller, J. A.
databases (Gubichev 2015) include parsing a (2014a). Effective caching techniques for accelerating
given query expressed in the query language, pattern matching queries. In IEEE international con-
ference on Big Data (Big Data) (pp. 491–499). Wash-
building an evaluation-oriented abstract syntax ington, DC: IEEE.
tree (AST), optimizing the AST, and evaluating Fard, A., Nisar, M. U., Miller, J. A., & Ramaswamy, L.
the optimized AST. Bottom-up evaluation could (2014b). Distributed and scalable graph pattern
be done by applying algorithms for graph algebra matching: Models and algorithms. International Jour-
nal of Big Data, 1(1), 1–14.
operators (e.g., selection, join, and expand) Gubichev, A. (2015). Query processing and optimization
(Gubichev 2015). Where applicable, pattern in graph databases (PhD thesis). Technische
matching algorithms discussed in section “Graph Universität München, München.
Graph-Theoretic Computations/Graph Databases 539

Henzinger, M. R., Henzinger, T. A., & Kopke, P. W. matching. ACM Transactions on Database Systems
(1995). Computing simulations on finite and infinite (TODS), 39(1), 4.
graphs. In Proceedings, 36th annual symposium on Miller, J. A., Ramaswamy, L., Kochut, K. J., & Fard, A.
foundations of computer science (pp. 453–462). Wash- (2015). Directions for big data graph analytics research.
ington, DC: IEEE. International Journal of Big Data (IJBD), 2(1), 15–27.
Karypis, G., & Kumar, V. (1995). Analysis of multilevel Tsourakakis, C. E., Kang, U., Miller, G. L., & Faloutsos, C.
graph partitioning. In Proceedings of the 1995 ACM/ (2009). Doulion: counting triangles in massive graphs
IEEE conference on supercomputing (p. 29). New with a coin. In Proceedings of the 15th ACM SIGKDD
York: ACM. international conference on knowledge discovery and
Ma, S., Cao, Y., Fan, W., Huai, J., & Wo, T. (2011). data mining (pp. 837–846). New York: ACM.
Capturing topology in graph pattern matching. Pro- Wang, L., Xiao, Y., Shao, B., & Wang, H. (2014). How to
ceedings of the VLDB Endowment, 5(4), 310–321. partition a billion-node graph. In IEEE 30th Interna-
Ma, S., Cao, Y., Fan, W., Huai, J., & Wo, T. (2014). Strong tional Conference on data engineering (ICDE) (pp.
simulation: Capturing topology in graph pattern 568–579). Washington, DC: IEEE.

G
H

Harnessing the Data a relatable perspective, five exabytes equals 1018


Revolution gigabytes and that is calculated to be the sum of all
words in the human vocabulary. In addition,
▶ Big Data Research and Development Initiative Cottle, Hoover, Kanwal, Kohn, Strome, and
(Federal, U.S.) Treister note that there are five separate categories
of big data relating specifically to health and
health care delivery. First, there are web and social
media data that also include health plan websites
HDR and smart phone apps, to name a few. Second,
there are the machine-to-machine data that origi-
▶ Big Data Research and Development Initiative nate from sensors, meters, and other devices.
(Federal, U.S.) Third on the list is big transaction data consisting
of health care claims and other billing records.
Fourth is biometric data consisting of fingerprints,
genetics, handwriting, retinal scans, and other
Health Care Delivery similar types of data, including x-rays and other
types of medical imaging. Finally, there is the data
Paula K. Baldwin generated by electronic medical records (EMRs),
Department of Communication Studies, Western health care providers’ notes, electronic correspon-
Oregon University, Monmouth, OR, USA dence, and paper documents.
Other industries such as retail and banking
embraced utilizing big data to benefit both the
Peter Groves, Basel Kayyali, David Knott, and organization and the consumer, but the health
Steve Van Kuiken note that the evolution and care industry is behind in that process. Catherine
use of big data are in its formative stages, and its M. DesRoches, Dustin Charles, Michael F.
true potential has yet to be revealed. Mike Cottle, Furukawa, Maulik S. Joshi, Peter Kralovec,
Waco Hoover, Shadaab Kanwal, Marty Kohn, Farzad Mostashari, Chantal Worzala, and Ashish
Trevor Strome, and Neil W. Treister write that in K. Jha report that in 2012, only 44% of US hos-
2011, US health care data totaled 150 exabytes, pitals reported using a basic electronic health
and that number is increasing. To put that figure in records (EHRs) system and rural and nonteaching

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
542 Health Care Delivery

hospitals lag significantly behind in adopting providing premium care for all health care
EHRs systems. recipients.
As the exploration of possible uses of big data
in health care delivery continues, the potential
increases to affect both the health care provider Challenges for Health Care Delivery
and the health care recipient, positively. However, Systems
as Groves, Kayyali, Knott, and Van Kuiken write,
the health care industries suffer from several The Healthcare Leadership Council identified two
inhibitors: resistance to change, lack of sufficient areas critical to the future use and implementation
investment in technology, privacy concerns for of big data into health care delivery systems. First,
health care recipients, and lack of technology for a new platform for linking health care recipients’
integrating data across multiple systems. medical records and health care must be developed,
and second, the USA must make a serious commit-
ment in health care research and development as
Creating Health Care Delivery Systems well as the education of future generations of health
care providers. Erin McCann seconds that and
IBM’s Information Management identified three writes that the biggest challenge in effectively car-
important areas to a successful health care deliv- ing for patients today is data on patients comes
ery transition. First, build health care systems from different institutions and different states all
that can efficiently manage resources and using multiple data tracking systems. In order for
improve patient care while reducing the cost of the patient information to be useful, technology
care. Second, health care organization should must develop a universal platform through which
focus on improving quality and efficiency care the various tracking systems and electronic health
by focusing on a deep understanding of health records (EHRs) can communicate accurately.
care recipients’ needs. Finally, in order to fully Communication between the technology and
engage with all segments of the US population, the health care provider is challenged and driven
emphasis on increasing access to health care is by the health care recipients themselves as they
crucial. change doctors, institutions, or insurance. These
changes are driven by changes in locale, changes
in health care needs, and other life changes; there-
Health Care Delivery System Benefits fore, patient engagement in the design of these
changes is paramount. In order for health care
The Healthcare Leadership Council identified delivery to be successful, the communication plat-
three key benefits for health care recipients. form must be able to identify and adapt to those
First, both individuals and families will have changes. The design and implementation of a
multiple options available to them for their health common platform for the different streams of
care delivery with the focus being on the “right medical information continue to evolve as the
treatment at the right time in the right place to each technology advances. With these advances, health
patient.” Second, shifting the emphasis of health care recipients in rural and urban settings will
care delivery to better long-term values rather than have equal access to health care.
the current strategy of reduction of short-term
costs will provide more economic relief for health
care recipients. Finally, the development and Cross-References
implementation of a universal health care delivery
system will create a major shift in the health care ▶ Electronic Health Records (EHR)
industry itself, away from an industry made up of ▶ Health Care Delivery
disparate parts and toward an integrated model for ▶ Health Informatics
Health Informatics 543

Further Reading The health informatics’ domain is extensive,


encompassing not only patient health and conti-
Cottle, M., et al. (2013). Transforming health care through nuity of care but also epidemiology and public
big data: Strategies for leveraging big data in the
health. Due to the increased use of IT in health-
health care industry. Institute for Health Technology
Transformation. Washington, D.C. care delivery and management, the purview of
DesRoches, C. M., et al. (2013). Adoption of electronic health informatics is expected to continue to
health records grows rapidly, but fewer than half of grow.
U.S. hospitals had at least a basic system. Health
The advent of the Internet; the availability of
Affairs, 32(8), 1478–1485.
Groves, P., et al. (2014). The ‘Big Data’ revolution in inexpensive high-speed computers, voice recog-
healthcare: Accelerating value and innovation. Center nition and mobile technologies, and large data sets
for US health system reform business technology in diverse formats from diverse sources; and the
office, McKinsey & Company.
use of social media have provided opportunities
Healthcare Leadership Council. Key Issues. http://www.
hlc.org/key-issues/ (n.d.). Accessed Nov 2014. for health-care professionals to incorporate IT
IBM. Harness your data resources in healthcare. Big data at applications in their practices. In the United
the speed of business. http://www-01.ibm.com/soft States, the impetus for health informatics came
ware/data/bigdata/industry-healthcare.html (n.d.).
during 2008–2010. Under the Health Information H
Accessed Nov 2014.
McCann, E. (2014). No interoperability? Goodbye big Technology (HIT) for Economic and Clinical
data. Healthcare IT news. Health (HITECH) component of the American
McCarthy, R. L., et al. (2012). Introduction to health care Recovery and Reinvestment Act of 2009
delivery. Sudbury/Mass: Jones & Bartlett Learning.
(ARRA), the Centers for Medicare and Medicaid
Services (CMS) reimburse health service pro-
viders for using electronic documents in formats
certified to comply with HITECH’s Meaningful
Health Informatics Use (MU) standards. The Patient Protection and
Affordable Care Act of 2010 (ACA) promotes
Erik W. Kuiler access to health care and greater use of electroni-
George Mason University, Arlington, VA, USA cally transmitted documentation. Health informat-
ics are expected to provide a framework for the
electronic exchange of health information that
Background complies with all legal requirements and
standards.
The growth of informatics as a technical disci- The increased acceptance of HIT is also
pline reflects the increased computerization of expected to expand the delivery of comparative
business operations in both the private and gov- effectiveness- and evidence-based medicine.
ernment sectors. Informatics focus on how infor- However, while there are important benefits to
mation technologies (IT) are applied in social, health data sharing among clinicians, caregivers,
cultural, organizational, and economic settings. and payers, the rates of HIT technical advances
Although informatics have their genesis in the have proven to be greater than their rates of
mainframe computer era, it has only been since assimilation.
the 1980s that health informatics, as Marsden S.
Blois notes, have gained recognition as a techni-
cal discipline by concentrating on the informa- Electronic Health Records
tion requirements of patients, health-care
providers, and payers. Health informatics also Electronic health documentation overcomes the
support the requirements of researchers, vendors, limitations imposed by paper records: idiosyn-
and oversight agencies at the federal, state, and cratic interpretability, inconsistent formats, and
local levels. indifferent quality of information. Electronic
544 Health Informatics

health documentation usually takes one of three Health Information Exchange


forms, each of which must comply with pre-
determined standards before they are authorized With electronic health records, health information
for use: electronic medical records (EMR), elec- exchange (HIE) provides the foundation of health
tronic health records (EHRs), and personal informatics. Adhering to national standards, HIE
health records (PHR). The National Alliance for operationalizes the HITECH MU provisions by
Health Information Technology distinguishes enabling the electronic conveyance of heath infor-
them as follows (2008): An EMR provides infor- mation among health-care organizations. Exam-
mation about an individual for use by authorized ples are case management and referral data,
personnel within a health-care organization. A clinical results (laboratory, pathology, medication,
PHR provides health-care information about an allergy, and immunization data), clinical summa-
individual from diverse sources (clinicians, care- ries (CCD and PHR extracts), images (including
givers, insurance providers, and support groups) radiology reports and scanned documents), free-
for the individual’s personal use. An EHR pro- form text (office notes, discharge notes, emer-
vides health-related information about an indi- gency room notes), financial data (claims and
vidual that may be created, managed, and payments), performance metrics (providers and
exchanged by authorized clinical personnel. institutions), and public health data.
EHRs may contain both structured and unstruc- The US Department of Health and Human
tured data so that it is possible to share coded Services Office of the National Coordinator
diagnostic data, clinician’s notes, personal geno- (DHHS ONC) has established the eHealth
mic data, and X-ray images in the same docu- Exchange as a network of networks to support
ment, with substantially less likelihood of error HIE by formulating the policies, services, and
in interpretation or legibility. standards that apply to HIE. HL7 has produced
Health Level 7 (HL7), an international the Fast Health Interoperable Resources (FHIR)
organization, has promulgated a set of docu- framework for developing web-based C-CDA and
ment architectural standards that enable the QRDA implementations that comply with web
creation of consistent electronic health docu- standards, such as XML, JSON, and HTTP. An
ments. The Consolidated Clinical Document older standard, the American National Standards
Architecture (C-CDA) and the Quality Institute X12 Electronic Data Interchange (ANSI
Reporting Document Architecture (QRDA) X12 EDI), supports the transmission of Health
provide templates that reflect the HL7 Refer- Care Claim and Claim Payment/Advice data
ence Information Model (RIM) and can be (transactions 835 and 837).
used to structure electronic health documents.
The C-CDA consolidates the initial CDA with
the Continuity of Care Document (CCD) Health Domain Data Standards
developed by the Healthcare Information
Technology Standards Panel (HITSP). The C- To ensure semantic consistency and data quality,
CDA and the QRDA support the Extensible the effective use of health informatics depends on
Markup Language (XML) standard so that the adoption of data standards, such as the inter-
any documents developed according to these nationally recognized Systematized Nomencla-
standards are both human- and machine-read- ture of Medicine Clinical Terms (SNOMED-
able. C-CDA and QRDA documents may con- CT), maintained by the International Health Ter-
tain structured and unstructured data. HL7 minology Standards Development Organisation
recommends the use of data standards, such (IHTSDO), a multilingual lexicon that provides
as the Logical Observation Identifiers Names coded clinical terminology extensively used in
and Codes (LOINC), managed by the EHR management. RxNorm, maintained by the
Regenstrief Institute, to ensure consistent National Institutes of Health’s National Library of
interpretability. Medicine (NIH NLM), provides a common
Health Informatics 545

(“normalized”) nomenclature for clinical drugs ingestion capabilities have increased the dangers
with links to their equivalents in other drug vocab- of unauthorized in-transit data extractions, trans-
ularies commonly used in pharmacology and drug formations, and assimilation unbeknownst to the
interaction research. The Logical Observation authorized data owners, stewards, or recipients.
Identifiers Names and Codes (LOINC), managed To ensure the privacy of individually identifiable
by the Regenstrief Institute, provides a standard- health information, the Health Insurance Portabil-
ized lexicon for reporting lab results. The Interna- ity and Accountability Act of 1996 (HIPAA)
tional Classification of Diseases, ninth and tenth requires health data records to be “anonymized”
editions, (ICD-9 and ICD-10), are also widely by removing all personally identifiable informa-
used. tion (PII) prior to their use in data analytics. As
data analytics and data management tools become
more sophisticated and robust, the availability of
Health Data Analytics Big Data sets will increase. Issues affecting the
management and ethical, disciplined use of Big
The accelerated adoption of EHRs has increased Data will continue to inform policy discussions.
the availability of health-care data. The availabil- With the appropriate safeguards, Big Data analyt- H
ity of large data sets, in diverse formats from ics enhance our capabilities to capture program
different sources, is the norm rather than the performance metrics focused on costs, compara-
exception. By themselves data are not particularly tive effectiveness of diagnostics and interven-
valuable unless researchers and analysts can dis- tions, fraud, waste, and abuse.
cern patterns of meaning that collectively consti-
tute information useful to meet strategic and
operational requirements. Health data analytics,
Challenges and Future Trends
comprising statistics-based descriptive and pre-
dictive modeling, data mining, and text mining,
Health informatics hold the promise of improving
supported by natural language processing (NLP),
health care in terms of access and outcomes.
provide the information necessary to improve
But many challenges remain. For example, Big
population well-being. Data analytics can help
Data analytics tools are in their infancy. The
reduce operational costs and increase operational
processes to assure interorganizational data qual-
efficiency by providing information needed to
ity standards are not fully defined. Likewise,
plan and allocate resources where they may be
anonymization algorithms need additional refine-
used most effectively. From a policy perspective,
ment to ensure the privacy and security of person-
data analytics are helpful in assessing program-
ally identifiable infromation (PII). In spite of the
matic successes and failures, enabling the modifi-
work that still needs to be done, the importance of
cation and refinement of policies to effect their
health informatics will increase as these issues are
desired outcomes at an optimum level.
addressed, not only as technical challenges but
also to increase the social good.
Big Data and Health Informatics
Further Reading
Big Data, with their size, complexity, and veloc-
ity, can be beneficial to health informatics by Falik, D. (2014). For big data, big questions remain. Health
expanding, for example, the range and scope Affairs, 33(7), 1111–1114.
of research opportunities. However, the increased Miller, R. H., & Sim, I. (2004). Physicians’ use of elec-
availability of Big Data has also increased the tronic medical records: Barriers and solutions. Health
Affairs, 23(2), 116–126.
need for effective privacy and security manage- Office of the National Coordinator. (2008). The National
ment, defense against data breaches, and data Alliance for Health Information Technology Report to
storage management. Big Data retrieval and the National Coordinator for Health Information
546 High Dimensional Data

Technology on Defining Key Health Information Tech- media posts, reflecting consumer, and business
nology Terms. Health Information Technology. Avail- sentiments, among other things
able from http://www.hitechanswers.net/wp-content/
uploads/2013/05/NAHIT-Definitions2008.pdf. • Micro-array data, where each microarray com-
Raghupathi, W., & Raghupathi, V. (2014). Big data prises tens of thousands of genes/features, but
analytics in healthcare: Promise and potential. there are only a limited number of clinical
Health Information Science and Systems, 2(3), 1– samples
10. Available from http://www.hissjournal.com/con
tent/2/1/3. • Unstructured documents, where each docu-
Richesson, R. L., & Krischer, J. (2007). Data standards in ment contains numerous words, terms, and
clinical research: Gaps, overlaps, challenges and future other attributes
directions. Journal of the American Medical Informat-
ics Association, 14(6), 687–696.
High dimensional data raise unique analytical,
statistical, and computational issues and chal-
lenges. Data with both a high number of dimen-
sions and observations raises an additional set of
High Dimensional Data issues, particularly in terms of algorithmic stabil-
ity and computational efficiency. Accordingly, the
Laurie A. Schintler use of high-dimensional data requires specific
George Mason University, Fairfax, VA, USA kinds of methods, tools, and techniques.

Overview Issues and Challenges

While big data is typically characterized as having Regression models based on high dimensional
a massive number of observations, it also refers to data are vulnerable to statistical problems, includ-
data with high dimensionality. High-dimensional ing noise accumulation, spurious correlations, and
data contains many attributes (variables) relative incidental endogeneity. Noise can propagate in
to the sample size, including instances where the models with many variables, particularly if there
number of attributes exceeds the number of obser- is a large share of poor predictors. Additionally,
vations. Such data are common within and across uncorrelated random variables in such models
multiple domains and disciplines, from genomics also can show a strong association in the sample,
to finance and economics to astronomy. Some i.e., there is the possibility for spurious correla-
examples include: tions. Finally, in large multivariate models,
covariates may be fortuitously correlated with
• Electronic Health Records, where each record the residuals, which is the essence of incidental
contains various data points about a patient, endogeneity. These issues can compromise the
including demographics, vital signs, medical validity, reliability, interpretability, and appropri-
history, diagnoses, medications, immuniza- ateness of regression models. In the particular
tions, allergies, radiology images, lab and test case where the number of attributes exceeds the
results, and other items sample size, there is no longer a unique least-
• Earth Observation Data, which contains loca- squares solution, as the variance of each of the
tional and temporal measurements of differ- estimators becomes infinite.
ent aspects of our planet, e.g., temperature, Another related problem is the “curse of
rainfall, altitude, soil type, humidity, terrain, dimensionality,” which has implications for the
etc. accuracy and generalizability of statistical learn-
• High-Frequency Trading data, which com- ing models. For supervised machine learning, a
prises real-time information on financial trans- model’s predictive performance hinges critically
actions and stock prices, along with on how well the data used for training accurately
unstructured content, such as news and social reflects the phenomenon being modeling. In this
High Dimensional Data 547

regard, the sample data should contain a represen- that contain precisely one predictor, then move on
tative combination of predictors and outcomes. to models with exactly two predictors, and so on.
However, high-dimensional data tends to be We then examine the entire collection of models
sparse, a situation in which the training examples to see which one performs best while minimizing
given to the model fail to capture all possible the number of covariates in the model. Indeed, this
combinations of the predictors and outcomes, is a simple and intuitively appealing approach.
including infrequent occurrences. This situation However, this technique can become computa-
can lead to “overfitting,” where the trained tionally intractable when there are large numbers
model has poor predictive performance when of variables. Further, the larger the search space,
using data outside the training set. As a general the higher the chance of finding models that look
rule, the amount of data needed for accurate model good on the training data but have low predictive
generalization increases exponentially with the power. An enormous search space can lead to
dimensionality. overfitting and high variance of the coefficient
In instances where high-dimensional data con- estimates. For these reasons, stepwise methods –
tains a large of observations, model optimization forward or backward elimination – are attractive
can be computationally expensive. Accordingly, alternatives to best subset selection. H
when scalability and computational complexity
must be considered in selecting models when Shrinkage Methods
working with such data. Shrinkage (regularization) methods involve fitting
a model with all the predictors, allowing for small
or even null values for some of the coefficients.
Strategies and Solutions Such approaches not only help in selecting pre-
dictors but also reduce model variance, in turn
Two approaches for addressing the problems reducing the chances of overfitting. Ridge regres-
associated with high-dimensional data involve sion and lasso regression are two types of models,
reducing the dimensionality of the data before which utilize regularization methods to “shrink”
being analyzed or selecting models specifically the coefficients. They accomplish this through the
designed to handle high dimensional data. use of a penalty function. Ridge regression
includes a weight in the objective function used
Subset or Feature Selection for model optimization to create a penalty for
This strategy involves the removal of irrelevant or adding more predictors. This has the effect of
redundant variables from the data prior to model- shrinking one or more of the coefficients to values
ing. As feature selection keeps only a subset of close to zero. On the other hand, lasso regression
original features, it has the advantages of making uses the absolute values of the coefficients in the
the final model more interpretable and minimizing penalty function, which allows for coefficients to
costs associated with data processing and storage. go to zero.
Feature selection can be accomplished in a couple
of different ways, each of which has advantages Dimensionality Reduction
and disadvantages. One tactic is to simply extract Another approach for managing high-dimen-
predictors that we believe are most strongly asso- sional data is to reduce the complexity of the
ciated with the output. However, the drawback of data prior to modeling. With dimensionality
this technique is that it requires a priori knowledge reduction methods, we do not lose any of the
on what are appropriate predictors, which can be original variables. Instead, all the variables get
difficult when working with massive numbers of folded into the high-order dimensions extracted.
variables. An alternative is to apply the “best There are two categories of dimensionality reduc-
subset” selection, which fits separate regression tion techniques. In data-oblivious approaches, we
models for each possible combination of predic- do the dimensionality-reducing mapping without
tors. In this approach, we fit all possible models using the data or knowledge about the data.
548 HIPAA

Random projection and sketching are two popular Further Reading


methods in this category. The advantages of the
data-oblivious approach are that (1) it is not com- Bühlmann, P., & Van De Geer, S. (2011). Statistics for
high-dimensional data: Methods, theory and applica-
putationally intensive and (2) it does not require
tions. New York: Springer.
us to “see” or understand the underlying data. Fan, J., & Lv, J. (2010). A selective overview of variable
Data-aware reduction maps the data without selection in high dimensional feature space. Statistica
explicit consideration of its contents; instead, it Sinica, 20(1), 101.
Fan, J., Han, F., & Liu, H. (2014). Challenges of big data
“learns” the data structure. Principal Component
analysis. National Science Review, 1(2), 293–314.
Analysis (PCA) and clustering algorithms are Genender-Feltheimer, A. (2018). Visualizing high dimen-
examples of such dimensionality reduction sional and big data. Procedia Computer Science, 140,
methods. 112–121.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).
An introduction to statistical learning (Vol. 112, p. 18).
Other Methods New York: Springer.
Support Vector Machines (SVM), a machine
learning algorithm, is also suitable for modeling
high-dimensional data. SVMs are not sensitive to
the data’s dimensionality, and they can effectively
deal with noise and nonlinearities. Ensemble data HIPAA
analysis can reduce the likelihood of overfitting
models based on high-dimensional data. Such William Pewen
methods use multiple, integrated algorithms to Department of Health, Nursing and Nutrition,
extract information from the entire data set. University of the District of Columbia,
Bootstrapping, boosting, bagging, stacking, and Washington, DC, USA
random forests are all involved in ensemble
analysis.
As health concerns are universal, and health care
spending is escalating globally, big data applica-
Concluding Remarks tions offer the potential to improve health out-
comes and reduce expenditures. In the USA,
While high-dimensional big data provide rich access to an individual’s health data including
opportunities for understanding and modeling both clinical and fiscal records has been regulated
complex phenomena, its use comes with various under the Health Insurance Portability and
issues and challenges, as highlighted. Certain Accountability Act of 1996 (HIPAA). The Act
types of high dimensional big data – e.g., spatial established a complex and controversial means
or network data – can contribute to additional of regulating health information which has been
problems unique to the data’s particular both burdensome to the health sector and has
nuances. Accordingly, when working with failed to fully meet public expectations for ensur-
high-dimensional data, it is imperative first to ing the privacy and security of information.
understand and anticipate the specific issues that In the absence of a broad statutory regime
may arise in modeling the data, which can help addressing privacy, legislative efforts to address
optimize the selection of appropriate methods the public concern regarding health information
and models. have been sector specific and reactive. Standards
have relied upon a foundation of medical ethics
beginning with the Hippocratic Oath. Yet even
Cross-References such a recognized standard has not been a static
one, as modern versions of the oath exhibit sub-
▶ Data Reduction stantial changes from the original form. The
▶ Ensemble Methods development of federal patient protections drew
HIPAA 549

substantially from reforms established in both the Under HIPAA personal health information
Nuremberg Code and the Helsinki Declaration, may be “de-identified” by removal of 18 specified
with the latter undergoing periodic revisions, identifiers or by a process in which expert certifi-
including recent recognition that the disclosure cation is obtained to ensure a low probability of
of “identifiable human material or identifiable identification of an individual patient. Such de-
data” can pose substantial risks to individuals. identified data is no longer considered PHI and is
not protected under HIPAA.

Enactment
HITECH Amendments to HIPAA
Enactment of HIPAA in 1996 provided a schema
for the regulation of the handling of individually The enactment of the Health Information Tech-
identifiable health information by “covered enti- nology for Economic and Clinical Health
ties” – those medical service providers, health (HITECH) Act resulted in a number of changes
plans, and certain other organizations involved to HIPAA. These focused primarily on bringing
in treatment and related financial transactions. the business associates of covered entities under H
Such information includes critical primary data HIPAA regulation, increasing penalties for viola-
on a healthcare sector now impacting over 325 tions, establishing notification and penalties for
million Americans and involving over $3 trillion data breaches, and limiting the sale of PHI without
in annual spending. a patient’s consent. However, three key aspects
HIPAA regulates the disclosure and use of health remained essentially intact.
data, rather than its collection, and functions pri- First, the public health exception remained and
marily as a tool for maintaining confidentiality. The explicitly permits the sale of PHI for research
pursuance of rules to implement the Act spanned purposes – now including sale by covered entities
six contentious years ending in 2002 after require- for public health purposes at a profit. This pro-
ments for active consent by individuals prior to data vides a major potential data stream for big data
disclosures were substantially reduced. applications development. Second, the de-identi-
A critical construct of HIPAA is the concept of fication scheme was preserved as a construct to
protected health information (“PHI”), which is utilize data outside the limitations imposed for
defined as “individually identifiable health infor- PHI (although some guidance in the use of de-
mation.” Only PHI is protected under the Act. identified and limited data sets has been issued).
HIPAA provides for required disclosure of PHI Finally, the scope of HIPAA remains limited to
when requested by the patient or by the Secretary health data handled by covered entities. A vast
of Health and Human Services for certain defined spectrum of data sharing including health-related
purposes including audit and enforcement. Aside websites, organizations, and even “personal
from this, disclosure is permitted for the purposes health records” remain outside HIPAA regulation.
of treatment, payment, and healthcare operations
without the need for specific patient authorization.
HIPAA also provides for significant exceptions Big Data Utilization of HIPAA Data
under which consent for individual data disclo-
sure is not required. Notable among these is a The healthcare applications for which big data
public health exception under which the use of offers the most immediate and clear promise are
data without consent for reporting and research is those aimed at utilizing data to realize improved
permitted. In addition, after implementation of the outcomes and cost savings. Research and innova-
1996 Act, many diverse activities were conducted tion to achieve such advances rely on a range of
as “healthcare operations,” ranging from a cov- data sources but in particular involve access to the
ered entity’s quality assurance efforts to electronic health record (EHR) and claims and
unsolicited marketing to patients. payment data. Both lie within the scope of
550 HIPAA

HIPAA regulation. Given that access to primary cumbersome, the use of consent also addresses a
data is of critical importance for big data key observation underlying continued debate
healthcare applications, three strategies for access around HIPAA: the finding of the Institute of
under the HIPAA regime are evident. Medicine that only approximately 10 percent of
The first of these utilizes a provision of the Americans support access to health data for
HITECH Act which addressed the “healthcare research without their consent.
operations” exception and provides means of
conducting internal quality assurance and process
improvement within a covered entity. This facili- Data Outside the Scope of HIPAA
tates a host of big data applications which may be
employed to analyze and improve care. However, The original drafting of HIPAA occurred in a
this strategy presents issues of both ethical con- context in which electronic health data exchange
sent and legal access in any subsequent synthesis was in its infancy. The HITECH Act remains
of data involving multiple covered entities. constrained by the HIPAA construct and conse-
A second strategy involves the use of de-iden- quently does not address health data outside of the
tified data. HIPAA’s requirements for the utiliza- “covered entity” construct. HIPAA thus fails to
tion of such data are minimal, and use does not regulate collection, disclosure, or the use of data if
require the consent of individual patients. How- there is no covered entity relationship. While
ever, this strategy presents two major problems access to the most desirable data may be protected
linked to the means of de-identification of PHR. by HIPAA, a huge expanse of health-related infor-
The first of these is that, for many studies, the loss mation is not, such as purchases of over-the-coun-
of the 18 designated identifiers may compromise ter drugs and data sharing on most health-related
the aims of the project. Identifying fields can be websites.
critical to an application. Attempts at longitudinal Nonregulated data is highly variable in valid-
study are particularly impacted by such a strategy. ity, reliability and precision - raising concerns
Alternatively study may either discard the desig- regarding its application in the study of health
nated identifiers or rely upon the use of a certify- states, outcomes, and costs. Such data may be
ing party that the data has been de-identified to the relatively definitive, such as information shared
extent that re-identification is highly unlikely – by the patient and information gleaned through
yet the latter approach may offer an uncertain commercial transactions such as consumer pur-
shield with regard to ultimate liability. chases. A report that Target Stores developed a
The use of de-identified data also presents a highly accurate means of identifying pregnant
clear contradiction when relied upon for big data women through retail purchase patterns is illus-
applications. The contradiction must be noted that trative of the power of secondary data in deriving
one key aspect of big data involves the ability to health information which would otherwise be pro-
link disparate data sets; consequently, this key tected under HIPAA.
attribute undermines the fundamental premise of The use of such surrogate data presents a host
de-identification. To the extent that big data tech- of problems, including both public objection to
niques are effective, the “safe harbor” of de-iden- the use of technology to undermine the statutory
tification is thus negated. intent of HIPAA, as well as the application of big
A third strategy is indisputably HIPAA com- data in facilitating discrimination which could
pliant. The acquisition of consent from individ- evade civil rights protections. In a context in
uals does not rely upon uncertain de-identification which the majority of Americans lack confidence
nor is it constrained to data sets contained only in the HIPAA framework for maintaining the con-
within a single covered entity. While consent fidentiality and security of individual health infor-
mechanisms undoubtedly could be constructed mation, the current regulatory framework may not
as to be more concise, and the process less remain static.
Human Resources 551

Cross-References with the expansion of the volume, variety, and


velocity of data in the past 20 years, more empha-
▶ Biomedical Data sis has been placed on strategic planning and
▶ De-identification/Re-identification future-casting. As an organizational function that
▶ Health Informatics has long embraced the use of information technol-
▶ Patient Records ogy and data, HR is well-positioned to deploy big
data in many ways, but also faces some
challenges.
Further Reading Enterprise data warehouses (EDW) have been
a key tool for HR for many years, with the more
Caines, K., & Hanania, R. (2013). Patients want granular recent development of interactive HR dashboards
privacy control over health information in electronic
to enable managers to access employee data and
medical records. Journal of the American Medical
Informatics Association, 20, 7–15. analytics (business intelligence, or BI tools) to
Duhigg C. (2012). How companies learn your secrets. monitor key metrics such as turnover, employee
New York Times. 16 Feb 2012. engagement, workforce diversity, absenteeism,
Pewen W. Protecting our civil rights in the era of digital
productivity, and the efficiency of hiring H
health. The Atlantic. 2 Aug 2012. http://www.theatla
ntic.com/health/archive/2012/08/protecting-our-civil- processes among others. In the past decade, the
rights-in-the-era-of-digital-health/260343/. Accessed availability of big data has meant that organiza-
Aug 2016. tional data systems now ingest unstructured, text-
110 Stat. 1936 – Health Coverage Availability and Afford-
based, and naturally occurring data from our
ability Act of 1996. https://www.gpo.gov/fdsys/granu
le/STATUTE-110/STATUTE-110-Pg1936. Accessed increasingly digital world. While EDW were
Sept 2017 designed to function well with structured, fixed
U.S. Department of Health And Human Services. Guid- semantics data, much of big data, with highly
ance regarding methods for de-identification of pro-
variable semantics, necessitates harmonization
tected health information in accordance with the
health insurance portability and accountability act and analysis in “data lakes” prior to being opera-
(HIPAA) privacy rule. http://www.hhs.gov/ocr/privacy/ tionally useful to organizations.
hipaa/understanding/coveredentities/De-identification/ There are a number of ways HR uses big data.
guidance.html. Accessed Sept 2017.
Examples include, but are not limited to, the
World Medical Association. Declaration of Helsinki – Eth-
ical principles for medical research involving human following:
subjects https://www.wma.net/policies-post/wma-decla
ration-of-helsinki-ethical-principles-for-medical-research- • Recruitment and hiring: Organizations use big
involving-human-subjects/. Accessed Sept 2017.
data to manage their brands to potential
employees, making extensive use of social
media platforms (e.g., Twitter, LinkedIn, and
Facebook). These platforms are also sources of
Human Resources additional information about potential
employees, with many organizations develop-
Lisa M. Frehill ing methods of ingesting data from applicants’
Energetics Technology Center, Indian Head, digital presence and using these data to supple-
MD, USA ment the information available from résumés
and interviews. Finally, big data has provided
tools for organizations to increase the diversity
Human resources (HR) management is engaged of those they recruit and hire by both providing
and applied throughout the full employee a wider net to increase the size and variety of
lifecycle, including recruitment and hiring, talent applicant pools and to counter interviewers’
management and advancement, and exit/retire- biases by gathering and analyzing observa-
ment. HR includes operational processes and, tional data in digital interviews.
552 Human Resources

• Talent management: There are many aspects of devices process application information (includ-
this broad HR function, which involves perfor- ing assessment tests of applicants) run a risk of
mance and productivity management, work- reducing the transparency of hiring processes and
force engagement, professional development of inadvertently introducing the biases HR pro-
and advancement, and making sure the salary fessionals hope to reduce.
and benefits portfolio keeps pace with the Second, the proliferation of big data, the tran-
rewards seen as valuable by the workforce, sition from EDW to data lakes, and the greater
among others. For example, big data has been societal pressure on organizations to be more
cited as a valuable tool for identifying skills transparent with respect to human resources
gaps within an organization’s workforce in means HR professionals need additional data
order to determine training that may need to analytics skills. HR professionals need to under-
be offered. Some organizations have employed stand limitations associated with big data such as
gamification strategies to gather data about quality issues (e.g., reliability, validity, and
employee skills. Finally, big data provides the bias), but they also need to be able to more
means for management to exert control over meaningfully connect data with organizational
the workforce, especially a geographically dis- outcomes without falling into the trap of spuri-
persed one, such as is common within the ous results.
coronavirus pandemic of 2020–2021. In closing, big data has been an important
• Knowledge management: Curation of the vast resource for data-intensive HR organizational
store of organizational information is another functions. Such unstructured, natural data has pro-
critical HR big data task. Early in the life of the vided complementary information to the highly
World Wide Web, voluminous employee hand- structured and designed data HR professionals
books moved online even as they became have long used to recruit, retain, and advance
larger, more detailed, and connected with employees necessary for efficient and productive
online digital forms associated with a multi- organizations.
tude of organizational processes (e.g., wage
and benefits, resource requisitions, perfor- Disclaimer The views expressed in this entry are those of
mance management). Big data techniques the author and do not necessarily represent the views of
have been important in expanding these Energetics Technology Center, the Institute of Museum
and Library Services, the U.S. Department of Energy, or
knowledge management functions. They have the government of the United States.
also expanded on these traditional functions to
include capturing and preserving institutional
knowledge for more robust succession plan- Bibliography/Further Readings
ning, which has recently been important as
baby boomers enter retirement. Corritore, M., Goldberg, A., & Srivastava, S. B. (2020,
January). The new analytics of workplace culture.
Big data poses important challenges for HR SHRM online at: https://shrm.org/resourcesandtools/
hr-topics/technology/pages/the-new-analytics-of-work
professionals. First, issues of privacy and trans- place-culture.aspx.
parency are important as more data about people Friedman, T., & Heudecker, N. (2020, February). Data
are more rapidly available. For example, in the hubs, data lakes and data warehouses: How they are
past several years, employees’ behavior outside different and why they are better together. Gartner
online at: https://www.gartner.com/doc/reprints?id¼1-
the workplace has been more easily surveilled by 24IZJZ2F&ct¼201103&st¼sb.
employers via the proliferation of social media Garcia-Arroyo, J., & Osca, A. (2019). Big data contribu-
platforms, with consequential outcomes for tions to human resource management: A systematic
employees whose behavior is considered inappro- review. International Journal of Human Resource
Management. https://doi.org/10.1080/09585192.2019.
priate by their employers. Additionally, algo- 1674357.
rithms used with ingested digital data and Giacumo, L. A., & Breman, J. (2016). Emerging evidence
problems associated with different ways various on the use of big data and analytics in workplace
Humanities (Digital Humanities) 553

learning: A systematic literature review. The Quarterly DH with corpus linguistics and quantitative
Review of Distance Education, 17(4), 21–38. methods in the Humanities (e.g., in social and
Howard, N., & Wise, S. (2018). Best practices in linking
data to organizational outcomes. Bowling Green: Soci- economic history) that are often borrowed from
ety for Industrial and Organizational Psychology the social sciences.
(SIOP). Online at: https://www.siop.org/Portals/84/ DH thus has not only introduced computa-
docs/White%20Papers/Visibility/DataLinking.pdf. tional methods to the Humanities but also signif-
Noack, B. (2019). Big data analytics in human resource
management: Automated decision-making processes, icantly widened the field of inquiry, enabling new
predictive hiring algorithms, and cutting-edge work- types of research, that would have been impossi-
place surveillance technologies. Psychosociological ble to be pursued in a pre-digital age, as well as old
Issues in Human Resource Management, 7(2), 37–42. research questions to be asked in new ways, prom-
Wright, P. M., & Ulrich, M. D. (2017). A road well trav-
eled: The past, present, and future journey of strategic ising to lead to new insights. Like many
human resource management. Annual Review of Orga- paradigm-shifting movements, it has sometimes
nizational Psychology and Organizational Behavior, 4 been perceived in terms of culture and counter-
(1), 45–65. culture and alternatively been portrayed as a
“threat” or a “savior” for the Humanities as a
whole. H

Humanities (Digital
Humanities) Definitions of Digital Humanities

Ulrich Tiedau Scholars are still debating the question whether


Centre for Digital Humanities, University College DH is primarily a set of methodologies or whether
London, London, UK it constitutes a new discipline or is in the process
of becoming a discipline, in its own right. The
current academic landscape certainly allows both
Big Data in the Humanities interpretations. Proponents of the methodological
nature of DH argue that the digital turn that has
Massive use of “Big Data” has not traditionally embraced all branches of scholarship has not
been a method of choice in the humanities, a field stopped for the Humanities and thus there would
in which close reading of texts, serendipitous be no need to qualify this part of Humanities as
finds in archives, and individual hermeneutic “Digital Humanities”. In this view, digital
interpretations have dominated the research cul- approaches are a novel but integral part of existing
ture for a long time. This “economy of scarcity” as Humanities disciplines, or a new version of the
it has been called has now been amended by an existing Humanities, an interpretation that occa-
“economy of abundance,” the possibility to sionally even overshoots the target by equating
distance-read, interrogate, visualize, and interpret DH with the Humanities as a whole.
a huge number of sources that would be impossi- On the other hand, indicators for the increas-
ble to be read by any individual scholar in their ingly disciplinary character of DH are not just a
lifetime, simultaneously by using digital tools and range of devoted publications, e.g., Journal of
computational methods. Digital Humanities (JDH), Digital Humanities
Since the mid-2000s, the latter approach is Quarterly (DHQ), Literary and Linguistic Com-
known as “Digital Humanities” (hereafter DH), puting: the Journal of Digital Scholarship in the
in analogy to “e-Science” sometimes also as “e- Humanities (LLC), etc., and organizations, e.g.,
Humanities,” although under the name of “Com- the Alliance of Digital Humanities Organizations
puting in the Humanities,” “Humanities Comput- (ADHO), an umbrella organization of five schol-
ing” or similar it has been in existence for half a arly DH associations with various geographical
century, albeit somewhat on the fringes of the and thematic coverage, that organizes the annual
Humanities canon. There are also overlaps of Digital Humanities conference, but also the rapid
554 Humanities (Digital Humanities)

emergence of DH centers all over the world, as a paradigmatic shift in the Humanities, which has
this process of institutionalization following the quickly been followed by changing funding
emergence a free and novel form of inquiry is how regimes, of which the establishment of an Office
all academic disciplines came into being origi- of Digital Humanities (2006) by the National
nally. Melissa Terras (2012) counts 114 physical Endowment for the Humanities (NEH) may
centers in 24 countries that complement long- serve as an example here.
established pioneering institutions at, e.g., the The origins of DH can be traced all the way
University of Virginia, the University of Mary- back to the late 1940s when the Italian priest
land, and George Mason University in the USA, at Roberto Busa S. J., who is generally considered
the University of Victoria in Canada, or at the to be the founder of the subject and in whose
University of Oxford, King’s College London honor the Alliance of Digital Humanities Organi-
and, somewhat newer, University College zations (ADHO) awards the annual Busa Prize, in
London and the University of Sussex in the UK. conjunction with IBM, started working on the
Index Thomisticus, a digital search tool for the
massive corpus of Thomas Aquina’s works
History of the Field (11 million words of medieval Latin), originally
with punch card technology, resulting in a 52 vol-
The success of the World Wide Web and the ume scholarly edition that was finally published in
pervasion of academia as well as everyday life the 1970s. The Hidden Histories project recon-
by modern information and communication tech- structs the subject’s history, or prehistory, from
nology, including mobile and tablet devices, these times to the present with an oral history
which has led to people conducting a good part approach (Nyhan et al. 2012).
of their lives online, are part of the explanation for
the rapid and pervasive success of DH in acade-
mia. Against this wider technological and societal Subfields of Digital Humanities
background, as well as a new iteration of the
periodically recurring crises of the Humanities, a Given its origins and the textual nature of most of
trigger for the rapid rise of the field has been its the Humanities disciplines, it is no wonder that
serendipitous rebranding to “Digital Humanities,” textual scholarship has traditionally been at the
a term that has caught on widely. Whereas heart of DH, although the umbrella term also
“Humanities Computing” emphasized the compu- includes non-textually based digital scholarship
tational nature of the subject, thus tools and tech- as well. Especially in the USA, DH programs
nology, and used “Humanities” only to qualify frequently developed in English departments
“Computing,” “Digital Humanities” has reversed (Kirschenbaum 2010), and the 2009 Convention
the order, placing the emphasis firmly on the of the Modern Language Association (MLA) in
humanistic nature of the inquiry and subordinat- Philadelphia is widely seen as the breakthrough
ing technology to its character, thus appealing to a moment, at which DH became “mainstream.”
great number of less technologically orientated Still being a field with comparatively low
Humanities scholars. Kathleen Fitzpatrick (2011) maturity, there also is no clear distinction between
recounts the decisive moment when Susan scholars and practitioners of DH, e.g., in the her-
Schreibman, Ray Siemens, and John Unsworth, itage sector. In fact some of the most eminent
the editors of the then planned Blackwell Com- work has come from, and continues to be done
panion to Humanities Computing countered the in, the world of libraries, archives, museums, and
publisher’s alternative title suggestion Compan- other heritage institutions. Generally speaking
ion to Digitized Humanities with the final Com- digitization has characterized a good part of
panion to Digital Humanities (2004) because the early DH work, throughout the second half of
field extended far beyond mere digitization. The the 1990s and early 2000s, creating the basis for
name has stuck ever since, helping to bring about later interpretative work, before funding bodies, in
Humanities (Digital Humanities) 555

a move to justify their investment, shifted their already known” or by providing shiny visualiza-
emphasis to funding research that utilized previ- tions of research for public engagement, also ques-
ously digitized data (e.g., JISC in the UK or the tion what added value “asking old research
joint US/Canadian/British/Dutch “Digging into questions in new ways” has and point to the still
Data” program) rather than accumulate more dig- comparatively few projects that are driven first and
itized material that initially remained somewhat foremost by a humanistic research question, rather
underused. Commercial companies, first and fore- than by tool development, proof of concept,
most Google, have been another player and con- etc. Well-noted critiques came from, e.g., Stanley
tributed to the development of DH, i.e., by Fish in a series of New York Times columns in 2011/
providing Big Humanities Data on unrivalled 2012 and from Adam Kirsch in the New Republic in
scales, notably Google Books with its more than 2014, sparking ongoing controversies ever since.
30 million digitized books (2013), although Goo- Still, it is early days, and these projects are
gle readily admits that this still only amounts to a underway. As the field is maturing, it promises
fraction of all books ever published, and including to transform scholarship to an even greater mea-
integrated analytical tools like the Google n-gram sure than it has already done. It does so not the
viewer. least by a second novel element that DH has H
An important place in the development of dig- introduced to the Humanities with its traditionally
ital textual scholarship has the Text Encoding dominant “lone-scholar ideal”: a research culture
Initiative (TEI). Growing from research into resembling the more team-based type of research
hypertext and the need for a standard encoding done in science, technology, engineering, and
scheme for scholarly editions, this format of XML medicine (STEM) subjects, in other words collab-
is one of the foremost achievements of early oration, an innovation just as important and poten-
humanities computing (1987–). Apart from text tially transformative as the use of computational
encoding, digital editing, textual analytics, corpus methods for humanistic enquiry. DH can thus also
linguistics, text mining, and language processing, be seen as an attempt at ending the separation of
central nontextual fields of activities of DH the “two cultures” in academia, the Humanities on
include digital image processing (still and mov- the one hand and the STEM-subjects on the other,
ing), geo-spatial information systems (GIS) and a notion that C. P. Snow first suggested in the
mapping, data visualization, user and reader stud- 1950s. Programmatically, the final report of the
ies, social media studies, crowdsourcing, 3D/4D first “Digging into Data” program, a collaborative
scanning, digital resources, subject-specific data- funding program by US, Canadian, British, and
bases, web-archiving and digital long-term pres- Dutch funding bodies, bears the title “One Cul-
ervation, the semantic web, open access, and open ture” (Wiliford and Henry 2012).
educational practices, to name but the most impor-
tant. An agreed upon “canon” of what constitutes
DH next to its core of digital textual scholarship, Cross-References
although emerging, does not yet exist and inter-
pretations differ in this fluid field. ▶ Big Humanities Project
▶ Curriculum, Higher Education, Humanities
▶ Visualization
Controversies and Debates

DH’s tendency to present itself, rightfully or wrong- Further Reading


fully, as revolutionary has not only made itself
friends, and like any revolutionary movement, the Berry, D. M. (Ed.). (2012). Understanding digital human-
ities. Basingstoke: Palgrave MacMillan.
field also encounters criticism and backlashes.
Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., &
Critics, while often acknowledging the potential of Schnapp, J. (2012). Digital humanities. Cambridge,
DH as auxiliary means for “further knowing the MA: MIT Press.
556 Humanities (Digital Humanities)

Fish, S. (2011, December 26). The old order changeth. Quarterly, 6(3). http://www.digitalhumanities.org/dhq/
Opinionator Blog, New York Times. http://opinionator. vol/6/3/000130/000130.html. Accessed August 2014.
blogs.nytimes.com/2011/12/26/the-old-order-changeth/. Schreibman, S., Siemens, R., & Unsworth, J. (Eds.).
Accessed August 2014. (2004). A companion to digital humanities. Oxford:
Fitzpatrick, K. (2011). The humanities done digitally. The Blackwell.
Chronicle of Higher Education. http://chronicle.com/arti Schreibman, S., Siemens, R., & Unsworth, J. (Eds.).
cle/The-Humanities-Done-Digitally/127382/. Accessed (2007). A companion to digital literary studies. Oxford:
August 2014. Blackwell.
Gold, M. (Ed.). (2012). Debates in the digital humanities. Terras, M. (2012). Infographic: Quantifying digital human-
Minneapolis: Minnesota University Press. ities. http://blogs.ucl.ac.uk/dh/2012/01/20/infographic-
Kirsch, A. (2014, May 2). Technology is taking over English quantifying-digital-humanities/. Accessed August 2014.
departments: The false promise of the digital humanities. Terras, M., Nyhan, J., & Vanhoutte, E. (Eds.). (2013).
New Republic. http://www.newrepublic.com/article/ Defining digital humanities: A reader. Farnham:
117428/limits-digital-humanities-adam-kirsch. Accessed Ashgate. ISBN 978-1-4094-6963-6.
August 2014. Warwick, C., Terras, M., & Nyhan, J. (Eds.). (2012). Dig-
Kirschenbaum, M. G. (2010). What is digital humanities ital humanities in practice. London: Facet.
and what’s it doing in English departments? ADE Bul- Wiliford, C., & Henry, C. (2012). One culture: Computa-
letin, 150, 1–7. tionally intensive research in the humanities and social
McCarty, W. (2005). Humanities computing. Basingstoke: sciences. A report on the experiences of first respon-
Palgrave. dents to the digging into data challenge (CLIR Publi-
Nyhan, J., Flynn, A., & Welsh, A. (2012). A short introduc- cation No. 151). Washington, DC: Council on Library
tion to the Hidden Histories project. Digital Humanities and Information Resources.
I

Indexed Web, Indexable Web the financial businesses related to industrial and
commercial sectors from the central bank
▶ Surface Web vs Deep Web vs Dark Web (People’s Bank of China) to ICBC (China Indus-
trial Map Committee 2016). This decision made
in September 1983 is considered a landmark event
in the evolution of China’s increasingly special-
Indicator Panel ized banking system (Fu and Hefferman 2009).
While the government retains control over ICBC,
▶ Dashboard the bank began to take on public shareholders in
October, 2006. As of May 2016, ICBC was
ranked as the world’s largest public company by
Forbes “Global 2000.” (Forbs Ranking 2016)
Industrial and Commercial With its combination of state and private owner-
Bank of China ship, state governance, and commercial dealings,
ICBC serves as a perfect case study to examine the
Jing Wang1 and Aram Sinnreich2 transformation of China’s financial industry.
1
School of Communication and Information, Big data collection and database construction
Rutgers University, New Brunswick, NJ, USA are fundamental to ICBC’s management strate-
2
School of Communication, American University, gies. Beginning in the late 1990s, ICBC paid
Washington, DC, USA unprecedented attention on the implication of
information technology (IT) in their daily opera-
tions. Several branches adopted computerized
The Industrial and Commercial Bank of input and internet communication of transactions,
China (ICBC) which had previously relied upon manual prac-
tices by bank tellers. Technological upgrades
The Industrial and Commercial Bank of China increased work efficiency and also helped to
(ICBC) was the first state-owned commercial save labor costs. More importantly, compared to
bank of the People’s Republic of China (PRC). It the labor-driven mechanism, the computerized
was founded on January 1st, 1984, and is system was more effective for retrieving data
headquartered in Beijing. In line with Deng from historical records and analyzing these data
Xiaoping’s economic reform policies launched for business development. At the same time, it
in the late 1970s, the State Council (chief admin- became easier for the headquarters to control the
istrative authority of China) decided to relay all local branches by checking digitalized
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
558 Industrial and Commercial Bank of China

information records. Realizing the benefits of customer information (including their profiles and
these informatization and centralization tactics, transaction records) into the national Data Center
the head company assigned its Department of through their computers at local branches. These
Information Management to develop a centralized two-step strategies of centralization and digitiza-
database collecting data from every single branch. tion allow ICBC to converge local operations on
This database is controlled and processed by one digital platform, which intensifies the head-
ICBC headquarters but is also available for use quarters’ control over national businesses. In
by local branches with the permission of top 2001, ICBC launched another data center in
executives. Shenzhen, China, which is in charge of the big
In this context, “big data” refers to all the data collected from its oversea branches. ICBC’s
information collected from ICBC’s daily opera- database thus enables the headquarters’ control
tions and can be divided into two general catego- over business and daily operations globally and
ries: “structured data” (which is organized domestically.
according to preexisting database categories) and By 2014, ICBC’s Data Center in Shanghai had
“unstructured data” (which does not) (Davenport collected more than 430 million individual cus-
and Kim 2013). For example, a customer’s tomers’ profiles and more than 600,000 commer-
account information is typically structured data. cial business records. National transactions –
The branch has to input the customer’s gender, exceeding 215 million on daily basis – have all
age, occupation, etc., into the centralized network. been documented at the Data Center. Data storage
This information then flows into the central data- and processing on such a massive scale cannot be
base which is designed specifically to accommo- fulfilled without a powerful and reliable computer
date it. Any data other than the structured data will system. The technology infrastructure supporting
be stored as raw data and preserved without pro- ICBC’s big data strategy consists of three major
cessing. For example, the video recorded at a local elements: hardware, software, and cloud comput-
branch’s business hall will be saved with only a ing. Suppliers are both international and domestic,
date and a location label. Though “big data” in including IBM, Teradata, and Huawei.
ICBC’s informational projects refers to both struc- Further, ICBC has also invested in data backup
tured and unstructured data, the former is the core to secure its database infrastructure and data
of ICBC’s big data strategy and is primarily used records. The Shanghai Data Center has a backup
for data mining. system in Beijing which can record data when the
Since the late 1990s, ICBC has invested in big main server fails to work properly. The Beijing
data development with increasingly large eco- data center serves as a redundant system in case
nomic and human resources. On September 1st, the Shanghai Data Center fails. It only takes less
1999, ICBC inaugurated its “9991” project, which than 30 s to switch between two centers. To speed
aimed at centralizing the data collected from data backup and minimize data loss in significant
ICBC branches nationwide. This project took disruptive events, ICBC undertakes multiple
more than 3 years to accomplish its goal. Begin- disaster recovery (DR) tests on a regular basis.
ning in 2002, all local branches were connected to The accumulation and construction of big data
ICBC’s Data Processing Center in Shanghai – a is significant for ICBC’s daily operation in three
data warehouse with a 400 terabyte (TB) capacity. respects. First of all, big data allows ICBC to
The center’s prestructured database enables ICBC develop its customers’ business potential through
headquarters to process and analyze data as soon a so-called “single-view” approach. A customer’s
as they are generated, regardless of the location. business data collected from one of ICBC’s
With its enhanced capability in storing and man- 35 departments are available for all the other
aging data, ICBC also networked and digitized its departments. By mining the shared database,
local branch operations. Tellers are able to input ICBC headquarters is able to evaluate both a
Industrial and Commercial Bank of China 559

customer’s comprehensive value and the overall emergent financial consequences might create a
quality of all existing customers. Cross depart- crisis for ICBC or even for the financial industry at
mental business has also been propelled (e.g., large. Consequently, today, a decade after its data
the Credit Card Department may share business warehouse construction, ICBC considers big data
opportunities with the Savings Department). Sec- indispensable in providing a holistic perspective,
ond, the ICBC marketing department has been mitigating risk for its business and development
using big data for email-based marketing strategies.
(EBM). Based on the data collected from To date, ICBC has been a pioneer in big data
branches, the Marketing and Business Develop- construction among all the financial enterprises in
ment Department is able to locate their target China. It was the first bank to have all local data
customers and follow up with customized market- centralized in a single database. As the Director of
ing/advertising information via customized email ICBC’s Informational Management Department
communications. This data-driven marketing claimed in 2014, ICBC has the largest Enterprise
approach is increasingly popular among financial Database (EDB) in China.
institutions in China. Third, customer manage- Parallel to its aggressive strategies in big data
ment systems rely directly on big data. All cus- construction, the issue of privacy protection has
tomers have been segmented into six levels, always been a challenge in ICBC’s customer data
I
ranging from “one star” to “seven stars,” (one collection and data mining. The governing policies
star and two stars fall into a single segment primarily regulate the release of data from ICBC to
which indicates the customers’ savings or invest- other institutions, yet the protection of customer
ment levels at ICBC). “Seven Stars” clients have privacy within ICBC itself has rarely been
the highest level of credit and enjoy the best addressed. According to the central bank’s Regu-
benefits provided by ICBC. lation on the Administration of the Credit Investi-
Big data has influenced ICBC’s decision- gation Industry issued by the State Council in
making on multiple levels. For local branches, 2013, interbank sharing of customer information
market insights are available at a lower cost. Con- is forbidden. Further, a bank is not eligible to
sumer data generated and collected at local release customer information to its nonbanking
branches have been stored on a single platform subsidiaries. For example, the fund management
provided and managed by the national data center. company (ICBCCS) owned by ICBC is not allo-
For example, a branch in an economically devel- wed access customer data collected from ICBC
oping area may predict demand for financial prod- banks. The only situation in which ICBC could
ucts by checking the purchase data from branches release customer data to a third party is when
in more developed areas. The branch could also such information has been linked to the official
develop greater insights regarding the local con- inquiry by law enforcement. These policies prevent
sumer market by examining data from multiple consumer information from leaking to other com-
branches in the geographic area. For ICBC head- panies for business purposes. Yet, the policies have
quarters, big data fuels a dashboard through which also affirmed the fact that ICBC has full ownership
it monitors ICBC’s overall business and is alerted of the customer information, thus giving ICBC
to potential risks. Previously, individual depart- greater power to use the data in its own interests.
ments used to manage their financial risk through
their own balance sheets. This approach was
potentially misleading and even dangerous for Cross-References
ICBC’s overall risk profile. A given branch pro-
viding many loans and mortgages may be consid- ▶ Data Mining
ered to be performing well, but if a large number ▶ Data Warehouse
of branches overextended themselves, the ▶ Structured Data
560 Informatics

Further Reading History

China Industrial Map Editorial Committee, China Eco- The French term “Informatique” was coined in
nomic Monitoring & Analysis Center & Xinhua Hold-
March 1962 by Phillipe Dreyfus – along with
ings. 2016. Industrial map of China’s financial sectors,
Chapter 6. World Scientific Publishing. translations in various other languages. Simulta-
Davenport, T., & Kim, J. (2013). Keeping up with the neously and independently Walter Bauer and his
quants: Your guide to understanding and using analyt- associates proposed the English term “Informat-
ics. Boston: Harvard Business School Publishing.
ics” when they co-founded Informatics Inc.
Fu, M., & Hefferman, S. (2009). The effects of reform on
China’s bank structure and performance. Journal of (Fourman 2002). A very early definition of “Infor-
Banking & Finance, 33(1), 39–52. matics” from Mikhailov in 1967 states that “Infor-
Forbs Ranking (2016). The World’s Biggest Public Com- matics is the discipline of science which
pany. Retrieved from https://www.forbes.com/compa
investigates the structure and properties (not spe-
nies/icbc/.
cific content) of scientific information, as well as
the regularities of scientific information activity,
its theory, history, methodology and organization”
(Fourman 2002). But in recent times, the scope of
Informatics Informatics has moved way beyond just scientific
information. It now extends to all information in
Anirudh Prabhu the modern age.
Tetherless World Constellation, Rensselaer
Polytechnic Institute, Troy, NY, USA
Need for “X-Informatics”

Synonyms The word Informatics is used as a compound,


in conjunction with the name of a discipline,
Information Engineering; Information Science; for example, Bioinformatics, Geoinformatics,
Information Systems; Information Studies; Infor- Astroinformatics. Earlier, people who had
mation Theory; Informatique deep knowledge of a specific domain would
work on processing and engineering informa-
tion systems that would be designed only for
Definition that domain.
In the last decade, fueled by the rapid increase
Informatics is the science of information, the prac- in data and information resources, Informatics has
tice of information processing, and the engineer- gained greater visibility across a broad range of
ing of information systems. Informatics studies disciplines. As the popularity of Informatics
the structure, algorithms, behavior, and interac- increased through time, there has been a wide-
tions of natural and artificial systems that store, spread need for people who specialized in the
process, access, and communicate information field of X-informatics. Informaticians (or
(Xinformatics Concept 2012). Informaticists) and Data Scientists are trained in
The advent of “big data” has brought many Informatics Theory, which is a combination of
opportunities for people and organizations to information science, cognitive science, social sci-
leverage large amounts of data to answer previ- ence, library science, and computer science,
ously unanswered questions, but along with these enables these people to engineer information sys-
opportunities come problems of storing and pro- tems in various domains using the Informatics
cessing these data. Expertise in the field of Infor- methodologies. In the term X-informatics, “X” is
matics becomes essential to build new a variable for the domain, which can be bio, geo,
information systems or adapt existing information chem, astro, etc. The term indicates that knowl-
systems to address “big data.” edge of Informatics can be applied to many
Informatics 561

different domains. Hence, there are many aca- Concepts in Informatics


demic institutions across the world which have
specialized courses and even degrees in Data – Data are encodings that represent the qual-
Informatics. itative or quantitative attributes of a variable or a
set of variables. Data are often viewed as the
lowest level of abstraction form which informa-
Informatics in Data-Information- tion and knowledge are derived (Fox 2016).
Knowledge Ecosystem Information – Representations of “Data” in a
form that lends itself to human use. Information
The amount of data in the world is exponentially has three indivisible ingredients – content, con-
rising. But these data are not directly useful to the text, and structure (Fox 2016).
majority of people. In order for the data to be used Information Theory – “Information theory is
to its fullest potential, it needs to be processed, the branch of mathematics that describes how
represented, and communicated in a meaningful uncertainty should be quantified, manipulated
way. This is where Informatics methods come into and represented” (Ghahramani 2003). Informa-
the picture. Figure 1 shows the Data-Information- tion theory is one of the rare scientific fields to
Knowledge ecosystem and the focus of Informat- have an identifiable origin. It was originally pro-
I
ics methods in this ecosystem. Informatics posed by Claude E. Shannon in 1948 in a land-
methods focus on transforming raw data into mark paper titled “A Mathematical Theory of
information that can easily understood. Once Communication.” In this paper, “information”
meaningful information has reached the “con- can be considered as a set of messages, where
sumer,” they can draw inferences, have conversa- the goal is to send this “information” over a
tions, combine the new information with previous noisy channel, and then to have the receiver
experiences, and gain knowledge on a specific reconstruct the message with a low probability
subject. of error.

Informatics, Fig. 1 Data-


Information-Knowledge
ecosystem (Fox 2016)
Informatics
Producers Consumers

Experience

Data Information Knowledge

Creation Presentation Integration


Gathering Organization Conversation

Context
562 Informatics

Informatics,
Fig. 2 Information life
cycle Acquisition
Curation

Management

Preservation
Stewardship

Information Entropy – Information entropy make something work – the thoughtful making
is defined as the average amount of information of either artifact, or idea, or policy that informs
produced by a probabilistic stochastic source of because it is clear” (Fox 2016).
data. Mathematically, Information Entropy can be Information Life Cycle – Information Life
defined as, Cycle refers to the steps stored information goes
through from its creation to its deletion or archival
X
n (Fig. 2).
H¼ pi  log 2 ðpi Þ Stages of the Information Life Cycle are as
i¼1
follows (Fox 2016):
where, H is the “Entropy” and pi is the probability
of occurrence of the i-th possible value of the Acquisition: The process of recording or generat-
source message. Information Entropy is com- ing a concrete artifact from the concept
monly measured in “bits,” which represents 2 Curation: The activity of managing the use of data
possible states. Hence, base of the logarithm in from its point of creation to ensure it is avail-
the definition of entropy is “2.” If the entropy able for discovery and re-use in the future
measurement unit changes, then the base of the Preservation: The process of retaining usability of
logarithm also changes accordingly. For example, data in some source form for intended and
if information entropy is measure in decimal unintended use
digits, then the base of the logarithm would Stewardship: The process of maintaining integrity
change to “10.” across acquisition, curation, and preservation
“In information theory, entropy quantifies the Management: The process of arranging for dis-
amount of uncertainty involved in the value of a covery, access and use of data, information and
random variable or the outcome of a random all related elements. Management also includes
process” (Ghahramani 2003). Therefore, Entropy overseeing processes for acquisition, curation,
is a key measure in informatics. preservation, and stewardship
Information Architecture – Information
Architecture is the art of expressing a model or Informatics in Scientific Research
concept of information used in activities that
require explicit details of complex systems. Rich- As mentioned earlier, in the past, Informatics
ard Saul Wurman an architect and a graphic efforts emerged largely in isolation across a num-
design, popularized the usage of the term. ber of disciplines. “Recently, certain core ele-
Wurman defines an information architect as fol- ments in informatics have been recognized as
lows – “. . . I mean architect as in the creating of being applicable across disciplines. Prominent
systemic, structural, and orderly principles to domains of informatics have two key factors in
Informatics 563

common: i) a distinct shift toward methodologies immediately useable for analysis. The first step
and away from dependence on technologies and is to process the information into a format than can
ii) understanding the importance of and thereby be used for modeling and visualizing the data.
using multidisciplinary and collaborative For example, the most commonly used data-
approaches in research” (Fox 2011). bases in mineralogy are “mindat.org” and
“RRUFF.info.” Combined, these databases con-
As a domain, Informatics builds on several existing tain data on all the mineral localities on Earth.
academic disciplines, primarily Artificial Intelli-
gence, Cognitive Science and Computer Science.
Along with data on the localities, they also contain
Cognitive Science concerns the study of natural data on the different minerals, their chemical com-
information processing systems; Computer Science position, age of minerals, and other geologic prop-
concerns the analysis of computation, and the erties. Since one of the goals is to discover pattern
design of computing systems; Artificial Intelligence
plays a connecting role, producing systems
and trends in the evolution of Earth’s environ-
designed to emulate those found in nature (Fourman ment, minerals that most often occurred together
2002). need to be observed and analyzed. One of the best
ways to represent this co-occurrence information
is to create a network visualization where every
Example Use case node represents a mineral and the edges (or con-
I
nections) implies that two minerals occur at the
The “Co-Evolution of the Geo- and Biosphere” same locality. To create this visualization, the raw
project will be used as the example use case for data need to be processed into the required format.
exhibiting Informatics techniques. The main goal Adjacency matrices and edge lists are appropriate
of this project is to use known informatics tech- formats for a network structure. In Tables 1 and 2,
niques in diverse disciplines (like mineralogy, the difference between the “Raw data” and the
petrology, paleo biology, paleo tectonics, geo- “Processed Data” can be seen.
chemistry, proteomics, etc.) to discover patterns Once the data have been converted to the
in evolution of Earth’s environment that exem- required format, the network visualization can be
plifies the abovementioned “Co-evolution.” created. It is important that the visualization cre-
There are vast amounts of data available in ated communicates the intended information (in
each scientific discipline. Not all of them are this case the Manganese Mineral Environment) to

Informatics, Table 1 Raw data – manganese minerals from the website “RRUFF.info”
RRUFF
Names ID Ideal chemistry Source Locality
Akatoreite R060230 Mn2+9Al2Si8O24(OH)8 Michael Near mouth of Akatore Creek, Taieri,
Scott Otago Province, New Zealand
S100146
Akrochordite R100028 Mn2+5(AsO4)2(OH)44H2O William Langban, Filipstad, Varmland, Sweden
W. Pinch
Alabandite R070174 MnS Michael Mina Preciosa, Sangue de Cristo,
Scott Puebla, Mexico
S101601
Allactite R070175 Mn2+7(AsO4)2(OH)8 Michael Langban, Filipstad, Varmland, Sweden
Scott
S102971
Allactite R150120 Mn2+7(AsO4)2(OH)8 Steven Sterling Mine, 12000 Level,
Kuitems Ogdensburg, New Jersey, USA
Alleghanyite R060904 Mn2+5(SiO4)2(OH)2 Michael Near Bald Knob, Alleghany County,
Scott North Carolina, USA
S100995
564 Information Commissioner, United Kingdom

Informatics, Table 2 Processed data – co-occurrence applications increases. It is for this very reason
edge list for manganese minerals that many universities across the world (especially
Source Target Value in the USA) have started multidisciplinary infor-
Agmantinite Alabandite 0.1 matics programs. These programs give students
Akatoreite Alabandite 0.75 the freedom to apply their knowledge of informat-
Akhtenskite Alabandite 0.5 ics to their field of choice. With “big data” becom-
Akrochordite Alabandite 0.8 ing increasingly common in most disciplines,
Akrochordite Allactite 0.5 informatics as a domain will not be losing traction
Alabandite Allactite 0.846153846
anytime soon.
Akhtenskite Alleghanyite 0.333333333
Akrochordite Alleghanyite 0.6
Alabandite Alleghanyite 0.632183908
Further Reading
Allactite Alleghanyite 0.461538462
Alabandite Alluaivite 0.75
Fourman, M. (2002). Informatics. In International ency-
Alabandite Andrianovite 0.666666667 clopaedia of information and library science. London,
Alluaivite Andrianovite 0.666666667 UK: Routledge.
Alabandite Ansermetite 0.666666667 Fox, P. (2011). The rise of informatics as a research
Alleghanyite Ansermetite 0.888888889 domain. In Proceedings of WIRADA Science Sympo-
sium, Melbourne (Vol. 15, pp. 125–131).
Fox, P. (2016). Tetherless world constellation. Retrieved
22 Sept 2017, from https://tw.rpi.edu/web/courses/
xinformatics/2016.
the “audience.” For example, a force directed net- Ghahramani, Z. (2003). Information theory. In Encyclope-
work not only indicates which nodes are connected dia of cognitive science. London, UK: Nature Publish-
to each other, it also indicates the most connected ing Group.
nodes (node size), the stable geometric layout with Xinformatics Concept. (2012). Tetherless World Constel-
lation. Retrieved 22 Sept 2017, from https://tw.rpi.
similar nodes in the network exhibiting closer edu//web/concept/XinformaticsConcept.
proximity to each other, all the nodes can be
grouped on many different properties, thereby
also indicating how each group behaves in the
network environment. Interactive 2D and 3D Man- Information Commissioner,
ganese Networks (with 540 mineral nodes) can be United Kingdom
found at (https://dtdi.carnegiescience.edu/sites/all/
themes/bootstrap-d7-theme/networks/Mn/ne Ece Inan
twork/Mn_network.html) and (https://deeptime.tw. Girne American University Canterbury,
rpi.edu/viz/3D_Network/Mn_Network/index.html). Canterbury, UK
These visualizations show a force directed layout
with nodes groups by Oxidation state (Indicator of
loss of electrons of an atom in a chemical com- The Information Commissioner’s Office (ICO) is
pound), Paragenetic Modes (Formational condi- the UK’s independent public authority which is
tions of the mineral), or Mineral Age. With these responsible for data protection mainly in England,
visualizations, the “audience” can explore some Scotland, Wales, and Northern Ireland; and also
complex and hidden patterns of diversity and dis- ICO has right to conduct some international
tribution in mineral environment. duties. ICO was firstly set up to uphold informa-
tion rights by implementing the Data Protection
Act 1984. The ICO declared their mission state-
Conclusion ment as to promote respect for the private lives of
individuals and in particularly, for the privacy of
As digital data continue to increase at unprece- their information by implementing the Data Pro-
dented rate, importance of informatics tection Act 1984 and also influencing national and
Information Commissioner, United Kingdom 565

international thinking on privacy and personal related. In order to comply with the Act, a data
information. controller must comply with the following eight
ICO enforces and oversees all the data protec- principles as “data should be processed fairly and
tion issues by following the Freedom of Informa- lawfully; should be obtained only for specified
tion Act 2000, Environmental Information and lawful purposes; should be adequate, rele-
Regulations 2004, and Privacy and Electronic vant, and not excessive; should be accurate and,
Communications Regulations 2003, and also where necessary, kept up to date; should not be
ICO has some limited responsibilities under the kept longer than is necessary for the purposes for
INSPIRE Regulations 2009, in England, Wales, which it is processed; should be processed in
Northern Ireland, and UK-wide public authorities accordance with the rights of the data subject
based in Scotland. On the other hand, Scotland under the Act; should be appropriate technical
has complementary INSPIRE Regulations and and organisational measures should be taken
its own Scottish Environmental Information Reg- against unauthorised or unlawful processing of
ulations regulated by the Scottish Information personal data and against accidental loss or
Commissioner and the Freedom of Information destruction of, or damage to, personal data; and
(Scotland) Act 2002. should not be transferred to a country or territory
The Information Commissioner is appointed outside the European Economic Area unless that
I
by the Queen and reports directly to Parliament. country or territory ensures an adequate level of
The Commissioner is supported by the manage- protection for the rights and freedoms of data
ment board. The ICO’s headquarter is in subjects in relation to the processing of personal
Wilmslow, Cheshire; in addition to this, three data.”
regional offices in Northern Ireland, Scotland, In 1995, The EU formally adopted the Gen-
and Wales are aimed to provide relevant services eral Directive on Data Protection. In 1997,
where legislation or administrative structure is DUIS, the Data User Information System, was
different. implemented, and the Register of Data Users
Under the Freedom of Information Act, Envi- was published on the internet. In 2000, the
ronmental Information Regulations, INSPIRE majority of the Data Protection Act comes into
Regulations, and associated codes of practice, force. The name of the office was changed from
the functions of the ICOs contain noncriminal the Data Protection Registrar to the Data Protec-
enforcement and assessments of good practice, tion Commissioner. Notification replaced the
providing information to individuals and orga- registration scheme established by the 1984
nizations, taking appropriate action when the Act. Revised regulations implementing the pro-
law an freedom of information is broken, con- visions of the Data Protection Telecommunica-
sidering complaints, disseminating publicity tions Directive 97/66/EC came into effect. In
and encouraging sectoral codes of practice, and January 2001, the office was given the added
taking action to change the behavior of organi- responsibility of the Freedom of Information
zations and individuals that collect, use, and Act and changed its name to the Information
keep personal information. The main aim is to Commissioner’s Office. On 1 January, 2005,
promote data privacy for individuals, for provid- the Freedom of Information Act 2000 was fully
ing this service, the ICO has different tools such implemented. The Act was intended to improve
as criminal prosecution, noncriminal enforce- the public’s understanding of how public author-
ment, and audit. The Information Commissioner ities carry out their duties, why they make the
also has the power to serve a monetary penalty decisions they do, and how they spend their
notice on a data controller and promotes open- money. Placing more information in the public
ness to public. domain would ensure greater transparency and
The Data Protection Act 1984 introduced basic trust and widen participation in policy debate. In
rules of registration for users of data and rights of October 2009, the ICO adopted a new mission
access to that data for the individuals to which it statement: “The ICO’s mission is to uphold
566 Information Discovery

information rights in the public interest, promot-


ing openness by public bodies and data privacy Information Overload
for individuals.” In 2011, ICO launched the
“data sharing code of practice” at the House of Deepak Saxena1 and Sandul Yasobant2
1
Commons and enable to impose monetary pen- Indian Institute of Public Health Gandhinagar,
alties of up to £500,000 for serious breaches of Gujarat, India
2
the Privacy and Electronic Communications Center for Development Research (ZEF),
Regulations. University of Bonn, Bonn, Germany

Cross-References Background

▶ Open Data With the advent of technology, humans are now


afforded greater access to information than ever
before (Lubowitz and Poehling 2010) and many
Further Reading can have access to any information irrespective of
its relevance. However, evidence indicates that
Data Protection Act 1984. http://www.out-law.com/page- humans have a limited capacity to process and
413. Accessed Aug 2014. retain new information (Lee et al. 2017; Mayer
DataProtectionAct 1984. http://www.legislation.gov.uk/uk and Moreno 2003). This capacity is influenced by
pga/1984/35/pdfs/ukpga_19840035_en.pdf?view¼ex
tent. Accessed Aug 2014. multiple personal factors such as anxiety (Chae
Smartt, U. (2014). Media & entertainment law (2nd ed.). et al. 2016), motivation to learn, and existing
London: Routledge. knowledge base (Kalyuga et al. 2003). Informa-
tion overload occurs when the volume or com-
plexity of information accessed by an individual
exceeds their capacity to process the information
Information Discovery within a given timeframe (Eppler and Mengis
2004; Miller 1956).
▶ Data Discovery
▶ Data Processing History of Information Overload
The term “information overload” has been exis-
tence for more than 2000 years. This has been re-
emerging as a new phenomenon in the recent
Information Engineering digital world. Since the introduction of the print-
ing machine in Europe in the fifteenth century to
▶ Informatics the current millions of Google search on the Inter-
net, the problem of information overload
remained a conundrum (Blair 2011).

Information Extraction
Definition
▶ Data Processing
Although a user-friendly definition of information
overload is still missing; Roetzel (2018), contrib-
uted a working definition.
Information Hierarchy Information overload is a state in which a decision-
maker faces a set of information (i.e., an informa-
▶ Data-Information-Knowledge-Action Model tion load with informational characteristics such as
Information Overload 567

an amount, a complexity, and a level of redundancy, however, with careful use, it could be managed
contradiction and inconsistency) comprising the for the right policy decision.
accumulation of individual informational cues of
differing size and complexity that inhibit the deci- Specialists agree that, for information users
sion maker’s ability to optimally determine the best and information professionals alike, achieving
possible decision. The probability of achieving the information literacy is vital for successfully deal-
best possible decision is defined as decision-making ing with information overload (Bruce 2013).
performance. The suboptimal use of information is
caused by the limitation of scarce individual Information literacy as per Edmunds et al. has
resources. A scarce resource can be limited individ- been defined as “a set of abilities requiring indi-
ual characteristics (such as serial processing ability, viduals to recognize when information is needed
limited short-term memory) or limited task-related and have the ability to locate, evaluate, and use
equipment (e.g., time to make a decision, budget).
effectively the needed information” (Edmunds
and Morris 2000). An information literate person
can determine the extent of information, access
Information Overload: Double-Edged the need for information, evaluate it, incorporate
Sword: Problem or Opportunity? and use it effectively (Gausul Hoq 2014). The
scholarships indicate that, to juduciously use the
The simplicity of creating, duplicating, and shar- information from various sources for problem-
I
ing information online in high volumes resulted in solving, a person should acquire at least a moder-
the information overload. The most cited causes ate level of information literacy. Admittedly, this
of information overload are the existence of mul- is not an easy task and even the most expert
tiple sources of information, over-abundance of information seekers could be overwhelmed by
information, difficulty in managing information, the huge quantity of information from which to
irrelevance/unimportance of the received infor- find his/her required information. However, as
mation, and scarcity of time on the part of infor- one continues acquiring, upgrading and refining
mation users to analyze and understand information literacy skills, he/she will find it eas-
information (Eppler and Mengis 2004). ier to deal with information overload in the long
The challenge is how to alleviate the burden run (Benselin and Ragsdell 2016).
of information. As there is no thumb rule for
this, keeping things simple, relevant, clear,
straight forward makes a step towards the reduc- Conclusion
tion of overload. As per Blair, who identified
four “S’s for managing information overload is: The overload of information that has been expe-
storing, sorting, selecting, and summarizing” rienced today as millions of Google search results
(Morrison 2018). One mystery raised by the in a fraction of a second surely can be a privilege,
issue of information overload is that infinitely which results into the massively increased access
increasing both information and the capacity to to the consumption and production of information
use that information does not guarantee better in the digital age but which one to utilize, absorb,
decisions leading to desired outcomes, which is and imbibe is difficult. Although the information
somehow not true. After all, information is often overload creates problems, it has also inspired
irrelevant, because either people are simply set important solutions for evidence generation. The
in their ways, or natural and social systems are foregoing discussions have made it clear that the
too unpredictable, or people’s ability to act is problem of information overload is here to stay
somehow restrained. What is required, then, is and with a growing focus on research and devel-
not just a skill in prioritizing information, but an opment in the coming decade, its intensity will
understanding of when information is not only increase. With the advent of new
needed. In real phenomenon, information over- technologies and various techniques of self-pub-
load might prevent taking the right decision or lishing, information overload will surely present
action because of its nature of huge volume; itself to a worldwide audience in new shapes and
568 Information Quantity

dimensions in the near future. There might be Lubowitz, J. H., & Poehling, G. G. (2010). Information
great potential for the policymakers to use this overload: Technology, the internet, and arthroscopy.
Arthroscopy: The Journal of Arthroscopic & Related
information overload in a positive way in the pro- Surgery, 26(9), 1141–1143. https://doi.org/10.1016/j.
cess of evidence-based policy formulation. arthro.2010.07.003.
Although the quality of life is greatly influenced Mayer, R. E., & Moreno, R. (2003). Nine ways to reduce
by the information overload in either way, the ease cognitive load in multimedia learning. Educational
Psychologist, 38(1), 43–52. Retrieved from http://
of accessing information with in fraction of sec- faculty.washington.edu/farkas/WDFR/MayerMoreno9
ond need to be considered as the positive aspect of WaysToReduceCognitiveLoad.pdf.
it. However, it depends on the user, who is Miller, G. A. (1956). The magical number seven, plus or
accessing this huge information, and the deci- minus two some limits on our capacity for processing
information. Psychological Review, 101. Retrieved
sion-capacity and the knowledge level of the from http://spider.apa.org/ftdocs/rev/1994/april/rev101
user to use this effectively and efficiently. 2343.html.
Morrison, R. (2018). Empires of knowledge: Scientific
networks in the early modern world. (P. Findlen, Ed.).
New York: Routledge. 2019: Routledge. https://doi.
Further Reading org/10.4324/9780429461842.
Orman, L. V. (2016). Information overload paradox:
Benselin, J. C., & Ragsdell, G. (2016). Information over- Drowning in information, starving for knowledge.
load: The differences that age makes. Journal of North Charleston: CreateSpace Independent Publish-
Librarianship and Information Science, 48(3), 284– ing Platform. ISBN-13: 978-1522932666.
297. https://doi.org/10.1177/0961000614566341. Pijpers, G. (2012). Information overload: A system for
Blair, A. (2011). Information overload’s 2,300-year-old better managing everyday data. Hoboken: Wiley
history. Harvard Business Review, 1. Retrieved from Online Library. ISBN 9780470625743.
https://hbr.org/2011/03/information-overloads-2300- Roetzel, P. G. (2018). Information overload in the infor-
yea.html. mation age: a review of the literature from business
Bruce, C. S. (2013). Information literacy research and administration, business psychology, and related disci-
practice: an experiential perspective (pp. 11–30). plines with a bibliometric approach and framework
https://doi.org/10.1007/978-3-319-03919-0_2. development. Business Research, 1–44. https://doi.
Chae, J., Lee, C., & Jensen, J. D. (2016). Correlates of cancer org/10.1007/s40685-018-0069-z.
information overload: Focusing on individual ability and Schultz, T. (2011). The role of the critical review article in
motivation. Health Communication, 31(5), 626–634. alleviating information overload. Annual Reviews, 56.
https://doi.org/10.1080/10410236.2014.986026. Available from: https://www.annualreviews.org/pb-
Edmunds, A., & Morris, A. (2000). The problem of infor- assets/ar-site/Migrated/Annual_Reviews_WhitePaper_W
mation overload in business organisations: A review of eb_2011-1293402000000.pdf.
the literature. International Journal of Information
Management, 20(1), 17–28. https://doi.org/10.1016/
S0268-4012(99)00051-1.
Eppler, M. J., & Mengis, J. (2004). The concept of infor-
mation overload: A review of literature from organiza-
tion science, accounting, marketing, MIS, and related
Information Quantity
disciplines. The Information Society, 20(5), 325–344.
https://doi.org/10.1080/01972240490507974. Martin Hilbert
Gausul Hoq, K. M. (2014). Information overload: causes, Department of Communication, University of
consequences and remedies: A study. Philosophy and
California, Davis, Davis, CA, USA
Progress, LV–LVI, 49–68. https://doi.org/10.3329/pp.
v55i1-2.26390.
Kalyuga, S., Ayres, P., Chandler, P., & Sweller, J. (2003). The
expertise reversal effect. Educational Psychologist, 38(1), The question of “how much information” there is
23–31. https://doi.org/10.1207/S15326985EP3801_4.
in the world goes at least back to the times when
Lee, K., Roehrer, E., & Cummings, E. (2017). Information
overload in consumers of health-related information. Aristotle’s student Demetrius (367 BC – ca.283
JBI Database of Systematic Reviews and Implementa- BC) was asked to organize the Library of Alexan-
tion Reports, 15(10), 2457–2463. https://doi.org/10. dria in order to collect and quantify “how many
11124/JBISRIR-2016-003287.
thousand books are there” (Aristeas 200AD,
Levitin, D. J. (2014). The organized mind: Thinking
straight in the age of information overload. New sec. 9). Pressed by the exploding number of infor-
York: Dutton. ISBN-13: 978-0525954187. mation and communication technologies (ICT)
Information Quantity 569

during recent decades, several research projects on the same hardware infrastructure over recent
have taken up this question again since the 1960s. decades (Hilbert 2014a; Hilbert and López
They differ considerably in focus, scope, and 2012a). We normalize on “optimally compressed
measurement variable. Some used US$ as a bits” (as if all content were compressed with the
proxy for information (Machlup 1962; Porat best compression algorithms possible in 2014
1977), others number of words (Ito 1981; Pool (Hilbert and López 2012b). It would also be pos-
1983; Pool et al. 1984), some focused on the sible to normalize on a different standard, but the
national level of a single country (Dienes 1986, optimal level of compression has a deeper infor-
2010), others broad estimations for the entire mation theoretic conceptualization as it
world (Gantz and Reinsel 2012; Lesk 1997; approaches the entropy of the source (Shannon
Turner et al. 2014), some focused on unique infor- 1948). For the estimation of compression rates
mation (Bounie 2003; Lyman et al. 2000), and of different content, justifiable estimates are elab-
others on a specific sector of society (Bohn and orated for 7-years intervals (1986, 1993, 2000,
Short 2009; Short et al. 2011) (for a methodolog- 2007, 2014). For more see Hilbert (2015b) and
ical comparison and overview see Hilbert 2012, López and Hilbert (2012).
2015a). For the following result, the estimations for the
The big data revolution has provided much period 1986–2007 follow Hilbert and López
I
new interest in the idea of quantifying the amount (2011). The update for 2007–2014 follows a mix
of information in the world. The idea is that an of estimates, including comparisons with more
important early step in understanding a phenom- current updates (Gantz and Reinsel 2012; Turner
enon consists in quantifying it: “when you can et al. 2014).
measure what you are speaking about, and express Figure 1 shows that the world’s technological
it in numbers, you know something about it; but capacity to store information has almost reached
when you cannot measure it, when you cannot 5 zettabytes in 2014 (from 2.6 exabytes in 1986 to
express it in numbers, your knowledge is of a 4.6 zettabytes in 2014). This results in a com-
meagre and unsatisfactory kind” (Lord Kelvin, pound annual growth rate of some 30%. This is
quoted from Bartlett 1968, p. 723). Understanding about five times faster than the world economy
the data revolution implies understanding how it grew during the same period. The digitalization of
grows and evolves. the world’s information stockpile happened in
In this inventory, we mainly follow the meth- what is a historic blink of an eye: in 1986, less
odology of what has become a standard reference than 1% of the world’s mediated information was
in estimating the world’s technological informa- still stored in digital format. By 2014, less than
tion capacity: Hilbert and López (2011). The total 0.5% is stored in analog media. Some analog
technological storage capacity is calculated as the storage media are still growing strongly today.
sum of the product of technological devices and For example, it is well known that the long-
their informational storage performance. Techno- promised “paperless office” has still not arrived.
logical performance is measured in the installed The usage of paper still grows with some 15% per
binary hardware digits, which is then normalized year (some 2.5 times faster than the economy), but
on compression rates. The hardware performance digital storage is growing at twice that speed. The
is estimated as “installed capacity” (not the effec- nature of this exponential growth trend leads to
tively used capacity), which implies that it is the fact that until not too long ago (until the year
assumed that all technological capacities are 2002) the world still stored more information in
used to their maximum. For storage this evaluates analog than in digital format. Our estimates deter-
the maximum available storage space (“as if all mine the year 2002 as the “beginning of the digital
storage were filled”). The normalization on soft- age” (over 50% digital).
ware compression rates is important for the crea- It is useful to put these mind-boggling numbers
tion of meaningful time series, as compression into context. If we would store the 4.6 optimally
algorithms have enable to store more information compressed zettabytes of 2014 in 730 MB
570 Information Quantity

Information Quantity, 1.E+22 4.6 zettabytes


Fig. 1 World’s World's technological capacity to store information
technological capacity to (in optimally compressed bytes)
store information 1.E+21
1986–2014 (log on y-axis) Zettabyte
54
(Source: Based on the

log (compressed MB)


309 exabytes
1.E+20 16 exabytes
methodology of Hilbert and
López (2011), with own exabytes
estimates for 2007–2014)
1.E+19

1.E+18 2.6 exabytes


Exabyte analog digital TOTAL

1.E+17

1.E+16

CD-ROM discs (of 1.2 mm thickness), we could Further Reading


build about 20 stacks of discs from the earth to the
moon. If we would store the information equiva- Abeliansky, A. L., & Hilbert, M. (2017). Digital technol-
ogy and international trade: Is it the quantity of sub-
lent in alphanumeric symbols in double-printed
scriptions or the quality of data speed that matters?
books of 125 pages, all the world’s landmasses Telecommunications Policy, 41(1), 35–48. https://doi.
could have been covered with one layer of double- org/10.1016/j.telpol.2016.11.001.
printed book paper back in 1986. By 1993 it Aristeas. (200AD, ca). The letter of Aristeas to Philocrates.
http://www.attalus.org/translate/aristeas1.html.
would have grown to 6 pages and to 20 pages in
Bartlett, J. (1968). William Thompson, Lord Kelvin, pop-
the year 2000. By 2007 it would be one layer of ular lectures and addresses [1891–1894]. In Bartletts
books that covers every square centimeter of the familiar quotations (14th ed.). Boston: Little Brown &
world’s land masses, two layers by 2010/2011, Co.
Bohn, R., & Short, J. (2009). How much information?
and some 14 layers by 2014 (letting us literally
2009 report on American consumers. San Diego:
stand “knee-deep in information”). If we would Global Information Industry Center of University of
make piles of these books, we would have about California, San Diego.
4500 piles from the Earth to the sun. Bounie, D. (2003). The international production and dis-
semination of information (Special project on the eco-
Estimating the amount of the world’s techno-
nomics of knowledge Autorità per le Garanzie nelle
logical information capacity is only the first step. Comunicazioni). Paris: École Nationale Supérieure
It can and has been used as input variable to des Télécommunications (ENST).
investigate a wide variety of social science ques- de S. Pool, I. (1983). Tracking the flow of information.
Science, 221(4611), 609–613. https://doi.org/10.1126/
tions of the data revolution, including its interna-
science.221.4611.609.
tional distribution, which has shown that the de S. Pool, I., Inose, H., Takasaki, N., & Hurwitz,
digital divide carries over to the data age (Hilbert R. (1984). Communication flows: A census in the
2014b, 2016); the changing nature of content, United States and Japan. Amsterdam: North-Holland
and University of Tokyo Press.
which has shown that the big data age counts
Dienes, I. (1986). Magnitudes of the knowledge stocks
with a larger ratio of alphanumeric text over and information flows in the Hungarian economy.
videos than the pre-2000s (Hilbert 2014c); the In Tanulmányok az információgazdaságról
crucial role of compression algorithms in the (in Hungarian, pp. 89–101). Budapest.
Dienes, I. (2010). Twenty figures illustrating the informa-
data explosion (Hilbert 2014a), and the impact
tion household of Hungary between 1945 and 2008
of data capacity on issues like international trade (in Hungarian). http://infostat.hu/publikaciok/10_
(Abeliansky and Hilbert 2017). infhazt.pdf.
Information Society 571

Gantz, J., & Reinsel, D. (2012). The digital universe in López, P., & Hilbert, M. (2012). Methodological and sta-
2020: Big data, bigger digital shadows, and biggest tistical background on the world’s technological capac-
growth in the Far East. IDC (International Data Cor- ity to store, communicate, and compute information
poration) sponsored by EMC. (online document). http://www.martinhilbert.net/
Hilbert, M. (2012). How to measure “how much informa- WorldInfoCapacity.html.
tion”? Theoretical, methodological, and statistical chal- Lyman, P., Varian, H. R., Dunn, J., Strygin, A., &
lenges for the social sciences. International Journal of Swearingen, K. (2000). How much information 2000.
Communication, 6(Introduction to Special Section on University of California, at Berkeley.
“How to measure ‘How-Much-Information’?”), Machlup, F. (1962). The production and distribution of
1042–1055. knowledge in the United States. Princeton: Princeton
Hilbert, M. (2014a). How much of the global information University Press.
and communication explosion is driven by more, and Porat, M. U. (1977, May). The information economy: Def-
how much by better technology? Journal of the Asso- inition and measurement. Superintendent of Docu-
ciation for Information Science and Technology, 65(4), ments, U.S. Government Printing Office, Washington,
856–861. https://doi.org/10.1002/asi.23031. DC. 20402 (Stock No. 003-000-00512-7).
Hilbert, M. (2014b). Technological information inequality Shannon, C. (1948). A mathematical theory of communi-
as an incessantly moving target: The redistribution of cation. Bell System Technical Journal, 27, 379–423,
information and communication capacities between 623–656. https://doi.org/10.1145/584091.584093.
1986 and 2010. Journal of the Association for Infor- Short, J., Bohn, R., & Baru, C. (2011). How much informa-
mation Science and Technology, 65(4), 821–835. tion? 2010 report on enterprise server information. San
https://doi.org/10.1002/asi.23020. Diego: Global Information Industry Center at the School
Hilbert, M. (2014c). What is the content of the world’s of International Relations and Pacific Studies, Univer- I
technologically mediated information and communica- sity of California, San Diego. http://hmi.ucsd.edu/
tion capacity: How much text, image, audio, and video? howmuchinfo_research_report_consum_2010.php.
The Information Society, 30(2), 127–143. https://doi. Turner, V., Gantz, J., Reinsel, D., & Minton, S. (2014). The
org/10.1080/01972243.2013.873748. digital universe of opportunities: Rich data and the
Hilbert, M. (2015a). A review of large-scale ‘how much increasing value of the internet of things. IDC
information’ inventories: Variations, achievements and (International Data Corporation) sponsored by EMC.
challenges. Information Research, 20(4). http://www.
informationr.net/ir/20-4/paper688.html.
Hilbert, M. (2015b). Quantifying the data deluge and the
data drought (SSRN scholarly paper no. ID 2984851).
Rochester: Social Science Research Network. https://
papers.ssrn.com/abstract¼2984851. Information Science
Hilbert, M. (2016). The bad news is that the digital access
divide is here to stay: Domestically installed band- ▶ Informatics
widths among 172 countries for 1986–2014. Telecom-
munications Policy, 40(6), 567–581. https://doi.org/10.
1016/j.telpol.2016.01.006.
Hilbert, M., & López, P. (2011). The world’s technological
capacity to store, communicate, and compute informa- Information Society
tion. Science, 332(6025), 60–65. https://doi.org/10.
1126/science.1200970.
Hilbert, M., & López, P. (2012a). How to measure the Alison N. Novak
world’s technological capacity to communicate, store Department of Public Relations and Advertising,
and compute information? part I: Results and scope. Rowan University, Glassboro, NJ, USA
International Journal of Communication, 6,
956–979.
Hilbert, M., & López, P. (2012b). How to measure the
world’s technological capacity to communicate, store The information age refers to the period of time
and compute information? part II: Measurement unit following the industry growth set forth by the
and conclusions. International Journal of Communica-
industrial revolution. Although scholars widely
tion, 6, 936–955.
Ito, Y. (1981). The Johoka Shakai approach to the study of debate the start date of this time period, it is
communication in Japan. In C. Wilhoit & H. de Bock often noted that the information age co-occurred
(Eds.), Mass communication review yearbook (Vol. 2, with the building and growth in popularity of the
pp. 671–698). Beverly Hills: Sage.
Internet. The information age refers to the increas-
Lesk, M. (1997). How much information is there in the
world? lesk.com. http://www.lesk.com/mlesk/ksg97/ ing access, quantification, and collection of digital
ksg.html. data, often referred to as big datasets.
572 Information Society

Edward Tenner writes that the information age As Pavolotsky evidences, the information age
is often called a new age in society because it is more than just a period in time, it also reshaped
simultaneously addresses the increasing digital values, priorities, and the legal structure of global
connections between citizens across large dis- society. Being connected digitally encouraged
tances. Tenner concludes that the information more people to purchase personal technologies
age is largely technology about technology. This such as laptops and phones to participate. Further,
suggests that many of the advancements that are this change in values similarly altered the demand
connected to the information age are technologies for high-speed information. Because digital tech-
that assist our understanding and connections nologies during this period of time encouraged
through other technologies. These include the more connections between individuals in the net-
expansion of the World Wide Web, mobile work, information such as current events and
phones, and GPS devices. The expansion in trends spread faster than before. This is why the
these technologies has facilitated the ability to information age is alternatively called a
connect digitally, collect data, and analyze larger networked society.
societal trends. Morris and Shin add that the information age
Similarly, the collection and analysis of big changed the public’s orientation toward publicly
datasets was facilitated by many of these informa- sharing information with a large, diverse, and
tion age technologies. The Internet, social net- unknown audience. While concerns of privacy
working sites, and GPS systems allow grew during the information age, so did the ability
researchers, industry professionals, and govern- to share and document previously private
ment agencies to seamlessly collect data from thoughts, behaviors, and texts. This was not just
users to later be analyzed and interpreted. The typical of users but also of media institutions and
information age, through the popularization and news organizations. What is and is not considered
development of many of these technologies, ush- public information became challenged in an era
ered in a new age of big data research. when previously hidden actions were now freely
Big data in the information age took shape documented and shared through new technologies
through large, often quantifiable groups of infor- such as social networking sites. The effect this
mation about individual users, groups, and orga- user and institutional sharing has had on mass
nizations. As users input data into the society is still heavily debated. However, this did
Information age technologies, these platforms mean that new behaviors previously not shared or
collected and stored the data for later use. documented in datasets were now freely available
Because the information age elevated the impor- to those archiving big datasets and analyzing these
tance and societal value of being digitally technologies.
connected, users entered large amounts of per- The information age is also centrally related
sonal data into these technologies in exchange to changes in global economy, jobs, and devel-
for digital presence. opment industries. Croissant, Rhoades, and
John Pavolotsky notes it is for this reason that Slaughter suggest that the changes occurring
privacy rose as a central issue in the information during the information age encouraged learning
age. As users provided data to these technology instructions to focus students toward careers in
platforms, legal and ethical issues over who owns science, technology, engineering, and mathe-
the data, who has the right to sell or use the data, matics (popularly known as STEM). This focus
and what rights to privacy do users have became was because of the rapid expansion in technol-
critical. It is for this reason that further technolo- ogy and the creation of many new companies
gies (such as secure networks) needed to be devel- and organizations dedicated to expanding the
oped to encourage safety among big data digital commercial front. These new organiza-
platforms. tions were termed Web 1.0 companies because
Instrument Board 573

of their focus on turning the new values of the Tenner, E. (1992). Information age at the National Museum
information age into valuable commodities. of American History. Technology and Culture, 33(4),
780–787.
Many of these companies used big datasets col-
lected from user-generated information to target
their campaigns and create personalized
advertising.
In addition, the information age also affected
Information Studies
the structure of banking, financial exchanges, and
▶ Informatics
the global market. As companies expanded their
reach using new digital technologies, outsourcing
and allocating resources to distant regions became
a new norm. Because instantaneous communica-
tion across large spaces was now possible and
Information Systems
encouraged by the shift in public values, it is
▶ Informatics
easy to maintain control of satellite operations
abroad.
The shift to an information society is largely
I
related to the technologies that facilitated big
dataset collection and analysis. Although the
Information Theory
exact dates of the information society are still
▶ Informatics
debated, the proliferation of social media sites
and other digital spaces supports that the informa-
tion age is ongoing, thus continuing to support the
emergence and advancements of big data
research.
Information Visualisation

▶ Data Visualization
Cross-References

▶ Mobile Analytics
▶ Network Analytics
Information Visualization
▶ Network Data
▶ Data Visualization
▶ Privacy
▶ Visualization
▶ Social Media

Further Reading
Informatique
Croissant, J. L., Rhoades, G., & Slaughter, S. (2001).
Universities in the information age: Changing work, ▶ Informatics
information, and values in academic science and engi-
neering. Bulletin of Science Technology Society, 21(1),
108–118.
Morris, S., & Shin, H. S. (2002). Social value of public
information. American Economic Review, 92(5), 1521–
1534.
Instrument Board
Pavolotsky, J. (2013). Privacy in the age of big data. The
Business Lawyer, 69(1), 217–225. ▶ Dashboard
574 Integrated Data System

monitoring, tracking, and evaluation, developing


Integrated Data System and testing an intervention and monitoring the
outcomes (Davis et al. 2014), research and policy
Ting Zhang analysis, strategic planning and performance
Department of Accounting, Finance and management, and so on. It can test social policy
Economics, Merrick School of Business, innovations through high-speed, low-cost ran-
University of Baltimore, Baltimore, MD, USA domized control trials and quasi-experimental
approaches, can be used for continuous quality
improvement efforts and benefit cost analysis,
Definition/Introduction and can also help provide a complete account of
how different programs, services, and policies
Integrated Data Systems (IDS) typically link indi- affect individual persons or individual geographic
vidual level administrative records collected by units to more efficiently and effectively address
multiple agencies such as k–12 schools, commu- the often interconnected needs of the citizens
nity colleges, other colleges and universities, (Actionable Intelligence for Social Policy 2017).
departments of labor, justice, human resources,
human and health services, police, housing, and
community services. The systems can be used for
Key Elements to Build an IDS
quick knowledge-to-practice development cycle
(Actionable Intelligence for Social Policy 2017),
According to Davis et al. (2014) and Zhang and
case management, program or service monitoring,
Stevens (2012), typical crucial factors related to a
tracking, and evaluation (National Neighborhood
successful IDS include:
Indicators Partnership 2017), research and policy
analysis, strategic planning and performance
• A broad and steady institutional commitment
management, and so on. It can also help evaluate
to administrate the system
how different programs, services, and policies
• Individual-level data (no matter individual per-
affect individual persons or individual geographic
sons or individual geographic units) to mea-
units. The linkages between different agency
sure outcomes
records are often made through a common indi-
• The necessary data infrastructure
vidual personal identification number, a shared
• Linkable data fields, such as Social Security
case number, or a geographic unit.
numbers, business identifiers, shared case
number, and addresses
• The capacity to match various administrative
Purpose of an IDS
records
• A favorable state interpretation of the data
With the rising attraction of big data and the
privacy requirements, consistent with federal
exploding need to share existing data, the need
regulations
to link already collected various administrative
• The funding, knowhow, and analytical capac-
records rises. The systems allow government
ity to work with and maintain the data
agencies to integrate various databases and bridge
• Successfully obtaining participation from mul-
the gaps that have traditionally formed within
tiple data providing agencies with clearance to
individual agency databases; it can be used for
use those data.
quick knowledge-to-practice development cycle
to better address the often interconnected citizens’
needs efficiently and effectively (Actionable Intel- Maintenance
ligence for Social Policy 2017), for case manage-
ment (National Neighborhood Indicators Administrative data records are typically col-
Partnership 2017), program or service lected by public and private agencies. An IDS
Integrated Data System 575

often requires to extract, transform, clean, and link known law is the Family Educational Rights and
information from various source administrative Privacy Act (FERPA) that defines when student
databases and load it into a data warehouse. information can be disclosed and data privacy
Many data warehouses offer a tightly coupled practices (U.S. Department of Education 2017).
architecture that it usually takes little time to Similarly Health Insurance Portability and
resolve queries and extract information (Widom Accountability Act of 1996 (HIPAA) addresses
1995). the use and disclosure of health information (U.S.
Department of Health & Human Services 2017).

Challenges Ethics
Most IDS taps individual person’s information.
Identity Management and Data Quality When using IDS information, in order not to mis-
One challenge to build an IDS is to have effective use personal information, extra caution is needed.
and appropriate individual record identity man- Institutional review boards are often needed when
agement diagnostics that include consideration conducting research involving human subjects.
of the consequences of gaps in common identifier
availability and accuracy. This is the first key step Data Sharing
I
for data quality of IDS information. However, To build an IDS, a favorable state interpretation of
some of the relevant databases, particularly stu- the data privacy requirements, consistent with
dent records. do not include a universally linkable federal regulations and clearance to use the data
personal identifier, that is, a Social Security num- for the IDS, is critical. For example, some state
ber; some databases are unable to ensure that a education agencies have been reluctant to share
known to be valid Social Security number is their education records, largely due to narrow
paired with one individual, and only that individ- state interpretations of the confidentiality provi-
ual, consistently over time; and some databases sions of FERPA and its implementing regulations
are unable to ensure that each individual is asso- (Davis et al. 2014). Corresponding data sharing
ciated with only one Social Security number over agreements need to be in place.
time (Zhang and Stevens 2012). Zhang and Ste-
vens (2012) included ongoing collection of case Data Security
studies documenting how SSNs can be extracted, During the process of building, transferring,
validated, and securely stored offline. With the maintaining, and using IDS information, the data
established algorithms required for electronic security issue in an IDS center is particularly
financial transactions, spreading adoption of elec- important. Measures to ensure data security and
tronic medical records and rising interest in big information privacy and confidentiality becomes
data, there is an extensive, and rapidly growing, the key factors for an IDS’ vigor and sustainabil-
literature illustrating probabilistic matching solu- ity. Fortunately, many of the US current IDS cen-
tions and various software designs to address the ters have had experience maintaining confidential
identity management challenge. Often the administrative records for years or even decades.
required accuracy threshold is application spe- However, facing the convenience of web access to
cific; assurance of an exact match may not be maintain the continued data security and sustain-
required for some anticipated longitudinal data ability often requires updated data protection tech-
system uses (Zhang and Stevens 2012). nics. The federal, state, and local government has
important roles in safeguarding data and data use.
Data Privacy
To build and use an IDS, issues related to privacy
of personal information within the system is Examples
important. Many government agencies have rele-
vant regulations. For example, a nationally wide- Example of IDS in the United States include:
576 Integrated Data System

Chapin Hall’s Planning for Human Service Conclusion


Reform Using Integrated Administrative
Data Integrated Data Systems (IDS) typically link indi-
Jacob France Institute’s database for education, vidual level administrative records collected by
employment, human resources, and human multiple agencies. The systems can be used for
services case management, program or service monitoring,
Juvenile Justice and Child Welfare Data Cross- tracking, and evaluation, research and policy anal-
over Youth Multi-Site Research Study ysis, etc. A successful IDS often requires a broad
Actionable Intelligence for Social Policy’s inte- and steady institutional commitment to adminis-
grated data systems initiatives for policy anal- trate the system, individual-level data, the neces-
ysis and program reform sary data infrastructure, linkable data fields,
Florida’s Common Education Data Standards capacity and knowhow to match various adminis-
(CEDS) Workforce Workgroup and the later trative records and maintain it, data access per-
Florida Education & Training Placement Infor- mission, and data privacy procedures. Main
mation Program challenges to build a sustainable IDS include
Louisiana Workforce Longitudinal Data System identity management, data quality, data privacy,
(WLDS) housed at the Louisiana Workforce ethics, data sharing, and data security. There are
Commission many IDS in the United States.
Minnesota’s iSEEK data. managed by an organi-
zation called iSEEK Solutions
Heldrich Center data at Rutgers University
Further Reading
Ohio State University’s workforce longitudinal Actionable Intelligence for Social Policy. (2017). Inte-
administrative database grated Data Systems (IDS). Retrieved in March 2017
University of Texas Ray Marshall Center database from the World Wide Web at https://www.aisp.upenn.
Virginia Longitudinal Data System edu/integrated-data-systems/.
Davis, S., Jacobson, L., & Wandner, S. (2014). Using
Washington’s Career Bridge, managed by the
workforce data quality initiative databases to develop
Workforce Training and Education Coordinat- and improve consumer report card systems.
ing Board Washington, DC: Impaq International.
Connecticut’s Preschool through Twenty and National Neighborhood Indicators Partnership. (2017).
Resources on Integrated Data Systems (IDS), Retrieved
Workforce Information Network (P-20 WIN)
in March 2017 from the World Wide Web at http://www.
Delaware Department of Education’s Education neighborhoodindicators.org/resources-integrated-data-
Insight Dashboard systems-ids.
Georgia Department of Education’s Statewide U.S. Department of Education. (2017). Family Educational
Rights and Privacy Act (FERPA). Retrieved on May
Longitudinal Data System and Georgia’s Aca- 14, 2017 from the World Wide Web https://ed.gov/
demic and Workforce Analysis and Research policy/gen/guid/fpco/ferpa/index.html.
Data System (GA AWARDS) U.S. Department of Health & Human Services. (2017).
Illinois Longitudinal Data System Summary of the HIPAA Security Rule. Retrieved on
May 14, 2017 from the World Wide Web https://www.
Indiana Network of Knowledge (INK)
hhs.gov/hipaa/for-professionals/security/laws-regula
Maryland Longitudinal Data System tions/.
Missouri Comprehensive Data System Widom, J. (1995). “Research problems in data
Ohio Longitudinal Data Archive (OLDA) warehousing.” CIKM ’95 Proceedings of the fourth
international conference on information and knowl-
South Carolina Longitudinal Information Center
edge management (pp. 25–30). Baltimore.
for Education (SLICE) Zhang, T., & Stevens, D. (2012). Integrated data system
Texas Public Education Information Resource person identification: Accuracy requirements and
(TPEIR) and Texas Education Research Center methods. Jacob France Institute. Available at SSRN:
https://ssrn.com/abstract¼2512590 or https://doi.org/
(ERC) 10.2139/ssrn.2512590 and http://www.workforcedqc.
Washington P-20W Statewide Longitudinal Data org/sites/default/files/images/JFI%20wdqi%20research
System %20report%20January%202014.pdf.
Intelligent Transportation Systems (ITS) 577

(e.g., computers, robotics, and control software)


Intelligent Agents to highway systems for improving mobility,
safety, and productivity. With new and expanding
▶ Artificial Intelligence sources of big data, coupled with advancements in
analytical and computational capabilities and
capacities, we are on the cusp of another techno-
logical revolution in surface transportation. This
Intelligent Transportation latest phase of ITS is more sophisticated, inte-
Systems (ITS) grated, and broader in scope and purpose than
before. However, while state-of-the-art ITS appli-
Laurie A. Schintler cations promise to benefit society in many ways,
George Mason University, Fairfax, VA, USA they also come with various technical, institu-
tional, ethical, legal, and informational issues,
challenges, and complexities.
Overview

In the last half-century, digital technologies have Opportunities, Prospects, and


I
transformed the surface transportation sector – Applications
that is, highway, rail, and public transport. It is in
this context that the concept of Intelligent Trans- Intelligent Transportation Systems (ITS) support
portation Systems (ITS) transpired. ITS generally various functions and activities (see Table 1).
pertains to the use of advanced technologies and Different types and sources of big data, along
real-time information for monitoring, managing, with big data analytics, help support the goals
and controlling surface transportation modes, ser- and objectives of each of these systems and
vices, and systems. The first generation of ITS, applications.
referred to as Intelligent Vehicle Highway Sys- Novel sources of big data create fresh oppor-
tems (IVHS), was focused primarily on applying tunities for the management, control, and provi-
Information and Communications Technology sion of transportation and mobility. First, the

Intelligent Transportation Systems (ITS), Table 1 ITS uses and applications


Application Aims and objectives
Traffic and travel information To provide continuous and reliable traffic and travel data and information for
transportation producers and consumers
Traffic and public transport To improve traffic management in cities and regions for intelligent traffic
management signal control, incident detection and management, lane control, speed limits
enforcement, etc.
Navigation services To provide route guidance to transportation users
Smart ticketing and pricing To administer and collect transportation fees for the pricing of transport
services, based on some congestion, emissions, or some other consideration,
and to facilitate “smart ticketing” systems
Safety and security To reduce the number and severity of accidents and other safety issues
Freight transport and logistics To gather, store, analyze, and provide access to cargo data for helping freight
operators to make better decisions
Intelligent mobility and co-modality To provide real-time information and analysis to transportation users for
services facilitating trip planning and management
Transportation automation (smart and To enable fully or partially automated movement of vehicles or fleets of
connected vehicles) vehicles
Source: Adapted from Giannopoulos et al. (2012)
578 Intelligent Transportation Systems (ITS)

proliferation of GPS-equipped (location-enabled) traffic signal optimization, ridesharing, public


devices, such as mobile phones, RFID tags, and transport, visual recognition tasks, among others.
smart cards, enable real-time and geographically New developments and breakthroughs in Natural
precise tracking of people, information, animals, Language Processing (NLP) (including sentiment
and goods. Second, data produced by analysis) and image, video, and audio processing
crowdsourcing platforms, Web 2.0 “apps,” and facilitate the analysis and mining of unstructured
social media – actively and passively – help to big data, such as that produced by social media
facilitate an understanding of the transportation feeds, news stories, and video surveillance cam-
needs, preferences, and attitudes of individuals, eras. Innovations in network analysis and visual-
organizations, firms, and communities. Third, sat- ization tools and algorithms, along with
ellites, drones, and other aerial sensors offer an improvements in computational and storage
ongoing and detailed view of our natural and built capacity, now enable large, complex, and dynamic
environment, enabling us to better understand the networks (e.g., logistics, critical infrastructure) to
factors that not only affect transportation (e.g., be tracked and analyzed in real-time. Finally,
weather conditions) but that are also affected by cloud robotics, which combines cloud computing
transportation (e.g., land use patterns, pollution). and machine learning (e.g., reinforcement learn-
Fourth, the Internet of Things (IoT), which com- ing), is the “brain” behind automated systems –
prises billions of interconnected sensors tied to the for example, autonomous vehicles, enabling them
Internet, combined with Cyber-Physical Systems, to learn from their environment and from each
are monitoring various aspects and elements of other to adapt and respond in an optimal way.
transportation systems and operations for anom- In ITS, technologies are also crucial for facili-
aly detection and system control. Fifth, trans- tating the storage, communication, and dissemi-
portation automation, which comes in various nation of data and information within and across
forms – from Automated vehicles (AVs) to drones organizations and to travelers and other transpor-
and droids – is producing vast amounts of stream- tation consumers. New and emerging technologi-
ing data on traffic and road conditions and other cal systems and platforms are leading to various
aspects of the environment. Lastly, video cameras innovations in this regard. For example, methods
and other modes of surveillance, which have for transmitting data in both public and commer-
become ubiquitous, are contributing to a massive cial settings have evolved from wired systems to
collection of dynamic, unstructured data, which wireless networks supported by cloud platforms.
provide new resources for monitoring transporta- Modes of disseminating messages to the public
tion systems and their use. (e.g., advisory statements) have shifted from static
Big data analytics, tools, and techniques also traffic signage and radio and television broadcast-
are playing a vital role in ITS, particularly for ing to intelligent Variable Message Signs (VMS),
mining and analyzing big data to understand and mobile applications, and in-vehicle informa-
anticipate issues and problems (e.g., accidents, tion services. Blockchain technology is just
bottlenecks) and ultimately to develop the intelli- beginning to replace traditional database systems,
gence needed to act and intervene efficiently and particularly for vetting, archiving, securing, and
appropriately to enhance transportation systems. sharing information on transportation transactions
Deep neural learning – a powerful form of and activities, such as those tied to logistics and
machine learning that mimics aspects of informa- supply chains and mobility services.
tion processing in the human brain – is running
behind the scenes literally everywhere to optimize
and control transportation modes, systems, and Issues, Challenges, and Complexities
services. Specifically, deep learning is being
used for transportation performance evaluation, While big data (and big data analytics) bring many
traffic and congestion prediction, avoidance of benefits for transportation and mobility, their use
incidents and accidents, vehicle identification, in this context also raises an array of ethical, legal,
Intelligent Transportation Systems (ITS) 579

and social downsides and dangers. One serious hands of commercial entities, rather than plan-
issue, in particular, is algorithmic bias and dis- ners and government managers.
crimination, a problem in which the outcomes of The use of big data in ITS applications raises
machine learning models and decisions based on various technical and informational challenges.
them have the potential to disadvantage and even Ensuring the interoperability of vehicles, mobile
harm certain segments of the population. One devices, infrastructure, operations centers, and
source of this problem is the data used for training, other ITS elements pose significant challenges,
testing, and validating such models. Algorithms particularly given that new technologies and
learn from the real-world; accordingly, if there are information systems comprise many moving
societal gaps and disparities reflected in the data in parts that require careful integration and coupling
the first place, then the output of machine learning in real-time. Other challenges relate to assessing
and its application and use for decision-making how accurately big data captures different aspects
may reinforce or even perpetuate inequalities and of transportation systems and integrating big data
inequities. This problem is particularly relevant with conventional data sources (e.g., traffic counts
for ITS, as people, organizations, and places are and census records).
not affected or benefited by transportation in the
same way. For example, people of color, indige-
I
nous populations, women, and the poor generally Conclusion
have lower mobility levels and fewer transporta-
tion options than others. Some of these same Nearly every aspect of our lives depends critically
groups and communities are disproportionately on transportation and mobility. Transportation
negatively impacted by transportation externali- systems are vital for the production, consumption,
ties, such as noise, air, and water pollution. Pri- distribution, and exchange of goods and services
vacy is another matter. Many of the big data and accordingly, are critical drivers of economic
sources and enabling technologies used in Intelli- growth, development, and prosperity. Moreover,
gent Transportation Systems contain sensitive as a means for accessing opportunities and activ-
information on the activities of individuals and ities, such as healthcare, shopping, and entertain-
companies in time and space. For instance, ITS ment, transportation is a social determinant of
applications that rely on photo enforcement have health and well-being. In these regards, new and
the potential for privacy infringement, given their advancing ITS applications, supported by big
active role in tracking and identifying vehicles. data, big data analytics, and emerging technolo-
Another set of related concerns stem from the gies, help maximize the full potential of surface
increasing presence and involvement of technol- transportation. At the same time, policies, stan-
ogy companies in designing, implementing, and dards, and practices, including ethical and legal
managing ITS in public spaces. The private sec- frameworks, are needed to ensure that the benefits
tor’s goals and objectives tend to be incongruent of ITS are equitably distributed within and across
with those of the public sector, where the former communities and that no one is disproportionately
is generally interested in maximizing profit and negatively impacted by transportation innovation.
rate-of-return and the latter enhancing societal
welfare. Accordingly, the collection and use of
big data, along with big data algorithms, has the Cross-References
potential to reflect the interests and motivations –
and values – of companies rather than those of ▶ Cell Phone Data
the public-at-large. This situation also com- ▶ Mobile Analytics
pounds knowledge and power asymmetries and ▶ Sensor Technologies
imbalances between the public and private sec- ▶ Smart Cities
tors, where information on transportation sys- ▶ Supply Chain and Big Data
tems in the public sphere are increasingly in the ▶ Transportation Visualization
580 Interactive Data Visualization

Further Reading data visualization is the appropriate mean to dis-


cover, understand, and present these stories. In
Chen, Z., & Schintler, L. A. (2015). Sensitivity of location- interactive data visualization there is a user input
sharing services data: evidence from American travel
(a control of some aspect of the visual representa-
pattern. Transportation, 42(4), 669–682.
Fries, R. N., Gahrooei, M. R., Chowdhury, M., & Conway, tion of information), and the changes made by the
A. J. (2012). Meeting privacy challenges while advanc- user must be incorporated into the visualization in
ing intelligent transportation systems. Transportation a timely manner. They are based on existing sets
Research Part C: Emerging Technologies, 25, 34–45.
of data, and obviously this subject is strongly
Giannopoulos, G., Mitsakis, E., Salanova, J. M., Dilara, P.,
Bonnel, P., & Punzo, V. (2012). Overview of Intelligent related with the issue of big data. Data visualiza-
Transport Systems (ITS) developments in and across tions is the best method in order to transform
transport modes. JRC Scientific and Policy Reports, chunks of data to meaningful information (Ward
1–34.
et al. 2015).
Haghighat, A. K., Ravichandra-Mouli, V., Chakraborty, P.,
Esfandiari, Y., Arabi, S., & Sharma, A. (2020). Appli-
cations of deep learning in intelligent transportation
systems. Journal of Big Data Analytics in Transporta- History
tion, 2(2), 115–145.
Schintler, L. A., & McNeely, C. L. (2020). Mobilizing a
culture of health in the era of smart transportation and Although people have been using tables in order
automation. World Medical & Health Policy, 12(2), to arrange data since the second century BC, the
137–162. idea of representing quantitative information
Sumalee, A., & Ho, H. W. (2018). Smarter and more
graphically first appeared in the seventeenth cen-
connected: Future intelligent transportation system.
IATSS Research, 42(2), 67–71. tury. Rene Descartes, who was a French philoso-
Zhang, J., Wang, F. Y., Wang, K., Lin, W. H., Xu, X., & pher and mathematician, proposed a two-
Chen, C. (2011). Data-driven intelligent transportation dimensional coordinate system for displaying
systems: A survey. IEEE Transactions on Intelligent
values, consisting of a horizontal axis for one
Transportation Systems, 12(4), 1624–1639.
variable and a vertical axis for another, primarily
as a graphical means of performing mathematical
operations. In the eighteenth century William
Playfair began to exploit the potential of graphics
Interactive Data Visualization for the communication of quantitative, by devel-
oping many of the graphs that are commonly used
Andreas Veglis today. He was the first to employ a line moving up
School of Journalism and Mass Communication, and down as it progressed from left to right to
Aristotle University of Thessaloniki, show how values changed through time. He
Thessaloniki, Greece invented the bar graph, as well as the pie chart.
In the 1960s Jacques Bertin proposed that visual
perception operates according to rules that can be
Definition followed to express information visually in ways
that represented it intuitively, clearly, accurately,
Data visualization is a modern branch of descrip- and efficiently. Also John Tukey, a statistics pro-
tive statistics that involves the creation and study fessor set the basis of the exploratory data analy-
of the visual representation of data. It is the graph- sis, by demonstrating the power of data
ical display of abstract information for data anal- visualization as a means for exploring and making
ysis and communication purposes. Static data sense of quantitative data (Few 2013).
visualization offers only precomposed “views” In 1983, Edward Tufte published his ground-
of data. Interactive data visualization supports breaking book “The Visual Display of Quantita-
multiple static views in order to present a variety tive Information,” in which he distinguished
of perspectives on the same information. Impor- between the effective ways of displaying data
tant stories include “hidden” data, and interactive visually and the ways that most people are doing
Interactive Data Visualization 581

it without much success. Also around this time, recently the availability of many easy-to-use free
William Cleveland extended and refined data tools have made the creation of infographics
visualization techniques for statisticians. At the available to every Internet user (Murray 2013).
end of the century, the term information visuali- Of course static visualizations can also be
zation was proposed. In 1999, Stuart Card, Jock published on the World Wide Web in order to
Mackinlay, and Ben Shneiderman published their disseminate more easily and rapidly. Publishing
book entitled “Readings in Information Visuali- on the web is considered to be the quickest way to
zation: Using Vision to Think.” Moving to the reach a global audience. An online visualization is
twenty-first century, Colin Ware published two accessible by any Internet user that employs a
books entitled “Information Visualization: Per- recent web browser, regardless of the operating
ception for Design (2004) and Visual Thinking system (Windows, Mac, Linux, etc.) and device
for Design (2008)” in which he compiled, orga- type (laptop, desktop, smartphone, tablet). But the
nized, and explained what we have learned from true capabilities of the web are being exploited in
several scientific disciplines about visual thinking the case of interactive data visualization.
and cognition and applied that knowledge to data Dynamic, interactive visualizations can
visualization (Few 2013). empower people to explore data on their own.
Since the turn of the twenty-first century, data The basic functions of most interactive visualiza-
I
visualization has been popularized, and it has tion tools have been set back in 1996, when Ben
reached the masses through commercial software Shneiderman proposed a “Visual Information-
products that are distributed through the web. Seeking Mantra” (overview first, zoom and filter,
Many of these data visualization products pro- and then details on demand). The above functions
mote more superficially appealing esthetics and allow data to be accessible from every user, from
neglect the useful and effective data exploration, the one who is just browsing or exploring the
sense-making, and communication. Nevertheless dataset to the one who approaches the visualiza-
there are a few serious contenders that offer prod- tion with a specific question in mind. This design
ucts which help users fulfill data visualization pattern is the basic guide for every interactive
potential in practical and powerful ways. visualization today.
An interactive visualization should initially
offer an overview of the data, but it must also
From Static to Interactive include tools for discovering details. Thus it will
be able to facilitate different audiences, from those
Visualization can be categorized into static and who are new to the subject to those who are
interactive. In the case of the static visualization, already deeply familiar with the data. Interactive
there is only one view of data, and in many occa- visualization may also include animated transi-
sions, multiple cases are needed in order to fully tions and well-crafted interfaces in order to
understand the available information. Also the engage the audience to the subject it covers.
number of dimensions of data is limited. Thus
representing multidimensional datasets fairly in
static images is almost impossible. Static visuali- User Control
zation is ideal when alternate views are neither
needed nor desired and is special suited for static In the case of interactive data visualization, users
medium (e.g., print) (Knaffic 2015). It is worth interact with the visualization by introducing a
mentioning that infographics are also part of the number of input types. Users can zoom in a par-
static visualization. Infographics (or information ticular part of an existing visualization, pinpoint
graphics) are graphic visual representations of an area that interest them, select an option from an
data or knowledge, which are able to present offered list, choose a path, and input numbers or
complex information quickly and clearly. text that customize the visualization. All the pre-
Infographics are being used for many years, and vious mentioned input types can be accomplished
582 Interactive Data Visualization

by using keyboard, mice, touch screen, and other generated where different regions are
more specialized input devices. With the help of updated over time.
these input actions, users can control both the
information being represented on the graph or
Types of Interactive Data Visualizations
the way that the information is being presented.
In the second case, the visualization is usually part
The information and more specifically statistical
of a feedback loop. In most cases the actual infor-
information is abstract, since it describes things
mation remains the same, but the representation of
that are not physical. It can concern education,
the information does change. One other important
sales, diseases, and various other things. But
parameter in the interactive data visualizations is
everything can be displayed visually, if the way
the time it takes for the visualization to be updated
is found to give them a suitable form. The trans-
after the user has introduced an input. A delay of
formation of the abstract into physical representa-
more than 20 ms is noticeable by most people.
tion can only succeed if we understand a bit about
The problem is that when large amounts of data
visual perception and cognition. In other words, in
are involved, this immediate rendering is
order to visualize data effectively, one must
impossible.
follow design principles that are derived from an
Interactive framerate is a term that is often
understanding of human perception.
being used to measure the frequency with which
Heer, Bostock and Ogievetsky (2010) defined
a visualization system generates an image. In
the types (and also their subcategories) of data
case that the rapid response time, which is
visualization:
required for interactive visualization, is not fea-
sible, there are several approaches that have
been explored in order to provide people with (i) Time series data (index charts, stacked
rapid visual feedback based on their input. graphs, small multiples, horizon graphs)
These approaches include: (ii) Statistical distributions (stem-and-leaf plots,
Q-Q plots, scatter plot matrix (SPLOM),
Parallel rendering: in this case the image is being parallel coordinates)
rendered simultaneously by two or more com- (iii) Maps (flow maps, choropleth maps, gradu-
puters (or video cards). Different frames are ated symbol maps, cartograms)
being rendered at the same time by different (iv) Hierarchies (node-link diagrams, adjacency
computers, and the results are transferred over diagrams, enclosure diagrams)
the network for display on the user’s (v) Networks (force-directed layout, arc dia-
computer. grams, matrix views)
Progressive rendering: in this case a framerate is
guaranteed by rendering some subset of the
information to be presented. It also provides Tools
progressive improvements to the rendering
when the visualization is no longer changing. There are a lot of tools that can be used for
Level-of-detail (LOD) rendering: in this case sim- creating interactive data visualizations. All of
plified representations of information are ren- them are either free or offer a free version (except
dered in order to achieve the desired frame rate, a paid version that includes more features).
while a user is providing input. When the user According to datavisualization.ch, the list of
has finished manipulating the visualization, then the tools that most users employ includes:
the full representation is used in order to generate Arbor.js, CartoDB, Chroma.js, Circos, Cola.js,
a still image. ColorBrewer, Cubism.js, Cytoscape, D3.js,
Frameless rendering: in this type of rendering, Dance.js, Data.js, DataWrangler, Degrafa, Envi-
the visualization is not presented as a time sion.js, Flare, GeoCommons, Gephi, Google
series of images. Instead a single image is Chart Tools, Google Fusion Tables, I Want
International Development 583

Hue, JavaScript InfoVis Toolkit, Kartograph,


Leaflet, Many Eyes, MapBox, Miso, Modest International Development
Maps, Mr. Data Converter, Mr. Nester, NVD3.
js,. NodeBox, OpenRefine, Paper.js, Peity, Poly- Jon Schmid
maps, Prefuse, Processing, Processing.js, Pro- Georgia Institute of Technology, Atlanta, GA,
tovis, Quadrigram, R, Raphael, Raw, Recline.js, USA
Rickshaw, SVG Crowbar, Sigma.js. Tableau
Public, Tabula, Tangle, Timeline.js, Unfolding,
Vega, Visage, and ZingCharts. Big data can affect international development in
two primary ways. First, big data can enhance our
understanding of underdevelopment by
expanding the evidence base available to
Conclusion researchers, donors, and governments. Second,
big data-enabled applications can affect interna-
Data visualization is a significant discipline that tional development directly by facilitating eco-
is expected to become even more important as nomic behavior, monitoring local conditions,
we gradually moving, as a society, in the era of and improving governance. The following sec-
big data. Especially the case of interactive data I
tions will look first at the role of big data in
visualization allows data analysts to convey increasing our understanding of international
complex data to meaningful information that development and then look at examples where
can be searched, explored, and understood by big data has been used to improve the lives of
end users. the world’s poor.

Cross-References Big Data in International Development


Research
▶ Business Intelligence
▶ Tableau Software Data quality and data availability tend to be low in
▶ Visualization developing countries. In Kenya, for example,
poverty data was last collected in 2005, and
income surveys in other parts of sub-Saharan
Further Reading Africa often take up to 3 years to be tabulated.
When national income-accounting methodologies
Few, S. (2013). Data visualization for human perception. In
S. Mads & D. R. Friis (Eds.), The encyclopedia of human- were updated in Ghana (2010) and Nigeria
computer interaction (2nd ed.). Aarhus: The Interaction (2014), GDP calculations had to be revised
Design Foundation. http://www.interaction-design.org/ upward by 63% and 89%, respectively. Poor-
literature/book/the-encyclopedia-of-human-computer- quality or stale data prevent national policy
interaction-2nd-ed/data-visualization-for-human-percep
tion. Accessed 12 July 2016. makers and donors from making informed policy
Heer, J., Bostock, M., & Ogievetsky, V. (2010). A tour decisions.
through the visualization zoo. Communications of the Big data analytics has the potential to amelio-
ACM, 53(6), 59–67. rate this problem by providing alternative
Knaffic, C. N. (2015). Storytelling with data: A data visu-
alization guide for business professionals. Hoboken, methods for collecting data. For example, big
New Jersey: John Wiley & Sons Inc. data applications may provide a novel means by
Murray, S. (2013). Interactive data visualization for the which national economic statistics are calculated.
web. Sebastopol, CA: O’Reilly Media, Inc. The Billion Prices Project – started by researchers
Ward, M., Grinstein, G., & Keim, D. (2015). Interactive
data visualization: Foundations, techniques, and at the Massachusetts Institute of Technology –
applications. Boca Raton, FL: CRC Press, Taylor & uses daily price data from hundreds of online
Francis Group. retailers to calculate changes in price levels. In
584 International Development

countries where inflation data is unavailable – or Besides providing insight into how individuals
in cases such as Argentina where official data is respond to price changes, big data analytics
unreliable – these data offer a way of calculating allows researchers to explore the complex ways
national statistics that does not require a high- in which the economic lives of the poor are orga-
quality national statistics agency. nized. Researchers at Harvard’s Engineering
Data from mobile devices is a particularly rich Social Systems lab have used mobile phone data
source of data in the developing world. Roughly to explore the behavior of inhabitants of slums in
20% of mobile subscriptions are held by individ- Kenya. In particular, the authors tested theories of
uals that earn less than 5 $ a day. Besides emitting rural-to-urban migration against spatial data emit-
geospatial, call, and SMS data, mobile devices are ted by mobile devices. Some of the same
increasingly being used in the developing world researchers have used mobile data to examine
to perform a broad array of economic functions the role of social networks on economic develop-
such as banking and making purchases. In many ment and found that diversity in individuals’ net-
African countries (nine in 2014), more people work relationships is associated with greater
have online mobile money accounts than have economic development. Such research supports
traditional bank accounts. Mobile money services the contention that insular networks – i.e., highly
such M-Pesa and MTN Money produce trace data clustered networks with few ties to outside
and thus offer intriguing possibilities for increas- nodes – may limit the economic opportunities
ing understanding of spending and saving behav- that are available to members.
ior in the developing world. As the functionality Big data analytics are also being used to
provided by mobile money services extends into enhance understanding of international develop-
loans, money transfers from abroad, cash with- ment assistance. In 2009, the College of William
drawal, and the purchase of goods, the data and Mary, Brigham Young University, and Devel-
yielded by these platforms will become even opment Gateway created AidData (aiddata.org), a
richer. website that aggregates data on development pro-
The data produced by mobile devices has jects to facilitate project coordination and provide
already been used to glean insights into complex researchers with a centralized source for develop-
economic or social systems in the developing ment data. AidData also maps development pro-
world. In many cases, the insights into local jects geospatially and links donor-funded projects
economic conditions that result from the analysis to feedback from the project’s beneficiaries.
of mobile device data can be produced more
quickly than national statistics. For example, in
Indonesia the UN Global Pulse monitored tweets Big Data in Practice
about the price of rice and found them to be
highly correlated with national spikes in food Besides expanding the evidence base available to
prices. The same study found that tweets could international development scholars and practi-
be used to identify trends in other types of eco- tioners, large data sets and big data analytic tech-
nomic behavior such as borrowing. Similarly, niques have played a direct role in promoting
research by Nathan Eagle has shown that reduc- international development. Here the term “devel-
tions in additional airtime purchases are associ- opment” is considered in its broad sense as refer-
ated with falls in income. Researchers Han Wang ring not to a mere increase in income, but to
and Liam Kilmartin examined Call Detail Record improvements in variables such as health and
(CDR) data generated from mobile devices in governance.
Uganda and identified differences in the way The impact of infectious diseases on develop-
that wealthy and poor individuals respond to ing countries can be devastating. Besides the
price discounts. The researchers also used the obvious humanitarian toll of outbreaks, infectious
data to identify centers of economic activity diseases prevent the accumulation of human cap-
within Uganda. ital and strain local resources. Thus there is great
International Development 585

potential for big data-enabled applications to which are established by local subject matter
enhance epidemiological understanding, mitigate experts. The software then crawls social media
transmission, and allow for geographically feeds – Twitter, Facebook, Google+, Ushahidi,
targeted relief. Indeed, it is in the tracking of and RSS – and generates real-time trend visuali-
health outcomes that the utility of big data analyt- zations based on keyword matches. The reports
ics in the developing world has been most obvi- are monitored by a local Social Media Tracking
ous. For example, Amy Wesolowski and Center, which identifies instances of violence or
colleagues used mobile phone data from 15 mil- election irregularities. Flagged incidents are
lion individuals in Kenya to understand the rela- passed on to members of the election commission,
tionship between human movement and malaria police, or other relevant stakeholders.
transmission. Similarly, after noting in 2008 that The history of international economic devel-
search trends could be used to track flu outbreaks, opment initiatives is fraught with would-be pan-
researchers at Google.org have used data on aceas that failed to deliver. White elephants –
searches for symptoms to predict outbreaks of large-scale capital investment projects for
the dengue virus in Brazil, Indonesia, and India. which the social surplus is negative – are strewn
In Haiti, researchers from Columbia University across poor countries as reminders of the pre-
and the Karolinska Institute used SIM card data ferred development strategies of the past. While
I
to track the dispersal of people following a cholera more recent approaches to reducing poverty that
outbreak. Finally, the Centers for Disease Control have focused on improving institutions and gov-
and Prevention used mobile phone data to direct ernance within poor countries may produce pos-
resources to appropriate areas during the 2014 itive development effects, the history of
Ebola outbreak. development policy suggests that optimism
Big data applications may also prove useful in should be tempered. The same caution holds in
improving and monitoring aspects of governance regard to the potential role of big data in interna-
in developing countries. In Kenya, India, and tional economic development. Martin Hilbert’s
Pakistan, witnesses of public corruption can 2016 systematic review article rigorously enu-
report the incident online or via text message to merates both the causes for optimism and reasons
a service called “I Paid A Bribe.” The provincial for concern. While big data may assist in under-
government in Punjab, Pakistan, has created a standing the nature of poverty or lead to direct
citizens’ feedback model, whereby citizens are improvements in health or governance outcomes,
solicited for feedback regarding the quality of the availability and ability to process large data
government services they received via automated sets are not a panacea.
calls and texts. In effort to discourage absenteeism
in India and Pakistan, certain government officials
are provided with cell phones and required to text Cross-References
geocoded pictures of themselves at jobsites. These
mobile government initiatives have created a rich ▶ Economics
source of data that can be used to improve gov- ▶ Epidemiology
ernment service delivery, reduce corruption, and ▶ International Development
more efficiently allocate resources. ▶ World Bank
Applications that exploit data from social
media have also proved useful in monitoring elec-
tions in sub-Saharan Africa. For example, Aggie, Further Reading
a social media tracking software designed to mon-
itor elections, has been used to monitor elections Hilbert, M. (2016). Big data for development: A review of
promises and challenges. Development Policy Review,
in Liberia (2011), Ghana (2012), Kenya (2013),
34(1), 135–174.
and Nigeria (2011 and 2014). The Aggie system is Wang, H., & Kilmartin, L. (2014). Comparing rural and
first fed with a list of predetermined keywords, urban social and economic behavior in Uganda:
586 International Labor Organization

Insights from mobile voice service usage. Journal of Conventions and Recommendations (189 and
Urban Technology, 21(2), 61–89. 203, respectively by, 2014) related to labor market
Wesolowski, A., et al. (2012). Quantifying the impact of
human mobility on malaria. Science, 338(6104), standards.
267–270. Where Conventions are ratified, come into
World Economic Forum. (2012). Big data, big impact: force, and are therefore legally binding, they cre-
New possibilities for international development. In ate a legal obligation for ratifying nations. For
Big data, big impact: New possibilities for interna-
tional development, Cologny/Geneva, Switzerland: many Conventions even in countries where they
World Economic Forum. http://www3.weforum.org/ are not ratified, they are often adopted and
docs/WEF_TC_MFS_BigDataBigImpact_Briefing_ interpreted as the international labor standard.
2012.pdf. There have been many important milestones cre-
ated by the ILO to shape the landscape to encour-
age the promotion of improved working lives
globally, although a significant milestone is often
International Labor considered to be the 1998 Declaration on the
Organization Fundamental Principles and Rights to Work
which had four key components: the right of
Jennifer Ferreira workers to associate freely and collectively, the
Centre for Business in Society, Coventry end of forced and compulsory labor, the end of
University, Coventry, UK child labor, and the end of unfair discrimination
among workers. ILO members have an obligation
to work toward these objectives and respect the
Every day people across the world in both devel- principles which are embedded in the
oped and developing economies are creating an Conventions.
ever-growing ocean of digital data. This “big
data” represents a new resource for international
organizations with the potential to revolutionize Decent Work Agenda
the way policies, programs, and projects are gen-
erated. The International Labour Organization The ILO believes that work plays a crucial role in
(ILO) is no exception to this and has begun to the well-being of workers and families and there-
discuss and engage with the potential uses of big fore the broader social and economic develop-
data to contribute to its agenda. ment of individuals, communities, and societies.
While the ILO works on many issues related to
employment, their key agenda which has domi-
Focus nated activities in recent decades is “decent
work.”
The ILO, founded in 1919 in the wake of the First “Decent work” refers to an aspiration for peo-
World War, became the first specialized agency of ple to have a work that is productive, provides a
the United Nations. It focuses on labor issues fair income with security and social protection,
including child labor, collective bargaining, cor- safeguards basic rights, and offers equal opportu-
porate social responsibility, disability, domestic nities and treatment, opportunities for personal
workers, forced labor, gender equality, informal development, and a voice in society. “Decent
economy, international labor migration, interna- work” is central to efforts to reduce poverty and
tional labor standards, labor inspection, micro- is a path to achieving equitable, inclusive, and
finance, minimum wages, rural development, sustainable development; ultimately it is seen as
and youth employment. By 2013 the ILO had a feature which underpins peace and security in
185 members (of the 193 member states of the communities and societies (ILO 2014a).
United Nations). Among its multifarious activi- The “decent work” concept was formulated by
ties, it is widely known for its creation of the ILO in order to identify the key priorities to
International Labor Organization 587

focus their efforts. “Decent work” is designed to The ILO is a major provider of statistics as
reflect priorities on the social, economic, and these are seen as important tools to monitor pro-
political agenda of countries as well as the inter- gress toward labor standards. In addition to the
national system. In a relatively short time, this maintenance of key databases (ILO 2014b) such
concept has formed an international consensus as LABOURSTA, it also publishes compilations
among government, employers, workers, and of labor statistics, such as the Key Indicators of
civil equitable globalization, a path to reduce pov- Labour Markets (KILM) which is a comprehen-
erty as well as inclusive and sustainable develop- sive database of country level data for key indica-
ment. The overall goal of “decent work” is to tors in the labor market which is used as a research
instigate positive change in/for people at all spa- tool for labor market information. Other databases
tial scales. include the ILO STAT, a series of databases with
Putting the decent work agenda into practice is labor-related data; NATLEX which includes leg-
achieved through the implementation of the ILO’s islation related to labor markets, social security,
four strategic objectives, with gender equality as a and human rights; and NORMLEX which brings
crosscutting objective: together ILO labor standards and national labor
and security laws (ILO 2014c). The ILO database
1. Creating jobs to foster an economy that gener- provides a range of datasets with annual labor
I
ates opportunities for investment, entrepre- market statistics including over 100 indicators
neurship, skills development, job creation, worldwide including annual indicators as well as
and sustainable livelihoods. short-term indicators, estimates and projections of
2. Guaranteeing rights at work in order to obtain total population, and labor force participation
recognition for work achieved as well as rates.
respect for the rights of all workers. Statistics are vital for the development and
3. Extending social protection to promote both evaluation of labor policies, as well as more
inclusion and productivity of all workers. To broadly to assess progress toward key ILO objec-
be enacted by ensuring both women and men tives. The ILO supports member states in the
experience safe working conditions, allowing collection and dissemination of reliable and recent
free time, taking into account family and social data on labor markets. While the data produced by
values and situations, and providing compen- the ILO are both wide ranging and widely used,
sation where necessary in the case of lost or they are not considered by most to be “big data,”
reduced income. and this has been recognized.
4. Promoting social dialogue by involving both
workers and employers in the organizations in
order to increase productivity, avoid disputes ILO, Big Data, and the Gender Data
and conflicts at work, and more broadly build
cohesive societies. In October 2014, a joint ILO-Data2X roundtable
event held in Switzerland identified the impor-
tance of developing innovative approaches to the
ILO Data better use of technology to include big data, in
particular where it can be sourced and where
The ILO produces research on important labor innovations can be made in survey technology.
market trends and issues to inform constituents, This event, which brought together representa-
policy makers, and the public about the realities of tives from national statistics offices, key interna-
employment in today’s modern globalized econ- tional and regional organizations, and
omy and the issues facing workers and employers nongovernmental organizations, was organized
in countries at all development stages. In order to to discuss where there were gender data gaps,
do so, it draws on data from a wide variety of particularly focusing on informal and unpaid
sources. work as well as agriculture. These discussions
588 International Labor Organization

were sparked by wider UN discussions about the potential to transform the understanding of
data revolution and the importance of develop- women’s participation in work and communities.
ment data in the post-2015 development agenda. Crucially it was posited that while better data is
It is recognized that big data (including adminis- needed to monitor the status of women in informal
trative data) can be used to strengthen existing employment conditions, it is not necessarily
collection of gender statistics, but there need to important to focus on trying to extract more data
be more efforts to find new and innovative ways to but to make an impact with the data that is avail-
work with new data sources to meet a growing able to try and improve wider social, economic,
demand for more up to date (and frequently and environmental conditions.
updating) data on gender and employment
(United Nations, 2013). The fundamental goal of
the discussion was to improve gender data collec- ILO, the UN, and Big Data
tion which can then be used to guide policy and
inform the post-2015 development agenda, and The aforementioned meeting represented one
here big data is acknowledged as a key compo- example of where the ILO has engaged with
nent. At this meeting, four types of gender data other stakeholders to not only acknowledge the
gaps were identified: coverage across countries importance of big data but begin to consider
and/or regular country production, international potential options for its use with respect to their
standards to allow comparability, complexity, agendas. However, as a UN agency, they partake
and granularity (sizeable and detailed datasets in wider discussion with the UN regarding the
allowing disaggregation by demographic and importance of big data, as was seen in the 45th
other characteristics). Furthermore a series of big session of the UN Statistical Commission in
data types that have the potential to increase col- March 2014 where the report of the secretary
lection of gender data were identified: general on “big data and the modernization of
statistical systems” was discussed (United
• Mobile phone records: for example, mobile Nations, 2014). This report is significant as it
phone use and recharge patterns could be touches upon important issues, opportunities,
used as indicators of women’s socioeconomic and challenges that are relevant for the ILO with
welfare or mobility patterns. respect to the use of big data.
• Financial patterns: exploring engagement with The report makes reference to the UN “Global
financial systems. Pulse” which is an initiative on big data
• Online activity: for example, Google searches established in 2009 which included a vision of a
or Twitter activity which might be used to gain future where big data was utilized safely and
insights into women’s maternal health, cultural responsibly. Its mission was to accelerate the
attitudes, or political engagement. adoption of big data innovation. Partnering with
• Sensing technologies: for example, satellite UN agencies such as the ILO, governments, aca-
data which might be used to examine agricul- demics, and the private sector, it sought to achieve
tural productivity, access to healthcare, and a critical mass of implemented innovation and
education services. strengthen the adoption of big data as a tool to
• Crowdsourcing: for example, disseminating foster the transformation of societies.
apps to gain views about different elements of There is a recognition that the national statis-
societies. tical system is essentially now subject to compe-
tition from other actors producing data outside of
A primary objective of this meeting was to their system, and there is a need for data collec-
highlight that existing gender data gaps are large, tion of national statistics to adjust in order to
and often reflect traditional societal norms, and make use of the mountain of data now being
that no data (or poor data) can have significant produced almost continuously (and often
development consequences. Big data here has the automatically).
International Labor Organization 589

To make use of the big data, a shift may be example, using transactional, tracking, and sensor
required from the traditional survey-oriented collec- data. However, in many cases, a key implication is
tion of data to a more secondary data-focused ori- that statistical systems and IT infrastructures need
entation from data sources that are high in volume, to be enhanced in order to be able to support the
velocity, and variety. Increasing demand from pol- storage and processing of big data as it accumu-
icy makers for real-time evidence in combination lates over time.
with declining response rates to national household Modern society has witnessed an explosion of
and business survey means that organizations like the quantity and diversity of real-time information
the ILO will have to acknowledge the need to make known more commonly as big data, presenting a
this shift. There are a number of different sources of potential paradigm shift in the way official statis-
big data which may be potentially useful for the tics are collected and analyzed. In the context of
ILO: sources from administration, e.g., bank increased demand for statistics information, orga-
records; commercial and transaction data, e.g., nizations recognize that big data has the potential
credit card transactions; sensor data, e.g., satellite to generate new statistical products in a timelier
images or road sensors; tracking devices, e.g., manner than traditional official statistical sources.
mobile phone data; behavioral data, e.g., online The ILO, alongside a broader UN agenda to
searches; and opinion data, e.g., social media. Offi- acknowledge the data revolution, recognizes the
I
cial statistics like those presented in ILO databases potential for future uses of big data at the global
often rely on administrative data, and these are level, although there is a need for further investi-
traditionally produced in a highly structured manner gation of the data sources, challenges and areas of
which can in turn limit their use. If administrative use of big data, and its potential contribution to
data was collected in real time, or in a more frequent efforts working toward the “better work” agenda.
basis, then it has the potential to become “big data.”
There are, however, a number of challenges
related to the use of big data which face the UN, Cross-References
its agencies, and national statistical services alike:
▶ United Nations Educational, Scientific and Cul-
• Legislative: in many countries, there will not tural Organization (UNESCO)
be legislation in place to enable the access to,
and use of, big data particularly from the pri-
vate sector. Further Reading
• Privacy: a dialogue will be required in order to
gain public trust around the use of data. International Labour Organization. (2014a). Key indica-
tors of the labour market. International Labour Orga-
• Financial: related to costs for access data.
nization. http://www.ilo.org/empelm/what/WCMS_
• Management: policies and directives to ensure 114240/lang–en/index.htm. Accessed 10 Sep 2014.
management and protection of data. International Labour Organization. (2014b). ILO data-
• Methodological: data quality, representative- bases. International Labour Organization. http://www.
ilo.org/public/english/support/lib/resource/ilodatabases.
ness, and volatility are all issues which present
htm. Accessed 1 Oct 2014.
potential barriers to the widespread use of International Labour Organization. (2014c). ILOSTAT data-
big data. base. International Labour Organization. http://www.ilo.
• Technological: the nature of big data, particu- org/ilostat/faces/home/statisticaldata?_afrLoop¼342428
603909745. Accessed 10 Sep 2014.
larly the volume in which it is often created United Nations. (2013). Big data and modernization of sta-
meaning that some countries would need tistical systems. Report of the Secretary-General. United
enhanced information technology. Nations. United Nations Economic and Social Council.
Available at: http://unstats.un.org/unsd/statcom/doc14/
2014-11-BigData-E.pdf. Accessed 1 Dec 2014.
An assessment of the use of big data for official
United Nations. (2014). UN global pulse. United Nations.
statistics carried out by the UN indicates that there Available at: http://www.unglobalpulse.org/. Accessed
are good examples where it has been used, for 10 Sep 2014.
590 International Nongovernmental Organizations (INGOs)

Emergence and Main Characteristics


International of INGOs
Nongovernmental
Organizations (INGOs) The growth of INGOs has been explained based
on various theoretical approaches. On the one
Lázaro M. Bacallao-Pino hand, some perspectives offer “top-down”
University of Zaragoza, Zaragoza, Spain approaches, arguing that the rise of INGOs is
National Autonomous University of Mexico, associated with the degree of a country’s integra-
Mexico City, Mexico tion in world polity and international economy.
On the other hand, “bottom-up” perspectives
underline the evolution of democracy and the
In general terms, international nongovernmental success of domestic economies as significant fac-
organizations (INGOs) refer to private interna- tors facilitating the growth of INGOs within cer-
tional organizations that are focused on solving tain countries. However, other approaches explain
various societal problems, often in developing the rise of INGOs by taking into account a com-
countries. For example, INGOs might operate to plex articulation of both economic and political
provide access to basic services for the poor and to factors and at two simultaneous levels of analysis:
promote their interests, to provide relief to people national and global.
who are suffering from disasters, or to work The rising importance and presence of INGOs
toward environmental protection and community in the international policy arena over the second
development. INGOS have been included in what half of the twentieth century, and particularly
has been defined as the global civil society, shar- since the 1990s, has been associated with factors
ing the same missions as other nongovernmental such as the proliferation of complex humanitarian
organizations (NGOs), but with an international emergencies during the post-Cold War era, the
scope. INGOs typically have outpost countries divides produced by the withdrawal of state ser-
around the world, aimed at ameliorating a variety vice provision as a result of the neoliberal privat-
of problems. ization of public services, the failures of
INGOs have grown in number and have taken the schemes of government-to-government aid,
on increasingly important roles especially in the the ineffectiveness and waste associated with the
post-World War II era, to the extent that they action of multilateral organizations, the growing
have been considered central to and engines for distrust to politics and governments, and/or the
issues such as the global expansion of human emergence of evermore complex and challenging
rights and the increasing environmental con- global problems. The convergence of these ten-
cerns and climate change and sustainable devel- dencies has created a space for the action of
opment. The importance of these global civil INGOs, with capacities and network structures
society actors has been recognized by a range in line with the emergent global-local scenario.
of international actors and stakeholders. In fact, As evidence of that importance, in February 1993,
the United Nations (UN) has created mecha- the UN Economic and Social Council (ECOSOC)
nisms and rules for INGO participation in inter- set up an open-ended working group (OEWG) to
national conferences, arranging for review and update its agreements for consultation
consultations and clarifying their roles and func- with NGOs and to establish consistent rules for
tions as part of the international community. the participation of these entities in international
Across the board, INGOs are increasingly conferences organized by the UN.
employing massive amounts of data in virtually Initially, INGOs were generally small and
all areas of concern to improve their work and worked in particular places of the world,
decision-making. maintaining close relationships with target
International Nongovernmental Organizations (INGOs) 591

beneficiaries. Later, they gained a positive reputa- countries, while carrying out their activities in
tion with donors and developing countries based developing countries, operating across national
on their actions. They also expanded, developing borders and not identifying themselves as domes-
larger programs and activities, covering more tic actors.
technical and geographical areas. Significant However, INGOs are differentiated from other
expansion of INGOs took place in the 1980s, third sector actors, e.g., civil society organiza-
with funding peaking in contexts where donor tions, based on aspects such as the INGO articu-
conditions were relatively less rigorous. In such lation in global consortia and their extensive
contexts, many INGOs put in practice processes global programmatic reach and the international
of decentralization, increasing their networks by arena in which they operate; their size and scope
setting up regional or national offices in different being much larger in terms of budgets, staffs, or
countries. operations; their greater organizational capacities
INGOs are often defined in contrast to interna- and broader range of partnerships; and their
tional governmental organizations (IGOs). higher profile, derived from the professionalism,
INGOs have been described as any international credibility, and legitimacy that donors and the
organizations that are not established by inter- public associate with their actions. It is in this
governmental agreements. Not constituted by regard also that INGOs are making bigger com-
I
states and not having structures of mitments and investments in big data collection
decision-making controlled by states are a defin- and use. As mentioned, INGO activities include a
ing characteristic of INGOs, although they may wide range of issues, from humanitarian and
have contacts in governmental institutions or, as development assistance to human rights, gender,
often happens, receive funding from states. environment, poverty, education, research, advo-
Although sometimes described as apolitical in cacy, and international relief. To a great degree,
character and as separate from political parties, INGOs are considered as spokespersons for
this does not mean that INGOs cannot take polit- global civil society on such themes since they
ical stand. In fact, their actions can have important become important spaces for collective action, as
political implications at both international and well as resources for participation in the global
domestic levels in a number of relevant issues – public sphere, contributing to the emergence and
such as human rights – by, for instance, development of a global civic culture that includes
recommending certain policies and political related issues.
actions. In this regard, they have been discussed
in terms of “soft laws,” taking place and facilitat-
ing the participation of non-state actors such as INGO Nonprofit Nature and Debates on
INGOs in policy processes, influencing what has Sources of Funding
traditionally been framed as exclusive nation-state
domains. A defining characteristic of INGOs is their not-
In this sense, INGOs have been included for-profit nature. This is particularly important as
within the global “third sector.” The third sector it is related to the autonomy of INGOs. Sources of
refers to those entities that lie between the market funding of INGOs can include individual donors
and the state, separate from governmental struc- who become members or partners of the organi-
tures and private enterprises. INGOs work outside zations and philanthropy from private funds –
both the global economy, a space dominated by such as private foundations – as well as donations
transnational corporations and other financial from official development assistance programs
institutions, and the global interstate system, con- provided by developed countries, churches and
figured by IGOs and typically centered around the religious groups, artists, or some commercial
UN. INGOs often are headquartered in developed activities, e.g., fair trade initiatives. Some claims
592 International Nongovernmental Organizations (INGOs)

indicate that their nonprofit designations mean turned on a more general issue: the relationships
that INGOs may not conduct any operations that between INGOs and governments. Of particular
generate some private benefits. However, for concern is rising bilateralization in the sense that
example, INGOs are frequently professional orga- an increasing amount of funding flows are
nizations that have to, for instance, provide sala- directed toward specific countries and for partic-
ries to their employees and also sometimes ular purposes, while unrestricted funding has
participate in marketing campaigns to support decreased. This tendency points not only to the
their actions and agendas. Hence, their nonprofit matter of INGOs’ autonomy but also to ideologi-
nature only means that INGOs are differentiated cal debates on the interaction between govern-
from other private organizational actors, such as ments and INGOs, summarized in two opposite
enterprises and corporations, because they are not positions: on the one hand, those who consider
explicitly for-profit organizations. that the private nonprofit sector is the best mech-
Big data analytics are being used to identify anism for addressing social and economic needs,
and track funding sources and donors. In that separating governments and INGOs, and, on the
respect, INGO fundraising efforts have engaged other hand, those who defend a strong welfare
big data to highlight aspects such as trends in state, possibly minimizing the explicit need for
individual giving that have decreased among nonprofit organizations depending on how they
younger generations or the preference of large provide their services.
private funding sources to seek similar large
INGOs for partnership. Related analyses also
underscore the consequences of dependency on INGOs for Development Cooperation,
government funds from official development Humanitarian Aid, and Human Rights
assistance programs, which can be reduced as
part of budgetary cuts during economic crises or Three of the main areas of INGO action have been
political shifts. Besides dependence on fickle development cooperation, humanitarian aid, and
donor funds, whether public or private, possible human rights. Some of the largest INGOs are
shortages in organizational autonomy – conceptu- focused on one or more of these issues, and
alized as the decision-making capacity of numerous approaches consider that the action of
INGOs – have been noted due to funding source. these organizations – and, in general, of non-
That is, INGO operational and managerial auton- governmental agents – has been the engine for
omy may be constrained depending on funding the global expansion and increasing importance
source. Constraints associated with funding can of those topics, especially during the second half
include factors such as evaluation and perfor- of the twentieth century.
mance controls, audit requirements, and various INGOs focused on humanitarian aid have
rules, regulations, and conditionalities. However, played a relevant role in large-scale humanitarian
as a correlate of influence on INGOs exerted projects around the world, particularly during the
through funding and as an example of the last decades. They have provided emergency
abovementioned roles and impacts of INGOs’ relief to millions of people and delivered impor-
actions, they also can influence their funding tant amounts of international humanitarian aid,
sources through strategies such as exerting influ- assisting refugees, displaced persons, and persons
ence on the design and implementation of pro- living in conflict zones and scenarios of humani-
grams, contract negotiations, and revenue tarian crises or emergency, and long-term medical
diversification or even by not applying for or care to vulnerable populations. The actions of
accepting funds from certain sources that would these INGOs are focused on aspects such as relief
constrain their autonomy. and rehabilitation, humanitarian mine action, and
From a more complex point of view, many post-conflict recovery. Many of them also act in
debates on funding and INGO autonomy have capacity building and cooperation between
International Nongovernmental Organizations (INGOs) 593

authorities at different levels and implement activ- corruption, and/or criminal justice abuses. By
ities in areas such as housing and small-scale conducting campaigns on these themes, INGOs
infrastructure, income generation through grants have drawn attention to human rights, mobilizing
and micro-finance, food security and agricultural public opinion and pressuring governments to
rehabilitation and development, networking and observe human rights in general.
capacity development, and advocating for equal
access to healthcare worldwide. Among others,
some important humanitarian INGOs are the Dan- Conclusion
ish Refugee Council, CARE International, and
Médecins Sans Frontières (MSF). In all of these situations, INGOs collect and utilize
INGOs also have become increasingly relevant massive amounts of data for relevant planning,
in the international development arena, with a implementing, monitoring, and accountability
rising amount of aid to developing countries, activities. INGOs (and NGOs more generally)
with budgets that, in the case of particularly are putting data to effective use to measure and
large INGOs, have even surpassed those of some increase their impact, cut costs, identify and man-
donor developed countries. Although INGOs age donors, and track progress.
involved in development cooperation assume
I
diverse roles, there are significant similarities in
their goals. Among their most frequent objectives Cross-References
are reducing poverty and inequality and the real-
ization of rights, mainly for marginalized groups; ▶ Human Resources
the promotion of gender equality and social jus- ▶ International Development
tice; the reinforcement of civil society and prac-
tices of democratic governance; and the protection
of the environment. An increasing trend is to Further Reading
include research and learning processes as part
of their strategies of action and as sources of Boli, J., & Thomas, J. M. (1999). Constructing world
data for establishing a more consolidated evi- culture: International Nongovernmental Organizations
since 1875. Redwood City: Stanford University Press.
dence base for both program experience and Hobe, S. (1997). Global challenges to statehood: The
knowledge and policy influence. To mention increasingly important role of nongovernmental orga-
only a few, some of the largest INGOs involved nizations. Indiana Journal of Global Legal Studies,
in development cooperation are BRAC, World 5(1), 191–209.
McNeely, C. L. (1995). Constructing the nation-state:
Vision International, Oxfam International, and International organization and prescriptive action.
Acumen Fund. Westport: Greenwood Press.
Finally, the promotion and defense of human Otto, D. (1996). Nongovernmental organizations in the
rights have been a particularly important action United Nations system: The emerging role of interna-
tional civil society. Human Rights Quarterly, 18(1),
area for INGOs, becoming significant spaces for 107–141.
participation in the global human rights move- Plakkot, V. (2015). 7 NGOs that are using data for impact
ment. Human rights INGOs, such as Amnesty and why you should use it too. https://blog.socialcops.
International and Human Rights Watch, have com/intelligence/best-practices/7-ngos-using-data-for-
impact.
been key agents in promoting human rights and Powell, W. W., & Steinberg, R. (2006). The nonprofit
in making known human rights violations. These sector: A research handbook. New Haven: Yale Uni-
INGOs oppose violations of rights such as free- versity Press.
dom of religion or discrimination on the basis of Tsutsui, K., & Min Wotipka, C. (2004). Global civil society
and the international human rights movement: Citizen
sexual orientation, denouncing infringements participation in human rights International Non-
related to gender discrimination, torture, military governmental Organizations. Social Forces, 83(2),
use of children, freedom of the press, political 587–620.
594 Internet Association, The

eBay, Yelp, IAC, Uber Technologies Inc,


Internet Association, The Expedia, and Netflix. As part of both their purpose
and mission statements, the Internet Association
David Cristian Morar believes that the decentralized architecture of the
Schar School of Policy and Government, George Internet, which it vows to protect, is what led it to
Mason University, Fairfax, VA, USA become one of the world’s most important engines
for growth, economically and otherwise. The
Association’s representational role, also referred
Synonyms to as a lobbying, is portrayed as not simply an
annex of Silicon Valley but as a voice of its com-
Internet Lobby; Internet Trade Association; munity of users as well. The policy areas it pro-
Internet Trade Organization motes are explained with a heavy emphasis on the
user and the benefits and rights the user gains.
The President and CEO, Michael Beckerman,
Introduction a former congressional staffer, is the public face of
the Internet Association, and he is usually the one
The Internet Association is a trade organization that signs statements or comments on important
that represents a significant number of the world’s issues on behalf of the members. Beyond their
largest Internet companies, all of whom are based, “business crawl” efforts promoting local busi-
founded, or ran in the United States of America. nesses and their connection to, and success yield-
While issues such as net neutrality or copyright ing from the Internet economy, the Association is
reform are at the forefront of their work, the Inter- active in many other areas. These areas include
net Association is also active in expressing the Internet freedom (nationally and worldwide) and
voice of the Internet industry in matters of Big patent reform, among others, with their most
Data. On this topic, it urges a commitment to important concern being net neutrality. As Big
status quo in privacy regulation and increased Data is associated with the Internet, and the indus-
government R&D for innovative ways of enhanc- try is interested in being an active stakeholder in
ing the benefits of Big Data, while also calling for related policy, the Association has taken several
dispelling the belief that the web is the only sector opportunities to make its opinions heard on the
that collects large data sets, as well as for a more matter. These opinions can also be traced through-
thorough review of government surveillance. out the policies it seeks to propose in other
These proposals are underlined by the perspective connected areas.
that the government has a responsibility to protect Most notably, after the White House Office of
the economic interests of the US industries, inter- Science and Technology Policy’s (OSTP) 2014
nationally, and a responsibility to protect the pri- request for information, as part of their 90-day
vacy of the American citizens, nationally. review on the topic of Big Data, the Internet
Association has released a set of comments that
crystallize their views on the matter. Prior com-
Main Text munications have also brought up certain aspects
related to Big Data; however, the comments made
Launched in 2012 with 14 members and designed to the OSTP have been the most comprehensive
as the unified voice in Washington D.C. for the and detailed public statement to date by the indus-
industry, the Internet Association now boasts try on issues of Big Data, privacy, and govern-
41 members and is dedicated, according to their ment surveillance.
statements, to protecting the future of the free and In matters of privacy regulation, the Associ-
innovative Internet. Among these 41 members, ation believes that the current framework is both
some of the more notable include Amazon, robust and effective in relation to commercial
AOL, Groupon, Google, Facebook, Twitter, entities. In their view, reform is mostly
Internet Association, The 595

necessary in the area of government surveil- consumer-oriented approach that would permeate
lance, by adopting an update to the Electronic the whole range of practices from understudied
Communications Privacy Act (which would sectors to the Internet, centered around increasing
give service providers a legal basis in denying user knowledge on how their data is being han-
government requests for data that are not accom- dled. This would allow the user to understand the
panied by a warrant), prohibiting bulk govern- entire processes that go on beyond the visible
mental collection of metadata from interfaces, without putting any more pressure on
communications and clearly bounding surveil- the industries to change their actions.
lance efforts by law. While the Internet Association considers
The Internet Association subscribes to the that commercial privacy regulation should be
notion that the current regime for private sector left virtually intact, substantial government
privacy regulation is not only sufficient but also funding for research and development should be
perfectly equipped to deal with potential concerns funneled into unlocking future and better societal
brought about by Big Data issues. The status quo benefits of Big Data. These funds, administered
is, in the acceptation of the Internet industry, a through the National Science Foundation and
flexible and multilayered framework, designed for other instruments, would be directed toward a
businesses that embrace privacy protective prac- deeper understanding of the complexities of Big
I
tices. The existing framework, beyond a some- Data, including accountability mechanisms,
times overlapping federal-state duality of levels, de-identification, and public release. Prioritizing
also includes laws in place through the Federal such government-funded research over new regu-
Trade Committee that guard against unfair prac- lation, the industry believes that current societal
tices and that target and swiftly punish the bad benefits from commercial Big Data usage
actors that perpetrate the worst harms. This allows (ranging from genome research to better spam
companies to harness the potential of Big Data filters) would multiply in number and effect.
within a privacy-aware context that does not allow The Association deems that the innovation
or tolerate gross misconduct. In fact, the Associa- economy would suffer from any new regulatory
tion even cites the White House’s 2012 laudatory approaches that are designed to restrict the free
comments on the existing privacy regimes, to flow of data. In their view, not only would the
strengthen its argument for regulatory status quo, companies not be able to continue with their com-
beyond simply an industry’s desire to be left to its mercial activities, which would hurt the sector,
own devices to innovate without major and the country, but the beneficial aspects of Big
restrictions. Data would suffer as well. Coupled with the rev-
The proposed solutions by the industry would elations about the data collection projects of the
center on private governance mechanisms that National Security Agency, this would signifi-
include a variety of stakeholders in the decision- cantly impact the standing of the United States
making process and are not, in fact, a product of internationally, as important international agree-
the legislative system. Such actions have been ments, such as the Transatlantic Trade and Invest-
taken before and, according to the views of the ment Partnership with the EU, are in jeopardy,
Association, are successful in the general sector of says the industry.
privacy, and they allow industry and other actors
that are involved in the specific areas to have a seat
at the table beyond the traditional lobbying route. Conclusion
One part that needs further action, according to
the views of the Association, is educating the The Internet Association thus sees privacy as a
public on the entire spectrum of activities that significant concern with regard to Big Data. How-
lead to the collection and analysis of large data ever, it strongly emphasizes governmental mis-
sets. With websites as the focus of most privacy- steps in data surveillance, and offers an
related research, the industry advocates a more unequivocal condemnation of such actions,
596 Internet Lobby

while lauding and extolling the virtues of the


regulatory framework in place to deal with the Internet Lobby
commercial aspect. The Association believes
that current nongovernmental policies, such as ▶ Internet Association, The
agreements between users and service providers,
or industry self-regulation, are also adequate, and
promoting such a user-facing approach to a major-
ity of privacy issues would continue to be useful. Internet of Things (IoT)
Governmental involvement is still desired by the
industry, primarily through funding for what Erik W. Kuiler
might be called basic research into the Big Data George Mason University, Arlington, VA, USA
territory, as the benefits of this work would be
spread around not just between the companies
involved but also with the government, as best The Internet of Things (IoT) is a global comput-
practices would necessarily involve governmental ing-based network infrastructure, comprising
institutions as well. uniquely identifiable objects embedded in entities
connected via the Internet that can collect, share,
and send data and act on the data that they have
Cross-References received. IoT defines how these objects will be
connected through the Internet and how they will
▶ De-identification/Re-identification communicate with other objects by publishing
▶ Google their capabilities and functionalities as services
▶ National Security Agency (NSA) and how they may be used, merging the digital
▶ Netflix (virtual) universe and the physical universe.
The availability of inexpensive computer chips;
advances in wireless sensor networks technologies;
Further Reading the manufacture of inexpensive radio-frequency
identification (RFID) tags, sensors, and actuators;
The Internet Association. Comments of the Internet Associa- and the ubiquity of wireless networks have made it
tion in response to the White House Office of Science and
possible to turn anything, such as telephony
Technology Policy’s Government ‘Big Data’ Request for
Information. http://internetassociation.org/wp-content/ devices, household appliances, and transportation
uploads/2014/03/3_31_-2014_The-Internet-Association- systems, into IoT participants. From a tropological
Comments-Regarding-White-House-OSTP-Request-for- perspective, IoT represents physical objects
Information-on-Big-Data.pdf. Accessed July 2016.
(things) as virtual entities that inhabit the Internet,
The Internet Association. Comments on ‘Big Data’ to the
Department of Commerce. http://internetassociation.org/ thereby providing a foundation for cloud-based big
080614comments/. Accessed July 2016. data analytics and management.
The Internet Association. Policies. https://internetassociat Conceptually, the IoT comprises a framework
ion.org/policy-platform/protecting-internet-freedom/.
with several interdependent tiers:
Accessed July 2016.
The Internet Association. Privacy. http://internetassociation.
org/policies/privacy/. Accessed July 2016. Code tier – the code tier provides the foundation
The Internet Association. Statement on the White House Big for IoT, in which each object is assigned a
Data Report. http://internetassociation.org/050114bigdata/.
Accessed July 2016.
unique identifier to distinguish it from other
The Internet Association. The Internet Association’s Press Kit. IoT objects.
http://internetassociation.org/the-internet-associations-press- Identification and recognition tier – the identi-
kit/. Accessed July 2016. fication and recognition tier comprises, for
The Internet Association. The Internet Association Statement
example, RFID tags, IR sensors, or other sen-
on White House Big Data Filed Comments. http://
internetassociation.org/bigdatafilingstatement/. Accessed sor networks. Devices in this tier gather infor-
July 2016. mation about objects from the sensor devices
Internet of Things (IoT) 597

linked with them and convert the information and information is increasingly difficult. In fact,
into digital signals which are then passed onto miniaturization also plays a role in this regard.
the network tier for further action. Many IoT personal devices are reduced to the
Network tier – the devices in the network tier point of invisibility, minimizing transparency to
receive and transmit the digital signals from human overview and management.
devices in the identification and recognition The IoT comprises the network of devices
tier and transmit it to the processing systems embedded in everyday objects that are enabled to
in the middleware tier through various media, receive, act on, and send data to each other via the
e.g., Bluetooth, WiMaX, Zigbee, GSM, 3G, Internet. For efficacy and efficiency, the IoT relies
etc., using the appropriate protocols (IPv4, on a multi-tiered framework that ensues syntactic
IPv6, MQTT, DDS, etc.). conformance, semantic congruence, and techno-
Middleware tier – devices in this tier process the logical reliability. In general terms, it can be framed
information received from the sensor devices. in terms of autonomous agency relative to the
The middleware tier includes the cloud-based increasing prevalence, reliance, and risks of
ubiquitous computing functions that ensure (unintended spontaneous) intervention in human
direct access to the appropriate data stores for events. As such, the IoT also reflects ontological
processing. ambiguity, blurring distinctions between human
I
Application tier – software applications in this beings, natural objects, and artifacts as parts of
tier instantiate support for IoT-dependent the broader smart and digitized environment.
applications, such as smart homes, smart trans-
portation systems, smart and connected cities,
etc. Cross-References
Business tier – software applications in this tier
support IoT-related research and development ▶ Data Streaming
as well as the evolution of business strategies,
models, and products.
Further Reading
Relative to these various tiers, the IoT is typi-
cally discussed in terms of technological advances Bandyopadhyay, D., & Sen, J. (1995). Internet of Things –
Applications and challenges in technology and stan-
and improvements to the human condition. How-
dardization. Wireless Personal Communications, 58
ever, there are issues that require more critical (1), 49–69.
review and consideration. For example, security Cheng, X., Zhang, M., & Sun, F. (2012). Architecture of
is a principal concern. IoT security breaches may internet of things and its key technology integration
based on RFID. In IEEE fifth international symposium
take the form of unauthorized access to RFID, on computational intelligence and design (pp. 294–
breaches of sensor-nodes security, cloud-based 297).
computing abuse, etc. Also, to ensure reliability European Research Cluster on the Internet of Things
and efficacy, IoT devices and networks must (IERC). (2015). Internet of Things IoT semantic inter-
operability: research challenges, best practices, rec-
ensure interoperability – technical interoperabil-
ommendations and next steps. Retrieved from: http://
ity, syntactical interoperability, semantic interop- www.internet-of-things-research.eu/.
erability, and organizational interoperability – but, Voas, J. (2016). NIST special publication 800-183: Net-
again, that raises further security issues. In the works of ‘Things.’ Retrieved from: Networks of
‘Things’ (nist.gov).
ubiquitous IoT environment, there are no clear Wu, M., Lu, T.-L., Ling, F.-Y., Sun, L., & Du, H.-Y. (2010).
ways to establish and secure human anonymity. Research on the architecture of Internet of things. In
In addition to deliberate (positive or negative) Advanced computer theory and engineering (pp. 484–
purposes, the inadvertent dissemination of per- 487).
Zhang, Y. (2011). Technology framework of the Internet of
sonally identifiable information (PII), privacy Things, and its application. In IEEE third international
information, and similar information occurs all conference on electronics and communication engi-
too frequently, and oversight of related devices neering (pp. 4109–4112).
598 Internet Trade Association

an electronic brain.” Dr. Clever has proposed


Internet Trade Association combining Arpanet, NSF, Bitnet, Usenet, and
all other networks into a single entity called the
▶ Internet Association, The Internet. The Internet has become “a homoge-
neous material resulting from a large number of
individual networks that are composed of many
heterogeneous computer systems (individuals, busi-
Internet Trade Organization nesses, government institutions)” (Mark 1999).
Tanenbaum (2001) shows that the reticular
▶ Internet Association, The structure of digital society is structured around
six levels of language, that is, among others, the
machine language, the programming language,
the language used by the user of this media, or
Internet: Language natural language, let alone the new language illus-
trated by the smiley. Tanenbaum specifies that
Marcienne Martin every language is built on its predecessor so that
Laboratoire ORACLE [Observatoire Réunionnais we can see a computer as a multilayer stack or
des Arts, des Civilisations et des Littératures dans levels. The language of the bottom is the simplest,
leur Environnement] Université de la Réunion the top one the most complex. Machine language,
Saint-Denis France, Montpellier, France which is the structural basis of the Internet, is
a binary sequence (e.g., 0111 1001) which can
only be understood by experts and, therefore, is
It is the Arpanet network that will be at the unusable as such in everyday communication. It is
origin the Internet. It was established in 1969 by something of a raw material to pass through a
the United States Department of Defense. As number of transformations in order to be used.
Mark (1999) mentions, the Internet has several A great number of researchers agree on the fact
characteristics including the decentralization of that this new digital paradigm, which is part of the
transmissions, which means that when a line of Internet, forms the basis for a transformation in
communication becomes inoperable the two social behavior that affects a large proportion of
remote machines will search for a new path to the world population. The analysis of the digital
transfer the data (the circuit can start on the East society is different from one researcher to another.
Coast of Canada, through the province of Ontario, Marshall McLuhan (2001) mentions the Internet
and finally lead to Saskatchewan). Arpanet has as a global village, without borders, without law,
a special mode of communication between com- and without constraint. For Wolton (2000) the
puters; Internet Protocol [IP]. [IP] works as a sort screen of the computer will simplify the commu-
of electronic envelope into which data are put. In nication between human beings and make it more
January 1994, the vice president of the United direct and transparent, while the computer system
States, Al Gore, for the first time used the term will be more regulated and more closed and more
“information highway” to describe the American coded. Wolton mentions that in civil society there
project to construct a national network of modern is never a transparent social relation. Furthermore,
communication. The network as we know it has the author specifies that access to knowledge and
been promoted by a group of research institutes information is the source of the revival of inequal-
and universities under the direction of Professor ity. The risk is that there is a place for everyone,
Clever. It consisted of five interconnected super- but yet every one remains in their place. Proulx
computers located in different geographic areas. (2004) found that communication in the digital
According to Mark (1999), “Computers are able society transforms the space-time relation. The
to collaborate, forming interconnected cells of user has access to information anytime and
Internet: Language 599

anywhere, which results in a generalization of con- to another and the semantic content of a particular
sultation of sites located in different parts of the meaning can take different values, generating sit-
world in delayed time. Contrary to users of tradi- uations of misunderstanding or even of conflict.
tional media (for example, television and radio However, when referring to the space of the
broadcasting), the Internet user is an innovator in Internet, one has to consider a new order of the
the management of a written code that uses the number of participants involved in the conversa-
style and syntax of an oral code. This innovative tional exchange. Indeed, the digital technology
character takes into account what is already there. that forms the basis of this medium permits an
So communication through this media is at the unlimited number of Internet users to connect to a
origin of a new language using the alphanumeric particular chat room. Some chat rooms can dis-
signs and symbols located on the keys of the phys- play a large number of participants. This means
ical or digital keyboard and this, in the context of an that we are far from being faced with conversa-
innovative semantic context. While the user is in tional patterns in real life and for which such
front of their screen, they do not see their interloc- exchanges would be doomed to fail. In the digital
utors. This staging of reality refers only to the society, each user is alone behind their machine,
imagination of the Net surfer and their and it is all of these units that form an informal
interpretation of the situation communication. group composed by the participants of a particular
I
Unlike the television in which the subject is rather chat room. Furthermore, the perception that Net
passive – it can use the “zapping” or turn off the TV surfers can have concerning the number of
– the Internet user can break off a conversation if speakers involved in the activity in which they
they find it inappropriate, without giving any justi- participate may be misleading. That is why to
fication, which is not the case in an exchange of overcome the problem posed by the large number
traditional communication. The rules of etiquette of users connected at the same time, in the same
(good manners), even if they are advocated on the chat room; discussions were set up called “pri-
Web, may, however, be ignored. In civil society, a vate” and expressed by the abbreviation “PV.”
speaker who would not make dialogic openings Regarding the exchange turns, here we are in the
and closures inherent to their culture would be case where the digital structure that underlies the
sanctioned by the rejection whatsoever from the Internet medium supports the management of this
caller and/or their group of belonging. event. Thus, the electrical impulses that work to
Communication in humans via conversational create equations, so-called Boolean, operate in
exchanges has been the subject of numerous stud- consecutive ranking. This order is reflected in
ies. A specific mode of verbal interaction, that the upper layers of more sophisticated program-
is conversation, was studied in particular by ming languages than at the level of Net surfers.
Kerbrat-Orecchioni (1996); she shows that the Turn-taking of speakers is, therefore, not managed
principal characteristics are the implication of a by users but by the computer.
limited number of participants playing roles not Moreover, the only organs solicited within the
predetermined, benefiting normally of the same framework of communicative exchanges on the
rights and duties and having the pleasure of Internet are eyes for reading on the screen and
conversing; conversation has a familiar and writing the message on the keyboard, as well as
improvised nature whatsoever at the themes, the the touch when using the keyboard keys; this
duration of the exchange, the order of the implies that in this particular universe there is
speeches. As specified by the author: “The inter- the absence of any kinesics manifestation, that
action is symmetric and egalitarian.” However, the opening of a dialogue takes place on the
some parameters are involved in the proper con- basis of a soliloquy, that the establishment of a
duct of this type of interaction: it is the sharing to single proxemic distance is common to all Internet
the same linguistic and cultural heritage. Indeed, users, namely, the physical distance that separates
the reports in the world can differ from one group them from the computer tool.
600 Internet: Language

The New Language on Internet Some of these graphs are shown in Table 1; for
each of them, the basic icons are those listed in the
The field of writing seems to correspond to a table of characters on the keyboard used. Thus, the
widening of the field of speech both on a spatial semantic field of facial expressions has several
and temporal level. Boulanger (2003) contends keys that initiate eyes, mouth, and nose, respec-
that through the medium of limited sounds and tively, as we can see in Table 1. Simplified picto-
possible actions, man has forged a speech orga- graphs are unambiguous and monosemic, but in
nized and filled with meaning. For Leroi- their more complex version the reading of these
Gourhan, anthropologist, the history of writing icons request the use of a legend. Usually their
begins with tracings and visuals of the end of the creators add a small explanatory text. Moreover,
Mousterian period, around 50,000 BC, and then it these symbols punctuate the linguistic discourse,
propagates around 30,000 BC. These tracings due to the inability to compensate paraverbal and
open to interpretation would have served as a nonverbal exchange set up during the usual con-
mnemonic support. This proto writing consisted versations implemented in civil society.
of incisions (lines, points, grooves, sticks, etc.)
regularly spaced and formed in stones or bones.
This is the development of external oral code Identity on the Internet
through the writing support.
Referring to the language used on Internet In the image of the complexity of the universe
means evoking a hybrid structure, first take into including humans and the organizations in which
account written support to express a message and they are part, subsuming various paradigms, such
the other, makes extensive use of terms used in the divine, the human, and the objectal, etc., nom-
spoken in the lexical-semantic phrases. Thus, ination is a fascinating phenomenon but difficult,
the identification and analysis of discursive almost impossible, to define in its entirety. Num-
sequences show that the form of rebus with the ber of parameters and factors modify both fixed
use of logograms has been adopted, such as num- and variable components. By the consciousness of
bers and the sign arrobas: @; these characters are being in the world, while questioning the strange-
at the origin of the phonetic support of written ness live to die, humans take place in reality by
medium objects and of its oral version, rapid naming them. Anthroponomy takes part in this
writing which is the abbreviation of words like process; its organization reflects the culture of
Pls (please), which reduce the message to its which it is part. Patronymic and first names have
phonetic transcription as ID (idea), or use a mix- in common the quality of nomen verum (veritable
ture of rapid writing and phonetic transcription. name) (Laugaa 1986). Ghasarian (1996) empha-
The personal creation governs linguistic innova- sizes with the patronymic as the noun of
tion in the Internet; it is manifested in rebus, rapid relationship that an individual receives at birth,
writing, and phonetic reduction, etc. So the puzzle demonstrating its identity. Moreover, the first
is made, often logograms, phonemes, stylistic fig- name would be similar to that pseudonym its
ures, etc. as C*** (cool). Poets have thus used as actualization occurs in the synchronic time and
Queneau (1947) with the use of these stylistic
figures in writing poems.
In addition, the writing on the Internet uses Internet: Language, Table 1 Semantic field of expres-
semantic keys to the image of Chinese characters, sions of the face
on one hand would serve to create complex logo- Eyes
grams and on the other hand, would initialize : ; ‘ ,
particular field semantics (Martin 2010). These Open eyes Nod Left eyebrow Right eyebrow
keys are at the origin of basic pictographs com- Expressive gestures of the mouth
posed of a simple graph; in their combined form, ) ( ! <
Smile Pout Indifference Disappointment
these graphs result in more complex pictograms.
Internet: Language 601

not in the diachronic time (transgenerational) as to emotional vector. Marker of identity at the base,
the surname. it is an anthroponym called pseudonym. Like the
As opposed to the surname, pseudonyms do mask, it has a plural vocation.
not infer any genealogical connection. However, Social networks are an extension of chat
if the construction is personal creation order, it rooms with more personalized communication
remains highly contextualized. Thus autonyms modalities. One example is the Facebook social
created by users for the need to surf the Internet network that allows any user of the Web to create
space while preserving the confidentiality of their a personal space with the ability to upload
privacy will be motivated by the particularity of photos and videos, to write messages on an
this media. However, the fact to evoke the place of interface (the wall) which can be consulted by
the individuals in the genealogical chain implic- relatives and friends, or by all the members of
itly refers to the construction of their identity, of the social network to the extent that the user
their groups of belonging and/or of opposition, accepts this possibility. It was in 2004 that the
and finally to the definition of their status. How- social network Facebook was created by
ever, the construction of a pseudonym on the Zuckerberg and his fellow students at Harvard
Internet actualizes new social habits that will University, Eduardo Saverin, Dustin Moskovitz,
depend on both the personal choice of the user and Chris Hughes. Other social networks like
I
and of the virtual society they wish to join. LinkedIn, established in 2003, belong to profes-
Digital media, including network structures sional online social networks; their network
(networking), form the basis of the function of structure works from several levels of connec-
nomen falsum (false name) is plural. A nickname tion: direct contacts, contacts of direct contacts,
is rigid and an identity marker at a given time. and then the contacts to the second degree. There
Indeed, because of the configuration of the com- are also social networks like Twitter whose char-
puter system which is running according to a acteristic is sending messages limited to 140
binary mode, homonymic names are only recog- characters. Twitter took its name from the com-
nized as a single occurrence. However, users of pany Twitter Inc. creator of this social network;
the Internet can change their pseudonym ad it is a blogging platform that allows a user to
libitum. The nominal sustainability is not corre- send free short messages called tweets on the
lated to the holder of such surnames as is the case internet, instant messaging or SMS.
in civil society where the law lays down strict Social networks are a way to enrich the lives of
rules for the official nomination. The creation Internet users by virtually meeting with users
of the pseudonym is made from data taken, sharing the same tastes, the same opinions, etc.
among others, in the private life of the Net surfer Social networks can be at the origin of the
(Martin 2006, 2012). reorientation of political or social opinion. Never-
In civil society, anthroponomy sets the social theless, connecting to Facebook can often be the
being within a group. In order to join discussion cause of a form of addiction, since acting on this
forums or chat rooms, the internet user has to social network allows users to create a large net-
choose a pseudonym. However, using a pseudo- work of virtual friends, which can also affect the
nym on the Web is not done for the sole purpose of user’s image. Thus, having a lot of friends may
naming an individual. The main feature of the refer to an overvalued self-image, while the oppo-
pseudonym on the Internet is its richness in site may result in a devaluation of one’s image.
terms of creativity. Moreover, some nicknames The study of the territory of the internet and
become discursive spaces where users claim posi- social practices that it induces, refers, on the one
tions already taken, issue opinions, or express hand, to the Internet physical territory occupied by
their emotions. A nomen falsum can serve the the user, that is to say, a relationship between the
user’s speech. Both designator and discursive keyboard and the screen on a space belonging to
unity, the pseudonym amplifies by synthesizing what Hall defines as “intimate distance” and, on
the speech of the user. It can also act as an the other hand, the symbolic territory that
602 Invisible Web, Hidden Web

registered the other in a familiar space. These


different modes of running of the pseudonym are Italy
correlated to the development of nomen falsum
(nickname) on the personal territory of the Net Chiara Valentini
surfer, both physical and symbolic, and more spe- Department of Management, Aarhus University,
cifically in the context of its intimate sphere, School of Business and Social Sciences, Aarhus,
which has profound implications concerning the Denmark
relations between the communication of Internet
users. The Internet is a breeding ground where
creativity takes shape and grows exponentially. Introduction
These are the new locations of speech in which
the exchange engaged by users can have repercus- Italy is a Parliamentary republic in southern
sions in civil society. Europe. It has a population of about 60 million
people of which, 86.7%, are Internet users
(Internet World Stat 2017). Public perception of
Further Reading handling big data is generally very liberal, and the
phenomenon has been associated with more trans-
Boulanger, J.-C. (2003). Les inventeurs de dictionnaires. parency and digitalized economic and social sys-
Ottawa: Les presses de l’Université d’Ottawa.
Ghasarian, C. (1996). Introduction à l’étude de la parenté.
tems. The collection and processing of personal
Paris: Editions du Seuil. data have been increasingly used to counter tax
Hall, T. E. (1971). La dimension cachée, édition originale. evasion which is one of the major problems of
Paris: Seuil. Italian economy. The Italian Revenue Agency is
Kerbrat-Orecchioni, C. (1996). La conversation. Paris:
using data collected through different private and
Seuil.
Laugaa, M. (1986). La pensée du pseudonyme. Paris: PUF. public data collectors to cross-check tax declara-
Leroi-Gourhan, A. (1964). Le Geste et la Parole, tions (DPA 2014a).
Technique et langage. Paris: Albin Michel. According to the results of a study on Italian
Mark, T. R. (1999). Internet, surfez en toute simplicité sur
companies' perception of big data conducted by
le plus grand réseau du monde. Paris: Micro
Application. researchers at the Big Data Analytics & Business
Martin, M. (2006). Le pseudonyme sur Internet, une nom- Intelligence Observatory of Milan Polytechnic,
ination située au carrefour de l’anonymat et de la more and more companies (þ22% in 2013) are
sphère privée. Paris: L’Harmattan.
interested in investing in technologies that allow
Martin, M. (2010). Dictionnaire des pictogrammes
numériques et du lexique en usage sur Internet et sur to handle and use big data. Furthermore, the num-
les téléphones portables. Paris: L’Harmattan. ber of companies seeking professional managers
Martin, M. (2012). Se nommer pour exister – L’exemple du that are capable of interpreting data and assisting
pseudonyme sur Internet. Paris: L’Harmattan.
senior management on decision-making is also
McLuhan, M., & Fiore, Q. (2001). The medium is the
MASSAGE. Hamburg/Berkeley: Gingko Press. increasing. Most of the Italian companies (76%
Proulx, S. (2004). La révolution Internet en question. of 184 interviewed) claim that they use basic
Montréal: Québec Amérique. analytics strategically and another 36% use more
Queneau, R. (1947). Exercices de style. Paris: Gallimard.
sophisticated tools for forecasting activities
Tanenbaum, A. (2001). Architecture de l’ordinateur.
Paris: Dunod. (Mosca 2014, January 7).
Wolton, D. (2000). Internet et après? Paris: Flammarion.

Data Protection Agency and Privacy


Issues
Invisible Web, Hidden Web
Despite the positive attitude and increased use of
▶ Surface Web vs Deep Web vs Dark Web big data by Italian organizations, an increasing
Italy 603

public expectation for privacy protection has criminal records, ethnicity, religion or other
emerged as a result of raising debates on personal beliefs, political opinions, membership of
data, data security, and protection in the whole parties, trade unions and/or associations, health,
European Union. In the past years, the Italian or sex life. Access to sensitive and judicial data is
Data Protection Authority (DPA) reported several granted only for specific purposes, for example,
instances of data collection of telephone and Inter- in situations where it is necessary to know more
net communications of Italian users which may about a certain individual for national security
have harmed Italians’ fundamental rights (DPA reasons (DPA 2014b).
2014b). Personal data laws have been developed The DPA participates to data protection activ-
as these are considered important instruments for ities involving the European Union and other
the overall protection of fundamental human international supervisory authorities and follows
rights, thereby adding new legal specifications to existing international conventions (Schengen,
the existing privacy framework. The first specific Europol, and Customs Information System)
law on personal data was adopted by the Italian when regulating Italian data protection and secu-
Parliament in 1996 and this incorporated a num- rity matters. It carries out an important role in
ber of guidelines already included in the European increasing public awareness of privacy legislation
Union 1995 Data Protection Directive. At the and in soliciting the Italian Parliament to develop
I
same time, an indepedent authority, the Italian legislation on new economic and social issues
Data Protection Authority (Garante per la pro- (DPA 2014b). The DPA has also formulated spe-
tezione dei dati personali), was created in 1997 to cific guidelines on cloud computing for helping
protect fundamental rights and freedoms of peo- Italian businesses. Yet, according to this authority,
ple when personal data are processed. The Italian these cloud computing guidelines require that Ital-
Data Protection Authority (DPA) is run by a four- ian laws are updated to be fully effective in regu-
member committee elected by the Italian Parlia- lating this area. Critics indicate that there are
ment for a seven-year mandate (DPA 2014a). limits in existing Italian laws concerning the allo-
The main activities of DPA consist of moni- cation of liabilities, data security, jurisdiction, and
toring and assuring that organizations comply notification of infractions to the supervisory
with the latest regulations on data protection authority (Russo 2012).
and individual privacy. In order to do so, DPA Another area of great interest for the DPA is the
carries out inspections on organizations’ data- collection of personal data via video surveillance
bases and data storage systems to guarantee both in the public and in the private sector. The
that their requirements for preserving individual DPA has acted on specific cases of video surveil-
freedom and privacy are of high standards. It lance, sometimes banning and other times allo-
checks that the activities of the police and the Ital- wing it (DPA 2014c). For instance, the DPA
ian Intelligence Service comply with the legisla- reported to have banned the use of webcams in a
tion, reports privacy infringements to judicial nursery school to protect children’s privacy and to
authorities, and encourages organizations to safeguard freedom of teaching. It banned police
adopt codes of conduct promoting fundamental headquarters to process images collected via
human rights and freedom. The authority also CCTV cameras installed in streets for public
handles citizens’ reports and complaints of pri- safety purposes because such cameras also cap-
vacy loss or any misuse or abuse of personal tured images of people’s homes. The use of cus-
data. It bans or blocks activities that can cause tomers’ pre-recorded, operator-unassisted phone
serious harm to individual privacy and freedom. calls for debt collection purposes is among those
It grants authorizations to organizations and activities that have been prohibited by this author-
institutions to have access and use sensitive ity. Yet, the DPA permits the use of video surveil-
and/or judicial data. Sensitive and judicial data lance in municipalities for counter-vandalism
concern, for instance, information on a person’s purposes (DPA 2014b).
604 Italy

Conclusion References

Overall, Italy is advancing with the regulation of Bilbao-Osorio, B., Dutta, S. & Lanvin, B. (2014). The
global information technology report 2014. Reword
big data phenomenon following also the impetus
and risks of big data. World Economic Forum. http://
given by the EU institutions and international www3.weforum.org/docs/WEF_GlobalInformationTec
debates on data protection, security, and privacy. hnology_Report_2014.pdf. Accessed 31 Oct 2014.
Nonetheless, Italy is still lagging behind many DPA (2014a). Summary of key activities by the Italian DPA
in 2013. http://www.garanteprivacy.it/web/guest/home/
western and European countries regarding the
docweb/-/docweb-display/docweb/3205017. Accessed
adoption and development of frameworks for a 31 Oct 2014.
full digital economy. According to the Networked DPA (2014b). Who we are. http://www.garanteprivacy.it/web/
Readiness Index 2015 published by the World guest/home_en/who_we_are. Accessed 31 Oct 2014.
DPA. (2014c) “Compiti del Garante” [Tasks of DPA].
Economic Forum, Italy is ranked 55th. As indi-
http://www.garanteprivacy.it/web/guest/home/autorita/
cated by the report, Italy’s major weakness is still compiti. Accessed 31 Oct 2014.
a political and regulatory environment that does Internet World Stat (2017). Italy. http://www.internet
not facilitate the development of a digital econ- worldstats.com/europa.htm. Accessed 15 May 2017.
Mosca, G. (2014, January 7). Big data, una grossa
omy and its innovation system (Bilbao-Osorio
opportunità per il business, se solo si sapesse come
et al. 2014). usarli. La situazione in Italia. La Stampa. http://www.
ilsole24ore.com/art/tecnologie/2014-01-07/big-data-gr
ossa-opportunita-il-business-se-solo-si-sapesse-come-us
arli-situazione-italia-110103.shtml?uuid¼ABuGM6n.
Cross-References Accessed 31 Oct 2014.
Russo, M. (2012). Italian data protection authority releases
▶ Cell Phone Data guidelines on cloud computing. In McDermott Will &
Emery (Eds.), International News (Focus on Data Pri-
▶ Data Security vacy and Security, 4). http://documents.lexology.com/
▶ European Union 475569eb-7e6b-4aec-82df-f128e8c67abf.pdf. Accessed
▶ Privacy 31 Oct 2014.
J

Journalism opportunities for journalists to report the news in


novel and interesting ways, critics have noted data
Brian E. Weeks1, Trevor Diehl2, Brigitte Huber2 journalism also faces potential obstacles that must
and Homero Gil de Zúñiga2 be considered.
1
Communication Studies Department, University
of Michigan, Ann Arbor, MI, USA
2
Media Innovation Lab (MiLab), Department of Origins of Journalism and Big Data
Communication, University of Vienna, Wien,
Austria Contemporary data journalism is rooted in the
work of reporters like Philip Meyer, Elliot Jaspin,
Bill Dedman, and Stephen Doig. In his 1973
The Pew Research Center notes that journalism is book, Meyer introduced the concept of “precision
a mode of communication that provides the public journalism” and advocated applying social sci-
verified facts and information in a meaningful ence methodology to investigative reporting prac-
context so that citizens can make informed judg- tices. Meyer argued that journalists needed to
ments about society. As aggregated, large-scale employ the same tools as scientific researchers:
data have become readily available and the prac- databases, spreadsheets, surveys, and computer
tice of journalism has increasingly turned to big analysis techniques.
data to help fulfill this mission. Journalists have Based on the work of Meyer, computer-
begun to apply a variety of computational and assisted reporting developed as a niche form of
statistical techniques to organize, analyze, and investigative reporting by the late 1980s, as com-
interpret these data, which are then used in con- puters became smaller and more affordable.
junction with traditional news narratives and A notable example from this period was Bill
reporting techniques. Big data are being applied Dedman’s Pulitzer Prize winning series “The
to all facets of news including politics, health, the Color of Money.” Dedman obtained lending sta-
economy, weather, and sports. tistics on computer tape through the federal Free-
The growth of “data-driven journalism” has dom of Information Act. His research team
changed many journalists’ news gathering rou- combined that data with demographic information
tines by altering the way news organizations inter- from the US Census. Dedman found widespread
act with their audience, providing new forms of racial discrimination in mortgage lending prac-
content for the public and incorporating new tices throughout the Atlanta metropolitan area.
methodologies to achieve the objectives of jour- Over the last decade, the ubiquity of large,
nalism. Although big data offer many often free, data sets has created new opportunities
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
606 Journalism

for journalists to make sense of the world of big journalism. They allow journalists to further
data. Where precision journalism was once the explain a story or phenomenon through statistical
domain of a few investigative reporters, data- tests that explore relationships, to more broadly
driven reporting techniques are now a common, generalize information by looking at aggregate
if not necessary, component of contemporary patterns over time and to predict future events
news work. News organizations like The Guard- based on prior occurrences. For example, using
ian, The New York Times’ Upshot, and The Texas an algorithm based on historical polling data, Sil-
Tribune represent the mainstream embrace of big ver’s website, FiveThirtyEight (formerly hosted
data. Some websites, like Nate Sliver’s by the New York Times), correctly predicted the
FiveThirtyEight, are entirely devoted to data outcome of the 2012 US presidential election in
journalism. all 50 states. Whereas methods of traditional jour-
nalism often lend themselves to more microlevel
reporting, more macrolevel and general insights
How Do Journalists Use Big Data? can be gleaned from big data.
An additional advantage of big data is that, in
Big data provide journalists with new and alterna- some cases, they reduce the necessary resources
tive ways to approach the news. In traditional needed to report the story. Stories that would
journalism, reporters collect and organize infor- otherwise have taken years to produce can be
mation for the public, often relying on interviews assembled relatively quickly. For example,
and in-depth research to report their stories. Big WikiLeaks provided news organizations nearly
data allow journalists to move beyond these stan- 400,000 unreleased US military reports related
dard methods and report the news by gathering to the war in Iraq. Sifting through these docu-
and making sense of aggregated data sets. This ments using traditional reporting methods would
shift in methods has required some journalists and take a considerable amount of time, but news
news organizations to change their information- outlets like The Guardian in the UK applied com-
gathering routines. Rather than identifying poten- putational techniques to quickly identify and
tial sources or key resources, journalists using big report the important stories and themes stemming
data must first locate relevant data sets, organize from the leak, including a map noting the location
the data in a way that allows them to tell a coher- of every death in the war.
ent story, analyze the data for important patterns Big data also allow journalists to interact with
and relationships, and, finally, report the news in a their audience to report the news. In a process
comprehensible manner. Because of the complex- called crowdsourcing the news, large groups of
ity of the data, news organizations and journalists people contribute relevant information about a
are increasingly working alongside computer pro- topic, which in the aggregate can be used to
grammers, statisticians, and graphic designers to make generalizations and identify patterns and
help tell their stories. relationships. For example, in 2013 the
One important aspect of big data is visualiza- New York Times website released an interactive
tion. Instead of writing a traditional story with quiz on American dialects that used responses to
text, quotations, and the inverted-pyramid format, questions about accents and phrases to demon-
big data allow journalists to tell their stories using strate regional patterns of speech in the US. The
graphs, charts, maps, and interactive features. quiz became the most visited content on the
These visuals enable journalists to present website that year.
insights from complicated data sets in a format
that is easy for the audience to understand. These
visuals can also accompany and buttress news Data Sets and Methodologies
articles that rely on traditional reporting methods.
Nate Silver writes that big data analyses pro- Journalists have a multitude of large data sets and
vide several advantages over traditional methodologies at their disposal to create news
Journalism 607

stories. Much of the data used is public and orig- like Google News employ these methods to col-
inates from government agencies. For example, lect and provide users personalized news feed.
the US government has created a website, data.
gov, which offers over 100,000 datasets in a vari-
ety of areas including education, finance, health, Limitations of Big Data for Journalism
jobs, and public safety. Other data, like the
WikiLeaks reports, were not intended to be public Although big data offer numerous opportunities to
but became primary sources of big data for jour- journalists reporting the news, scholars and prac-
nalists. News organizations can also utilize publi- titioners have both highlighted several potential
cally available data from private Internet general limitations of these data. As much as big
companies like Google or social networking data can help journalists in their reporting, they
sites such as Facebook and Twitter to help report need to make an active effort to contextualize the
the news. information. Big data storytelling also elicits
Once the data are secured, journalists can apply moral and ethical concerns with respect the data
numerous techniques to make sense of the data. collection of individuals as aggregated informa-
For example, at a basic level, journalists could get tion. These reporting techniques also need to bear
a sense of public interest about a topic or issue by in mind potential data privacy transgressions.
examining the volume of online searches about
the topic or the number of times it was referenced J
in social media. Mapping or charting occurrences Cross-References
of events across regions or countries also offers
basic descriptive visualizations of the data. Jour- ▶ Computational Social Sciences
nalists can also apply content or sentiment ana- ▶ Data Visualization
lyses to get a sense of the patterns of phrases or ▶ Digital Storytelling, Big Data Storytelling
tone within a set of documents. Further, network ▶ Information Society
analyses could be utilized to assess connections ▶ Interactive Data Visualization
between points in the data set, which could pro- ▶ Open Data
vide insights on the flow or movement of infor-
mation, or on power structures.
These methods can be combined to produce a Further Reading
more holistic account of events. For example,
journalists at the Associated Press used textual Pew Research Center. The core principles of journalism.
http://www.people-press.org/1999/03/30/section-i-the-
and network analysis to examine almost 400,000
core-principles-of-journalism. Accessed April 2016.
WikiLeaks documents related to the Iraq war that Shorenstein Center on Media, Politics and Public Policy.
identified related clusters of words used in the Understanding data journalism: Overview of resources,
reports. In doing so, they were able to demonstrate tools and topics. http://journalistsresource.org/reference/
reporting/understanding-data-journalism-overview-tools-
patterns of content within the documents, which
topics. Accessed April 2016.
shed previously unseen light on what was happen- Silver, N. What the fox knows. http://fivethirtyeight.com/
ing on the ground during the war. features/what-the-fox-knows. Accessed August 2014.
Computer algorithms, and self-taught machine
learning techniques, also play an important role in Special Issues and Volumes
the big data journalistic process. Algorithms can Digital Journalism–Journalism in an Era of Big Data:
Cases, concepts, and critiques. v. 3/3 (2015).
be designed to automatically write news stories,
Social Science Computer Review – Citizenship, Social
without a human author. These automated “robot Media, and Big Data: Current and Future Research in
journalists” have been used to produce stories for the Social Sciences (in press).
news outlets like the Associated Press and The The ANNALS of American of the American Academy of
Political and Social Science – Toward Computational
Los Angeles Times. Algorithms have also changed
Social Science: Big Data in Digital Environments.
the way news is delivered, as news aggregators v. 659/1 (2015).
K

KDD Keystroke Capture

▶ Data Discovery Gordon Alley-Young


Department of Communications and Performing
Arts, Kingsborough Community College, City
University of New York, New York, NY, USA

KDDM Synonyms
▶ Data Discovery
Keycatching; Keylogger; Keystroke logger;
Keystroke recorder

Introduction
Keycatching
Keystroke capture (KC) tracks a computer or
▶ Keystroke Capture mobile device users’ keyboard activity using
hardware or software. KC is used by businesses
to keep employees from misusing company tech-
nology, in families to monitor the use possible
misuse of family computers, and by computer
hackers who seek gain through secretly
Keylogger possessing an individual’s personal information
and account passwords. KC software can be pur-
▶ Keystroke Capture chased for use on a device or may be placed

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
610 Keystroke Capture

maliciously without the user’s knowledge through computer/keyboard. KC is placed on devices


contact with untrusted websites or e-mail attach- maliciously by hackers when computer and
ments. KC hardware can also be purchased and is mobile device users visit websites, open e-mail
disguised to look like computer cords and acces- attachments, or click links to files that are from
sories. KC detection can be difficult because soft- untrusted sources. Individual technology users are
ware and hardware are designed to avoid frequently lured by untrusted sources and
detection by anti-KC programs. KC can be websites that offer free music files or pornogra-
avoided by using security software as well as phy. KC’s infiltrate organizations’ computers
through careful computing practices. KC affects when an employee is completing company busi-
individual computer users as well as small, ness (i.e., financial transactions) on a device that
medium, and large organizations internationally. he/she also uses to surf the Internet in their
free time.
When a computer is infected with a malicious
How Keystroke Capture (KC) Works KC, it can be turned into what is called a zombie, a
computer that is hijacked and used to spread KC
Keystroke capture (KC), also called keystroke malware/spyware to other unsuspecting individ-
logger, keylogger, keystroke recorder, and uals. A network of zombie computers that is con-
keycatching, tracks a computer or mobile device trolled by someone other than the legitimate
users’ activities, including keyboard activity, network administrator is called a botnet. In 2011,
using hardware or software. KC is knowingly the FBI shut down the Coreflood botnet, a global
employed by businesses to deter its employees KC operation affecting 2 million computers. This
from misusing company devices and also by fam- botnet spread KC software via an infected e-mail
ilies seeking to monitor the technology activities attachment and seemed to infect only computers
of vulnerable family members (e.g., teens, chil- using Microsoft Windows operating systems. The
dren). Romantic partners and spouses use KC to FBI seized the operators’ computers and charged
catch their significant others engaged in deception 13 “John Doe” defendants with wire fraud, bank
and/or infidelity. Computer hackers install KC fraud, and illegally intercepting electronic com-
onto unsuspecting users’ devices in order to steal munication. Then in 2013 security firm
their personal data, website passwords, financial SpiderLabs found 2 million passwords in the
information, read their correspondence/online Netherlands stolen by the Pony botnet. While
communication, to stalk/harass/intimidate users, researching the Pony botnet, SpiderLabs discov-
and/or to sabotage organizations or individuals ered that it contained over a million and a half
that hackers consider unethical. When used Twitter and Facebook passwords and over
covertly to hurt and/or steal from others, KC is 300,000 Gmail and Yahoo e-mail passwords. Pay-
called malware, malicious software used to inter- roll management company ADP, with over
fere with a device, and/or spyware, software used 600,000 clients in 125 countries, was also hacked
to steal information or to spy on someone. by this botnet.
KC software (e.g., WebWatcher, SpectorPro,
Cell Phone Spy) is available for free and also for
purchase, and it is usually downloaded onto the The Scope of the Problem
device where it either saves captured data onto the Internationally
hard drive or sends it through networks/wirelessly
to another device/website. KC hardware (e.g., In 2013 the Royal Canadian Mounted Police
KeyCobra, KeyGrabber, KeyGhost) may be an (RCMP) served White Falcon Communications
adaptor device into which a keyboard/mouse with a warrant that alleged that the company was
USB cord is plugged before it is inserted in to controlling an unknown number of computers
the computer or may look like an extension cable. known as the Citadel botnet (Vancouver Sun
Hardware can also be installed inside the 2013). In addition to distributing KC malware/
Keystroke Capture 611

spyware, the Citadel botnet also distributed spam to unlock the files on their personal computer that
and conducted network attacks that reaped over could include records for a small business, aca-
$500 million dollars illegal profit affecting more demic research, and/or family photographs
than 5 million people globally (Vancouver Sun (Lyons 2014).
2013). The Royal Bank of Canada and HSBC in KC is much more difficult to achieve on a
Great Britain were among the banks attacked by smartphone as most operating systems operate
the Citadel botnet (Vancouver Sun 2013). The only one application at a time, but it is not impos-
operation is believed to have originated from sible. As an experiment Dr. Hao Chen, an Asso-
Russia or Ukraine as many websites hosted by ciate Professor in the Department of Computer
White Falcon Communications end in the .ru suf- Science at the University of California, Davis,
fix (i.e., country code for Russia). Microsoft with an interest in security research created a KC
claims that the 1,400 botnets running Citadel software that operates using smartphone motion
malware/spyware were interrupted due to the data. When tested, Chen’s application correctly
RCMP action with the highest infection rates in guessed more than 70% of the keystrokes on a
Germany (Vancouver Sun 2013). Other countries virtual numerical keypad though he asserts that it
affected were Thailand, Italy, India, Australia, the would probably be less accurate on an alphanu-
USA, and Canada. White Falcon owner Dmitry merical keypad (Aron 2011). Point-of-sale (POS)
Glazyrin’s voicemail claimed he was out of the data, gathered when a credit card purchase is made
country on business when the warrant was served in a retail store or restaurant, is also vulnerable to
(Vancouver Sun 2013). KC software (Beierly 2010). In 2009 seven Lou-
Trojan horses allow others to access and install isiana restaurant companies (i.e., Crawfish Town K
KC and other malware. Trojan horses can alter or USA Inc., Don’s Seafood & Steak House Inc.,
destroy a computer and its files. One of the most Mansy Enterprises LLC, Mel’s Diner Part II Inc.,
infamous Trojan horses is called Zeus. Don Jack- Sammy’s LLC, Sammy’s of Zachary LLC, and
son, a senior security researcher with Dell B.S. & J. Enterprises Inc.) sued Radiant Systems
SecureWorks and who has been widely Inc., a POS system maker, and Computer World
interviewed, claims that Zeus is so successful Inc., a POS equipment distributor, charging that
because those behind it, seemingly in Russia, are the vendors did not secure the Radiant POS sys-
well funded and technologically experienced, and tems. The customers were then defrauded by KC
this allows them to keep Zeus evolving into dif- software, and restaurant owners incurred financial
ferent variations (Button 2013). In 2012 Micro- costs related to this data capture. Similarly, Patco
soft’s Digital Crimes Unit with its partners Construction Company, Inc. sued People’s United
disrupted a variation of Zeus botnets in Pennsyl- Bank for failing to implement sufficient security
vania and Illinois responsible for an estimated measures to detect and address suspicious trans-
13 million infections globally. Another variation actions due to KC. The company finally settled for
of Zeus called GameOver tracks computer users’ $345,000, the cost that was stolen plus interest.
every login and uses the information to lock them Teenage computer hackers, so-called hactivists
out and drain their bank accounts (Lyons 2014). In (people who protest ideologically by hacking
some instances GameOver works in concert with computers), and governments under the auspices
CryptoLocker. If GameOver finds that an individ- of cyber espionage engage in KC activities, but
ual has little in the bank then CryptoLocker will cyber criminals attain the most notoriety. Cyber
encrypt users’ valuable personal and business files criminals are as effective as they are evasive due
agreeing to release them only once a ransom is to the organization of their criminal gangs. After
paid (Lyons 2014). Often ransoms must be paid in taking money from bank accounts via KC, many
Bitcoin, Internet based and currently anonymous cyber criminals send the payments to a series of
and difficult to track. Victims of CryptoLocker money mules. Money mules are sometimes
will often receive a request for a one Bitcoin unwitting participants in fraud who are recruited
ransom (estimated to be worth 400€/$500USD) via the Internet with promises of money for
612 Keystroke Capture

working online. The mules are then instructed to report being unable to find Lai, a former resident
wire the money to accounts in Russia and China of Irvine, CA, since the allegations surfaced in
(Krebs 2009). Mules have no face-to-face contact December 2013. The students are accused of plac-
with the heads of KC operations so it can be ing KC hardware onto teachers’ computers to get
difficult to secure prosecutions, though several passwords to improve their grades and steal
notable cyber criminals have been identified, exams. All 11 students signed expulsion agree-
charged, and/or arrested. In late 2013 the RCMP ments in January 2014 that whereby they aban-
secured a warrant for Dmitry Glazyrin, the appar- doned their right to appeal their expulsions in
ent operator of a botnet who left Canada before the exchange for being able to transfer to other
warrant could be served. Then in early 2014, schools in the district. Subsequently, five of the
Russian SpyEye creator Aleksandr Panin was students’ families sued the district for denying the
arrested for cyber crime (IMD 2014). Also, Esto- students the right to appeal and/or claiming tutor
nian Vladimir Tsastsin, the cyber criminal who Lai committed the KC crimes. By the end of
created DNSChanger and became rich of online March, the school district had spent almost
advertising fraud and KC by infecting millions of $45,000 in legal fees.
computers. Finnish Internet security expert Mikko When large organizations are hacked via KC,
Hermanni Hyppönen claimed that Tsastsin owned the news is reported widely. For instance, Visa
159 Estonian properties when he was arrested in found KC software being able to transmit card
2011 (IMD 2014). Tsastsin was released data to a fixed e-mail or IP address where hackers
10 months after his arrest due to insufficient could retrieve it. Here the hackers attached KC to
proof. As of 2014 Tsastsin has been extradited to a POS system. Similarly KC was used to capture
the US for prosecution (IMD 2014). Also in 2014 the keystrokes of pilots flying the US military’s
the US Department of Justice Department (DOJ) Predator and Reaper drones that have been used in
filed papers accusing a Russian Evgeniy Afghanistan (Shachtman 2011). Military officials
Mikhailovich Bogachev of leading the gang were unsure whether the KC software was already
behind GameOver Zeus. The DOJ claims built into the drones was the work of a hacker
GameOver Zeus caused $100 million in losses (Shachtman 2011). Finally, Kaspersky Labs has
from individuals and large organizations. publicized how it is possible to get control of
Suspected Eastern European malware/spyware BMW’s Connected Drive system via KC and
oligarchs have received ample media attention for other malware, and this gain control of a luxury
perpetuating KC via botnets and Trojan horses car that uses this Internet-based system.
while other perpetrators have taken the public by Research by Internet security firm Symantec
surprise. In 2011 critics accused software com- shows that many small and medium-sized busi-
pany Carrier IQ of placing KC and geographical nesses believe that malware/spyware is a problem
position spyware in millions of users’ Android for large organizations (e.g., Visa, the US mili-
devices (International Business Times 2011). tary). However, since 2010 the company notes
The harshest of critics have alleged illegal that 40% of all companies attacked have fewer
wiretapping on the part of the company while than 500 employees while only 28% of attacks
Carrier IQ has rebutted that what was identified target large organizations. A case in point is a
as spyware is actually diagnostic software that 2012–2013 attack on a California escrow firm,
provides network improvement data (Interna- Efficient Services Escrow Group of Huntington
tional Business Times 2011). Further the com- Beach, CA, that had one location and nine
pany stated that the data was both encrypted and employees. Using KC malware/spyware, the
secured and not sold to third parties. In January hackers drained the company of $1.5 million dol-
2014, 11 students were expelled from Corona del lars in three transactions wired to bank accounts in
Mar High School in California’s affluent Orange China and Russia. Subsequently, $432,215 sent to
County for allegedly using KC to cheat for several a Moscow Bank was recovered, while the $1.1
years with the help of tutor Timothy Lai. Police million sent to China was never recouped. The
Keystroke Capture 613

loss was enough to shutter the business’s one banking websites when finished with them and
office and put its nine employees out of work. to never click on third-party advertisements that
Though popular in European computer circles, post to online banking sites and take you to a new
the relatively low-profile Chaos Computer Club website upon clicking.
learned that German state police were using KC Configurations of one’s computer features,
malware/spyware as well as saving screenshots programs, and software are also urged to thwart
and activating the cameras/microphones of club KC. This includes removing remote access (i.e.,
members (Kulish and Homola 2014). News of the accessing one’s work computer from home) con-
police’s actions led the German justice minister to figurations when they are not needed in addition
call for stricter privacy rules (Kulish and Homola to using a strong firewall (Beierly 2010). Users
2014). This call echoes a 2006 commission report need to continually check their devices for unfa-
to the EU Parliament that calls for strengthening miliar hardware attached to mice or keyboards as
the regulatory framework for electronic commu- well as check the listings of installed software
nications. KC is a pressing concern in the US for (Adhikary et al. 2012; Beierly 2010). Many finan-
as of 2014, 18 states and one territory (i.e., Alaska, cial organizations are opting for virtual keypads
Arizona, Arkansas, California, Georgia, Illinois, and virtual mice, especially for online transactions
Indiana, Iowa, Louisiana, Nevada, New Hamp- (Kumar 2009). Under this configuration instead of
shire, Pennsylvania, Rhode Island, Texas, Utah, typing a password and username on the keyboard
Virginia, Washington, Wyoming, Puerto Rico) all using number and letter keys, the user scrolls
have anti-spyware laws on the books (NCSL through numbers and letters using the cursors’
2015). virtual keyboard. Always use the online virtual K
keyboard for your banking password to avoid
the risk of keystrokes being logged when
Tackling the Problem available.

The problem of malicious KC can be addressed


through software interventions and changes in Conclusion
computer users’ behaviors, especially when
online. Business travelers may be at a greater Having anti-KC/malware/spyware alone does not
risk for losses if they log onto financial accounts guarantee protection, but experts agree that it is an
using hotel business centers as these high-traffic important component of an overall security strat-
areas provide ample opportunities to hackers egy. Anti-KC programs include SpyShelter Stop-
(Credit Union Times 2014). Many Internet secu- Logger, Zemana AntiLogger, KeyScrambler Pre-
rity experts recommend not using public wireless mium, Keylogger Detector, and GuardedID Pre-
networks where of KC spyware thrives. Experts at mium. Some computer experts claim that PC’s are
Dell also recommend that banks have separate more susceptible to KC malware/spyware than are
computers dedicated only to banking transactions Mac’s as KC malwares/spywares are often
with no emailing or web browsing. reported to exploit holes in PC’s operating sys-
Individuals without the resources to devote one tems, but new wisdom suggests that all devices
computer to financial transactions can, experts can be vulnerable especially when programs and
argue, protect themselves from KC through plug-ins are added to devices. Don Jackson, a
changing several computer behaviors. First, indi- senior security researcher with Dell SecureWorks,
viduals should change their online banking pass- argues that one of the most effective methods for
words regularly. Second, they should not use the preventing online business fraud, the air-gap tech-
same password for multiple accounts or use com- nique, is not widely utilized despite being around
mon words or phrases. Third is checking one’s since 2005. The air-gap technique creates a unique
bank account on a regular basis for unauthorized verification code that is transmitted as a digital
transfers. Finally, it is important to log off of token, text message, or other device not connected
614 Keystroke Logger

to the online account device, so the client can read lying dormant in thousands of Australian computers.
and then key in the code as a signature for each Retrieved from http://www.dailymail.co.uk/news/article-
2648038/Gameover-Zeus-lying-dormant-thousands-
transaction over a certain amount. Alternately in Australian-computers-without-knowing.html#ixzz3
2014 Israeli researchers presented research on a AmHLKlZ9.
technique to hack an air-gap network using just a NCSL: National Conference of State Legislatures. (2015).
cellphone. State spyware laws. Retrieved from http://www.ncsl.
org/research/telecommunications-and-information-tech
nology/state-spyware-laws.aspx.
Shachtman, N. (2011). Exclusive: Computer virus hits
US drone fleet. Retrieved from http://www.wired.com/
Cross-References 2011/10/virus-hits-drone-fleet/.
Vancouver Sun. (2013). Police seize computers linked to
▶ Cyber Espionage large cybercrime operation: Malware Responsible for
over $500 million in losses has affected more than five
▶ Data Brokers and Data Services million people globally. Retrieved from http://www.
▶ Industrial and Commercial Bank of China vancouversun.com/news/Police+seize+computers+
linked+large+cybercrime+operation/8881243/story.html
#ixzz3Ale1G13s.
Further Reading

Adhikary, N., Shrivastava, R., Kumar, A., Verma, S., Bag,


M., & Singh, V. (2012). Battering keyloggers and
screen recording software by fabricating passwords. Keystroke Logger
International Journal of Computer Network & Infor-
mation Security, 4(5), 13–21.
Aron, J. (2011). Smartphone jiggles reveal your private
▶ Keystroke Capture
data. New Scientist, 211(2825), 21.
Beierly, I. (2010). They’ll be watching you. Retrieved from
http://www.hospitalityupgrade.com/_files/File_Articles/
HUSum10_Beierly_Keylogging.pdf.
Button, K. (2013). Wire and online banking fraud continues Keystroke Recorder
to spike for businesses. Retrieved from http://www.
americanbanker.com/issues/178_194/wire-and-online-
banking-fraud-continues-to-spike-for-businesses-1062 ▶ Keystroke Capture
666-1.html.
Credit Union Times. (2014). Hotel business centers
hacked. Credit Union Times, 25(29), 11.
IMD: International Institute for Management Develop-
ment. (2014). Cybercrime buster speaks at IMD.
Retrieved from http://www.imd.org/news/Cybercrime- Key-Value-Based Database
buster-speaks-at-IMD.cfm.
International Business Times. (2011). Carrier iq spyware: ▶ NoSQL (Not Structured Query Language)
Company’s Android app logging the keystrokes of mil-
lions. Retrieved from http://www.ibtimes.com/carrier-
iq-spyware-companys-android-app-logs-keystrokes-
millions-video-377244.
Krebs, B. (2009). Data breach highlights role of ‘money
mules’. Retrieved from http://voices.washingtonpost. Knowledge Discovery
com/securityfix/2009/09/money_mules_carry_loot_for_
org.html.
▶ Data Discovery
Kulish, N., & Homola, V. (2014). Germans condemn police
use of spyware. Retrieved from http://www.nytimes.
com/2011/10/15/world/europe/uproar-in-germany-on-
police-use-of-surveillance-software.html?_r¼0.
Kumar, S. (2009). Handling malicious hackers & assessing
risk in real time. Siliconindia, 12(4), 32–33.
Knowledge Graph
Lyons, K. (2014). Is your computer already infected with
dangerous Gameover Zeus software? Virus could be ▶ Ontologies
Knowledge Management 615

knowledge management (KM) can be defined as a


Knowledge Hierarchy set of tools and methods connected with organizing
knowledge. It encompasses such activities as creat-
▶ Data-Information-Knowledge-Action Model ing, encoding, systematizing, distributing and acquir-
ing knowledge. There are number of reasons why
knowledge management is very crucial in modern
times. First of all, it should be mentioned that the
Knowledge Management twenty-first century can be characterized by the large
amount of data that modern people are surrounded
Magdalena Bielenia-Grajewska by. Secondly, many spheres of modern life depend
Division of Maritime Economy, Department of on knowledge flows; information societies demand
Maritime Transport and Seaborne Trade, not only the access to knowledge but also its effec-
University of Gdansk, Gdansk, Poland tive management. Thirdly, technological advance-
Intercultural Communication and ments facilitate the effectiveness connected with
Neurolinguistics Laboratory, Department of different stages of knowledge management. Thus,
Translation Studies, University of Gdansk, the need to manage knowledge has become more
Gdansk, Poland important nowadays than it was in other centuries.
Knowledge management is classified by taking into
account both the process and subject approach. How-
There are different definitions of knowledge man- ever, the processual perspective reflects the changing
agement. As Gorelick et al. (2004, p. 4) state, nature of knowledge that has to constantly adapt to K
“knowledge management is a vehicle to system- new conditions of the environment and expectations
atically and routinely help individuals, groups, of the target audience. Thus, KM is studied by taking
teams, and organizations to: learn what the indi- into account the processes accompanying creating,
vidual knows; learn what others know (e.g. individ- codifying, disseminating as well as teaching and
uals and teams); learn what the organization knows; learning. Apart from processes, KM should also be
learn what you need to learn; organize and dissemi- investigated from the prism of different types of
nate these learnings effectively and simply; apply knowledge involved in knowledge management.
these learnings to new endeavours”. Knowledge As far as other features of knowledge management
can also be defined by juxtaposing it with another are concerned, Jemielniak (2012) stresses that
phenomenon, being relatively close to it. As Foray knowledge is a primary resource that allows other
(2006, p. 4) claims, “in my conception, knowledge resources to be created and acquired. Moreover, in
has something more than information: knowledge-in the process of using, knowledge does not use up but
whatever field- empowers its possessors with the grows continually.
capacity for intellectual or physical action. What I
mean by knowledge is fundamentally a matter of
cognitive ability. Information, on the other hand, Types of Knowledge
takes the shape of structured and formatted data
that remain passive and inert until used by those Knowledge can be classified by taking into
with the knowledge needed to interpret and process account different factors. The famous division is
them.” He adds that “therefore, the reproduction of the one by Nonaka and Konno (1998) who discuss
knowledge and the reproduction of information are the concepts of tacit and explicit knowledge.
clearly different phenomena. While one takes place “Explicit knowledge can be expressed in words
through learning, the other takes place simply and numbers and shared in the form of data,
through duplication. Mobilization of a cognitive scientific formulae, specifications, manuals, and
resource is always necessary for the reproduction of the like. Tacit knowledge is highly personal and
knowledge, while information can be reproduced by hard to formalize, making it difficult to commu-
a photocopy machine” (Foray 2006, p. 4). In short, nicate or share with others” (Nonaka and Konno
616 Knowledge Management

1998, p. 42). Another way is to look at knowledge e.g., corporate or national linguistic policies. Non-
management through the prism of knowledge verbal knowledge encapsulates other than verbal
architects. The first notion that can be taken into types of knowledge. For example, auditory
account is the level of professionalism among knowledge encompasses elements of knowledge
information creators. Thus, such types of knowl- disseminated through the audio channel; it is
edge can be distinguished as professional/expert represented in jingles and songs. Olfactory knowl-
knowledge and laymen knowledge. Professional/ edge includes knowledge gained by the sense of
expert knowledge is connected with knowledge smell and it concerns, e.g., the flavors connected
that can be acquired exclusively by vocational with regional festivities. Another type of knowl-
schooling, professional experience, and/or spe- edge is tactile knowledge, being the type of
cialized training. On the other hand, laymen knowledge acquired through the physical experi-
knowledge is associated with the knowledge on ence of touching objects. The advancement in
the topic possessed by an average human being, modern technology has also led to the classifica-
resulting from one’s experience with using, e.g., a tion of online knowledge and offline knowledge.
given device or information gained from others. Online knowledge is the type of knowledge cre-
Knowledge can also be categorized by taking into ated and made available on the Internet, whereas
account the acceptable level of information dis- offline knowledge encompasses the knowledge
closure; consequently, open and closed knowl- made and published outside the web.
edge can be categorized. Open knowledge is
available freely to everybody, whereas closed Knowledge Transfer
knowledge is directed at a selected group of Knowledge transfer or knowledge flow can be
users. An example of open knowledge is an article briefly defined as moving the knowledge from
published online in the open-access journal, one person/organization to another one. Teece
whereas the same article published in the journal (2001) divides knowledge transfer into internal
with subscription belongs to closed knowledge. and external knowledge transfer. Internal trans-
Knowledge can also be classified by taking into fer takes place within an organization, e.g.,
account the notion of tangibility. Tangible knowl- between workers, whereas external transfer
edge is the type of knowledge that can be easily takes place from one company to another. The
perceived and measured (e.g., by points and latter one includes technology transfer and
marks). On the other hand, intangible knowledge intellectual property rights. As Zizzo (2005)
encompasses knowledge that cannot be easily claims, transfer of knowledge can be vertical
perceived and managed. Knowledge can also be or horizontal. Vertical transfer of knowledge is
classified by analyzing the channel used for cre- connected with using rules and characteristics
ating and disseminating knowledge. The first divi- in similar situations, whereas horizontal trans-
sion concerns the type of sense used for fer of knowledge is represented in direct and
knowledge processing- the most general division context-dependant adaptation of problem to
includes the classification into verbal and non similar one.
verbal knowledge. Verbal knowledge encom-
passes knowledge produced and disseminated in Factors Determining Knowledge
a verbal way, by relying on a commonly known Management
linguistic system of communication. Verbal One of the key factors determining knowledge
knowledge includes, e.g., words and phrases char- management is language. The first linguistic
acteristic of a given language and culture. Verbal issue shaping KM is the opportunity to access
knowledge can also be subcategorized by observ- information, taking into account the language
ing, e.g., the length of information. Thus, the used to create and disseminate knowledge. Thus,
micro approach encompasses morphemes, the lack of linguistic skills in a given language
words, and phrases, the meso dimension focuses may lead to the limited or no access to required
on texts, whereas the macro dimension concerns, data. For example, knowledge created and
Knowledge Management 617

disseminated in English can be reached only by accompany all stages of KM. For example, technol-
the users of the lingua franca. To meet the growing ogy facilitates the development of knowledge as well
demand of knowledge among linguistically as its subsequent presentation in online and offline
diverse users, translation is one of the methods informational outlets and special databases. Knowl-
directed at effective knowledge management. edge management is also determined by individual
Another approach to organizational knowledge and social factors. Taking into account the personal
management can be observed in many interna- dimension, the attitude to knowledge management is
tional companies; they adopt a corporate linguis- shaped by such factors as gender, age, profession and
tic policy that regulates linguistic matters in interest in innovation. Knowledge management also
business entities by analyzing both corporate and depends on group features, namely, how a given
individual linguistic needs. Apart from languages community perceives the importance of knowledge
understood in the broad sense, that is, as the management. Knowledge management also depends
linguistic repertoire used by a nation or a large on political and economic situation of a country.
cultural group, they may also be studied by taking Thus, it should be stated that there are different
into account dialects or professional genres. As far factors of micro, meso, and macro nature that deter-
as knowledge management is concerned, attention mine the way knowledge is created, disseminated,
is focused on making knowledge created within a and stored.
small linguistic community relatively accessible Fantino and Stolarz-Fantino (2005) discuss the
to all interested stakeholders. Analyzing the role of different types of context, such as spatial
example of corporations, knowledge produced context, temporal context, historical context and
in, e.g., professional discourse of accountants linguistic context in understanding one’s behav- K
should be comprehensible to the representative ior. They also stress that context can be under-
of other professions during corporate meetings. stood in different ways. “Context refers to many
Linguistic knowledge management also concerns aspects of our environment that play an important
the selection of linguistic tools to manage knowl- role in determining our behavior. For example, in
edge effectively. An example of the mentioned the laboratory, the term context may be use to refer
linguistic approach is to select the verbs or adjec- to any of the following: background stimuli that
tives that are supposed to attract the readers to affect the degree of conditioning to foreground
knowledge, make the content reliable and infor- stimuli; historical events that affect subjects’
mative, as well as invoke certain reactions. In appreciation of contemporary stimuli; rules or
addition, the discussion on the linguistic tools superordinate stimuli that stipulate the correct
should also encompass the role of literal and non- response to a target stimulus in a situation. In the
literal language in KM. As Bielenia-Grajewska laboratory it is simple to demonstrate stimulus
(2014, 2015, 2018) stresses, the nonliteral dimension control, by which we mean that a behavior will
encompasses the figurative tools used in discourse, be maintained in the presence of one stimulus (or
such as idioms, puns, similes, and metaphors. Taking context) but not in another. The more similar two
the example of metaphors, they serve different func- contexts (or stimulus configurations), the more
tions in effective knowledge management. First of likely a behavior acquired in the presence of one
all, metaphors facilitate the understanding of com- context is likely to transfer to (occur in the pres-
plex and novel information. Using a familiar domain ence of) the other context” (2005, p. 28). Nonaka
to explain a new concept turns out to be very effec- and Konno (1998) discuss that knowledge is
tive. Secondly, metaphors serve as an economical embedded in ba (shared spaces). Ba is a concept
way of providing information. Instead of using coined by the Japanese philosopher Kitaro
long phrases to facilitate explanation of novel Nishida to denote a shared space for emerging
approaches in knowledge, a metaphor relying on a relationships. The space can be of different char-
well-known domain makes the concept comprehen- acter: physical (e.g. office, home), virtual (e.g.
sible. The next important factor of knowledge man- online chat), mental (shared experience), or the
agement is technology; technological advancements combination of the mentioned features. Ba is
618 Knowledge Management

perceived as a platform for facilitating individual determine knowledge management on a given


and collective knowledge. topic. In addition, CDA offers the option to
study how knowledge management changes
depending on the type of situations. For example,
Knowledge Management and risky conditions demand other communication
Methodology tools than the coverage of leisure activities.
Another discipline that facilitates the research on
Researchers rely on different methods investigat- knowledge management is neuroscience that
ing KM. For example, the selection of tools offers a plethora of methods to investigate the
depends on the nature of knowledgeable ele- way knowledge is perceived and understood. For
ments. Methodology can be classified, e.g., by example, modern neuroscientific apparatus makes
taking into account the type of stimuli, such as it possible to study the effectiveness of knowl-
verbal and nonverbal elements. Thus, audio ele- edge. As Bielenia-Grajewska (2013) stresses, it
ments, such as sounds, songs, or jingles, require is visible in the application of neuroscientific
different methods of data management than olfac- tools in modern management. One of the possible
tory data. Taking into account the magnitude of tools used in neuroscientific investigations is
factors determining KM, there are certain meth- fMRI. Functional magnetic resonance imaging
odologies that prove to be used more often than is a technique that uses the advancements of mag-
the other ones. One of the methods applied to netic resonance to research brain performance.
study KM is the network approach. Within the The investigation concerns mainly two stages.
network perspectives, Bielenia-Grajewska The first part is devoted to taking the anatomical
(2011) highlights the one called actor-network scans of the subject’s brain when the person lies
analysis which, stressing the importance of both still in a scanner. The next stage concerns the
living and nonliving entities in the way a given active involvement of a subject in some activity
person, thing, or organization performs, turns out (e.g., choosing a word, a phrase, or a picture). The
to be useful in KM. Applying the ANT approach, apparatus measures the BOLD signal (blood oxy-
it is possible to highlight the role of things, such as gen level dependent) that shows which parts of the
telephones or computers, in transmitting and stor- brain are active. Such experiments facilitate effec-
ing knowledge. In addition, it can be researched tive knowledge management since such experi-
how, e.g., the technological defects of machines ments show which pieces of information are
influence the effectiveness of knowledge manage- easier understood. It should be mentioned that
ment. Another network technique – social net- also the investigations on other parts of the body
work analysis – concentrates on the relations may provide information on how knowledge is
between individuals. This approach may provide understood. The emotional response to knowl-
information how data is distributed among net- edge management can be investigated by analyz-
work members and how the types of nodes and ing the way face muscles respond to a given
ties determine the way KM is handled. Apart from stimulus. Facial electromyography (fEMG) mea-
interdisciplinary approaches, KM may also rely sures the face muscles nerve (e.g., the zygomatic
on various disciplines. It should also be stated that major muscle) when the subject is shown a stim-
the methods used in linguistics may support the ulus. In addition, the emotional attitude of the
research on knowledge management. For exam- subject to the presented knowledge can be
ple, critical discourse analysis facilitates the checked by observing the electrodermal reactions,
understanding of verbs, nouns, adjectives, or using the technique called galvanic skin response.
numerals in managing knowledge. CDA may Researchers may also observe the heart rate or
help to create a text that will be understandable blood pressure to check the reaction of the subject
by a relatively large group of people; verbal and to a given stimulus. In addition, knowledge man-
pictorial tools of communication are studied to agement can be researched from both qualitative
show how they separately as well as together and quantitative perspectives. As far as the
Knowledge Management 619

quantitative focus is concerned, knowledge man- concepts are associated with big data. Social intel-
agement can be supported by statistical tools that ligence is connected with monitoring social media
organize big data. It should also be stated that the and paying attention to data connected with likes,
growing role of the Internet in modern life has led dislikes, sentiment data, and brand names. Social
to the interest in online and offline approaches of analytics comprises the tools applied to analyzed
knowledge management. Knowledge manage- data, connected with what users discuss (share),
ment uses different tools to disseminate knowl- what their opinion about these things is (engage-
edge in a quick and effective way. One of the ways ment), and the way they discuss them (reach).
is to use stories in KM. As Gorelick et al. (2004)
state, real stories based on one’s experience
become codified and become available in the Future of Knowledge Management
form of knowledge assets. Among different
tools, Probst (2002) discusses the role of case- It can be predicted that the increase in the
writing for knowledge management. First, he amount of knowledge should be supported with
mentions that case writing is used as a teaching more advanced tools that will enable not only to
tool in e.g. MBA studies since it offers students acquire extensive data but also to use and store
learning new knowledge from real-life situations. it. Thus, it can be estimated that the future
Secondly, the narrative style of case writing offers knowledge management will depend even more
discussion and reflection on issues presented in on the advancements of modern technology. One
cases. Thirdly, they are an effective tool for example of using the improvements in other
increasing the skills and knowledge of managers. domains of science is the application of neuro- K
He suggests that companies should write the cases scientific expertise in the field of knowledge
about their situations that show how experience management as well as statistical methods
and knowledge were acquired through the period aimed at analyzing large quantities of data. In
of time. In the case of collective case writing, addition, since big data are becoming more and
learning is fostered in a spiral way from the indi- more important in the reality of the twenty-first
vidual, through the group, to the corporate level. century, knowledge management has to rely on
diverse and complicated tools that will facilitate
the creation and dissemination of data. Conse-
Knowledge and Big Data quently, the interrelation between KM and other
disciplines is supposed to be growing in the
The place of knowledge management in big data coming years.
is often discussed by taking into account the nov-
elties in the sphere of dealing with information.
Knowledge nowadays can be extracted from dif- Cross-References
ferent types of data, namely, structured data and
unstructured data. Structured data is well-orga- ▶ Information Society
nized and can be found in different databases. It ▶ Social Media
may include names, telephones, addresses, among ▶ Social Network Analysis
others. Unstructured data, on the other hand, is ▶ Statistics
not often to be found in databases and is not as
searchable as structured data is. It includes mate-
rial of different nature, such as written, video, or Further Reading
audio ones, including websites, texts, emails, and
conversations. In the area of big data, information Bielenia-Grajewska, M. (2011). A potential application of
actor network theory in organizational studies: The
exists in different types and in different quantities
company as an ecosystem and its power relations
that can be extracted by both humans and from the ANT perspective. In A. Tatnall (Ed.), Actor-
machines. As Neef (2015) discusses, two network theory and technology innovation:
620 Knowledge Pyramid

Advancement and new concepts. Hershey: Information Jemielniak, D. (2012). Zarządzanie wiedzą. Podstawowe
Science Reference. pojęcia. In D. Jemielniak & A. K. Koźmiński (Eds.),
Bielenia-Grajewska, M. (2013). International Zarządzanie wiedzą. Warszawa: Oficyna Wolters
neuromanagement. In D. Tsang, H. H. Kazeroony, & Kluwer.
G. Ellis (Eds.), The Routledge companion to interna- Neef, D. (2015). Digital exhaust: What everyone should
tional management education. Abingdon: Routledge. know about big data, digitization and digitally driven
Bielenia-Grajewska, M. (2014). CSR online communica- innovation. Upper Saddle River: Pearson Education.
tion: The metaphorical dimension of CSR discourse in Nonaka, I., & Konno, N. (1998). The concept of Ba.
the food industry. In R. Tench, W. Sun, & B. Jones Building a foundation for knowledge creation. Califor-
(Eds.), Communicating corporate social responsibility: nia Management Review, 40(3):40–54.
Perspectives and practice (critical studies on corporate Probst, G. J. B. (2002). Putting knowledge to work: Case-
responsibility, governance and sustainability, volume writing as a knowledge management and organiza-
6). Bingley: Emerald Group Publishing Limited. tional learning tool. In T. H. Davenport & G. J. B.
Bielenia-Grajewska, M. (2015). The role of figurative lan- Probst (Eds.), Knowledge management case book.
guage in knowledge management. Knowledge Erlangen: Publicis Corporate Publishing and John
encoding and decoding from the metaphorical perspec- Wiley & Sons.
tive. In M. Khosrow-Pour (Ed.), Encyclopedia of infor- Teece, D. J. (2001). Strategies for managing knowledge
mation science and technology. Hershey: IGI assets: The role of firm structure and industrial context.
Publishing. In I. Nonaka & D. J. Teece (Eds.), In managing indus-
Bielenia-Grajewska, M. (2018). Knowledge management trial knowledge: Creation, transfer and utilization.
from the metaphorical perspective. In M. Khosrow- London: SAGE Publications.
Pour (Ed.), Encyclopedia of information science and Zizzo, D. J. (2005). Transfer of knowledge and the simi-
technology (4th ed.). Hershey: IGI Publishing. larity function in economic decision-making. In D. J.
Fantino, E., & Stolarz-Fantino, S. (2005). Context and its Zizzo (Ed.), Transfer of knowledge in economic deci-
effect on transfer. In D. J. Zizzo (Ed.), Transfer of sion making. Basingstoke: Palgrave Macmillan.
knowledge in economic decision making. Basingstoke:
Palgrave Macmillan.
Foray, D. (2006). The economics of knowledge. Cam-
bridge, MA: The MIT Press.
Gorelick, C., Milton, N., & April, K. (2004). Performance
Knowledge Pyramid
through learning. Knowledge Management in Practice.
Oxford: Elsevier. ▶ Data-Information-Knowledge-Action Model
L

LexisNexis and testing. The entire complex serves its Reed


Elsevier sister companies while also providing
Jennifer J. Summary-Smith LexisNexis customers with the following: backup
Florida SouthWestern State College, Fort Myers, services, data hosting, and online services.
FL, USA LexisNexis opened its first remote data center
Culver-Stockton College, Canton, MO, USA and development facility in Springfield, Ohio, in
2004, which hosts new product development.
Both data centers function as a backup and recov-
As stated on its website, LexisNexis is a leading ery facility for each other.
global provider of content-enabled workflow According to the LexisNexis’ website, its
solutions. This corporation provides data and customers use services that span multiple
solutions for professionals in areas such as the servers and operating systems. For example,
academia, accounting, corporate world, govern- when a subscriber submits a search request, the
ment, law enforcement, legal, and risk manage- systems explore and sift through massive
ment. LexisNexis is a subscription-based service, amounts of information. The answer set is typi-
with two data centers located in Springfield and cally returned to the customer within 6–10 s,
Miamisburg, Ohio. The centers are among the resulting in a 99.99% average for reliability
largest complexes of their kind in the United and availability of the search. This service is
States, providing LexisNexis with “one of the accessible to five million subscribers, with
most complete comprehensive collections of nearly five billion documents of source informa-
online information in the world.” tion available online and stored in the
Miamisburg facility. The online services also
provide access to externally hosted data from
Data Centers the Delaware Secretary of State, Dun & Brad-
street Business Reports, Historical Quote, and
The LexisNexis data centers hold network Real-Time Quote. Given that a large incentive
servers, software, and telecommunication equip- for data center services is to provide expansion
ment, which is a vital component of the entire capacity for all future hosting opportunities, this
range of LexisNexis products and services. The has led to an increase in the percentage of total
data centers service the LexisNexis Group revenue for Reed Elsevier. Currently, the
Inc. providing assistance for application develop- Miamisburg data center supports over two bil-
ment, certification and administrative services, lion dollars in online revenue for Reed Elsevier.

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
622 LexisNexis

Mainframe Servers site. Multiple times a year, emergency business


resumption plans are tested. Furthermore, the data
There are over 100 servers housed in the Spring- center has system management services 365 days
field center, managing over 100 terabytes of data a year and 24 h a day provided by skilled opera-
storage. As for the Miamisburg location, this com- tions engineers and staff. If needed, there are
plex holds 11 huge mainframe servers, running additional specialists on site, or on call, to provide
34 multiple virtual storage (MVS) operating sys- the best support to customers. According to its
tem images. The center also has 300 midrange website, LexisNexis invests a great deal in protec-
Unix servers and almost 1,000 multiprocessor tion architecture to prevent hacking attempts,
NT servers. They provide a wide range of com- viruses, and worms. In addition, the company
puter services including patent images to cus- also has third-party contractors which conduct
tomers, preeminent US case law citation security studies.
systems, a hosting channel data for Reed Elsevier,
and computing resources for the LexisNexis
enterprise. As the company states, its processors Security Breach
have access to over 500 terabytes (or one trillion
characters) of data storage capacity. In 2013, Byron Acohido reported that a hacking
group hit three major data brokerage companies.
LexisNexis, Dun & Bradstreet, and Kroll Back-
Telecommunications ground America are companies that stockpile and
sell sensitive data. The group that hacked these data
LexisNexis has developed a large telecommunica- brokerage companies specialized in obtaining and
tions network, permitting the corporation to sup- selling social security numbers. The security
port its data collection requirements while also breach was disclosed by a cybersecurity blogger
serving its customers. As noted on its website, Brian Kebs. He stated that the website ssndob.ms
subscribers to the LexisNexis Group have a search (SSNDOB), their acronym stands for social secu-
rate of one billion times annually. LexisNexis also rity number and date of birth, markets itself on
provides bridges and routers and maintains fire- underground cybercrime forums, offering services
walls, high-speed lines, modems, and multiplexors, to its customers who want to look up social security
providing an exceptional degree of connectivity. numbers, birthdays, and other data on any US
resident. LexisNexis found an unauthorized pro-
gram called nbc.exe on its two systems listed in the
Physical Dimensions of the Miamisburg botnet interface network located in Atlanta, Geor-
Data Center gia. The program was placed as far back as April
2013, compromising their security for at least
LexisNexis Group has hardware, software, elec- 5 months.
trical, and mechanical systems housed in a
73,000 ft2 data center hub. Its sister complex,
located in Springfield, comprises a total of LexisNexis Group Expansion
80,000 ft2. In these facilities, the data center hard-
ware, software, electrical, and mechanical sys- As of July 2014, LexisNexis Risk Solutions
tems have multiple levels of redundancy, in the expanded its healthcare solutions to the life sci-
event that a single component fails, ensuring ence marketplace. In an article by Amanda Hall,
uninterrupted service. The company’s website she notes that an internal analysis revealed that
states that its systems are maintained and tested 40% of the customer files have missing or inaccu-
on a regular basis to ensure they perform correctly rate information in a typical life science company.
in case of an emergency. The LexisNexis Group LexisNexis Risk Solutions has leveraged its lead-
also holds and stores copies of critical data off- ing databases, reducing costs, improving
Link Prediction in Networks 623

effectiveness, and strengthening identity transpar- Further Reading


ency. LexisNexis is able to deliver data to over 6.5
million healthcare providers in the United States. Acohido, B. LexisNexis, Dunn & Bradstreet, Kroll Hacked.
http://www.usatoday.com/story/cybertruth/2013/09/26/
This will benefit life science companies allowing
lexisnexis-dunn–bradstreet-altegrity-hacked/2878769/.
them to tailor their marketing and sales strategies, Accessed July 2014.
to identify the correct providers to pursue. The Hall, A. LexisNexis verified data on more than 6.5 million
LexisNexis databases are more efficient, which providers strengthens identity transparency and reduces
costs for life science organizations. http://www.benzi
will help health science organizations gain com-
nga.com/pressreleases/14/07/b4674537/lexisnexis-veri
pliance with federal and state laws. fied-data-on-more-than-6-5-million-providers-strengt
Following the healthcare solutions announce- hens. Accessed July 2014.
ment, Elisa Rodgers writes that Reed Technology Krebs, B. Data broker giants hacked by ID theft service.
http://krebsonsecurity.com/2013/09/data-broker-giants-
and Information Services, Inc., a LexisNexis com-
hacked-by-id-theft-service/. Accessed July 2014.
pany, acquired PatentCore. PatentCore is an inno- LexisNexis. http://www.lexisnexis.com. Accessed July
vator of patent data analytics. PatentAdvisor is a 2014.
user-friendly suite, delivering information to Rodgers, E. Adding multimedia reed tech strengthens line
of LexisNexis intellectual property solutions by acquir-
assist with a more effective patent prosecution
ing PatentCore, an innovator in patent data analytics.
and management. Its web-based patent analytic http://in.reuters.com/article/2014/07/08/supp-pa-reed-
tools will help IP-driven companies and law technology-idUSnBw015873a+100+BSW20140708.
firms by making patent prosecution a more strate- Accessed July 2014.
gic and probable process.
The future of the LexisNexis Group should
include more acquisitions, expansion, and
increased capabilities for the company. According Lightnet L
to its website, the markets for their companies have
grown over the last three decades, servicing pro- ▶ Surface Web vs Deep Web vs Dark Web
fessionals in academic institutes, corporations,
governments, and business people. LexisNexis
Group provides critical information, in easy-to-
use electronic products, to the benefit of subscribed Link Prediction in Networks
customers. The company has a long history of
fulfilling its mission statement “to enable its cus- Anamaria Berea
tomers to spend less time searching for critical Department of Computational and Data Sciences,
information and more time using LexisNexis George Mason University, Fairfax, VA, USA
knowledge and management tools to guide critical Center for Complexity in Business, University of
decisions.” For more than a century, legal profes- Maryland, College Park, MD, USA
sionals have trusted the LexisNexis Group. It
appears that the company will continue to maintain
this status and remain one of the leading providers Link Prediction is an important methodology in
in the data brokerage marketplace. social network analysis that aims to predict
existing links between the nodes of a network
when there is incomplete or partial information
Cross-References about the network. Link Prediction is also a very
important method for assessing the development
▶ American Bar Association and evolution of dynamic networks. While link
▶ Big Data Quality prediction is not a method only specific to big
▶ Data Brokers and Data Services data, as it can be used with smaller datasets as
▶ Data Center well, its’ importance for big data arises from the
▶ Ethical and Legal Issues complexity of large networks with varied
624 Link Prediction in Networks

topologies and the importance of pattern identifi- arXiv and concluded that the best performance
cation that is specific only for large, complex was given by the Katx clustering, although the
datasets (Wei et al. 2017). Katz, Adamic/Adar, and low-rank inner product
There have been developed a number of algo- are similar in their predictions. They also found
rithms that are predicting the missing information that the most different method from all others
from networks or reconstruct networks. The aim was the hitting time. Nonetheless, all algorithms
of these algorithms has been. perform quite poorly, with only a 16% accuracy
Hasan and Zaki (2011) make an overview of as the maximum best prediction from Katz on
the current techniques used in link prediction and only one data set.
classify them into 3 types of algorithms: Sharma et al. (2014) also review the current
techniques used in link prediction and make an
1. The first class of algorithms computes a simi- experimental comparison between these. They
larity score between the nodes and employs a also classify these techniques in 3 groups:
training/learning method; these models are
considered as having a classic approach. 1. Node based techniques.
2. The second class of algorithms is based on 2. Link based techniques.
Bayesian probabilistic inference and on prob- 3. Path based techniques.
abilistic relational methods.
3. The third class of algorithms is based on graph They also classify the link prediction tech-
evolution models or linear algebraic niques as Graph theoretic approach, Statistical
formulations. approach, Supervised learning approach, and
Clustering and they choose 12 of the most used
Besides proposing this taxonomy for link pre- techniques that they classify and test experimen-
diction, Hasan and Zaki (2011) also identify the tally. These 12 techniques are the following: Node
current problems and research gaps in this field. Neighborhood, Jaccard‟s Coefficient, Adamic/
Specifically, they show that time-aware link pre- Adar, Hitting Time, Preferential Attachment,
diction (or predicting the evolution of a network Katz (â ¼ 0.01), Katz (â ¼ 0.001), Katz
topology), scalability of proposed solutions (â ¼ 0.0001), Sim Rank, Commute Time, Nor-
(particularly in the case of probabilistic algo- malized Commute Time, LRW, SRW, Rooted
rithms), and game theoretic approaches to link Pagerank (á ¼ 0.01), Rooted Pagerank (á ¼ 0.1),
prediction are areas where more research is Rooted Pagerank (á ¼ 0.5). They compared the
needed. precision of the link prediction from 12 techniques
Liben-Nowell and Kleinberg (2007) proposed on a real dataset and concluded that the Local
some of the earliest link prediction methods in a Random Walk (LRW) technique has the best
social networks, based on node proximity, and performance.
compare them with other algorithms by ranking Lü and Zhou (2011) also surveyed a series of
them on their accuracy and performance. They techniques used in link prediction, from the sta-
looked at various similarity measurements tistical physicist point of view. They classify the
between pairs of nodes. Their proposed method- algorithms similarly as Hasan and Zaki (2011), as:
ology is part of the first class of algorithms that
uses similarity between known nodes in order to 1. Similarity based algorithms
train the model for future nodes and compared 2. Maximum Likelihood Methods
the performance of algorithms such as Adamic/ 3. Probabilistic models
Adar, weighted Katz, Katz clustering, low-rank
approximation (inner product), Jaccard’s coeffi- They also emphasize that most of the current
cient, graph distance, common neighbors, hitting techniques are focused on the unweighted undi-
time, rooted PageRank, SimRank, and compared rected networks and that directed networks add
these on five networks of co-authorship from another layer of complexity to the problem. Also,
Link/Graph Mining 625

another difficult problem is to predict not only the international conference on Knowledge discovery and
existence of a link, but also the weight of that link. data mining. ACM.
Linyuan Lü, & Zhoua, T. (2011). Link prediction in com-
They also show that more challenges in link plex networks: A survey. Physica A: Statistical
prediction come from multi-dimensional net- Mechanics and its Applications, 390(6), 1150–1170.
works. Specifically, a big challenge is the link Sharma, D., Sharma, U., & Khatri, S. K. (2014). An exper-
prediction in multi-dimensional networks, where imental comparison of the link prediction techniques in
social networks. International Journal of Modeling and
links could have different meanings or where the Optimization, 4(1), 21.
network is consisted of several classes of nodes. Wang, D., et al. (2011). Human mobility, social ties, and
Link prediction becomes a particular problem link prediction. In Proceedings of the 17th ACM
in the case of sparse networks (Lichtenwalter et al. SIGKDD international conference on Knowledge dis-
covery and data mining. ACM.
2010). The authors address it using a “supervised Wei, X., Xu, L., Cao, B., & Yu, P. S. (2017). Cross view link
approach” through training classifiers. They also prediction by learning noise-resilient representation con-
dismiss the unsupervised approaches, which are sensus. In Proceedings of the 26th International Con-
based on node-neighborhoods or path information ference on World Wide Web, 1611–1619.
Zhang, M., & Chen, Y. (2018). Link prediction based on
as too simplistic, since they are based on a single graph neural networks. In Advances in neural infor-
metric (Lichtenwalter et al. 2010). On the con- mation processing systems.
trary, the supervised approaches are using training Zhou, M.-Y., Liao, H., Xiong, W.-M., Wu, X.-Y., & Wei,
classifiers. Z.-W. (2017). Connecting Patterns Inspire Link Predic-
tion in Complex Networks. Complexity, 2017., Article
Another research shows that link prediction ID 8581365, 12 p. https://doi.org/10.1155/2017/
can be effectively done by using a spatial proxim- 8581365.
ity approach and not network-based measures
(Wang et al. 2011).
Particularly in very large datasets or very large L
complex networks, link prediction is a critical
algorithm for understanding the evolution of Link/Graph Mining
such networks and their dynamic topology, par-
ticularly in social media data, where the links can Derek Doran
be sparse or missing and there is an abundance of Department of Computer Science and
nodes and information exchanged through these Engineering, Wright State University, Dayton,
nodes. OH, USA
Additionally, link prediction algorithms have
been more recently used also to improve the per-
formance of graph neural networks (Zhang and Synonyms
Chen 2018) and show great potential for refine-
ment or current neural networks and AI Network analysis; Network science; Relational
algorithms. data analytics

Further Reading Definition/Introduction

Al Hasan, M., & Zaki, M. J. (2011). A survey of link Link/graph mining is defined as the extraction of
prediction in social networks. In Social network data
analytics (pp. 243–275). Boston: Springer.
information within a collection of interrelated
Liben-Nowell, D., & Kleinberg, J. (2007). The link pre- objects. Whereas conventional data mining ima-
diction problem for social networks. Journal of the gines a database as a collection of “flat” tables,
American Society for Information Science and Tech- where entities are rows and attributes of these
nology, 58(7), 1019–1031.
Lichtenwalter, R. N., Lussier, J. T., & Chawla, N. V.
entities are columns, link/graph mining imagines
(2010). New perspectives and methods in link predic- entities as nodes or vertices in a network, with
tion. In Proceedings of the 16th ACM SIGKDD attributes attached to the nodes themselves.
626 Link/Graph Mining

Relationships among datums in a “flat” database implicit in a data table. For example, in a database
may be seen by primary key relationships or by of employee personal and their meeting calendars,
common values across a set of attributes. In the a network view may be constructed where
link/graph mining view of a database, these rela- employees are nodes and edges are present if
tionships are made explicit by defining links or two employees will participate in the same meet-
edges between vertices. The edges may be homo- ing. The network thus captures a “who works with
geneous, where a single kind of relationship who” relationship that is only implicit in the data
defines the edges that are formed, or heteroge- table. Analytics over the network representation
neous, where multiple kinds of data are used to itself can answer queries such as “how did some-
develop a vertex set, and relationships define body at meeting C hear about information that was
edges among network vertices. For example, a only discussed during meeting A?”, or “which
relation from vertex A to B and a relation from employee may have been exposed to the most
vertex C to D in a homogeneous graph means that amount of potential information, rumors, and
A is related to B in the same way that C is related views, as measured by participating in many
to D. An example of a homogeneous graph may meetings where few other participants overlap?”
be one where nodes represent individuals and The network representation of data has another
connections represent a friendship relationship. important advantage: the network itself represents
An example of a heterogeneous graph is one the structure of a complex system of
where different types of network devices connect interconnected participants. These participants
to each other to form a corporate intranet. Differ- could be people or even components of a physical
ent node types correspond to different device system. There is some agreement in the scientific
types, and different relationships may correspond community that the complexity of most techno-
to the type of network protocol that two devices logical, social, biological, and natural systems is
use to communicate with each other. Networks best captured by its representation as a network.
may be directed (e.g., a link may be presented The field of network science is devoted to the
from A to B but not vice versa) or undirected scientific application of link and graph mining
(e.g., a link from A to B exists if and only if a link techniques to quantitatively understand, model,
from B to A exists). Link/graph mining is inti- and make predictions over complex systems. Net-
mately related to network science, which is the work science defines two kinds of frameworks
scientific study of the structure of complex sys- under which link/graph mining is performed:
tems. Common link/graph mining tasks include (i) exploratory analysis and (ii) hypothesis-driven
discovering shortest or expected paths in the net- analysis. In exploratory analysis, an analyst has no
work, an importance ranking of nodes or vertices, specific notion about why and now nodes in a
understanding relationship patterns, identifying complex system connect or are related to each
common clusters or regions of a graph, and other or why a complex network takes on a spe-
modeling propagation phenomena across the cific structure. Exploratory analysis leads to a
graph. Random graph models give researchers a hypothesis about an underlying mechanism of
way to identify whether a structural or interaction the system based on regularly occurring patterns
pattern seen within a dataset is statistically or based on anomalous graph metrics. In
significant. hypothesis-driven analysis, the analyst has some
at hand evidence supporting an underlying mech-
anism about how a system operates and is inter-
Network Representations of Data ested in understanding how the structural qualities
of the system speak in favor or in opposition to the
While a traditional “tabular” representation of a mechanism. Under either setting, hypotheses may
dataset contains information necessary to under- be tested by comparing observations against ran-
stand a big dataset, a network representation dom network models to identify whether or not
makes explicit datum relations that may be patterns in support or in opposition of a
Link/Graph Mining 627

hypothesis are significant or merely occurred by directed or undirected). If one were to draw a graph
chance. Network science is intimately tied to link/ graphically, a path is any sequence of movements
graph mining: it defines an apparatus for analysts along the edges of the network that brings you from
to use link/graph mining methods that can answer one vertex to another. Any path is valid, even ones
important questions about a complex system. that have loops or crosses the same vertex many
Similarly, network science procedures and ana- times. Paths that do not intersect with themselves
lyses are the primary purpose for the development (i.e., vi does not equal vj for any vi,vj  p) are self-
of link/graph mining techniques. The utility of avoiding. The length of a path is defined by the
one would thus not nearly be as high without the total number of edges along it. Geodesic paths
other. between vertices i and j is a minimum length path
of size k where p1 ¼ i and pk ¼ j. A breadth-first
Representation search starting from node d, which iterates over all
The mathematical representation of a graph is a paths of length 1, and then 2 and 3, and so on up to
basic preprocessing step for any link/graph mining the largest path that originates at d, is one way to
task. One form may be as follows: every node in compute geodesic paths.
the graph is labeled with an integer i ¼ 1 . . . n and Network interactions: Whereas path analysis
a tuple (i, j) is defined for a relationship between considers the global structure of a graph, the inter-
nodes i and j. A network may then be defined by actions among nodes are a concept related to sub-
the value n and a list of all tuples. For example, let graphs or microstructures. Microstructural
n ¼ 5 and define the set {(1, 2), (3, 4), (2, 4), (4, 1), measures consider a single node, members of its
(2, 3)}. This specifies a graph with five vertices, nth degree neighborhood (the set of nodes no
one of which is disconnected (vertex 5) and the more than n hops from it), and the collection of
others that have edges between them as defined by interactions that run between them. If macro- L
the set. Such a specification of a network is called measures study an entire system as a whole (the
an edge list. Another approach is to translate the “forest”), micro-measures such as interactions try
edge list representation into an adjacency matrix to get at the heart of the individual conditions that
A. This is defined as an n  n matrix where the cause nodes to bind together locally (the “trees”).
element Aij, corresponding to the ith row and jth Three popular features for microstructural analy-
column of the matrix, is equal to 1 if the tuple (i, j) sis are reciprocity, transitivity, and balance.
or (j, i) exists in the edge list. When edges are Reciprocity measures that degree to which
unlabeled or unweighted, A is simply a binary two nodes are mutually connected to each other
matrix. Alternatively, if the graph is heteroge- in a directed graph. In other words, if one observes
neous or allows multiple relationships between that a node A connects to B, what is the chance
the same pair of nodes, then Aij is equal to the that B will also connect A? The term reciprocity
number of edges between i and j. When A is not comes from the field of social network analysis,
symmetric, the graph is directed rather than which describes a particular set of link/graph min-
undirected. ing techniques designed to operate over graphs
where nodes represent people and edges represent
Types of Link/Graph Mining Techniques the social relationships among them. For example,
The discovery and analysis of algorithms for if A does a favor for B, will B also do a favor for
extracting knowledge from networks are ongoing. A? If A sends a friend request to B on an online
Common types of analyses, emphasizing those social system, will B reply? On the World Wide
types often used in practice, are explained below. Web, if website A has a hyperlink to B, will B link
Path analysis: A path p in a graph is a sequence to A?
of vertices p ¼ (v1, v2, . . . , vm) , vi  V such Transitivity refers to the degree to which two
that for each consecutive pair vi,vj of vertices in nodes in a network have a mutual connection in
p is matched by an edge of the form (vj,vi) (if the common. In other words, if there is an edge
network is undirected) or (vi,vj) (if the network is between nodes A and B and B to C, graphs that
628 Link/Graph Mining

are highly transitive indicate a tendency for an context where positive means friend and neg-
edge to also exist between A and C. In the context ative means enemy, B can fall into a conflicting
of social network analysis, transitivity carries an situation when friends A and C disagree.
intuitive interpretation based on the old adage “a • Three negative: In this triangle, all vertices are
friend of my friend is also my friend.” Transitivity in conflict with one another. This is a danger-
is an important measure in other contexts, as well. ous scenario in systems of almost any context.
For example, in a graph where edges correspond For example, in a dataset of nations, mutual
to paths of energy as in a power grid, highly disagreements among three states has conse-
transitive graphs correspond to more efficient sys- quence to the world community. In a dataset of
tems compared to less transitive ones: rather than computer network components, three routers
having energy take the path A to B to C, a transi- that are interconnected but in “conflict” (e.g.,
tive relation would allow a transmission from A to a down connection or a disagreement among
C directly. The transitivity of a graph is measured routing tables) may lead to a system outage.
by counting the total number of closed triangles in
the graph (i.e., counting all subgraphs that are Datasets drawn from social process always
complete graphs of three nodes) multiplied by tend toward balanced states because people do
three and divided by the total number of not like tension or conflict. It is thus interesting
connected triples in the graph (e.g., all sets of to use link/graph mining to study social systems
three vertices A, B, and C where at least the where balance may actually not hold. If a graph
edges (A,B) and (B,C) exist). where most triangles are not balanced comes from
Balance is defined for networks where edges a social system, one may surmise that there exist
carry a binary variable that, without loss of gen- latent factors pushing the system toward imbal-
erality, is either “positive” (i.e., a “+,” “1,” “Yes,” anced states. A labeled complete graph is bal-
“True,” etc.) or “negative” (i.e., a “,” “0,” “No,” anced if every one of its triangles is balanced.
“False,” etc.). Vertices incident to positive edges Quantifying node importance: The impor-
are harmonious or non-conflicting entities in a tance of a node is related to its ability to reach
system, whereas vertices incident to negative out or connect to other nodes. A node may also be
edges may be competitive or introduce a tension important if it carries a strong degree of “flow,”
in the system. Subgraphs over three nodes that are that is, if the values of relationships connected to it
complete are balanced or imbalanced depending are very high (so that it acts as a strong conduit for
on the assignment of + and  labels to the edges the passage of information). Nodes may be impor-
of the triangle as follows: tant if they are vital to maintain network connec-
tivity, so that if an important node was removed,
• Three positive: Balanced. All edges are “posi- the graph may suddenly fragment or become dis-
tive” and in harmony with each other. connected. Importance may be measured recur-
• One positive, two negative: Balanced. In this sively: a node is important if it is connected to
triangle, two nodes exhibit a harmony, and other nodes that themselves are important. For
both are in conflict with the same other. The example, people who work in the United States
state of this triangle is “balanced” in the sense White House or serve as Senior Aids to the Pres-
that every node is either in harmony or in ident are powerful people, not necessarily because
conflict with all others in kind. of their job title but because they have a direct and
• Two positive, one negative: Imbalanced. In this strong relationship with the Commander in Chief.
triangle, node A is harmonious with B, and B is Importance is measured by calculating the cen-
harmonious with C, yet A and C are in conflict. trality of a node in a graph. Different centrality
This is an imbalanced disagreement since, if measures that encode different interpretations of
A does not conflict with B, and B does not node importance exist and should thus be selected
conflict with C, one would expect A to also according to the analysis at hand. Degree central-
not conflict with C. For example, in a social ity defines importance as being proportional to the
LinkedIn 629

number of connections a node has. Closeness cen- ▶ Mathematics


trality defines importance as having a small aver- ▶ Statistics
age distance to all other nodes in the graph.
Betweenness centrality defines importance as
being part of as many shortest paths in graph Further Reading
from other pairs of nodes as possible. Eigenvector
centrality defines importance as being connected Cook, D. J., & Holder, L. B. (2006). Mining graph data.
Wiley.
to not only many other nodes but also to many
Getoor, L., & Diehl, C. P. (2005). Link mining: A survey.
other nodes that are themselves are important. ACM SIGKDD Explorations Newsletter, 7(2), 3–12.
Graph partitioning: In the same way that clus- Lewis, T. G. (2011). Network science: Theory and appli-
ters of datums in a dataset correspond to groups of cations. Wiley.
Newman, M. (2010). Networks: An introduction. New
points that are similar, interesting, or signify some
York: Oxford University Press.
other demarcation, vertices in graphs may also be Philip, S. Y., Han, J., & Faloutsos, C. (2010). Link mining:
divided into groups that correspond to a common Models, algorithms, and applications. Berlin: Springer.
affiliation, property, or connectivity structure.
Graph partitioning methods. Graph partitioning
takes as an input the number and size of the groups
and then searches for the “best” partitioning under LinkedIn
these constraints. Community detection algo-
rithms are similar to graph partitioning methods Jennifer J. Summary-Smith
except that they do not require the number and Florida South Western State College, Fort Myers,
size of groups to be specified a priori. But this is FL, USA
not necessarily a disadvantage to graph Culver-Stockton College, Canton, MO, USA L
partitioning methods; if a graph miner under-
stands the domain from where the graph came
from well, or if for her application she requires a According to its website, LinkedIn is the largest
partitioning into exactly k groups, graph professional network in the world servicing over
partitioning methods should be used. 300 million members in over 200 territories and
countries. Their mission statement is to “connect
the world’s professionals to make them more pro-
Conclusion ductive and successful. When you join LinkedIn,
you get access to people, jobs, news, updates, and
As systems that our society relies on become ever insights that help you be great at what you do.”
more complex, and as technological advances Through its online service, LinkedIn earns around
continue to help us capture the structure of this $473.2 million from premium subscriptions, mar-
complexity at high definition, link/graph mining keting solutions, and talent solutions. It offers free
methods will continue to rise in prevalence. As the and premium memberships allowing people to net-
primary means to understand and extract knowl- work, obtain knowledge, and locate potential job
edge from complex systems, link/graph mining opportunities. The greatest asset to LinkedIn is its
methods need to be included in the toolkit of any data, making a significant impact in the job industry.
big data analyst.

Company Information
Cross-References
Cofounder Reid Hoffman conceptualized the
▶ Computer Science company in his living room in 2002, launching
▶ Computational Social Sciences LinkedIn on May 5, 2003. Hoffman, a Stanford
▶ Graph-Theoretic Computations/Graph Databases graduate, became one of PayPal’s earliest
630 LinkedIn

executives. After PayPal was sold to eBay, he interdisciplinary approach with backgrounds in
cofounded LinkedIn. The company had one mil- computer science, economics, information
lion members by 2004. Today, the company is ran retrieval, machine learning, optimization, soft-
by chief executive, Jeff Weiner, who is also the ware engineering, and statistics. Relevance scien-
former CEO of Yahoo! Inc. LinkedIn’s headquar- tists work to improve the relevancy of LinkedIn’s
ters are located in Mountain View, California, products. According to Deepak Agarwal,
with US offices in Chicago, Los Angeles, LinkedIn relevance scientists significantly
New York, Omaha, and San Francisco. LinkedIn enhance products such as advertising, job recom-
also has international offices in 21 locations and mendations, news, LinkedIn feed, people recom-
its online content is available in 23 languages. mendations, and much more. He further points out
LinkedIn currently employs 5,400 full-time that most of the company’s products are based
employees with offices in 27 cities globally. upon its use of data.
LinkedIn states that professionals are signing up
to join the service at the rate of two new members
per second with 67% of its membership located Impact on the Recruiting Industry
outside of the United States. The fastest growing
demographic using LinkedIn are students and As it states on LinkedIn’s website, the company’s
recent college graduates, accounting for around free membership allows its members the opportu-
39 million users. LinkedIn’s corporate talent solu- nity to upload resumes and/or curriculum vitae,
tions product lines and its memberships include join groups, follow companies, establish connec-
all executives from the 2013 Fortune 500 compa- tions, view and/or search for jobs, endorse con-
nies and 89 Fortune 100 companies. In 2012, its nections, and update profiles. It also suggests to its
members conducted over 5.7 billion profession- members several people that they may know,
ally oriented searches, with three million compa- based on their connections. LinkedIn’s premium
nies utilizing LinkedIn company pages. service provides members with additional bene-
As noted on cofounder Reid Hoffman’s fits, allowing access to hiring managers and
LinkedIn account, a person’s network is how one recruiters. Members can send personalized mes-
stays competitive as a professional, keeping up- sages to any person on LinkedIn. Additionally,
to-date on one’s industry. LinkedIn provides a members can also find out who has viewed their
space where professionals learn about key trends, profile, detailing how others found them for up to
information, and transformations of their industry. 90 days. There are four premium search filters,
It provides opportunities for people to find jobs, permitting premium members to find decision
clients, and other business connections. makers at target companies. The membership
also provides individuals the opportunity to get
noticed by potential employers. When one applies
Relevance of Data as a featured applicant, it raises his or her rank to
the top of the application list. OpenLink is a
MIT Sloan Management Review contributing edi- network that also lets any member on LinkedIn
tor, Renee Boucher Ferguson, interviewed to view another member’s full profile to make a
LinkedIn’s director of relevance science, Deepak connection.
Agarwal, who states that relevance science at The premium LinkedIn membership assists
LinkedIn plays the role of improving the rele- with drawing attention to members’ profile,
vancy of its products by extracting information adding an optional premium or job seeker badge.
from LinkedIn data. In other words, LinkedIn When viewing the job listings, members have the
provides recommendations using its data to pre- option to sort by salary range, comparing salary
dict user responses to different items. estimates for all jobs in the United States,
To achieve this difficult task, LinkedIn has Australia, Canada, and the United Kingdom.
relevance scientists who provide an LinkedIn’s premium membership also allows
LinkedIn 631

users to see more profile data in one’s extended company has made a $27 billion impact on the
network, including first-, second-, and third- recruiting industry. Jeff Weiner also states that
degree connections. A member’s first-level con- every time LinkedIn expands its sales team for
nections are people that have either received an hiring solutions, the payoff increases “off the
invitation from the member or the member sent an charts.” He also talks about how sales keep rising
invitation to connect. Second-level connections and its customers are spreading enthusiasm for
are people who are connected to first-level con- LinkedIn’s products. Jeff Weiner further states
nections but are not connected to the actual mem- that once sales are made, LinkedIn customers are
ber. Third-level connections are only connected to loyal, reoccurring, and low maintenance. This
the second-level members. Moreover, members trend is reflected in current stock market prices in
can receive advice and support from a private the job-hunting sector. George Anders writes that
group of LinkedIn experts, assisting with job older search firm companies, such as Heidrick &
searches. Struggles that recruits candidates the old fashion
In a recent article by George Anders, he notes way, have slumped 67%. Monster Worldwide has
the impact that LinkedIn has made on the experienced a more dramatic drop, tumbling 81%.
recruiting industry. He spoke with the chief exec- As noted on its website, “LinkedIn operates the
utive of LinkedIn, Jeff Weiner, who brushes off world’s largest professional network on the Inter-
comparisons between LinkedIn and Facebook. net.” This company has made billions of dollars,
While both companies connect a vast amount of hosting a massive amount of data with a member-
people via the Internet, each social media platform ship of 300 million people worldwide. The social
occupies a different niche within the social net- network for professionals is growing at a fast pace
working marketplace. Facebook generates 85% of under the tenure of Chief Executive Jeff Weiner.
its revenue from advertisements, whereas In a July 2014 article by David Gelles, he reports L
LinkedIn focuses its efforts on monetizing mem- that LinkedIn has made its second acquisition in
bers’ information. Furthermore, LinkedIn’s the last several weeks buying Bizo for $175 mil-
mobile media experience is growing significantly, lion dollars. A week prior, it purchased Newsle,
changing the face of job searching, career net- which is a service that combs the web for articles
working, and online profiles. George Anders that are relevant to members. It quickly notifies a
also interviewed the National Public Radio head person whenever friends, family members,
of talent acquisition, Lars Schmidt, who notes that coworkers, and so forth are mentioned online in
recruiters no longer remain chiefly in their offices the news, blogs, and/or articles.
but are becoming more externally focused. The LinkedIn continues to make great strides by
days of exchanging business cards is quickly leveraging its large data archives, to carve out a
being replaced by smartphone applications such niche in the social media sector specifically
as CardMunch. CardMunch is an iPhone app that targeting the needs of online professionals. It is
captures business card photos, transferring them evident that, through the use of big data, LinkedIn
into digital contacts. In 2011, LinkedIn bought the is changing and significantly influencing the job-
company, retooling it to pull up existing LinkedIn hunting process. This company provides a service
profiles from each card improving the ability of that allows its member to connect and network
members to make connections. A significant part with professionals. LinkedIn is the world’s largest
of LinkedIn’s success comes from its dedication professional network, proving to be an innovator
to selling services to people who purchase talent. in the employment service industry.
The chief executive of LinkedIn, Jeff Weiner,
has created an intense sales-focused culture. The
company celebrates new account wins during its Cross-References
biweekly meetings. According to George Anders,
LinkedIn has doubled the number of sales ▶ Facebook
employees in the past year. In addition, the ▶ Information Society
632 LinkedIn

▶ Online Identity Boucher Ferguson, R. The relevance of data: Behind the


▶ Social Media scenes at LinkedIn. http://sloanreview.mit.edu/arti
cle/the-relevance-of-data-going-behind-the-scenes-
at-linkedin/. Accessed July 2014.
Gelles, D. LinkedIn makes another deal, buying Bizo.
Further Reading http://dealbook.nytimes.com/2014/07/22/linkedin-does-
another-deal-buying-bizo/?_php¼true&_type¼blogs
Anders, G. How LinkedIn has turned your resume into a cash &_php¼true&_type¼blogs&_php¼true&_type¼blogs
machine. http://www.forbes.com/sites/georgeanders/ &_r¼2. Accessed July 2014.
2012/06/27/how-linkedin-strategy/. Accessed July 2014. LinkedIn. https://www.linkedin.com. Accessed July 2014.
M

Machine Intelligence system evaluates the performance of predictive


models and optimizes the model parameters in
▶ Artificial Intelligence order to obtain better predictions. In practice, ML
systems also learn from prior experiences and
generate solutions for given problems with spe-
cific requirements.
Machine Learning

Ashrf Althbiti and Xiaogang Ma Machine Learning Approaches


Department of Computer Science, University of
Idaho, Moscow, ID, USA Jordan and Mitchell (2015) discussed that the
main paradigms of ML methods are (1) supervised
learning, (2) unsupervised learning, and (3) rein-
Introduction forcement learning. ML approaches are catego-
rized based on two criteria: (1) the data type of a
Machine learning (ML) is a fast-evolving scien- dependent variable and (2) the availability of
tific field that effectively copes with big data labels of a dependent variable. The former crite-
explosion and forms a core infrastructure for rion is categorized into two classes: (1) continue
artificial intelligence and data science. ML brid- and (2) discrete. The latter criterion is utilized to
ges the research fields of computer science and determine the type of ML algorithm. If a depen-
statistics and builds computational algorithms dent variable is given and labeled, it would be a
and statistical model-based theories from those supervised learning approach. Otherwise, if a
fields of studies. These algorithms and models dependent variable is not given or unlabeled, it
are utilized by automated systems and computer would be an unsupervised learning approach.
applications to perform specific tasks, with the Supervised learning algorithms are often uti-
desire of high prediction performance and gen- lized for prediction tasks and building a mathe-
eralization capabilities (Jordan and Mitchell matical model of a set of data that include both
2015). Sometimes, ML is also referred to as a inputs and desired outputs. These algorithms learn
predictive analytics or statistical learning. The a prediction model that approximates a function
general workflow of a ML system is that it f(x) to predict an output y (Hastie et al. 2009). For
receives inputs (aka, training sets), trains predic- instance, fraud classifier of credit-card transac-
tive models, performs specific prediction tasks, tions, spam classifier of emails, and medical diag-
and eventually generates outputs. Then, the ML nosis systems (e.g., breast cancer diagnosis) each
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
634 Machine Learning

represents a function of approximation that super- Machine Learning, Table 1 Classification of machine
vised learning algorithms perform. learning models
Unsupervised learning algorithms are often Data are labeled Data are unlabeled
used to study and analyze a dataset and learn a (supervised (unsupervised
learning) learning)
model that finds and discovers a useful structure
Continuous Regression Dimensionality
of the inputs without the need of labeled outputs. reduction
They are also used to address two major problems Discrete Classification Clustering
that researchers encounter in the ML workflow:
(1) data sparsity where missing values can affect a
model’s accuracy and performance and (2) curse
of dimensionality which means data is organized
y and one or more independent variables X. For-
in high-dimensional spaces (e.g., thousands of
dimensions). mula (1) is a linear model for several explanatory
variables represented by a hyperplane in higher
Reinforcement learning forms a major ML
dimensions:
paradigm where it sits at the crossroad of super-
vised and unsupervised learning (Mnih et al.
2015). For reinforcement learning, the avail- yb ¼ w½0  x½0 þ w½1  x½1 þ . . . þ w½p
ability of information in training examples is  x ½ p þ b ð1Þ
intermediate between supervised and
unsupervised learning. In another words, the where x[0] to x[p] signify features of a single
training examples provide indications about an instance and w and b are learned parameters by
output inferred by the correctness of an action. minimizing the mean squared error between pre-
Yet, if an action is not correct, the challenge of dicted values yb and true values of y on the training
finding a correct action endures (Jordan and set. Linear regression also forms other models
Mitchell 2015). such as ridge and LASSO. Moreover, a regression
Other ML approaches emerge when researchers model, namely, logistic regression, can be applied
develop combinations across the three main para- for classification where target values are trans-
digms, such as semi-supervised learning, discrim- formed into two classes as the prediction formula
inative training, active learning, and causal (2) shows:
modeling (Jordan and Mitchell 2015).
yb ¼ w½0  x½0 þ w½1  x½1 þ . . . þ w½p  x½p þ b > 0
ð2Þ
Machine Learning Models
where the threshold of a predicted value is zero.
As a result of utilizing ML algorithms on a train- Thus, if a predicted value is greater than zero, a
ing dataset, a model is learned to make predictions predicted class is +1; otherwise it is 1.
on new datasets. Table 1 lists different ML algo-
rithms based on the availability of labeled output K-Nearest Neighbors (KNNs)
variables and their data types. Table 2 gives a K-nearest neighbor models are utilized to make a
longer list of the state-of-the-art models in ML prediction for a new single point (aka, instance).
algorithms (Amatriain et al. 2011). It is utilized for classification and regression
problems and is known as lazy learner because
Classification and Regression it needs to memorize the training sets to make a
The following algorithms are briefly introduced. new prediction (aka, instance-based learning).
This model makes a prediction for a new
Linear Regression instance based on the values of the nearest
Linear regression is a statistical model for model- neighbors. It finds those nearest neighbors by
ing the relationship between a dependent variable calculating similarities and distances between a
Machine Learning 635

Machine Learning, Table 2 Different categories of ML models


ML paradigm Task Type of algorithm Model
Supervised Prediction Regression and Linear regression
classification Ridge regression
Least absolute shrinkage and selection operator
(LASSO)
K-nearest neighbors for regression
K-nearest neighbors for classification
(logistic regression)
One vs. rest linear model for multi-label classification
Decision trees (DSs)
Bayesian classifiers
Support vector machines (SVMs)
Artificial neural networks (ANNs)
Unsupervised Features Dimensionality Principal component analysis (PCA)
extraction reduction Singular value decompensation (SVD)
Description Clustering k-means
Density-based spatial clustering of application with
noise (DBSCAN)
Message passing
Hierarchical
Association rule A priori
mining

single point and its neighbors. The similarity is Support Vector Machines (SVMs)
M
calculated using Pearson correlation, cosine Support vector machines are classifiers that strive
similarity, Euclidian distance, or other similarity to separate data points by finding linear hyper-
measures. planes that maximize margins between data points
in an input space. It is noteworthy that SVMs can
Decision Trees (DSs) be applied to address regression and classification
Decision trees classify a target variable in the form needs. The support vectors are data points that fit
of a tree structure. The nodes of a tree can be on maximized margins.
(1) decision nodes, where their values are tested
to determine to which branch a subtree moves, or Artificial Neural Networks (ANNs)
(2) leaf nodes, where a class of a data point is Artificial neural networks are models that inferred
determined. Decision nodes must be carefully from biological neural networks of the brain. It
selected to enhance the accuracy of prediction. develops a network of interconnected neurons
DSs can be used in regression and classification which work together to perform prediction tasks.
applications. Numerical weights are assigned to the links
between nodes and are tuned based on experience.
Bayesian Classifiers The simple representation network consists of
Bayesian classifiers are a probabilistic framework three main layers: (1) input layer, (2) hidden
to address classification and regression needs. It is layer, and (3) output layer. Handwriting recogni-
based on performing Bayes’ theorem and the def- tion is a typical application where ANNs are used.
inition of conditional probability. The main
assumption of applying Bayes’ theorem is that Dimensionality Reduction
features should maintain strong (naïve) The poor performance of ML algorithms is often
independence. caused by the number of dimensions in a data
636 Machine Learning

space. Hence, an optimal solution is to reduce the predictions on test datasets. On the other hand,
number of dimensions, while the maximum if a model is not sufficiently trained on a training
amount of information is retained. PCA and dataset, this model most likely will do badly even
SVD are the main ML algorithms that offer a on a training dataset. Hence, the goal is to select a
solution to the issue of dimensionality. model that maintains an optimal complexity of
training.
Clustering Learning a model requires a set of data points
Clustering is a popular ML algorithm which falls as inputs to train a model, a set of data points to
in the unsupervised learning category. It groups tune and optimize a model’s parameters, and a set
data points based on their similarity. Thus, data of data points to evaluate its performance. There-
points that fit in one cluster or class are different fore, a dataset is divided into three sets, namely,
from the data points in another cluster. the training set, the evaluating set, and the testing
A common technique of clustering is a k-means set. The way of dividing these sets depends on
where k indicates total number of clusters. The algorithm developers. There are different tech-
k-means clustering algorithm randomly selects niques to be followed when dividing datasets.
k-number of data points and plots them on a One basic technique is to utilize a 90/10 rule of
Cartesian plane. These data points form a cen- thumb which means 90% of a dataset is used to
troid of each cluster where the remaining data learn a model and the other 10% is used to eval-
points are assigned to the best centroid. Then, a uate and adjust it. Other methods for dataset split-
process of reassigning centroids is repeated ting include k-fold cross-validation and hold-out
inside each cluster until there are no more cross-validation (Picard and Berk 2010). Further-
changes in a set of k centroids. Other ML algo- more, there are other sophisticated statistical eval-
rithms considered as an alternative selection of uation techniques applicable for different types of
k-means are DBSCAN, message-passing cluster- datasets, such as bootstrapping methods which
ing, hierarchical clustering, etc. depend on random sampling with replacement or
grid search.
Association Rule Mining A wide range of evaluation criteria can be used
Association rule mining algorithms are mostly used for evaluating ML algorithms. For example, accu-
in marketing when predicting co-occurrence of racy is an extensively utilized property to gauge
items in a transaction. It is widely utilized to iden- the performance of model predictions. Typical
tify co-occurrence relationship patterns in large- examples of accuracy measurements are R2 and
scale data points (e.g., items or products). root-mean-square error (RMSE). Other metrics to
measure the accuracy of usage prediction include
precision, recall, support, F-score, etc.
Model Fitness and Evaluation

The main objective of adopting ML algorithms Applications


on training dataset is to generalize a learned
model to make accurate predictions on new data Various applications have utilized ML techniques
points. Hence, if a model makes accurate pre- to automate inferences and decision-making
dictions on new data points, this model would be under uncertainty, which include, but not limited
generalized from a training dataset to test to, health care, education, aerospace, neurosci-
datasets. However, an extensive training of a ence, genetics and genomics, cloud computing,
model increases the complexity, in which the business, e-commerce, finance, and supply
overfitting problem may appear. The overfitting chain. Within artificial intelligence (AI), the dra-
problem means that a model does memorize the matic emergence of ML has been utilized as the
training dataset and perform well on training method of choice for building systems for com-
dataset, but is not able to make accurate puter vision, speech recognition, facial
Maritime Transport 637

recognition, natural language processing, and


other applications (Jordan and Mitchell 2015). Maritime Shipping

▶ Maritime Transport
Conclusion

ML provides models that learn automatically


through experience. The explosion of big data is Maritime Transport
a main motivation behind the evolution of ML
approaches. A survey of the current state of ML Camilla B. Bosanquet
algorithms is introduced in a coherent fashion in Schar School of Policy and Government, George
this entry to simplify the rich and detailed content. Mason University, Arlington, VA, USA
Also, the discussion has extended to cover more
topics that demonstrate how utilizing machine
learning algorithms can alleviate the issue of Synonyms
dimensionality and offer solutions to automate
prediction and detection. Also, model evaluation Maritime Data; Maritime Shipping
and machine learning applications are briefly
introduced.
Ancient Origins of Maritime
Transport Data
Cross-References
Maritime transport data collection and dissemina-
▶ Financial Data and Trend Prediction tion have their roots in antiquity. Mercantile enter-
prise first necessitated such information, the
M
imperative for which deepened over time with
Further Reading the emergence of passenger movement by sea,
coastal defense, and naval operations. Early data
Amatriain, X., Jaimes, A., Oliver, N., & Pujol, J. M. took a variety of forms, e. g., Ancient Egyptian
(2011). Data mining methods for recommender sys- tombs and papyri recorded maritime commerce
tems. In Recommender systems handbook activities as early as 1500 BCE; Zenon of Kaunos,
(pp. 39–71). Boston: Springer.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The private secretary to the finance minister of Ptol-
elements of statistical learning: Data mining, infer- emy II, documented vessel cargo manifests in
ence, and prediction (2nd ed.). New York: Springer. 250 BCE; and “sailing directions” dating to the
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: third century BCE provided ship captains of the
Trends, perspectives, and prospects. Science. https://
doi.org/10.1126/science.aaa8415. Roman Empire with critical weather, ship routing,
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., and harbor guidance for Indian coastal markets
Veness, J., Bellemare, M. G., & Petersen, S. (2015). (Casson 1954).
Human-level control through deep reinforcement learn- While modern maritime shipping has benefit-
ing. Nature, 518(7540), 529.
Picard, R. R., & Berk, K. N. (2010). Data splitting. Amer- ted from remarkable technological advance-
ican Statistician, 44(2), 140–147. https://doi.org/10. ments during the intervening millennia, the
1080/00031305.1990.10475704. business of maritime transport relies upon time-
less concepts, even as the capture and convey-
ance of relevant data has jumped from paper to
digital form. While global positioning satellites
Maritime Data replaced the nautical sextant and compass, elec-
tronic logbooks supplanted handwritten voyage
▶ Maritime Transport diaries, and touch-screen navigation displaced
638 Maritime Transport

plotting routes and position fixes in pencil on • Calculate search areas and human survival
paper charts, surface vessel operation and man- likelihood given currents, winds,
agement still bear a strong resemblance to activ- temperatures
ities of yesteryear. It is similarly the case with the • Record transit temperatures of refrigerated per-
commercial movement of goods and people ishables (e.g., food, flowers, medicine)
between ports; passenger and cargo manifests • Estimate tsunami wave probabilities resulting
serve the same essential needs as those of from seismic activity
past ages. • Enable law enforcement to intercept actors
engaged in illicit activity at sea
• Enforce exclusive economic zones, marine
Maritime Transport and Big Data sanctuaries, and closed fishing grounds
• Communicate relocated harbor buoy and nav-
What is new, in modernity, is the immeasur- igation hazard information promptly
able quantity of data related to maritime trans- • Contribute to vessels’ timely inspection, certi-
port and the lightning speed at which such data fication, maintenance, and overhaul
can proliferate and travel. Ship owners, fleet • Study economic flows of goods and people
managers, cargo shipping agents and firms, by sea
vessel operators and engineers, government • Facilitate coastal incident response (e.g., oil
officials, business logisticians, port manage- spill mitigation, plane crash recovery)
ment, economists, market traders, commodity • Develop strategies to avoid, defend against,
and energy brokers, financial investors, insur- counter, and withstand maritime piracy
ance firms, maritime organizations, vessel traf- • Evaluate the performance of vessels, equip-
fic service watch standers, classification ment, captains, crews, ports, etc.
societies, admiralty lawyers, navies, coast • Identify vessels and operators violating cabo-
guards, and others all benefit from access to tage laws or embargoes
maritime information. • Monitor adherence to flag state and port state
Maritime data can, inter alia, be used by the control requirements
aforementioned actors in myriad endeavors to: • Determine financial and insurance risks asso-
ciated with ships and vessel fleets
• Ascertain and predict sea and meteorological • Inform strategies for coastal defense, military
conditions sealift, and naval projection of power
• Obtain the global locations of vessels in real-
time As of 2005, the United Nations’ International
• Make geospatial transit calculations based Maritime Organization mandated the worldwide
upon port locations operation of automatic identifications systems
• Optimize ship routing given weather, port (AIS) onboard all passenger vessels and certain
options, available berthing, fuel costs, etc. ships specified by tonnage and/or purpose.
• Monitor the status of shipboard engineering Accomplished under the International Conven-
plant machinery tion for the Safety of Life at Sea, the new regula-
• Translate hydrographic survey data to under- tion required the use of each vessel’s AIS
sea cartographic charts transponder to transmit ship identification, vessel
• Track shipments of containerized, bulk, type, and navigational information to other ships,
break-bulk, liquid, and roll-on/roll-off shore-based receivers, and aircraft, and receive
cargoes such data from other ships. Geosynchronous sat-
• Document endangered whale and sea turtle ellites can now collect AIS transponder data,
sightings for ship strike avoidance enabling the near-instantaneous monitoring of
Maritime Transport 639

ships that have exceeded terrestrial receivers’ marine bunker fueling, anti-piracy voyage plan-
tracking ranges. ning, ship charter pricing, and freight fees. Afore-
While a plethora of global maritime intelli- mentioned public-sector entities will likewise
gence providers boast AIS-informed vessel track- experience benefits, e. g., a reduction in marine
ing and associated analytics, Lloyd’s of London casualties, fewer marine environmental incidents,
arguably offers the most comprehensive maritime and greater intelligence concerning transnational
data, having been collected from the greatest num- organized crime activity in the maritime domain.
ber of sources. Beyond providing vessel tracking, A feasible scenario emerges from this vision: a
the Lloyd’s List Intelligence Seasearcher online virtuous feedback loop in which improved mari-
user interface provides clients with searchable time technologies refine maritime transport data
bills of lading, individual container tracking, ves- and multiply end-user benefits, leading to greater
sel sanctions information, registered owner investment in maritime technologies, further
details, shipping companies’ credit ratings, iden- refining data and boosting benefits. In many
tification of dry bulk vessels by products carried, respects, big data analytics of maritime transport
liquid natural gas and oil tanker movements and information is the latest manifestation of an urge
port calls, multilevel vessel and fleet ownership that once prompted the creation of “sailing direc-
structures, detailed port characteristics, vessel risk tions” to India during the days of the Roman
detection, assessment of suspicious activity (e.g., Empire. It is a human desire to collect, analyze,
breaking sanctions, fishing illegally, illicit traf- and disseminate critical information to facilitate
ficking) during AIS gap periods, incident notifi- the safe and profitable transport of people and
cations (e.g., cargo seizures, ship casualties, crew goods over the sea.
arrests, vessel detentions), and more. At the
macro-level, such data and analytics enable cli-
ents to forecast market trends, manage risk, eval-
uate global trade, develop strategy, design unique
Cross-References M
algorithmic applications, and conduct an unlim- ▶ Business Intelligence
ited variety of tailored analyses. ▶ Intelligent Transportation Systems (ITS)
▶ Spatiotemporal Analytics

The Future of Maritime Transport Data


Further Reading
Increased port and vessel automation, alongside
further development of autonomous vessel tech- Casson, L. (1954). Trade in the ancient world. Scientific
nologies, should enhance the quality of maritime American, 191(5), 98–104.
transport data. Prolific use of sensors throughout Fruth, M., & Teuteberg, F. (2017). Digitization in maritime
logistics – What is there and what is missing? Cogent
the maritime transport industry will strengthen Business & Management, 4. https://doi.org/10.1080/
data on weather, vessels, ports, cargoes, human 23311975.2017.1411066.
operators, and the performance of automated Jović, M., Tijan, E., Marx, R., & Gebhard, B. (2019). Big
machines. AI and machine learning applications data management in maritime transport. Pomorski
Zbornik, 57(1), 123–141. https://doi.org/10.18048/
should improve terrestrial and satellite imagery 2019.57.09.
analyses, providing stakeholders with a greater Notteboom, T., & Haralambides, H. (2020). Port manage-
understanding of vessels’ activities and vulnera- ment and governance in a post-COVID-19 era: Quo
bilities. More timely data should facilitate finan- vadis? Maritime Economics & Logistics, 22(3),
329–352. https://doi.org/10.1057/s41278-020-
cial gains or savings as executives and vessel 00162-7.
captains make informed decisions concerning Stopford, M. (2009). Maritime economics. London:
efficient ship routing, maintenance availabilities, Routledge.
640 Mathematics

home computers, cameras in public areas, etc.)


Mathematics generates a digital footprint, which quickly cre-
ates extremely large amount of data. One could
Daniele C. Struppa say that information and digitalization are the
Donald Bren Presidential Chair in Mathematics, fuels that allow data to grow into big data. If the
Chapman University, Orange, CA, USA storage of such data is an immediate challenge,
even more challenging is to find efficient ways to
process such data and finally ways to transmit
Introduction them. Despite these challenges, big data are now
considered a fundamental instrument in a variety
The term big data refers to data sets that are so of scientific fields (e.g., to forecast the path of
large and complex as to make traditional data weather events such as hurricanes, to model cli-
management insufficient. Because of the rapid mate changes, to simulate the behavior of blood
increase in data gathering, in data storage, as flow when new medical devices such as stents are
well as in the techniques used to manage data, inserted in arteries, or finally to make predictions
the notion of big data is somewhat dependent on the outcome of a disease through the use of
on the capability and technology of the user. genomics and proteomics assays), as well as in
Traditionally, the challenge of big data has been many business applications (data analytics is cur-
described in terms of the 3Vs: high Volume (the rently one of the areas of fastest growth, because
size and amount of data to be managed), high of its implied success in determining customer
Velocity (the speed at which data need to be preferences, as well as the risk of financial
acquired and handled), and high Variety (the decisions).
range of data types to be managed). Specifically,
“Big Data is the Information asset characterized
by such a High Volume, Velocity and Variety to The Power of Data and Their Impact on
require specific Technology and Analytical Science and Mathematics
Methods for its transformation into Value
(De Mauro et al. 2016).” More recently, some A few authors (Weinberger 2011) have argued
authors have added two more dimensions to the that it will be increasingly possible to answer any
notion of big data, by introducing high Variability question of interest, as long as we have enough
(inconsistency in the data set) and high Veracity raw data about the question being posed. This
(the quality and truthfulness of data present high philosophical assumption has been called the
variability, thus making its management more microarray paradigm in Napoletani et al.
delicate). (2014), and it has important implications for the
way in which both mathematics and science
develop and interact. While in the past scientific
The Growth of Data theories were seen as offering a theoretical model
that would describe reality (think, e.g., of New-
Big data (its collection, management, and manip- tonian mechanics, which explains the behavior
ulation) has become a topic of significant interest of bodies on the basis of three relatively simple
because of the rapid growth of available data due laws), the advent of the use of big data seems to
to the ease of access to and collection of data. herald an era of agnostic science (Napoletani
While until a couple decades ago big data used et al. 2014), in which mathematical techniques
to appear mostly through scientific data collec- are used to allow the scientist (or the social sci-
tion, we are now at a point in which each and entist) to make predictions as to the outcome of a
every digital interaction (through mobile devices, process, even in the absence of a model that
Mathematics 641

would explain the behavior of the system at sets goes under the name of boosting (Schapire
hand. The consequence of this viewpoint is the 1990), where a large number of mediocre
development of new techniques, originating in (slightly better than random) classifiers are
the field one could call computational mathemat- combined to provide a much more robust
ics, whose validity is demonstrated not through classifier.
the traditional methods of demonstrations and Another relatively new addition to the toolkit
proofs but rather by its own applicability to a of the practitioner of big data management is
given problem. what goes under the name of nonlinear manifold
learning (Roweis and Saul 2000), intended as an
entire set of techniques designed to find low-
Mathematical Techniques dimensional objects that preserve some important
structural properties in an otherwise collection of
Mathematicians have employed a wide array of high-dimensional data.
mathematical techniques to work with large data Finally we will mention the growing use of
sets and to use them to make inferences. neural networks and of techniques that use the
One of the most successful theories that allow fundamental ideas behind such tool. The idea
the utilization of large data sets in a variety of that originally inspires neural networks consisted
different disciplines in the spirit of the agnostic in designing networks that would somehow
science we referred to goes under the name of mimic the behavior of biological neural networks,
statistical learning theory. With this terminology, in order to either solve or approximate the solution
we mean that particular approach to machine to the interpolation problem we described above.
learning that takes its tools from the fields of As the models of neural networks developed,
statistics and functional analysis. In particular, however, the connection with biological networks
statistical learning theory embraces the approach remained more of an inspiration than an actual
of supervised learning, namely, learning from a guide, and modern neural networks are designed
M
training set of data. without any such reference. Both supervised and
Supervised learning can really be seen as an unsupervised learning can occur with the use of
interpolation problem. Imagine every point in the neural networks.
data set to be an input-output pair (e.g., a genomic
structure and a disease): the process of learning
consists then in finding a function that interpolates Further Reading
these data and then using it to make predictions
when new data are added to the system. While the De Mauro, M., Greco, M., & Grimaldi, M. (2016).
theory of statistical learning is very developed, A formal definition of Big Data based on its essential
among the specific techniques that are used, we features. Library Review, 65(3), 122–135.
Frey, B. J., & Duek, D. (2007). Clustering by passing
will mention clustering techniques such as affinity messages between data points. Science, 315,
propagation method, as described, for example, in 972–976.
Frey and Duek (2007). In this case the idea is to Napoletani, D., Panza, M., & Struppa, D. C. (2014). Is big
split the data into clusters, but passing information data enough? A reflection on the changing role of
mathematics in applications. Notices of the American
locally among various data points, in order to Mathematical Society, 61(5), 485–490.
determine the split. Once that is done, the method Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimension-
extracts a best representative from each cluster. ality reduction by locally linear embedding. Science,
The interpolation process is then used on those 290, 2323–2326.
Schapire, R. E. (1990). The strength of weak learnability.
particular representatives. Machine Learning, 5(2), 197–227.
Another classical example of mathematical Weinberger, D. (2011). The machine that would predict the
technique that is employed to study large data future. Scientific American, 305, 52–57.
642 Media

preferences, tastes, and moods of their users, to


Media offer personalized content and targeted advertis-
ing. This datafication means that social media
Colin Porlezza transforms intangible elements such as relation-
IPMZ - Institute of Mass Communication and ships and transform them into a valuable resource
Media Research, University of Zurich, Zürich, or an economic asset on which to build entire
Switzerland business models. Third, datafication and the use
of large amounts of data give also rise to risks with
regard to ethics, privacy, transparency, and sur-
Synonyms veillance. Big data can have huge benefits because
it allows organizations to personalize and target
Computer-assisted reporting; Data journalism; products and services. But at the same time, it
Media ethics requires clear and transparent information han-
dling governance and data protection. Handling
big data increases the risk of paralyzing privacy,
Definition/Introduction because (social) media or internet-based services
require a lot of personal information in order to
Big data can be understood as “the capacity to use them. Moreover, analyzing big data entails
search, aggregate and cross-reference large data higher risks to incur in errors, for instance, when
sets” (Boyd and Crawford 2012, p. 663). The it comes to statistical calculations or visualiza-
proliferation of large amounts of data concerns tions of big data.
the media in at least three different ways. First,
large-scale data collections are becoming an
important resource for journalism. As a result, Big Data in the Media Context
practices such as data journalism are increasingly
gaining attention among newsrooms and become Within media, big data mainly refers to huge
relevant resources as data collected and published amounts of structured (e.g., sales, clicks) or
in the Internet expands and legal frameworks to unstructured (e.g., videos, posts, or tweets) data
access public data such as Freedom of Informa- generated, collected, and aggregated by private
tion Acts come into effect. Recent success stories business activities, governments, public adminis-
of data journalism such as uncovering the MPs’ trations, or online-based organizations such as
expenses scandal in the UK or the giant data leak social media. In addition, the term big data usually
in the case of the Panama Papers have contributed includes references to the analysis of huge bulks
to further improve the capacities to deal with large of data, too. These large-scale data collections are
amounts of data in newsrooms. Second, big data difficult to analyze using traditional software or
are not only important in reference to the practice database techniques and request new methods in
of reporting. They also play a decisive role with order to identify patterns in such a massive and
regard to what kind of content gets finally often incomprehensible amount of data. The
published. Many newsrooms are no longer using media ecosystem has therefore developed special-
the judgment of human editors alone to decide ized practices and tools not only to generate big
what content ends up on their websites; instead data but also to analyze it in turn. One of these
they use real-time data analytics generated by the practices to analyze data is called data or data-
clicks of their users to identify trends, to see how driven journalism.
content is performing, and to boost virality and
user engagement. Data is also used in order to Data Journalism
improve product development in entertainment We live in an age of information abundance. One
formats. Social media like Facebook have of the biggest challenges for the media industry,
perfected this technique by using personal and journalism in particular, is to bring order in
Media 643

this data deluge. It is therefore not surprising that from preventable diseases during cure rather than
the relationship between big data and journalism as a cause from battles.
is becoming stronger, especially because large By the middle of the twentieth century, news-
amounts of data need new and better tools that rooms started to use systematically computers
are able to provide specific context, to explain the to collect and analyze data in order to find and
data in a clear way, and to verify the information it enrich news stories. In the 1950s this procedure
contains. Data journalism is thus not entirely dif- was called computer-assisted reporting (CAR)
ferent from more classic forms of journalism. and is perhaps the evolutionary ancestor of what
However, what makes it somehow special are we call data journalism today. Computer-assisted
the new opportunities given by the combination reporting was, for instance, used by the television
of traditional journalistic skills like research and network CBS in 1952 to predict the outcome
innovative forms of investigation thanks to the use of the US presidential election. CBS used a
of key information sets, key data and new pro- then famous Universal Automatic Computer
cessing, analytics, and visualization software that (UNIVAC) and programmed it with statistical
allows journalists to peer through the massive models based on voting behavior from earlier
amounts of data available in a digital environment elections. With just 5% of votes in, the computer
and to show it in a clear and simple way to the correctly predicted the landslide win of former
publics. The importance of data journalism is World War II general Dwight D. Eisenhower
given by its ability to gather, interrogate, visual- with a margin of error less than 1%. After this
ize, and mash up data with different sources or remarkable success of computer-assisted reporting
services, and it requires an amalgamation of a at CBS, other networks started to use computers in
journalist’s “nose for news” and tech savvy their newsrooms as well, particularly for voting
competences. prediction. Not one election has since passed with-
However, data journalism is not as new as it out a computer-assisted prediction. However, com-
seems to be. Ever since organizations and public puters were slowly introduced in newsrooms, and
M
administrations collected information or built up only in the late 1960s, they started to be regularly
archives, journalism has been dealing with large used in the news production as well.
amounts of data. As long as journalism has been In 1967, a journalism professor from the Uni-
practiced, journalists were keen to collect data and versity of North Carolina, Philip Meyer, used for
to report them accurately. When the data the first time a quicker and better equipped IBM
displaying techniques got better in the late eigh- 360 mainframe computer to do statistical analyses
teenth century, newspapers started to use this on survey data collected during the Detroit riots.
know-how to present information in a more Meyer was able to show that not only less edu-
sophisticated way. The first example of data jour- cated Southerners were participating in the riots
nalism can be traced back to 1821 and involved but also people who attended college. This story,
The Guardian, at the time based in Manchester, published on the Detroit Free Press, won him a
UK. The newspaper published a leaked table list- Pulitzer Prize together with other journalists and
ing the number of students and the costs for each marked a paradigm shift in computer-assisted
school in the British city. For the first time, it was reporting. On the grounds of this success, Meyer
publicly shown that the number of students not only supported the use of computers in jour-
receiving free education was higher than what nalistic practices but developed a whole new
was expected in the population. Another example approach to investigative reporting by introducing
of early data journalism dates back to 1858, when and using social science research methods in jour-
Florence Nightingale, the social reformer and nalism for data gathering, sampling, analysis, and
founder of modern nursing, published a report to presentation. In 1973 he published his thoughts in
the British Parliament about the deaths of soldiers. the seminal book entitled “Precision Journalism.”
In her report she revealed with the help of visual The fact that computer-assisted reporting entered
graphics that the main cause of mortality resulted newsrooms especially in the USA was also
644 Media

revealed through the increased use of computers visualization, which means that journalists have
in news organizations. In 1986, the Time maga- to look for specific patterns within the data rather
zine wrote that computers are revolutionizing than merely seeking information – although
investigative journalism. By trying to analyze recent discussions call for journalists to create
larger databases, journalists were able to offer a their own databases due to an overreliance on
broader perspective and much more information public databases. Either way, the success of data
about the context of specific events. journalism also led to new practices, routines, and
The practice of computer-assisted reporting mixed teams of journalists working together with
spread further until, at the beginning of the programmers, developers, and designers within
1990s, it became a standard routine particularly the same newsrooms, allowing them to tell stories
in bigger newsrooms. The use of computers, in a different and visually engaging way.
together with the application of social science
methods, has helped – according to Philip Media Organizations and Big Data
Meyer – to make journalism scientific. Besides, Big data is not only a valuable resource for data
Meyer’s approach tried also to tackle some of the journalism. Media organizations are data gath-
common shortcomings of journalism like the erers as well. Many media products, whether
increasing dependence on press releases, shrink- news or entertainment, are financed through
ing accuracy and trust, or the critique of political advertising. In order to satisfy the advertisers’
bias. An important factor of precision journalism interests in the site’s audience, penetration, and
was therefore the introduction and the use of sta- visits, media organizations track user behavior on
tistical software. These programs enabled journal- their webpages. Very often, media organizations
ists for the first time to analyze bigger databases share this data with external research bodies,
such as surveys or public records. This new which then try to use the data on their behalf.
approach might also be seen as a reaction to alter- Gathering information about their customers is
native journalistic trends that came up in the therefore not only an issue when it comes to the
1990s, for instance, the concept of new journal- use of social media. Traditional media organiza-
ism. While precision journalism stood for scien- tions are also collecting data about their clients.
tific rigor in data analysis and reporting, new However, media organizations track the user
journalism used techniques from fiction to behavior on news websites not only to provide
enhance reading experience. data to their advertisers. Through user data, they
There are some similarities between data jour- also adapt the website’s content to the audience’s
nalism and computer-assisted reporting: both rely demand, with dysfunctional consequences for
on specific software programs that enable journal- journalism and its democratic function within
ists to transform raw data into news stories. How- society. Due to web analytics and the generation
ever, there are also differences between computer- of large-scale data collections, the audience exerts
assisted reporting and data journalism, which are an increasing influence over the news selection
due to the context in which the two practices were process. This means that journalists – particularly
developed. Computer-assisted reporting tried to in the online realm – are at the risk of increasingly
introduce both informatics and scientific methods adapting their news selections on the audience’s
into journalism, given that at the time, data was feedback through data generated via web analyt-
scarce, and many journalists had to generate their ics. Due to the grim financial situation and their
own data. The rise of the Internet and new media shrinking advertising revenue, some print media
contributed to the massive expansion of archives, organizations especially in western societies try to
databases, and to the creation of big data. There is apply strategies to compensate these deficits
no longer a poverty of information, data is now through a dominant market-driven discourse,
available in abundance. Therefore, data journal- manufacturing cheaper content that appeals to
ism is less about the creation of new databases, but broader masses – publishing more soft news, sen-
more about data gathering, analysis, and sationalism, and articles of human interest without
Media 645

any connection to public policy issues. This is also movie – even before it is broadcast. Particularly
due to the different competitive environment: for big production companies or film studios it is
while there are fewer competitors in traditional essential to observe structured data like ratings,
newspaper or broadcast markets, in the online market share, and box office stats. But also
world, the next competitor is just one click away. unstructured data like comments or videos in
Legacy media organizations, particularly news- social media are equally important in order to
papers and their online webpages, offer more soft understand consumer habits, given that they pro-
news to increase traffic, to attract the attention of vide information about the potential success or
more readers, and thus to keep their advertisers at failure of a (new) product.
it. A growing body of literature about the conse- An example of such use of big data is the
quences of this behavior shows that journalists, in launch of the TV show “House of Cards” by the
general, are becoming much more aware of the Internet-based on demand streaming provider
audiences’ preferences. At the same time, however, Netflix. Before launching the first original content
there is also a growing concern among journalist with the political drama, Netflix was already
with regard to their professional ethics and the collecting huge amounts of data about the stream-
consequences for the function of journalism in ing habits of their customers. Of more than 25 mil-
society if they base their editorial decision-making lion users, they tracked around 30 million views a
processes on real-time data. The results of web day (recording also when people are pausing,
analytics not only influence the placement of rewinding, or fast-forwarding the videos), about
news on the websites; they also have an impact four million ratings, and three million searches
on the journalists’ beliefs about what the audience (Carr 2013). On top of that, they also try to gather
wants. Particularly in online journalism, the news unstructured data from social media, and they
selection is carried out grounding the decisions on look how customers are tagging the selected
data generated by web analytics and no longer videos with metadata descriptors and whether
on intrinsic notions such as news values or they recommend the content. Based on these
M
personal beliefs. Consequently, online journalism data, Netflix predicted possible preferences and
becomes highly responsive to the audiences’ pref- decided to buy “House of Cards.” It was a major
erences – serving less what would be in the public success for the online-based company.
interest. As many news outlets are integrated orga- There are also potential risks associated with
nizations, which means that they apply a the collection of such huge amounts of data:
crossmedia strategy by joining previously sepa- Netflix recommends specific movies or TV
rated newsrooms such as the online and the print shows to their customers based on what they
staff, it might be possible that factors like data- liked or what they have watched before. These
based audience feedback will also affect print recommendation algorithms might well guide the
newsrooms. As Tandoc Jr. and Thomas state, if user toward more of their original content, without
journalism continues to view itself as a sort of taking into account the consumers’ actual prefer-
“conduit through which transient audience prefer- ences. In addition, consumers might not be able to
ences are satisfied, then it is no journalism worth discover new TV shows that transcend their usual
bearing the name” (Tandoc and Thomas 2015, taste. Given that services like Netflix know so
p. 253). much about their users’ habits, another concern
While news organizations still struggle with with regard to privacy arises.
self-gathered data due to the conflicts that can
arise in journalism, media organizations active in Big Data Between Social Media, Ethics, and
the entertainment industry rely much more Surveillance
strongly on data about their audiences. Through Social media are a main source for big data. Since
large amounts of data, entertainment media the first major social media webpages have been
can collect significant information about the launched in the 2000s, they began to collect and
audience’s preferences for a TV series or a store massive amounts of data. These sites started
646 Media

to gather information about the behavior, prefer- explorative focus of big data, it raises issues with
ences, and interests of their users in order to know regard to the specific purpose of the data collec-
how their users would both think and act. In tion. Particularly if the data usage, storage, and
general, this process of datafication is used to transfer remain opaque and are not made transpar-
target and tailor the services better to the users’ ent, the data collection might be disproportionate.
interests. At the same time, social media use these Yet, certain social media allow third parties to
large-scale data collections to help advertiser tar- access their data, particularly as the trade of data
get the users. Big data in social media have there- increases because of its economic potential. This
fore also a strong commercial connotation. policy raises ethical issues with regard to trans-
Facebook’s business model, for instance, is parency about data protection and privacy.
entirely based on hyper-targeted display ads. Particularly in the wake of the Snowden reve-
While display ads are a relatively old-fashioned lations, it has been shown that opaque algorithms
way of addressing customers, Facebook can make and big data practices are increasingly important
it up with its incredible precision about the cus- to surveillance: “[...] Big Data practices are
tomers’ interests and its ability to target advertis- skewing surveillance even more towards a reli-
ing more effectively. ance on technological “solutions,” and that these
Big data are an integrative part of social both privileges organizations, large and small,
media’s business model: they possess far more whether public or private, reinforce the shift in
information on their customers given that they emphasis toward control rather than discipline
have access not only to their surf behavior but and rely increasingly on predictive analytics to
above all to their tastes, interests, and networks. anticipate and preempt” (Lyon 2014, p. 10). Over-
This might not only bear the potential to predict all, the Snowden disclosures have demonstrated
the users’ behavior but also to influence it, partic- that surveillance is no longer limited to traditional
ularly as social media such as Facebook and Twit- instruments in the Orwellian sense but have
ter adapt also their noncommercial content to the become ubiquitous and overly reliant on practices
individual users: the news streams we see on our of big data – as governmental agencies such as the
personal pages are balanced by various variables NSA and GCHQ are allowed to access not only
(differing between social media) such as interac- the data of social media and search giants but also
tions, posting habits, popularity, the number of to track and monitor telecommunications of
friends, user engagement, and others, being how- almost every individual in the world. However,
ever constantly recombined. Through such the big issue even with the collect-all approach is
opaque algorithms, social media might well use that data is subject to limitations and bias, partic-
their own data to model voters: in 2010, for exam- ularly if they rely on automated data analysis:
ple, 61 million users in the USA were shown a “Without those biases and limitations being
banner message on their pages about how many of understood and outlined, misinterpretation is the
their friends already voted for the US Congressio- result” (Boyd and Crawford 2012, p. 668). This
nal Elections. The study showed that the banner might well lead to false accusation or failure of
convinced more than 340,000 additional people to predictive surveillance as could be seen in the case
cast their vote (Bond et al. 2012). The individually of the Boston Marathon bombing case: first, a
tailored and modeled messaging does not only picture of the wrong suspect was massively shared
bear the potential to harm the civic discourse; it on social media, and second, the predictive radar
also enhances the negative effects deriving from grounded on data gathering was ineffective.
“asymmetry and secrecy built into this mode of In addition, the use of big data generated by
computational politics” (Tufekci 2014). social media entails also ethical issues in refer-
The amount of data stored on social media will ence to scientific research. Normally, when
continue to rise, and already today, social media human beings are involved in research, strict
are among the largest data repositories in the ethical rules, such as the informed consent of
world. Since the data collecting mania of social the people participating in the study, have to be
media will not decrease, which is also due to the observed. Moreover, in social media there are
Media Ethics 647

“public” and “private,” which can be accessed. to a critical discourse about the use of big data in
An example of such a controversial use of big our society, we will be able to determine “our
data is a study carried out by Kramera et al. agency with respect to big data that is generated
(2014). The authors deliberately changed the by us and about us, but is increasingly being used
newsfeed of Facebook users: some got more at us” (Tufekci 2014). Being more transparent,
happy news, others more sad ones, because the accountable, and less opaque about the use and,
goal of the study was to investigate whether in particular, the purpose of data collection might
emotional shifts in those surrounding us – in be a good starting point.
this case virtually – can change our own moods
as well. The issue with the study was that the
users in the sample were not aware that their
Cross-References
newsfeed was altered. This study shows that the
use of big data generated in social media can
▶ Crowdsourcing
entail ethical issues, not the least because the
▶ Digital Storytelling, Big Data Storytelling
constructed reality within Facebook can be
▶ Online Advertising
distorted. Ethical questions with regard to
▶ Transparency
media and big data are thus highly relevant in
our society, given that both the privacy of citi-
zens and the protection of their data are at stake.
References

Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D. I.,


Conclusion Marlow, C., Settle, J. E., & Fowler, J. H. (2012).
A 61-million-person experiment in social influence
Big data plays a crucial role in the context of the and political mobilization. Nature, 489, 295–298.
media. The instruments of computer-assisted
Boyd, D., & Crawford, K. (2012). Critical questions for big M
data. Information, Communication & Society, 15(5),
reporting and data journalism allow news organi- 662–679.
zations to engage in new forms of investigations Carr, D. (2013, February 24). Giving readers what they
and storytelling. Big data also allow media orga- want. New York Times. http://www.nytimes.com/
2013/02/25/business/media/for-house-of-cards-using-
nizations to better adapt their services to the pref- big-data-to-guarantee-its-popularity.html. Accessed
erences of their users. While in the news business 11 July 2016.
this may lead to an increase of soft news, the Kramera, A. D. I., Guilloryb, J. E., & Hancock, J. T.
entertainment industry benefits from such data in (2014). Experimental evidence of massive-scale emo-
tional contagion through social networks. Proceedings
order to predict the audience’s taste with regard to of the National Academy of Sciences of the United
potential TV shows or movies. One of the biggest States of America, 111(24), 8788–8790.
issues with regard to media and big data are its Lyon, D. (2014, July–December). Surveillance, Snowden,
ethical implications, particularly with regard to and Big Data: Capacities, consequences, critique. Big
Data & Society, 1–13.
data collection, storage, transfer, and surveillance. Tandoc Jr., E. C., & Thomas, R. J. (2015). The ethics of
As long as the urge to collect large amounts of web analytics. Implications of using audience metrics
data and the use of opaque algorithms continue to in news construction. Digital Journalism, 3(2),
prevail in many already powerful (social) media 243–258.
Tufekci, Z. (2014). Engineering the public: Big data, sur-
organizations, the risks of data manipulation and veillance and computational politics. First Monday,
modeling will increase, particularly as big data are 19(7). http://journals.uic.edu/ojs/index.php/fm/article/
becoming even more important in many different view/4901/4097. Accessed 12 July 2016.
aspects of our lives. Furthermore, as the Snowden
revelations showed, collect-it-all surveillance
already relies heavily on big data practices. It is
therefore necessary to increase both the research Media Ethics
into and the awareness about the ethical implica-
tions of big data in the media context. Only thanks ▶ Media
648 Medicaid

information generated for those covered by Med-


Medicaid icaid. It provides national standards for processing
electronic healthcare transactions, including
Kim Lorber1 and Adele Weiner2 secure electronic access to health data. Electronic
1
Social Work Convening Group, Ramapo College health networks or health information exchanges
of New Jersey, Mahwah, NJ, USA (HIE) facilitate the availability of medical infor-
2
Audrey Cohen School For Human Services and mation electronically across organizations within
Education, Metropolitan College of New York, states, regions, communities, or hospital systems.
New York, NY, USA Such systems include medical records for all
patients, including their health insurance
information.
Introduction Data for Medicaid enrollment, service utiliza-
tion and expenditures, is collected by the Medic-
Medicaid provides medical care to low-income aid and Statistical Information System (MSIS)
individuals and is the US federal government’s used by the Centers for Medicare and Medicaid
most costly welfare program. Funds are provided Services (CMS). Ongoing efforts continue to
to states that wish to participate and programs are ensure the availability of necessary medical data
managed differently within each; all states have for providers, reduce duplication of services,
participated since 1982. As of January 2017, 32 and to insure timely payment. In 2014, CMS
states participate in the Medicaid expansion under implemented the Medicaid Innovation Accelerator
the Affordable Care Act (ACA). In 2015, Program (IAP) to improve health care for Medic-
according to the Kaiser Family Foundation aid beneficiaries by supporting individual states’
(KFF), 20% of Americans with medical insur- efforts in reducing costs while improving pay-
ance, or over 62 million people, were covered by ment and delivery systems. In 2016, CMS created
Medicaid. By 2017, with the ACA expansion, an interoperability initiative to connect a wider
Medicaid has become the nation’s largest insurer, variety of Medicaid providers and improve health
covering over 74.5 million people or 1 in 5 indi- information exchanges.
viduals and has financed over 16% of all personal Ultimately, it is possible to track Medicaid
health care. A poll in 2017 found that 50% of beneficiaries in real-time as they utilize the health
respondents said Medicaid is important to their care system in any environment ensuring appro-
family (Kaiser Family Foundation 2017). It is priate treatment in the most cost-effective venues.
estimated that the proposed alternatives to the These real-time details allow Medicaid providers
ACA will throw 24 million people off Medicaid throughout the country to access detailed reports
in the next 10 years. about patient treatment as well as program spend-
Medicaid data is big by nature of the number of ing to managed care plans which do not necessar-
individuals served and information collected. The ily use a fee-for-service system but, instead, make
potential for understanding the population served the provider fully responsible for potentially high
regarding medical, hospital, acute and long-term quality and costly treatments.
care is seemingly limitless. And, with such poten-
tial comes huge responsibility to be comprehen-
sive and accurate. Reducing Medicaid Costs

A variety of methods are being utilized throughout


Privacy and Health Care Coordination the country in order to reduce Medicaid costs,
increase efficiency, demonstrate billing account-
The Health Information and Portability Act of ability, and find cases of Medicaid fraud. In a
1966 (HIPAA) provides safeguards to ensure volatile and politically charged healthcare envi-
privacy of medical records, including any ronment, Medicaid services and eligibility are
Medicaid 649

constantly changing. Sorting through double bill- rule violations, and find duplicate billing and
ing, patients’ repeat visits, and the absence of reused prescriptions resulting in multiple and
required follow-ups have demonstrated the bene- fraudulent payments. Advances are allowing pre-
fits in cost-saving measures such as preventing dictive detection of fraud to avoid only finding it
potential hospitalizations while ensuring proper using the rules-based system and after payments
patient treatment and follow up. Individual states have been made. This helps defeat criminal activ-
are responsible for developing electronic systems ities as these are enhanced in synch with regularly
for managing their Medicaid costs. improved preventive methods.
Washington State, facing a fiscal Medicaid
crisis, implemented a statewide database of ER
visits, available across hospitals, to document Limitations of Medicaid Big Data
and reduce the use of hospitals for non-emer-
gency care. While hospitals cannot turn away a With Big Data comes big responsibility. All data
patient, reimbursement for treatment may not be collection networks must be HIPAA compliant
provided due to limited Medicaid funds. The and protect patient medical information and yet
state had tried different approaches including must be accessible to service providers. Biola
limiting the number of ER visits per year to et al. (2014) analyzed Medicaid information
identifying 500 ailments that would no longer from North Carolina. The study was of non-can-
be reimbursable as emergency care. The solution cer adults on Medicaid who had received at least
to these protested and rejected “solutions” was to 10 computed tomography (CT) scans, to inform
create a database that, within minutes of arrival, them of their radiation exposure. Most interest-
allows an attending doctor to see a patient’s com- ing as relevant to this chapter is that scan infor-
plete medical history, potentially reducing dupli- mation was only available about Medicaid
cation of diagnostic tests. Referrals to alternate patients and even that was not comprehensive
more appropriate and less expensive treatment as some patients with high exposure were
M
resources resulted in a Medicaid cost reduction excluded unintentionally because care providers’
in the course of 1 year of $33.7 million in part file claims differed by setting. Thus Medicaid
attributable to the database. Oregon will follow information can be incomplete suggesting the
and inquiries from other states to reduce ER costs need for future alignment of billing and claims
have been received as Washington has shown systems.
how impactful data can be.
IBM developed text analysis software, which
was successful in reducing Medicaid re-admis- Conclusion
sions in North Carolina. Other uses have resulted
in systems which provide alerts to case managers Big data, as related to Medicaid, can significantly
and others to remind patients to follow up with improve patient safety and care while providing
specialists or to complete necessary medical tests cost-saving measures. As political challenges are
in order to complete the treatment begun in the mounted to the Affordable Care Act, Medicaid
hospital. Such efforts allow patients to truly data may help to inform the national discussion
address their malady and reduce hospital about health and insurance. It clearly demon-
readmissions while resulting in reduced Medicaid strates how access to health care can reduce
costs. more costly emergency room visits. Demographic
The Medicaid Management Information Sys- information about those enrolled in Medicaid and
tem (MMIS) developed by Xerox and, approved their advocates can present a sizeable voting block
by the Centers for Medicare and Medicaid Ser- in the political process to protect Medicaid
vices (CMS), is used in at least 31 states. The funding levels and eligibility for enrollment by
sophisticated algorithms review drug prescrip- highlighting efficiencies in Medicaid payment
tions, pharmacy and doctors’ offices’ repeated and services.
650 Metadata

Further Reading Know Before Use

Biola, H., Best, R. M., Lahlou, R. M., Burke, L. M., Few people are able to use a piece of data before
Dward, C., Jackson, C. T., Broder, J., Grey, L.,
knowing its subject, origin, structure, and mean-
Semelka, R. C., & Dobson, A. (2014). With “big
data” comes big responsibility: Outreach to North Car- ing. A primary functionality of metadata is to
olina Medicaid patients with 10 or more computed help people to obtain an overview of some data,
typography scans in 12 months. North Carolina Med- and this functionality can be understood through
ical Journal, 75(2), 102–109.
a few real-world examples. If data are compara-
Kaiser Family Foundation. (2017). Medicaid. Retrieved on
May 13, 2017 from http://kff.org/medicaid/. ble with goods in a grocery, then metadata are
Medicaid.gov: Keeping American Healthy. (2014). like the information on the package of an item.
Retrieved on September 20, 2014 from http://www. A consumer may care more about the ingredients
medicaid.gov/.
due to allergies to some substances, the nutrition
Weise, K. (2014, March 25). How big data helped cut
emergency room visits by 10 percent. Retrieved on facts due to dietary needs, and/or the manufac-
September 10, 2014 from http://www.businessweek. turer and date of expiration due to personal pref-
com/articles/2014-03-25/how-big-data-helped-cut-em erences. Most people want to know the
ergency-room-visits-by-10-percent.
information about a grocery item before purchas-
ing and consuming it. The information on the
package provides a concise and essential intro-
duction about the item inside. Such nutrition and
Metadata ingredient information of grocery items is man-
datory for manufacturers in many countries. Sim-
Xiaogang Ma ilarly, an ideal situation for data users is that they
Department of Computer Science, University of can receive clear metadata from data providers.
Idaho, Moscow, ID, USA However, compared to the food industry, the
rules and guidelines for metadata are still less
developed.
Metadata are data about data, or in a more gen- Another comparable subject is the 5W1H
eral sense, they are data about resources. They method for storytelling or context description,
provide a snapshot about a resource, such as especially in journalism. The 5W1H represents
information about the creator, date, subject, loca- the question words who, what, when, where,
tion, time and methods used, etc. There are high- why, and how, which can be used to organize a
level metadata standards that can provide a gen- number of questions about a certain object or
eral description of a resource. In recent years, event, such as: Who is responsible for a research
community efforts have been taken to develop project? What are the planned output data? Where
domain-specific metadata schemas and encode will the data be archived? When will the data be
the schemas with machine readable formats for open access? Why a specific instrument is needed
the World Wide Web. Those schemas can be for data collection? How will the data be
reused and extended to fit requirements of spe- maintained and updated? In journalism, the
cific applications. Comparing with the long-term 5W1H is often used to evaluate whether the infor-
archive of data and metadata in traditional data mation covered in a news article is complete or
management and analysis, the velocity of Big not. Normally, the first paragraph of a news article
Data leads to short-term and quick applications gives a brief overview of the article and provides
addressing scientific and business issues. concise information to answer the 5W1H ques-
Accordingly, there is a metadata data life cycle tions. By reading the first paragraph, a reader can
in Big Data applications. Community metadata grasp the key information of an article even before
standards and machine readable formats will be a reading through the full text. Metadata is data
big advantage to facilitate the metadata data life about data; such functionality is similar to what
cycle on the Web. the first paragraph works for a news article, and
Metadata 651

metadata items used for describing a dataset are 15 core elements and cover a number of more
equal to the 5W1H question words. specific properties, such as abstract, access rights,
has part, has version, medium, modified, spatial,
temporal, valid, etc. In practice, the metadata
Metadata Hierarchy terms in the DCMI specification can be further
extended by combining with other compatible
Metadata are used for describing resources. The vocabularies to support various application pro-
description can be general or detailed according to files. With the 15 core elements, one is able to
the actual needs. Accordingly, there is a hierarchy provide rich metadata for a certain resource, and
of metadata items corresponding to the actual by using the enriched DCMI metadata terms and
needs of describing an object. For instance, the external vocabularies, one can create an even
abovementioned 5W1H question words can be more specific metadata description for the same
regarded as a list of general metadata items, and object. This can be done in a few ways. For
they can also be used to describe datasets. How- example, one way is to use terms that are not
ever, the six question words only offer a start included in the core elements, such as spatial and
point, and there may be various derived metadata temporal. Another possible way is to use a refined
items in actual works. In early days there was such metadata term that is more appropriate for
a heterogeneous situation among the metadata describing an object. For instance, the term
provided by different stakeholders. To promote “description” in the core elements is with broad
standardization of metadata items, a number of meaning, and it may include an abstract, a table of
international standards have been developed. contents, a graphical representation, or a free-text
The most well-known standard is the Dublin account of a resource. In the enrich DCMI terms,
Core Metadata Element Set (DCMI Usage Board there is a more specific term “abstract,” which
2012). The name “Dublin” originates from a 1995 means a summary of a resource. Compared to
workshop at Dublin, OH, USA. The word “Core” “description,” the term “abstract” is more specific
M
means that the elements are generic and broad. and appropriate if one wants to collect a literal
The 15 core elements are contributor, coverage, summary of an academic article.
creator, date, description, format, identifier, lan-
guage, publisher, relation, rights, source, subject,
title, and type. Those elements are more specific Domain-Specific Metadata Schemas
than the 5W1H question words and can be used
for describing a wide range of resources, includ- High-level metadata terms such as those in the
ing datasets. The Dublin Core Metadata Element Dublin Core Metadata Element Set have broad
Set was published as a standard by the Interna- meaning and are applicable to various resources.
tional Organization for Standardization (ISO) in However, those metadata elements are too general
2003 and later revised in 2009. It has also been in meaning and sometimes are implicit. If one
endorsed by a number of other national or inter- wants a more specific and detailed description of
national organizations such as the American the resources, a domain-specific metadata schema
National Standards Institute and the Internet Engi- is needed. Such a metadata schema is a list of
neering Task Force. organized metadata items for describing a certain
The 15 core elements are part of an enriched type of resource. For example, there could be a
specification of metadata terms maintained by the metadata schema for each type defined in the
Dublin Core Metadata Initiative (DCMI). The DCMI Type Vocabulary, such as dataset, event,
specification includes properties in the core ele- image, physical object, service, etc. There have
ments, properties in an enriched list of terms, been various national and international commu-
vocabulary encoding schemes, syntax encoding nity efforts for building domain-specific metadata
schemes, and classes (including the DCMI Type schemas. Especially, many schemas developed in
Vocabulary). The enriched terms include all the recent years face the data management and
652 Metadata

exchange on the Web. A few recent works are Geospatial data are about objects with some posi-
introduced below. tion on the surface of the Earth. The ISO 19115
The data catalog vocabulary (DCAT) standards provide guidelines on how to describe
(Erickson and Maali 2014) was approved as a geographical information and services. Detailed
World Wide Web Consortium (W3C) recommen- metadata items cover topics about contents, spa-
dation in January 2014. It was designed to facili- tiotemporal extents, data quality, channels for
tate interoperability among data catalogs access and rights to use, etc. Another standard,
published on the Web. DCAT defines a metadata ISO 19139, provides an XML schema implemen-
schema and provides a number of examples on tation for the ISO 19115. The catalog service for
how to use it. DCAT reuses a number of DCMI the Web (CSW) is an open geospatial consortium
metadata terms in combination with terms from (OGC) standard for describing online geospatial
other schemas such as the W3C Simple Knowl- data and services. It adopts ISO 19139, the Dublin
edge Organization System (SKOS). It also defines Core elements and items from other metadata
a few new terms to make the resulted schema efforts. Core elements in CSW include title, for-
more appropriate for describing datasets in data mat, type, bounding box, coordinate reference
catalogs. system, and association.
The Darwin Core is a group of standards for
biodiversity applications. By extending the Dub-
lin Core metadata elements, the Darwin Core Annotating a Web of Data
establishes a vocabulary of terms to facilitate the
description and exchange of data about the geo- Recent efforts on metadata standards and
graphic occurrence of organisms and the physical schemas, such as the abovementioned Dublin
existence of biotic specimens. The Darwin Core Core, DCAT, Darwin Core, EML, IGSN meta-
itself is also extensible, which provides a mecha- data, ISO 19139, and CSW, show a trend of pub-
nism for describing and sharing additional lishing metadata on the Web. More importantly,
information. by using standard encoding formats, such as the
The ecological metadata language (EML) is a XML and W3C resource description framework
metadata standard developed for the none- (RDF), they are making metadata machine dis-
geospatial datasets in the field of ecology. It is a coverable and readable. This mechanism moves
set of schemas encoded in the format of extensible the burden of searching, evaluating, and integrat-
markup language (XML) and thus allows struc- ing massive datasets from humans to computers,
tured expression of metadata. EML can be used to and for computers such burden is not real burden
describe digital resources and also nondigital because they can find ways to access various data
resources such as paper maps. sources through standardized metadata on the
The international geo sample number (IGSN), Web. For example, the project OneGeology aims
initiated in 2004, is a sample identification code to enable online access to geological maps across
for the geoscience community. Each registered the world. By the end of 2014, the OneGeology
IGSN identifier is accompanied with a group of has 119 participating nations, and most of them
metadata providing detailed background informa- share national or regional geological maps
tion about that sample. Top concepts in the current through OGC geospatial data service standards.
IGSN metadata schema are sample number, reg- Those map services are maintained by their
istrant, related resource identifiers, and log. A top corresponding organizations, and they also enable
concept may include a few child concepts. For standardized metadata services, such as CSW. On
example, there are two child concepts for “regis- the one hand, OneGeology provides technical
trant”: registrant name and name identifier. supports to organizations who want to set up
The ISO 19115 and ISO 19115-2 geographic geologic map services using common standards.
information metadata are regarded as a best prac- On the other hand, it also provides a central data
tice of metadata schemas for geospatial data. portal for end users to access various distributed
Metadata 653

metadata and data services. The OneGeology pro- of metadata items is a work of annotating those
ject presents a successful example on how to types. The work in schema.org is an excellent
rescue the legacy data, update them with well- reflection of those two works. Various structured
organized metadata, and make them discoverable, and unstructured resources can be categorized and
accessible, and usable on the Web. annotated by using metadata and are ready to be
Comparing with domain-specific structured discovered and accessed. In a scientific or busi-
datasets, such as those in OneGeology, many ness procedure, various resources are retrieved
other datasets in the Big Data are not structured, and used, and outputs are generated and archived
such as webpages and data stream on social and perhaps be reused elsewhere. In recent years,
media. In 2011, the search engines Bing, Google, people take a further step to make links among
Yahoo!, and Yandex launched an initiative called those resources, their types, and properties, as
schema.org, which aims at creating and well as the people and activities involved in the
supporting a common set of schemas for struc- generation of those outputs. The work of catego-
tured data markup on web pages. The schemas are rization, annotation, and linking as a whole can be
presented as lists of tags in hypertext markup used to describe the origin of a resource, which is
language (HTML). Webmasters can use those called provenance. There have been community
tags to mark up their web pages, and search engine efforts developing specifications of commonly
spiders and other parsers can recognize those tags usable provenance models.
and record what a web page is about. This makes The Open Provenance Model was initiated in
it easier for search engine users to find the right 2006. It includes three top classes: artifact, pro-
web pages. Schema.org adopts a hierarchy to cess, and agent and their subclasses, as well as a
organize the schemas and vocabularies of terms. group of properties, such as was generated by, was
The concept on the top is thing, which is very controlled by, was derived from, and used, for
generic and is divided into schemas of a number describing the classes and the interrelationships
of child concepts, including creative work, event, among them. Another earlier effort is the proof
M
intangible, medical entity, organization, person, markup language, which was used to represent
place, product, and review. These schemas are knowledge about how information on the Web
further divided into smaller schemas with specific was asserted or inferred from other information
properties. A child concept inherits characteristics sources by intelligent agents. Information, infer-
from a parent concept. For example, book is a ence step/inference rule, and inference engine are
child concept of creative work. The hierarchy of the three key building blocks in the proof markup
concepts and properties does not intend to be a language.
comprehensive model that covers everything in Works on the Open Provenance Model and the
the world. The current version of schema.org only proof markup language have set up the basis for
represents those entities that the search engines community actions. Most recently, the W3C
can handle in a short term. Schema.org provides a approved the PROV Data Model as a recommen-
mechanism for extending the scope of concepts, dation in 2013. The PROV Data Model is a
properties, and schemas. Webmasters and devel- generic model for provenance, which allows spe-
opers can define their own specific concepts, cific representations of provenance in research
properties, and schemas. Once those extensions domains or applications to be translated into the
are commonly used on the Web, they can also be model and be interchangeable among systems
included as a part of the schema.org schemas. (Moreau and Missier 2013). There are intelligent
knowledge systems that can import the prove-
nance information from multiple sources, process
Linking for Tracking it, and reason over it to generate clues for potential
new findings. The PROV Data Model includes
If the recognition of domain-specific topics is a three core classes, entity, activity, and agent,
work to identify resource types, then the definition which are comparable to the Open Provenance
654 Middle East

Model and the proof markup language. W3C also applications to automatically harvest machine
approved the PROV Ontology as a recommenda- readable metadata from multiple sources and har-
tion for the expression of the PROV Data Model monize them. Commonly used domain-specific
with semantic Web languages. It can be used to metadata standards and machine readable formats
represent machine readable provenance informa- will significantly facilitate the metadata life cycle in
tion and can also be specialized to create new applications using Big Data, because most of such
classes and properties to represent provenance applications will be on the Web and interchange-
information of specific applications and domain. able schemas and formats will be an advantage.
The extension and specification here are similar to
the idea of a metadata hierarchy. A typical appli-
cation of the PROV Ontology is the Global Cross-References
Change Information System for the US Global
Change Research Program (Ma et al. 2014), ▶ Data Brokers and Data Services
which captures and presents provenance of global ▶ Data Profiling
change research, and links to the publications, ▶ Data Provenance
datasets, instruments, models, algorithms, and ▶ Data Sharing
workflows that support key research findings. ▶ Open Data
The provenance information in the system
increases understanding, credibility, and trust in
the works of the US Global Change Research Further Reading
Program and aids in fostering reproducibility of
results and conclusions. DCMI Usage Board. (2012). DCMI metadata terms. http://
dublincore.org/documents/dcmi-terms.
Erickson, J., Maali, F. (2014). Data catalog vocabulary
(DCAT). http://www.w3.org/TR/vocab-dcat.
A Metadata Life Cycle Ma, X., Fox, P., Tilmes, C., Jacobs, K., & Waple, A.
(2014). Capturing provenance of global change infor-
mation. Nature Climate Change, 4/6, 409–413.
Velocity is a unique feature that differentiates Big
Moreau, L., Missier, P.. (2013). PROV-DM: The PROV
Data from traditional data. Traditional data can data model. http://www.w3.org/TR/prov-dm.
also be big, but they have a relatively longer life
cycle compared to social media data stream in Big
Data. Big Data life cycles are featured by short-
term and quick deployments to solve specific sci-
entific or business issues. In traditional data man- Middle East
agement, especially for a single data center or data
repository, the metadata life cycle is less Feras A. Batarseh
addressed. Now, facing the short-lived and quick College of Science, George Mason University,
Big Data life cycles, attention should also be paid Fairfax, VA, USA
to the metadata life cycle.
In general, a data life cycle covers steps of
context recognition, data discovery, data access, Synonyms
data management, data archive, and data distribu-
tion. Correspondingly, a metadata life cycle covers Mid-East; Middle East and North Africa (MENA)
similar steps but they focus on the description of
data rather than the data themselves. The context
recognition allows people to study a specific Definition
domain or application and reuse any existing meta-
data standards and schemas. Then in the metadata The Middle East is a transcontinental region in
discovery step, it is possible to develop Western Asia and North Africa. Countries of the
Middle East 655

Middle East are ones extending from the shores of Bahrain) have adopted social technologies by
the Mediterranean Sea, south towards Africa, and 70% of its population (which is a higher percent-
east towards Asia, and sometimes beyond age than the United States). While citizens are
depending on the context (political, geographical, jumping on the wagon of social media, govern-
etc.). The majority of the countries of the region ments still struggle to manage, define, or guide the
speak Arabic. usage of such technologies.
The McKinsey Middle East Digitization Index
is the one of the main metrics to assess the level
Introduction and impact of digitization across the Middle East.
Only 6% of Middle Eastern public lives under a
The term “Middle East” evolved with time. It digitized smart or electronic government (The
was originally referred to as the countries of the UAE, Jordan, Israel, and Saudi Arabia are
Ottoman empire, but by the mid-twentieth cen- among the few countries that have some form of
tury, a more common definition of the Middle e-government) (Elmasri et al. 2016). However,
East included the following states (countries): many new technology startups are coming from
Turkey, Jordan, Cyprus, Lebanon, Iraq, Syria, the Middle East with great success. The most
Israel, Iran, the West Bank and the Gaza Strip famous technology startup companies coming
(Palestine), Egypt, Sudan, Libya, Saudi Arabia, out of the Middle East include: (1) Maktoob
Kuwait, Yemen, Oman, Bahrain, Qatar, and (from Jordan): is one that stands out. The com-
United Arab Emirates (UAE). Subsequent pany represents a major trophy on the list of
political and historical events have tended to Middle Eastern tech achievements. It made global
include more countries into the mix (such as: headlines when it was bought by Yahoo, Inc. for
Tunisia, Algeria, Morocco, Afghanistan, and $80 million in 2009, symbolizing a worldwide
Pakistan). important step by a purely Middle Eastern com-
The Middle East is often referred to as the pany. (2) Yamli (from Lebanon): One of the most
M
cradle of civilization. By studying the history of popular web apps for Arabic speakers today.
the region, it is clear why the first human civili- (3) GetYou (from Israel): A famous social media
zations were established in this part of the world application. (4) Digikala (from Iran): An online
(particularly the Mesopotamia region around the retailer application. (5) ElWafeyat (from Egypt):
Tigris and Euphrates rivers). The Middle East is An Arabic language social media site for honoring
where humans made their first transitions from deceased friends and family. (6) Project X (from
nomadic to agriculture, invented the wheel, cre- Jordan): A mobile application that allows for 3D
ated basic agriculture, and where the beginnings printing of prosthetics, inspired by wars in the
of the written-word first existed. It is well known region. These examples are assembled from mul-
that this region is an active political, economic, tiple sources; many other exciting projects exist as
historic, and religious part of the world well (such as Souq which was acquired by Ama-
(Encyclopedia Britannica 2017). For the pur- zon in 2017, Masdar, Namshi, Sukar, and many
poses of this encyclopedia, the focus of this others).
entry is on technology, data, and software of the
Middle East.
Software Arabization: The Next Frontier

The Digital Age in the Middle East The first step towards invoking more technology
in a region is to localize the software, content, and
Since the beginning of the 2000s, the Middle East its data. Localizing a software system is accom-
was one of the highest regions in the world in plished by supporting a new spoken language
terms of adoption of social media; certain coun- (Arabic Language in this context, hence the
tries (such as the United Arab Emirates, Qatar, and name, Arabization). A new term is presented in
656 Middle East

this entry of the Encyclopedia, Arabization: it is up with industrialized nations in terms of soft-
the overall concept that includes the process of ware technology adoption and utilizations (i.e.,
making the software available and reliable across bridge the digital divide between third world
the geographical borders of the Arab states. Dif- and first world countries). Figure 1 below
ferent spoken languages have different orienta- shows which countries are investing towards
tions and fall into different groups. Dealing with leading that transformation; numbers in the fig-
these groups is accomplished by using different ure illustrate venture capital funding as share of
code pages and Unicode fonts. Languages fall into GDP (Elmasri et al. 2016). However, According
two main families, single-byte (such as: French, to Cisco’s 2015 visual networking index (VNI),
German, and Polish) and double-byte (such as: the world is looking towards a new digital
Japanese, Chinese, and Korean). Another catego- divide, beyond software and mobile apps. By
rization that is more relevant to Middle Eastern 2019, the number of people connecting to Inter-
Languages is based on their orientation. Most net is going to rise to 3.9 billion users, reaching
Middle Eastern languages are right-to-left (RTL) over 50% of the global population. That will
(such as: Arabic and Hebrew), while other world accelerate the new wave of big data, machine
languages are left-to-right (LTR) (such as: English learning, and the Internet of Things (IoT). That
and Spanish). For all languages, however, a set of will be the main new challenge for technology
translated strings should be saved in a bundle file innovators in the Middle East. Middle Eastern
that indexes all the strings, assign them IDs so the countries need to first lay the “data” infrastruc-
software program can locate them and display the ture (such as the principle of software Arabiza-
right string in the language of the user. Further- tion presented above) that would enable the
more, to accomplish software Arabization, char- peoples of the Middle East towards higher
acters encoding should be enabled. The default adoption rates of future trends (big data and
encoding for a given system is determined by the IoT). Such a shift would greatly influence eco-
runtime locale set on the machine’s operating nomic growth at countries all across the region;
system. The most commonplace character however, the impacts of technology require
encoding format is UTF (USC transformation for- minimum adoption thresholds before those
mat) USC is the universal character set. UTF is impacts begin to materialize; the wider the
designed to be compatible with ASCII. UTF has intensity and use of big data, Internet of things
three types: UTF-8, UTF-16, and UTF-32. UTF is (IoT), and machine learning, the greater the
the international standard for ISO/IEC 10646. It is impacts.
important to note that the process of Arabization is
not a trivial process; engineers cannot merely
inject translated language strings into the system,
or hardcode cultural, date, or numerical settings Conclusion
into the software, rather, the process is done by
obtaining different files based on the settings of The Middle East is known for many historical
the machine, the desires of the user, and applying and political events, conflicts, and controver-
the right locales. An Arabization package needs to sies; however, it is not often referred to as a
be developed to further develop the digital, soft- technological and software-startup hub. This
ware, and technological evolution in the entry of the Encyclopedia presents a brief intro-
Middle East. duction to the Middle East and draws a simple
picture about its digitization, and claims that
Arabization of software could lead to many
Bridging the Digital Divide advancements across the region and eventually
the world – for startups and creativity, the Mid-
Information presented in this entry showed how dle East is an area worth watching (Forbes
the Middle East is speeding towards catching- 2017).
Mobile Analytics 657

Middle East,
Fig. 1 Middle Eastern
Investments in Technology
(Elmasri et al. 2016)

References M
Middle East and North Africa
Elmasri, T., Benni, E., Patel, J., & Moore, J. (2016). Digital (MENA)
Middle East: Transforming the region into a leading
digital economy. McKinsey and Company. https://www. ▶ Middle East
google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
cd=2&ved=0ahUKEwiG2J2e55LTAhXoiVQKHfD8Cx
AQFggfMAE&url=http%3A%2F%2Fwww.mckinsey.
com%2F~%2Fmedia%2Fmckinsey%2Fglobal%2520
themes%2Fmiddle%2520east%2520and%2520africa Mixture-of-Experts
%2Fdigital%2520middle%2520east%2520transforming
%2520the%2520region%2520into%2520a%2520lead
ing%2520digital%2520economy%2Fdigital-middle-east- ▶ Ensemble Methods
finalupdated.ashx&usg=AFQjCNHioXhFY692mS_Qwa
6hkBT6UiXYVg&sig2=6udbc7EP-bPs-ygQ18KSLA&
cad=rja.
Encyclopedia Britannica. (2017). Available at https://
www.britannica.com/place/Middle-East. Mobile Analytics
Forbes reports on the Middle East. (2017). Available at
http://www.forbes.com/sites/natalierobehmed/2013/08/
22/forget-oil-tech-could-be-the-next-middle-east- Ryan S. Eanes
goldmine/. Department of Business Management,
Washington College, Chestertown, MD, USA

Mid-East Analytics, broadly defined, refers to a series of


quantitative measures that allow marketers, ven-
▶ Middle East dors, business owners, advertisers, and interested
658 Mobile Analytics

parties the ability to gauge consumer engagement previous web or tech experience. If a complete
and interaction with a property. When properly end-user experience is desired, there are two pri-
deployed and astutely analyzed, analytics can mary strategies that a company can employ: an
help to inform a range of business decisions all-in-one web-based solution, or a stand-
related to user experience, advertising, budgets, alone app.
marketing, product development, and more. All-in-one web-based solutions allow the same
Mobile analytics, then, refers to the measurement HTML5/CSS3-based site to appear elegant and
of consumer engagement with a brand, property, functional in a full-fledged computer-based
or product via a mobile platform, such as a browser while simultaneously “degrading” on a
smartphone or tablet computer. mobile device in such a way that no functionality
Despite the fact that the mobile Internet and is lost. In other words, the same underlying code
app markets have exploded in growth over the provides the user experience regardless of what
past decade, and despite the fact that more than technological platform one uses to visit a site.
half of all American adults now own at least one There are several advantages to this approach,
smartphone, according to the Pew Research Cen- including singularity of platform (that is, no
ter, marketers have been relatively slow to jump need to duplicate properties, logos, databases,
into mobile marketing. In fact, American adults etc.), ease of update, unified user experience, and
spend at least 20% of their time online via mobile relative ease of deployment. However, there are
devices; the advertising industry has been playing downsides: full implementation of HTML5 and
“catch-up” over the past few years in an attempt to CSS3 are relatively new. As a result, it can be
chase this market. Even so, analyst Mary Meeker costly to find a developer who is sufficiently
notes that advertising budgets still devote only knowledgeable to make the solution as seamless
about a tenth of their expenditures to mobile – as desired, and who can articulate the solution in
though this is a fourfold increase from just a few such a way that non-developers will understand
years ago. the full vision of the end product. Furthermore,
Any entity that is considering the deployment development of a polished finished product can be
of a mobile strategy must understand consumer time-consuming and will likely involve a great
behavior as it occurs via mobile devices. Web deal of compromise from a design perspective.
usability experts have known for years that online Mobile analytics tools are relatively easy to
browsing behavior can be casual, with people deploy when a marketer chooses to take this
quickly clicking from one site to another and route, as most modern smartphone web browsers
making judgments about content encountered in are built on the same technologies that drive
mere seconds. Mobile users, on the other hand, are computer-based web browsers – in other words,
far more deliberate in their efforts – generally most mobile browsers support both JavaScript
speaking, a mobile user has a specific task in and web “cookies,” both of which are typically
mind when he or she pulls out his phone. Brows- requisites for analytics tools. Web pages can be
ing is far less likely to occur in a mobile context. “tagged” in such a way that mobile analytics can
This is due to a number of factors, including be measured, which will allow for the collection
screen size, connection speed, and the environ- of a variety of information on visitors. This might
mental context in which mobile activity takes include device type, browser identification, oper-
place – the middle of the grocery store dairy ating system, GPS location, screen resolution/
case, for example, is not the ideal place for one size, and screen orientation, all of which can pro-
to contemplate the purchase of an eight-person vide clues as to the contexts in which users are
spa for the backyard. visiting the website on a mobile device. Some
The appropriate route to the consumer must be mainstream web analytics tools, such as Google
considered, as well. This can be a daunting pros- Analytics, already include a certain degree of
pect, particularly for small businesses, businesses information pertaining to mobile users (i.e., it is
with limited IT resources, or businesses with little possible to drill down into reports and determine
Mobile Analytics 659

how many mobile users have visited and what require staff with specialized training and know-
types of devices they were using); however, mar- how.
keting entities that want a greater degree of insight If a full-fledged app or redesigned website
into the success of their mobile sites will likely proves too daunting or beyond the scope of what
need to seek out a third-party solution to monitor a marketer needs or desires, there are a number of
performance. other techniques that can be used to reach con-
There are a number of providers of web-based sumers, including text and multimedia messaging,
analytics solutions that cover mobile web use. email messaging, mobile advertising, and so forth.
These include, but are not limited to, ClickTale, Each of these techniques can reveal a wealth of
which offers mobile website optimization tools; data about consumers, so long as the appropriate
comScore, which is known for its audience mea- analytic tools are deployed in advance of the
surement metrics; Flurry, which focuses on use launch of any particular campaign.
and engagement metrics; Google, which offers Mobile app analytics are quite different from
both free and enterprise-level services; IBM, web analytics in a number of ways, including the
which offers the ability to record user sessions vocabulary. For example, there are no page views
and perform deep analysis on customer actions; in the world of app analytics – instead, “screen
Localytics, which offers real-time user tracking views” are referenced. Likewise, an
and messaging options; Medio, which touts “pre- app “session” is analogous to a web “visit.”
dictive” solutions that allow for custom content App analytics often have the ability to access
creation; and Webtrends, which incorporates and gauge the use of various features built into
other third-party (e.g., social media) data. a phone or tablet, including the accelerometer,
The other primary mobile option: development GPS, and gyroscope, which can provide interest-
of a stand-alone smartphone or tablet app. Stand- ing kinesthetic aspects to user experience con-
alone apps are undeniably popular, given that siderations. App analytics tools are also typically
50 billion apps were downloaded from the Apple able to record and retain data related to offline
M
App Store between July 2008 and June 2014. usage for transmission when a device has
A number of retailers have had great success reconnected to the network, which can provide
with their apps, including Amazon, Target, a breadth of environmentally contextual infor-
Zappos, Groupon, and Walgreens, which speaks mation to developers and marketers alike.
to the potential power of the app as a marketing Finally, multiple versions of a mobile app can
tool. However, consider that there are more than exist “in the wild” simultaneously because users’
one million apps in the Apple App Store alone, as proclivities differ when it comes to updating
of this writing – those odds greatly reduce the apps. Most app analytic packages have the ability
chances that an individual will simply “stumble to determine which version of an app is in use so
across” a company’s app in the absence of some that a development team can track interactional
sort of viral advertising, breakout product, or differences between versions and confirm that
buzzworthy word-of-mouth. Furthermore, devel- bugs have been “squashed.”
oping a successful and enduring app can be quite As mentioned previously, marketers who
expensive, particularly considering that a mar- choose to forego app development and develop a
keter will likely want to make versions of the mobile version of their web page often choose to
app available for both Apple iOS and Google stick with their existing web analytics provider,
Android (the two platforms are incompatible and oftentimes these providers do not provide a
with each other). Estimates for app development level of detail regarding mobile engagement that
vary widely, from a few thousand dollars at the would prove particularly useful to marketers who
low end all the way up to six figures for a complex want to capture a snapshot of mobile users. In
app, according to Mark Stetler of AppMuse – and many cases, companies simply have not given
these figures do not include ongoing updates, bug adequate consideration to mobile engagement,
fixes, or recurring content updates, all of which despite the fact that it is a growing segment of
660 Multiprocessing

online interaction that is only going to grow, par- increasing customer retention, loyalty, and
ticularly as smartphone saturation continues. satisfaction.
However, for those entities that wish to delve
further into mobile analytics, there are a growing
number of options available, with a few key dif- Cross-References
ferences between the major offerings. There are
both free and paid mobile analytics platforms ▶ Data Aggregation
available; the key differentiator between these ▶ Network Data
offerings seems to come down to data ownership.
A third-party provider that shares the data with
you, like Google, is more likely to come at a Further Reading
bargain price, whereas a provider that grants you
exclusive ownership of the data is going to come Meeker, M. Internet trends 2014. http://www.kpcb.com/
insights/2014-internet-trends. Accessed September
at a premium. Finally, implementation will make a
2014.
difference in costs: SaaS (software-as-a-service) Smith, A. Smartphone ownership 2013. Pew Research
solutions, which are typically web based, run on Center. http://www.pewinternet.org/2013/06/05/
the third-party service’s own servers, and rela- smartphone-ownership-2013/. Accessed September
2014.
tively easy to install, tend to be less expensive,
Stetler, M. How much does it cost to develop a mobile app?
whereas “on-premises” solutions are both rare and AppMuse. http://appmuse.com/appmusing/how-much-
quite expensive. does-it-cost-to-develop-a-mobile-app/. Accessed Sep-
There are a small but growing number of com- tember 2014.
panies that provide app-specific analytic tools,
typically deployed as SDKs (software develop-
ment kits) that can be “hooked” into apps. These
companies include, but are by no means limited Multiprocessing
to, Adobe Analytics, which has been noted for its
scalability and depth of analysis; Artisan Mobile, Joshua Lee
an iOS-focused analytics firm that allows cus- Schar School of Policy and Government, George
tomers to conduct experiments with live users in Mason University, Fairfax, VA, USA
real time; Bango, which focuses on ad-based
monetization of apps; Capptain, which allows
specific user segments to be identified and Synonyms
targeted with marketing campaigns; Crittercism,
which is positioned as a transaction-monitoring Parallel processing
service; Distimo, which aggregates data from a
variety of platforms and app stores to create a
fuller position of an app in the larger marketplace; Introduction
ForeSee, which has the ability to record customer
interactions with apps; and Kontagent, which Multiprocessing is the utilization of separate pro-
touts itself as a tool for maintaining customer cessors to complete a given task on a computer.
retention and loyalty. For example, on modern computers, Microsoft
As mobile devices and the mobile web grow Word (or just about any executable program)
increasingly sophisticated, there is no doubt that would be a single process. By contrast, multi-
mobile analytics tools will also grow in sophis- threading is done within a single process such
tication. Nevertheless, it would seem that there that a single process can have multiple threads.
are a wide range of promising toolkits already A multiprocessing approach to computation uses
available to the marketer who is interested in more physical hardware (i.e., additional proces-
better understanding customer behaviors and sors) to improve speed, whereas a multi-threading
Multiprocessing 661

approach uses more threads within a single pro- process finishes first and thus might modify some
cessor to improve speed. Both are meant to opti- bit of memory that another process requires.
mize performance, but the conditions in which Second, it avoids problems with the “global
they thrive are different. When utilized correctly, interpreter local” (GIL) utilized by several pro-
multiprocessing increases overall data throughput gramming languages such as Python and Ruby.
whereas multi-threading increases the efficiency The GIL generally prevents multiple threads from
and minimizes the idleness of each process. running at once on a single processor when using
While the two concepts are closely related, it is that programming language. As such, depending
important to note that they have significant differ- on how it is implemented, it can almost negate
ences, particularly when it comes to dealing with performance improvements from multi-threading.
Big Data. In this entry, we will focus on the core At the same time, the GIL is used because it can
differences that make multiprocessing distinct significantly increase the performance of single-
from multi-threading and what those differences threaded performance and allow the programmer
mean for Big Data. not to have to worry about utilizing C-based
libraries that are not thread-safe. However, that
Concurrent Versus Parallel Processing doesn’t help us when we want to use multi-
The terms concurrent and parallel appear threading.
frequently in any discussion of multiprocessing, Third, child processes are usually easy and safe
particularly when it is compared with multi- to interrupt and/or kill when they are done with
threading. While there is significant overlap their task. By contrast, since threads share mem-
between them, they are nevertheless distinct ory, it can be significantly more complex to safely
terms. In short, multiple threads are run concur- kill a thread that is done with its task, potentially
rently, whereas multiple processes are run in par- leaving it wasting resources until the remaining
allel. What does this difference actually mean, threads are done.
though? Fourth, there isn’t the issue of deadlock with
M
As an example, imagine that to win a compe- multiprocessing (as compared to multi-threading).
tition you must do 50 push-ups and 50 sit-ups. Deadlock occurs when Thread A needs a resource
The concurrent approach to doing this task that Thread B has while Thread B needs a resource
would be to switch back and forth between that Thread A has. When this happens, both threads
them – for example, do one push-up, one sit-up, wait indefinitely for the other thread to drop their
one push-up, one sit-up, etc. Whereas that would resource but never end up doing so. Because of
be extraordinarily inefficient for a human being separate memory spaces (and significantly less com-
to do, that is what multi-threading does, and it is munication between processes), deadlock doesn’t
quite efficient at it. By contrast, the parallel occur with multiprocessing.
approach to the task would be to have you do All of this means that, in general, writing effi-
all 50 push-ups while having your friend join you cient code for a multiprocessing environment is
in the competition and do 50 sit-ups simulta- much simpler than writing efficient code in a
neously. In this case, the parallel approach multi-threaded environment. At the same time,
would undoubtedly be faster. the multiprocessing code will require more phys-
ical capabilities (i.e., more processing cores) to be
Programming for Multiprocessing able to run.
In terms of computer programming,
multiprocessing offers multiple advantages over Performance when Multiprocessing
multi-threading. First, there are no race conditions A multiprocessing approach will often be faster
with multiprocessing, since each process has a than a multi-threading approach, albeit with some
distinct area of memory it operates in (unlike important caveats. This is because there are sev-
multi-threading, which shares the same memory eral features of a given task that can make multi-
space). There is no reason to worry about which processing the slower solution.
662 Multiprocessing

First, whereas threads exist in the same mem- be tasks that don’t easily fit into either one, this
ory space and can communicate between one basic typology should provide a starting point for
another quite easily, communication between pro- when to use multiprocessing.
cesses requires inter-process communication In general, multi-threading should be utilized
(IPC), a requirement with vastly greater overhead. for I/O tasks, whereas multiprocessing should be
If there is a significant requirement for communi- utilized for computational tasks. This split mainly
cation between two separate parallel processes, derives from the issues of overhead, idleness, and
then this additional overhead may slow it down IPC. Multiprocessing requires significantly more
enough for a multi-threading approach to have overhead, which means that I/O tasks being idle
better performance. and/or requiring communication between separate
Second, multiple processes require both sepa- processes during computation would be a signif-
rate memory spaces and more memory for each icant slowdown. If your processes are lying idle
process, whereas multiple threads share memory for any significant stretch of time, it is simply a
spaces. This can both add additional hardware waste of resources. When it comes to I/O tasks,
requirements in terms of memory to run success- these idle times can be significantly more com-
fully. If the processes themselves are compara- mon because input and output (usually) are not
tively lightweight, the additional overhead of constantly occurring in the way that, say, training
creating the processes in the first place may out- a machine learning model is almost never idle.
weigh their ability to run faster when comparing Consider the case of a web page: a web server
performance. can have multiple tasks going on simultaneously.
Finally, the required overhead of each process To name a few, this includes tracking user clicks,
means that having a process lay idle for any sig- sending information to back-end databases, and
nificant period becomes a waste of resources. If a displaying the proper HTML in the users’
given task is only run once during a program’s browser. However, none of these individual tasks
execution, and that run only occurs for 10% of the are being run constantly – the user is not contin-
overall time of the program’s execution, it is ques- ually making clicks, back-end databases don’t
tionable whether the task should be given its own need to be constantly appended every millisec-
process by itself due to the wasted resources. ond, and new HTML doesn’t need to be con-
stantly displayed to the user. Were we to assign
Multiprocessing Versus Multi-Threading three separate processes for each of these three
Usage with Big Data tasks, many of those processes would remain idle
With a review of programming and performance for significant amounts of time, wasting valuable
issues in hand, what does this mean when dealing computing resources. By contrast, a concurrent
with Big Data? While there are endless kinds of multi-threaded approach would be much more
programming tasks that might need to be done, in fitting on a web server because all the tasks on a
general, we can abstract these tasks out to either server don’t need to be executed all the time, even
I/O tasks or raw computational tasks. An I/O task if one or more threads is idle, the others in the
is one where the inputting and outputting of data is process can take up the slack and make most
central to completing the task. This would include efficient use of a single process.
frequently writing to or reading from files, scrap-
ing large quantities of web pages of the internet, or Conclusion
accepting user input from the keyboard, among In conclusion, multi-threading and multiprocess-
other tasks. By contrast, a more computational ing both have their place at the Big Data table.
task in this context could involve sorting/ Neither is a perfect solution for all occasions, but
searching through massive troves of data, each has circumstances in which it is superior.
performing machine learning training or infer- While using multiprocessing at the wrong time
ence, or applying any kind of en masse mathemat- can lead to massive wastes of computational
ical transformation to one’s data. While there may resources, using multi-threading at the wrong
Multi-threading 663

time can lead to no improvement (or even simply hitting an “on” switch. In fact, if it is poorly
decreases) in performance. Indeed, the purposeful implemented, multi-threading may produce no dis-
and careful combination of multi-threading cernable impact or even decrease the performance
together with multiprocessing (that is, multiple of certain tasks (https://brooker.co.za/blog/2014/
processes each with multiple threads) will ensure 12/06/random.html). Therefore, any project work-
optimal performance for a wide array of Big Data- ing with Big Data should thoroughly study multi-
oriented tasks. threading before implementing it.

A Basic Example
Cross-References Let process A need to divide 500 integers by
5 and print the results for each. Without multi-
▶ Multi-Threading threading (i.e., with a single thread), the integers
▶ Parallel Processing are divided serially (i.e., one at a time) until all
500 operations are complete. By contrast, with
multi-threading, the integers are divided in paral-
Further Reading lel by different threads. If the process uses
2 threads, it would split the 500 integers into
Bellairs, R. (2019, April 10). How to take advantage of 2 lists of 250 each. Then, each thread would
multithreaded programming and parallel program- work on their assigned list. Theoretically, this
ming in C/C++. Retrieved from PERFORCE: https://
could double the completion speed. However,
www.perforce.com/blog/qac/multithreading-parallel-
programming-c-cpp#:~:text¼Parallel%20program realistically this speedup will be lower due to the
ming%20is%20a%20broad,set%20(thread)%20of% additional processing overhead from the addi-
20instructions.&text¼These%20threads%20could% tional thread and if there are sections of code
20run%20on%20a%20single%20processor.
that can’t be parallelized.
Nagarajan, M. (2019, December 2). Concurrency M
vs. parallelism — A brief view. Retrieved from Medium:
https://medium.com/@itIsMadhavan/concurrency-vs-
parallelism-a-brief-review-b337c8dac350. Conceptual Foundations
Rodrigues, G. S. (2020, September 27). Multithreading
vs. multiprocessing in Python. Retrieved from Towards
Data Science: https://towardsdatascience.com/multithre Understanding multi-threading requires an under-
ading-vs-multiprocessing-in-python-3afeb73e105f. standing of several other computer science
concepts.
Process – A computer program that is currently
running. For example, Microsoft Excel becomes a
Multi-threading process when a user double-clicks on its icon.
Thread – A series of instructions within a pro-
Joshua Lee cess that can be executed independently of any
Schar School of Policy and Government, George other code in that process.
Mason University, Fairfax, VA, USA Thread-safe – A code which is thread-safe is
implemented in such a way as to allow for multi-
ple threads to access and modify its data safely,
Introduction such as to maintain data integrity. While all codes
would be thread-safe in a perfect world, there are
Multi-threading is the utilization of multiple practical limitations that prevent this (that discus-
threads to complete a given task (i.e., process) in sion is beyond the scope of this section).
parallel. It is one of the fundamental mechanisms Lightweight Process (LWP) – Generally used
through which data processing and execution can as an alternate name for a thread. This is because a
be substantially sped up. However, utilizing multi- thread is essentially a process (they can accom-
threading properly is more complicated than plish the same tasks in theory), but a thread is a
664 Multi-threading

more restricted process that has less overhead and forth between different processes (each of which
can’t run processes inside it. has at least one thread), the machine creates the
Overhead – Indirect or excess computation illusion to the user that all processes are being run
time. Excessive overhead will substantially slow simultaneously.
down the performance. It also increases linearly – Because of this, it is sometimes possible to
four threads will have roughly four times the efficiently use more threads than there are cores.
overhead as one thread. For example, assume there is a dual core machine
with core A and core B. In addition, there is
Process Versus Thread thread A, B, C, and D; all four of which are
While theoretically, one process and one thread contained within process A. Threads A and
could be instructed to perform the same task, they B are running on core A, and threads C and
have some key distinctions that affect how they’re D are running on core B. If threads A and C are
utilized. Specifically: frequently sleeping or waiting, then their respec-
tive cores can run threads B and D during this
• Processes are generally used for “heavy- period, allowing for the efficient use of more
weight,” major tasks, whereas threads are gen- threads than there are cores.
erally used for “lightweight,” minor tasks. This
is because a process can have one or more
threads inside it, but a thread cannot have a Threads and Shared Memory Access
process inside.
• A process has much larger overhead than a One of the greatest benefits of multi-threading is
thread – starting up and managing a new pro- that the threads share memory access and can thus
cess is itself a computationally intensive task work together to complete the same task. How-
that can slow down the performance. ever, this benefit also has a substantial drawback
• Different threads within a single process share that must be taken into consideration. What hap-
the same address space (memory), whereas pens if two threads running in a single process
different processes on an operating system do determine that they need to modify the same
not. Sharing the same address space allows addressed memory (i.e., variable) at the same
different threads to access the same variables time? What determines which thread should get
in memory and to communicate with one priority, and how is this conflict managed? Solu-
another quickly and easily. By contrast, shar- tions to this problem are classified under the term
ing information between processes (known as thread synchronization.
inter-process communication, or IPC) is far Different programming languages utilize dif-
more computationally intensive. ferent solutions to the issue of shared memory
• However, using multiple processes (versus access. One of the most common solutions is via
multiple threads) allows for more isolation for mutual exclusion (mutex). With mutex, an object
each process – processes cannot directly inter- (i.e., memory/variable/address space) is “locked”
act with each other’s memory/variables, which by one thread. Any other thread which attempts to
can be useful for some tasks. This adds an access the locked object is refused access. Other
inherent layer of security between processes methods of synchronization, beyond the scope of
that don’t exist between threads. this guide, include barriers, semaphores, and
spinlocks (https://msdn.microsoft.com/en-us/
Core Versus Thread library/ms228964(v¼vs.110).aspx).
One potentially confusing aspect of understand- There are many varieties of mutual exclusion,
ing multi-threading is the relationship between but two of the most common are queuing mutex
threads and cores. A normal single core can run and read/write mutex, also known as shared
only a single thread at a time. However, even on a mutex. A queuing mutex creates a FIFO frame-
single-core machine, by swiftly moving back and work for threads requesting a locked object. For
Multi-threading 665

example, let thread A have the lock on object Z, Common Pitfalls in Multi-threading
and let thread B request access to object Z. Under
queuing mutex, thread B would “wait in line” for Implementing a multi-threaded design also comes
object Z to become available. Then, if thread with certain pitfalls to avoid. Different types of
C also requested access to object Z, it would mutex locking and different design patterns must
need to get in line behind thread B. While this deal with these pitfalls to varying degrees, but
has the advantage of the simplicity of understand- they always need to be taken into design
ing, it can also create additional overhead for the consideration.
operating system to manage, as well as threads Race Conditions: A race condition is where
excessively waiting in line rather than processing thread B’s processing interferes with thread A’s
information. processing due to them both being run simulta-
A read/write mutex, also known as a shared neously. For example, consider functions X and
mutex, allows for any number of threads to read Y below:
an object simultaneously. However, if a thread
wants to write to that object, it must wait until all Function X(integer C){
if C ¼¼ 5:
the threads currently reading it have let go of their
return True;
locks. else:
return False;
}
Function Y(integer C){
Common Multi-threading Design
C ¼ C + 1;
Patterns return C;
}
There are many common design patterns for
multi-threaded programming, which are naturally Next, let thread A call X, thread B call Y, and
efficient. Generally, pre-existing design patterns integer C start at 5. If the threads are set to run
M
should be considered before attempting to invent a their functions at the same time, will X return true
new one. or false? The answer is, it depends – it’s an unsta-
Boss-Worker Thread Pattern: In the boss- ble race between the two threads.
worker thread pattern, there is one thread, which Thus, if we run this experiment 100 times,
is the “boss,” and all other threads are “workers.” there will not be consistency in the result.
When new tasks need to be completed, the boss Threads A and B are racing against one another
thread assigns the task to a given worker thread, to decide the result. If thread A happens to
creating a new thread on the spot if none are finish processing fast enough on one run, it
available. This is one of the simplest and most will return true. But sometimes, thread B will
common patterns – it allows for ease of use and run fast enough that thread A will return false.
ease of debugging. However, it can also create This kind of inconsistent result, given the same
problems of contention between threads if they conditions and starting point, can cause
require interdependent resources. difficult-to-spot bugs in a multi-threaded
Pipeline Pattern: In the pipeline pattern, each program.
thread completes a portion of a given task and Deadlocks: Deadlocks occur when no single
then passes it on to the next thread. This is also a thread can execute. Let thread A have a lock for
simple pattern and can be most useful when there object Z and thread B have the lock for object Y.
are discrete steps that need to be completed that Thread A will only give up object Z when it can
are sequential in nature. However, it can also grab the lock for object Y, and thread B will only
require substantial fine tuning to ensure that each give up object Y when it can grab the lock for
stage of the pipeline doesn’t cause a bottleneck. object Z. In this situation, both threads will wait
Additionally, the parallelization that can occur is for eternity because neither of their conditions can
limited by the number of pipelines. be fulfilled.
666 Multi-threading

Multi-threaded Design Optimization threads will need those locks and when they will
need it.
Even after shared memory access issues are Lock Frequency: The act of locking and
resolved, pitfalls are compensated for, and a unlocking itself adds overhead. Analyze your pro-
design pattern is chosen, performance can still be gram to see if there are perhaps ways you can
further optimized. Especially when dealing with minimize this frequency.
Big Data, even a minor performance increase can Critical Sections: A critical section is a part of
substantially impact processing time. Below are the code which must be accomplished serially
some of the most important optimizations to (i.e., in order and without multiple threads).
consider: These sections are naturally time-consuming in
Granularity: Granularity is a measurement for any multi-threaded algorithm. Minimizing the
how much real work is done in each thread. size and computational complexity of these criti-
Threads that are sleeping or waiting are not cal sections is vital for optimizing performance.
performing real work. For example, we need a
program to square 800 integers. Fine granularity
would be if more threads each accomplished less
work. Thus, the maximum fine granularity would Conclusion
be if 800 threads each performed one squaring
operation. Needless to say, this isn’t an efficient The most important point to remember about uti-
design. By contrast, coarse granularity would be lizing a multi-threaded design is that it’s not the
if few threads each accomplished more work. solution for every problem. Furthermore, there are
Thus, maximum coarse granularity would be not numerous structural factors that can inhibit its
to use multi-threading at all. effectiveness, and the decision on whether to uti-
If the threads are too fine, it creates unneces- lize a multi-threaded design should be handled
sary overhead from handling the threads them- with care. Even after the decision to use a multi-
selves. However, if granularity is too coarse, threaded design is made, it may require substantial
threads can suffer from a load imbalance – for optimization to obtain the desired performance
example, one thread can take 1 h to complete its enhancements.
tasks, whereas another thread only takes 10 min.
In that case, the application itself would still take
an hour to complete, even though one thread is Further Reading
sitting idly by for most of that period. Therefore,
Lewis, B., & Berg, D. J. (1995). Threads primer: A guide to
granularity optimization aims to find the proper
multithreaded programming. Upper Saddle River, NJ:
balance between the two extremes, both in terms Prentice Hall Press.
of load balancing and overhead minimization. Protopopov, B. V. (1996). Concurrency, multi-threading,
Lock Ordering: One method of avoiding dead- and message passing. Master’s thesis, Department of
Computer Science, Mississippi State University.
lock is lock ordering. With lock ordering, locks Ungerer, T., Robič, B., & Šilc, J. (2003). A survey of
should be obtained in a fixed order throughout the processors with explicit multithreading. ACM Comput-
program. This order is determined by what other ing Surveys (CSUR), 35(1), 29–63.
N

National Association for the ways that can directly impact disadvantaged
Advancement of Colored minority groups.
People With a membership of over 425,000 members,
the NAACP is the nation’s largest civil rights
Steven J. Campbell organization. Administered by a 64-member
University of South Carolina, Lancaster, board headed by a chairperson, various depart-
Lancaster, SC, USA ments within the NAACP govern particular areas
of action. The Legal Department tracks court
cases with potentially extensive implications for
The National Association for the Advancement of minorities, including recurring discrimination in
Colored People (NAACP) is an African- areas such as education and employment. The
American civil rights organization headquartered Washington, D.C., office lobbies Congress and
in Baltimore, MD. Founded in 1909, its member- the Presidency on a wide range of policies and
ship advocates civil rights by engaging in activi- issues, while the Education Department seeks
ties such as mobilizing voters and tracking equal improvements in the sphere of public education.
opportunity in government, industry, and commu- Overall, the NAACP’s mission is to bolster equal
nities. Over the past few years, the NAACP has rights for all people in political, educational, and
shifted its attention to digital advocacy and the economic terms as well as stamp out racial biases
utilization of datasets to better mobilize activists and discrimination.
online. In the process, the NAACP has become a In order to extend this mission into the twenty-
leading organization in how it harnesses big data first century, the NAACP launched a digital media
for digital advocacy and related campaigns. The department in 2011. This entailed a mobile sub-
NAACP’s application of specially tailored data to scriber project that led to 423,000 contacts,
its digital approach, from rapid response to 233,000 Facebook supporters, and 1.3 million
targeted messaging to understanding recipients’ email subscribers, due in large part to greater
interests, has become an example for other groups social media outreach. The NAACP’s “This is
to follow. At the same time, the NAACP has my Vote!” campaign, launched prior to the 2012
challenged other big data (both in the public and presidential election, dramatically advanced the
private sectors), highlighting abuse of such data in organization’s voter registration and mobilization

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
668 National Association for the Advancement of Colored People

programs. As a result, the NAACP registered Controversy


twice the number of individuals – over
374,000 – than it did in 2008 and mobilized over While government and commercial surveillance
1.2 million voters. In addition, the NAACP potentially affect all Americans, minorities face
conducted an election eve poll that surveyed these risks at disproportionate rates. Thus, the
1,600 African-American voters. This was done NAACP has raised concerns about whether big
in order to assess their potential influence as well data needs to provide greater protections for
as key issue areas prior to the election results and minorities in addition to the general privacy pro-
in looking forward to 2016. Data from the poll tections commonly granted. Such controversy
highlighted the predominate role played by surrounding civil rights and big data may not be
African-Americans in major battleground states self-evident; however, big data often involves the
and divulged openings for the Republican Party targeting and segmenting of one type of individual
in building rapport with the African-American from another. This serves as a threat to basic civil
community. In addition, the data signaled to Dem- rights –which are protected by law – in ways that
ocrats a message not to assume levels of Black were inconceivable in recent decades. For
support in 2016 on par with that realized in the instance, the NAACP has expressed alarm regard-
2008 and 2012 elections. ing the collection of information by credit
By tailoring its outreach to individuals, the reporting agencies. Such collections can result in
NAACP has been successful in achieving rela- the making of demographic profiles and stereo-
tively high rates of engagement. The organization typical categories, leading to the marketing of
segments supporters based on their actions, such predatory financial instruments to minority
as whether they support a particular issue based on groups.
past involvement. For instance, many NAACP The US government’s collection of massive
members view gun violence as a serious problem phone records for purposes of intelligence has
in today’s society. If such a member connects with also drawn harsh criticism from the NAACP as
NAACP’s online community via a particular well as other civil rights organizations. They
webpage or internet advertisement, s/he will be have vented warnings regarding such big
recognized as one espousing stronger gun control data by highlighting how abuses can uniquely
laws. Future outreach will entail tailored mes- affect disadvantaged minorities. The NAACP
sages expressing attributes that resonate on a per- supports principles aimed at curtailing the
sonal level with the supporter, not unlike that from pervasive use of data in areas such as law
a friend or colleague. enforcement and employment. Increasing
The NAACP also takes advantage of major collections of data are viewed by the NAACP
events that reflect aspects of the organization’s as a threat since such big data could allow
mission statement. Preparation for such moments for unjust targeting of, and discrimination
entails much advance work, as evidenced in the against, African-Americans. Thus, the NAACP
George Zimmerman trial involving the fatal strongly advocates measures such as a stop to
shooting of 17-year-old Trayvon Martin. As the “high-tech profiling,” greater pressure on private
trial was concluding in 2013, the NAACP formed industry for more open and transparent data,
contingency plans in advance of the court’s deci- and greater protections for individuals from
sion. Website landing pages and prewritten emails inaccurate data.
were set in place, adapted for whatever result may
come. Once the verdict was read, the NAACP sent
out emails within 5 min that detailed specific Cross-References
actions for supporters to take. This resulted in
over a million petition signatures demanding ▶ Demographic Data
action on the part of the US Justice Department, ▶ Facebook
which it eventually took. ▶ Pattern Recognition
National Oceanic and Atmospheric Administration 669

Further Reading the oceans to the state of the sun, and to better
safeguard and preserve seashores and marine life.
Fung, Brian (27 Feb 2014). Why civil rights groups are NOAA provides alerts to dangerous weather,
warning against ‘big data’. Washington Post. http://
maps the oceans and atmosphere, and directs the
www.washingtonpost.com/blogs/the-switch/wp/2014/
02/27/why-civil-rights-groups-are-warning-against- responsible handling and safeguarding of the seas
big-data/. Accessed Sept 2014. and coastal assets. One key way NOAA pursues
Murray, Ben (3 Dec 2013). What brands can learn about its mission is by conducting research in order to
data from the NAACP: Some advocacy groups are
further awareness and better management of envi-
ahead of the curve, making smarter data decisions.
Advertising Age. http://adage.com/article/datadriven- ronmental resources. With a workforce of over
marketing/brands-learn-data-advocacy-groups/245 12,000, NOAA consists of six major line offices,
498/. Accessed Sept 2014). including the National Weather Service (NWS), in
NAACP. http://www.NAACP.org. Accessed Sept 2014.
addition to over a dozen staff offices.
NOAA’s collection and dissemination of vast
sums of data on the climate and environment
contribute to a multibillion-dollar weather enter-
National Oceanic and prise in the private sector. The agency has sought
Atmospheric Administration ways to release extensive new troves of this data,
an effort that could be of great service to industry
Steven J. Campbell and those engaged in research. NOAA
University of South Carolina Lancaster, announced a call in early 2014 for ideas from
Lancaster, SC, USA the private sector to assist the agency’s efforts in
freeing up a large amount of the 20 terabytes of
data that it collects on a daily basis pertaining to
The National Oceanic and Atmospheric Adminis- the environment and climate change. In
tration (NOAA) is an agency housed within the exchange, researchers stand to gain critical
US Commerce Department that monitors the sta- access to important information about the planet, N
tus and conditions of the oceans and the atmo- and private companies can receive help and
sphere. NOAA oversees a diverse array of assistance in advancing new climate tools and
satellites, buoys, ships, aircraft, tide gauges, and assessments.
supercomputers in order to closely track environ- This request by NOAA shows that it is plan-
mental changes and conditions. This network ning to place large amounts of its data into the
yields valuable and critical data that is crucial for cloud, benefitting both the private and public sec-
alerting the public to potential harm and pro- tors in a number of ways. For instance, climate
tecting the environment nationwide. The vast data collected by NOAA is currently employed
sums of data collected daily have served as a for forecasting the weather over a week in
challenge to NOAA in storing as well as making advance. In addition, marine navigation and off-
the information readily accessible and meaningful shore oil and gas drilling operations are very
to the public and interested organizations. In the interested in related data. NOAA has pursued
future, as demand grows for ever-greater amounts unleashing ever-greater amounts of its ocean and
and types of climate data, NOAA must be atmospheric data by partnering with groups out-
resourceful in meeting the demands of public side government. This is seen as paramount to
officials and other interested parties. NOAA’s data management, where tens of
First proposed by President Richard Nixon, petabytes of information are recorded in various
who wanted a new department in order to better ways, engendering over 15 million results daily –
protect citizens and their property from natural from weather forecasts for US cities to coastal tide
dangers, NOAA was founded in October 1970. monitoring – which totals twice the amount of all
Its mission is to comprehend and foresee varia- the printed collections of the US Library of
tions in the environment, from the conditions of Congress.
670 National Organization for Women

Maneuvering through NOAA’s mountain of NOAA’s efforts to build a Weather-Ready


weather and climate, the data has proved to be a Nation have evolved from a foundation of super-
great challenge over the years. To help address computer advancements that have permitted
this issue, NOAA made available, in late 2013, an more accurate storm-tracking algorithms for
instrument that helped further open up the data to weather prediction. First launched in 2011, this
the public. With a few clicks of a mouse, individ- initiative on the part of NOAA has resulted in
uals can create interactive maps illustrating natu- advanced services, particularly in ways that data
ral and manmade changes in the environment and information can be made available to the
worldwide. For the most part, the data is free to public, government agencies, and private
the public, but much of the information has not industry.
always been organized in a user-friendly format.
NOAA’s objective was to bypass that issue and
allow public exploration of environmental condi- Cross-References
tions from hurricane occurrences to coastal tides
to cloud formations. The new instrument, named ▶ Climate Change, Hurricanes/Typhoons/Cyclones
NOAA View, allows ready access to many of ▶ Cloud Computing
NOAA’s databases, including simulations of ▶ Data Storage
future climate models. These datasets grant users ▶ Environment
the ability to browse various maps and informa- ▶ Predictive Analytics
tion by subject and time frame. Behind the scenes,
numerous computer programs manipulate
datasets into maps that can demonstrate environ- Further Reading
mental attributes and climate change over time.
NOAA View’s origins were rooted in data visual- Freedman, A. (2014, February 24). U.S. readies big-data
dump on climate and weather. http://mashable.com/
ization instruments present on the web, and it is
2014/02/24/NOAA-data-cloud/. Accessed September
operational on tablets and smartphones that 2014.
account for 44% of all hours spent online by the Kahn, B. (2013). NOAA’s new cool tool puts climate on
US public. view for all. http://www.climatecentral.org/news/
noaas-new-cool-tool-puts-climate-on-view-for-all-
Advances to NOAA’s National Weather Ser-
16703. Accessed September 2014.
vice supercomputers have allowed for much faster National Oceanic and Atmospheric Administration
calculations of complex computer models, (NOAA). www.noaa.gov. Accessed September 2014.
resulting in more accurate weather forecasts. The
ability of these enhanced supercomputers to ana-
lyze mounds of scientific data proves vital in
helping public officials, communities, and indus- National Organization for
trial groups to better comprehend and prepare for Women
perils linked with turbulent weather and climatic
occurrences. Located in Virginia, the supercom- Deborah Elizabeth Cohen
puters operate with 213 teraflops (TF) – up from Smithsonian Center for Learning and Digital
the 90 TF with the computers that came before Access, Washington, DC, USA
them. This has helped to produce an advanced
Hurricane Weather Research and Forecasting
(HWRF) model that the National Weather Service The National Organization for Women (NOW) is
can more effectively employ. By allowing more an American feminist organization that is the
effective monitoring of violent storms and more grassroots arm of the women’s movement and
accurate predictions regarding the time, place, and the largest organization of feminist activists in
intensity of their impact, the HWRF model can the United States. Since its founding in 1966,
result in saved lives. NOW has engaged in activity to bring about
National Organization for Women 671

equality for all women. NOW has been partici- • Metadata collection renders legal protection of
pating in recent dialogues to identify how com- civil rights and liberties less enforceable, undo-
mon big data working methods lead to ing civil rights law.
discriminatory practices against protected clas-
ses including women. This entry discusses Comprehensive US civil rights legislation in
NOW’s mission and issues related to big data the 1960s and 1970s resulted from social actions
and the activities NOW has been involved with organized to combat discrimination. A number of
to end discriminatory practices resulting from the current big data practices are in misalignment with
usage of big data. these laws and can lead to discriminatory
As written in its original statement of pur- outcomes.
pose, the purpose of NOW is to take action to NOW has been involved with several impor-
bring women into full participation in the main- tant actions in response to these recognized prob-
stream of American society, exercising privi- lems with big data. In January of 2014, the US
leges and responsibilities in completely equal White House engaged in a 90-day review of big
partnership with men. NOW strives to make data and privacy issues, to which NOW as a
change through a number of activities including participating stakeholder provided input.
lobbying, rallies, marches, and conferences. Numerous policy recommendations resulted
NOW’s six core issues are economic justice, from this process especially related to data pri-
promoting diversity and ending racism, lesbian vacy and the need for the federal government to
rights, ending violence against women, consti- develop technical expertise to stop
tutional equality, and access to abortion and discrimination.
reproductive health. The NOW Foundation also belongs to a coa-
NOW’s current president Terry O’Neill has lition of 200 progressive organizations named
stated that big data practices can render obsolete the Leadership Conference on Civil and
the USA’s landmark civil rights and anti- Human Rights whose mission is to promote the
discrimination laws with special challenges for civil and human right of all persons in the United N
women, the poor, people of color, trans-people, States. NOW President Terry O’Neill serves on
and the LGBT community. While the technolo- the Coalition’s Board of Directors. In February
gies of automated decision-making are hidden and 2014, The Leadership Conference released five
largely not understood by average people, they are “Civil Rights Principles for the Era of Big Data”
being conducted with an increasing level of per- and in August 2014 provided testimony based
vasiveness and used in contexts that affect indi- on their work to the US National Telecommuni-
viduals’ access to health, education, employment, cations and Information Administration’s
credit, and products. Problems with big data prac- Request for Public Comment related to Big
tices include the following: Data and Consumer Privacy. The five civil rights
principles to ensure that big data is designed and
• Big data technology is increasingly being used used in ways that respect the values of equal
to assign people to ideologically or culturally opportunity and equal justice include the
segregated clusters, profiling them and in following:
doing so leaving room for discrimination.
• Through the practice of data fusion, big data 1. Stop high tech profiling – ensure that clear
tools can reveal intimate personal details, erod- limits and audit mechanisms are in place to
ing personal privacy. make sure that data gathering and surveillance
• As people are often unaware of this “scoring” tools that can assemble detailed information
activity, it can be hard for individuals to break about a person or group are used in a respon-
out of being mislabeled. sible and fair way.
• Employment decisions made through data 2. Ensure fairness in automated decisions –
mining have the potential to be discriminatory. require through independent review and
672 National Security Administration (NSA)

other measures that computerized decision- Further Reading


making systems in areas such as employment,
health, education, and lending operate fairly Big data: Seizing opportunities, preserving values. (2014).
Washington, DC: The White House. www.whitehouse-
for all people and protect the interests of
gov/sites/default/files/docs/big-data-privacy-report-5.1.1.
those that are disadvantaged and have histor- 14-final-print.pdf. Accessed 7 Sep 2014.
ically been discriminated against. Systems Eubanks, V. (2014). How big data could undo our civil-
that are blind to preexisting disparities can rights laws. The American Prospect. www.prospect.
org/article/how-big-data-could-undo-our-civil-rights-
easily reach decisions that reinforce existing
laws. Accessed 7 Sep 2014.
inequities. Gangadharan, S. P. (2014). The dangers of high-tech profil-
3. Preserve constitutional principles – govern- ing, using big data. The New York Times. www.nytimes.
ment databases must not be allowed to under- com/roomfordebate/204/08/06/Is-big-data-spreading-
inequality/the-dangers-of-high-tech-profiling-using-
mine core legal protections, including those of
big-data. Accessed 5 Sep 2014.
privacy and freedom of association. Indepen- NOW website. (2014). Who we are. National Organization
dent oversight of law enforcement is particu- for Women. http://now.org/about/who-we-are/. Accessed
larly important for minorities who often 2 Sep 2014.
The Leadership Conference on Civil and Human Rights.
receive disproportionate scrutiny.
(2014). Civil rights principles for the era of big data.
4. Enhance individual control of personal infor- www.civilrights.org/press/2014/civil-rights-principles-
mation – individuals, and in particular those in big-data.html. Accessed 7 Sep 2014.
vulnerable populations including women and
the LGBT community, should have meaningful
and flexible control over how a corporation
gathers data from them and how it uses and National Security
shares that data. Nonpublic information should Administration (NSA)
not be shared with the government without
judicial process. ▶ Data Mining
5. Protect people from inaccurate data – Govern-
ment and corporate databases must allow
everyone to appropriately ensure the accuracy
of personal information used to make impor- National Security Agency
tant decisions about them. This requires disclo- (NSA)
sure of the data and the right to correct it when
inaccurate. Doug Tewksbury
Communication Studies Department, Niagara
Big data has been called the civil rights battle University, Niagara, NY, USA
of our time. Consistent with its mission, NOW is
engaged in this battle, protecting civil rights of
women and others against discriminatory prac- The National Security Agency (NSA) is the US
tices that can result from current big data governmental agency responsible for collecting,
practices. processing, analyzing, and distributing signal-
based intelligence information to support military
and national security operations, as well as pro-
Cross-References viding information security for US governmental
agencies and its allies. Alongside the Central
▶ Data Fusion Security Service (CSS), which serves as a liaison
▶ Data Mining between the NSA and military intelligence-
▶ National Oceanic and Atmospheric gathering agencies, the NSA/CSS serves as 1 of
Administration 17 intelligence agencies in the American govern-
▶ White House Big Data Initiative ment, reporting equally to the Department of
National Security Agency (NSA) 673

Defense and the Director of National Intelligence. Agency History and Operations
Its central mission is to use information gathered
through surveillance and codebreaking to support The National Security Agency was created in
the interests of the United States and its allies. 1952, evolving out of the Cipher Bureau and
The NSA has become the center of a larger Military Intelligence Branch, a World War I-era
debate over the proper extent of state surveillance cryptanalytic agency, and later, the Armed Forces
powers in balancing both national security and Security Agency, both of which dealt with the
civil liberties. As the world has become increas- encryption of the messages of American forces
ingly globalized, and as cultural expression has and its allies through the end of the Second
increasingly become mediated through informa- World War. The mandate of the organization con-
tion flows and new technological developments, tinues to be one of signal intelligence – mediated,
the NSA has seen its importance in the national signal-based information sources such as textual,
intelligence-gathering landscape rise in tandem radio, broadcast, or telephonic communications –
with its ability to collect, store, and analyze infor- rather than human intelligence, which is the
mation through mass surveillance of electronic domain of the Central Intelligence Agency (CIA)
communications. and other governmental agencies. Thought the
This tension became particularly fervent NSA’s existence was classified upon the agency’s
following former NSA contractor and whistle- creation, and its practices clandestine, it would
blower Edward Snowden’s 2013 revelation that become controversial in the 1960s and 1970s for
the agency had been secretly collecting the inter- its role in providing evidence for the Gulf of
net, telephone, mobile location, and other digital Tonkin incident, domestic wiretaps of anti-
records of over a billion people worldwide, Vietnam War protesters and civil rights leaders,
including tens of millions of domestically the agency’s involvement with the Watergate
based US citizens and dozens of heads of state scandal of the Nixon Administration, and numer-
of foreign governments. Many of the NSA’s ous military actions of the United States and eco-
surveillance practices require no court approval, nomic espionage instances during the 1980s and N
oversight, or warrant issuing: There is consider- 1990s. Both the NSA’s budget and number of
able legal disagreement on whether these war- employees are classified information, but in
rantless collections violate Fourth Amendment 2016 were estimated to be just under $10b and
protections against search and seizure. The between 35,000 and 45,000, respectively. Its
secret Foreign Intelligence Surveillance Court headquarters is in Fort Meade, Maryland.
(FISC) that oversees many of the NSA’s data-
collection strategies has repeatedly allowed these
practices. However, the rulings from FISC The NSA in the Twenty-First Century
courts are classified, neither available to the
public or most members of Congress, and there Technological Capabilities
have been contradictory rulings from lower and The per-bit cost of storage continues to decrease
appeals courts on the FISC’s interpretation of dramatically with each passing year while pro-
law. The US Supreme Court is expected to cessing speed increases exponentially. With
address these issues in the near future, but as access to both the deep pockets of the US Gov-
of this writing, it has not yet ruled on the con- ernment and the data infrastructure of American
stitutionality of most of the NSA’s surveillance ISPs, the technological and logistical capabilities
practices. Most of what is known about the of the NSA continue to lead to new programs of
NSA’s activities has thus far come from the surveillance and countersurveillance, often at the
Snowden leaks and subsequent interpretation of leading edge of technological and scientific
the leaked documents by media organizations discovery.
and the public. However, the full extent of the In terms of its global advantages in these terms
NSA’s practices continues to be unknown. of data collection and processing, what is known
674 National Security Agency (NSA)

about the NSA reads as a list of superlatives: It has surprised at the extent of the NSA’s data collec-
more combined computing power, more data stor- tion and retention, as the organization is pre-
age, the largest collection of supercomputers, and vented from knowingly surveilling US citizens
more taps on global telephone and internet con- on US soil. However, the mass collection of data
nections than any other governmental or private has often been indiscriminate, and an unknown
entity in the world. Particularly following the number of unintentional targets were regularly
2013 opening of its 1 million square foot Utah swept up in the collection. In March 2014, Pres-
Data Center outside of Salt Lake City, potentially ident Obama announced slight alterations to the
holding upward of a yottabyte of data, it is esti- NSA’s bulk telephone metadata collection prac-
mated that the NSA now has the ability to surveil tices, but these did little to quell the controversy
most of the world’s internet traffic, most notably or appease the public, a majority of whom con-
through the signals that run through public and tinued to oppose the agency’s domestic surveil-
private servers in the United States. The NSA has lance practices as recently as 2016.
numerous facilities throughout the United States, Beyond the telephone metadata collection, the
around the globe in allied nations, and at least four NSA’s data-collection and analysis activities
spy satellites dedicated for its exclusive use. It has are numerous and include such programs as
spent at least hundreds of millions of dollars to PRISM, MUSCULAR, Boundless Informant,
fund the development of quantum computing plat- XKEYSCORE, and several known others. These
forms that, if realized, will be able to decrypt the have produced similar massive databases of user
most complex algorithmic encryption available information for both foreign and non-foreign
today. Billions of the world’s emails, computer users and often with the collaboration between
data transfers, text messages, faxes, and phone the NSA and other foreign (primarily European)
calls flow through the NSA’s computing centers intelligence agencies. It has been documented that
every hour, many of which are logged and a large number of US service providers have given
indexed. the NSA information directly from their servers or
through direct access to their network lines,
Surveillance and Countersurveillance including Microsoft, Yahoo, Google, Facebook,
Activities PalTalk, AOL, Skype, YouTube, Apple, and
In June 2013, The Guardian reported that they AT&T.
had received documents leaked by former NSA The MYSTIC program collected metadata
contractor Edward Snowden that detailed that the from a number of nation-states’ territories, appar-
FISC had secretly ordered Verizon Communica- ently without the consent of the governments, and
tions to provide the NSA a daily report for all used in-house developed voice-recognition soft-
calls made in its system by its 120 million cus- ware under the subsequent SOMALGET program
tomers, both within the United States and to record both full-take audio and metadata for
between the United States and other countries every telephone conversation in Bermuda, Iraq,
and, in bulk, with no discrimination based on Syria, and others. The NSA also intentionally
suspicion of wrongdoing. While the content of weakened the security of a number of encryption
the calls was not included, the corporation protocols or influenced the production of a master
handed over the call’s metadata: the numbers encryption key in order to maintain a “back door”
involved, geographic location data, duration, through its BULLRUN program.
time, routing information, and other transac- The NSA regularly intercepts server and
tional data. These practices had existed in some routing hardware – most of which is built by US
form for over a decade under the Bush and corporations – after they are shipped via postal
Obama Administrations through the also- mail, but before they are delivered to government
controversial “warrantless wiretapping” provi- or private recipients in countries, implants hard-
sions of the USA PATRIOT Act. But in this ware or software surveillance tools and then
case, many in Congress and the public were repackages them with a factory seal and sends
National Security Agency (NSA) 675

them onward, allowing post-encryption access to domestically in the United States and worldwide.
the information sent through them. Most of the agency’s data-collection practices are
Edward Snowden revealed in 2014 that the clandestine and fall under the jurisdiction of the
NSA also routinely hacks foreign nations’ net- Federal Intelligence Surveillance Court, a secret,
works, not only military or governmental servers non-adversarial court that rules on the constitu-
but also academic, industrial, corporate, or medi- tionality of US governmental agencies’ surveil-
cal facilities. NSA hackers, for example, lance practices. The FISC has, itself, been
attempting to gain access to one of the core routers critiqued for its secrecy and lack of transparency
in a Syrian ISP in 2012, crashed the ISP’s routing and accountability, both from members of the
system, which in turn cascaded and blacked out public and from Congress, as well as its critique
the entire nation’s internet access for several days. as a “rubber stamp” court that approves nearly all
The SEXINT program has been monitoring of the requests that the government submits.
and indexing the sexual preferences and pornog- US citizens have constitutional protections that
raphy habits of internet users, political activists, are not granted to noncitizens, and many within
and dissidents in order to “call into question a the country have argued that the mass surveillance
radicalizer’s dedication” to a cause by releasing of Americans’ telephone, internet, and other activ-
the potentially embarrassing details. ities is a violation of the Fourth Amendment’s
The NSA has admitted that it monitored the prohibition against illegal search and seizure.
personal cell phones and electronic communica- Others have upheld the authority of the FISC’s
tion of at least 35 world leaders (including many rulings and need for secrecy in the name of
nations allied with the United States), as well as national security, particularly in an age where
attendees to the 2010 G20 Conference in Toronto, violent and cyber terrorism are prescient threats.
EU embassies in Washington, DC, visiting for- The NSA requires that its intelligence analysts
eign diplomats, and apparently many others, all have 51% confidence in their target’s “foreign-
without their knowledge. It has collected massive ness” for data collection, and many American
indiscriminate datasets of foreign citizens’ com- citizens are routinely swept up in massive intelli- N
munications, including 45 million Italian phone gence gathering. It was reported in 2013 that the
calls, 500 million German communications, agency shares its raw data with the FBI, CIA, IRS,
60 million Spanish phone calls, 70 million French the National Counterterrorism Center, local and
phone calls, 33 million Norwegian communica- state police agencies, and others without stripping
tions, and hundreds of millions of Brazilian com- names and personally identifying information, a
munications in 30-day increments in 2012 practice that was approved by the FISC.
and 2013. The tension, though, between the principles of
Furthermore, it was reveled in mid-2014 that civil rights transparency and effective public over-
the NSA had implemented its AI platform sight and of effective national security practices is
MonsterMind, which is designed to detect cyber not a new one, and the tendencies of the informa-
attacks, block them from entering the U.S., and tion age will continue to evolve in these terms. It
automatically counterattack with no human can be assured that the NSA will continue to be at
involvement, a problematic practice that, the forefront of many of these controversies as the
according to Snowden, requires the interception nation and the world decides where the appropri-
of all traffic flows in order to analyze threats. ate legal boundary lies.

Legal Oversight
There have been questions over the legality of Cross-References
many of the National Security Agency’s practices,
particularly in terms of the possibility of civil ▶ Ethical and Legal Issues
rights abuses that can occur without adequate ▶ Fourth Amendment
public transparency and oversight, both ▶ Privacy
676 Natural Disasters

Further Reading floods), geological (avalanches, coastal erosion,


landslides, earthquakes, lahars, volcanic erup-
Bamford, J. (2014, August). Edward Snowden: The untold tions), and wildfires and extraterrestrial events
story. WIRED. http://www.wired.com/2014/08/
(geomagnetic storms or impacts). These natural
edward-snowden/.
Greenwald, G. (2014). No place to hide: Edward Snowden, hazards, due to their location, severity, and fre-
the NSA, and the U.S. surveillance state. New York: quency, may adversely affect humans, their infra-
Metropolitan Books. structure, and their activities. Climate change may
Macaskill, E., & Dance, G. (2013, November 1). NSA
exacerbate natural disasters due to weather by
files: Decoded: What the revelations mean for you.
The Guardian. http://www.theguardian.com/world/ increasing the intensity and frequency of such
interactive/2013/nov/01/snowden-nsa-files-surveillanc disasters.
e-revelations-decoded. Research into natural hazards falls into four
National Security Administration. (2013). 60 years of
areas: mitigation, preparedness, response, and
defending our nation. www.nsa.gov/about/crypto
logic_heritage/60th/book/NSA_60th_Anniversary.pdf. recovery, and this categorization will be followed
here, even though the four areas often overlap. In
all four areas, big data intensifies the challenges in
carrying out these responses to natural hazards.
Natural Disasters Big data in these areas of research is impacted by
the “seven Vs”: volume, variety, velocity, verac-
▶ Natural Hazards ity, value, variability, and visualization (Akter and
Fosso Wamba 2017) to which may be added
vinculation, viscosity, and vicinity. All these
requirements of the big data that might be used
Natural Hazards in natural hazard research and operations would
impact the demands on computational resources.
Guido Cervone1, Yuzuru Tanaka2 and Cloud computing is an active big data research
Nigel Waters3 area for natural hazards because it provides elastic
1
Geography, and Meteorology and Atmospheric computing to respond to varying computational
Science, The Pennsylvania State University, loads that might occur in different geographical
University Park, PA, USA areas with differing probabilities of being affected
2
Graduate School of Information Science and during a disaster (Huang and Cervone 2016).
Technology, Hokkaido University, Sapporo, Major sources of research are the peer
Hokkaido, Japan reviewed journals such as International Journal
3
Department of Geography and Civil of Disaster Risk Reduction, Journal of Disaster
Engineering, University of Calgary, Calgary, AB, Research, Journal of Geography and Natural
Canada Disasters, Natural Hazards, Natural Hazards
and Earth Systems Sciences, Natural Hazards
Review, and Safety Science. In each of the four
Synonyms areas of natural hazard, research reference will be
made to recent case studies, although, as Akter
Disaster management; Natural disasters and Fosso Wamba (2017) have noted, these are
much less common than review/conceptual or
mathematical/analytical articles.
Introduction

The origins of natural hazard events may be atmo- Natural Hazard Mitigation
spheric/meteorological (droughts, heat waves,
and storms such as cyclonic, ice, blizzards, hail, Natural hazard mitigation measures are those
and tornados), hydrological (river and coastal undertaken by individuals or various levels of
Natural Hazards 677

government to reduce or eliminate the impacts of to the Red River Basin and will protect up to a 1-in-
hazards or to remove the risk of damage and 700-year flood. Lee and Kim (2017) in a study of
disaster. Various methodologies are used to assess flood mitigation in Seoul, South Korea, describe an
the effectiveness of mitigation including return on approach that combines structural and nonstructural
investment (ROI). Although all natural hazards flood prevention for decentralized reservoirs. They
require mitigation, here this process will be illus- also review the extant literature for structural, non-
trated by considering floods. Floods are among structural, and integrated approaches to flood
the most devastating of natural hazards inclu- mitigation.
ding the two most deadly natural disasters of all Big data approaches to flood mitigation have
time: the 1931 China floods that killed between 1 been pioneered by the Dutch in collaboration with
and four million people and the 1887 Yellow IBM using the Digital Delta software (Woodie
River Flood that killed between 900,000 and two 2013). This software allows the analysis of a
million people. They also cause some of the most huge variety of flood-related data. These data
devastating environmental impacts (e.g., the include water levels and water quality; sensors
Mozambique Flood of 2000 covered much of the embedded in levees; radar data; and weather pre-
country for about 3 weeks, an area of 1,400 sq. dictions and other historical flood-related infor-
km). Floods are thus extensive and occur across mation. To mitigate a flood, there is a need for
the globe and with great and increasing frequency. evidence-based approaches, and this may be facil-
All of this exacerbates the big data problems asso- itated with the use of crowdsourcing and social
ciated with their mitigation. Details of mitigation media (Huang and Cervone 2016). This is a big
strategies for other natural hazards may be found data problem since all aspects of big data noted
in the FEMA Comprehensive Preparedness Guide above add to the computational challenges faced
(FEMA 2018). by these researchers. Mitigation measures for all
Mitigation measures may be classed as struc- the other types of natural hazards may be found in
tural or nonstructural. For example, in the case of Wisner et al. (2012).
a flood, traditional nonstructural approaches N
such as early detection and warning measures,
zoning and building codes, emergency plans, Preparedness (Prevention and
and flood proofing and flood insurance may be Protection)
supplemented by newer nonstructural approaches,
to flood mitigation that include new computer FEMA (2018) provides a comprehensive guide to
architectures such as virtual databases and a deci- threat and hazard identification and risk assess-
sion support system to manage flood waters. ment (THIRA). Their document describes five
Other methodological approaches are benefit- core capabilities: prevention, protection, mitiga-
cost ratios (BCR) and cost-benefit analysis tion, response, and recovery. Mitigation has been
(CBA); a discussion of these can be found in considered above, while response and recovery
Wisner et al. (2012). are considered separately below. Prevention and
Structural approaches to flood mitigation include, protection of threats and hazards fall into three
for example, the building of floodways to take flood areas: technological, human caused, and natural.
waters away from residential areas. One of the most Here the concern is only with the natural hazards
successful of these was the Red River Floodway in listed in the introduction above. Preparedness is
Manitoba, Canada. This floodway was originally enhanced by developing and then consulting var-
built between 1962 and 1968 at a cost of Can$63 ious sources such as emergency laws, policies,
million. Starting in 2005 a further Can$627 million plans, and procedures, by checking existing
was spent to upgrade the capacity of the floodway THIRAs, and by planning emergency response
from 90,000 to 140,000 cubic feet per second. It is scenarios with all levels of government, stake-
estimated that it has prevented approximately holders, and first responders (fire, police, and
Can$12 billion worth of damage during major floods emergency medical services).
678 Natural Hazards

Ideally these activities are coordinated with an simulation of a big data warehouse for sharing
emergency operations center (EOC). In addition, the results of these simulations.
records and historical data from previous incidents Hultquist and Cervone (2018) describe damage
should be reviewed and critical infrastructure inter- assessment of the urban environment during a nat-
dependencies examined. Factors for selecting ural disaster using various sources of volunteered
threats and hazards include the likelihood/proba- geographic information (VGI). These sources
bility of the incident and its significance in terms of include social media (Twitter, Facebook, Instagram
impact. The complexity of these activities inevita- to provide text, videos, and photos), mobile
bly produces big data problems in all but the phones, collective mapping projects, and images
smallest communities. This is especially the case from unmanned aerial vehicles (UAVs). The need
if a methodology needs to be developed for an all- for high levels of granularity in both space and time
hazards approach as opposed to the less demanding plus the integration of these interrelated
but more commonly encountered single hazard (vinculation) VGI data with authoritative sources
methodologies. Preparedness is most effective if created big data demands for those analyzing and
the occurrence of a natural hazard can be pre- interpreting the information. The effectiveness of
dicted. Sala (2016) explains how hurricanes, these data sources was proven with an analysis of
minor seismic disturbances, and floods can be the September 2013 floods in Colorado and health
predicted using big data acquired from mobile hazard monitoring following the 2011 Fukushima
phone accelerometers and crowdsourced from Daiichi nuclear disaster. Flood detection, warning,
volunteers. These data may be collected using damage assessment, response, as well as disaster
cloud computing and amalgamated with data prevention and mitigation are all goals of the Dart-
from traditional seismic sources. Sala describes mouth Flood Observatory (Sala 2016). United
how researchers at the Quake-Catcher Network States Geological Survey (USGS) seismic data
have gathered these data into the globally distrib- and National Aeronautics and Space Administra-
uted Quake-Catcher Network that can be used as tion (NASA) Tropical Rainfall Measuring Mission
an early warning system for seismic disturbances, (TRMM) rainfall data have been integrated with
thus enhancing preparedness. social sensors including YouTube, Instagram, and
Twitter for landslide detection using the LITMUS
Response system described in Sala (2016).
Early detection systems blend into response sys- Tanaka et al. (2014) have addressed the prob-
tems. Koshimura (2017) has described a series of lem of snow management in the city of Sapporo
research initiatives under a Japan Science and on the island of Hokkaido, Japan. Each year Sap-
Technology Agency, CREST, and big data appli- poro with a population of almost two million
cations program. These include a framework for receives approximately 6 m of snow and has a
the real-time simulation of a tsunami inundation snow removal budget of almost $180 million
that incorporates an estimation of building and (US). Historic data from probe cars (private vehicles
other infrastructure damage; a study of the traffic and taxis), buses, and snow plows are combined
distribution following the 2016 Kumamoto with real-time data from each of these sources. In
earthquake permitting the simulation of future addition, the system integrates (a vinculation big
traffic disruptions following similar natural data problem) probe person data, traffic sensor
disasters; the use of synthetic aperture radar data, meteorological sensor data, plus snow plowing
(SAR) for damage detection following the 2016 and subway passenger records among other data
Kumamoto, 2011 Great East Japan, and 2015 sources. Visualization tools are integrated to mini-
Nepal earthquakes; emergency vehicle and mize the impact of the snow hazard.
wide-area evacuation simulation models; a big
data assimilation team to simulate the distribu- Recovery
tion of humans and cars assuming various sce- An initial concern in disaster recovery is data
narios following a natural disaster; and a restoration from an emergency operations center
Natural Language Processing (NLP) 679

or from affected businesses. Huang et al. (2017) Huang, Q., & Cervone, G. (2016). Usage of social media
review the literature on this and then describe how and cloud computing during natural hazards. In T. C.
Vance, N. Merati, C. Yang, & M. Yuan (Eds.), Cloud
cloud computing can be used to rapidly restore computing in ocean and atmospheric sciences (pp.
large volumes of data to multiple operations cen- 297–324). Amsterdam: Academic Press.
ters. Business continuity refers to the restoration Huang, Q., Cervone, G., & Zhang, G. (2017). A cloud-
of IT or technology systems and the physical enabled automatic disaster analysis system of multi-
sourced data streams: An example synthesizing social
infrastructure of the environment damaged during media, remote sensing and Wikipedia data. Computers,
the natural disaster. Environment and Urban Systems, 66:23–37. https://
FEMA (2016) has developed a National Disas- doi.org/10.1016/j.compenvurbsys.2017.06.004.
ter Recovery Framework (NDRF) that is designed Hultquist, C., & Cervone, G. (2018). Citizen monitoring
during hazards: validation of Fukushima radiation mea-
to ensure not only the restoration of the surements. GeoJournal, 83(2):189–206. https://doi.
community’s physical infrastructure to pre-disaster org/10.1007/s10708-017-9767-x.
conditions but also seeks to support the financial, Koshimura, S. (2017). Fusion of real-time disaster simula-
emotional, and physical requirements of affected tion and big data assimilation – Recent progress. Jour-
nal of Disaster Research, 12(2), 226–232.
community members. The complexity of this task Lee, E. H., & Kim, J. H. (2017). Design and operation of
and the need for a rapid and integrated response to decentralized reservoirs in urban drainage systems.
recovery ensure that this is a big data problem. Water, 9, 246. https://doi.org/10.3390/w9040246.
Sala, Simone 2016. Using big data to detect and predict natural
hazards better and faster: Lessons learned with hurricanes,
earthquakes and floods. http://datapopalliance.org/using-
Conclusion big-data-to-detect-and-predict-natural-hazards-better-and-
faster-lessons-learned-with-hurricanes-earthquakes-floods/
.
Natural hazards are the continuing source of Tanaka, Y., Sjöbergh, J., Moiseets, P., Kuwahara, M.,
disasters that impact communities around the Imura, H., & Yoshida, T. (2014). Geospatial visual
world. Remediation of the threats that result analytics of traffic and weather data for better winter
road management. In G. Cervone, J. Lin, & N. Waters
from these hazards has been reviewed under the
(Eds.), Data mining for geoinformatics (pp. 105–126).
headings of mitigation, preparedness, response, New York: Springer. N
and recovery. The complexity and interrelated- Wisner, B., Gaillard, J. C., & Kelman, I. (Eds.). (2012).
ness of these tasks and the speed required for Handbook of hazards and disaster risk reduction and
management. New York: Routledge.
timely response ensure that they are “big data”
Woodie, Alex 2013. Dutch turn to big data for water
problems. In the instances of atmospheric, mete- management and flood control. https://www.datanami.
orological, and hydrological events, these tasks com/2013/06/27/dutch_turn_to_big_data_for_water_
are continuing to be exacerbated by climate management_flood_control/.
change as extreme events become more frequent
and of greater severity.

Natural Language Processing


Further Reading (NLP)

Akter, S., & Fosso Wamba, S. (2017). Big data and disaster Erik W. Kuiler
management: A systematic review and agenda for George Mason University, Arlington, VA, USA
future research. Annals of Operations Research.
https://doi.org/10.1007/s10479-017-2584-2.
FEMA. (2016). National disaster recovery framework
(2nd ed.). Washington, DC: Federal Emergency Man- Natural Language Processing (NLP) – with
agement Agency. 53 p. Machine Learning (ML) and Deep Learning
FEMA. (2018). Comprehensive preparedness guide (CPG) (DL) – constitutes an important subdomain of
201: Threat and hazard identification and risk assess-
ment (THIRA) and Stakeholder preparedness review
Artificial Intelligence (AI). NLP operates on
(SPR) guide. https://www.fema.gov/media-library/ very large unstructured data sets – text-based big
assets/documents/165308. data sets – by employing information technology
680 Natural Language Processing (NLP)

(IT) capabilities and linguistics to support potential roles of words. There are also lexical
computer-enabled Natural Language Understand- aspects: NLP examines how morphemes com-
ing (NLU) and Natural Language Generation bine to make words and how minor differences
(NLG). NLP provides not only the basis for text can change the meaning of the word.
analyses of massive corpora, but also for such Syntax – the study of how sentences are formed
tools as virtual assistant AI technology (e.g., Siri and the rules that apply to their formulation; for
and Alexa). Common applications of NLP include example, a sentence may adopt a syntactic
machine translating human languages into pattern of subject + verb+ direct object: Louis
machine languages for analysis, manipulation, hit the ball. NLP uses predetermined syntactic
and management; supporting search engines; rules and norms to determine the meaning of a
extracting and summarizing information from sentence based on word order and its
diverse sources (e.g., financial information from dependencies.
newspaper articles); supporting human-machine Semantics – the study of the meaning of language
vocal interactions (e.g., Alexa and Siri); and fil- (not to be confused with semiotics – the study
tering Spam. As a big data process, NLP can be of symbols and their interpretations); for exam-
framed and understood in basic terms referencing ple, the subtle difference between rubric and
linguistic foundations, text analysis tasks, and heading. Based on the semantics of words and
text-based information extraction. their syntax in a sentence, NLP attempts to
determine what is the most likely meaning of
a sentence and what makes the most sense in a
Linguistic Foundations specific context or discourse.
Pragmatics – the study of language and the cir-
NLP begins with the application of linguistics – cumstances in which it is used; for example,
the scientific study of language – that comprises how people take turns in conversation, how
several disciplines. NLP uses linguistics to derive texts are organized, how a word can take on a
meaning from human speech and texts particular meaning based on tone or context;
(referencing English examples): for example, guilt in a legal context differs
from guilt in an ecclesiastical context. NLP
Phonetics – the study of speech sounds and how uses semantics to determine how the contex-
they are made; for example, the sound m is tual framework of a sentence helps determine
articulated with the lips held closed; b starts the meaning of individual words.
with the lips held together, followed by a
voiceless plosive.
Phonology – the study of distinguishable units of NLP Text Analysis Tasks
speech that distinguish one word from
another – phonemes; for example, b, p, h: bit, Using NLP to perform text analysis usually takes
pit, hit. NLP uses phonemes and their combi- the form several basic activities:
nation to identify comprehensible speech
events based on predetermined lexica and Sentence segmentation – demarcating separate
usage conventions sentences in a text.
Morphology – the study of how words are Word tokenization – assigning tokens to each
formed – morphemes, units of language that word in a sentence to make them machine-
cannot be meaningfully be subdivided; for readable so that sentences can be processed
example, out, go, -ing collectively form outgo- one at a time.
ing; morphemes provide the basis for lexicon Parts of speech assignment – designate a pat of
and ontology development. NLP uses mor- speech for each token (noun, pronoun, verb,
phemes to determine the construction and adjective, adverb, preposition, conjunction,
Natural Language Processing (NLP) 681

interjection) in a sentence, facilitating syntactic research papers. NLP abstraction methods cre-
conformance and semantic cohesion. ate summary by generating fresh text that con-
Text lemmatization – determining the basic veys the crux of the original text; for example,
form – lemma – of each word in a sentence; an NLP extraction method can create a sum-
for example, smok- in smoke, smoker, or mary by extracting parts from the text.
smoking. Topic modelling – topic modeling focuses on
Stop words identification – many languages identifying topics in a text and can be quite
have stop words, such as and, the, a in English; complex. An important advantage of topic
these are usually removed to facilitate NLP modeling is that it is an unsupervised technique
processing. that does not require model training or a
Dependency parsing – determining the relation- labeled training set. Algorithms that support
ships and dependencies of the words in a sen- topic modelling include the following:
tence; for example, noun phrases and verb
phrases in a sentence.
Correlated Topic Model (CTM) – CTM is a
Named entity recognition (NER) – examples of
topic model of a document collection that
named entities that a typical system can iden-
models the words of each document and
tify are people’s names, company names, phys-
correlates the different topics in the
ical and political geographic locations, product
collection.
names, dates and times, currency amounts, and
Latent Semantic Analysis (LSA) – an NLP
named events. NER is generally based on
technique for distributional semantics based
grammar rules and supervised models. How-
on analyzing relationships between a set of
ever, there are NER platforms with built-in
documents and the terms they contain to
NER models.
produce a set of concepts related to the
Co-reference resolution – resolving the refer-
documents and terms.
ences of deictic pronouns such as he, she, it,
Probabilistic Latent Semantic Analysis
they, them, etc.
(PLSA) –A statistics-based technique for N
Fact extraction – using a predefined knowledge
analyzing bi-model and co-occurrence data
base to extract facts (meaning) from a text.
that can be applied to unstructured text data.
Latent Dirichlet Allocation (LDA) – The
NLP Text-based Information Extraction premise of LDA is that a text document in
a corpus comprises topics and thar each
NLP provides important capabilities to support topic comprises several words. The input
diverse analytical efforts: required by LDA is text documents and
the number of topics that LDA algorithms
are expected to generate.
Sentiment analysis – NLP can be useful in ana-
lyses of people’s opinions or feedback, such as
those contained in customer surveys, reviews,
and social media. Summary
Aspect mining – aspect mining identifies differ-
ent points of view (aspects) in a text. Aspect NLP supports intellectual- and labour-intensive
mining may be used in conjunction with senti- tasks, ranging from sentence segmentation and
ment analysis to extract information from word tokenization to topic extraction and model-
a text. ling. NLP is an important subdomain of AI,
Text summarization and knowledge extrac- providing IT capabilities to analyze very large
tion – NLP can be applied to extract informa- sets of unstructured data, such as text and
tion from, for example, newspaper articles or speech data.
682 Netflix

Further Reading software company he sold for $700 million. The


idea for Netflix was prompted by Hastings’ expe-
Bender, E. M. (2013). Linguistic fundamentals for natural rience of paying $40 in overdue fees at a local
language processing: 100 essentials from morphology
Blockbuster. Using $2.5 million dollars in start-up
and syntax. New York: Morgan & Claypool Publishers.
Bird, S., Klein, E., & Loper, E. (2009). Natural language money from his sale of Pure Soft, Hastings
processing with python. Sebastopol: O’Reilly envisioned a video provider whose content could
Publishing. be returned from the comfort of one’s own home,
Manning, C. D., & Schütze, H. (2002). Foundations of
void of due dates or late fees. Netflix’s website
natural language processing. Cambridge, MA: MIT
Press. was subsequently launched on August 29, 1997.
Mitkov, R. (2003). The Oxford book of computational Netflix’s original business model used a tradi-
linguistics. Oxford: Oxford University Press. tional pay-per-rental approach, charging 0.50
cents per film. Netflix introduced its monthly
flat-fee subscription service in September 1999,
which led to the termination of its pay-per-rental
Netflix model by early 2000. Netflix has since built its
global reputation on the flat-fee business model,
J. Jacob Jenkins as well as its lack of due dates, late fees, or
California State University Channel Islands, shipping and handling charges. Netflix delivers
Camarillo, CA, USA DVDs directly to its subscribers using the United
States Postal Service and a series of regional
warehouses located throughout the United States.
Introduction Based upon which subscription plan is chosen,
users can keep between one and eight DVDs at a
Netflix is a film and television provider time, for as long as they desire. When subscribers
headquartered in Los Gatos, California. Netflix return a disc to Netflix using one of its prepaid
was founded in 1997 as an online movie rental envelopes, the next DVD on their online rental
service, using Permit Reply Mail to deliver queue is automatically mailed in its stead. DVD-
DVDs. In 2007, the company introduced stream- by-mail subscribers can access and manage their
ing content, which allowed customers instant online rental queue through Netflix’s website in
access to its online video library. Netflix has order to add and delete titles or rearrange their
since continued its trend toward streaming ser- priority.
vices by developing a variety of original and In 2007 Netflix introduced streaming content
award-winning programming. Due to its success- as part of its “Watch Instantly” initiative. When
ful implementation of Big Data, Netflix has expe- Netflix first introduced streaming video to its
rienced exponential growth since its inception. It website, subscribers were allowed 1 h of access
currently offers over 100,000 titles on DVD and is for every $1 spent on their monthly subscription.
the world’s largest on-demand streaming service This restriction was later removed due to emerg-
with more than 80 million subscribers in over ing competition from Hulu, Apple TV, Amazon
190 countries worldwide. Prime, and other on-demand services. There are
substantially less titles available through Netflix’s
streaming service than its disc library. Despite this
Netflix and Big Data limitation, Netflix has become the most widely
supported streaming service in the world by
Software executives Marc Randolph and Reed partnering with Sony, Nintendo, and Microsoft
Hastings founded Netflix in 1997. Randolph was to allow access through Blu-ray DVD players, as
a previous cofounder of MicroWarehouse, a mail- well as the Wii, Xbox, and PlayStation gaming
order computer company; Hastings was a previ- consoles. In subsequent years, Netflix has increas-
ous math teacher and founder of Pure Soft, a ingly turned attention toward its streaming
Netflix 683

services. In 2008 the company added 2500 new streaming and DVD rentals. Wal-Mart began its
“Watch Instantly” titles through a partnership with own online rental service in 2002 before acquiring
Starz Entertainment. In 2010 Netflix inked deals the Internet delivery network, Vudu, in 2010.
with Paramount Pictures, Metro-Goldwyn-Mayer, Amazon Prime, Redbox Instant, Blockbuster @
and Lions Gate Entertainment; in 2012 it inked a Home, and even “adult video” services like
deal with DreamWorks Animation. WantedList and SugarDVD have also entered the
Netflix has also bolstered its online library by video streaming market. Competition from Block-
developing its own programming. In 2011 Netflix buster sparked a price war in 2004, yet Netflix
announced plans to acquire and produce original remains the industry leader in online movie
content for its streaming service. That same year it rentals and streaming.
outbid HBO, AMC, and Showtime to acquire the Netflix owes much of its success to the inno-
production rights for House of Cards, a political vative use of Big Data. Because it is an Internet-
drama based on the BBC miniseries of the same based company, Netflix has access to an unprec-
name. House of Cards was released on Netflix in edented amount of viewer behavior. Broadcast
its entirety in early 2013. Additional program- networks have traditionally relied on approxi-
ming released during 2013 included Lilyhammer, mated ratings and focus group feedback to
Hemlock Grove, Orange is the New Black, and the make decisions about their content and airtime.
fourth season of Arrested Development – a series In contrast, Netflix can aggregate specified data
that originally aired on Fox between 2003 and about customers’ actual viewing habits in real
2006. Netflix later received the first Emmy time, allowing it to understand subscriber trends
Award nomination for an exclusively online tele- and tendencies at a much more sophisticated
vision series. House of Cards, Hemlock Grove, level. The type of information Netflix gathers is
and Arrested Development received a total of not limited to what viewers watch and the ratings
14 nominations at the 2013 Primetime Emmy they ascribe. Netflix also tracks the specific dates
Awards; House of Cards received an additional and times in which viewers watch particular
four nominations at the 2014 Golden Globe programming, as well as their geographic loca- N
Awards. In the end, House of Cards won three tions, search histories, and scrolling patterns;
Emmy Awards for “Outstanding Casting for a when they use pause, rewind, or fast-forward;
Drama Series,” “Outstanding Directing for a the types of streaming devices employed; and
Drama Series,” and “Outstanding Cinematogra- so on.
phy for a Single-Camera Series.” It won one The information Netflix collects allows it to
Golden Globe for “Best Actress in a Television deliver unrivaled personalization to each individ-
Series Drama.” ual customer. This customization not only results
Through its combination of DVD rentals, in better recommendations but also helps to
streaming services, and original programming, inform what content the company should invest
Netflix has grown exponentially since 1997. In in. Once content has been acquired/developed,
2000, the company had approximately 300,000 Netflix’s algorithms also help to optimize their
subscribers. By 2005 that number grew to nearly marketing and to increase renewal rates on origi-
4 million users, and by 2010 it grew to 20 million. nal programming. As an example, Netflix created
During this time, Netflix’s initial public offering ten distinct trailers to promote their original series
(IPO) of $15 per share soared to nearly $500, with House of Cards. Each trailer was designed for a
a reported annual revenue of more than $6.78 different audience and seen by various customers
billion in 2015. Today, Netflix is the largest source based on those customers’ previous viewing
of Internet traffic in all of North America. Its sub- behaviors. Meanwhile, the renewal rate for
scribers stream more than 1 billion hours of media original programming on traditional broadcast
content each month, approximating one-third of television is approximately 35%; the current
total downstream web traffic. Such success has renewal rate for original programming on Netflix
resulted in several competitors for online is nearly 70%.
684 Network Advertising Initiative

As successful as Netflix’s use of Big Data has ▶ Facebook


been, the company strives to keep pace with ▶ Social Media
changes in viewer habits, as well as changes in
its own product. When the majority of subscribers
used Netflix’s DVD-by-mail service, for instance, Further Reading
those customers consciously added new titles to
their queue. Streaming services demand a more Keating, G. (2013). Netflixed: The epic battle for America’s
eyeballs. London: Portfolio Trade.
instantaneous and intuitive process of generating
McCord, P. (2014). How Netflix reinvented HR. Harvard
future recommendations. In response to develop- Business Review. http://static1.squarespace.com/
ments such as this, Netflix initiated the “Netflix static/5666931569492e8e1cdb5afa/t/56749ea457eb
Prize” in 2006: a $1 million payout to the first 8de4eb2f2a8b/1450483364426/How+Netflix+Reinven
ted+HR.pdf. Accessed 5 Jan 2016.
person or group of persons to formulate a superior
McDonald, K., & Smith-Rowsey, D. (2016). The Netflix
algorithm for predicting viewer preferences. Over effect: Technology and entertainment in the 21st cen-
the next 3 years, more than 40,000 teams from tury. London: Bloomsbury Academic.
183 countries were given access to over 100 mil- Simon, P. Big data lessons from Netflix. Wired. Retrieved
from https://www.wired.com/insights/2014/03/big-
lion user ratings. BellKor’s Pragmatic Chaos was
data-lessons-netflix/.
able to improve upon Netflix existing algorithm Wingfield, N., & Stelter, B. (2011, October 24). How Netflix
by approximately 10% and was announced as the lost 800,000 members, and good will. The New York
award winner in 2009. Times. http://faculty.ses.wsu.edu/rayb/econ301/Arti
cles/Netflix%20Lost%20800,000%20Members%20.
pdf. Accessed 5 Jan 2016.

Conclusion

In summation, Netflix is presently the world’s Network Advertising Initiative


largest “Internet television network.” Key turning
points in the company’s development have Siona Listokin
included a flat-rate subscription service, streaming Schar School of Policy and Government, George
content, and original programming. Much of the Mason University, Fairfax, VA, USA
company’s success has also been due to its inno-
vative implementation of Big Data. An unprece-
dented level of information about customers’ The Network Advertising Initiative (NAI) is a
viewing habits has allowed Netflix to make self-regulatory association in the United States
informed decisions about programming develop- (US), representing third parties in online advertis-
ment, promotion, and delivery. As a result, Netflix ing, and is one of the oldest industry-led efforts
currently streams more than 1 billion hours of focused on consumer data privacy and security. It
content per month to over 80 million subscribers was initially formed in 1999 following industry
in 190 countries and counting. engagement with the Federal Trade Commission
(FTC) and consisted of ten firms that covered 90%
of the network advertising industry. Membership
Cross-References rosters and rules have fluctuated significantly
since the NAI’s formation, and it is useful to
▶ Algorithm evaluate the organization’s evolution rather than
▶ Apple its performance in any single year. Today, the
▶ Communications Initiative has about 100 participating firms. The
▶ Data Streaming NAI has received praise from the FTC as a leader
▶ Entertainment in the self-regulatory community. However, many
Network Advertising Initiative 685

critics point to a history of lax enforcement, inef- Self-Regulatory Guidelines


fective consumer choice tools, and insufficient
industry representation. In response to criticism over the NAI’s member-
ship, enforcement and narrow definitions of con-
sumer choice over advertising network data
Initial Evolution of NAI collection, along with a new FTC report on self-
regulation in online behavioral advertising, the Ini-
The FTC invited online advertisers to consider tiative updated its self-regulatory guidelines at the
self-regulating the online profiling industry in end of 2008 and allowed for public comment. The
1999, in advance of a workshop on the subject. new guidelines were notable for expanding the
At the time, the FTC was concerned with the lack definition of online advertising as the industry
of transparency to consumers as to the involve- evolved in the decade since its founding. In addi-
ment of ad networks while using the Web. The tion, NAI supported a new effort in consumer
initial NAI agreement with the FTC was founded education, addressing the transparency concerns
on the four principles of notice, choice, access, that persisted since its founding. A later update in
and security. Over time, data use/limitation and 2013 added data transfer and retention restrictions
data reliability were added to the foundational to the core principles. In addition, NAI joined other
principles. Notably, consumer choice over online major advertising organizations and trade associa-
tracking was based on an “opt-out” model for tions in the Digital Advertising Alliance, which
non-personally identifying information. In 2001, offered its own mechanism for opting out of
the NAI launched a Web form that allowed con- interest-based advertisements via its AdChoices
sumers to opt-out of participating firms’ data col- tool. NAI began regulating cross-app ads in 2016.
lection in a single site, but did not directly address The NAI now includes about 100 companies as
the concern about lack of consumer knowledge. full members; associate memberships no longer
While the NAI continued to grow its self- exist. The Initiative emphasizes its industry cov-
regulatory guidelines, within a few years many erage and notes that nearly all ads served on the N
of the founding firms dropped out of the initiative, Internet seen in the USA. involve the technology
during a period that coincided with less FTC of NAI members. Compliance and enforcement
scrutiny and engagement in consumer privacy are conducted by the NAI itself, and utilizes ongo-
regulation. Only two companies, Avenue A and ing manual reviews of opt-out pages as well as an
DoubleClick were full participating members in in-house scanner to check if opt-out choices are
2002; five other founding firms were listed as honored or if privacy policies have changed. In its
associate members that did not engage in online 2018 Compliance Report, the NAI reported
preference marketing and were not part of the opt- receiving almost 2000 consumer and industry
out web form. The NAI added third-party enforce- complaints, the vast majority of which were either
ment through TRUSTe at this time to improve outside of NAI’s mission or related to technical
credibility through their Watchdog Reports, glitches in the opt-out tool. NAI investigating one
though the company was also a participating potential noncompliance in 2018.
member of the Initiative. TRUSTe’s public disclo-
sure of complaints and resolutions became
increasingly opaque, culminating in a total Assessment of NAI
absence of public enforcement by the end of
2006. The lack of industry representation and There have been a number of outside assessments
credible enforcement led many privacy advocacy of the NAI following its 2008 update. It is worth
groups to declare the NAI a failed attempt at noting that some of these evaluations are
strong self-regulation. conducted by privacy advocacy groups that are
686 Network Analysis

skeptical of self-regulation in general and support-


ive of comprehensive consumer privacy legisla- Network Analytics
tion in the USA. That said, the NAI is frequently
criticized for inadequate technical innovation in Jürgen Pfeffer
its consumer choice tools and a lack of credible Bavarian School of Public Policy, Technical
enforcement. University of Munich, Munich, Germany
Despite general approval over the 2008 and
2013 updates, critiques of the NAI note that the
main opt-out function has remained largely static, Synonyms
utilizing web-based cookies despite changing
technologies and consumer behavior. In addition, Network science; Social network analysis
the Initiative defines online behavioral advertising
as that done by a third party, and its principles Much of big data comes with relational informa-
therefore do not apply to tracking and targeting by tion. People are friends with or follow each other
websites in general. A 2011 study found more on social media platforms, send each other emails,
than a third of NAI members did not remove or call each other. Researchers around the world
their tracking cookies after the opt-out choice copublish their work, and large-scale technology
was selected in the Initiative’s web form. Other networks like power grids and the Internet are the
works have found that only about 10% of those basis for worldwide connectivity. Big data net-
studied could discern the functionality of the NAI works are ubiquitous and are more and more
opt-out tool, and that there was infrequent com- available for researchers and companies to extract
pliance with membership requirements in both knowledge about our society or to leverage new
privacy policies and opt-out mechanisms. These business models based on data analytics. These
studies also note the variability in privacy policy networks consist of millions of interconnected
and opt-out options, with many membership firms entities and form complex socio-technical sys-
going above and beyond the NAI code. tems that are the fundamental structures
governing our world, yet defy easy understand-
ing. Instead, we must turn to network analytics to
Further Reading understand the structure and dynamics of these
large-scale networked systems and to identify
Dixon, P. (2007). The network advertising initiative: Fail- important or critical elements or to reveal groups.
ing at consumer protection and at self-regulation. World However, in the context of big data, network
Privacy Forum, Fall 2007. analytics is also faced with certain challenges.
King, N. J., & Jessen, P. W. (2010). Profiling the mobile
customer– Is industry self-regulation adequate to pro-
tect consumer privacy when behavioural advertisers
target mobile phones? – Part II. Computer Law &
Security Review, 26(6), 595–612.
Network Analytical Methods
Komanduri, S., Shay, R., Norcie, G., Ur, B., & Cranor, L. F.
(2011). AdChoices? Compliance with online behav- Networks are defined as a set of nodes and a set of
ioral advertising notice and choice requirements. Car- edges connecting the nodes. The major questions
negie Mellon University CyLab, March 30, 2011.
for network analytics, independent from network
Mayer, J. (2011). Tracking the trackers: Early results.
Stanford Law School Center for Internet and Society, size, are “Who is important?” and “Where are the
July 12, 2011. groups?” Stanley Wasserman and Katherine Faust
have authored a seminal work on network analyt-
ical methods. Even though this work was
published in the mid-1990s, it can still be seen as
Network Analysis the standard book on methods for network analyt-
ics, and it also provides the foundation for many
▶ Link/Graph Mining contemporary methods and metrics. With respect
Network Analytics 687

to identifying the most important nodes in a given common approach to assess the quality of group-
network, a diverse array of centrality metrics have ing algorithms is to calculate the modularity index
been developed in the last decades. Marina developed by Michelle Girvan and Mark
Henning and her coauthors classified centrality Newman.
metrics into four groups. “Activity” metrics
purely count the number or summarize the volume
of connections. For “radial” metrics, a node is Algorithmic Challenges
important if it is close to other nodes, and
“medial” metrics account for being in the middle The most widely used algorithms in network ana-
of flows in networks or for bridging different areas lytics were developed in the context of small
of the network. “Feedback” metrics are based on groups of (less than 100) humans. When we
the idea that centrality can result from the fact that study big networks with millions of nodes, several
a node is connected (directly or even indirectly) to major challenges emerge. To begin with, most
other central nodes. For the first three groups, network algorithms run in Θ(n2) time or slower.
Linton C. Freeman has defined “degree central- This means that if we double the number of nodes,
ity,” “closeness centrality,” and “betweenness the calculation time is quadrupled. For instance,
centrality” as the most intuitive metrics. These let us assume we have a network with 1,000 nodes
metrics are used in almost every network analyt- and a second network with one million nodes
ical research project nowadays. The fourth metric (thousandfold). If a certain centrality calculation
category comprises mathematically advanced with quadratic algorithmic complexity takes
methods based on eigenvector computation. 1 min on the first network, the same calculation
Phillip Bonacich presented eigenvector centrality would take 1 million minutes (approximately
which led to important developments of metrics 2 years) on the second network (millionfold).
for web analytics like Google’s PageRank algo- This property of many network metrics makes it
rithm or the HITS algorithms by John Kleinberg, nearly impossible to apply them to big data net-
which is incorporated into several search engines works within reasonable time. Consequently, N
to rank search results based on the website’s struc- optimization and approximation algorithms of tra-
tural importance on the Internet. ditional metrics are developed and used to speed
The second big pile of research questions up analysis for big data networks.
related to networks is about identifying groups. A straight forward approach for algorithmic
Groups can refer to a broad array of definitions, optimization of network algorithms for big data
e.g., nodes sharing of certain socioeconomic attri- is parallelization. The abovementioned algorithms
butes, membership affiliations, or geographic closeness and betweenness centralities are based
proximity. When analyzing networks, we are on all-pairs shortest path calculation. In other
often interested in structurally identifiable groups, words, the algorithm starts at a node, follows its
i.e., sets of nodes of a network that are denser links, and visits all other nodes in concentric cir-
connected among them and sparser connected to cles. The calculation for one node is independent
all other nodes. The most obvious group of nodes from the calculation for all other nodes; thus,
in a network would be a clique – a set of nodes different processors or different computers can
where each node is connected to all other nodes. jointly calculate a metric with very little coordi-
Other definitions of groups are more relaxed. nation overhead.
K-cores are a set of nodes for which every node Approximation algorithms try to estimate a
is connected to at least k other nodes in the set. It centrality metric based on a small part of the
turns out that k-cores are more realistic for real- actual calculations. The calculations of the all-
world data than cliques and much faster to calcu- pairs shortest path calculation can be restricted in
late. For any form of group identification in net- two ways. First, we can limit the centrality calcu-
works, we are often interested in evaluating the lation to the k-step neighborhood of nodes, i.e.,
“goodness” of the identified groups. The most instead of visiting all other nodes in concentric
688 Network Analytics

circles, we stop at a distance k. Second, instead of among nodes, people call each other, and data
all nodes, we just select a small proportion of flows are measured among servers. The observed
nodes as starting points for the shortest path cal- data consist of dyadic interaction. As the nodes of
culations. Both approaches can speed up calcula- the dyads overlap over time, we can extract net-
tion time tremendously as just a small proportion works. Even though networks extracted from
of the calculations are needed to create these streaming data are inherently dynamic, the actual
results. Surprisingly, these approximated results analysis of these networks is often done with static
have very high accuracy. This is because real- metrics, e.g., by comparing the networks created
world networks are far from random and have from daily aggregation of data. The most interest-
specific characteristics. For instance, networks ing research questions with respect to streaming
created from social interactions among people data are related to change detection. Centrality
often have core-periphery structure and are highly metrics for every node or network level indices
clustered. These characteristics facilitate the accu- that describe the structure of the network can be
racy of centrality approximation calculations. In calculated for every time interval. Looking at
the context of optimizing and approximating tra- these values as time series can help to identify
ditional network metrics, a major future challenge structural change in the dynamically changing
will be to estimate time/fidelity trade-offs(e.g., networks over time.
develop confidence intervals for network metrics)
and to build systems that incorporate the con-
straints of user and infrastructure into the calcula- Visualizing Big Data Networks
tions. This is especially crucial as certain network
metrics are very sensitive and small data change Visualizing networks can be a very efficient ana-
can lead to big change of results. lytical approach as human perception is capable of
New algorithms are especially developed for identifying complex structures and patterns. To
very large networks. These algorithms have sub- facilitate visual analytics, algorithms are needed
quadratic complexity so that they are applicable that present network data in an interpretable way.
for very large networks. Vladimir Batagelj and One of the major challenges for network visuali-
Andrej Mrvar have developed a broad array of zation algorithms is to calculate the positions of
new metrics and a network analytical tool called the nodes of the network in a way that it reveals
“Pajek” to analyze networks with tens of millions the structure of the network, i.e., show communi-
of nodes. ties and put important nodes in the center of the
However, some networks are too big to fit into figure. The algorithmic challenges for visualizing
the memory of a single computer. Imagine a net- big networks are very similar to the ones
work with 1 billion nodes and 100 billion edges – discussed above. Most commonly used layout
social media networks have already reached this algorithms scale very poorly. Ulrich Brandes and
size. Such a network would require a computer Christian Pich developed a layout algorithm based
with about 3,000 gigabyte RAM to hold the pure on eigenvector analysis that can be used to visu-
network structure with no additional information. alize networks with millions of nodes. The
Even though supercomputer installations already method that they applied is similar to the before-
exist that can cope with these requirements, they mentioned approximation approaches. As real-
are rare and expensive. Instead, researchers make world networks normally have a certain topology
use of computer clusters and analytical software that is far from random, calculating just a part of
optimized for distributed systems, like Hadoop. the actual layout algorithm can be a good enough
approximation to reveal interesting aspects of a
network.
Streaming Data Networks are often enriched with additional
information about the nodes or the edges. We
Most modern big data networks come from often know the gender or the location of people.
streaming data of interactions. Messages are sent Nodes might represent different types of
Network Data 689

infrastructure elements. We can incorporate this traditional approaches of handling and analyzing
information by mapping data to visual elements of these networks are not scalable. Nonetheless, it
our network visualization. Nodes can be visual- is worthwhile coping with these challenges.
ized with different shapes (circles, boxes, etc.) and Researchers from different academic areas have
can be colored with different colors resulting in been optimizing existing and developing new met-
multivariate network drawings. Adding contex- rics and methodologies as network analytics can
tual information to compelling network visualiza- provide unique insights into big data.
tions can make the difference between pretty
pictures and valuable pieces of information
visualization. Cross-References

▶ Algorithmic Complexity
Methodological Challenges ▶ Complex Networks
▶ Data Streaming
Besides algorithmic issues, we also face serious ▶ Data Visualization
conceptual challenges when analyzing big data
networks. Many “traditional” network analytical
metrics were developed for groups of tens of Further Reading
people. Applying the same metrics to very big
networks raises questions whether the algorithmic Batagelj, V., Mrvar, A., & de Nooy, W. (2011). Exploratory
social network analysis with Pajek. (Expanded edi-
assumptions or the interpretations of results are
tion.). New York: Cambridge University Press.
still valid. For instance, the abovementioned met- Brandes, U., & Pich, C. (2007). Eigensolver Methods for
rics closeness and betweenness centralities just progressive multidimensional scaling of large data.
incorporate the shortest paths between every pair Proceedings of the 14th International Symposium on
Graph Drawing (GD’06), 42–53.
of nodes ignoring possible flow of information on
Freeman, L. C. (1979). Centrality in social networks: Con-
non-shortest paths. Even more, these metrics do ceptual clarification. Social Networks, 1(3), 215–239. N
not take path length into account. In other words, Hennig, M., Brandes, U., Pfeffer, J., & Mergel, I. (2012).
if a node is on the shortest path of length, two or Studying social networks. A guide to empirical
research. Frankfurt: Campus Verlag.
eight is treaded identically. Most likely this does
Wasserman, S., & Faust, K. (1994). Social network analy-
not reflect real-world assumptions of information sis: Methods and applications. Cambridge: Cambridge
flow. All these issues can be addressed by apply- University Press.
ing different metrics that incorporate all possible
paths or a random selection of paths with length k.
In general, when accomplishing network analyt-
ics, we need to ask which of the existing network Network Data
algorithms are suitable under which assumptions
to be used for very large networks? Moreover, Meng-Hao Li
what research questions are appropriate for very George Mason University, Fairfax, VA, USA
large networks? Does being a central actor in a
group of high school kids has the same interpre-
tation as being a central user of an online social Network (graph) data consist of composition and
network with millions of users? structural variables. Composition variables mea-
sure actor attributes, where actor attributes could
be gender or age for people, private or public for
Conclusions organizations, country names, or locations. Struc-
tural variables measure connections between pairs
Networks are everywhere in big data. Analyzing of actors, where connections could be friendships
these networks can be challenging. Due of the very between people, collaboration between organiza-
nature of network data and algorithms, many tions, trade between nations, or transmission line
690 Network Data

between two stations (Wasserman and Faust 1994, affiliated to hospital A, and doctor c is affiliated
p. 29; Newman 2010, p. 110). In the mathematical to hospitals A, B, and C.
literature, network data are called as graph
G ¼ (V, E), where V is the set of vertices (actors),
and E is the set of edges (connections). Table 1 The Adjacency Matrix
shows some examples of composition and struc-
tural variables in different properties of networks. The adjacency matrix is a most common form to
represent a network mathematically. For example,
the adjacency matrix of the friendship network in
Modes of Networks Fig. 1 can be displayed as elements Aij in Table 2,
where 1 represents that there is an edge between
The mode of a network is used to express the vertices i and j, and 0 represents that there is no
number of sets of vertices on which structural edge between vertices i and j. This is a symmetric
variables present. There is no limitation on the matrix with no self-edges, implying that the ele-
number of the mode that a network can construct, ments in the upper right and lower left triangles are
but most networks are defined as either one-mode identical, and all diagonal matrix elements are zero.
networks or two-mode networks. One-mode net- In Fig. 1, the friendship network that has single
works have one set of vertices that are similar to edge and no self-edge is also called simple graph.
each other. For example, in a friendship network, In some situations, a network may have multiple
the set of vertices is people connected by friend- edges between two vertices (multiedge is also
ships. In Fig. 1, the friendship network has six called multiplexity in Sociology). The network is
vertices (1, 2, 3, 4, 5, 6) and seven edges (1, 2), (1, called multigraph. Figure 3 is a representation of
5), (2, 4), (2, 5), (3, 4), (3, 5), and (5, 6). The multigraph. Suppose that a researcher is interested
representation of vertices and edges is also called in understanding a group’s friendship network,
an edge list. Edge lists are commonly used to store advice network, and gossip network. The
network data on computers and are often efficient researcher conducts a network survey to investi-
for computing a large network. gate how those people behave in those three net-
Two-mode (affiliation; bipartite) networks works. The survey data can be constructed as a
consist of two sets of vertices. For example, a multigraph network in Fig. 3. In Fig. 3, a solid
group of doctors work for several hospitals. edge represents a friendship, a dotted edge repre-
Some doctors work for a same hospital but some sents an advice relation, and a dash-dot edge rep-
doctors work for different hospitals. In this case, resents a gossip relation. For example, there are
one set of vertices is the doctors and another set of friendship, advice, and gossip relations between
vertices is the hospitals. In Fig. 2, the doctor- vertices 1 and 5. The vertices 5 and 3 are
hospital network consists two sets of vertices, connected by friendship and advice relations.
five doctors (a, b, c, d, e), and three hospitals (A, The vertices 2 and 4 are linked by friendship and
B, C). The edge represents that a doctor is affili- gossip relations. The multigraph network can also
ated to a hospital. For example, doctor b is be converted to an adjacency matrix Aij in Table 3.

Network Data, Table 1 Examples of networks


Composition variable Structural variable
Network Vertex Attribute Edge
Friendship Network Person Age, gender, weight, or income Friendship
Collaboration Network Organization Public, private, or nonprofit Collaboration
Citation Network Article Biology, Engineering or Sociology Citation
World Wide Web Web page Government, education or commerce Hyperlink
Trade Network Nation Developed or developing Trade
Network Data 691

The values between i and j represent that the


number of edges is present between i and j.

The Incidence Matrix

The incidence matrix is used to represent a two-


Network Data, Fig. 1 One-mode friendship network
mode (affiliation; bipartite) network. In Fig. 2,
the two-mode doctor-hospital network can be
constructed as an incidence matrix as Bij in
Table 4, where 1 represents doctor j belongs to
hospital i, and 0 represents doctor j does not
belong to hospital i. Although an incidence
matrix can completely represent a two-mode net-
Network Data, Fig. 2 Two-mode doctor-hospital work, it is often convenient for computing by
network projecting a two-mode network to a one-mode
network. Tables 5 and 6 are different ways to
exhibit the doctor-hospital network in one-
Network Data, Table 2 Adjacency matrix
mode networks. Table 5 is a hospital adjacency
1 2 3 4 5 6 matrix, where 1 represents that there is at least
1 0 1 0 0 1 0
one shared doctor between hospitals. Table 6 is a
2 1 0 0 1 1 0
doctor adjacency matrix, where 1 represents that
3 0 0 0 1 1 0
there is at least one shared hospital between
4 0 1 1 0 0 0
doctors and 0 represents there is no shared hos-
5 1 1 1 0 0 1
6 0 0 0 0 1 0
pital between doctors.
N
Network Data, Table 4 Incidence matrix
a b c d e
A 1 1 1 0 0
B 0 0 1 0 1
C 0 0 1 1 1

Network Data, Table 5 Hospital adjacency matrix


A B C
Network Data, Fig. 3 An example of the multigraph
A 0 1 1
network
B 1 0 1
C 1 1 0

Network Data, Table 3 The adjacency matrix of the


multigraph network
1 2 3 4 5 6 Network Data, Table 6 Doctor adjacency matrix
1 0 1 0 0 3 0 a b c d e
2 1 0 0 2 1 0 a 0 1 1 0 0
3 0 0 0 1 2 0 b 1 0 1 0 0
4 0 2 1 0 0 0 c 1 1 0 1 1
5 3 1 2 0 0 1 d 0 0 1 0 1
6 0 0 0 0 1 0 e 0 0 1 1 0
692 Network Data

Weighted Networks negative edge does not refer to an absence of an


edge. For example, Fig. 5 shows a network with
The aforementioned examples assume that positive friendship edges and negative animosity
edges are weighted equally, but it may not edges. A negative edge here represents that two
be realistic in most network structures. The enemies are connected by an animosity edge. A
weighted (valued) networks relax the edge value positive edge represents that two friends are
and allow a researcher to assign values for each connected by a friendship edge. If edges are
edge in a network. For example, Fig. 1 has equal absent between two people, it simply indicates
weights on each edge in the friendship network. that the two people do not interact with each
The friendship can be weighted by the time that other. The signed networks are commonly stored
two people have known each other. Figure 4 as two distinct networks, one with positive edges
shows the weighted friendship network of Fig. 1. and the other one with negative edges. The adja-
The edge value 15 between vertices 1 and 2 rep- cency matrix in Table 8 shows positive edges of
resents that the two people have known each other the friendship network in Fig. 5, where 1 repre-
for 15 years. Likewise, the edge 1 between verti- sents that there is a positive edge between two
ces 1 and 5 represents that the two people have people and 0 represents that there is no edge
known each other for 1 year. Table 7 converted the between two people. The adjacency matrix in
weighted network into an adjacency matrix. This Table 9 is constructed by negative edges, where
is a symmetric matrix with no self-edges. The 1 represents that there is a negative edge between
upper right and lower left triangles have the
same elements, and diagonal matrix elements are
all zero.

Signed Networks

The signed networks are used to represent a net- Network Data, Fig. 5 An example of the signed network
work with “positive” and “negative” edges. A
Network Data, Table 8 Positive signed adjacency
matrix
1 2 3 4 5 6
1 0 1 0 0 1 0
2 1 0 0 0 1 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
5 1 1 0 0 0 0
Network Data, Fig. 4 An example of the weighted
6 0 0 0 0 0 0
network

Network Data, Table 7 The adjacency matrix of the Network Data, Table 9 Negative signed adjacency
weighted network matrix
hh 1 2 3 4 5 6 1 2 3 4 5 6
1 0 15 0 0 1 0 1 0 0 0 0 0 0
2 15 0 0 5 2 0 2 0 0 0 1 0 0
3 0 0 0 8 3 0 3 0 0 0 0 1 0
4 0 5 8 0 0 0 4 0 1 0 0 0 0
5 1 2 3 0 0 2 5 0 0 1 0 0 1
6 0 0 0 0 2 0 6 0 0 0 0 1 0
Network Data 693

two people and 0 represents that there is no edge Quality of Network Data
between two people.
The quality of network data is principally deter-
mined by the methods of data collection and data
Directed Networks cleansing. However, there is no universal method
that can ensure a high quality of data. The only
A directed network is a network with directed way that a researcher can pursue is to minimize
edges, where an arrow points a direction from threat of data errors and to seek optimal methods
one vertex to another vertex. Suppose that Fig. 6 to approach research questions. Some types of
is a directed network consisting of directed friend- errors have significant impacts on the quality of
ships, indicating that some people recognize other data and data analysis. Those errors are summa-
people as their friends. For example, there is one rized below (Borgatti et al. 2013, pp. 37–40).
directional edge from person 2 to person 1, indi-
cating that person 2 recognizes that person 1 is her 1. Omission errors: this type of errors describes
friend. But it does not mean that person 1 also missing edges or vertices in the data collection
recognizes that person 2 is her friend, the friend- process. In Fig. 1, for example, vertex 5 has
ship between person 1 and person 2 is one direc- four connections with other vertices and seems
tion and asymmetric. In the friendship between to occupy an important position in the network.
person 1 and person 5, the bi-directional arrow If information of vertex 5 was not collected, it
edge represents that there is a mutual recognition would cause a significant bias in data analysis.
of the friendship between person 1 and person 5. 2. Commission errors: this type of errors
As an example, the friendship network can be describes that some edges or vertices should
presented as an adjacency matrix Aij in Fig. 6, not be included in the network. In other words,
where 1 represents that there is an edge from j to network boundaries need to be set precisely to
i, and 0 represents that there is no edge between j exclude unnecessary vertices and edges.
and i. It must be remembered, by convention and 3. Edge/node attribution errors: this type of errors N
mathematical calculation, the direction goes from describes that attributes of edges or nodes are
rows (j) to columns (i) (Table 10). incorrectly assigned. For example, a private
organization is labeled as a public organization
in Table 1.
4. Retrospective errors: this type of errors
describes that informant/respondent compe-
tence is not capable of recalling people and
activities that they have involved in. This is
an important issue in network survey studies.
Since network survey questions are not intui-
Network Data, Fig. 6 An example of the directed
tive as conventional survey questions, some
network
respondents are burdened with identifying
people and recognizing activities with those
Network Data, Table 10 The adjacency matrix of the people (Marsden 2005, pp. 21–23).
directed network 5. Data management errors: this type of errors
1 2 3 4 5 6 describes coding errors, inappropriate survey
1 0 1 0 0 1 0 instruments or software issues.
2 0 0 0 1 1 0 6. Data aggregation errors: this type of errors
3 0 0 0 1 1 0 describes information lost during the data
4 0 0 0 0 0 0 aggregation process. It sometime occurs with
5 1 0 1 0 0 0 omission errors. As an example, when a
6 0 0 0 0 1 0 researcher needs to aggregate different data
694 Network Science

sets into a master file, some vertices that do not ▶ Data Brokers and Data Services
match others may need to be excluded from the ▶ Data Cleansing
master file. ▶ Data Integration
7. Errors in secondary data sources: this type of ▶ Data Quality Management
errors describes that data sources have inherent ▶ Database Management Systems (DBMS)
errors. For example, Facebook only allows
researchers to access individual accounts that
are open to pubic. Private accounts thus would Further Reading
be excluded from the data collection.
8. Formatting errors: this type of errors describes Borgatti, S. P., Everett, M. G., & Johnson, J. C. (2013).
Analyzing social networks. London, England: SAGE.
that data formatting process may cause errors.
Junghanns, M., Petermann, A., Neumann, M., & Rahm, E.
For example, a researcher collects data from (2017). Management and analysis of big graph data:
different sources with various formats. When Current systems and open challenges. In A. Y. Zomaya
there is a need for the researcher to integrate & S. Sakr (Eds.), Handbook of big data technologies
(pp. 457–505). Cham, Switzerland: Springer Interna-
various data formats for analysis, the format-
tional Publishing. https://doi.org/10.1007/978-3-319-
ting errors are likely to occur. 49340-4_14.
Marsden, P. V. (2005). Recent Developments in Network
Measurement. In P. J. Carrington, J. Scott, & S.
Wasserman (Eds.), Models and methods in social net-
Large-Scale Network Data work analysis. Cambridge: Cambridge University
Press.
Storing, querying, and analyzing network data Newman, M. (2010). Networks: An introduction. Oxford,
have become extremely challenging when the England: Oxford University Press.
Wasserman, S., & Faust, K. (1994). Social network analy-
scale of network data is large. In a conventional
sis: Methods and applications. Cambridge, England:
dataset, 1,000 data points generally represent Cambridge University Press.
1,000 data points. In a network dataset, 1,000
homogenous vertices are likely to be connected
by 499,500 undirected edges (N x (N-1)/2).
When network data come with heterogenous ver-
tices, directed/weighed/signed edges, time Network Science
points, and locations, the possible combinations
of network structures will exponentially grow up ▶ Link/Graph Mining
(e.g., mobile cellular networks). Several graph ▶ Network Analytics
database systems, graph processing systems,
and graph dataflow systems have been developed
to manage network data. Those systems typically
require the allocation of distributed clusters and Neural Networks
in-memory graph processing and are anticipated
to offer flexible and efficient ways of querying Alberto Luis García
and analyzing large-scale network data Departamento de Ciencias de la Comunicación
(Junghanns et al. 2017). Aplicada, Facultad de Ciencias de la Información,
Universidad Complutense de Madrid, Madrid,
Spain
Cross-References

▶ Big Variety Data Neural networks are analytic techniques modeled


▶ Collaborative Filtering inspired in the processes of learning by an ani-
▶ Data Aggregation mal’s central nervous systems which is capable of
Neural Networks 695

predicting new observations (on specific vari- hypotheses, which can lead to erroneous research
ables) from other observations (on the same or lines go. It is the very model which will suggest
other variables). trends and at the same time, learns that run
Neural networks have seen an explosion of through the intervening variables to continuously
interest over the last few years and are aimed adapt to circumstances. An important disadvan-
to apply across finance, medicine, research, tage, however, is that the final solution depends on
classification, data processing, robotics, engi- the process of the initial training and learning and
neering, geology and physics, to get faster the initial conditions of the network.
network processing, more efficiency, or The main applications for neural networks are
fewer errors. data mining and exploratory data analysis, but
The two main characteristics of neural net- there exist another one. Neural networks can be
works are nonlinear, i.e., there is a possibility to applied in all the situations that consist in a rela-
introduce a large number of variables; also, neu- tionship between the predictor variables (inputs)
ral networks are easy to use because works with and predicted variables (outputs).
training algorithms to automatically learn the For example:
structure of the data. Neural networks are also
intuitive, based on the similarity with the biolog- • Detection of medical phenomena. These
ical neural systems. models are used as a way to prevent major
Neural networks have grown out of research in pest or disease control in large populations.
artificial intelligence, with which its structure Also, to monitor and prevent the processes of
based research on the development of knowledge disease development and, thus, to make tighter
of brain functioning. The main branch of artificial nursing diagnosis.
intelligence research in the 1960s–1980s pro- • Stock market prediction. Nowadays, it is
duced expert systems. very important to know the fluctuations of
The brain is composed of 10,000,000,000 of DOW, NASDAQ, or FTSE index, and try to
neurons, massively interconnected between them. predict the tomorrow’s stock prices. In some N
Each neuron is a specialized cell, composed of circumstances, there are partially deterministic
dendrites (input structure) and axon (output struc- phenomenons (as factors such as past perfor-
ture and connect with dendrites of another neuron mance of other stocks and various economic
via a synapse). indicators), that can be used to learn the model.
Work through neural networks is structured • Credit assignment. There are always the same
around two main characteristics: the size of the variables to analyze the risk of a credit (the
structure and the number of layers needed to meet applicant’s age, education, occupation,. . .). It
all the variables specified in the model. In all is possible to work with all of this variables as
models, the main form of work is through trial inputs in the model and with the previous credit
and error testing. history, analyze and predict the risk in a spe-
The new network is then subjected to the pro- cifically way for each client.
cess of training and learning, where it is applied an • Monitoring the machinery. Neural networks
iterative process of inputs variables adjusted to the can be to be used in the preventive maintenance
weights of the network in order to optimally pre- of machines, trained to distinguish between
dict the sample data. The Network developed in good and bad performance of a machine.
this process is a pattern that can make predictions • Engine management. Neural networks have
through real input data, but that can be modified been used to analyze the input of sensors from
through the different layers adjusting the results in an engine, i.e., in a Formula 1 race. All sensors
a specific data. are monitored machine and create a historical
One of the major advances in working with work that allows you to train and teach the
neural networks is to prevent initial working model to make predictions naturally very
696 Neural Networks

reliable operation. This way we can avoid addi- Convolutional It is a multilayer network in which each
tional costs in repairs and maintenance of neural network part of the network is specialized to
engines and machines. (CNN) perform a task, thus reducing the
• Image processing. With proper training, you number of hidden layers and allowing
faster training
are able to read a car license plate or recognize
Recurrent These are networks without a layered
a person’s face. neural network structure but allow for arbitrary
(RNN) connections between neurons
The main question is how we apply to solve Radial-based Calculates the output of the function
a problem with a Neural Network. The first network (RBF) according to the distance to a point
thing is to apply the model to a specific prob- called the center, avoiding local
information minima where
lem that can be solved through historical data, information back-propagation may be
repetitive situations, etc. For example, it is blocked
impossible to predict the lottery (in a normal Classification of neural networks according to the
way). Another important requirement is that learning method
there are relationship variables between inputs Supervised Supervised learning is learning based
learning on the supervision of a controller who
and outputs data; the relationship can be evaluates and modifies the response
strange, but it must exist. And more, it can be according to its correctness or falsity
possible to begin with the model in training and Error correction Adjusts values according to the
a learning process that can be supervised or learning difference between expected and
unsupervised. In the first, the training data con- obtained values
tains examples of input data (as historical data Stochastic Uses random changes that modify the
learning weights of variables, keeping those
of events, historical fluctuations of stock pro- that improve the results
cess, etc.) that are controlled by the researcher Unsupervised The algorithm itself and the internal
and examples of output data. The results are or self- rules of the neural networks create
adjusted to the model and it can be known the supervised consistent output patterns. The
learning interpretation of the data depends on
final result to check the success of the model.
the learning algorithm used
If the model is “ready” to work, the first Hebrew Measures the familiarity and
decision is to choose which variables to use Learning characteristics of the input data
and how many cases to gather. The choice of Competitive It consists of adding data and leaving
variables is guided by intuition and the process and out those that are more similar to the
of work whit this chosen variables can be deter- comparative input pattern, giving more weight to
learning those data that meet this premise
mined to choice another ones. But the first part
Learning by It is similar to supervised learning but it
of the process is the choice of the main influ- reinforcement is only indicated if the data is
ential variables in the process. This data can be acceptable or not
numeric that must be scaled into an appropriate The current predisposition and enthusiasm towards artifi-
range for the network and another kind of sta- cial intelligence is largely due to advances in deep learning,
tistic values. which is based on techniques that allow the implementa-
tion of automatic learning of the algorithms that build and
Classification of neural networks according to network determine artificial neural networks. Deep learning is based
topology on interaction based on the functioning of the human brain,
Monolayer It is the simplest neural network and is in which several layers of interconnected simulated neu-
neural network formed by an input neuron layer and an rons learn to understand more complex processes. Deep
output neuron layer learning networks have more than ten layers with millions
of neurons each
Multilayer A neural network formed by several
Deep learning is made possible by the Big Data to train and
neural network layers, in which there are hidden
teach systems, storage capacity and system performance,
intermediate layers between the input
both in terms of storage and developers of cores, CPUs,
layer and the output layer
and graphics cards
(continued)
NoSQL (Not Structured Query Language) 697

Further Reading handles both situations and delivers very fast


performance for both read and write operations.
Boonkiatpong, K., & Sinthupinyo, S. Applying multiple NoSQL database addresses the following
neural networks on large scale data. In 2011 Interna-
opportunities (Cattell 2011):
tional Conference on Information and Electronics
Engineering, IPCSIT, Vol. 6. Singapore: Press.
• Large volumes of structured, semi-structured,
and unstructured data
• Agile sprints, quick iteration, and frequent
code pushes
NoSQL (Not Structured Query
• Flexible, easy to use object-oriented
Language)
programming
• Efficient, scale-out architecture instead of
Rajeev Agrawal1 and Ashiq Imran2
1 expensive, monolithic architecture
Information Technology Laboratory, US Army
Engineer Research and Development Center,
Vicksburg, MS, USA Classification
2
Department of Computer Science &
Engineering, University of Texas at Arlington, NoSQL database can be classified into four major
Arlington, TX, USA categories (Han et al. 2011). The details are as
follows:
Key-Value Stores Database: These systems
Synonyms typically store values and an index to find
them, based on user-defined key. Examples:
Big Data; Big Data analytics; Column-based data- FoundationDB, DynamoDB, MemcacheDB,
base; Document-oriented database; Key-value- Redis, Riak, LevelDB, RocksDB, BerkeleyDB,
based database Oracle NoSQL Database, GenieDB, BangDB, N
Chordless, Scalaris, KAI, FairCom c-tree, LSM,
KitaroDB, upscaleDB, STSDB, Maxtable,
Introduction RaptorDB, etc.
Document Stores Database: These systems
Rapidly growing humongous amount of data usually store documents as just defined. The doc-
must be stored into a database. NoSQL is uments are indexed and a simple query mecha-
increasingly used in order to store Big Data. nism is provided. Examples: Elastic, OrientDB,
NoSQL systems are also called “not only MongoDB, Cloud Datastore, Azure DocumentDB,
SQL” or “not relational SQL” to emphasize Clusterpoint, CouchDB, Couchbase, MarkLogic,
that they may also support SQL query but RethinkDB, SequoiaDB, RavenDB, JSON ODM,
more than that. Moreover, a NoSQL data store NeDB, Terrastore, AmisaDB, JasDB, SisoDB,
is able to accept all types of data – structured, DensoDB, SDB, iBoxDB, ThruDB, ReasonDB,
semi-structured, and unstructured – much more IBM Cloudant, etc.
easily than a relational database. For applica- Graph Database: These systems database is
tions that have a mixture of datatypes, a NoSQL designed for data whose relations are well
database is a good option. Performance factors represented as a graph. The kind of data could be
come into play with an RDBMS’ data model, social relations, public transport links, road maps,
especially where “wide rows” are involved and or network topologies. Examples: Neo4J,
update actions are many. However, a NoSQL ArangoDB, Infinite Graph, Sparksee, TITAN,
data model such as Google’s Bigtable easily InfoGrid, HyperGraphDB, GraphBase, Trinity,
698 NoSQL (Not Structured Query Language)

Bigdata, BrightstarDB, Onyx Database, Column-Oriented Databases


VertexDB, FlockDB, Virtuoso, Stardog, Allegro,
Weaver, Fallen 8, etc. Though column-oriented database has not
Column Database: These systems store undermined the traditional store by row, but in
extensible records that can be partitioned verti- architecture with data compression, hugely parallel
cally and horizontally across nodes. Examples: processing, shared nothing, column-oriented data-
Hadoop/HBase, Cassandra, Hortonworks, Scylla, base can main high performance of data analysis
HPCC, Accumulo, Hypertable, Amazon and business intelligence processing. Column-
SimpleDB, Cloudata, MonetDB, Apache Flink, oriented databases are HBase, HadoopDB, Cassan-
IBM Informix, Splice Machine, eXtremeDB dra, Hypertable, Bigtable, PNUTS, etc.
Financial Edition, ConcourseDB, Druid, KUDU, Cassandra: Cassandra is an open-source data-
Elassandra, etc. base of Facebook. Its features are (1) the schema is
All of the described databases are multimodal very flexible and does not require to design database
databases that is designed to support multiple data schema at first, and add or delete field is very con-
models against a single and integrated back end. venient; (2) it supports range of queries, and (3) high
Some examples are Datomic, GunDB, CortexDB, scalability: a single point failure does not affect the
AlchemyDB, WonderDB, RockallDB, and whole cluster, and it supports linear expansion. Cas-
FoundationDB. sandra system is a distributed database system which
NoSQL Database: Traditional databases are was made of lots of database nodes; a write operation
primarily relational, but in the NoSQL database will be replicated to other nodes, and read request
fields, there are some new type of databases. Each will be routed to a certain node. For a Cassandra
type of database with an example is described as cluster, only to add node can achieve the goal of
follows: scalability. In addition, Cassandra also supports rich
Key-Value Databases: Key-value (KV) data structure and powerful query language.
stores use the associative array (also known as
a map or dictionary) as their fundamental data
model. In this model, data is represented as a Document-Oriented Database
collection of key-value pairs, such that each
possible key appears at most once in the A document-oriented database is one type of
collection. NoSQL database designed for storing, retrieving,
and managing document-oriented information, also
known as semi-structured data. In contrast to rela-
Redis tional databases and their notions of “relations”
(or “tables”), these systems are designed around an
Redis is a key-value memory database: when abstract notion of a “document.” Unlike the key-
Redis runs, data were entirely loaded into mem- value stores, these systems generally support second-
ory, so all the operations were run in memory ary indexes and multiple types of documents
and then periodically save the data asynchro- (objects) per database and nested documents or
nously to the hard disk. The characteristics of lists. Like other NoSQL systems, the document
pure memory operation make it to have very stores do not provide ACID transactional properties.
good performance; it can handle more than
100,000 read or write operation per second;
(1) Redis supports list and set and various MongoDB
related operations; (2) the maximum of value
limit to 1GB; and (3) the main drawback is that MongoDB is an open-source database used by
capacity of the database is limited by physical companies of all sizes, across all industries and
memory, so Redis cannot be used as Big Data for a wide variety of applications. It is an agile
storage, and scalability is poor. database that allows schemas to change quickly as
NoSQL (Not Structured Query Language) 699

applications evolve, while still providing what the NoSQL Pros and Cons
functionality developers expect from traditional
databases, such as secondary indexes, a full Advantages
query language, and strict consistency. MongoDB The major advantages of NoSQL database are
is built for scalability, performance and high avail- described below.
ability, and scaling from single server deploy-
ments to large, complex multisite architectures. Open Source
By leveraging in-memory computing, MongoDB Most of the NoSQL databases are open source,
provides high performance for both reads and offering development in the whole software mar-
writes. MongoDB’s native replication and auto- ket. According to Basho chief technology officer
mated failover enable enterprise-grade reliability Justin Sheehy, open-source environment is
and operational flexibility. healthy for NoSQL database, and it lets user per-
form technical evaluation at low cost. SQL
(relational) versus NoSQL scalability is a contro-
Graph Database versial topic. Here is some more background to
support this position. If new relational systems
A graph database is a database that uses graph can do everything that a NoSQL system can,
structures with nodes, edges, and properties to with analogous performance and scalability, and
represent and store data. A graph database is any with the convenience of transactions and SQL, why
storage system that provides index-free adjacency would you choose a NoSQL system? Relational
(Cattell 2011). This means that every element DBMSs have taken and retained majority market
contains a direct pointer to its adjacent elements share over other competitors in the past 30 years:
and no index lookups are necessary. network, object, and XML DBMSs. Fruitful rela-
tional DBMSs have been built to handle other spe-
cific application loads in the past: read-only or read-
Neo4j mostly data warehousing, distributed databases, N
and now horizontally scaled databases.
Neo4j is an open-source graph database,
implemented in Java. The developers describe Fast Data Processing
Neo4j as “embedded, disk-based, fully transactional NoSQL databases usually process data faster than
Java persistence engine that stores data structured in relational databases. Relational databases are mostly
graphs rather than in tables.” Neo4j is the most used by business transactions that require great pre-
popular graph database. cision. They thus generally subject all data to the
NoSQL, a relatively new technology, has same set of atomicity, consistency, isolation, and
already attracted a significant amount of attention durability (ACID) fetters. Uppsala University pro-
due to its use by massive websites like Amazon, fessor Tore Risch explained about ACID. Atomicity
Yahoo, and Facebook (Moniruzzaman and Hossain means an update is performed completely or not at
2013). all. Consistency means no part of a transaction is
NoSQL began within the domain of open allowed to break a database’s rules. Isolation means
source and a few small vendors, but continued each application runs transactions independently of
growth in data and NoSQL has attracted many other applications operating concurrently, and dura-
new players into the market. NoSQL solutions bility means that completed transactions will persist.
are attractive because they can handle huge quan- Having to perform these restraints makes the rela-
tities of data, relatively quickly, across a cluster of tional database slower.
commodity servers that share resources. Addi-
tionally, most NoSQL solutions are open source, Scalability
which gives them a price advantage over conven- The NoSQL approach presents huge advantages
tional commercial databases. over SQL databases because it allows one to scale
700 NSF

an application to new levels. The new data ser- Less complaint to the operational realm:
vices are based on truly scalable structures and The operational environment requires a set of
architectures, built for the cloud and built for tools that is not only scalable but also manageable
distribution, and are very attractive to the applica- and stable, be it on the cloud or on a fixed set of
tion developer. There is no need for DBA, no need servers. When something goes wrong, it should
for complicated SQL queries, and it is fast. not require going through the whole chain and up
to the developer level to diagnose the problem. In
fact, that is exactly what operation managers
Disadvantages regard as an operational nightmare. Operation
needs to be systematic and self-contained. With
There are some disadvantages to the NoSQL the current NoSQL services available in the mar-
approach. Those are less visible at the developer ket, this is not easy to achieve, even in managed
level but are highly visible at the system, archi- environments such as Amazon.
tecture, and operational levels.
Lack of skilled authority at the system level:
Not having a skilled authority to design a single, Conclusion
well-defined data model, regardless of the tech-
nology used, has its drawbacks. The data model NoSQL is a big and expanding area, covering clas-
may suffer from duplication of data objects sification of different types of NoSQL databases,
(non-normalized model). This can happen due to performance measurement, advantages and disad-
the different object model used by different devel- vantages of NoSQL databases, and current state of
opers and their mapping to the persistency model. adoption of NoSQL databases. This article provides
At the system level, one must also understand the an independent understanding of the strengths and
limitations of the chosen data service, whether it is weaknesses of various NoSQL database approaches
size, ops per second, concurrency model, etc. to supporting applications that process huge vol-
Lack of interfaces and interoperability at umes of data, as well as to provide a global over-
the architecture level: Interfaces for the NoSQL view of this non-relational NoSQL databases.
data services are yet to be standardized. Even
DHT, which is one of the simpler interfaces, still
has no standard semantics, which includes trans- Further Reading
actions, non-blocking API, etc. Each DHT service
used comes with its own set of interfaces. Another Cattell, R. (2011). Scalable SQL and NoSQL data stores.
Acm Sigmod Record, 39(4), 12–27. Stonebraker,
big issue is how different data structures, such as
M. (2010). SQL databases v. NoSQL databases. Com-
DHT and a binary tree, just as an example, share munications of the ACM, 53(4), 10–11.
data objects. There are no intrinsic semantics for Han, J., Haihong, E., Le, G., Du, J. (2011), October. Survey
pointers in all those services. In fact, there is on NoSQL database. In Pervasive computing and
applications (ICPCA), 2011 6th international confer-
usually not even strong typing in these services –
ence on (pp. 363–366). IEEE.
itis the developer’s responsibility to deal with that. Moniruzzaman, A. B. M., & Hossain, S. A. (2013).
Interoperability is an important point, especially NoSQL database: New era of databases for big data
when data needs to be accessed by multiple ser- analytics – Classification, characteristics and compari-
son. International Journal of Database Theory and
vices. A simple example: if Back office works in
Application., 6(4), 1–14.
Java, and web serving works in PHP, can the data
be accessed easily from both domains? Clearly
one can use web services in front of the data as a
data access layer, but that complicates things even NSF
more and reduces business agility, flexibility,
and performance while increasing development ▶ Big Data Research and Development Initiative
overhead. (Federal, U.S.)
Nutrition 701

data sets, which are large in volume and complex


Nutrition in structure. For instance, the data managed by
America’s leading health care provider Kaiser is
Qinghua Yang1 and Yixin Chen2 more than 4,000 times the amount of information
1
Department of Communication Studies, Texas stored in the Library of Congress. As to data
Christian University, Fort Worth, TX, USA structure, nutritional data and ingredients are
2
Department of Communication Studies, Sam really difficult to normalize. The volume and com-
Houston State University, Huntsville, TX, USA plexity of nutritional big data make it difficult to
process them using traditional data analytic
techniques.
Nutrition is a science that helps people to make Big data analyses can provide more valuable
good choices of foods to keep healthy, by identi- information than traditional data sets and reveal
fying the amount of nutrients they need and the hidden patterns among variables. In a big data
amount of nutrients each food contains. Nutrients study sponsored by the National Bureau of Eco-
are chemicals obtained from diet and are indis- nomic Research, economists Matthew Harding
pensable to people’s health. Keeping a balanced and Michael Lovenheim analyzed data of over
diet containing all essential nutrients can prevent 123 million purchasing decisions on food and
people from diseases caused by nutritional defi- beverage made in the U.S. between 2002 and
ciencies such as scurvy and pellagra. 2007 and simulated the effects of various taxes
Although the United States has one of the most on Americans’ buying habits. Their model pre-
advanced nutrition sciences in the world, the dicted that an increase of 20% tax on sugar would
nutrition status of the U.S. population is not opti- reduce Americans’ total caloric intake by 18% and
mistic. While nutritional deficiencies as a result of reduce sugar consumption by over 16%. Based on
dietary inadequacies are not very common, many their findings, they proposed a new policy of
Americans are suffering from overconsumption- implementing a broad-based tax on sugar to
related diseases. Due to the excessive intake of improve public health. In another big-data study N
sugar and fat, the prevalence of overweight and on human nutrition, two researchers at West Vir-
obesity in the American adult population ginia University tried to understand and monitor
increased from 47% to over 65% over the past the nutrition status of a population. They designed
three decades, currently with two-thirds of Amer- intelligent data collection strategies and examined
ican adults being overweight and among whom the effects of food availability on obesity occur-
36% being obese. Overweight and obesity are rence. They concluded that modifying environ-
concerns not only for the adult population, but mental factors (e.g., availability of healthy food)
also for the childhood population, with one third could be the key in obesity prevention.
of American children being overweight or obese. Big data can be applied to self-tracking, that is,
Obesity kills more than 2.8 million Americans monitoring one’s nutrition status. An emerging
every year, and the obesity-related health prob- trend in big data studies is quantified self (QS),
lems cost American taxpayers more than $147 which refers to keeping track of one’s nutritional,
billion every year. Thus, reducing the obesity biological and physical information, such as cal-
prevalence in the United States has become a ories consumed, glycemic index, and specific
national health priority. ingredients of food intake. By pairing the self-
Big data research on nutrition holds tremen- tracking device with a web interface, the QS solu-
dous promise for preventing obesity and improv- tions can provide users with nutrient-data aggre-
ing population health. Recently, researchers have gation, infographic visualization, and personal
been trying to apply big data to nutritional recommendations for diet.
research, by taking advantages of the increasing Big data can also enable researchers to monitor
amount of nutritional data and the accumulation the global food consumption. One pioneering
of nutritional studies. Big data is a collection of project is the Global Food Monitoring Group
702 Nutrition

conducted by the George Institute for global status, many people fail to do consistent daily
health with participations from 26 countries. documentation or suffer from poor recall of food
With the support of these countries, the Group is intake. Also, big data analyses may be subject to
able to monitor the nutrition composition of var- systematic biases and generate misleading
ious foods consumed around the world, identify research findings. Lastly, since an increasing
the most effective food reformulation strategies, amount of personal data is being generated
and explore effective approaches on food produc- through quantified self-tracking devices, it is
tion and distribution by food companies in differ- important to consider privacy rights in personal
ent countries. data. That individuals’ personal nutritional data
Thanks to the development of modern data should be well-protected and that data shared and
collection and analytic technologies, the amount posted publicly should be used appropriately are
of nutritional, dietary, and biochemical data con- key ethical issues for nutrition researchers and
tinues to increase at a rapid pace, along with a practitioners. In light of these challenges, techni-
growing accumulation of nutritional epidemio- cal, methodological, and educational interven-
logic studies during this time. The field of nutri- tions are needed to deal with issues related to
tional epidemiology has witnessed a substantial big-data accessibility, errors and abuses.
increase in systematic reviews and meta-analyses
over the past two decades. There were 523 meta-
analyses and systematic reviews within the field Cross-References
of nutritional epidemiology in 2013 versus just 1
in 1985. However, in the era of “big data”, there is ▶ Biomedical Data
an urgent need to translate big-data nutrition ▶ Data Mining
research to practice, so that doctors and ▶ Health Informatics
policymakers can utilize this knowledge to
improve individual and population health.
Further Reading

Controversy Harding, M., & Lovenheim, M. (2017). The effect of prices


on nutrition: Comparing the impact of product-and
nutrient-specific taxes. Journal of Health Economics,
Despite the exciting progress of big-data applica- 53.
tion in nutrition research, several challenges are Insel, P., et al. (2013). Nutrition. Boston: Jones and Bartlett
equally noteworthy. First, to conduct big-data Publishers.
nutrition research, researchers often need access Satija, A., & Hu, F. (2014). Big data and systematic
reviews in nutritional epidemiology. Nutrition Reviews,
to a complete inventory of foods purchased in all 72(12).
retail outlets. This type of data, however, is not Swan, M. (2013). The quantified self: Fundamental disrup-
readily available and gathering such information tion in big data science and biological discovery. Big
site by site is a time-consuming and complicated Data, 1(2).
WVU Today. WVU researchers work to track nutritional
process. Second, information provided by nutri- habits using ‘Big Data’. http://wvutoday.wvu.edu/n/
tion big data may be incomplete or incorrect. For 2013/01/11/wvu-researchers-workto-track-nutritional-
example, when doing self-tracking for nutrition habits-using-big-data. Accessed Dec 2014.
O

Online Advertising return on investments across media, generate pre-


dictive models, and modify their campaigns in
Yulia A. Strekalova near-real time. The proliferation of data collection
College of Journalism and Communications, gave rise to increased concerns among the Internet
University of Florida, Gainesville, FL, USA users and advocacy groups. As the user data are
collected by shared among multiple parties, they
may amount to become personally identifiable to a
In a broad sense, online advertising means adver- particular person.
tising through cross-referencing on a business’s
own web portal or on the websites of other online
businesses. The goal of online advertising is to Types of Online Advertising
attract attention to advertised websites and prod-
ucts and, potentially, lead to an enquiry about a Online advertising, a multibillion-dollar industry
project, mail list subscription, or product pur- today, started from a single marketing email offer-
chase. Online advertising creates new cost-saving ing a new computer system sent in 1978 to
opportunities for businesses by reducing some of 400 users of the Advanced Research Projects
the risks of ineffective advertising resources. Agency Network (ARPAnet). While the reactions
Online advertising types include banners, targeted to this first online advertising campaign were neg-
ads, and social media community interactions, ative and identified the message as spam, email
and each type requires careful planning and con- and forum-based advertising continued to develop
sideration of potential ethical challenges. and grow. In 1993, a company called Global Net-
Online advertising analytics and measurement work Navigator sold the first clickable online
is necessary to assess the effectiveness of adver- ad. AT&T, one of the early adopters of this adver-
tising efforts and the return on the investment of tising innovation, received clicks from almost half
funds. However, measurement is challenged by of the Internet users who were exposed to its
the fact that advertising across media platforms “Have you ever clicked your mouse right
is increasingly interactive. For example, a TV HERE? – You will.” banner ad. In 1990s, online
commercial may lead to an online search, which advertising industry was largely fragmented, but
will result in a relevant online ad, which may lead first ad networks started to appear and offer their
to a sale. The vast amounts of data and powerful customers opportunities to develop advertising
analytics are necessary to allow advertisers campaigns that will place ads across a diverse set
performing high-definition cross-channel ana- of websites and reach particular audience seg-
lyses of the public and its behaviors, evaluate the ments. An advertising banner may be placed on
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
704 Online Advertising

high-traffic sites statically for a predefined period obtain insights on the environment or search
of time. While this method may be the least costly terms that led a consumer to the ad in the first
and targeted to a niche audience, it does not allow place.
for rich data collection. Banner advertising is a Online advertising may also include direct
less sophisticated form of online advertising. Ban- newsletter advertising delivered to potential cus-
ner advertising could also be used as a hybrid of tomers who have purchased before. However, the
cost per mille (CPM), or cost per thousand, as decision to use this way of advertising should be
another advertising option which will deliver an coupled with an ethical way of employing
ad to website users. This option is usually priced it. Email addresses became a commodity and can
in a multiple of 1,000 impressions (or the number be bought. However, a newsletter sent to users
of times an ad was shown) and an additional cost who never bought from a company may fire
for clicks. It also allows businesses to assess how back and lead to unintended negative conse-
many times an ad was shown. However, this quences. Overall, this low-cost advertising
method is limited in its ability to measure if the method can be effective in keeping past customers
return on an investment in advertising covered the informed about new products and other cam-
costs. However, proliferation of banners on sites paigns run by the company.
and the overall volume of information on sites Social media is another advertising channel,
lead to “banner blindness” among the Internet which is rapidly growing in its popularity. Social
users. In addition, with rapid increase of mobile media networks created repositories of psycho-
phones as Internet connection devises, the average graphic data, which include user-reported demo-
effectiveness of banners became even lower. The graphic information, hobbies, travel destinations,
use of banner and pop-up ads increased in the late lifetime events, and topics of interest. Social
1990s and early 2000s, but the users of the Inter- media can be used as more traditional advertising
net started to block these ads with pop-up channels for PPC ad placements. However, they
blockers, and the clicks on banner ads dropped can also serve as a base for customer engagement.
to about 0.1%. Social media, although require a commitment and
The next innovation in the online advertising time investment from advertisers, may generate
is tied to the growth in sophistication of search brand loyalty. Social media efforts, therefore,
engines. The search engines started to allow require careful evaluation as they can be both
advertisers to place ad relevant to particular key- costly in terms of direct advertising costs and the
words. Tying advertising to relevant search key- cost of time spent by company employees on
words gave rise to the pay-per-click (PPC) developing and executing social media campaign
advertising. PPC provides advertisers with and keeping the flow of communication active.
most robust data to assess if expended costs Data collected from social media channels can be
generated sufficient return. PPC advertising analyzed on the individual level, which was
means that advertisers are charged per click on nearly impossible with earlier online advertising
an ad. This advertising method ties exposure to methods. Companies can collect information
advertising to an action from a potential con- about specific user communication and engage-
sumer thus providing advertisers with the data ment behavior, track communication activities of
on the sites that are more effective. Google individual users, and analyze comments shared by
AdWords is an example of pay-per-click adver- the social media users. At the same time, aggre-
tising, which is linked to the keywords and gate data may allow for general sentiment analysis
phrases used in search. AdWords ads are corre- to assess if overall comments about a brand are
lated with these keywords and shown only to the positive or negative and seek out product-related
Internet users with relevant searches. By using signals shared by users. Social media evaluation,
PPC in conjunction with a search engine, like however, is challenged by the absence of deep
Google, Bing, or Yahoo, advertisers can also understanding of the audience engagement
Online Advertising 705

metrics and lack of industry-wide benchmarks keywords and place ads that match these key-
and evaluation standards. As a fairly new area of words most closely. For example, a user viewing
advertising, social media evaluation of likes, com- a website about gardening may see ads for gar-
ments, and shares may be interpreted in a number dening and house-keeping magazines or home
of ways. Social media networks provide a frame- improvement stores.
work for a new type of advertising, community Geo, or local, targeting is focused on the deter-
exchange, but they also are channels of online mination of the geographical location of a website
advertising through real-time advertising visitor. This information, in turn, is used to deliver
targeting. It is likely that focused targeting will ads that are specific to a particular location, coun-
continue to be the focus of advertisers as it leads to try, region or state, city, or metro area. In some
the increases in the effectiveness of advertising cases, targeting can go as deep as an organiza-
efforts. At the same time, tracking of user web tional level. Internet protocol (IP) address,
behavior throughout the Web creates privacy con- assigned to each device participating a computer
cerns and policy challenges. network, is used as the primary data point in this
targeting method. The use of this method may
prevent the delivery of ads to users where product
Targeting or service is not available – for example, a content
restriction for Internet television or region-
Innovations in online advertising introduced specific advertising that complies with regional
targeting techniques that based advertising on regulations.
the past browsing and purchase behaviors of Inter- Demographic targeting, as implied by its name,
net users. Proliferation of data collection enabled tailors ads based on website users’ demographic
advertisers to target potential clients based on a information, like gender, age, income and educa-
multitude of web activities, like site browsing, key tion level, marital status, ethnicity, language pref-
word searchers, past purchasing across different erences, and other data points. Users may supply
merchants, etc. These targeting techniques led to this information is social networking site registra-
the development of data collection systems that tion. The sites, additionally, may also encourage
track user activity in real time and make decisions its users to “complete” their profiles after the O
to advertise or not advertise right as the user is initial registration to get access to the fullest set
browsing a particular page. Online advertising of data.
lacks rigorous standardization and several recent Behavioral targeting looks at users’ declared or
targeting typologies have been proposed. expressed interests to tailor the content of deliv-
Reviewing strategies for online advertising, ered ads. Web-browsing information, data on the
Gabriela Taylor identifies nine distinct targeting pages visited, the amount of time spent on partic-
methods, which overlap or complement the dis- ular pages, meta-data for the links that were
cussion of targeting methods proposed by other clicked, the searches conducted recently, and
authors. In general, targeting refers to situation information about recent purchases is collected
when ads that are shown to an Internet user are and analyzed by advertisement delivery systems
relevant to their interests. The latter are deter- to select and display the most relevant ads. In a
mined by the keywords used on searchers, pages sense, website publishers can create user profiles
visited, or online purchases made. based on the collected data and use it to predict
Contextual targeting ads are delivered to web future browsing behavior and potential products
users based on the content of the sites these users of interest. This approach, using rich past data,
visit. In other words, contextually targeted adver- allows advertisers to target their ads more effec-
tising matches ads to the content of the webpage tively to the page visitors who are more likely to
an Internet user is browsing. Systems managing have interest in these products or services. Com-
contextual advertising scan websites for bined with other strategies, including contextual,
706 Online Advertising

geographic, and demographic targeting, this or may not be defining for a particular behavior or
approach may lead to finely tuned and interest- user group.
tailored ads. The approach proves effective as Act-alike targeting is an outcome of predictive
several studies showed that also Internet users analytics. Advertisers using this method define
prefer to have no ads on the web-pages they profiles of customers based on their information
visit, they favor relevant ads over random ones. consumption and spending habits. Customers and
DayPart and time-based targeting is run dur- their past behaviors are identified; they are seg-
ing specific times of the day or the week, for mented into groups to predict their future pur-
example, 10 am to 10 pm local time Monday chase behavior. The goal of this method is to
through Friday. Ads targeted based on this identify the most loyal group of customers, who
method are displayed only during these days generate revenue for the company and engage
and times and go off during the off-times. Ads with this group in a most effective and
run through DayPart campaigns may focus on supportive way.
time-limited offers and create a sense of urgency
among audience members. At the same time,
such ads may create an increased sense of mon- Privacy Concerns
itoring and lack of privacy among the users
exposed to these ads. Technology is developing at a speed too rapid for
Real-time targeting allows for the ad place- policy-making to catch up. Whichever advertising
ment systems to place bids for advertisement targeting method is used, each is based on an
placement in real time. Additionally, this adver- extended collections and analysis of personal
tising method allows to track every unique site and behavioral data for each user. Ongoing and
user and collect real-time data to assess the likeli- potentially pervasive data collection raises impor-
hood of each visitor to make a purchase. tant privacy questions and concerns. Omer Tene
Affinity targeting creates a partnership and Jules Polonetsky identify several privacy
between a product producer and an interest- risks associated with big data. First is an incre-
based organization to promote the use of a third- mental adverse effect on privacy from an ongoing
party product. This method targets customers who accumulation of information. More and more data
share interest in a particular topic. These cus- points are collected about individual Internet users
tomers are assumed to have positive attitude and once information about real identify has been
toward a website they visit and therefore have a linked to a virtual identify of a user, the anonymity
positive attitude toward more relevant advertis- is lost. Furthermore, disassociation of a user with
ing. This method is akin to niche advertising, a particular service may be insufficient to break a
and its success is based on the close match previously existing link as other networks and
between the advertising content and that of the online resources may have already harvested
passions and interests of website users. missing data points. Second area of privacy risks
Look-alike targeting aims to identify prospec- is an automated decision-making process. These
tive customers who are similar to the advertiser’s automated algorithms may lead to discrimination
customer base. Original customer profiles are and self-determination. Targeting and profiling
determined based on the website use and previous used in online advertising gives ground to poten-
behaviors of active customers. These profiles are tial threats to the free access to information and
then matched against a pool of independent Inter- open, democratic society. The third area of pri-
net users who share common attributes and behav- vacy concerns is predictive analysis, which may
iors and are the likely targets for an advertised identify and predict stigmatizing behaviors or
product. The challenge with identifying these characteristics, like susceptibility to disease or
look-alike audiences is challenged by the large undisclosed sexual orientation. In addition, pre-
number of possible input data points which may dictive analysis may give ground to social
Online Identity 707

stratification by putting users in like-behaving


clusters and ignoring outliers and minority Online Identity
groups. Finally, the fourth area of concern is the
lack of access to information and exclusion of Catalina L. Toma
smaller organizations and individuals from the Communication Science, University of
access to the benefits of big data. Large organiza- Wisconsin-Madison, Madison, WI, USA
tions are able to collect and use big data to price
products close to an individual’s reservation price
or cornering an individual with a deal impossible Identity refers to the stable ways in which indi-
to resist. At the same time, large organizations are viduals or organizations think of and express
seldom forthcoming with sharing individuals’ themselves. The availability of big data has
information with these individuals in an assess- enabled researchers to examine online communi-
able and understandable format. cators’ identity using generalizable samples.
Empirical research to date has focused on per-
sonal, rather than organizational, identity, and on
Cross-References social media platforms, particularly Facebook and
Twitter, given that these platforms require users to
▶ Content Management System (CMS) present themselves and their daily reflections to
▶ Data-Information-Knowledge-Wisdom audiences. Research to date has investigated the
(DIKW) Pyramid, Framework, Continuum following aspects of online identity: (1) expres-
▶ Predictive Analytics sion, or how users express who they are, espe-
▶ Social Media cially their personality traits and demographics
(e.g., gender, age) through social media activity;
(2) censorship, or how users suppress their urges
Further Reading to reveal aspects of themselves on social media;
(3) detection, or the extent to which it is possible
Siegel, E. (2013). Predictive analytics: The power to pre- to use computational tools to infer users’ identity
dict who will click, buy, lie, or die. Hoboken: Wiley.
Taylor, G. (2013). Advertising in a digital age: Best prac-
from their social media activity; (4) audiences, or O
tices & tips for paid search and social media advertis-
who users believe accesses their social media
ing. Global & Digital. postings and whether these beliefs are accurate;
Tene, O., & Polonetsky, J. (2013). Privacy in the age of big (5) families, or the extent to which users include
data: A time for big decisions. Stanford Law Review family ties as part of their identity portrayals; and
Online, 11/5.
Turow, J. (2012). The daily you: How the advertising
(6) culture, or how users express their identities in
industry is defining your identity and your worth. culturally determined ways. Each of these areas of
New Haven: Yale University Press. research is described in detail below.

Identity Expression
Online Analytical Processing
In its early days, the Internet appealed to many
▶ Data Mining users because it allowed them to engage with one
another anonymously. However, in recent years,
users have overwhelmingly migrated toward per-
sonalized interaction environments, where they
Online Commerce reveal their real identities and often connect with
members of their offline networks. Such is the
▶ E-Commerce case with social media platforms. Therefore,
708 Online Identity

research has taken great interest in how users individuals high in conscientiousness posted
communicate various aspects of their identities more photographs but participated in fewer
to their audiences in these personalized groups and “liked” fewer Facebook pages; and
environments. individuals high in neuroticism had fewer friends
One important aspect of people’s identities is but participated in more groups and “liked” more
their personality. Big data has been used to exam- Facebook pages. A related study, conducted by
ine how personality traits get reflected in people’s Michal Kosinski and his colleagues, replicated
social media activity. How do people possessing these findings on a sample of 350,000 American
various personality traits talk, connect, and Facebook users, the largest dataset to date on the
present themselves online? The development of relationship between personality and Internet
the myPersonality Facebook application was behavior.
instrumental in addressing these questions. Another study examined the relationship
myPersonality administers personality question- between personality traits and word usage in the
naires to Facebook users and then informs them status updates of over 69,000 English-speaking
of their personality typology in exchange for Facebook users. Results show that personality
access to all their Facebook data. The application traits were indeed reflected in natural word use.
has attracted millions of volunteers on Facebook For instance, extroverted users used words
and has enabled researchers to correlate Facebook reflecting their sociable nature, such as “party,”
activities with personality traits. The application, whereas introverted users used words reflecting
used in all the studies summarized below, mea- their more solitary interests, such as “reading” and
sures personality using the Big Five Model, which “Internet.” Similarly, highly conscientious users
specifies five basic personality traits: (1) extraver- expressed their achievement orientation through
sion, or an individual’s tendency to be outgoing, words such as “success,” “busy,” and “work,”
talkative, and socially active; (2) agreeableness, or whereas users high in openness to experience
an individual’s tendency to be compassionate, expressed their artistic and intellectual pursuits
cooperative, trusting, and focused on maintaining through words like “dreams,” “universe,” and
positive social relations; (3) openness to experi- “music.”
ence, or an individual’s tendency to be curious, In sum, this body of work shows that people’s
imaginative, and interested in new experiences identity, operationalized as personality traits, is
and ideas; (4) conscientiousness, or an individ- illustrated in the actions they undertake and
ual’s tendency to be organized, reliable, consis- words they use on Facebook. Given social media
tent, and focused on long-term goals and platforms’ controllable nature, which allows users
achievement; and (5) neuroticism, or an individ- time to ponder their claims and the ability to edit
uals’ tendency to experience negative emotions, them, researchers argue that these digital traces
stress, and mood swings. likely illustrate users’ intentional efforts to com-
One study conducted by Yoram Bachrach and municate their identity to their audience, rather
his colleagues investigated the relationship than being unintentionally produced.
between Big Five personality traits and Facebook
activity for a sample of 180,000 users. Results
show that individuals high in extraversion had Identity Censorship
more friends, posted more status updates, partici-
pated in more groups, and “liked” more pages on While identity expression is frequent in social
Facebook; individuals high in agreeableness media and, as discussed above, illustrated by
appeared in more photographs with other behavioral traces, sometimes users suppress iden-
Facebook users but “liked” fewer Facebook tity claims despite their initial impulse to divulge
pages; individuals high in openness to experience them. This process, labeled “last-minute self-cen-
posted more status updates, participated in more sorship,” was investigated by Sauvik Das and
groups, and “liked” more Facebook pages; Adam Kramer using data from 3.9 million
Online Identity 709

Facebook users over a period of 17 days. Censor- gender with 93% accuracy, religion (Christian
ship was measured as instances when users vs. Muslim) with 82% accuracy, political orienta-
entered text in the status update or comment tion (Democrat vs. Republican) with 85% accu-
boxes on Facebook but did not post it in the next racy, sexual orientation among men with 88%
10 min. The results show that 71% of the partic- accuracy and among women with 75% accuracy,
ipants censored at least one post or comment and relationship status with 65% accuracy. Cer-
during the time frame of the study. On average, tain “likes” stood out as having particularly high
participants censored 4.52 posts and 3.20 com- predictive ability for Facebook users’ personal
ments. Notably, 33% of all posts and 13% of all characteristics. For instance, the best predictors
comments written by the sample were censored, of high intelligence were “The Colbert Report,”
indicating that self-censorship is a fairly prevalent “Science,” and, unexpectedly, “curly fries.” Con-
phenomenon. Men censored more than women, versely, low intelligence was indicated by
presumably because they are less comfortable “Sephora,” “I Love Being a Mom,” “Harley
with self-disclosure. This study suggests that Davidson,” and “Lady Antebellum.”
Facebook users take advantage of controllable In the area of personality, two studies found
media affordances, such as editability and unlim- that users’ extraversion can be most accurately
ited composition time, in order to manage their inferred from Facebook profile activity (e.g.,
identity claims. These self-regulatory efforts are group membership, number of friends, number
perhaps a response to the challenging nature of of status updates); neuroticism, conscientious-
addressing large and diverse audiences, whose ness, and openness to experience can be reason-
interpretation of the poster’s identity claims may ably inferred; and agreeableness cannot be
be difficult to predict. inferred at all. In other words, Facebook activity
renders extraversion highly visible and agreeable-
ness opaque.
Identity Detection Language can also be used to predict online
communicators’ identity, as shown by Andrew
Given that users leave digital traces of their per- Schwartz and his colleagues in a study of 15.4
sonal characteristics on social media platforms, million Facebook status updates, totaling over O
research has been concerned with whether it is 700 million words. Language choice, including
possible to infer these characteristics from social words, phrases, and topics of conversation, was
media activity. For instance, can we deduce users’ used to predict users’ gender, age, and Big Five
gender, sexual orientation, or personality from personality traits with high accuracy.
their explicit statements and patterns of activity? In sum, this body of research suggests that it
Is their identity implicit in their social media is possible to infer many facets of Facebook
activity, even though they might not disclose it users’ identity through automated analysis of
explicitly? their online activity, regardless of whether they
One well-publicized study by Michal Kosinski explicitly choose to divulge this identity. While
and his colleagues sought to predict Facebook users typically choose to reveal their gender and
users’ personal characteristics from their “likes” – ethnicity, they can be more reticent in disclosing
that is, Facebook pages dedicated to products, their relational status or sexual orientation and
sports, music, books, restaurant, and interests – might themselves be unaware of their personal-
that users can endorse and with which they can ity traits or intelligence quotient. This line of
associate by clicking the “like” button. The study research raises important questions about
used a sample of 58,000 volunteers recruited users’ privacy and the extent to which this infor-
through the myPersonality application. Results mation, once automatically extracted from
show that, based on Facebook “likes,” it is possi- Facebook activity, should be used by corpora-
ble to predict a user’s ethnic identity (African- tions for marketing or product optimization
American vs. Caucasian) with 95% accuracy, purposes.
710 Online Identity

Real and Imagined Audience for Identity understand that they did in fact have this larger
Claims audience. About half of Facebook users indicated
that they were satisfied with the audience they
The purpose of many online identity claims is to thought they had, even though their audience
communicate a desired image to an audience. was actually much greater than they perceived it
Therefore, the process of identity construction to be. Overall, this study highlights a substantial
involves understanding the audience and targeting mismatch between users’ beliefs about their audi-
messages to them. Social media, such as ences and their actual audiences, suggesting that
Facebook and Twitter, where identity claims are social media environments are translucent, rather
posted very frequently, pose a conundrum in this than transparent, when it comes to audiences. That
regard, because audiences tend to be unprecedent- is, actual audiences are somewhat opaque to users,
edly large, sometimes reaching hundreds and who as a result may fail to properly target their
thousands of members, and diverse. Indeed, identity claims to their audiences.
“friends” and “followers” are accrued over time
and often belong to different social circles (e.g.,
high school, college, employment). How do users Family Identity
conceptualize their audiences on social media
platforms? Are users’ mental models of their audi- One critical aspect of personal identity is family
ences accurate? ties. To what extent do social media users reveal
These questions were addressed by Michael their family connections to their audience, and
Bernstein and his colleagues in a study focusing how do family members publically talk to one
specifically on Facebook users. The study used a another on these platforms? Moira Burke and her
survey methodology, where Facebook users indi- colleagues addressed these questions in the con-
cated their beliefs about how many of their text of parent-child interactions on Facebook.
“friends” viewed their Facebook postings, Results show that 37.1% of English-speaking
coupled with large-scale log data for 220,000 US Facebook users specified either a parent or
Facebook users, where researchers captured the child relationship on the site. About 40% of
actual number of “friends” who viewed users’ teenagers specified at least one parent on their
postings. Results show that, by and large, profile, and almost half of users age 50 or above
Facebook users underestimated their audiences. specified a child on their profile. The most com-
First, they believed that any specific status update mon family ties were between mothers and
they posted was viewed, on average, by 20 daughters (41.4% of all parent-child ties),
“friends,” when in fact it was viewed by 78 followed by mothers and sons (26.8%), fathers
“friends.” The median estimate for the audience and daughters (18.9%), and least of all fathers
size for any specific post was only 27% of the and sons (13.1%). However, Facebook commu-
actual audience size, meaning that participants nication between parents and children was lim-
underestimated the size of their audience by a ited, accounting for only 1–4% of users’ public
factor of 4. Second, when asked how many total Facebook postings. When communication did
audience members they had for their profile post- happen, it illustrated family identities: Parents
ings during the past month, Facebook users gave advice to children, expressed affection,
believed it was 50, when in fact it was 180. The and referenced extended family members, partic-
median perceived audience for the Facebook pro- ularly grandchildren.
file, in general, was only 32% of the actual audi-
ence, indicating that users underestimated their
cumulative audience by a factor of 3. Slightly Cultural Identity
less than half of Facebook users indicated they
wanted a larger audience for their identity claims Another critical aspect of personal identity is cul-
than they thought they had, ironically failing to tural identity. Is online communicators’ cultural
Ontologies 711

identity revealed by their communication pat- Kosinski, M., et al. (2013). Private traits and attributes are
terns? Jaram Park and colleagues show that Twit- predictable from digital records of human behavior.
Proceedings of the National Academy of Sciences,
ter users create emoticons that reflect an 110, 5802–5805.
individualistic or collectivistic cultural orienta- Kramer, A., & Chung, C. (2011). Dimensions of self-
tion. Specifically, users from individualistic cul- expression in Facebook status updates. In Proceedings
tures preferred horizontal and mouth-oriented of the International Conference on Weblogs and Social
Media (ICWSM) (pp. 169–176). Association for the
emoticons, such as :), whereas users from collec- Advancement of Artificial Intelligence.
tivistic cultures preferred vertical and eye-ori- Park, J., et al. (2014). Cross-cultural comparison of non-
ented emoticons, such as ^_^. Similarly, a study verbal cues in emoticons on twitter: Evidence from big
of self-expression using a sample of four million data analysis. Journal of Communication, 64, 333–
354.
Facebook users from several English-speaking Schwartz, A., et al. (2013). Personality, gender, and age in
countries (USA, Canada, UK, Australia) shows the language of social media: The open-vocabulary
that members of these cultures can be differenti- approach. PloS One, 8, e73791.
ated through their use of formal or informal
speech, the extent to which they discuss positive
personal events, and the extent to which they
discuss school. In sum, this research shows that Ontologies
cultural identity is evident in linguistic self-
expression on social media platforms. Anirudh Prabhu
Tetherless World Constellation, Rensselaer
Polytechnic Institute, Troy, NY, USA
Cross-References

▶ Anonymity Synonyms
▶ Behavioral Analytics
▶ Facebook Computational ontology; Knowledge graph;
▶ Privacy Semantic data model; Taxonomy; Vocabulary
▶ Profiling O
▶ Psychology
Definition

Further Reading Ontology provides a rich description of the:

Bachrach, Y., et al. (2012). Personality and patterns of


• Terminology, concepts, nomenclature
Facebook usage. In Proceedings of the 3rd Annual
Web Science Conference (pp. 24–32). Association for • Relationships among and between concepts
Computing Machinery. and individuals
Bernstein, M., et al. (2013). Quantifying the invisible audi- • Sentences distinguishing concepts, refining
ence in social networks. In Proceedings of the SIGCHI
definitions and relationships (constraints,
Conference on Human Factors in Computing Systems
(pp. 21–30). Association for Computing Machinery. restrictions, regular expressions)
Burke, M., et al. (2013). Families on Facebook. In Pro-
ceedings of the International Conference on Weblogs Relevant to a particular domain or area of
and Social Media (ICWSM) (pp. 41–50). Association
interest (Kendall and McGuinness 2019).
for the Advancement of Artificial Intelligence.
Das, S., & Kramer, A. (2013). Self-censorship on An ontology defines a common vocabulary for
Facebook. In Proceedings of the 2013 Conference on researchers who need to exchange information in
Computer-Supported Cooperative Work (pp. 793– a domain. It can include machine-interpretable
802). Association for Computing Machinery.
definitions of basic concepts in the domain and
Kern, M., et al. (2014). The online social self: An open
vocabulary approach to personality. Assessment, 21, relations among them (Noy and McGuinness
158–169. 2001).
712 Ontologies

The specification of an ontology can vary variety of disciplines (or domains) are thus
depending on the reason for developing an ontol- increasingly available for discovery, access,
ogy, the domain of the ontology, and its and use.
intended use.

Ontology Components
History
Most ontologies are structurally similar. Common
The first ever written use of the word ontology
components of ontologies include:
comes from the Latin word “ontologia” coined in
1613, independently by two philosophers, Rudolf
• Individuals (or instances) are the basic
Göckel in his “Lexicon Philosophicum” and
ground level components of an ontology. An
Jacob Lorhard in his “Theatrum Philosophicum”
ontology together with a set of individual
(Smith and Welty 2001). In English, the first
instances is commonly called a knowledge
recorded use of the word was seen in Bailey’s
base.
dictionary of 1721, where ontology is defined as
• Classes describe concepts in the domain.
“an Account of being in the Abstract.” (Smith and
A class can have subclasses that represent con-
Welty 2001).
cepts more specific than their superclass. Clas-
Artificial intelligence researchers in 1980s
ses are the focus of most ontologies.
were the first to adopt this word in computer
• Attributes (or annotations) are assigned to
science. These researchers recognized that one
classes or subclasses to help describe classes
could create ontologies (information models)
in an ontology.
which could leverage automated reasoning
• Relations (or property relations) help specify
capabilities.
how the entities in an ontology are related to
In the 1990s, Thomas Gruber wrote two influ-
other entities.
ential papers titled “Toward Principles for the
• Restrictions are formal conditions that affect
Design of Ontologies Used for Knowledge Shar-
the acceptance of assertions as input.
ing” and “A Translation Approach to Portable
Ontology Specifications”. In the first paper, he
introduced the notion of ontologies as designed Ontology Engineering
artifacts. He also provides a guide for designing
formal ontologies based on five design criteria. Ontology engineering encompasses the study of
These criteria will be described in more detail the ontology development process including
later in the entry. methodologies, tools, and languages for building
In 2001, Tim Berners-Lee, Jim Hendler, and ontologies.
Ora Lassila described the evolution of the then
existing Web (World Wide Web) to Semantic
Web. In that article, Berners-Lee informally Why Develop Ontologies?
defines semantic web as “an extension of the
current web in which information is given well- There are many reasons to develop an ontology.
defined meaning, better enabling computers and For example, ontologies help share a common
people to work in cooperation.” (Kendall and understanding of the structure of information
McGuinness 2019). among people or software agents. Ontologies
As the web of data grows, the amount of are also developed to make domain assumptions
machine-readable data that can be used with explicit and to enable the analysis and reuse of
ontologies also increases. And with the World domain knowledge (Noy and McGuinness
Wide Web being highly ubiquitous, data from a 2001).
Ontologies 713

Ontology Development Process Once a final “term list” (or concept spread-
sheet) has been compiled, it is time to build the
There isn’t a single “correct” way to develop an ontology. This ontology can be built one of two
ontology. This entry will be following a combina- ways. One method is to build the ontology by
tion of the methodologies described in Kendall manually using domain specific concepts, related
and McGuinness (2019) and Noy and domain concepts, authoritative vocabularies, vet-
McGuinness (2001). ted definitions, and supporting citations from lit-
Development of an ontology usually starts erature. The other method is to use automated
from its use case(s). Most ontology development tools and scripts to generate ontology from con-
projects are driven by the research questions that cept spreadsheets. Human evaluation of the
are derived from a specific use case. These ques- results in the automated ontology generation pro-
tions would typically be the starting point for cess remains important to ensure the ontology
developing the ontology. generated accurately represent the expert’s view
Once the use case has been completed, it is of the domain.
time to determine the domain and scope of the After the ontology has been generated, it
ontology. The domain and scope of the ontology should be available to the users to explore or use
can be defined by answer several basic questions in different applications. Using an interactive
like (1) What is the domain that the ontology will ontology browser enables discovery, review, and
cover? (2) For what are we going to use this commentary of concepts.
ontology? (3) For what types of questions will Ontology development is an ongoing process
the information in the ontology provide answers? and by no means is finished once the ontology has
(4) Who will use and maintain the ontology? (Noy been generated. The ontology is maintained and
and McGuinness 2001). curated by the ontology development team,
These answers may change and evolve of the domain collaborators, invited experts, and the
time of the development process, but they help consumers. The user community can also engage
limit the scope of the ontology. Another way of in commentary and recommend changes to the
determining the scope of the ontology is to use ontology.
competency questions. Competency questions are O
ones that the users would want to use the ontology
to answer. Competency questions can also be used Design Criteria for Ontologies
for evaluating the success of the ontology later in
the development process. Numerous design decisions need to be made
The next step in the ontology development while developing an ontology. It is important to
process is to enumerate the terms required in the follow a guide or a set of objective criteria in order
ontology. As seen in the flowchart, there are many to make a “well designed” ontology. Generally,
sources to the list the terms. There is a plethora of the design criteria suggested by Thomas Gruber
the well-developed ontologies and vocabularies are used for this purpose. Gruber proposed a pre-
available for use. So, if parts (or all) of the other liminary set of criteria for ontologies whose pur-
available ontologies fit into the ontology being pose is knowledge sharing and interoperation
developed, they should without a question be among programs (and applications) based on a
imported and used. Reusing existing ontologies shared conceptualization (Gruber 1995).
and vocabularies is one of the best practices in
ontology development. Some other sources Clarity: An ontology should be clear in commu-
include database schema, data dictionaries, text nicating the meaning of its terms. Definitions
documents, and terms obtained from domain should be objective and where possible defini-
experts both within and outside the primary tions should be stated in logical axioms. “A
development team. complete definition (a predicate defined by
714 Ontologies

necessary and sufficient conditions) is pre- for properties and, domain and range properties”
ferred over a partial definition (defined by (Kendall and McGuinness 2019).
only necessary or sufficient conditions). All OWL: The Web Ontology Language (OWL) is
definitions should be documented with natural a standard ontology language for the semantic
language” (Gruber 1995). web. “OWL includes conjunction, disjunction,
Coherence: All the inferences in an ontology existentially, and universally quantified vari-
should be logically consistent with its defini- ables, which can be used to carry out logical
tions (both formal and informal) of its con- inferences and derive knowledge” (Kalibatiene
cepts. “If a sentence that can be inferred from and Vasilecas 2011). “Version 2 for OWL was
the axioms contradicts a definition or example, adopted in October 2009 by W3C with minor
then the ontology is incoherent” (Gruber revisions in December 2012” (Kendall and
1995). McGuinness 2019). OWL 2 introduced three
Extendibility: The ontology should be designed sublanguages (called Profiles): OWL-EL,
by taking potential future extensions into OWL-QL, OWL-RL. The OWL 2 profiles are
account. When a user extends the ontology in trimmed down versions of OWL 2, since they
the future, there should not be a need to revise trade expressive power for the efficiency of
existing definitions. reasoning.
Minimal encoding bias: An ontology should be The OWL 2 EL captures the expressive power
independent of the issues of the implementing of ontologies with large number of properties
language. An encoding bias occurs when the and/or classes. OWL 2 EL performs reasoning in
developers make design choices based on ease polynomial time with respect to the size of the
of implementation. ontology (Motik et al. 2009).
Minimal ontological commitments: An ontology In applications with large instance data, query
should make very few claims about the domain answering is the most important reasoning task.
being modeled. This gives the user the freedom For such use cases, OWL 2 QL is used because
to specialize or generalize the ontology QL implements conjunctive query answering
depending on their need. using conventional relational database systems.
In OWL 2 QL, reasoning can be performed in
LOGSPACE with respect to the size of the asser-
Ontology Languages tions (Motik et al. 2009).
OWL 2 RL systems can be implemented using
An ontology language is a formal language used rule-based reasoning engines and are aimed at
to encode the ontology. Some of the commonly applications where scalable reasoning is required
used ontology languages are RDF+RDF(S), without sacrificing too much expressive power.
OWL, SKOS, KIF, and DAML+OIL The ontology consistency, class expression
(Kalibatiene and Vasilecas 2011). satisfiability, class expression subsumption,
RDF + RDFS: “The Resource Description instance checking, and conjunctive query answer-
Framework (RDF) is a recommendation for ing problems can be solved in time that is poly-
describing resources on the web developed by nomial with respect to the size of the ontology
the World Wide Web Consortium (W3C). It is (Motik et al. 2009).
designed to be read and understood by computers, SKOS: The Simple Knowledge Organization
not displayed to people” (Kalibatiene and System (SKOS) data model is a W3C recommen-
Vasilecas 2011). dation for sharing and linking knowledge organi-
“An RDF Schema (RDFS) is a RDF vocabu- zation systems via the web. SKOS is particularly
lary that provides identification of classes, inher- useful for encoding knowledge organization sys-
itance relations for classes, inheritance relations tems like thesauri, classification schemes, subject
Ontologies 715

heading systems, and taxonomies (Isaac and identified. The same cannot be said for restrictions
Summers 2009). The SKOS data model, which or axioms in an ontology, with the possible excep-
is formally defined as an OWL ontology, repre- tion of RDF, where descriptions can be translated
sents a knowledge organization system as a con- into first order predicate calculus logical axioms.
cept scheme, consisting of a set of concepts. In practice though, large volumes of data need to
SKOS data are expressed as RDF triples and can be processed from their original form (either
be encoded using any RDF syntax (Isaac and unstructured text or XML, HTML documents,
Summers 2009). etc.) into instances in the knowledge base. With
KIF: The Knowledge Interchange Format is a the amount of data increasing rapidly, processing
W3C recommendation and computer-oriented data from multiple sources to populate the
language for the interchange of knowledge instances in a knowledge base requires the appli-
among disparate programs. KIF is logically com- cation of automated methods in order to aid the
prehensive and has declarative semantics (i.e., the development an ontology, i.e., ontology
meaning of expressions in the representation can engineering.
be understood without appeal to an interpreter for
manipulating those expressions). It also provides
for representation of meta-knowledge and non- Ontology Learning
monotonic reasoning rules. Lastly, it also provides
for definitions of objects, functions, and relations. The scaling up of an ontology to include large
DAML + OIL: DARPA Agent Markup Lan- volumes of data (processed into instances)
guage + Ontology Inference Layer is a semantic remains an ongoing problem, which is being fer-
markup language for Web resources. DAML + vently researched. Researchers often seek to auto-
OIL provides a rich set of constructs with which mate either parts or all of the ontology engineering
to create machine readable and understandable process in order eliminate or at least reduce the
ontologies. It is also a precursor to OWL and is increasing workload. This process of automation
rarely used in contemporary ontology is called ontology learning. Cimiano et al. (2009)
engineering. and Asim et al. (2018) provide a comprehensive
introduction to ontology learning; a generic archi- O
tecture of such systems; and discusses current
Ontology Engineering and Big Data problems and challenges in the field. The general
trend in ontology learning is focused on the acqui-
Ontology engineering faces the same scalability sition of either taxonomic hierarchies or the con-
problems faced by most symbolic artificial intel- struction of knowledge bases for existing
ligence systems. Ontologies are mostly built by ontologies. Such approaches do help resolve
humans (specifically a team of ontology and some of the scalability problems in ontology
domain experts who work together). The increase engineering.
in the number of instances in the ontology or OWL allows for modeling expressive axi-
broadening of scope of the ontology, i.e., adding omatizations. The field of automated axiom
parts of the domain previously unaddressed in the generation becomes a very important branch of
ontology (which result in additions and changes in ontology learning, because reasoning using
the classes and properties of the ontology) are axioms and restrictions is critical in inferring
difficult to implement for ontologies that are con- knowledge from an ontology, and as additional
tinuously developed and improved. Ontologies classes or instances are added to the ontology,
have the capability to add new instances to the there is a need to add new axioms that address
knowledge base without concern for the volume the newly added data. Automated axiom gener-
of data, as long as the correct concepts are ation, however, is a nascent and underexplored
716 Open Data

area of research. There are some systems that


can automatically generate disjointness axioms Open Data
or use inductive logic programming to generate
general axioms from schematic axioms Alberto Luis García
(Cimiano et al. 2009; Asim et al. 2018). There Departamento de Ciencias de la Comunicación
are also methods that can extract rules from text Aplicada, Facultad de Ciencias de la Información,
documents (Dragoni et al. 2016). However, Universidad Complutense de Madrid, Madrid,
these methods require the rule to be explicitly Spain
stated (in natural language) in the text docu-
ment. Truly automated axiom generation
methods, that can learn to generate axioms The link between Open Data and Big Data is
based on the data in the knowledge base, still articulated in relation to the need of public admin-
remain an unsolved and fervently researched istrations to manage and analyze large volumes of
problem. data. The management of these data is based on
the same technology used for the Big Data, but the
difference with the Open Data lies in the origin of
these data.
Further Reading
Today’s democratic societies demand transpar-
Asim, M. N., Wasim, M., Khan, M. U. G., Mahmood, W., ency in the management of their resources, so they
& Abbasi, H. M. (2018). A survey of ontology learn- demand open governments. At the same time, the
ing techniques and applications. Database (Oxford), European Union points to Open Data as a tool for
2018, bay101. https://doi.org/10.1093/database/ innovation and growth.
bay101.
Cimiano, P., Mädche, A., Staab, S., & Völker, J. (2009). Therefore, they have legislated the use of Open
Ontology learning. In Handbook on ontologies Data through portals where data repositories are
(pp. 245–267). Berlin/Heidelberg: Springer. hosted, following the workflow established with
Dragoni, M., Villata, S., Rizzi, W., & Governatori, the use of Big Data: classification and quality
G. (2016, December). Combining NLP approaches
for rule extraction from legal documents. In 1st work- control of information, processing architectures,
shop on mining and reasoning with legal texts (MIREL and usability in the presentation and analysis of
2016). data.
Gruber, T. R. (1995). Toward principles for the design of Open Data is related as documents held by the
ontologies used for knowledge sharing? International
Journal of Human-Computer Studies, 43(5–6), public sector, by individuals or legal entities, for
907–928. commercial or noncommercial purposes, pro-
Isaac, A., & Summers, E. (2009). Skos simple knowledge vided that this use does not constitute a public
organization system. Primer, World Wide Web Consor- administrative activity. In any case, for all this
tium (W3C), 7.
Kalibatiene, D., & Vasilecas, O. (2011). Survey on ontol- information is considered as Open Data must
ogy languages. In Perspectives in business informatics publish information in standard, open, and inter-
research (pp. 124–141). Cham: Springer. operable formats, allowing easier access and
Kendall, E., & Mcguinness, D. (2019). Ontology engineer- reuse. Open Data must complete the next
ing (Synthesis lectures on the semantic web: theory and
technology). Ying Ding and Paul Groth: Morgan & sequence to get the major proposals and to convert
Claypool Editiors. the information in a public service to reuse them
Motik, B., Grau, B. C., Horrocks, I., Wu, Z., Fokoue, A., & with strategical purposes:
Lutz, C. (2009). OWL 2 web ontology language pro-
files. W3C recommendation, 27, 61.
Noy, N. F., & McGuinness, D. L. (2001). Ontology devel- 1. Complete: It is necessary and reasonable
opment 101: A guide to creating your first ontology. attempt to cover the entire spectrum of data
http://ftp.ksl.stanford.edu/people/dlm/papers/ontol available on the subject that the data comes
ogy-tutorial-noy-mcguinness.pdf. from. This step is crucial because it involves
Smith, B., & Welty, C. (2001, October). Ontology:
Towards a new synthesis. In Formal ontology in infor- their success over the rest of the process. They
mation systems (Vol. 10(3), pp. 3–9). Ongunquit: ACM must also be structured so that they can expand
Press. permanently and constantly updated so that
Open Data 717

they can achieve the overall purpose of open of public sector information, in which it is
access. assumed the philosophy of Directive 2003/98/
2. Accessible: This characteristic is the key in the EC, which states that “the use of such docu-
relevance of Open Data. Accessibility should ments for other reasons set out either commer-
be universal and should be treated in a format, cial or non-commercial purposes is reuse.”
both search and output data, enabling all citi-
zens, individually or commercial, can access In this sense, the Open Data principal agents
and use them directly as a source of (Users, Facilitators and Infomediaries) must be
information. attentive to all processes respect these basic
3. Free: The system of free access and use should principles.
be provided for in the legislation itself to reg- The Users (citizens or professionals) who
ulate public access to Open Data. However, the demand information or new services initiate the
ultimate goal is to reuse data for commercial, process of action to Open Data access being the
social, economic, political, etc., to help inte- ultimate goal on which to structure the organiza-
grate them into strategies and decision making tion and management of data access. Data sup-
in the executive staff of companies in the pri- pliers: only public administrations. Other kinds of
vate and public sector. This feature underlies suppliers must be integrated in the public regula-
the legal possibility of allowing access to data tion organism to ensure the correct use of them.
that otherwise would violate the privacy rights. The Facilitators promote legal schemes and
4. Nondiscriminatory: In this respect, non- technical mechanisms making reuse possible.
discrimination should occur in two ways. The Infomediaries are the creators of the prod-
First, the possibility of universal access ucts and services based on the sources (students,
through systems and web pages that meet professionals or public, private, or third-sector
accessibility standards for anyone; on the entities). The ultimate goal of Open Data is extra-
other hand, the data must respect the particu- cted from the same added value through the gen-
larities of gender, age, and religion and must eration of applications and services that help solve
meet the information needs for all social, reli- specific demands from users; without this princi-
gious, and ethnic groups that request. ple, there would be no value action on Open Data. O
5. Nonproprietary: data must be public and must Also, how to monetize this information service is
not belong to any organization or private insti- an emerging economic value in businesses that
tution. The management must be controlled by need access to the peculiarities and needs of
government agencies in response to the regu- each individual to capitalize more effectively pro-
lations of each country in this regard. In any moting every product and brand message.
case, the defense of individual rights and per- Thus, the Open Data advantages are acting in
sonal freedoms of individuals and institutions this direction and are distributed in two main parts
should be honored and respected. This use of in response to social criteria: public and private
documents held by the public sector can be benefits.
performed by individuals or legal entities, As far as public benefits are concerned, we
commercial or noncommercial purposes, pro- could say that the main one relates to Open Gov-
vided such use does not constitute a public ernment and transparency. There is a new trend
administrative activity. The ownership of the based on transparency, participation, and collabo-
same, therefore, is the public sector, although ration, advocating a model of participatory
the use thereof is regulated to establish and democracy. The Open Data provides the neces-
develop economic activities to private gain. sary and sufficient information to create broader
In any case, the data reuse activities cannot be democratic models based on transparency in the
assumed to override the decision making in the access and visibility of data synergy, citizen par-
public and make them look passed on all the ticipation in decision-making from the Open Data.
citizens. In Spain, for example, use is regulated In the same direction we find public participa-
by Law 37/2007, of 16 November, on the reuse tion and integration. In this line of work, the use of
718 Open Data

Open Data enables the improvement in the quality features are that Open Data must be non-
of monitoring of public policies, thus allowing discriminatory, limited by technical constraints
greater participation and collaboration of citizens. or need for expensive high quality connections
The cooperation of citizens in the governance of or technical access, so it is also convenient to
cities from Open Data runs through two funda- introduce URLs identifying data from other
mental ways: access to public places of informa- websites and try to connect with data originating
tion (mainly web pages and apps) and by creating from other websites; and no owners and license-
civilian institutions organizing the data manage- free formats as being public information should
ment response to specific areas of interest. In this not be delivered in formats that benefit few soft-
sense, citizens are called for greater participation ware companies over others. For example, you
in the governance of their citizens and this means can raise deliver data in CSV format instead of
better management of Open Data in order not to Excel.
limit the fundamental rights of access to public Economic and employment growth: There is
information. a clear tendency to use the Open Data as a clear
Another property is the data quality and inter- way to develop effective new business models,
operability. The characteristics of the data must which directly affects two fundamental aspects:
meet the following stipulations in order to a trend of increased employment of skilled labor
achieve optimization and the basic characteris- and reducing costs by consolidating assets that
tics of the information to use in a systematic and affect the structural organization of companies
accessible for any user. The data must be com- delegating work through data with no charge.
plete: there should be no limitation and prior Contrasting examples of these benefits can be
checking in introducing them, unless those found in the following: subject MEPSIR 2006
issues that limit the legislation itself in defense report prepared by the European Commission, is
of the rights of personal privacy; primary, that producing a profit of up to € 47,000 million per
is, the data must be unprocessed and filtered as year working with Open Data; in Spain, in partic-
this is the core functionality that should contrast ular, according to the Characterization Study
the infomediary forward to the specific objec- infomediary Sector within the Provides Project
tives proposed. Also data must be accessible, and dated June 2012, there has been a profit
avoiding any kind of restriction but should be a between 330 and 550€ million and between
compulsory registration of users to have control 3600 and 4400 direct jobs. But since 2016, sector
over access to data. A line that contains the data staffs have grown significantly: the total number
that are available on the Web (in any format and of employees is estimated to be 14.3% higher
according to the criteria of accessibility of the (14,000–16,000 employees, in the most positive
Web) must clearly exist. Another characteristic is estimate). The collaboration between the public
that it must be provided on time, or what is the and private sectors in generating companies, ser-
same, there must be a continuous input that vices, and applications from the Open Data are
allows the constant updating of the same. Suc- leading to a change in the production model which
cess in Open Data is the principle of immediate is very evident. Part of the internal current expen-
upgrade in order to monetize the information diture of the public sector provides liquidity for
immediately to obtain success always results in public access technology infrastructure in the
real time. The Open Data must be processible or form of products, services, and therefore employ-
structured so that it can work with them without ment. The result that is occurring is a multiplier
special tools required data. Access to the data effect on the collaboration between the public and
should be universal and not limited to specifica- private sectors due to a recurring public contribu-
tions that prevent the normal processing of the tion that gives value to public investment and
same. For example, it would be mandatory to generating profits in companies that request. In
have data ready to work with them in any spread- this sense, and within the European Union, it is
sheet, place the image in a table that prevents creating a harmonization and standardization of
default operability with such data. The last two Open Data Metadata for Open Apps proposals
Open Data 719

from IEEE and JoinUP carried out in the Digital Moreover, all agents must retain re-user meta-
Agenda of the EU. data to certify at all times, the source thereof. All
Social benefits, as transparency and civil these principles need to be articulated and regu-
government: In this sense, there are a historical lated as an emergency that could produce a con-
background and legislative framework that in flict in the use of Open Data. The legislation must
Europe, for example, began in 2003, with the overcome geographical barriers since the use
Directive 2003/98/EC of the European Parliament thereof affects globally, so that international orga-
and the Council, of 17 November 2003, on the nizations are getting involved in legislation.
reuse of public sector information. Each country Definitively, Open Data influences the way that
has developed their own regulation in order to people relate to institutions in the future. How-
define the Technical Interoperability Standard for ever, there are currently a number of barriers that
the Re-use of information resources. need to be solving for the sake of process consis-
The forms of reuse of data have changed from tency. Among the main barriers we are currently
consultation services to a personalized manner facing, there is a lack of commitment from the
through public agencies willing to do RSS content Public Sector to promote the reuse of Open Data,
services that act as continuous indicators through which could be improved with proper training of
scheduled alerts and filtered according to specific public employees. In that direction, there are dif-
criteria. There is also the possibility of going ficulties held by public entities to carry out Open
through Web service that can presently offer reg- Data strategies and it is necessary to disseminate
ulated manners to all public bodies. In this case, benefits of Open Data. Also, it is necessary for
being the main form of reuse through raw data. data opening up: identification, cataloguing, and
The basic scheme for accessing documents classification, so it is very important to promote
reusable Open Data involves: (a) General basic citizens’ and private organizations’ participation
mode (open-open access data); (b) License type in in the reuse of data.
two ways: licenses-type “free” call is free infor-
mation, and licenses-type specific, in which con-
ditions are established for reuse information. In Further Reading
any case, inaccessible involved data access limi- O
tation should never exist. (c) Re-request: a general Estudio de Caracterización del Sector Infomediario en
method for the request of documents. This way is España. http://www.ontsi.red.es/ontsi/es/estudios-
informes/estudio-de-caracterizaci%C3%B3n-del-sec
reserved for application data and subject to more tor-infomediario-en-espa%C3%B1-edici%C3%B3n-
specific characteristics which may be confronted 2012. Accessed Aug 2014.
with some aspect of the general regulatory stan- Estudio de Caracterización del Sector Infomediario en
dards conditions. España. http://datos.gob.es/sites/default/files/Info_sec
tor%20infomediario_2012_vfr.pdf. Accessed Aug
In any case, you should take in all the Interna- 2014.
tional standards inquiry Industrial Property, Estudio de Caracterización del Sector Infomediario en
including the ST 36, ST 66, and ST 86. In addi- España. https://www.ontsi.red.es/sites/ontsi/files/2020-
tion, we must take into account the Data Protec- 06/PresentationCharacterizationInfomediarySector2020.
pdf. Accessed Aug 2020.
tion Act and have control over the use of the legal EU Data Protection Directive 95/46/EC. http://europa.eu/
basis for the re-obligations that should aim to legislation_summaries/information_society/data_pro
improve the quality of data at all times and con- tection/l14012_es.htm. Accessed Aug 2014.
sultation. To do this, regulators should give acces- Industrial Property, including the ST 36, ST 66 and ST 86.
http://www.wipo.int/export/sites/www/standards/en/
sibility and data entry in the catalogue to both the pdf/03-96-01.pdf. Accessed Aug 2014.
user and the infomediary. Law 37/2007, of 16 November, on the reuse of public
The general conditions applicable to all re- sector information. http://www.boe.es/diario_boe/txt.
users go through basic concepts, but essential as php?id¼BOE-A-2007-19814. Accessed Aug 2014.
MEPSIR Report. (2006). http://www.cotec.es/index.php/
not to distort the data, cite the source, and update pagina/publications/new-additions/show/id/952/titulo/
the date of consultation at all times to comply with reutilizacion-de-la-informacion-del-sector-publico%
the reliability thereof. 2D%2D2011. Accessed Aug 2014.
720 Open-Source Software

system composed entirely of free software. The


Open-Source Software free BSD Unix operating system was developed
by Bill Jolitz of the University of California at
Marc-David L. Seidel Berkeley Computer Science Research Group and
Sauder School of Business, University of British served as the basis for many later Unix operating
Columbia, Vancouver, BC, Canada system releases. Many open-source software pro-
jects were unknown outside of the highly techni-
cal computer science community. Stallman’s
Open-source software refers to computer software GNU was later popularized by Linus Torvalds, a
where the copyright holder provides anybody the Finish computer science student, who released a
right to edit, modify, and distribute the software Linux kernel based upon the earlier work. The
free of charge. The initial creation of such soft- release of Linux triggered substantial media atten-
ware spawned the open-source movement. Fre- tion for the open-source movement when an inter-
quently the only limitation on the intellectual nal Microsoft strategy document, dubbed the
property rights are that any subsequent changes Halloween Documents, was leaked. It outlined
made by others are required to be made with Microsoft’s perception of the threat of Linux to
similarly open intellectual property rights. Such Microsoft’s dominance of the operating system
software is often developed in an open collabora- market. Linux was portrayed in the mass media
tive manner by a Community Form (C-form) as a free alternative to the Microsoft Windows
organization. A large percentage of the internet operating system. Eric S. Raymond and Bruce
infrastructure is operated utilizing such software Perens further formalized open source as a devel-
which handles the majority of networking, web opment method by creating the Open Source Ini-
serving, e-mail, and network diagnostics. With the tiative in 1998. By 1998, open-source software
spread of the internet, the volume of user gener- routed 80% of the e-mail on the internet. It has
ated data has expanded exponentially, and open- continued to flourish to the modern day being
source software to manage and analyze big data responsible for a large number of software and
has flourished through open-source big data pro- information-based products today produced by
jects. This entry explains the history of open- the open-source movement.
source software, the typical organizational struc-
ture used to create such software, prominent pro-
ject examples of the software focused on C-form Organizational Architecture
managing and analyzing big data, and the future
evolution suggested by current research on the The C-form organizational architecture is the pri-
topic. mary organizational structure for open-source
development projects. A typical C-form has four
common organizing principles. First, there are
History of Open-Source Software informal peripheral boundaries for developers.
Contributors can participate as much or as little
Two early software projects leading to the mod- as they like and join or leave a project on their
ern-day open-source software growth were at the own. Second, many contributors receive no finan-
Massachusetts Institute of Technology (MIT) and cial compensation at all for their work, yet some
the University of California at Berkeley. The Free may have employment relationships with more
Software Foundation, created by Richard traditional organizations which encourage their
Stallman of the MIT Artificial Intelligence Lab, participation in the C-form as part of their regular
was launched as a nonprofit organization to pro- job duties. Third, C-forms focus on information-
mote the development of free software. Stallman based product, of which software is a major sub-
is credited with creating the term “copyleft” and set. Since the product of a typical C-form is infor-
created the GNU operating system as an operating mation based, it can be replicated with minimal
Open-Source Software 721

effort and cost. Fourth, typical C-forms operate Apache Lucene is an information retrieval soft-
with a norm of open transparent communication. ware library which tightly integrates with search
The primary intellectual property of an open- engine projects such as ElasticSearch. It provides
source C-form is the software code. This, by def- full text indexing and searching capabilities. It
inition, is made available for any and all to see, treats all document formats similarly by extracting
use, and edit. textual components and as such is independent of
file format. It is developed by the Apache Soft-
ware Foundation and released under the Apache
Prominent Examples of Open-Source Big Software License.
Data Projects D3.js is a data visualization package originally
created by Mike Bostock, Jeff Heer, and Vadim
Apache Casandra is a distributed database man- Ogievetsky who worked together at Stanford Uni-
agement system originally developed by Avinash versity. It is now licensed under the Berkeley
Lakshman and Prashant Malik at Facebook as a Software Distribution (BSD) open-source license.
solution to handle searching an inbox. It is now It is designed to graphically represent large
developed by the Apache Software Foundation, a amounts of data and is frequently used to generate
distributed community of developers. It is rich graphs and for map making.
designed to handle large amounts of data distrib- Drill is a framework to support distributed
uted across multiple datacenters. It has been rec- applications for data intensive analysis of large-
ognized by University of Toronto researchers as scale datasets in a self-serve manner. It is inspired
having leading scalability capabilities. by Google’s BigQuery infrastructure service. The
Apache CouchDB is a web-focused database stated goal for the project is to scale to 10,000 or
system originally developed by Damien Katz, a more servers to make low-latency queries of
former IBM developer. Similar to Apache petabytes of data in seconds in a self-service man-
Casandra, it is now developed by the Apache ner. It is being incubated by Apache currently. It is
Software Foundation. It is designed to deal with similar to Impala.
large amounts of data through multi-master repli- ElasticSearch is a search server that provides
cation across multiple locations. near real-time full-text search engine capabilities O
Apache Hadoop is designed to store and pro- for large volumes of documents using a distrib-
cess large-scale datasets using multiple clusters of uted infrastructure. It is based upon Apache
standardized low-level hardware. This technique Lucene and is released under the Apache Software
allows for parallel processing similar to a super- License. It spawned a venture-funded company in
computer but using mass market off the shelf 2012 created by the people responsible for
commodity computing systems. It was originally ElasticSearch and Apache Lucene to provide sup-
developed by Doug Cutting and Mike Cafarella. port and professional services around the
Cutting was employed at Yahoo, and Cafarella software.
was a Masters student at the University of Wash- Impala is an SQL query engine which enables
ington at the time. It is now developed by the massively parallel processing of search queries
Apache Software Foundation. It serves a similar on Apache Hadoop. It was announced in 2012
purpose as Storm. and moved out of beta testing in 2013 to public
Apache HCatalog is a table and storage man- availability. It is targeted at data analysts and
agement layer for Apache Hadoop. It is focused scientists who need to conduct analysis on
on assisting grid administrators with managing large-scale data without reformatting and trans-
large volumes of data without knowing exactly ferring the data to a specialized system or propri-
where the data is stored. It provides relational etary format. It is released under the Apache
views of the data, regardless of what the source Software License and has professional support
storage location is. It is developed by the Apache available from the venture-funded Cloudera. It is
Software Foundation. similar to Drill.
722 Open-Source Software

Julia is a technical computing high-perfor- datasets. It is an implementation of the S program-


mance dynamic programming language with a ming language created by Bell Labs’ John Cham-
focus on distributed parallel execution with high bers. It was created by Ross Ihaka and Robert
numerical accuracy using an extensive mathemat- Gentleman at the University of Auckland. It is
ical function library. It is designed to use a simple designed to allow multiple processors to work
syntax familiar to many developers of older pro- on large datasets. It is released under the GNU
gramming languages while being updated to be License.
more effective with big data. The aim is to speed Scribe is a log server designed to aggregate
development time by simplifying coding for par- large volumes of server data streamed in real
allel processing support. It was first released in time from a high volume of servers. It is com-
2012 under the MIT open-source license after monly described as a scaling tool. It was origi-
being originally developed starting in 2009 by nally developed by Facebook and then released in
Alan Edelman (MIT), Jeff Bezanson (MIT), 2008 using the open-source Apache License.
Stefan Karpinski (UCSB), and Viral Shah Spark is a data analytic cluster computing
(UCSB). framework designed to integrate with Apache
Kafka is a distributed, partitioned, replicated Hadoop. It has the capability to cache large
message broker targeted on commit logs. It can be datasets in memory to interactively analyze the
used for messaging, website activity tracking, data and then extract a working analysis set to
operational data monitoring, and stream pro- further analyze quickly. It was originally devel-
cessing. It was originally developed by LinkedIn oped at the University of California at Berkeley
and released open source in 2011. It was subse- AMPLab and released under the BSD License.
quently incubated by the Apache Incubator and as Later it was incubated in 2013 at the Apache
of 2012 is developed by the Apache Software Incubator and released under the Apache License.
Foundation. Major contributors to the project include Yahoo
Lumify is a big data analysis and visualization and Intel.
platform originally targeted to investigative work Storm is a programming library focused on
in the national security space. It provides real-time real-time storage and retrieval of dynamic object
graphical visualizations of large volumes of data information. It allows complex querying across
and automatically searches for connections multiple database tables. It handles unbound
between entities. It was originally created by Alta- streams of data in an instantaneous manner allo-
mira Technologies Corporation and then released wing real-time analytics of big data and continu-
under the Apache License in 2014. ous computation. The software was originally
MongoDB is a NoSQL document focused developed by Canonical Ltd., also known for the
database focused on handling large volumes of Ubuntu Linux operating system, and is released
data. The software was first developed in 2007 under the GNU Lesser General Public License. It
by 10gen. In 2009, the company made the soft- is similar to Apache Hadoop but with a more real-
ware open source and focused on providing pro- time and less batch-focused nature.
fessional services for the integration and use of the
software. It utilizes a distributed file storage, load
balancing, and replication system to allow quick The Future
ad hoc queries of large volumes of data. It is
released under the GNU Affero General Public The majority of open-source software focused on
License and uses drivers released under the big data applications has primarily been targeting
Apache License. web-based big data sources and corporate data
R is a technical computing high-performance analytics. Current developments suggest a shift
programming language focused on statistical toward more analysis of real-world data as sensors
analysis and graphical representations of large spread more widely into everyday use by mass
Open-Source Software 723

market consumers. As consumers provide more Further Reading


and more data passively through pervasive sen-
sors, the open-source software used to manage Bretthauer, D. (2002). Open source software: A history.
Information Technology and Libraries, 21(1), 3–11.
and understand big data appears to be shifting
Lakhani, K. R., & von Hippel, E. (2003). How open source
toward analyzing a wider variety of big data software works: ‘Free’ user-to-user assistance.
sources. It appears likely that the near future will Research Policy, 32(6), 923–943.
provide more open-source software tools to ana- Marx, V. (2013). Biology: The big challenges of big data.
Nature, 498, 255–260.
lyze real-world big data such as physical move-
McHugh, J. (1998, August). For the love of hacking.
ments, biological data, consumer behavior, health Forbes.
metrics, and voice content. O’Mahony, S., & Ferraro, F. (2007). The emergence of
governance on an open source project. Academy of
Management Journal, 50(5), 1079–1106.
Seidel, M.-D. L., & Stewart, K. (2011). An initial descrip-
Cross-References tion of the C-form. Research in the Sociology of Orga-
nizations, 33, 37–72.
Shah, S. K. (2006). Motivation, governance, and
▶ Crowdsourcing the viability of hybrid forms in open source soft-
▶ Google Flu ware development. Management Science, 52(7),
▶ Wikipedia 1000–1014.

O
P

Parallel Processing mental, emotional, and financial state, and


much about an individual can be learned through
▶ Multiprocessing the tracking of his or her data. When big data is
fine-tuned, it can benefit the people and commu-
nity at large. Big data can be used to track epi-
demics, and its analysis can be used in the
Participatory Health and support of patient education, treatment of
Big Data at-risk individuals, and encouragement of partic-
ipatory community health. However, with the
Muhiuddin Haider, Yessenia Gomez and rise of big data comes concern about the security
Salma Sharaf of health information and privacy.
School of Public Health Institute for Applied There are advantages and disadvantages to
Environmental Health, University of Maryland, casting large data nets. Collecting data can help
College Park, MD, USA organizations learn about individuals and commu-
nities at large. Following online search trends and
collecting big data can help researchers under-
The personal data landscaped has changed dras- stand health problems currently facing the studied
tically with the rise of social networking sites and communities and can similarly be used to track
the Internet. The Internet and social media sites epidemics. For example, increases in Google
have allowed for the collection of large amounts searches for the term flu have been correlated
of personal data. Every keystroke typed, website with an increase in flu patient visits to emergency
visited, Facebook post liked, Tweet posted, or rooms. In addition, a 2008 Pew study revealed
video shared becomes part of a user’s digital that 80% of Internet users use the Internet to
history. A large net is cast collecting all the search for health information. Today, many
personal data into big data sets that may be sub- patients visit doctors after having already
sequently analyzed. This type of data has been searched their symptoms online. Furthermore,
analyzed for years by marketing firms through more patients are now using the Internet to search
the use of algorithms that analyze and predict health information, seek medical advice, and
consumer purchasing behavior. The digital his- make important medical decisions. The rise of
tory of an individual paints a clear picture about the Internet has led to more patient engagement
their influence in the community and their and participation in health.

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
726 Participatory Health and Big Data

Technology has also encouraged participatory patterns, and heart rate. Wearable technology
health through an increase in interconnectedness. enables users to track their health information,
Internet technology has allowed for constant and some wearable technology even allows the
access to medical specialists and support groups individual to save their health information and
for people suffering from diseases or those share it with their medical providers. Wearable
searching for health information. The use of tech- technology encourages participatory health, and
nology has allowed individuals to take control of the constant tracking of health information and
their own health, through the use of online sharing with medical providers allow for more
searches and the constant access to online health accurate health data collection and tailored care.
records and tailored medical information. In the The increase in health technology and collection
United States, hospitals are connecting individ- and analysis of big data has led to an increase in
uals to their doctors through the use of online participatory health, better communication
applications that allow patients to email their doc- between individuals and healthcare providers,
tors, check prescriptions, and look at visit sum- and more tailored care.
maries from anywhere where they have an Big data collected from these various
Internet connection. The increase in patient sources, whether Internet searches, social
engagement has been seen to play a major role media sites, or participatory health through
in promotion of health and improvement in qual- applications and technology, strongly influ-
ity of healthcare. ences our modern health system. The analysis
Technology has also helped those at risk of of big data has helped medical providers and
disease seek treatment early or be followed care- researchers understand health problems facing
fully before contracting a disease. Collection of their communities and develop tailored pro-
big data has helped providers see health trends in grams to address health concerns, prevent dis-
their communities, and technology has allowed ease, and increase community participatory
them to reach more people with targeted health health. Through the use of big data technology,
information. A United Nations International Chil- providers are now able to study health trends in
dren’s Emergency Fund (UNICEF) project in their communities and communicate with their
Uganda asked community members to sign up patients without scheduling any medical visits.
for U-report, a text-based system that allows indi- However, big data also creates concern for the
viduals to participate in health discussions security of health information.
through weekly polls. This system was There are several disadvantages to the collec-
implemented to connect and increase communi- tion of big data. One being that not all the data
cation between the community and the govern- collected is significant and much of the informa-
ment and health officials. The success of the tion collected may be meaningless. Additionally,
program helped UNICEF prevent disease out- computers lack the ability to interpret informa-
breaks in the communities and encouraged tion the way humans do, so something that may
healthy behaviors. U-report is now used in other have multiple interpretations may be mis-
countries to help mobilize communities to play interpreted by a computer. Therefore, data may
active roles in their personal health. be flawed if simply interpreted based on algo-
Advances in technology have also created rithms, and any decisions regarding the health of
wearable technology that is revolutionizing par- the communities that were made based on this
ticipatory health. Wearable technology is a cate- inaccurate data would also be flawed. Of greater
gory of devices that are worn by individuals and concern is the issue of privacy with regards to big
are used to track data about the individuals, such data. Much of the data is collected automatically
as health information. Examples of wearable tech- based on people’s online searches and Internet
nology are wrist bands that collect information activities, so the question arises as to whether
about the individual’s global positioning system people have the right to choose what data is
(gps) location, amount of daily exercise, sleep collected about them. Questions that arise
Patient Records 727

regarding big data and health include how long is


personal health data saved? Will data collected Patient Records
be used against individuals? How will the Health
Insurance Portability and Accountability Act Barbara Cook Overton
(HIPPA) change with the incorporation of big Communication Studies, Louisiana State
data in medicine? Will data collected determine University, Baton Rouge, LA, USA
insurance premiums? Privacy concerns need to Communication Studies, Southeastern Louisiana
be addressed before big health data, health appli- University, Hammond, LA, USA
cations, and wearable technology become a secu-
rity issue.
Today, big data can help health providers better Patient records have existed since the first hospi-
understand their target populations and can lead to tals were opened. Early handwritten accounts of
an increase in participatory health. However, con- patients’ hospitalizations were recorded for edu-
cerns arise about the safety of health information cational purposes but most records were simply
that is automatically collected in big data sets. tallies of admissions and discharges used to justify
With this in mind, targeted data collected may be expenditures. Standardized forms would eventu-
a more beneficial method for data collection with ally change how patient care was documented.
regard to health. All these concerns need to be Content shifted from narrative to numerical
addressed today as the use of big data in health descriptions, largely in the form of test results.
becomes more commonplace. Records became unwieldy as professional guide-
lines and malpractice concerns required more and
more data be recorded. Patient records are owned
Cross-References and maintained by individual providers, meaning
multiple records exist for most patients. Nonethe-
▶ Epidemiology less, the patient record is a document meant to
▶ Online Advertising ensure continuity of care and is a communication
▶ Patient-Centered (Personalized) Health tool for all providers engaged in a patient’s current
▶ PatientsLikeMe and future care. Electronic health records may
▶ Prevention facilitate information sharing, but that goal is
largely unrealized. P
Modern patient records evolved with two pri-
Further Reading mary goals: facilitating fiscal justification and
improving medical education. Early hospitals
Eysenbach, G. (2008). Medicine 2.0: Social networking,
collaboration, participation, apomediation, and open-
established basic rules to track patient admissions,
ness. Journal of Medical Internet Research, 10(3), e22. diagnoses, and outcomes. The purpose was
https://doi.org/10.2196/jmir.1030. largely bureaucratic: administrators used patient
Gallant, L. M., Irizarry, C., Boone, G., & Kreps, G. (2011). tallies to justify expenditures. As far back as 1737,
Promoting participatory medicine with social media:
New media applications on hospital websites that
Berlin surgeons were required to note patients’
enhance health education and e-patients’ voices. Jour- conditions each morning and prescribe lunches
nal of Participatory Medicine, 3, e49. accordingly (e.g., soup was prescribed for patients
Gallivan, J., Kovacs Burns, K. A., Bellows, M., & Eigen- too weak to chew). The purpose, according to
seher, C. (2012). The many faces of patient engage-
ment. Journal of Participatory Medicine, 4, e32.
Volker Hess and Sophie Ledebur, was helping
Lohr, S. (2012). The age of big data. The New York Times. administrators track the hospital’s food costs and
Revolutionizing social mobilization, monitoring and had little bearing on actual patient care. In 1791,
response efforts. (2012) UNICEF [video file]. according to Eugenia Siegler in her analysis of
Retrieved from https://www.youtube.com/watch?
v¼gRczMq1Dn10.
early medical recordkeeping, the New York Board
The promise of personalized medicine. (2007, Winter). of Governors required complete patient logs along
NIH Medline Plus, pp. 2–3. with lists of prescribed medications, but no
728 Patient Records

descriptions of the patients’ conditions. Formally little room for lengthy narratives, no more than a
documenting the care that individual patients few inches, so summary reports gave way to brief
received was fairly uncommon in American hos- descriptions of pertinent findings. As medical
pitals at that time. It was not until the end of the technology advanced, according to Siegler, the
nineteenth century that American physicians medical record became more complicated and
began recording the specifics of daily patient cumbersome with the addition of yet more forms
care for all patients. Documentation in European for reporting each new type of test (e.g., chemis-
hospitals, by contrast, was much more complete. try, hematology, and pathology tests). While most
From the mid-eighteenth century on, standardized physicians kept working notes on active patients,
medical forms were widely used to record these scraps of paper notating observations, daily
patients’ demographic data, their symptoms, treat- tasks, and physicians’ thoughts seldom made their
ments, daily events, and outcomes. By 1820, these way into the official patient record. The official
forms were collected in preprinted folders with record emphasized tests and numbers, as Siegler
multiple graphs and tables (by contrast, American noted, and this changed medical discourse: inter-
hospitals would not begin using such forms until actions and care became more data driven. Care
the mid-1860s). Each day, physicians in training became less about the totality of the patient’s
were tasked with transcribing medical data into experience and the physician’s perception of
meaningful narratives, describing patterns of dis- it. Nonetheless, patient records had become a
ease progression. The resulting texts became valu- mainstay and they did help ensure continuity of
able learning tools. Similar narratives were care. Despite early efforts at a unifying style,
complied by American physicians and used for however, the content of patient records still varied
medical training as well. In 1805, Dr. David considerably.
Hosack had suggested recording the specifics of Although standardized forms ensured certain
particularly interesting cases, especially those events would be documented, there were no
holding the greatest educational value for medical methods to ensure consistency across documenta-
students. The New York Board of Governors tions or between providers. Dr. Larry Weed pro-
agreed and mandated compiling summary reports posed a framework in 1964 to help standardize
in casebooks. As Siegler noted, there were very recording medical care: SOAP notes. SOAP notes
few reports written at first: the first casebook are organized around four key areas: subjective
spanned 1810–1834. Later, as physicians in train- (what patients say), objective (what providers
ing were required to write case reports in order to observe, including vital signs and lab results),
be admitted to their respective specialties, the assessment (diagnosis), and plan (prescribed treat-
number of documented cases grew. Eventually, ments). Other standardized approaches have been
reports were required for all patients. The reports, developed since then. The most common charting
however, were usually written retrospectively and formats today, in addition to SOAP notes, include
in widely varying narrative styles. narrative charting, APIE charting, focus charting,
Widespread use of templates in American hos- and charting by exception. Narrative charting,
pitals helped standardize patient records, but the much as in the early days of patient record-
resulting quantitative data superseded narrative keeping, involves written accounts of patients’
content. By the start of the twentieth century, conditions, treatments, and responses and is
forms guaranteed documentation of specific documented in chronological order. Charts
tasks like physical exams, histories, orders, and include progress notes and flow sheets which are
test results. Graphs and tables dominated patient multi-column forms for recording dates, times,
records and physicians’ narrative summaries and observations that are updated every few
began disappearing. The freestyle narrative form hours for inpatients and upon each subsequent
that had previously comprised the bulk of the outpatient visit. They provide an easy-to-read
patient record allowed physicians to write as record of change over time; however their limited
much or as little as they wished. Templates left space cannot take the place of more complete
Patient Records 729

assessments, which should appear elsewhere in tests, and surgical records. Several agencies
the patient record. APIE charting, similar to require the patient’s full name, birthdate, and a
SOAP notes, involves clustering patient notes unique patient identification number appear on
around assessment (both subjective and objective each page of the record, along with the name of
findings), planning, implementation, and evalua- the attending physician, date of visit or admission,
tion. Focus charting is a more concise method of and the treating facility’s contact information.
inpatient recording and is organized by keywords Every entry must be legibly signed or initialed
listed in columns. Providers note their actions and and date/time stamped by the provider.
patients’ responses under each keyword heading. The medical record is a protected legal docu-
Charting by exception involves documenting only ment and because it could be used in a malpractice
significant changes or events using specially for- case, charting takes on added significance. Incom-
matted flow sheets. Computerized charting, or plete, confusing, or sloppy patient records could
electronic health records (EHR), combines several signal poor medical care to a jury, even in the
of the above approaches but proprietary systems absence of medical incompetence. For that rea-
vary widely. Most hospitals and private practices son, many malpractice insurers require additional
are migrating to EHRs, but the transition has been documentation above and beyond what profes-
expensive, difficult, and slower than expected. sional agencies recommend. For example, pro-
The biggest challenges include interoperability viders are urged to: write legibly in permanent
issues impeding data sharing, difficult-to-use ink, avoid using abbreviations, write only objec-
EHRs, and perceptions that EHRs interfere with tive/quantifiable observations and use quotation
provider-patient relationships. marks to set apart patients’ statements, note com-
Today, irrespective of the charting format used, munication between all members of the care team
patient records are maintained according to while documenting the corresponding dates and
strict guidelines. Several agencies publish times, document informed consent and patient
recommended guidelines including the American education, record every step of every procedure
Association of Nurses, the American Medical and medication administration, and chart
Association (AMA), the Joint Commission of instances of patients’ noncompliance or lack of
Accreditation of Healthcare Organizations cooperation. Providers should avoid writing over,
(JCAHO), and the Centers for Medicare and Med- whiting out, or attempting to erase entries, even if
icaid Services (CMS). Each regards the medical made in error – mistakes should be crossed P
record as a communication tool for everyone through with a single line, dated, and signed.
involved in the patient’s current and future care. Altering a patient chart after the fact is illegal in
The primary purpose of the medical record is to many states, so corrections should be made in a
identify the patient, justify treatment, document timely fashion and dated/signed. Leaving blank
the course of treatment and results, and facilitate spaces on medical forms should be avoided as
continuity of care among providers. Data stored in well; if space is not needed for documenting
patient records have other functions; aside from patient care, providers are instructed to draw a
ensuring continuity of care, data can be extracted line through the space or write “N/A.” The fol-
for evaluating the quality of care administered, lowing should also be documented to ensure both
released to third-party payers for reimbursement, good patient care and malpractice defense: the
and analyzed for clinical research and/or epidemi- reason for each visit, chief complaint, symptoms,
ological studies. Each agency’s charting guide- onset and duration of symptoms, medical and
lines require certain fixed elements in the patient social history, family history, both positive and
record: the patient’s name, address, birthdate, negative test results, justifications for diagnostic
attending physician, diagnosis, next of kin, and tests, current medications and doses, over-the-
insurance provider. The patient record also con- counter and/or recreational drug use, drug aller-
tains physicians’ orders and progress notes, as gies, any discontinued medications and reactions,
well as medication lists, X-ray records, laboratory medication renewals or dosage changes, treatment
730 Patient Records

recommendations and suggested follow-up or to patients’ well-being or cause emotional or men-


specialty care, a list of other treating physicians, tal distress. In addition to HIPAA mandates, many
a “rule-out” list of considered but rejected diag- states have strict confidentiality laws restricting
noses, final definitive diagnoses, and canceled or the release of HIV test results, drug and alcohol
missed appointments. abuse treatment, and inpatient mental health
Patient records contain more data than ever records. While HIPAA guarantees patient access
before because of professional guidelines, to their medical records, providers can charge
malpractice-avoidance strategies, and the ease of copying fees. Withholding records because a
data entry many EHRs make possible. The result patient cannot afford to pay for them is prohibited
is that providers are experiencing data overload. in many states because it could disrupt the conti-
Many have difficulty wading through mounds of nuity of care. HIPAA also allows patients the right
data, in either paper or electronic form, to discern to amend their medical records if they believe
important information from insignificant attesta- mistakes have been made. While providers are
tions and results. While EHRs are supposed to encouraged to maintain records in perpetuity,
make searching for data easier, many providers there are not requirements that they do so. Given
lack the needed skills and time to search for and the costs associated with data storage, both on
review patients’ medical records. Researchers paper and electronically, many providers will
have found some physicians rely on their own only maintain charts on active patients. Many
memories or ask patients about previous visits inactive patients, those who have not seen a
instead of searching for the information them- given provider in 8 years, will likely have their
selves. Other researchers have found providers records destroyed. Additionally, many retiring
have trouble quickly processing the amount of physicians typically only maintain records for
quantitative data and graphs in most medical 10 years. Better data management capabilities
records. Donia Scott and colleagues, for example, will inevitably change these practices in years
found that providers given narrative summaries of to come.
patient records culled from both quantitative and While patient records have evolved to ensure
qualitative data performed better on questions continuity of patient care, many claim the current
about patients’ conditions than those providers form that records have taken facilitates billing
given complete medical records, and did so in over communication concerns. Many EHRs, for
half the time. Their findings highlight the impor- instance, are modeled after accounting systems:
tance of narrative summaries that should be providers’ checkbox choices of diagnoses and
included in patients’ records. There is a clear tests are typically categorized and notated in bill-
need for balancing numbers with words in ensur- ing codes. Standardized forms are also designed
ing optimal patient care. with billing codes in mind. Diagnosis codes are
Another important issue is ownership of and reported in the International Statistical Classifica-
access to patient records. For each healthcare pro- tion of Diseases and Related Health Problems
vider and/or medical facility involved in a terminology, commonly referred to as ICD. The
patient’s care, there is a unique patient record World Health Organization maintains this coding
owned by that provider. With patients’ permis- system for epidemiological, health management,
sion, those records are frequently shared among and research purposes. Billable procedures and
providers. The Health Insurance Portability and treatments administered in the United States are
Accountability Act (HIPAA) protects the confi- reported in Current Procedural Terminology
dentiality of patient data, but patients, guardians (CPT) codes. The AMA owns this coding schema
or conservators of minor or incompetent patients, and users must pay a yearly licensing fee for the
and legal representatives of deceased patients may CPT codes and codebooks, which are updated
request access to records. Providers in some states annually. Critics claim this amounts to a monop-
can withhold records if, in the providers’ judg- oly, especially given HIPAA, CMS, and most
ment, releasing information could be detrimental insurance companies require CPT-coded data to
Patient-Centered (Personalized) Health 731

satisfy reporting requirements and for reimburse- Scott, D., et al. (2013). Data-to-text summarisation of
ment. CPT-coded data may impact patients’ abil- patient records: Using computer-generated summaries
to access patient histories. Patient Education and
ity to decipher and comprehend their medical Counseling, 92, 153–159.
records, but the AMA does have a limited search Siegler, E. (2010). The evolving medical record. Annals of
function on its website for non-commercial use Internal Medicine, 153, 671–677.
allowing patients to look up certain codes.
Patient records are an important tool ensuring
continuity of care, but data-heavy records are
cumbersome and often lacking narrative summa- Patient-Centered
ries which have been shown to enhance providers’ (Personalized) Health
understanding of patients’ histories and inform
better medical decision-making. Strict guidelines Barbara Cook Overton
and malpractice concerns produce thorough Communication Studies, Louisiana State
records that while ensuring complete documenta- University, Baton Rouge, LA, USA
tion, sometimes impede providers’ ability to dis- Communication Studies, Southeastern Louisiana
cern important from less significant past findings. University, Hammond, LA, USA
Better search and analytical tools are needed for
managing patient records and data.
Patient-centered health privileges patient partici-
pation and results in tailored interventions incor-
porating patients’ needs, values, and preferences.
Cross-References
Although this model of care is preferred by
patients and encouraged by policy makers, many
▶ Electronic Health Records (EHR)
healthcare providers persist in using a biomedical
▶ Health Care Delivery
approach which prioritizes providers’ expertise
▶ Health Informatics
and downplays patients’ involvement. Patient-
▶ Patient-Centered (Personalized) Health
centered care demands collaborative partnerships
and quality communication, both requiring more
time than is generally available during medical
Further Reading
exams. While big data may not necessarily P
American Medical Association. CPT – current proce- improve patient-provider communication, it can
dural terminology. http://www.ama-assn.org/ama/ facilitate individualized care in several
pub/physician-resources/solutions-managing-your-prac important ways.
tice/coding-billing-insurance/cpt.page. Accessed Oct The concept of patient-centered health,
2014.
Christensen, T., & Grimsmo, A. (2008). Instant availability although defined in innumerable ways, has gained
of patient records, but diminished availability of momentum in recent years. In 2001, the Institute
patient information: A multi-method study of GP’s of Medicine (IOM) issued a report recommending
use of electronic health records. BMC Medical Infor- healthcare institutions and providers adopt six
matics and Decision Making, 8(12), doi:10.1186/
1472-6947-8-12 basic tenets: safety, effectiveness, timeliness, effi-
Hess, V., & Ledebur, S. (2011). Taking and keeping: A note ciency, equity, and patient-centeredness. Patient-
on the emergence and function of hospital patient centeredness, according to the IOM, entails deliv-
records. Journal of the Society of Archivists, 32, 1. ering quality health care driven by patients’ needs,
Lee, J. Interview with Lawrence Weed, MD – The father of
the problem-oriented medical record looks ahead. values, and preferences. The Institute for Patient-
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2911807/. and Family-Centered Care expands the IOM def-
Accessed Oct 2014. inition by including provisions for shared
Medical Insurance Exchange of California. Medical decision-making, planning, delivery, and evalua-
record documentation for patient safety and physi-
cian defensibility. http://www.miec.com/Portals/0/ tion of health care that is situated in partnerships
pubs/MedicalRec.pdf. Accessed Oct 2014. comprising patients, their families, and providers.
732 Patient-Centered (Personalized) Health

The concept is further elucidated in terms of four providers who are insensitive to or misinterpret
main principles: respect, information sharing, par- patients’ socio-emotional needs, fail to express
ticipation, and collaboration. According to the empathy, do not give adequate feedback or infor-
Picker Institute, patient-centered care encom- mation regarding diagnoses and treatment proto-
passes seven basic components: respect, coordi- cols, and disregard patients’ input in decision-
nation, information and education, physical making. Patients who are dissatisfied with pro-
comfort, emotional support, family involvement, viders’ communication are less likely to comply
and continuity of care. All of the definitions basi- with treatment plans and typically suffer poorer
cally center on two essential elements: patient outcomes. Conversely, patients satisfied with the
participation in the care process and quality of their providers’ communication are
individualized care. more likely to take medications as prescribed
The goal of patient-centered care, put forth by and adhere to recommended treatments. Satisfied
the IOM, is arguably a return to old-fashioned patients also have lower blood pressure and better
medicine. Dr. Abraham Flexner, instrumental in overall health. Providers, however, routinely sac-
revamping physician training during the 1910s rifice satisfaction for efficiency, especially in man-
and 1920s, promoted medical interactions that aged care contexts.
were guided by both clinical reasoning and com- Many medical interactions proceed according
passion. He encouraged a biopsychosocial to a succinct pattern that does not prioritize
approach to patient communication, which incor- patients’ needs, values, and preferences. The
porates patients’ feelings, thoughts, and expecta- asymmetrical nature of the provider-patient rela-
tions. Scientific and technological advances tionship preferences providers’ goals and discour-
throughout the twentieth century, however, grad- ages patient participation. Although patients
ually shifted medical inquiry away from the whole expect to have all or most of their concerns
person and towards an ever-narrowing focus on addressed, providers usually pressure them to
symptoms and diseases. Once the medical inter- focus on one complaint per visit. Providers also
view became constricted, scientific, and objective, encourage patients to get to the point quickly,
collaborative care gave way to a provider-driven which means patients rarely speak without inter-
approach. The growth of medical specialties ruption or redirection. While some studies note
(like cardiology and gastroenterology) further patients are becoming more involved in their
compounded the problem by reducing patients to health care by offering opinions and asking ques-
collections of interrelated systems (such as circu- tions, others find ever-decreasing rates of partici-
latory and digestive). This shift to specialty care pation during medical encounters. Studies show
coincided with fewer providers pursuing careers physicians invite patients to ask questions in
in primary care, the specialty most inclined fewer than half of exams. Even when patients do
to adopt a patient-centered perspective. The have concerns, they rarely speak up because they
resulting biomedical model downplays patient par- report feeling inhibited by asymmetrical relation-
ticipation while privileging provider control and ships: many patients simply do not feel
expertise. Although a return to patient-centered empowered to express opinions, ask questions,
care is being encouraged, many providers persist or assert goals. Understandably, communication
in using a biomedical approach. Some researchers problems stem from these hierarchical differences
fault patients for not actively co-constructing the and competing goals, thereby making patient-
medical encounter, while others blame medical centered care difficult.
training that de-emphasizes relationship develop- There are several other obstacles deterring
ment and communication skills. patient-centered communication and care. While
Several studies posit quality communication as medical training prioritizes the development of
the single most important component necessary clinical skills over communication skills, lack of
for delivering patient-centered care. Researchers time and insufficient financial reimbursement are
find patient dissatisfaction is associated with the biggest impediments to patient-centered care.
Patient-Centered (Personalized) Health 733

The “one complaint per visit” approach to health recommendations for its users, providers may
care means most conversations are symptom spe- similarly recommend wellness programs by tak-
cific, with little time left for discussing patients’ ing into account patients’ past behavior and health
overall health goals. Visits should encompass outcomes. Comprehensive data could also be use-
much broader health issues, moving away from ful for tailoring different types of programs based
the problem presentation/treatment model while on patients’ preferences, thereby facilitating
taking each patient’s unique goals into account. increased participation and retention. For exam-
The goal of patient-centered care is further ple, programs could be customized for patients
compromised by payment structures incentivizing that go beyond traditional racial, ethnic, or socio-
quick patient turnaround over quality communi- demographic markers and include characteristics
cation, which takes more time than is currently such as social media use and shopping habits. By
available in a typical medical encounter. Some designing analytics aimed at understanding indi-
studies, however, suggest that patient-centered vidual patients and not just their diseases, pro-
communication strategies, like encouraging ques- viders may better grasp how to motivate and
tions, co-constructing diagnoses, and mutually support the necessary behavioral changes
deciding treatment regimens, do not necessarily required for improved health.
lengthen the overall medical encounter. Further- The International Olympic Committee (IOC),
more, collaboratively decided treatment plans are in a consensus meeting on noncommunicable
associated with decreased rates of hospitalization disease prevention, has called for an expansion
and emergency room use. Despite the challenges of health data collected and a subsequent con-
that exist, providers are implored to attempt version of that data into information providers
patient-centered communication. and patients may use to achieve better health
Big data has helped facilitate asynchronous outcomes. Noncommunicable/chronic diseases,
communication between medical providers, such as diabetes and high blood pressure, are
namely through electronic health records which largely preventable. These conditions are related
ensure continuity of care, but big data’s real prom- to lifestyle choices: too little exercise, an
ise lies elsewhere. Using the power of predictive unhealthy diet, smoking, and alcohol abuse.
analytics, big data can play an important role in The IOC recommends capturing data from
advancing patient-centered health by helping pedometers and sensors in smart phones, which
shape tailored wellness programs. The provider- provide details about patients’ physical activity, P
driven, disease-focused approach to health care and combining that with data from interactive
has, heretofore, impacted the kind of health data smart phone applications (such as calorie coun-
that exist: data that are largely focused on patients’ ters and food logs) to customize behavior
symptoms and diseases. However, diseases do not counseling. This approach individualizes not
develop in isolation. Most conditions develop only patient care but also education, prevention,
through a complicated interplay of hereditary, and treatment interventions and advances
environmental, and lifestyle factors. Expanding patient-centered care with respect to information
health data to include social and behavioral data, sharing, participation, and collaboration. The
elicited via a biopsychosocial/patient-centered IOC also identifies several other potential
approach, can help medical providers build better sources of health data: social medial profiles,
predictive models. By examining comprehensive electronic medical records, and purchase histo-
rather than disease-focused data, providers can, ries. Collectively, this data can yield a “mass
for example, leverage health data to predict customization” of prevention programs. Given
which patients will participate in wellness pro- chronic diseases are responsible for 60 percent
grams, their level of commitment, and their poten- of deaths and 80 percent of healthcare spending
tial for success. This can be done using data is dedicated to chronic disease management, cus-
mining techniques, like collaborative filtering. In tomizable programs have the potential to save
much the same way Amazon makes purchase lives and money.
734 Patient-Centered (Personalized) Health

Despite the potential, big data’s impact are password protected, and use encrypted data trans-
largely unrealized in patient-centered care efforts. mission. Information sharing is an important com-
Although merging social, behavioral, and medical ponent of patient-centered care. Some proponents
data to improve health outcomes has not hap- of the patient-centered care model advocate trans-
pened on a widespread basis, there is still a lot ferring control of health data to patients, who may
that can be done analyzing medical data alone. then use and share it as they see fit. Regardless as
There is, however, a clear need for computational/ to who maintains control of health data, storing
analytical tools that can aid providers in recogniz- and electronically transferring that data pose
ing disease patterns, predicting individual potential security and privacy risks.
patients’ susceptibility, and developing personal- Patient-centered care requires collaborative
ized interventions. Nitesh Chawla and Darcy partnerships and wellness strategies that incorpo-
Davis propose aggregating and integrating big rate patients’ thoughts, feelings, and preferences.
data derived from millions of electronic health It also requires individualized care, tailored to
records to uncover patients’ similarities and con- meet patients’ unique needs. Big data facilitates
nections with respect to numerous diseases. This patient-centered/individualized care in several
makes a proactive medical model possible, as ways. First, it ensures continuity of care and
opposed to the current treatment-based approach. enhanced information sharing through integrated
Chawla and Davis suggest that leveraging clini- electronic health records. Second, analyzing pat-
cally reported symptoms from a multitude of terns embedded in big data can help predict dis-
patients, along with their health histories, pre- ease. APACHE III, for example, is a prognostic
scribed treatments, and wellness strategies, can program that predicts hospital inpatient mortality.
provide a summary report of possible risk factors, Similar programs help predict the likelihood of
underlying causes, and anticipated concomitant heart disease, Alzheimer’s, cancer, and digestive
conditions for individual patients. They devel- disorders. Lastly, big data accrued from not only
oped an analytical framework called the Collabo- patients’ health records but from their social
rative Assessment and Recommendation Engine media profiles, purchase histories, and
(CARE), which applies collaborative filtering smartphone applications have the potential to pre-
using inverse frequency and vector similarity to dict enrollment in wellness programs and improve
generate predictions based on data from similar behavioral modification strategies thereby
patients. The model was validated using a Medi- improving health outcomes.
care database of 13 million patients with two
million hospital visits over a 4-year period by
comparing diagnosis codes, patient histories, and Cross-References
health outcomes. CARE generates a short list that
includes high-risk diseases and early warning ▶ Biomedical Data
signs that a patient may develop in the future, ▶ Electronic Health Records (EHR)
enabling a collaborative prevention strategy and ▶ Epidemiology
better health outcomes. Using this framework, ▶ Health Care Delivery
providers can improve the quality of care through ▶ Health Informatics
prevention and early detection and also advance ▶ HIPAA
patient-centered health care. ▶ Predictive Analytics
Data security is a factor that merits discussion.
Presently, healthcare systems and individual pro-
viders exclusively manage patients’ health data. Further Reading
Healthcare systems must comply with security
Chawla, N. V., & Davis, D. A. (2013). Bringing big data to
mandates set forth by the Health Insurance Porta- personalized healthcare: A patient-centered frame-
bility and Accountability Act of 1996 (HIPAA). work. Journal of General Internal Medicine, 28(3),
HIPAA demands data servers are firewall and 660–665.
PatientsLikeMe 735

Duffy, T. P. (2011). The Flexner report: 100 years later. Yale communities targeted on a specific disease, or
Journal of Biology and Medicine, 84(3), 269–276. kind of patient experience. In the context of a
Institute of Medicine. (2001). Crossing the quality chasm.
Washington, DC: National Academies Press. sponsored project, PatientsLikeMe staff develop
Institute for Patient- and Family-Centered Care. FAQs. disease-specific tools required for patient health
http://www.ipfcc.org/faq.html. Accessed Oct 2014. self-reporting (Patient-reported outcome mea-
Matheson, G., et al. (2013). Prevention and management of sures – PROMs) on a web-based platform, then
non-communicable disease: The IOC consensus state-
ment, Lausanne 2013. Sports Medicine, 43, collect and analyze the patient data, and produce
1075–1088. research outputs, either commercial research
Picker Institute. Principles of patient-centered care. http:// reports or peer-reviewed studies. Research has
pickerinstitute.org/about/picker principles/. Accessed regarded a wide range of issues, from drug efficacy
Oct 2014.
discovery for neurodegenerative diseases, or
symptom distribution across patient populations,
to sociopsychological issues like compulsive
gambling.
PatientsLikeMe While the network has produced much of its
research in occasion of sponsored research pro-
Niccolò Tempini
jects, this has mostly been discounted from criti-
Department of Sociology, Philosophy and
cism. This because, for its widespread involvement
Anthropology and Egenis, Centre for the Study
of patients in medical research, PatientsLikeMe is
of the Life Sciences, University of Exeter,
often seen as a champion of the so-called partici-
Exeter, UK
patory turn in medicine, the issue of patient
empowerment and more generally of the forces of
democratization that several writers argued to be
Introduction
promise of the social web. While sustaining its
operations through partnerships with commercial
PatientsLikeMe is a for-profit organization based
corporations, PatientsLikeMe also gathers on the
in Cambridge, Massachusetts, managing a social
platform a number of patient-activism NGOs. The
media-based health network that supports patients
system provides them customized profiles and
in activities of health data self-reporting and
communication tools, with which these organiza-
socialization. As of January 2015, the network
tions can try to improve the reach with the patient P
counts more than 300,000 members and 2,300+
population of reference, while the network in
associated conditions and it is one of the most
return gains a prominent position as the center, or
established networks in the health social media
enabler, of health community life.
space. The web-based system is designed and
managed to encourage and enable patients to
share data about their health situation and
Patient Members
experience.
PatientsLikeMe attracts patient members because
the system is designed to allow patients to find
Business Model others and socialize. This can be particularly use-
ful for patients of rare, chronic, or life-changing
Differently from most prominent social media diseases: patient experiences for which an indi-
sites, the network is not ad-supported. Instead, the vidual might feel helpful to learn from the expe-
business model centers on the sale of anonymized rience of others, whom however might be not easy
data access and medical research services to com- to find through traditional, “offline” socialization
mercial organizations (mostly pharmaceutical opportunities. The system is also designed to
companies). The organization has been partnering enable self-tracking of a number of health dimen-
with clients, in order to develop patient sions. The patients record both structured data,
736 PatientsLikeMe

about diagnoses, treatments, symptoms, disease- implies that for medical matters to be discovered
specific patient-reported questionnaires (PROs), further, the development of an open, distributed
or results of specific lab test, and semi-structured and data-based socio-technical system that is
or unstructured data, in the form of comments, more sensitive to their forms and differences is a
messages, and forum posts. All of these data are necessary step. But also, the hope is that important
at the disposal of the researchers that have access lessons can be learned by opening the medical
to the data. A paradigmatic characteristic of framework to measure and represent a broader
PatientsLikeMe as social media research network collection of entities and events than traditional,
is that the researchers do not learn about the profession-bound medical practice accepted. The
patients in any other way than through the data PatientsLikeMe database includes symptoms and
that the patients share. medical entities as described in the terms used by
the patients themselves. This involves sensitive
and innovative processes of translation from the
Big Data and PatientsLikeMe patient language to expert terminology. Questions
about the epistemological consequence of the
As such, it is the approach to data and to research translation of the patient voice (until now a
that defines PatientsLikeMe as a representative neglected form of medical information) over
“Big Data” research network – one that, however, data fields and categories, and the associated con-
does not manage staggeringly huge quantities of cerns about reliability of patient-generated data,
data nor employs extremely complex technologi- cannot have a simple answer. In any case, from a
cal solutions for data storage and analysis. practice-based point of view these data are none-
PatientsLikeMe is a big data enterprise because, theless being mobilized for research through inno-
first, it approaches medical research through an vative technological solutions for coordinating
open (to data sharing by anyone and about user- the patient user-base. The data can then be ana-
defined medical entities), distributed (relative to lyzed in multiple ways, all of which include the
availability of a broadband connection, from any- use of computational resources and databases –
where and at anytime), and data-based (data are all given the digital nature of the data.
that is transacted between the participating As ethnographic research of the organization
parties) research approach. Second, the data used has pointed out (see further readings section,
by PatientsLikeMe researchers are highly varied below), social media companies that try to
(including social data, social media user- develop knowledge from the aggregation and
generated content, browsing session data, and analysis of the data contributed by their patients
most importantly structured and unstructured are involved in complex efforts to “cultivate” the
health data) and relatively fast, as they are information lying in the database – as they have to
updated, parsed, and visualized dynamically in come to grips with the dynamics and trade-offs
real time through the website or other data- that are specific to understanding health through
management technologies. The research process social media. Social media organizations try to
involves practices of pattern detection, analysis of develop meaningful and actionable information
correlations, and investigation of hypotheses from their database by trying to make data struc-
through regression and other statistical tures more precise in differentiating between phe-
techniques. nomena and reporting about them in data records,
The vision of scientific discovery that is under- and make the system easier and flexible in use in
lying the PatientsLikeMe project is one based on order to generate more data. Often these demands
the assumption that given a broad enough base of work at cross-purposes. The development of
users and a granular, frequent and longitudinal social media for producing new knowledge
exercise of data collection, new, small patterns through distributed publics involves the engineer-
ought to emerge from the data and invite further ing of social environment where sociality and
investigation and explanation. This assumption information production are inextricably
PatientsLikeMe 737

intertwined. Users need to be steered towards personal views in the virtual comfort of a forum
information-productive behaviors as they engage room. In this respect, even if the commercial pro-
in social interaction of sorts, for information is the jects that the organization has undertaken with
worth upon which social media businesses industry partners implied the exchange of user
depend. In this respect, it has been argued that data that had been pseudonymised before being
PatientsLikeMe is representative of the construc- handed over, the limits of user profile
tion of sociality that takes place in all social media anonymization are well known. In the case of
sites, where social interaction unfolds along the profiles of patients living with rare diseases,
paths that the technology continuously and which are a consistent portion of the users in
dynamically draws based on the data that the PatientsLikeMe, it can arguably be not too diffi-
users are sharing. cult to reidentify individuals, upon determined
As such, many see PatientsLikeMe as incarnat- effort. These issues of privacy and confidentiality
ing an important dimension of the much-expected remain a highly sensitive topic as society does not
revolution of personalized medicine. Improve- dispose of standard and reliable solutions against
ments in healthcare will not be limited to a capil- the various forms that data misuse can take. As
lary application of genetic sequencing and other both news and scholars have often reported, the
micro and molecular biology tests that try to open malleability of digital data makes it impossible to
up the workings of individual human physiology stop the diffusion of sensitive data once that func-
at unprecedented scale, instead the information tion creep happens.
produced by these tests will often the related Moreover, as it is often discussed in the social
with the information about the subjective patient media and big data public debate, data networks
experience and expectations that new information increasingly put pressure on the notion of
technology capabilities are increasingly making informed consent as an ethically sufficient device
possible. for conducting research with user and patient data.
The need for moral frameworks of operation that
overperform over strict compliance with law has
Other Issues often been called for, and recently by the report on
data in biomedical research by the Nuffield Coun-
Much of the public debate about the cil for Bioethics. In the report, PatientsLikeMe
PatientsLikeMe network involves issues of pri- was held as a paramount example of new kinds P
vacy and confidentiality of the patient users. The of research networks that rely on extensive patient
network is a “walled garden,” with patient profiles involvement and social (medical) data – these
remaining inaccessible to unregistered users by networks are often dubbed as citizen science or
default. However, once logged in, every user can participatory research.
browse all patient profiles and forum conversa- On another note, some have argued that
tions. In more than one occasion, unauthorized PatientsLikeMe, as many other prominent social
intruders (including journalists and academics) media organizations, has been exploiting the rhe-
were detected and found screen-scraping data toric of sharing (one’s life with a network and its
from the website. Despite the organization members) to encourage data-productive behav-
employing state-of-the-art techniques to protect iors. The business model of the network is built
patient data from unauthorized exporting, any around a traditional, proprietary model of data
sensitive data shared on a website remains at a ownership. The network facilitates the data flow
risk, given the widespread belief – and public inbound and makes it less easy for the data to flow
record on other websites and systems – that skilled outbound, controlling their commercial applica-
intruders could always execute similar exploits tion. In this respect, we must notice that the cur-
unnoticed. Patients can have a lot to be concerned rent practice in social media management in
about, especially if they have conditions with a general is often characterized by data sharing
social stigma or if they shared explicit political or evangelism by the managing organization, which
738 PatientsLikeMe

at the same time requires monopoly of the most is a pressing need to understand more on the
important data resources that the network gener- consequences of these networks for individual
ates. In the general public debate, this kind of health and the future of healthcare and health
social media business model has been linked as a research.
factor contributing to the erosion of user privacy. There are other issues besides these more evi-
On a different level, one can notice how the dent and established topics of discussion. As it has
kind of patient-reported data collection and med- been pointed out, questions of knowledge transla-
ical research that the network makes possible to tion (from the patient vocabulary to the clinical-
perform is a much cheaper and under many professional) remain open, and unclear is also the
respects more efficient model than what the capacity of these distributed and participative net-
professional-laden institutions such as the clinical works to consistently represent and organize the
research hospital, with their specific work loci and patient populations that they are deemed to serve,
customs, could put in place. This way of as the involvement of patients is however limited
organising the collection of valuable data operates and relative to specific tasks, most often of data-
by including large amounts of end users who are productive character. The afore-mentioned issues
not remunerated. Despite this, running and orga- are not exhaustive nor exhausted in this essay.
nizing such an enterprise is expensive and labor- They require in-depth treatment; with this intro-
intensive and as such, critical analysis of this kind duction the aim has been to give a few coordinates
of “crowdsourcing” enterprise needs to look on how to think about the subject.
beyond the more superficial issue of the absence
of a contract to sanction the exchange of a mone-
tary reward for distributed, small task perfor- Further Reading
mances. One connected problem in this respect
is that since data express their value only when Angwin, J. (2014). Dragnet nation: A quest for privacy,
security, and freedom in a world of relentless surveil-
they are re-situated through use, no data have a
lance. New york: Henry Holt and Company.
distinct, intrinsic value upon generation; not all Arnott-Smith, C., & Wicks, P. (2008). PatientsLikeMe:
data generated will ever be equal. Consumer health vocabulary as a folksonomy. Ameri-
Finally, the affluence of medical data that can Medical Informatics Association Annual Sympo-
sium Proceedings, 2008, 682–686.
this network makes available can have important
Kallinikos, J., & Tempini, N. (2014). Patient data as med-
consequences on therapy or lifestyle decisions ical facts: Social media practices as a foundation for
that a patient might take. Sure, patients can make medical knowledge creation. Information Systems
up their mind and take critical decisions without Research, 25, 817–833. https://doi.org/10.1287/isre.
2014.0544.
appropriate consultation at any time, as they have
Lunshof, J. E., Church, G. M., & Prainsack, B. (2014).
always done. Nonetheless, the sheer amount of Raw personal data: Providing access. Science, 343,
information that networks such as PatientsLikeMe 373–374. https://doi.org/10.1126/science.1249382.
or search engines such as Google make available Prainsack, B. (2013). Let’s get real about virtual: Online
health is here to stay. Genetical Research, 95, 111–113.
at a click’s distance is without antecedents and
https://doi.org/10.1017/S001667231300013X.
what this implies for healthcare must still be Richards, M., Anderson, R., Hinde, S., Kaye, J., Lucassen,
fully understood. Autonomous decisions by the A., Matthews, P., Parker, M., Shotter, M., Watts, G.,
patients do not necessarily happen for the worst. Wallace, S., & Wise, J. (2015). The collection, linking
and use of data in biomedical research and health care:
As healthcare often falls short of providing appro-
Ethical issues. London: Nuffield Council on Bioethics.
priate information and counseling, especially Tempini, N. (2014). Governing social media: Organising
about everything that is not strictly therapeutic, information production and sociality through open,
patients can eventually devise improved courses distributed and data-based systems (Doctoral disserta-
tion). School of Economics and Political Science,
of action, through a consultation of appropriate London.
information-rich web resources. At the same time, Tempini, N. (2015). Governing PatientsLikeMe: Informa-
risks and harms are not fully appreciated and there tion production and research through an open,
Persistent Identifiers (PIDs) for Cultural Heritage 739

distributed and data-based social media network. The serve as a source of big data for a number of
Information Society, 31, 193–211. economic sectors.
Wicks, P., Vaughan, T. E., Massagli, M. P., & Heywood,
J. (2011). Accelerated clinical discovery using self- One project that could demonstrate the poten-
reported patient data collected online and a patient- tial of persistent identifiers as a target of big data
matching algorithm. Nature Biotechnology, 29, was recently launched in the United Kingdom
411–414. https://doi.org/10.1038/nbt.1837. (UK). Towards a National Collection is a multi-
Wyatt, S., Harris, A., Adams, S., & Kelly, S. E. (2013).
Illness online: Self-reported data and questions of trust year, $25.4 M project to develop a virtual
in medical and social research. Theory Culture & Soci- national collection of the cultural heritage assets
ety., 30, 131–150. https://doi.org/10.1177/ held in the UK’s museums, galleries, libraries,
0263276413485900. and archives. Part of the project is an initiative to
Zuboff, S. (2015). Big other: surveillance capitalism and
the prospects of an information civilization. Journal of establish a PID system as a “research infrastruc-
Information Technology, 30, 75–89. ture,” a data resource available to enable data
discovery and access. The goal of the project is
to spur research and innovation, as well as eco-
nomic and social benefits. This national invest-
Pattern Recognition ment by UK Research and Innovation (UKRI.
org), a public agency, was undertaken in recog-
▶ Financial Data and Trend Prediction nition of the economic power of cultural heritage
tourism, and to set global standards for cultural
heritage research.
Given the sweeping national level of this
Persistent Identifiers (PIDs) effort, it is quite likely that once established,
for Cultural Heritage persistent identifiers will spread from the pub-
lic to the private sector. Buyers of art, partic-
Jong-On Hahm ularly of high-end artworks, may want their
Department of Chemistry, Georgetown purchases to be accompanied by the full pan-
University, Washington, DC, USA oply of information available via a persistent
identifier. Insurance companies are even more
likely to require that artworks have persistent
A persistent identifier (PID) is a long-lasting ref- identifiers, to be able to examine full informa- P
erence to a digital resource. Examples include tion and documentation on the object to be
digital object identifiers (DOIs) for publications insured. As such, the adoption of persistent
and datasets, and ORCID iDs for individual identifiers for private collections of artworks
authors. A PID can provide access to large and antiquities has the potential to fundamen-
amounts of data and metadata about an object, tally and dramatically transform the art and
offering a diverse array of information previously insurance markets.
unavailable to the public. In 2019, the global art market was a ~ $64.1
Cultural heritage science, specifically the con- billion enterprise of which 42% of total sale
servation and characterization of artworks and values were objects priced more than $1 million.
antiquities, would greatly benefit from the estab- According to the US Department of Justice, art
lishment of a system of persistent identifiers. Cul- crime is the third highest-grossing criminal trade.
tural heritage objects are treasured as historical It is also one of the least prosecuted, primarily
and cultural assets that can iconify the national because data on art objects are scarce. The estab-
identity of many societies. They are also resources lishment of a persistent identifier system could be
for education, drivers of economic activity, and a disruptive force in art crime, a global enterprise
represent significant financial assets. A system of of which the vast majority of sales and transac-
persistent identifiers for cultural heritage could tions goes undetected.
740 Personally Identifiable Information

Further Reading efforts. The average pharmaceutical company in


the United States sees a profit of greater than $10
Art scandal threatens to expose mass fraud in global art billion annually, while pharmaceutical companies
market. https://www.cnbc.com/2015/03/13/art-scan
contribute 50 times more spending on promoting
dal-threatens-to-expose-mass-fraud-in-global-art-mar
ket.html. and advertising for their own products than spend-
Persistent Identifiers as IRO Infrastructure. https://bl.iro.bl. ing on public health information initiatives.
uk/work/ns/14d713d7-72d3-4f60-8583-91669758ab41. Big data can be described as the collection,
Protecting Cultural Heritage from Art Theft. https://leb.fbi.
manipulation, and analysis of massive amounts
gov/articles/featured-articles/protecting-cultural-heri
tage-from-art-theft-international-challenge-local- of data – and the decisions made from that analy-
opportunity. sis. Having the ability to be described as both a
The Art Market 2020. https://theartmarket.foleon.com/ problem and an opportunity, big data and its tech-
2020/artbasel/the-global-art-market.
niques are continuing to be utilized in business by
Towards a National Collection. https://tanc-ahrc.github.io/
HeritagePIDs/index.html. thousands of major institutions. The sector of
health care is not immune to massive data collec-
tion efforts, and pharmaceuticals in particular
comprise an industry that relies on aggregating
Personally Identifiable information.
Information Literature on data mining in the pharmaceuti-
cal industry generally points to a disagreement
▶ Anonymization Techniques regarding the intended use of health-care informa-
tion. On the one hand, historically, data mining
techniques have proved useful for the research
and development (R&D) of current and future
prescription drugs. Alternatively, continuing con-
Pharmaceutical Industry sumerist discourses in health care that have posi-
tion the pharmaceutical industry as a massive and
Janelle Applequist successful corporate entity have acknowledged
The Zimmerman School of Advertising and Mass how this data is used to increase business sales,
Communications, University of South Florida, potentially at the cost of patient confidentiality
Tampa, FL, USA and trust.

Globally, the pharmaceutical industry is worth History of Data Mining Used for
more than $1 trillion, encompassing one of the Pharmaceutical R&D
world’s most profitable industries, focusing on the
development, production, and marketing of pre- Proponents of data mining in the pharmaceutical
scription drugs for use by patients. Over one-third industry have cited its ability to aide in: organiz-
of the pharmaceutical industry is controlled by ing information pertaining to genes, proteins, dis-
just ten companies, with six of these companies eases, organisms, and chemical substances,
in the United States alone. The World Health allowing predictive models to be built for analyz-
Organization has reported an inherent conflict of ing the stages of drug development; keeping track
interest between the pharmaceutical industry’s of adverse effects of drugs in a neural network
business goals and the medical needs of the pub- during clinical trial stages; listing warnings and
lic, attributable to the fact that twice the amount is known reactions reported during the post-drug
spent on promotional spending (including adver- production stage; forecasting new drugs needed
tisements, marketing, and sales representation) in the marketplace; providing inventory control
than is on the research and development for future and supply chain management information; and
prescription drugs needed for public health managing inventories. Data mining was first used
Pharmaceutical Industry 741

in the pharmaceutical industry as early as the licensing number of the prescribing physician.
1960s alongside the increase in prescription drug Yet, it is simple for the prescription drug industry
patenting. With over 1,000 drug patents a year to identify specific physicians through protocol in
being introduced at that time, data collection place by the American Medical Association
assisted pharmaceutical scientists in keeping up (AMA). The AMA has a “Physician Masterfile”
with patents being proposed. At this time, infor- that includes all US physicians, whether or not
mation was collected and published in an edito- they belong to the AMA, and this file allows the
rial-style bulletin categorized according to areas physician licensing numbers collected by data
of interest in an effort to make relevant issues for miners to be connected to a name. Information
scientists easier to navigate. Early in the 1980s, distribution companies (such as IMS Health, Den-
technologies allowed biological sequences to be drite, Verispan, Wolters Kluwer, etc.) purchase
identified and stored, such as the Human Genome records from pharmacies. What many consumers
Project, which led to the increased use and pub- do not realize is that most pharmacies have these
lishing of databanks. Occurring alongside the records for sale and are able to do so legally by not
popularity of personal computer usage, bioinfor- including patient names and only providing a
matics was born, which allowed biological physician’s state licensing number and/or name.
sequence data to be used for discovering and While pharmacies cannot release a patient’s name,
studying new prescription drug targets. Ten they can provide data miners with a patient’s age,
years later, in the 1990s, microarray technology sex, geographic location, medical conditions, hos-
developed, posing a problem for data collection, pitalizations, laboratory tests, insurance copays,
as this technology permitted the simultaneous and medication use. This has caused a significant
measurement of large numbers of genes and col- area of concern on behalf of patients, as it not only
lection of experimental data on a large scale. As may increase instances of prescription detailing,
the ability to sequence a genome occurred in the but it may compromise the interests of patients.
2000s, the ability to manage large levels of raw Data miners do not have access to patient names
data was still maturing, creating a continued prob- when collected prescription data; however, data
lem for data mining in the pharmaceutical indus- miners assign unique numbers to individuals so
try. As the challenges presented for data mining in that future prescriptions for the patient can be
relation to R&D have continued to increase since tracked and analyzed together. This means that
the 1990s, the opportunities for data mining in data miners can determine: how long a patient P
order to increase prescription drug sales have remains on a drug, whether the drug treatment is
steadily grown. continued, and which new drugs become pre-
scribed for the patient.
As information concerning a patient’s health is
Data Mining in the Pharmaceutical highly sensitive, data mining techniques used by
Industry as a Form of Controversy the pharmaceutical industry have perpetuated the
notion that personal information carries a substan-
Since the early 1990s, health-care information tial economic value. By data mining companies
companies have been purchasing the electronic paying pharmacies to extract prescription drug
records of prescriptions from pharmacies and information, the relationships between patients
other data collection resources in order to strate- and their physicians and/or pharmacists is being
gically link this information with specific exploited. The American Medical Association
physicians. (AMA) established the Physician Data Restriction
Prescription tracking refers to the collection of Program in 2006, giving any physician the oppor-
data from prescriptions as they are filled at phar- tunity to opt out from data mining initiatives. To
macies. When a prescription gets filled, data date, no such program for patients exists that
miners are able to collect: the name of the drug, would give them the opportunity to have their
the date of the prescription, and the name or records removed from data collection procedures
742 Pharmaceutical Industry

and subsequent analyses. Three states have representative can attempt to encourage the phy-
enacted statutes that do not permit data mining sician to switch back to the original prescription.
of prescription records. The Prescription Confi-
dentiality Act of 2006 in New Hampshire was the
first state to decide that prescription information The Future of Data Mining in the
could not be sold or used for any advertising, Pharmaceutical Industry
marketing, or promotional purposes. However, if
the information is de-identified, meaning that the As of 2013, only 18% of pharmaceutical compa-
physician and patient names cannot be accessed, nies work directly with social media to promote
then the data can be aggregated by geographical their prescription drugs, but this number is
region or zip code, meaning that data mining expected to increase substantially in the next
companies could still provide an overall, more year. As more individuals tweet about their med-
generalized report for small geographic areas but ical concerns, symptoms, the drugs they take, and
could not target specific physicians. Maine and respective side effects, pharmaceutical companies
Vermont have statutes that limit the presence of have noticed that social media has become an
data mining. Physicians in Maine can register with integrated part of personalized medicine for indi-
the state to prevent data mining companies from viduals. Pharmaceutical companies are already in
obtaining their prescribing records. Data miners in the process of hiring data miners to collect and
Vermont must obtain consent from the physician analyze various forms of public social media in an
for which they are analyzing prior to using “pre- effort to: discover unmet needs, recognize new
scriber-identifiable” information for marketing or adverse events, and determine what types of
promotional purposes. drugs consumers would like to enter the market.
The number one customer for information dis- Based on the history of data mining used by
tribution companies is the pharmaceutical indus- pharmaceutical corporations, it is evident that the
try, which purchases the prescribing data to lucrative nature of prescription drugs serves as a
identify the highest prescribers and also to track catalyst for data collection and analysis. By hav-
the effects of their promotional efforts. Physicians ing the ability to generalize what should be very
are given a value, a ranking from one to ten, which private information about patients for the pre-
identifies how often they prescribe drugs. A sales scription drug industry, the use of data allows
training guide for Merck even states that this value prescription drugs to make more profit than ever,
issued to identify which products are currently in as individual information can be commoditized to
favor with the physician in order to develop a benefit the bottom line of a corporation. Although
strategy to change those prescriptions into Merck there are evident problems associated with pre-
prescriptions. The empirical evidence provided by scription drug data mining, the US Supreme Court
information distribution companies offers a has continued to recognize that the pharmaceuti-
glimpse into the personality, behaviors, and cal industry has a first amendment right to adver-
beliefs of a physician, which is why these num- tise and solicit clients for goods and future
bers are so valued by the drug industry. services. The Court has argued that legal safe-
By collecting and analyzing this data, pharma- guards, such as the Health Information Portability
ceutical sales representatives are able to better and Accountability Act (HIPAA), are put in place
target their marketing activities toward physi- to combat the very concerns posed by practices
cians. For example, as a result of data mining in such as pharmaceutical industry data mining.
the pharmaceutical industry, pharmaceutical sales Additionally, the Court has found that by stripping
representatives could: determine which physi- pharmaceutical records of patient information that
cians are already prescribing specific drugs in could lead to personal identification (e.g., name,
order to reinforce already-existent preferences, address, etc.), patients have their confidentiality
or, could learn when a physician switches from a adequately protected. The law, therefore, leaves it
drug to a competing drug, so that the to the discretion of the physician to decide
Policy Analytics 743

whether they will associate with pharmaceutical Wang, J., et al. (2011). Applications of data mining in
sales representatives and various data collection pharmaceutical industry. The Journal of Management
and Engineering Integration, 4(1), 120–128.
procedures. White paper: Big Data and the needs of the Pharmaceuti-
An ongoing element to address in analyzing cal Industry. (2013). Philadelphia: Thomson Reuters.
the pharmaceutical industry’s use of data mining World Health Organization. (2013). Pharmaceutical
techniques will be the level of transparence used Industry. Retrieved online from http://www.who.int/
trade/glossary/story073/en/.
with patients while utilizing the information col-
lected. Research shows that the majority of
patients in the United States are not only unfamil-
iar with data mining use by the pharmaceutical
industry, but that they are against any personal Policy
information (e.g., prescription usage information
and personal diagnoses) being sold and shared ▶ Regulation
with outside entities, namely, corporations. As
health care continues to change in the United
States, it will be important for patients to under-
stand the ways in which their personal informa- Policy Analytics
tion is being shared and used, in an effort to
increase national understandings of how privacy Laurie A. Schintler
laws are connected to the pharmaceutical industry. George Mason University, Fairfax, VA, USA

Overview
Cross-References
Over the last half century, the policymaking pro-
▶ Electronic Health Records (EHR) cess has undergone a digital transformation
▶ Health Care Delivery (Pencheva et al. 2020). Information technology
▶ Patient Records such as computers and the Internet – artifacts of
▶ Privacy the “digital revolution” – helped usher in data-
driven public policy analysis and decision-mak- P
ing in the 1980s (Gil-Garcia et al. 2018). Now big
Further Reading data, coupled with new and advancing computa-
tional tools and analytics (e.g., machine learning),
Altan, S., et al. (2010). Statistical considerations in design
space development. Pharmaceutical Technology, 34
are digitalizing the process even further. While the
(7), 66–70. origins of the term are murky, policy analytics
Fugh-Berman, A. (2008). Prescription tracking and public encapsulates this changing context, referring spe-
health. Journal of General Internal Medicine, 23(8), cifically to the use of big data resources and tools
1277–1280.
for policy analysis (Daniell et al. 2016). Although
Greene, J. A. (2007). Pharmaceutical marketing research
and the prescribing physician. Annals of Internal Med- policy analytics can benefit the policymaking pro-
icine, 146(10), 742–747. cess in various ways, it also comes with a set of
Klocke, J. L. (2008). Comment: Prescription records for issues, challenges, and downsides that must be
sale: Privacy and free speech issues arising from the
sale of de-identified medical data. Idaho Law Review,
managed simultaneously.
44(2), 511536.
Orentlicher, D. (2010). Prescription data mining and the
protection of patients’ interests. The Journal of Law, Prospects and Potentialities
Medicine & Ethics, 38(1), 74–84.
Steinbrook, R. (2006). For sale: Physicians’ prescribing
data. The New England Journal of Medicine, 354(26), Policymaking involves developing, analyzing,
2745–2747. evaluating, and implementing laws, regulations,
744 Policy Analytics

and other courses of action to solve real-world aggregate forms, spatially and temporally.
problems for improving societal welfare. The pol- Accordingly, such data lack the level of resolution
icy cycle is a framework for understanding this required to understand the details and nuances of
process and the “complex strategic activities, public problems, such as how particular individ-
actors, and drivers” it affects and is affected by uals, neighborhoods, and groups are negatively
(Pencheva et al. 2020). Critical steps in this cycle impacted by dynamic situations and circum-
include: stances. Big data, such as that produced by video
surveillance cameras, the Internet of Things (IoT),
1. Problem identification and agenda setting mobile phones, and social media, provide the
2. Development of possible policy options (or granularity to address such gaps.
policy instruments) Second, data-driven analytics such as deep
3. Evaluation of the feasibility and impact of each neural learning – a powerful form of machine
policy option learning that attempts to mimic how the human
4. Selection and implementation of a policy or set brain processes information – has enormous
of guidelines potential for policy analysis. Specifically, such
approaches enable insight to be gleaned from
There is also often an ongoing assessment of massive amounts of streaming data in a capacity
policies and their impacts after they have been not possible with traditional models and frame-
implemented (i.e., ex-postevaluation), which works. Moreover, supervised machine learning
may, in turn, result in the modification or termi- techniques for prediction and classification can
nation of policies. help anticipate trends and evaluate policy
Data and methods have long played a critical options on the fly. They also give policymakers
role in all phases of the policy life cycle. In this the ability to test potential solutions in advance.
regard, various types and forms of qualitative and “Nowcasting,” an approach developed in the field
quantitative data (e.g., census records, household of economics, enables the evaluation of policies in
surveys), along with models and tools (e.g., cost- the present, the imminent future, and the recent
benefit analysis, statistical inference, and mathe- past (Bańbura et al. 2010). Such methods can
matical optimization), are used for analyzing and supplement and inform models used for longer-
assessing policy problems and their potential term forecasting.
solutions (Daniell et al. 2016). Big data and Third, big data produced by crowdsourcing
data-driven computational and analytical tools mechanisms and platforms provide a valuable
(e.g., machine learning) provide a new “toolbox” resource for addressing the difficulties associated
for policymakers and policy analysts, which can with agenda setting, problem identification, and
help address the growing complexities of the policy prioritization (Schintler and Kulkarni
policymaking process while overcoming the lim- 2014). Such activities pose an array of challenges.
itations of conventional methods and data sources One issue is that the policymaking process
for policy analysis. involves multiple stakeholders, each of which
First, big data provide a rich means for identi- has its own set of values, objectives, expectations,
fying, characterizing, and tracking problems for interests, preferences, and motivations. Compli-
which there may be a need for policy solutions. cating matters is that positions on “hot-button”
Indeed, such tasks are fraught with a growing policy issues, such as climate change and vacci-
number of complications and challenges, as pub- nations, have become increasingly polarized and
lic issues and public policies have become politicized. While surveys and interviews provide
increasingly dynamic, interconnected, and a means for sensing the opinions, attitudes, and
unpredictable (Renteria and Gil-Garcia 2017). In needs of citizens and other stakeholders, they are
this regard, conventional sources of data (e.g., costly and time-consuming to implement. More-
government censuses) tend to fall short, especially over, they do not allow for real-time situational
given the information is described in fixed and awareness and intelligence. Crowdsensing data,
Policy Analytics 745

combined with tools such as sentiment analysis, their use and application. One challenge in this
provide a potential means for understanding, regard is integrating traditional sources of data
tracking, and accounting for ongoing and fluctu- (e.g., census records) with big data, especially
ating views on policy problems and public policy given they tend to have different levels of reso-
solutions. Thus, as a lever for increasing partici- lution and coverage. Issues related to privacy,
pation in the policy cycle, crowdsensed big data data integrity, data provenance, and algorithmic
can promote social and civic empowerment, ulti- bias and discrimination complicate matters fur-
mately engendering trust within and between ther (Schintler 2020; Schintler and Fischer
stakeholder groups (Brabham 2009). 2018).

Downsides, Dilemmas, and Challenges Conclusion

Despite the actual and potential benefits of big In sum, while the use of big data and data-driven
data and data-driven methods for policy analy- methods for policy analysis, i.e., policy analytics,
sis, i.e., policy analytics, the public sector has can improve the efficiency and effectiveness of
yet to make systematic and aggressive use of the policymaking process in various ways, it also
such tools, resources, and approaches (Daniell comes with an array of downsides and dilemmas,
et al. 2016; Sun and Medaglia 2019). While as highlighted. Thus, a grand challenge is
robust methods, techniques, and platforms for balancing the need for “robust and convincing
big data have been developed for business (i.e., analysis” with the need to satisfy public expecta-
business intelligence), they cannot (and should tions about the transparency, fairness, and integ-
not) be transferred to a public policy context. rity of the policy process and its outcomes
One significant issue is that framings, interests, (Daniell et al. 2016). In this regard, public policy
biases, and motivations – and values – of public itself has a crucial role to play.
and private entities tend to be incongruent (Sun
and Medaglia 2019). Whereas companies gener-
ally strive to maximize rate profit and rate-of-
Cross-References
return on investment, the government is more
concerned with equitably allocating public P
▶ Business Intelligence Analytics
resources to promote societal well-being
▶ Crowdsourcing
(Daniell et al. 2016) (Of course, there are some
▶ Ethics
exceptions, e.g., “socially-conscious” corpora-
▶ Governance
tions or corrupt governments). As values get
embedded into the architecture and design of
computational models and drive the selection
and use of data in the first place, the blind
Further Reading
application of business analytics to public policy Bańbura, M., Giannone, D., & Reichlin, L. (2010).
can have dangerous consequences. More to the Nowcasting, ECB working paper, no. 1275. Frankfurt
point, while the use of business intelligence for a. M.: European Central Bank (ECB).
policy analysis may yield efficient and cost-sav- Brabham, D. C. (2009). Crowdsourcing the public partic-
ipation process for planning projects. Planning Theory,
ing policy solutions, it may come at the expense 8(3), 242–262.
of broader societal interests, such as human Daniell, K. A., Morton, A., & Insua, D. R. (2016). Policy
rights and social justice. On top of all this, analysis and policy analytics. Annals of Operations
there are technical and ethical issues and consid- Research, 236(1), 1–13.
Gil-Garcia, J. R., Pardo, T. A., & Luna-Reyes, L. F. (2018).
erations that come into play in applying big data Policy analytics: Definitions, components, methods,
and data-driven methods in the public sphere, and illustrative examples. In Policy analytics, model-
which create an additional set of barriers to ling, and informatics (pp. 1–16). Cham: Springer.
746 Political Science

Pencheva, I., Esteve, M., & Mikhaylov, S. J. (2020). Big Political science deals extensively with the allo-
data and AI–A transformational shift for government: cation and transfer of power in decision-making,
So, what next for research? Public Policy and Admin-
istration, 35(1), 24–44. the roles and systems of governance including
Renteria, C., & Gil-Garcia, J. R. (2017). A systematic governments, international organizations, political
literature review of the relationships between policy behavior, and public policies. It is methodologi-
analysis and information technologies: Understanding cally diverse and employs many methods originat-
and integrating multiple conceptualizations. In Interna-
tional conference on electronic participation (pp. 112– ing in social science research. Approaches include
124). Cham: Springer. positivism, rational choice theory, behavioralism,
Schintler, L.A. (2020). Regional policy analysis in the era structuralism, post-structuralism, realism, institu-
of spatial big data. In Development studies in regional tionalism, and pluralism.
science (pp. 93–109). Singapore: Springer.
Schintler, L. A., & Kulkarni, R. (2014). Big data for policy Although it was codified in the nineteenth cen-
analysis: The good, the bad, and the ugly. Review of tury, political science originated in Ancient
Policy Research, 31(4), 343–348. Greece with the works of Plato and Aristotle.
Schintler, L.A., & Fischer, M. M. (2018). Big data and During the Italian Renaissance, Florentine Philos-
regional science: Opportunities, challenges and direc-
tions for future research (Working Papers in Regional opher Niccolò Machiavelli established the empha-
Science). WU Vienna University of Economics and sis of modern political science on direct empirical
Business, Vienna. https://epub.wu.ac.at/6122/1/ observation of political institutions and actors.
Fischer_etal_2018_Big-data.pdf. Later, the expansion of the scientific paradigm
Sun, T. Q., & Medaglia, R. (2019). Mapping the challenges
of artificial intelligence in the public sector: Evidence during the Enlightenment further pushed the
from public healthcare. Government Information Quar- study of politics beyond normative determinations.
terly, 36(2), 368–383. Because political science is essentially a study of
human behavior, in all sides of politics, observations
in controlled environments are often challenging to
reproduce or duplicate, though experimental methods
Political Science are increasingly common. Because of this, political
scientists have historically observed political elites,
Marco Morini institutions, and individual or group behavior in order
Dipartimento di Comunicazione e Ricerca to identify patterns, draw generalizations, and build
Sociale, Universita’ degli Studi “La Sapienza”, social and political theories.
Roma, Italy Like all social sciences, political science faces
the difficulty of observing human actors that can
only be partially observed and who have the
Political science is a social science discipline capacity for making conscious choices. Despite
focused on the study of the state, nation, govern- the complexities, contemporary political science
ment, and public policies. As a separate field, it is has progressed by adopting a variety of methods
a relatively late arrival, and it is commonly and theoretical approaches to understanding poli-
divided into distinct sub-disciplines which tics, and methodological pluralism is a defining
together constitute the field: political theory, com- feature of contemporary political science. Often in
parative politics, public administration, and polit- contrast with national media, political science
ical methodology. Although it seems that political scholars seek to compile long-term data and
science has been using machine learning methods research on the impact of political issues, produc-
for decades, nowadays political scientists are ing in-depth articles and breaking down the
encountering larger datasets with increasingly issues.
complex structures and are using innovative new Several scholars have long been using machine
big data techniques and methods to collect data learning methods to develop and analyze rela-
and test hypotheses. tively large datasets of political events, such as
Political Science 747

using multidimensional scaling methods to study election cycles showed how campaigners count on
roll-call votes from the US Congress. Since decades, big data in order to win elections.
therefore, mainstream political methodology has The most well-known example is how the
already dealt with the exact attributes that character- Democratic National Committee leveraged big
ize big data – the use of computationally intensive data analytics to better understand and predict
techniques to analyze what to social scientists are voter behavior in the 2012 US elections. The
large and complex datasets. In 1998, Yale Professors Obama campaign used data analytics and the
Don Green and Alan Gerber conducted the first experimental method to assemble a winning coa-
randomized controlled trial in modern political sci- lition vote by vote. In doing so, it overturned the
ence, assigning New Haven voters to receive non- long dominance of TV advertising in US politics
partisan election reminders by mail, phone, or in- and created something new in the world: a
person visit from a canvasser and measuring which national campaign run like a local ward election,
group saw the greatest increase in turnout. The where the interests of individual voters were
subsequent wave-of-field experiments by Green, known and addressed.
Gerber, and their followers focused on mobilization, The 2012 Obama’s campaign used big data to
testing competing modes of contact and get-out-the- rally individual voters. His approach amounted to
vote language to see which were most successful. a decisive break with twentieth-century tools for
But while there has been this long tradition in tracking public opinion, which consisted of iden-
political science for big data like research, politi- tifying small samples that could be treated as
cal scientists are now using innovative new big representative of the whole. The electorate could
data techniques and methods to collect data and be seen as a collection of individual citizens who
test hypotheses. They employ automated analyti- could each be measured and assessed on their own
cal methods data to create new knowledge from terms. This campaign became celebrated for its
the unstructured and overwhelming amount of use of technology – much of it developed by an
data streaming in from a variety of sources. As unusual team of coders and engineers – that
field experiments are an important methodology redefined how individuals could use the Web,
that many social scientists use to test many differ- social media, and smartphones to participate in
ent behavioral theories, now large-scale field the political process.
experiments can be accomplished at low cost.
Political scientists are seeing interesting new P
research opportunities with social media data, Cross-References
with large aggregations of field experiment and
polling data, and with other large-scale datasets ▶ Curriculum, Higher Education, Humanities
that just a few years ago could not be easily ▶ Data Mining
analyzed with available computational resources. ▶ Social Sciences
Particularly, recent advances in text mining, auto-
matic coding, and analysis are bringing major
changes in two interrelated research subfields: Further Reading
social media and politics and election campaigns.
Issenberg, S. (2012). How President Obama’s campaign used
Social media analytics and tools such as Twitter
big data to rally individual voters. http://www.techno
Political Index – that measures Twitter users’ sen- logyreview.com/featuredstory/509026/how-obamas-team-
timents about candidates – allow researchers to used-big-data-to-rally-voters/. Accessed 28 May 2014.
track posts of candidates and to study social Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A
revolution that will transform how we live, work, and
media habits of politicians and governments.
think. London: Eamon Dolan/Mariner Books.
Scholars can now gather, manage, and analyze McGann, A. (2006). The logic of democracy. Ann Arbor:
huge amounts of data. On the other hand, recent University of Michigan Press.
748 Pollution, Air

forest fires, volcanic eruptions, etc.) or anthropo-


Pollution, Air genic (human-caused) reasons. When outdoor
pollution – referring to the pollutants found in
Zerrin Savaşan outdoors – is thought, smokestacks of industrial
Department of International Relations, plants can be given as an example of human-made
Sub-Department of International Law, Selçuk ones. However, natural processes also produce
University, Konya, Turkey outdoor air pollution, e.g., volcanic eruptions.
The main causes of indoor air pollution, on the
other hand, again raise basically from human-
The air contains many different substances, gases, driven reasons, e.g., technologies used for
aerosols, particulate matter, trace metals, and a cooking, heating, and lighting. Nonetheless,
variety of other compounds. If those are not at again there are also natural indoor air pollutants,
the same concentration and change in space, and like radon, and chemical pollutants from building
over time to an extent that the air quality deterio- materials and cleaning products.
rates, some contaminants or pollutant substances Among those, human-based reasons, specifi-
exist in the air. The release of these air pollutants cally after industrialization, have produced a vari-
causes harmful effects to both environment and ety of sources of air pollution, and thus more
humans, to all organisms. This is regarded as air contributed to the global air pollution. They can
pollution. emanate from point and nonpoint sources or from
The air is a common/shared resource of all mobile and stationary sources. A point source
human beings. After released, air pollutants can describes a specific location from which large
be carried by natural events like winds, rains, and quantities of pollutants are discharged, e.g., coal-
so on. So, some pollutants, e.g., lead or chloro- fired power plants. A nonpoint source, one the
form, often contaminate more than one environ- other hand, is more diffuse often involving many
mental occasions, so, many air pollutants can also small pieces spread across a wide range of area,
be water or land pollutants. They can combine e.g., automobiles. Automobiles are also known as
with other pollutants and thus can undergo chem- mobile sources, and the combustion of gasoline is
ical transformations, and then they can be eventu- responsible for released emissions from mobile
ally deposited on different locations. Their effects sources. Industrial activities are also known as
can emerge in different locations far from their stationary sources, and the combustion of fossil
main resources. Thus, they can detrimentally fuels (coal) is accountable for their emissions.
affect upon all organisms on local or regional These pollutants producing from distinct
scales and also upon the climate on global scale. sources may cause harm directly or indirectly. If
Hence, concern for air pollution and its influ- they are emitted from the source directly into the
ences on the earth and efforts to prevent/and to atmosphere, and so cause harm directly, they are
mitigate it have increased greatly in global scale. called as primary pollutants, e.g., carbon oxides,
However, today, it still stands as one of the pri- carbon monoxide, hydrocarbons, nitrogen oxides,
mary challenges that should be addressed globally sulfur dioxide, particulate matter, and so on. If
on the basis of international cooperation. So, it they are produced from chemical reactions includ-
becomes necessary to promote the widespread ing also primary pollutants in the atmosphere,
understanding on air pollution, its pollutants, they are known as secondary pollutants, e.g.,
sources, and impacts. ozone and sulfuric acid.

Sources of Air Pollution The Impacts of Air Pollution

The air pollutants can be produced from natural- The air pollutants result in a wide range of impacts
based reasons (e.g., fires from burning vegetation, both upon humans and environment. Their
Pollution, Air 749

detrimental effects upon humans can be briefly Meeting of the Parties (MOP 28), it is accepted
summarized as follows: health problems resulting to address hydrofluorocarbons (HFCs) – green-
particularly from toxicological stress, like respi- house gases having a very high global warming
ratory diseases such as emphysema and chronic potential even if not harmful as much as CFCs and
bronchitis, chronic lung diseases, pneumonia, car- HCFCs for the ozone layer under the Protocol – in
diovascular troubles, and cancer, and immune addition to chlorofluorocarbons (CFCs) and
system disorders increasing susceptibility to hydrochlorofluorocarbons (HCFCs).
infection and so on. Their adverse effects upon Air pollution first becomes an international
environment, on the other hand, are the following: issue with the Trail Smelter Arbitration
acid deposition, climate change resulting from the (1941) between Canada and the United States.
emission of greenhouse gases, degradation of air Indeed, prior to the decision made by the Tribunal,
resources, deterioration of air quality, noise, disputes over air pollution between two countries
photooxidant formation (smog), reduction in the had never been settled through arbitration. Since
overall productivity of crop plants, stratospheric this arbitration case – specifically with increasing
ozone (O3) depletion, threats to the survival of efforts since the early 1990s – attempts to mea-
biological species, etc. sure, to reduce, and to address rapidly growing
While determining the extent and degree of impacts of air pollution have been continuing.
harm given by these pollutants, it becomes neces- Developing new technologies, like Big Data,
sary to know sufficiently about the features of that arises as one of those attempts.
pollutant. This is because some pollutants can be Big Data has no uniform definition (ELI 2014;
the reason of environmental or health problems in Keeso 2014; Simon 2013; Sowe and Zettsu 2014).
the air, they can be essential in the soil or water, In fact, it is defined and understood in diverse
e.g., nitrogen is harmful as it can form ozone in the ways by different researchers (Boyd 2010; Boyd
air, and it is necessary for the soil as it can also act and Crawford 2012; De Mauro et al. 2016; Gogia
beneficially as fertilizer in the soil. Additionally, if 2012; Mayer-Schönberger and Cukier 2013;
toxic substances exist below a certain threshold, Manyika et.al 2011) and interested companies
they are not necessarily harmful. like Experian, Forrester, Forte Wares, Gartner,
and IBM. It is initially identified by 3Vs – volume
(data amount), velocity (data speed), and variety
New Technologies for Air Pollution: (data types and sources) (Laney 2001). By the P
Big Data time, it has included fourth Vs like veracity (data
accuracy) (IBM) and variability (data quality of
Before the industrialization period, the compo- being subject to structural variation) (Gogia 2012)
nents of pollution are thought to be primarily and a fifth V, value (data capability to turn into
smoke and soot; but with industrialization, they value) together with veracity (Marr), and a sixth
have been expanded to include a broad range of one, vulnerability (data security-privacy)
emissions, including toxic chemicals and biolog- (Experian 2016). It can be also defined by veracity
ical or radioactive materials. Therefore, even and value together with visualization (visual rep-
today it is still admitted that there are six conven- resentation of data) as additional 3Vs (Sowe and
tional pollutants (or criteria air pollutants) identi- Zettsu 2014) and also by volume, velocity, and
fied by the US Environmental Protection Agency variety requiring specific technology and analyti-
(EPA): carbon monoxide, lead, nitrous oxides, cal methods for its transformation into value
ozone, particulate matter, and sulfur oxides. (De Mauro et al. 2016). However, it is generally
Hence, it is expectable that there can be new referred as large and complex data processing
sources for air pollution and so new threats for sets/applications that conventional systems are
the earth soon. Indeed, very recently, through not able to cope with them.
Kigali (Rwanda) Amendment (14 October, Because air pollution has various aspects that
2016) to the Montreal Protocol adopted at the should be measured as mentioned above, it
750 Pollution, Air

requires massive data that should be collected at 2013; Sowe and Zettsu 2014), various concerns
different spatial and temporal levels. Therefore, it can be raised about the use of Big Data to monitor,
is observed in practice that Big Data sets and measure, and forecast air pollution as well. There-
analytics are increasingly used in the field of air fore, it is required to make further research to
pollution, for monitoring, predicting its possible identify gaps, challenges, and solutions for “mak-
consequences, responding timely to them, con- ing the right data (not just higher volume) avail-
trolling and reducing its impacts, and mitigating able to the right people (not just higher variety) at
the pollution itself. the right time (not just higher velocity)” (Forte
They can be used by different kind of organi- Wares, ).
zations, such as governmental agencies, private
firms, and nongovernmental organizations
(NGOs). To illustrate, under US Environmental
Protection Agency (EPA), samples of Big Data Cross-References
use include:
▶ Environment
• Air Quality Monitoring (collaborating with ▶ Pollution, Land
NASA on the DISCOVER-AQ initiative, it ▶ Pollution, Water
involves research on Apps and Sensors for
Air Pollution (ASAP), National Ambient Air
Quality Standards (NAAQS) compliance, and References
data fusion methods)
Boyd, Danah. Privacy and publicity in the context of big
• Village Green Project (on improving Air Qual-
data. WWW Conference. Raleigh, (2010). Retrieved
ity Monitoring and awareness in communities) from http://www.danah.org/papers/talks/2010/
• Environmental Quality Index (EQI) (a dataset WWW2010.html. Accession 3 Feb 2017.
consisting of an index of environmental quality Boyd, Danah & Crawford, Kate. Critical questions for big
data, information, communication & society, 15(5),
based on air, water, land, build environment,
662–679, (2012). Retrieved from: http://www.tandf
and sociodemographic space) online.com/doi/abs/10.1080/1369118X.2012.678878.
Accession3 Feb 2017.
There are also examples generated by local De Mauro, Andrea, Greco, Marco, Grimaldi, Michele.
A formal definition of big data based on its Essential
governments like “E-Enterprise for the Environ-
features. (2016). Retrieved from: https://www.
ment,” by environmental organizations like “Per- researchgate.net/publication/299379163_A_formal_
sonal Air Quality Monitoring,” or by citizen definition_of_Big_Data_based_on_its_essential_fea
science like “Danger Maps,” or by private firms tures. Accession 3 Feb 2017.
Environmental Law Institute (ELI). (2014). Big data and
like “Aircraft Emissions Reductions” (ELI 2014)
environmental protection: An initial survey of public and
or Green Horizons Project (IBM 2015). private initiatives. Washington, DC: Environmental Law
The Environmental Performance Index (EPI) is Institute. Retrieved from: https://www.eli.org/sites/
also another platform – using Big Data compiled default/files/eli-pubs/big-data-and-environmental-pro
tection.pdf. Accession 3 Feb 2017.
from a great number of sensors and models – pro-
Environmental Performance Index (EPI) (n.d.). Available
viding a country and an issue ranking on how each at: http://epi.yale.edu/. Accession 3 Feb 2017.
country manages environmental issues and also a Experian. A data powered future. White Paper (2016).
Data Explorer allowing users to investigate the Retrieved from: http://www.experian.co.uk/assets/
resources/white-papers/data-powered-future-2016.pdf.
global data comparing environmental performance Accession 3 Feb 2017.
with GDP, population, land area, or other variables. Gartner. Gartner says solving ‘big data’ challenge involves
Despite all, as the potential benefits and more than just managing volumes of data. June
costs of the use of Big Data are still under discus- 27, 2011. (2011). Retrieved from: http://www.gartner.
com/newsroom/id/1731916. Accession 3 Feb 2017.
sion (Boyd 2010; Boyd and Crawford 2012; De
Gogia, Sanchit. The big deal about big data for customer
Mauro et al. 2016; Forte Wares, – ; Keeso 2014; engagement, June 1, 2012, (2012). Retrieved from:
Mayer-Schönberger and Cukier 2013; Simon http://www.iab.fi/media/tutkimus-matskut/130822_
Pollution, Land 751

forrester_the_big_deal_about_big_data.pdf. Acces- The Open University. (2007). T210-environmental control


sion 3 Feb 2017. and public health. Milton Keynes: The Open
IBM. IBM expands green horizons initiative globally to University.
address pressing environmental and pollution chal- Vallero, D. A. (2008). Fundamentals of air pollution.
lenges. (2015). Retrieved from: http://www-03.ibm. Amsterdam: Elsevier.
com/press/us/en/pressrelease/48255.wss. Accession Vaughn, J. (2007). Environmental politics. Belmont:
3 Feb 2017. Thomson Wadsworth.
IBM (n.d.). What is big data? Retrieved from: https:// Withgott, J., & Brennan, S. (2011). Environment. San
www-01.ibm.com/software/data/bigdata/what-is-big- Francisco: Pearson.
data.html. Accession 3 Feb 2017.
Keeso, Alan. Big data and environmental sustainability:
A conversation starter. Smith School Working Paper
Series, December 2014, Working paper 14-04, (2014).
Retrieved from: http://www.smithschool.ox.ac.uk/
library/working-papers/workingpaper%2014-04.pdf. Pollution, Land
Accession 3 Feb 2017.
Laney, D. 3D data management: Controlling data volume, Zerrin Savaşan
velocity, and variety. Meta Group (2001). Retrieved
from: Available at: https://blogs.gartner.com/doug- Department of International Relations,
laney/files/2012/01/ad949-3D-Data-Management-Cont Sub-Department of International Law, Selçuk
rolling-Data-Volume-Velocity-and-Variety.pdf. Acces- University, Konya, Turkey
sion 3 Feb 2017.
Manyika, J. et al. Big data: The next frontier for innova-
tion, competition, and productivity. McKinsey
Global Institute (2011). Retrieved from: https:// Pollution, in its all types (air, water, land), means
file:///C:/Users/cassperr/Downloads/MGI_big_da- the entrance of some substances beyond the
ta_full_report.pdf. Accession 3 Feb 2017. threshold concentration level into the natural envi-
Marr, Bernard (n.d.). Big data: The 5 vs everyone must know.
Retrieved from: Available at: https://www.linkedin. ronment which do not naturally belong there and
com/pulse/20140306073407-64875646-big-data-the- not present there, resulting in its destruction and
5-vs-everyone-must-know. Accession 3 Feb 2017. causing harmful effects on both humans/all living
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: organisms and the environment. So, in land pol-
A revolution that will transform how We live, work and
think. London: John Murray. lution as well, solid or liquid waste materials get
Simon, P. (2013). Too big to ignore: The business case for deposited on land and further degrade and deteri-
big data. Hoboken: Wiley. orate the quality and the productive capacity of
Sowe, S. K. & Zettsu, K. “Curating big data made simple: land surface. It is sometimes used as a substitute P
Perspectives from scientific communities.” Big Data, 2,
1. 23–33. Mary Ann Liebert, Inc. (2014). of/or together with soil pollution where the upper
Wares, F. (n.d.). Failure to launch: From big data to big layer of the soil is destroyed. However, in fact, soil
decisions why velocity, variety and volume is not pollution is just one of the causes of the land
improving decision making and how to fix it. White pollution.
Paper. A Forte Consultancy Group Company.
Retrieved from http://www.fortewares.com/Administra Like the other types, land pollution also arises
tor/userfiles/Banner/forte-wares–pro-active-reporting_ as a global environmental problem, specifically
EN.pdf. Accession:3 Feb 2017. associated with urbanization and industrialization,
that should be dealt with globally concerted envi-
Further Reading ronmental policies. However, as a first and fore-
Gillespie, A. (2006). Climate change, ozone depletion and most step, it requires to be understood very well
air pollution. Leiden: Martinus Nijhoff Publishers. with its all dimensions by all humankind, but
Gurjar, B. R., et al. (Eds.). (2010). Air pollution, health and
environmental impacts. Boca Raton: CRC Press.
particularly the researchers studying on it.
Jacobson, M. Z. (2012). Air pollution and global warming.
New York: Cambridge University Press.
Louka, E. (2006). International environmental law, What Causes Land Pollution?
fairness, effectiveness, and world order. New York:
Cambridge University Press.
Raven, P. H., & Berg, L. R. (2006). Environment. Danvers: The degradation of land surfaces are caused
Wiley. directly or indirectly by human (anthropogenic)
752 Pollution, Land

activities. It is possible to mention several reasons others, land pollution has also serious conse-
temporally or permanently changing the land quences on both humans, animals and other living
structure and so causing land pollution. However, organisms, and environment. First of all, all living
three main reasons are generally identified as things depend on the resources of the earth to
industrialization, overpopulation, and urbaniza- survive and on the plants growing from the land,
tion, and the others are counted as the reasons so anything that damages or destroys the land
stemming from these main reasons. Some of ultimately has an impact on the survival of
them are as follows: improper waste disposal humankind itself and all other living things on
(agricultural/domestic/industrial/solid/radioactive the earth. Damages on the land also lead to some
waste) littering; mining polluting the land through problems in relation to health like respiratory
removing the topsoil which forms the fertile layer problems, skin problems, and various kinds of
of soil, or leaving behind waste products and the cancers.
chemicals used for the process; misuse of land Its effects on environment also require to take
(deforestation, land conversion, desertification); attention as it forms one of the most important
soil pollution (pollution on the topmost layer of reasons of the global warming which has started to
the land); soil erosion (loss of the upper (the most be a very popular but still not adequately under-
fertile) layer of the soil); and the chemicals stood phenomena. This emerges from a natural
(pesticides, insecticides, and fertilizers) applied circulation, in turn, land pollution leads to the
for crop enhancement on the lands. deforestation, it leads to less rain, eventually to
Regarding these chemicals used for crop problems such as the greenhouse effect and global
enhancement, it should be underlined that, while warming/climate change. Biomagnification is the
they are enhancing the crop yield, they can also other major concern stemming from land pollu-
kill the insects, mosquitoes, and some other small tion. It occurs when certain substances, such as
animals. So, they can harm the bigger animals that pesticides or heavy metals, gained through eating
feed on these tiny animals. In addition, most of by aquatic organisms such as fish, which in turn
these chemicals can remain in the soil or accumu- are eaten by large birds, animals, or humans. They
late there for many years. To illustrate, DDT become concentrated in internal organs as they
(dichlorodiphenyltrichloroethane) is one of these move up the food chain, and then the concentra-
pesticides. It is now widely banned with the great tion of these toxic compounds tends to increase.
effect of Rachel Carson’s very famous book, This process threatens both these particular spe-
Silent Spring (1962), which documents detrimen- cies and also all the other species above and below
tal effects of pesticides on the environment, par- in the food chain. All these combining with the
ticularly on birds. Nonetheless, as it is not massive extinctions of certain species – primarily
ordinarily biodegradable, so known as persistent because of the disturbance of their habitat –
organic pollutant, it has remained in the environ- induce also massive reductions in biodiversity.
ment ever since it was first used.

Control Measures for Land Pollution


Consequences of Land Pollution
Land pollution, along with other types of pollu-
All types of pollution are interrelated and their tion, poses a threat to the sustainability of world
consequences cannot be restricted to the place resources. However, while others can have self-
where the pollution is first discharged. This is purification opportunities through the help of nat-
particularly because of the atmospheric deposition ural events, it can stay as polluted till to be cleaned
in which existing pollution in the air (atmosphere) up. Given the time necessary to pass for the dis-
creating pollution in water or land as well. appearance of plastics in nature (hundreds of
Since they are interrelated to each other, their years) and the radioactive waste (almost forever),
impacts are similar to each other as well. Like the this fact can be understood better. So then land
Pollution, Land 753

pollution becomes one of the serious concerns of essentially requires the use of Big Data to
the humankind. make this contribution.
When the question is asked what should be • Alabama’s State Water Program is another
done to deal with it, first of all, it is essential to example ensuring geospatial data related to
remind that it is a global problem having no hydrologic, soil, geological, land use, and
boundaries, so requires to be handled with collec- land cover issues.
tively. While working collectively, it is first of all • The National Ecological Observatory Network
necessary to set serious environmental objectives (NEON) is an environmental organization pro-
and best-practice measures. A wide range of mea- viding the collection of the site-based data
sures – changing according to the cause of the related to the effects of climate change, inva-
pollution – can be thought to prevent, reduce, or sive species from 160 sites and also land use
stop land pollution, such as adopting and encour- throughout the USA.
aging organic farming instead of using chemicals • The Tropical Ecology Assessment and Moni-
herbicides, and pesticides, restricting or forbid- toring Network (TEAM) is also a global net-
ding their usage, developing the effective methods work facilitating the collection and integration
of recycling and reusing of waste materials, of publicly shared data related to patterns of
constructing proper disposal of all wastes biodiversity, climate, ecosystems, and also
(domestic, industrials, etc.) into secured landfill land use.
sites, and creating public awareness and support • The Danger Maps is another sample case for
towards all environmental issues. the use of Big Data, as it also provides the
Apart from all those measures, the use of Big mapping of government-collected data on
Data technologies can also be thought as a way of over 13,000 polluting facilities in China to
addressing rapidly increasing and wide-ranging allow users to search by area or type of pollu-
consequences of land pollution. tion (water, air, radiation, soil).
Some of the cases in which Big Data technol-
ogies are used in relation to one or more aspects of The US Environmental Protection Agency
land pollution can be illustrated as follows (ELI (EPA) and the Environmental Performance Index
2014): (EPI) are also other platforms using Big Data
compiled from a great number of sensors regard-
• Located under US Department of the Interior ing environmental issues, on land pollution and on P
(DOI), the National Integrated Land System other types of pollution. That is, Big Data tech-
(NILS) aims to provide the principal data nologies can be thought as a way of addressing
source for land surveys and status by combin- consequences of all types of pollution, not just of
ing Bureau of Land Management (BLM) and land pollution. This is particularly because, all
Forest Service data into a joint system. types of pollution are deeply interconnected with
• New York City Open Accessible Space Infor- another type, so their consequences cannot be
mation System (OASIS) is another sample restricted to the place where the pollution is first
case; as being an online open space mapping discharged as mentioned above. Therefore, actu-
tool, it involves a huge amount of data ally, for all types of pollution, relying on satellite
concerning public lands, parks, community technology and data and data visualization is
gardens, coastal storm impact areas, and zon- essentially required to monitor them regularly, to
ing and land use patterns. forecast and reduce their possible impacts, and to
• Providing online accession of the state Depart- mitigate the pollution itself. Nonetheless, there are
ments of Natural Resources (DNRs) and other serious concerns raised about different aspects of
agencies to the data of Geographic Information the use of Big Data in general (boyd 2010; boyd
Systems (GIS) on environmental concerns, and Crawford 2012; De Mauro et al. 2016; Forte
while contributing to the effective manage- Wares; Keeso 2014; Mayer-Schönberger and
ment of land, water, forest, and wildlife, it Cukier 2013; Simon 2013; Sowe and Zettsu
754 Pollution, Water

2014). So, further investigation and analysis are Mayer-Schönberger, V., & Cukier, K. (2013). Big data:
needed to clarify the relevant gaps and challenges A revolution that will transform how we live, work and
think. London: John Murray.
regarding the use of Big Data for specifically land Mirsal, I. A. (2008). Soil pollution, origin, monitoring &
pollution. remediation. Berlin/Heidelberg: Springer.
Raven, P. H., & Berg, L. R. (2006). Environment. Danvers:
Wiley.
Simon, P. (2013). Too big to ignore: The business case for
Cross-References big data. Hoboken: Wiley.
Sowe, S. K., & Zettsu, K. (2014). Curating big data made
▶ Earth Science simple: Perspectives from scientific communities. Big
▶ Environment Data, 2(1), 23–33. Mary Ann Liebert, Inc.
Withgott, J., & Brennan, S. (2011). Environment. Cornell
▶ Pollution, Air University: Pearson.
▶ Pollution, Water

Further Reading
Pollution, Water
Alloway, B. J. (2001). Soil pollution and land contamina-
tion. In R. M. Harrison (Ed.), Pollution: Causes, effects
Zerrin Savaşan
and control (pp. 352–377). Cambridge: The Royal
Society of Chemistry. Department of International Relations,
Boyd, D. (2010). Privacy and publicity in the context of big Sub-Department of International Law, Selçuk
data. WWW Conference, Raleigh, 29 Apr 2010. University, Konya, Turkey
Retrieved from http://www.danah.org/papers/talks/
2010/WWW2010.html. Accessed 3 Feb 2017.
Boyd, D., & Crawford, K. (2012). Critical questions for big
data, information, communication & society. 15(5), Water pollution can be defined as the contamina-
662–679. Retrieved from http://www.tandfonline.com/ tion of water bodies by the entrance of large
doi/abs/10.1080/1369118X.2012.678878. Accessed 3 Feb
amounts of materials/substances to those bodies,
2017.
De Mauro, A., Greco, M., & Grimaldi, M. (2016). A formal resulting in physical or chemical change in water,
definition of big data based on its essential features. modifying the natural features of the water,
Retrieved from https://www.researchgate.net/publica degrading the water quality, and adversely affect-
tion/299379163_A_formal_definition_of_Big_Data_
ing the humans and the environment.
based_on_its_essential_features. Accessed 3 Feb 2017.
Environmental Law Institute (ELI). (2014). Big data and Particularly in recent decades, it is highly
environmental protection: An initial survey of public accepted that water pollution is a global environ-
and private initiatives. Washington, DC: Environmen- mental problem which is interrelated to all other
tal Law Institute. Retrieved from https://www.eli.org/
environmental challenges. Water pollution con-
sites/default/files/eli-pubs/big-data-and-environmental-prot
ection.pdf. Accessed 3 Feb 2017. trol, at national level, generally should involve
Environmental Performance Index (EPI). Available at: financial resources, technology improvement,
http://epi.yale.edu/. Accessed 3 Feb 2017. policy measures, and necessary legal and admin-
Forte Wares. Failure to launch: From big data to big deci-
istrative framework and institutional/staff capac-
sions why velocity, variety and volume is not improv-
ing decision making and how to fix it. White Paper. ity for implementing these policy measures in
A Forte Consultancy Group Company. Retrieved from practice. However, more importantly, at global
http://www.fortewares.com/Administrator/userfiles/Ban level, it should involve cooperation of all related
ner/forte-wares–pro-active-reporting_EN.pdf. Accessed
sides at all levels. Despite the efforts at both
3 Feb 2017.
Hill, M. K. (2004). Understanding environmental pollu- national and global levels, reducing pollution sub-
tion. New York: Cambridge University Press. stantially still continues to pose a challenge. This
Keeso, A. (2014). Big data and environmental sustainabil- is particularly because even though the world is
ity: A conversation starter. Smith School Working Paper
becoming increasingly globalized, it is still mostly
Series, Dec 2014, Working paper 14-04. Retrieved from
http://www.smithschool.ox.ac.uk/library/working-paper regarded as having with unlimited resources.
s/workingpaper%2014-04.pdf. Accessed 3 Feb 2017. Hence, it becomes essential to explain that it is
Pollution, Water 755

limited and so its resources should not be polluted. wastes, bacteria, nutrients, turbidity, sediment,
Here, it also becomes essential to have adequate total suspended solids (TSS), fecal coliform, oil,
information on all types of pollution resulting in and grease. Nonconventional (or nontoxic) pollut-
environmental deterioration and on water ants are not identified as either conventional or
pollution. priority, like aluminum, ammonia, chloride, col-
ored effluents, exotic species, instream flow, iron,
radioactive materials, and total phenols. Toxic
What Causes Water Pollution? pollutants, metals, dioxin, and lead can be counted
as examples of priority pollutants. Each group of
This question has many responses, but basically it these pollutants has its own specific ways of enter-
is possible to mention two main reasons: natural ing the water bodies and its own specific risks.
reasons and human-driven reasons. All waters are
subject to some degree of natural (or ecological)
pollution caused by nature rather than by human Water Pollution Control
activity, through algal blooms, forest fires, floods,
sedimentation stemming from rainfalls, volcanic In order to control all these pollutants, it is ben-
eruptions, and other natural events. However, a eficial to determine from where they are
greater part of the instances of water pollution discharged. So, the following categories can be
arises from humans’ activities, particularly from identified to find out where they originate from:
massive industrialization. Accidental spills (e.g., a point and nonpoint sources of pollution. If
disaster like the wreck of an oil tanker, as different the sources causing pollution come from single
from others, is unpredictable); domestic dis- identifiable points of discharge, they are point
charges; industrial discharges; the usage of large sources of pollution, e.g., domestic discharges,
amounts of herbicides, pesticides, chemical fertil- ditches, pipes of industrial facilities, and ships
izers; sediments in waterways of agricultural discharging toxic substances directly into a water
fields; improper disposal of hazardous chemicals body. Nonpoint sources of pollution are charac-
down the sewages; and being not able to construct terized by dispersed, not easily identifiable dis-
adequate waste disposal systems can be expressed charge points, e.g., runoff of pollutants into a
as not all but just some of the human-made rea- waterway, like agricultural runoff, stormwater
sons of water pollution. runoff. As it is harder to identify them, it is nearly P
The causes as abovementioned vary greatly impossible to collect, trace, and control them
because a complex variety of pollutants, lying precisely, whereas point sources can be easily
suspended in the water or depositing beneath the controlled.
earth’s surface, get involved in water bodies and Water pollution, like other types of pollution,
result in water quality degradation. Indeed, there has serious widespread effects. In fact, adverse
are many different types of water pollutants spill- alteration of water quality produces costs both to
ing into waterways causing water pollution. They humans (e.g., large-scale diseases and deaths) and
all can be divided up into various categories: to environment (e.g., biodiversity reduction, spe-
chemical, physical, pathogenic pollutants, radio- cies mortality). Its impact differs depending on the
active substances, organic pollutants, inorganic type of water body affected (groundwater, lakes,
fertilizers, metals, toxic pollutants, biological pol- rivers, streams, and wetlands). However, it can be
lutants, and so on. Conventional, non- prevented, lessened, and even eliminated in many
conventional, and toxic pollutants are some of different ways. Some of these different treatment
these divisions which are regulated by the US methods, aiming to keep the pollutants from dam-
Clean Water Act. The conventional pollutants aging the waterways, can be relied on the use of
are as follows: dissolved oxygen, biochemical techniques reducing water use, reducing the usage
oxygen demand (BOD), temperature, pH (acid of highly water soluble pesticide and herbicide
deposition), sewage, pathogenic agents, animal compounds, and reducing their amounts,
756 Pollution, Water

controlling rapid water runoff, physical separation technology on farm performance can also be
of pollutants from the water, or on the manage- shown as another sample on the use of Big Data
ment practices in the field of urban design and compiled from yield information, sensors, high-
sanitation. resolution maps, and databases for water pollution
There are also some other attempts to measure, issue. For example, machine-to-machine (M2M)
reduce, and address rapidly growing impacts of agricultural technology produced by a Canadian
water pollution, such as the use of Big Data. Big startup company Semios allows farmers to
Data technologies can provide ways of achieving improve yields and their farm operations’ effi-
better solutions for the challenges of water pollu- ciency but also it provides information for reduc-
tion. To illustrate, EPA databases can be accessed ing polluted runoff through increasing the
and maps can be generated from them including efficient use of water, pesticides, and fertilizers
information on environmental activities affecting (ELI 2014).
water and also on air and land in the context of The Environmental Performance Index (EPI)
EnviroMapper. Under US Department of the Inte- is also another platform using Big Data to display
rior (DOI), National Water Information System how each country manages environmental issues
(NWIS) monitors surface and underground water and to allow users to investigate data through
quantity, quality, distribution, and movement. comparing environmental performance with
Under National Oceanic and Atmospheric GDP, population, land area, or other variables.
Administration (NOAA), California Seafloor As shown above by example cases, the use of
Mapping Program (CSMP) works for creating a Big Data technologies is increasingly applied in
comprehensive base map series of coastal/marine the water field, in its different aspects from man-
geology and habitat for all waters of the USA. agement to pollution. However, it is still required
Additionally, the Hudson River Environmental to make further research for their effective use in
Conditions Observing System comprises 15 mon- order to eliminate related concerns. This is partic-
itoring stations – located between Albany and the ularly because there is still debate on the use of
New York Harbor – automatically collecting sam- Big Data even regarding its general scope and
ples every 15 min that are used to monitor water terms (Boyd 2010; Boyd and Crawford 2012; De
quality, assess flood risk, and assist in pollution Mauro et al. 2016; Forte Wares, - ; Keeso 2014;
cleanup and fisheries management. Contamina- Mayer-Schönberger and Cukier 2013; Simon
tion Warning System Project, conducted by the 2013; Sowe and Zettsu 2014).
Philadelphia Water Department, is a combination
of new data technologies with existing manage-
ment systems. It provides a visual representation Cross-References
of data streams containing geospatial, water qual-
ity, customer concern, operations and public ▶ Earth Science
health information. Creek Watch is another sam- ▶ Environment
ple case of the use of Big Data in the field of water ▶ Pollution, Land
pollution. It is developed by IBM and the Califor-
nia State Water Resources Control Board’s Clean
Water Team as a free app to allow users to rate the Further Reading
waterway on three criteria: amount of water, rate
Boyd, D. (2010). Privacy and publicity in the context of big
of flow, and amount of trash. The collected data is data. WWW Conference, Raleigh, 29 Apr 2010.
in large enough to track pollution and manage Retrieved from http://www.danah.org/papers/talks/
water resources. The Danger Maps is another 2010/WWW2010.html. Accessed 3 Feb 2017.
project mapping government-collected data on Boyd, D., & Crawford, K. (2012). Critical questions for big
data, information, communication &society. 15(5),
over 13,000 polluting facilities in China. It ren- 662–679. Retrieved from http://www.tandfonline.com/
ders users to search by area or type of pollution doi/abs/10.1080/1369118X.2012.678878. Accessed 3
(water, air, radiation, soil). Developing Feb 2017.
Precision Population Health 757

De Mauro, A., Greco, M., & Grimaldi, M. (2016). A formal


definition of big data based on its essential features. Precision Population Health
Retrieved from https://www.researchgate.net/publica
tion/299379163_A_formal_definition_of_Big_Data_
based_on_its_essential_features. Accessed 3 Feb 2017. Emilie Bruzelius1,2 and James H. Faghmous1
1
Environmental Law Institute (ELI). (2014). Big data and Arnhold Institute for Global Health, Icahn
environmental protection: An initial survey of public School of Medicine at Mount Sinai, New York,
and private initiatives. Washington, DC: Environmen-
tal Law Institute. Retrieved from https://www.eli.org/ NY, USA
2
sites/default/files/eli-pubs/big-data-and-environmental- Department of Epidemiology, Joseph
protection.pdf. Accessed 3 Feb 2017. L. Mailman School of Public Health, Columbia
Environmental Performance Index (EPI). Available at: University, New York, NY, USA
http://epi.yale.edu/. Accessed 3 Feb 2017.
Forte Wares. Failure to launch: From big data to big deci-
sions why velocity, variety and volume is not improv-
ing decision making and how to fix it. White Paper. Synonyms
A Forte Consultancy Group Company. Retrieved from
http://www.fortewares.com/Administrator/userfiles/Ban
ner/forte-wares–pro-active-reporting_EN.pdf. Accessed Precision public health
3 Feb 2017.
Hill, M. K. (2004). Understanding environmental pollu-
tion. New York: Cambridge University Press. Definition
Keeso, A. (2014). Big data and environmental sustainabil-
ity: A conversation starter. Smith School Working Paper
Series, Dec 2014, Working paper 14-04. Retrieved from Precision population health refers to the emerging
http://www.smithschool.ox.ac.uk/library/working-papers/ use of big data to improve the health of
workingpaper%2014-04.pdf. Accessed 3 Feb 2017. populations. In contrast to precision medicine,
Mayer-Schönberger, V., & Cukier, K. (2013). Big data:
A revolution that will transform how we livework and which focuses on detecting and treating disease
think. London: John Murray. in individuals, precision population health instead
Raven, P. H., & Berg, L. R. (2006). Environment. Danvers: focuses on identifying and intervening on the
Wiley. determinants of health within and across
Simon, P. (2013). Too big to ignore: The business case for
big data. Hoboken: Wiley. populations. Though the application of the term
Sowe, S. K., & Zettsu, K. (2014). Curating big data made “precision” is relatively new, the concept of
simple: Perspectives from scientific communities. Big applying the right intervention at the right time
Data, 2(1), 23–33. Mary Ann Liebert, Inc. in the right setting is well-established. Recent P
The Open University. (2007). T210 – Environmental con-
trol and public health. The Open University. advances in the volume, accessibility, and ability
Vaughn, J. (2007). Environmental politics. Thomson to process massive datasets offer enhanced oppor-
Wadsworth. tunities to do just this, holding the potential to
Vigil, K. M. (2003). Clean water, An introduction to water monitor population health progress in real-time,
quality and water pollution control. Oregon State Uni-
versity Press. and allowing for more agile health programs and
Withgott, J., & Brennan, S. (2011). Environment. Pearson. policies. As big data and machine learning are
increasingly incorporated within population
health, future work will continue to focus on the
core tasks of improving population-wide preven-
Precision Agriculture tion strategies, addressing social determinants of
health, and reducing health disparities.
▶ AgInformatics

What Is Precision Population Health?

Precision Farming Precision population health is a complementary


movement to precision medicine that emphasizes
▶ AgInformatics the use of big data and other emerging technologies
758 Precision Population Health

in advancing population health. These parallel investigating how these “upstream” factors, shape
trends in medicine and public health are distin- health distributions, and in using this information
guished by their focus on employing data-driven to better develop appropriate interventions to pre-
strategies to predict which health interventions are vent disease, promote health, and reduce health
most likely to be effective for a given individual or disparities. In this context, precision is derived
population. However, while precision medicine from the use of big data to accomplish population
emphasizes clinical applications, often highlighting health goals. Enhanced precision is also derived
the explanatory role of genetic differences between from the use of scalable machine learning algo-
individuals, precision population health is oriented rithms and affordable computing infrastructure to
towards using data to identify effective interven- measure exposures, outcomes, and context with
tions for entire populations, often highlighting pre- granularity and in real time. From this perspective,
vention strategies. big data is typically described in the context of the
The concept of precision population health is 3 Vs – variety, volume, and velocity – and also
rooted in the precision medicine approach to med- includes the broader incorporation of the tools and
ical treatment, first successfully pioneered in the methods need to manage, store, process and gain
context of cancer treatments. Motivated by insight from such data.
advances in genetic sequencing, precision medi-
cine tries to take into account an individual’s
variability in genetics, environment, and behav- Precision Population Health
iors to better optimize therapeutic benefits. The Opportunities
goal of precision medicine is to use data to more
accurately predict which treatment is most likely Recent advances in the volume, variety, and
to be effective for a given patient at a given time, velocity of new data sources provide a unique
rather than treating all patients with the therapy opportunity to understand and intervene on
that has been shown to be most effective on aver- broad-scale health determinants within and across
age. Central to the notion of precision medicine is populations. High volume data refers to the expo-
the use of large scale data to enhance this process nential increases, in terms of both rows and col-
of personalized prediction. umns, of current datasets. This increasing data
Precision population health, on the other hand, quantity can provide utility to population health
emphasizes the treatment of populations as researchers by making new measures available
opposed to individuals, using data-driven and by increasing sample sizes so that more com-
approaches to better account for the complex plex interactions can be evaluated, particularly in
social, environmental, and economic factors that terms of rare exposures, outcomes, or subgroups.
are known to shape patterns of health and illness Further, collecting data at finer spatial and time-
(Keyes and Galea 2016). In the same way that scale resolution than has been previously been
precision medicine aims to tailor medical treat- feasible can help to improve population health
ments to a specific individual’s genetics, precision program targeting and continuous feedback and
population health aims to tailor public health pro- program adaptation.
grams and policies to the needs of specific High-variety data refers to the increasing
populations or communities. While not overtly diversity of data types that may be applicable to
part of the definition, precision population health population health science. These include both
is understood as contextual – resulting from multi- traditional sources of epidemiologic data as
ple complex and interacting factors determined not well as expanding access to newer sources of
only at the level of the individual, by his or her clinical, administrative, and contextual informa-
genetics, health behaviors, and medical care, but tion. Improved computing power has already
also by the broader set of macrosocial forces that facilitated access to rich sources of novel medical
accrue over the life-course. In the past several data including massive repositories of medical
decades, there has been renewed interest in records, imaging, and genetic information.
Precision Population Health 759

Other opportunities include administrative identify early opportunities for prevention. In


sources of information on critical health determi- under-resourced settings, where traditional
nants such as housing, transportation systems, or sources of population data are suboptimal, such
land-use patterns, that are increasingly available. methods may provide a useful complement or
Remotely sensed data products and weather data alternative method for measuring needed popula-
may also prove to be of high utility to population tion health characteristics for targeted intervention
health researchers, especially in the context of planning and implementation.
environmental exposures and infectious disease
patterns. Finally, social media content, GPS and
other continuous location data, as well as pur- Precision Population Health Challenges
chasing and transaction data, microfinance and
mobile banking information, and wearable tech- The integration of complex population health data
nologies offer unique opportunities to study the poses numerous challenges, many of which,
how social and economic factors shape opportu- including noise, high dimensionality, and nonlinear
nities and barriers to engaging in health promot- relationships, are common in most data-driven
ing behaviors. explorations (Faghmous 2015). With respect to
Finally, increasing technological usage is precision population health however, there are
beginning to provide opportunities for high veloc- also several unique challenges that should be
ity precision population health – close to real-time highlighted. First, though increased access to
collection, storage, and analysis of population novel data sources presents new opportunities,
health data. Instantaneous data collection and working with secondary data can create or rein-
analysis, often through the use of algorithms oper- force validity challenges as systematic bias due to
ating without human intervention, holds immense measurement error cannot be overcome simply
promise for population health monitoring and with greater volumes of data (Mooney et al.
improvement. These advances may be especially 2015). Though the novel data sources are already
important for population health in the context of beginning to provide insights for global population
continuous monitoring and surveillance activities. health programs, these data must be complemented
For example, the penetration of wearable apps and by efforts to expand the collection of population-
mobile phone networks over the past decade has sampled, representative, health, and demographic
expedited the collection health data, also reducing data for designing, implementing and monitoring P
data collection costs by orders of magnitude. the effectiveness of health based policies and pro-
These new technologies may prove to be espe- grams. In particular, numerous authors have
cially important in global settings where national highlighted the insufficiency of global health data,
and subnational data on health indicators may not even with regards to basic metrics such as mortality
be updates on a routine basis, yet are critical to (Desmond-Hellman 2016).
effective program planning and implementation. A second important challenge for precision pop-
Along with mobile data, the use of satellite image ulation health is that a greater emphasis on preci-
analysis is uniquely salient in the context of global sion raises potential ethical, social, and legal
population health. For example, recent research implications, particularly in terms of privacy. As
has leveraged satellite images of nighttime lights greater volumes of health data are collected, it will
to predict updatable poverty and population den- be critical to find ways to protect individual privacy
sity estimates for remote regions (Doupe et al. and confidentiality, especially as more data is col-
2016). Mobile and web technology may also lected passively through the use of digital services
prove to be useful in early detection of anomalies like mobile phones and web searches. Traditional
such as disease outbreaks, enabling faster notions of health data privacy, such as those
response in times of crisis. More-precise disease guaranteed under the Health Insurance Portability
surveillance can also generate hypotheses about and Accountability Act (HIPAA), provide data pri-
the causes of emerging disease patterns and vacy and security provisions for safeguarding
760 Precision Public Health

medical information and rely on informed consent References


for the disclosure and use of an individual’s private
data. However, regulations regarding the use of Desmond-Hellman, S. (2016). Progress lies in precision.
Science, 353(6301), 731.
nonmedical data are less established, especially
Doupe, P., Bruzelius, E., Faghmous, J., & Ruchman, S. G.
with respect to other types of potentially sensitive (2016). Equitable development through deep learning:
information and data owned by private sector enti- The case of sub-national population density estimation.
ties. As discussed, these sources of information In Proceedings of the 7th Annual Symposium on Com-
puting for Development. ACM DEV ’16 (pp. 6:1–6:10).
may be highly salient to population health
New York: ACM.
researchers. There are serious privacy concerns Faghmous, J. H. (2015). Machine learning. In A. El-Sayed
regarding the use of large-scale patient-level data & S. Galea (Eds.), Systems science and population
as the sheer size these datasets increases the risk of health. Epidemiology. Oxford, UK: Oxford University
Press.
potential data breaches by orders of magnitude. At
Keyes, K. M., & Galea, S. (2016). Setting the agenda for a
the same time, de-identified datasets may be of new discipline: Population health science. American
limited practical use to clinicians and public health Journal of Public Health, 106(4), 633–634. https://
practitioners, especially in the context of health doi.org/10.2105/AJPH.2016.303101.
Mooney, S. J., Westreich, D. J., & El-Sayed, A. M. (2015).
programs that attempt to target high-risk individ-
Epidemiology in the era of big data. Epidemiology
uals for prevention. The complexity of these issues (Cambridge, Mass.), 26(3), 390–394. https://doi.org/
has led to extensive discussions around the 10.1097/EDE.0000000000000274.
privacy-utility tradeoff of precision population
health, yet further work is needed, especially as
greater emphasis on scientific collaboration, data
sharing, and scientific reproducibility becomes Precision Public Health
the norm.
Finally, while precision population health ▶ Precision Population Health
holds great promise to improve our ability to
predict which health programs and policies are
most likely to work, where and for whom, it will
be important to continue to focus on core popu- Predictive Analytics
lation health tasks, prioritizing population pre-
vention strategies, the role of social and Anamaria Berea
environmental context and addressing health Department of Computational and Data Sciences,
inequity. Much of the current focus on precision George Mason University, Fairfax, VA, USA
has centered too narrowly on genetic and phar- Center for Complexity in Business, University of
macological factors, rather than on the intersec- Maryland, College Park, MD, USA
tion of precision medicine and precision
population health tasks. A better integration of
these two themes is critical in order to develop Predictive analytics is a methodology in data min-
more precise approaches to targeted interven- ing that uses a set of computational and statistical
tions for both populations and individual techniques to extract information from data with
patients. the purpose to predict trends and behavior pat-
terns. Often, the unknown event of interest is in
the future, but predictive analytics can be applied
Cross-References to any type of unknown data, whether it is in the
past, present, or future (Siegel 2013). In other
▶ Electronic Health Records (EHR) words, predictive analytics can be applied not
▶ Health Informatics only to time series data but to any data where
▶ Patient-Centered (Personalized) Health there is some unknown that can be inferred.
Predictive Analytics 761

Therefore prediction analytics is a powerful set of Unlike past good or bad omens, the results of
tools for inferring lost past data as well. predictive analytics are probabilistic. This means
The core of predictive analytics in data science that predictive analytics informs the probability of
relies on capturing relationships between explan- a certain data point or the probability of a hypoth-
atory variables and the predicted variables from esis to be true.
past occurrences, and exploiting them to predict While true prediction can be achieved only by
the unknown outcome. It is important to note, determining clearly the cause and the effect in a
however, that the accuracy and usability of results set of data, a task that is usually hard to do, most of
will depend greatly on the level of data analysis the predictive analytics techniques are outputting
and the quality of assumptions (Tukey 1977). probabilistic values and error term analyses.

Predictive Analytics and Forecasting Predictive Modeling Methods

Prediction, in general, is about forecasting the Predictive modeling statistically shows the under-
future or forecasting the unknown. In the past, lying relationships in historical, time series data in
before the scientific method was invented, pre- order to explain the data and make predictions,
dictions were based on astrological observations, forecasts, or classifications about future events.
witchcraft, foretelling, oral history folklore, and, In general, predictive analytics uses a series of
in general, on random observations or associa- statistical and computational techniques in order
tions of observations that happened at the same to forecast future outcomes from past data. Tradi-
time. For example, if a conflict happened during tionally, the most used method has been the linear
an eclipse, then all eclipses would become regression, but lately, with the emergence of the
“omens” of wars and, in general, bad things. For Big Data phenomenon, there have been developed
a long period of time in our civilization, the events many other techniques aiming to support busi-
were merely separated in two classes: good or nesses and forecasters, such as machine learning
bad. And thus the associations of events that algorithms or probabilistic methods.
would lead to a major conflict or epidemics or Some classes of techniques include:
natural catastrophe would be categorized as
“bad” omens from there on, while any associa- 1. Applications of both linear and nonlinear P
tions of events that would lead to peace, prosper- mathematical programming algorithms, in
ity, and, in general, “good” major events would be which one objective is optimized within a set
categorized as “good” omens or good predictors of constraints.
from there on. 2. Advanced “neural” systems, which learn com-
The idea of associations of events as predictive plex patterns from large datasets to predict the
for another event is actually at the core of some of probability that a new individual will exhibit
the statistical methods we are using today, such as certain behaviors of business interest. Neural
correlation. But the fallacy of using these methods networks (also known as deep learning) are
metaphorically instead of in a quantitative system- biologically inspired machine learning models
atic analysis is that only one set of observations that are being used to achieve the recent
cannot be predictive for the future. That was true record-breaking performance on speech recog-
in the past and it is true now as well, no matter nition and visual object recognition.
how sophisticated the techniques we are using. 3. Statistical techniques for analysis and pattern
Predictive analytics uses a series of events or detection within large datasets.
associations of events, and the longer the series,
the more informative the predictive analysis Some techniques in predictive analytics are
can be. borrowed from traditional forecasting techniques,
762 Predictive Analytics

such as moving average, linear regressions, logis- required for the visual (spatial) data (Maciejewski
tic regressions, probit regressions, multinomial et al. 2011). This technique is particularly useful
regressions, time series models, or random forest in determining hotspots and areas of conflict with
techniques. Other techniques, such as supervised a high dynamics. Some of the techniques used in
learning, A|B testing, correlation ranking, spatiotemporal analysis are kernel density estima-
k-nearest neighbor algorithm are closer to tion for event distribution and seasonal trend
machine learning and newer computational decomposition by loess smoothing (Maciejewski
methods. et al. 2011).
One of the most used techniques in predictive
analytics today though is supervised learning or
supervised segmentation (Provost and Fawcett Predictive Analytics Example
2013). Supervised segmentation includes the fol-
lowing steps: A good example for using predictive analytics is
in healthcare. The problem of understanding the
– Selection of informative attributes – particu- probability of an upcoming epidemics or the prob-
larly in large datasets, the selection of the vari- ability of increase in incidence of various dis-
ables that are more likely to be informative to eases, from flu to heart disease and cancer.
the goal of prediction is crucial; otherwise the For example, given a dataset that contains data
prediction can render spurious results. with respect to the past incidence of heart disease
– Information gain and entropy reduction – these in the USA, demographic data (gender, average
two techniques measure the information in the income, age, etc.), exercise habits, eating habits,
selected attributes. traveling habits, and other variables, a predictive
– Selection is done based on tree induction, model would follow these steps:
which fundamentally represents subsetting
the data and searching for these informative 1. Descriptive statistics – the first step in doing
attributes. predictive analytics or building a predictive
– The resulting tree-structured model partitions model is always an understanding of the data
the space of all data into possible segments with respect to what the variables represent,
with different predicted values. what ranges they fall into, how long is the
time series, ASO, essentially a summary statis-
The supervised learning/segmentation has tics of the data.
been popular because it is computationally and 2. Data cleaning and treatment – it is very impor-
algorithmically simple. tant to understand not what the data is or has
but also what the data is missing.
3. Build the model/s – in this step, several tech-
Visual Predictive Analytics niques can be explored and used comparatively
and based on their results; the best one should
Data visualization and predictive analytics com- be chosen. For example, both a general regres-
plement each other nicely and together they are an sion and a random forest can be used and
even more powerful methodology for the analysis compared, or supervised segmentation based
and forecasting of complex datasets that comprise on demographics and then the segments
a variety of data types and data formats. compared.
Visual predictive analytics is a specific set of 4. Performance and accuracy estimation – in this
techniques of predictive analytics that is applied final step, the probabilities or measurements of
to visual and image data. Just as in the case of forecasting accuracy are computed and
predictive analytics in general, temporal data is interpreted.
Predictive Analytics 763

In any predictive model or analytics technique, Predictive/Descriptive/Prescriptive


the model can do only what the data is. In other
words, it is impossible to assess a predictive There is a clear distinction between descriptive
model of the heart disease incidence based on vs. predictive vs. prescriptive analytics in Big
the travel habits if no data regarding travel is Data (Shmueli 2010). Descriptive analytics
included. shows how past or current data can be analyzed
Another important point to remember is that in order to determine patterns and extract mean-
the accuracy of the model also depends on the ingful observations out of the data. Predictive
accuracy measure, and using multiple accuracy analytics is generally based on a model that is
measures is desired (i.e., mean squared error, informed by descriptive analytics and gives vari-
p-value, R-squared). ous outcomes based on past data and the model.
In general, any predictive analytic technique Prescriptive analytics is closely related to predic-
will output a dataset of created variables, called tive analytics, as it takes the predictive values,
predictive values, and the newly created dataset. puts them in a decision model, and informs the
Therefore a good technique for verification and decision-makers about the future course of action
validation of the methods used is to partition the (Shmueli and Koppius 2010).
real dataset in two sets and use one to “train” the
model and the second one to validate the model’s
results. Predictive Analytics Applications
The success of the model ultimately depends
on how real events will unfold and that is one of In practice, predictive analytics can be applied to
the reasons why longer time series are better at almost all disciplines – from predicting the failure
informing predictive modeling and giving better of mechanical engines in hard sciences, to pre-
accuracy for the same set of techniques. dicting customers’ buying power in social sci-
ences and business (Gandomi and Haider 2015).
Predictive analytics is especially used in busi-
Predictive Analytics Fallacies ness and marketing forecasting. Hair Jr. (2007)
shows the importance of predictive analytics for
Cases of “spurious correlations” tend to be quite marketing and how it has become more relevant
famous, such as the correlation between the num- with the emergence of the Big Data phenome- P
ber of people who dies tangled in their bed sheets non. He argues that survival in a knowledge-
and the consumption of cheese per capita (http:// based economy is derived from the ability to
www.tylervigen.com/spurious-correlations). These convert information to knowledge. Data mining
examples fall on the same fallacy as the “bad”/ identifies and confirms relationships between
“good” omen one, as the observations of the events explanatory and criterion variables. Predictive
at the same time does not imply that there is a analytics uses confirmed relationships between
causal relationship between the two events. variables to predict future outcomes. The predic-
Another classic example is to think, in general, tions are most often values suggesting the likeli-
that correlations show a causal relationship; there- hood a particular behavior or event will take
fore predictions based on correlation analyses place in the future.
alone tend to fail often. Hair also argues that, in the future, we can
Some other fallacies of predictive analytics expect predictive analytics to increasingly be
techniques include an insufficient analysis of the applied to databases in all fields and revolutionize
errors, relying on the p-value alone, relying on a the ability to identify, understand, and predict
Poisson distribution of the current data, and future developments; data analysts will increas-
many more. ingly rely on mixed-data models that examine
764 Prevention

both structured (numbers) and unstructured (text Hair Jr., J. F. (2007). Knowledge creation in marketing:
and images) data; statistical tools will be more The role of predictive analytics. European Business
Review, 19(4), 303–315.
powerful and easier to use; future applications Maciejewski, R., et al. (2011). Forecasting hotspots –
will be global and real time; demand for data A predictive analytics approach. IEEE Transactions
analysts will increase as will the need for students on Visualization and Computer Graphics, 17(4),
to learn data analysis methods; and scholarly 440–453.
Pearl, J. (2009). Causality. Cambridge: Cambridge univer-
researchers will need to improve their quantitative sity press.
skills so the large amounts of information avail- Provost, F., & Fawcett, T. (2013). Data science for busi-
able can be used to create knowledge instead of ness: What you need to know about data mining and
information overload. data-analytic thinking. Sebastopol: O’Reilly Media.
Shmueli, G. (2010) To Explain or to Predict?. Statistical
Science 25(3):289–310.
Shmueli, G., & Koppius, O. (2010). Predictive analytics in
Predictive Modeling and Other information systems research. Robert H. Smith School
Forecasting Techniques Research Paper No. RHS, 06-138.
Siegel, E. (2013). Predictive analytics: The power to pre-
dict who will click, buy, lie, or die. Hoboken: Wiley.
Some predictive modeling techniques do not nec- Tukey, J. (1977). Exploratory data analysis. New York:
essarily involve Big Data. For example, Bayesian Addison-Wesley.
networks and Bayesian inference methods, while
they can be informed by Big Data, they cannot be
applied granularly to each data point due to the
computational complexity that can arise from cal- Prevention
culating thousands of conditional probability
tables. But Bayesian models and inferences can David Brown1,2 and Stephen W. Brown3
1
certainly be used in combination with statistical Southern New Hampsire University, University
predictive modeling techniques in order to bring of Central Florida College of Medicine,
the analysis closer to a cause-effect type of infer- Huntington Beach, CA, USA
2
ence (Pearl 2009). University of Wyoming, Laramie, WY, USA
3
Another forecasting technique, that does not Alliant International University, San Diego,
rely on Big Data, but harnesses the power of the CA, USA
crowds, is the prediction market. Just like Bayes-
ian modeling, prediction markets can be used as a
complement to Big Data and predictive modeling One of the primary purposes of government is to
in order to augment the likelihood value of the provide for the health and safety of its constitu-
predictions (Arrow et al. 2008). ents. From humanitarian, economic, and public
health perspectives, prevention is the most effec-
tive and efficient approach towards achieving this
Cross-References goal. Research and program evaluation studies
repeatedly demonstrate that prevention activities
▶ Business Intelligence Analytics improve health and safety outcomes. Health and
safety problem prevention are much safer, effi-
cient, and cost-effective than health and safety
References problem treatment.
Since its development, the discipline of public
Arrow, K.J., et al. (2008). The promise of prediction mar- health has had disease, accident, illness, and harm
kets. Science-New York then Washington-320.5878: prevention as one of its primary goals. Program
877.
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big
data concepts, methods, and analytics. International
Journal of Information Management, 35(2), 137–144. S. W. Brown: deceased.
Prevention 765

evaluation studies have demonstrated that effec- saving function, and the data can be used to eval-
tive prevention efforts improve health outcomes, uate people’s responses to prevention programs
and they lower the cost of health care for both and activities.
program participants and nonparticipants. Global Positioning System (GPS) big data
All aspects of the public health prevention information is being used to bring emergency
model have been advanced through the use of care to areas in need of first responder services.
big data. Big data greatly enhances the ability to This system has led to significant decreases in
identify environmental, genetic, and lifestyle fac- emergency response time. Fast response time
tors that might increase or decrease the risk of often prolongs life and prevents further complica-
diseases, illness, and accidents. Big data increases tions from emergency situations. Communities
the speed at which new vaccines and other pre- that have installed GPS systems have also seen a
vention programs can be developed and evalu- significant decrease in accidents involving their
ated. Big data dramatically improves the ability first responders and their vehicles.
to identify large geographic areas and highly spe- Big data is also being used to help people who
cific locations at risk of illnesses, accidents, suffer from chronic diseases. As an example, in
crimes, and epidemics. the case of asthma, a wireless sensor is attached to
In public health activities, the adage is more the patient’s medication inhaler. This sensor pro-
information makes for better programs. Thus, the vides information about the amount of medication
belief of public health scientists that more infor- being administered, the time of administration,
mation has been generated in the last 5 years than and the location of the patient using the inhaler.
in the entire history of mankind leads to the con- A smartphone is then used to transmit the infor-
clusion that big data has the potential of leading to mation to care providers and researchers. This
great strides in the advancement of public health program has led to significant decreases in the
prevention programs. incidence of uncontrolled asthma attacks.
Big data is enhancing epidemiologists’ abili- Big data is being used by community police
ties to identify risk factors that increase the prob- departments to track the time and place of inci-
ability of diseases, illnesses, accidents, and other dents of accidents and crimes. The use of such
types of harm. This information is being used to data has led to significant decreases in accidents
develop programs designed to decrease or elimi- and the incidence of violent crime in many
nate the identified risks. Big data helps clinical communities. P
researchers track the efficacy of their treatments; Big data facilitates the elimination of risk fac-
this knowledge is used to develop new interven- tors that contribute to the development of chronic
tions designed to prevent other clinical problems. disease such as diabetes, obesity, and heart dis-
Accident and crime prevention experts use big ease. Wearable monitors can assess physical activ-
data to predict areas likely to suffer accidents ity, diet, tobacco use, drug use, and exposure to
and/or criminal behavior. This information is pollution. These data then lead to the discovery
being used to lower crime rates and improve acci- and prevention of risk factors for public health
dent statistics. Big data mechanisms can be used problems at the population, subpopulation, and
to track and map the spread of infectious diseases. individual levels. It can improve peoples’ quality
This information has significant implications for of life by monitoring intervention effectiveness
worldwide disease prevention and health and by helping people live healthier lives in
improvement. healthier environments.
Electronic medical records and online patient
charting systems are frequently used sources of
prevention big data. As an example, anonymous Conclusion
aggregate data from these systems help identify
gaps, disparities, and unnecessary duplications in As technology continues to advance, additional
healthcare delivery. This information has a cost- opportunities will present themselves to utilize
766 Privacy

big data techniques to prevent disease and dis-


ability around the world. As additional big data Privacy
sources and technologies develop, it is reason-
able to predict a decrease in their cost and Joanna Kulesza
increase their effectiveness. However, as in all Department of International Law and
systems, while the data itself is highly valuable, International Relations, University of Lodz, Lodz,
it is not the data that is the primary source of Poland
improved prevention activities and programs.
Rather, it is information gleaned from the data
and the questions that the data answer is of most Origins and Definition
value. The effective use of big data has great
potential to prevent illness, accidents, diseases, Privacy is a universally recognized human right,
and crimes that cause harm to the public good on subject to state protection from arbitrary or unlaw-
a worldwide scale. Big data and big improve- ful interference and unlawful attacks. The age of
ments in disease, accident, illness, and harm pre- Big Data has brought it to the foreground of all
vention would definitely seem to go hand in technology-related debates as the amount of infor-
hand. mation aggregated online, generated by various
sources together with the computing capabilities
of modern networks, makes it easy to connect an
individual to a particular piece of information
Cross-References about them, possibly causing a direct threat to
their privacy. Yet international law grants every
▶ Biomedical Data person the right to legal safeguards against any
▶ Electronic Health Records (EHR) interference with one’s right or attacks upon
▶ Evidence-Based Medicine it. The right to privacy covers, although is not
▶ Health Care Delivery limited to, one’s identity, integrity, intimacy,
▶ Participatory Health and Big Data autonomy, communication, and sexuality and
▶ Patient-Centered (Personalized) Health results in legal protection for one’s physical integ-
rity; health information, including sex orientation
and gender; reputation; image; personal develop-
Further Reading ment; personal autonomy; and self-determination
as well as family, home, and correspondence that
Barrett, M. Humblet, O., Hiatt, R.A., et al. (2013). Big
are to be protected by state from arbitrary or
data and disease prevention. Big Data September
2013. unlawful interferences by its organs or third
Chawla, N. V., & Davis, D. A. (2013). Bringing big data to parties. This catalogue is meant to remain an
personalized healthcare: A patient-centered frame- open one, enabling protection of forever new cat-
work. Journal of General Internal Medicine, 28
egories of data, such as geographical location data
(Suppl 3), 660–665.
Hay, S. I., George, D. B., Moyes, C. L., & Brownstein, J. S. or arguably to a “virtual personality.” As such, the
(2013). Big data opportunities for global infectious term covers also information about an individual
disease surveillance. PLoS Medicine, 10(4), 1–4. that is produced, generated, or needed for the
https://doi.org/10.1371/journal.pmed.1001413.
purpose of rendering electronic services, such as
Michael, K., & Miller, K. W. (2013). Big data: New
opportinities and new challenges. Computer, 46(6), a telephone, an IMEI or an IP number, e-mail
22–24. address, a website address, geolocation data, or
Van Sickle, D., Maenner, M., Barrett, M., et al. (2013). search terms, as long as such information may be
Monitoring and improving compliance and asthma
control: Mapping inhaler use for feedback to patients,
linked to an individual and allows for their identi-
physicians and payers. Respitory Drug Delivery fication. Privacy is not an absolute right and may
Europe, 1, 1–12. be limited for reasons considered necessary in a
Privacy 767

democratic society. While there is no numerus all finding their application to the vast and varied
clausus of such limitative grounds, they usually Big Data resource.
include reasons of state security and public order
or the rights of others, such as their freedom of
expression. States are free to introduce certain History
limitations on individual privacy right as long as
those are introduced by specific provisions of law, The idea of privacy rose simultaneously in various
communicated to the individuals whose privacy is cultures. Contemporary authors most often refer
impacted, and applied solely when necessary in to the works of American and European legal
particular circumstances. This seemingly clear and writers of late nineteenth century to identify its
precise concept suffers practical limitations as origins. In US doctrine it were Warren and Bran-
states differ in their interpretations of “necessity” deis who introduced in their writings “the right to
of interference as well as the “specificity” of legal be let alone,” a notion still often used to describe
norms required and scope of their application. As a the essential content of privacy. Yet at remotely
consequence the concept of privacy strongly the same time, German legal scholar Kohler
varies throughout the world’s regions and coun- published a paper covering a similar concept. It
tries. This is a particular challenge at the time of was also in mid nineteenth century that French
Big Data as various national and regional percep- courts issued their first decisions protecting the
tions of privacy need to be applied to the very same right to private life. The right to privacy was
vast catalogue of online information. introduced to grant individuals protection from
This inconsistency in privacy perceptions undesired intrusions into their private affairs and
results from varied cultural and historical back- home life, be it by nosy journalists or governmen-
ground of individual states as well as their dif- tal agents. Initially the right was used to limit the
fering political and economic situation. In rapidly evolving press industry, with time, as indi-
countries recognizing values reflected in univer- vidual awareness and recognition of the right
sal human rights treaties, including Europe, large increased; the right to privacy primarily intro-
parts of Americas, and some Asian states, the duced limits of individual information that state
right to privacy covers numerous elements of or local authorities may obtain and process. As
individual autonomy and is strongly protected any new idea, the right to privacy initially pro-
by comprehensive legal safeguards. On the voked much skepticism, yet by mid twentieth P
other hand in countries rapidly developing, as century became a necessary element of the rising
well as in ones with unstable political or eco- human rights law. In the twenty-first century, it
nomic situation, primarily located in Asia and gained increased attention as a side effect of the
Africa, the significance of the right to one’s pri- growing, global information society. International
vate life subsides to urgent needs of protecting online communications allowed for easy and
life and personal or public security. As a conse- cheap mass collection of data, creating the
quence the undisputed right to privacy, subject to greatest threat to privacy so far. What followed
numerous international treaties and rich interna- was an eager debate on the limits of allowed
tional law jurisprudence, remains highly ambig- privacy intrusions and actions required from states
uous, an object of conflicting interpretations by aimed at safeguarding the rights of an individual.
national authorities and their agents. This is one A satisfactory compromise is not easy to find as
of the key challenges to finding the appropriate states and communities view privacy differently,
legal norms governing Big Data. In the unique based on their history, culture, and mentality. The
Big Data environment, it is not only the tradi- existing consensus on human rights seems to be
tional jurisdictional challenges, specific to all the only starting point of a successful search for an
online interactions, that must be faced but also effective privacy compromise, much needed in
the tremendously varying perceptions of privacy the era of transnational companies operating on
768 Privacy

Big Data. With the modern notions of “the right to instruction on the scope of privacy protected by
be forgotten” or “data portability” referring to new international law, discussing the thin line with
facets of the right to protect one’s privacy, the Big state sovereignty, security, and surveillance.
Data phenomenon is one of the deciding factors of According to Article 12 UDHR and Article
this ongoing evolution. 17 ICCPR, privacy must be protected against
“arbitrary or unlawful” intrusions or attacks
through national laws and their enforcement.
Privacy as a Human Right Those laws are to detail limits for any justified
privacy invasions. Those limits of individual pri-
The first document of international human rights vacy right are generally described in Article
law, recognizing the right to privacy, was the 1948 29 para. 2 which allows for limitations of all
Universal Declaration on Human Rights (UDHR). human rights determined by law solely for the
The nonbinding political middle ground was not purpose of securing due recognition and respect
too difficult to find with the greatest horrors in for the rights and freedoms of others and of meet-
human history of World War II still vividly in the ing the just requirements of morality, public order,
mind of world’s politicians and citizens alike. and the general welfare in a democratic society.
With horrid memories fading away and the Iron Although proposals for including a similar
Curtain drawing a clear line between differing restraint in the text of the ICCPR were rejected
values and interests, a binding treaty on the very by negotiating parties, the right to privacy is not
issue took almost 20 more years. Irreconcilable an absolute one. Following HRC guidelines and
differences between communist and capitalist state practice surrounding the ICCPR, privacy
countries covered the scope and implementation may be restrained according to national laws
of individual property, free speech, or privacy. which meet the general standards present in
The eventual 1966 compromise in the form of human rights law. The HRC confirmed this inter-
the two fundamental human rights treaties: the pretation in its 1988 General Comment No. 16 as
International Covenant on Civil and Political well as recommendations and observations issued
Rights (ICCPR) and the International Covenant thereafter. Before Big Data became, among its
on Economic Social and Cultural Rights other functions, an effective tool for mass surveil-
(ICESCR) allowed for a conciliatory wording on lance, the HRC took a clear stand on the question
hard law obligations for different categories of of legally permissible limits of state inspection. It
human rights, yet left the crucial details to future clearly stated that any surveillance, whether elec-
state practice and international jurisprudence. tronic or otherwise; interceptions of telephonic,
Among the right to be put into detail by future telegraphic, and other forms of communication;
state practice, international courts, and organiza- wiretapping; and recording of conversations
tions was the right to privacy, established as a should be prohibited. It confirmed that individual
human right in Article 12 UDHR and Article limitation upon privacy must be assessed on a
17 ICCPR. They both granted every individual case-by-case basis and follow a detailed legal
freedom from “arbitrary interference” with their guideline, containing precise circumstances
“privacy, family, home, or correspondence” as when privacy may be restricted by actions of
well as from any attacks upon their honor and local authorities or third parties. The HRC speci-
reputation. While neither document defines “pri- fied that even interference provided for by law
vacy,” the UN Human Rights Committee (HRC) should be in accordance with the provisions,
has gone into much detail on delimitating its scope aims, and objectives of the Covenant and reason-
for the international community. All 168 ICCPR able in the particular circumstances, where “rea-
state parties are obliged per the Covenant to reflect sonable” means justified by those particular
HRC recommendations on the scope and enforce- circumstances. Moreover, as per the HRC inter-
ment of the treaty in general and privacy in par- pretation, states must take effective measures to
ticular. Over time the HRC produced detailed guarantee that information about individual’s life
Privacy 769

does not reach ones not authorized by law to individual right to privacy, yet its 34 members
obtain, store, or process it. Those general guide- differ on the effective methods of privacy protec-
lines are to be considered the international stan- tion and the extent to which such protection
dard of protecting the human right to privacy and should be granted. Nevertheless, the nonbinding
need to be respected regardless of the ease that Big yet influential 1980 OECD Guidelines on the Pro-
Data services offer in connecting pieces of infor- tection of Privacy and Transborder Flow of Per-
mation available online with individuals they sonal Data (Guidelines) together with their 2013
relate to. Governments must ensure that Big update have so far encouraged over data protec-
Data is not to be used in a way that infringes tion laws in over 100 countries, justifying the
individual privacy, regardless of the economic claim that, thanks to its detailed yet unified char-
benefits and technical accessibility of Big Data acter and national enforceability personal data
services. protection, is the most common and effective
The provisions of Article 17 ICCPR resulted in legal instrument safeguarding individual privacy.
similar stipulations of other international treaties. The Guidelines identify the universal privacy pro-
Those include Article 8 of the European Conven- tection through eight personal data processing
tion on Human Rights (ECHR) binding upon its principles. The definition of “personal data”
48 member states or Article 11 of the American contained in the Guidelines is usually directly
Convention on Human Rights (ACHR) agreed adopted by national legislations which cover any
upon by 23 parties to the treaty. The African information relating to an identified or identifiable
Charter on Human and Peoples’ Rights (Banjul individual, referred to as “data subject.” The basic
Charter) does not contain a specific stipulation eight principles of privacy and data protection
regarding privacy, yet its provisions of Article include (1) the collection limitation principle,
4 on the inviolability of human rights, Article (2) the data quality principle, (3) the individual
5 on human dignity, and Article 16 on the right participation principle, (4) the purpose specifica-
to health serve as basis to grant individuals within tion principle, (5) the use limitation principle,
the jurisdiction of 53 state parties the protection (6) the security safeguards principle, (7) the open-
recognized by European or American states as ness principle, and (8) the accountability princi-
inherent to the right of privacy. While no general ple. They introduce certain obligations upon “data
human rights document exists among Austral- controllers” that is parties “who, according to
asian states, the general guidelines provided by domestic law, are competent to decide about the P
the HRC and the work of the OECD are often contents and use of personal data regardless of
reflected in national laws on privacy, personal whether or not such data are collected, stored,
rights, and personal data protection. processed or disseminated by that party or by an
agent on their behalf.” They oblige data control-
lers to respect limits made by national laws
Privacy and Personal Data pertaining to the collection of personal data. As
already noted this is of particular importance to
The notion of personal data is closely related to Big Data operators, who must be aware and abide
that of privacy, yet their scopes differ. While per- by the varying national regimes. Personal data
sonal data is a term relatively well defined, pri- must be obtained by “lawful and fair” means and
vacy is a more broad and ambiguous notion. As with the knowledge or consent of the data subject,
Kuner rightfully notes, the concept of privacy unless otherwise provided by relevant law.
protection is a broader one than personal data Collecting or processing personal data may only
regulations, where the latter provides a more be done when it is relevant to the purposes for
detailed framework for individual claims. The which it will be used. Data must be accurate,
influential Organization for Economic complete, and up to date. The purposes for data
Co-operation and Development (OECD) Forum collection ought to be specified no later than at the
identified personal data as a component of the time of data collection. The use of the data must be
770 Privacy

limited to the purposes so identified. Data control- enacted by state authorities. Usually privacy is
lers, including those operating on Big Data, are considered an element of the larger catalogue of
not to disclose personal data at their disposal for personal rights and granted equal protection. It
purposes other than those initially specified and allows individuals whose privacy is under threat
agreed upon by the data subject, unless such use for the threatening activity to be seized (e.g.,
or disclosure is permitted by law. All data pro- infringing information be deleted or a press
cessors are to show due diligence in protecting release be stopped). It also allows for pecuniary
their collected data, by introducing reasonable compensation or damages should a privacy
security safeguards against the loss or infringement already take place.
unauthorized data access and its destruction, use, Originating from German-language civil law
modification, or disclosure. This last obligation doctrine, privacy protection may be well
may prove particularly challenging for Big Data described by the theory of concentric spheres.
operators, with regard to the multiple locations of Those include the public, private, and intimate
data storage and their continuously changeability. sphere, with different degrees of protection from
Consequently each data subjects enjoys the right interference granted to each of them. The stron-
to obtain information on the fact of the data con- gest protection is granted to intimate informa-
troller having data relating to him, to have any tion; activities falling within the public sphere
such data communicated within a reasonable time, are not protected by law and may be freely col-
to be given reasons if a request for such informa- lected and used. All individual information may
tion is denied, as well as to be able to challenge be qualified as falling into one of the three
such denial and any data relating to him. spheres, with the activities performed in the pub-
Followingly each data subject enjoys the right to lic sphere being those performed by an individ-
have their data erased, rectified, completed, or ual as a part of their public or professional duties
amended, and data controller is to be held and obligations and deprived of privacy protec-
accountable to national laws for lack of effective tion. This sphere would differ as per individual,
measures ensuring all of those personal data with “public figures” enjoying least protection.
rights. An assessment of the limits of one’s privacy
Therewith the OECD principles form a practi- when compared with their public function
cal standard for privacy protection represented in would always be made on case-by-case basis.
the human rights catalogue, applicable also to Big Any information that may not be considered
Data operators, given the data in their disposal public is to be granted privacy protection and
relates directly or indirectly to an individual. may only be collected or processed with permis-
While their effectiveness may come to depend sion granted by the one it concerns. The need to
upon jurisdictional issues, the criteria for identifi- obtain consent from the individual the informa-
cation of data subjects and obligations of data tion concerns is also required for the intimate
processors are clear. sphere, where the protection is even stronger.
Some authors argue that information on one’s
health, religious beliefs, sexual orientation, or
Privacy as a Personal Right history should only be distributed in pursuit of
a legitimate aim, even when permission for its
Privacy is recognized not only by international distribution was granted by the one it concerns.
law treaties and international organizations but With the civil law scheme for privacy protec-
also by national laws, from constitutions to civil tion being relatively simple, its practical applica-
and criminal law codes and acts. Those regula- tion relies on case-by-case basis and therefore
tions hold great practical significance, as they may show challenging and unpredictable in prac-
allow for direct remedies against privacy infrac- tice, especially when international court practice
tions from private parties, rather than those is of issue.
Privacy 771

Privacy and Big Data inciting Big Data operators to convince forever
more users to choose their privacy-oriented
Big Data is a term that directly refers to informa- services.
tion about individuals. It may be defined as gath-
ering, compiling, and using large amounts of
information enabling for marketing or policy Summary
decisions. With large amounts of data being col-
lected by international service providers, in par- Privacy recognized as a human right requires cer-
ticular ones offering telecommunication tain precautions to be taken by state authorities
services, such as Internet access, the scope of and private business alike. Any information that
data they may collect and the use to which they may allow for the identification of an individual
may put it is of crucial concern to all their clients ought to be subjected to particular safeguards
but also to their competitors and state authorities allowing for its collection or processing solely
interested in participating in this valuable based on the consent of the individual in question
resource. In the light of the analysis presented or a particular norm of law applicable in a case
above, any information falling within the scope where the inherent privacy invasion is reasonable
of Big Data that is collected and processed while and necessary to achieve a justifiable aim. In no
rendering online services may be considered sub- case may private information be collected or pro-
ject to privacy protection when it refers to iden- cessed in bulk, with no judicial supervision or
tified or identifiable individual that is a physical without the consent of the individual it refers
person who may either be directly identified or to. Big Data offer new possibilities for collecting
whose identification is possible. When determin- and processing personal data. When designing
ing whether particular category or a piece of Big Data services or using information they pro-
information constitutes private data, account vide, all business entities must address the inter-
must be taken of means likely reasonably to be national standards of privacy protection, as
used either by any person to identify the individ- identified by international organizations and
ual, in particular costs, time, and labor needed to good business practice.
identify such person. When private information
has been identified, the procedures required for
privacy protection described above ought to be P
Cross-References
applied by entities dealing with such informa-
tion. In particular the guidelines described by
▶ Data Processing
the HRC in their comments and observations
▶ Data Profiling
may serve as a guideline for handling personal ▶ Data Provenance
data falling within the Big Data resource. Initia-
▶ Data Quality Management
tives such as Global Network Initiative, a ▶ Data Security
bottom-up initiative of the biggest online service ▶ Data Security Management
providers aimed at identifying and applying uni-
versal human rights standards for online ser-
vices, or the UN Protect Respect and Remedy
Further Reading
Framework for business, defining the human
rights obligations of private parties, present a Kuner, C. (2009). An international legal framework for
useful tool for introducing enhanced privacy data protection: Issues and prospects. Computer Law
safeguards for all Big Data resources. With the and Security Review, 25(263), 307.
Kuner, C. (2013). Transborder data flows and data privacy
users’ growing awareness of the value of their law. Oxford: Oxford University Press.
privacy, company privacy policies prove to be a UN Human Rights Committee. General Comment No. 16:
significant element of the marketing game, Article 17 (Right to Privacy), The Right to Respect of
772 Probabilistic Matching

Privacy, Family, Home and Correspondence, and Pro- Why Probabilistic Matching?
tection of Honour and Reputation. 8 Apr 1988. http://
www.refworld.org/docid/453883f922.html.
UN Human Rights Council. Report of the Special Rappor- Although deterministic matching is important in
teur on the promotion and protection of human rights the big data world, it works well with high quality
and fundamental freedoms while countering terrorism, data. However, often data we have has no known
Martin Scheinin. U.N. Doc. A/HRC/13/37. or identical identifiers with missing, incomplete,
Warren, S.D., & Brandeis, L.D. (1980). The right to pri-
vacy. Harvard Law Review, v. 4/193. erroneous, or inaccurate values. Some data may
Weber, R.H. (2013). Transborder data transfers: Concepts, change over time, such as address changes due to
regulatory approaches and new legislative initiatives. relocation or name changes due to marriage or
International Data Privacy Law v. 1/3–4. divorce. Sometimes there could be a typos,
words out of order, split words, extraneous or
missing, or wrong information in identification
number (see Zhang and Stevens 2012).
Probabilistic Matching In the big data world, larger data sets have
more attributes involved and more complex
Ting Zhang rules-based matching routines. In that case, imple-
Department of Accounting, Finance and mentation deterministic matching can involve
Economics, Merrick School of Business, many man hours of processing, testing, customi-
University of Baltimore, Baltimore, MD, USA zation, and revision time and longer deploy-
ment times than probabilistic matching. As
Schumacher (2007) mentioned, unlike probabilis-
Definition/Introduction tic matching that has scalability and capability to
perform lookups in real time, deterministic
Probabilistic matching differs from the simplest matching does not have speed advantages.
data matching technique, deterministic As Schumacher (2007) suggested, probabilis-
matching. For deterministic matching, two tic matching assign a probability of the quality of a
records are said to match if one or more identi- match allowing variation and nuances, it is better
fiers are identical. Deterministic record linkage is suited for complex data systems with multiple
a good option when the entities in the data sets databases. Larger databases often mean greater
have identified common identifiers with a rela- potential for duplicates, human error, and discrep-
tively high quality of data. Probabilistic ancies; this makes the matching technique
matching is a statistical approach in measuring designed to determine links between records
the probability that two records represent the with complex error patterns more effective. For
same subject or individual based on whether probabilistic matching, users decide the tolerance
they agree or disagree on the various identifiers level of their choice for a match.
(Dusetzina et al. 2014).
It calculates linkage composite weights based
on likeness scores for identifier values and uses Steps for Probabilistic Matching
thresholds to determine a match, nonmatch, or
possible match. The quality of resulting matches This matching technique typically includes three
can depend upon one’s confidence in the specifi- stages: pre-matching data cleaning, matching
cation of the matching rules (Zhang and Stevens stage, post-matching data manual review. For the
2012). It is designed to work using a wider set of match stage, Dusetzina et al. (2014) summarize
data elements and all available identifiers for the probabilistic matching steps as follows:
matching and does not require identical identifier
or exact matches. Instead, it compares the proba- 1. Estimate the match and non-match probabili-
bility of a match to a chosen threshold. ties for each linking variable using the
Probabilistic Matching 773

observed frequency of agreement and dis- Software


agreement patterns among all pairs, com-
monly generated using the expectation- Link Plus
maximization algorithm described by Fellegi One often used free software is Link Plus, devel-
and Sunter (1969). The match probability is oped by the Centers for Disease Control and Pre-
the probability of agreed identifier, and the vention. Link Plus is a probabilistic record linkage
non-match probability is the probability that software product originally designed to be used by
false matches randomly agree on the cancer registries. However, Link Plus can be used
identifier. with any type of data and has been used exten-
2. Calculate agreement and disagreement weights sively across diverse research disciplines.
using the match and non-match probabilities.
The weight assigned to agreement or disagree- The Link King
ment on each identifier is assessed as a likeli- The Link King is another free software, but it
hood ratio, comparing the match probability to requires a license for base SAS. It is developed
the non-match probability. by Washington State’s Division of Alcohol and
3. Calculate a total linking weight for each pair by Substance Abuse. Like Link Plus, the software
summing the individual linking weights for provides a straightforward user interface using
each linkage variable. information including first and last names.
4. Compare the total linkage weight to a chosen
threshold above which pairs are considered a Other Public Software
link. The threshold is set using information ChoiceMaker and Freely Extensible Biomedical
generated in Step 1. Record Linkage (FEBRL) are two publicly avail-
able software that health services researchers have
used frequently in recent years (Dusetzina et al.
Applications 2014). Record Linkage At IStat (RELAIS) is a
JAVA, R, and MySQL based open source
Data Management software.
Probabilistic matching is used to create and man-
age databases. It helps to clean, reconcile data, and Known Commercial Software
remove duplicates. Selected commercial softwares include P
LinkageWiz, G-Link developed by Statistics Can-
Data Warehousing and Business Intelligence ada based on Winkler (1999), LinkSolv, Strategic-
Probabilistic matching plays a key role in data Matching, and IBM InfoSphere Master Data
warehousing. This method can help merge Management for enterprise data.
multiple datasets from various sources into
one.
Conclusion
Medical History and Practice
Medical data warehouse put together using prob- Probabilistic matching is a statistical approach in
abilistic matching can help quickly extract a measuring the probability that two records repre-
patient’s medical history for better medical sent the same subject or individual based on
practice. whether they agree or disagree on the various
identifiers. It has superiority over simplistic deter-
Longitudinal Study ministic matching. The method itself follows sev-
Data warehouse based on probabilistic matching eral steps. Its application includes data
can be used to put together longitudinal datasets management, data warehousing, medical practice,
for longitudinal studies. and longitudinal research. A variety of public and
774 Profiling

commercial software to conduct probabilistic individual identity, but a set of characteristics that
matching is available. can apply to many people, but is still useful. One
application is in criminal investigations. Investi-
gators use profiling to identify characteristics of
Further Readings offenders based on what is known of their actions
(Douglas and Burgess 1986). For example, the
Dusetzina, S. B., Tyree, S., Meyer, A. M., et al. (2014). use of specific words by anonymous letter writers
Linking data for health services research: A framework
can help link different letters to the same person
and instructional guide. Rockville: Agency for
Healthcare Research and Quality (US). and in some cases can provide deeper informa-
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record tion. In one case (Shuy 2001), an analysis of a
linkage. Journal of the American Statistical Associa- ransom note turned up an unusual phrase indicat-
tion, 64, 1183–1210.
ing that the writer of the note was from the Akron,
Schumacher, S. (2007). Probabilistic versus deterministic
data matching: Making an accurate decision, informa- Ohio area; this knowledge made it easy to identify
tion management special reports. Washington, DC: the actual kidnapper from among the suspects.
The Office of the National Coordinator for Health Unfortunately, this kind of specific clue is not
Information Technology (ONC).
always present at the crime scene and may require
Winkler, W. E. (1999). The state of record linkage and
current research problems. Washington, DC: Statistical specialist knowledge to interpret. Big data pro-
Research Division, US Census Bureau. vides one method to fill this gap by treating pro-
Zhang, T., & Stevens, D. W. (2012). Integrated data system filing as a data classification/machine learning
person identification: Accuracy requirements and
problem and analyzing large data sets to learn
methods. https://ssrn.com/abstract¼2512590; https://
doi.org/10.2139/ssrn.2512590. differences among classes, then applying this to
specific data of interest.
For example, the existence of gender differ-
ences in language is well-known (Coates 2015).
Profiling By collecting large samples of writing by both
women and men, a computer can be trained to
Patrick Juola learn these differences and then determine the
Department of Mathematics and Computer gender of the unknown author of a new work
Science, McAnulty College and Graduate School (Argamon et al. 2009). Similar analyses can deter-
of Liberal Arts, Duquesne University, Pittsburgh, mine gender, age, native language, and even per-
PA, USA sonality traits (Argamon et al. 2009). Other types
of analysis, such as looking at Facebook “likes,”
can evaluate a person’s traits more accurately than
Profiling is the analysis of data to determine fea- the person’s close friends (Andrews 2018).
tures of the data source that are not explicitly This kind of knowledge can be used in many
present in the data. For example, by examining ways beyond law enforcement. Advertisements,
information related to a specific criminal act, for example, can be more effective when tailored
investigators may be able to determine the psy- to the recipient’s traits (Andrews 2018). However,
chology and the background of the perpetrator. this lends itself to data abuses, such as Cambridge
Similarly, advertisers may look at public behavior Analytica’s attempt to manipulate elections,
to identify psychological traits, with an eye to including the 2016 US Presidential election and
targeting ads to more effectively influence indi- the 2016 UK Brexit referendum. Using personal-
vidual consumer’s behavior. This has proven to be ity-based microtargeting, the company suggested
a controversial application of big data both for different advertisements to persuade individual
ethical reasons and because the effectiveness of voters to vote in the desired way (Rathi 2019).
profiling techniques has been questioned. This has been described as an “ethical grey
Profiling is sometimes distinguished from area” and an “[attempt] to manipulate voters by
identification (see De-identification/Re-identifica- latching onto their vulnerabilities” (Rathi 2019).
tion) because what is produced is not a specific However, it is also not clear whether or not the
Psychology 775

models used were accurate enough to be effective, complex that it becomes difficult to process
or how many voters were actually persuaded to using on-hand data management tools or tradi-
cast their votes in the correct way (Rathi 2019). tional data processing applications.” The field of
As with any active research area, the perfor- psychology is interested in big data in two ways:
mance and effectiveness of profiling are likely to (1) at the level of the data, that is, how much data
progress over time. Debates on the ethics, effec- there are to be processed and understood, and
tiveness, and even legality of this sort of profile- (2) at the level of the user, or how the researcher
based microtargeting are likely to continue for the analyzes and interprets the data. Thus, psychology
foreseeable future. can serve the role of helping to improve how
researchers analyze big data and provide data
sets that can be examined or analyzed using big
Cross-References data principles and tools.

▶ De-identification/Re-identification.
Psychology

Further Reading Psychology may be divided into two overarching


areas: clinical psychology with a focus on indi-
Andrews, E. L. (2018) The science behind Cambridge viduals, and the fields of experimental psychology
analytica: Does psychological profiling work? Insights
with foci on the more general characteristics that
by Stanford Business. https://www.gsb.stanford.edu/
insights/science-behind-cambridge-analytica-does-psy apply to the majority of people. Allen Newell
chological-profiling-work. classifies the fields of experimental psychology
Argamon, S., Koppel, M., Pennebaker, J. W., & Schler, J. by time scale, to include biological at the smallest
(2009). Automatically profiling the author of an anon-
time scale, cognitive (the study of mental pro-
ymous text. Communications of the ACM, 52(2), 119–
123. cesses) at the scale of hundreds of milliseconds
Coates, J. (2015). Women, men and language: A sociolin- to tens of seconds, rational (the study of decision
guistic account of gender differences in language. New making and problem solving) at minutes to hours,
York: Routledge.
and social at days to months. The cognitive, ratio-
Douglas, J. E., & Burgess, A. E. (1986). Criminal profiling: A
viable investigative tool against violent crime. FBI Law nal, and social bands can all be related to big data
Enforcement Bulletin, 55(12), 9–13. https://www.ncjrs. in terms of both the researcher analyzing data and P
gov/pdffiles1/Digitization/103722-103724NCJRS.pdf. the data itself. Here, we describe how psycholog-
Rathi, R. (2019). Effect of Cambridge analytica’s
ical principles can be applied to the researcher to
Facebook ads on the 2016 US Presidential election.
Towards Data Science. https://towardsdatascience. handle data in the cognitive and rational fields and
com/effect-of-cambridge-analyticas-facebook-ads-on- demonstrate how psychological data in the social
the-2016-us-presidential-election-dacb5462155d. field can be big data.
Shuy, R. W. (2001). DARE’s role in linguistic profiling, 4
DARE Newsletter 1.

Cognitive and Rational Fields

One of the greatest challenges of big data is its


Psychology analysis. The principles of cognitive and rational
psychology can be applied to improve how the big
Daniel N. Cassenti and Katherine R. Gamble data researcher evaluates and makes decisions
U.S. Army Research Laboratory, Adelphi, about the data. The first step in analysis is atten-
MD, USA tion to the data, which often involves filtering out
irrelevant from relevant data. While many soft-
ware programs can provide an automated filtering
Wikipedia introduces big data as “a blanket term of data, the researcher must still give attention and
for any collection of data sets so large and critical analysis to the data as a check on the
776 Psychology

automated system, which operates within rigid deems recommended options to be unsuitable,
criteria preset by the researcher that is not sensi- then the associate system can present what it
tive to the context of the data. At this early level of judges to be the next best options.
analysis, the researcher’s perception of the data,
ability to attend and retain attention, and working
memory capacity (i.e., the quantity of information Social Field
that an individual can store while working on a
task) are all important to success. That is, the The field of social psychology provides good
researcher must efficiently process and highlight examples of methods of analysis that can be
the most important information, stay attentive used with big data, especially with big data sets
enough to do this for a long period of time, and that include groups of individuals and their rela-
because of limited working memory capacity and tionships with one another, the scope of social
a lot of data to be processed, effectively manage psychology. The field of social psychology is
the data, such as by chunking information, so that able to ask questions and collect large amounts
it is easier to filter and store in memory. of data that can be examined and understood using
The goal of analysis is to lead to decisions or these big data-type analyses including, but not
conclusions about data, the scope of the rational limited to, the following types of analyses.
field. If all principles from cognitive psychology Linguistic analysis offers the ability to process
have been applied correctly (e.g., only the most transcripts of communications between individ-
relevant data are presented and only the most uals, or to groups as in social media applications,
useful information stored in memory), tenets of such as tweets from a Twitter data set. A linguistic
rational psychology must next be applied to make analysis may be applied in a multitude of ways,
good decisions about the data. Decision making including analyzing the qualities of relationship
may be aided by programming the analysis soft- between individuals or how communications to
ware to present decision options to the researcher. groups may differ based on the group. These
For example, in examining educational outcomes analyses can determine qualities of these commu-
of children who come from low income families, nications, which may include trust, attribution of
the researcher’s options may be to include chil- personal characteristics, or dependencies, among
dren who are or are not part of a state-sponsored other considerations.
program, or are of a certain race. Statistical soft- Sentiment analysis is a type of linguistic anal-
ware could be designed to present these options to ysis that takes communications and produces
the researcher, which may reveal results or rela- ratings of the emotional valence individuals
tionships in the data that the researcher may not direct to the topic. This is of value for consider-
have otherwise discovered. Option presentation ations of social data researchers who must find
may not be enough, however, as researchers those with whom alliances may be formed and
must also be aware of the consequences of their who to avoid. A famous example is the strategy
decisions. One possible solution is the implemen- shift taken by United Stated Armed Forces com-
tation of associate systems for big data software. manders to ally with Iraqi residents. Sentiment
An associate system is automation that attempts to analysis indicated which residential leaders
advise the user, in this case to aid decision mak- would give their cooperation for short-term
ing. Because these systems are knowledge based, goals of mutual interest.
they have situational awareness and are able to The final social psychological big data analysis
recommend courses of action and the reasoning technique under consideration here is social-
behind those recommendations. Associate sys- network analysis or SNA. With SNA, special
tems do not make decisions themselves, but emphasis is not with the words spoken as in lin-
instead work semiautonomously, with the user guistic and sentiment analysis but on the direc-
imposing supervisory control. If the researcher tionality and frequency of communication
Psychology 777

between individuals. SNA created a type of net- continuously across a certain time range, so much
work map that uses nodes and ties to connect of the data collected in this field could be consid-
members of groups or organizations to one ered big data.
another. This visualization tool allows a Cognitive psychology covers all mental pro-
researcher to see how individuals are connected cessing. That is, this field includes the initiation of
to one another with factors like the thickness of a mental processing from internal or external stim-
line to determine frequency of communication, or uli (e.g., seeing a stoplight turn yellow), the actual
the number of lines coming from a node deter- processing of this information (e.g., understand-
mining the number of nodes to which they are ing that a yellow light means to slow down), and
connected. the initiation of an action (e.g., knowing that you
must step on the brake in order to slow your car).
For each action that we take, and even actions that
Psychological Data as Big Data may be involuntary (e.g., turning your head
toward an approaching police siren as you begin
Each field of psychology potentially includes big to slow your car), cognitive processing must take
data sets for analysis by a psychological place at the levels of perception, information pro-
researcher. Traditionally, psychologists have col- cessing, and initiation of action. Therefore, any
lected data on a smaller scale using controlled behavior or thought process that is measured in
methods and manipulations analyzable with tradi- cognitive psychology will yield a large amount of
tional statistical analyses. However, with the data for even the simplest of these, such that
advent of big data principles and analysis tech- complex processes or behaviors measured for
niques, psychologists can expand the scope of their cognitive process will yield data sets of the
data collection to examine larger data sets that magnitude of big data.
may lead to new and interesting discoveries. The Another clear case of a field with big data sets
following section discusses each of the aforemen- is rational psychology. In rational psychological
tioned fields. paradigms, researchers who limit experimental
In clinical psychology, big data may be used participants to a predefined set of options often
to diagnose an individual. In understanding an find themselves limiting their studies to the point
individual or attempting to make a diagnosis, of not capturing naturalistic rational processing.
the person’s writings and interview transcripts The rational psychologist, instead typically con- P
may be analyzed in order to provide insight to fronts big data as imaginative solutions to prob-
his or her state of mind. To thoroughly analyze lems, and many forms of data, such as verbal
and treat a person, a clinical psychologist’s protocols (i.e., transcripts of participants
most valuable tool may be this type of big explaining their reasoning), require big data anal-
data set. ysis techniques.
Biological psychology includes the subfields Finally, with the large time band under consid-
of psychophysiology and neuropsychology. Psy- eration, social psychologists must often consider
chophysiological data may include hormone col- days’ worth of data in their studies. One popular
lection (typically salivary), blood flow, heart rate, technique is to have participants use wearable
skin conductance, and other physiological technology to periodically remind them to record
responses. Neuropsychology includes multiple how they are doing, thinking, and feeling during
technologies for collecting information about the the day. These types of studies lead to big data sets
brain, including electroencephalography (EEG), not just because of the frequency with which the
functional magnetic resonance imaging (fMRI), data is collected, but also due to the enormous
functional near infrared spectroscopy (fNIRS), number of possible activities, thoughts, and feel-
among other lesser used technologies. Measures ing that participants may have experienced and
in biological psychology are generally taken near- recorded at each prompted time point.
778 Psychology

The Unique Role of Psychology in ▶ Social Media


Big Data ▶ Social Network Analysis
▶ Social Sciences
As described above, big data plays a large role in the ▶ Socio-spatial Analytics
field of psychology, and psychology can play an ▶ Visualization
important role in how big data are analyzed and
used. One aspect of this relationship is the necessity
of the role of the psychology researcher on both Further Reading
ends of big data. That is, psychology is a theory-
driven field, where data are collected in light of a set Cowan, N. (2004). Working memory capacity. New York:
Psychology Press.
of hypotheses, and analyzed as either supporting or
Endsley, M. R. (2000). Theoretical underpinnings of situ-
rejecting those hypotheses. Big data offers endless ation awareness: A critical review. In Situation aware-
opportunities for exploration and discovery in other ness analysis and measurement. Mahwah, NJ:
fields, such as creating word clouds from various Lawrence Erlbaum Associates.
Ericsson, K. A., & Simon, H. A. (1984). Protocol analysis.
forms of social media to determine what topics are
Cambridge, MA: MIT-press.
trending, but solid psychological experiments are Lewis, T. G. (2011). Network science: Theory and appli-
driven by a priori ideas, rather than data exploration. cations. Hoboken: Wiley.
Thus, psychology is important to help big data Neisser, U. (1976). Cognition and reality: Principles and
implications of cognitive psychology. San Francisco:
researchers learn how to best process their data,
W.H. Freeman and Co.
and many types of psychological data can be big Newell, A. (1990). Unified theories of cognition. Cam-
data, but the importance of theory, hypotheses, and bridge, MA: Harvard University Press.
the role of the researcher will always be integral in Newell, A., & Simon, H. (1972). Human problem solving.
Englewood Cliffs: Prentice-Hall.
how psychology and big data interact.
Pang, B., & Lee, L. (2008). Opinion mining and sentiment
analysis. Foundations and Trends in Information
Retrieval, 2(1–2), 1–35.
Pentland, A. (2014). Social physics: How good ideas
Cross-References spread – The lessons from a new science. New York:
Penguin Press.
Yarkoni, T. (2012). Psychoinformatics new horizons at the
▶ Artificial Intelligence interface of the psychological and computing sciences.
▶ Communications Current Directions in Psychological Science, 21(6),
▶ Decision Theory 391–397.
R

Recommender Systems use of rich content, such as temporal information,


text, or social networks.
Julian McAuley Scalability issues are a major consideration
Computer Science Department, UCSD, San when applying recommender systems in indus-
Diego, USA trial or other “big data” settings. The systems we
describe below are those specifically designed to
address such concerns, through use of sparse data
Introduction structures and efficient approximation schemes,
and have been successfully applied to real-world
Every day we interact with predictive systems that applications including recommendation on Netflix
seek to model our behavior, monitor our activities, (Bennett and Lanning 2007), Amazon (Linden
and make recommendations: Whom will we et al. 2003), etc.
befriend? What articles will we like? What prod- Preliminaries & Notation. We consider the
ucts will we purchase? Who influences us in our scenario where users (U) interact with items (I),
social network? And do our activities change over where “interactions” might describe purchases,
time? Models that answer such questions drive clicks, likes, etc.(In certain instances (like friend
important real-world systems, and at the same recommendation) the “user” and “item” sets may
time are of basic scientific interest to economists, be the same.) In this setting, we can describe
linguists, and social scientists, among others. users’ interactions with items in terms of a
Recommender Systems aim to solve tasks such (sparse) matrix:
as those above, by learning from large volumes of
historical activities to describe the dynamics of zffl}|ffl{
0 items 19
user preferences and the properties of the content 1 0  1 >
>
users interact with. Recommender systems can >
B0
B 0 0 C=
C
take many forms (Table 1), though in essence all A ¼B C users, ð1Þ
boil down to modeling the interactions between @⋮ ⋱ ⋮ A> >
>
;
users and content, in order to predict future 1 0  1
actions and preferences. In this chapter, we inves-
tigate a few of the most common models and where Aui ¼ 1 if and only if the user u interacted
paradigms, starting with item-to-item recommen- with the item i. A row of the matrix Au is now a
dation (e.g., “people who like x also like y”), binary vector describing which items the user
followed by systems that model user preferences u interacted with, and a column A,i is a binary
and item properties, and finally systems that make vector describing which users interacted with the

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
780 Recommender Systems

Recommender Systems, Table 1 Different types of recommender system (for a hypothetical fashion recommendation
scenario). (U ¼ user; I ¼ item; F ¼ feature space; I* ¼ sequence of items)
Output type Example/applications Example input/output
f: U  I ! I Item-to-Item recommendation & collaborative user x
filtering
f: U  I ! ℝ Model-based recommendation (rating prediction) user x

f: Content/context-aware recommendation user x


UIF!ℝ [color:red, size:12, price:$80]
[gender:f, location:billings-MT]

f: Temporal/sequence-aware recommendation user x


U  I*  I ! ℝ 1/5/17 4/7/17 8/11/17

recent purchases

item i. Equivalently, we can describe interactions recommendations of the form “people who like
in terms of sets: x also like y.” To do so, a system must identify
which items i and j are similar to each other.
I u ¼ fijAu,i ¼ 1g ð2Þ In the simplest case, “similarity” might be
measured by counting the overlap between the
U i ¼ fujAu,i ¼ 1g: ð3Þ set of users who interacted with the two items,
e.g., via the Jaccard Similarity:
Such data are referred to as implicit feedback,
in the sense that we observe only what items users
j Ui \ U j j
interacted with, rather than their preferences Jaccardði, jÞ ¼ : ð5Þ
j Ui [ U j j
toward those items. In many cases, interactions
may be associated with explicit feedback signals,
Note that this measure takes a value between
e.g., numerical scores such as star ratings, which
0 (if no users interacted with both items) and
we can again describe using a matrix:
1 (if exactly the same set of users interacted with
0 1 both items). Where explicit feedback is available,
4 ?  3
B ? we might instead measure the similarity between
B ? ? C
C users’ rating scores, e.g., via the Pearson
R¼B C: ð4Þ
@⋮ ⋱ ⋮A Correlation:
2 ?  1
Corði, jÞ P   
Note that the above matrix is partially Ru,i  R,i Ru,j  R,j
u  Ui \U j
observed, that is, we only observe ratings for ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P  2 P  2 ,
those items the users interacted with. We can Ru,i  R,i Ru,j  R,j
now describe recommender systems in terms of u  Ui \U j u  U i \U j

the above matrices, e.g., by estimating interac- ð6Þ


tions Aui that are likely to occur, or by predicting
ratings Rui. which takes a value from 1 (both items were
rated by the same set of users, and those users
had the same opinion polarity about them),
Models and 1 (both items were rated by the same
users, but users had the opposite opinion polar-
Item-to-Item Recommendation and Collabora- ity about them).
tive Filtering. Identifying relationships among Simple similarity measures such as those
items is a fundamental part of many real-world above can be used to make recommendations by
recommender systems, e.g., to generate identifying the items j (from some candidate set)
Recommender Systems 781

that are most similar to the item i currently being user's item's
considered: preferences properties

argmax Corði, jÞ ð7Þ


j

‘Model-Based’ Recommendation. Model- compatibility

based recommender systems attempt to estimate


user “preferences” and item “properties” so as to γu γi
directly optimize some objective, such as the error
incurred when predicting the rating r(u, i) when Recommender Systems, Fig. 1 Latent-factor models
describe users’ preferences and items’ properties in terms
the true rating is Ru,i, e.g., via the Mean Squared
of low-dimensional factors
Error (MSE):

1 X
ðr ðu, iÞ  Ru,i Þ2 : ð8Þ described as latent factor models or factorization-
j R j u, i  R based approaches.
Finally, the parameters must be optimized so as
A trivial form of model-based recommender to minimize the MSE:
might simply associate each item with a bias
term βi (how good is the item?) and each user a, b, g
with a bias term βu (how generous is the user 1 X
¼ argmin ða þ bu þ bi þ gu  gi  Ru,i Þ2 :
with their ratings?), so that ratings would be pre- a, b, g j R j u, i  R
dicted by ð11Þ

r ðu, iÞ ¼ a þ bu þ bi , ð9Þ This can be achieved via gradient descent, i.e.,


by computing the partial derivatives of Eq. (11)
where α is a global offset. A more complex system with respect to α, β, and γ, and updating the
might capture interactions between a user and an parameters iteratively.
item via multidimensional user and item terms:

r ðu, iÞ ¼ a þ bu þ bi þ gu  gi , ð10Þ Variants and Extensions

where γu and γi are low-rank matrices that describe Temporal Dynamics and Sequential Recom- R
interactions between the user u and the item i in mendation. Several works extend recommenda-
terms of the user’s preferences and the item’s tion models to make use of timestamps associated
properties. This idea is depicted in Fig. 1. The with feedback. For example, early similarity-
dimensions or “factors” that describe an item’s based methods (e.g., Ding and Li 2005) used
properties (γi) might include (for example) time-weighting schemes that assign decaying
whether a movie has good special effects, and weights to previously rated items when comput-
the corresponding user factor (γu) would capture ing similarities. More recent efforts are frequently
whether the user cares about special effects; their based on matrix factorization, where the goal is to
inner product then describes whether the user’s model and understand the historical evolution of
preferences are “compatible” with the item’s users and items, via temporally evolving offsets,
properties (and will thus give the movie a high biases, and latent factors (e.g., parameters βu(t)
rating). However, no “labels” are assigned to the and γu(t) become functions of the timestamp t).
factors; rather the dimensions are discovered sim- For example, the winning solution to the Netflix
ply by factorizing the matrix R in terms of the low- prize (Bennett and Lanning 2007) was largely
rank factors γi and γu. Thus, such models are based on a series of insights that extended matrix
782 Recommender Systems

factorization approaches to be temporally aware For example, suppose we are given binary
(Koren et al. 2009). Variants of temporal recom- features associated with a user (or equivalently
menders have been proposed that account for an item), A(u). Then, we might fit a model of the
short-term bursts and long-term “drift,” user form
evolution, etc.
0 1
Similarly, the order or sequence of activities that X
users perform can provide informative signals, for r ðu, iÞ ¼ a þ bu þ bi þ @gu þ ra A  gi ð13Þ
example, knowing what action was performed most a  AðuÞ
recently provides context that can be used to predict
the next action. This type of “first-order” relation- where ra is a vector of parameters associated with
ship can be captured via a Markov Relationship, the ath attribute (Koren et al. 2009). Essentially, ra
which can be combined with factorization-based in this setting determines how the our estimate of
approaches (Rendle et al. 2010). the user’s preference vector (γu) changes an a
One-Class Collaborative Filtering. In many result of having observed the attribute a (which
practical situations, explicit feedback (like rat- might correspond to a feature like age or location).
ings) are not observed, and instead only implicit Variants of such models exist that make use of rich
feedback instances (like clicks, purchases, etc.) and varied notions of “content,” ranging from
are available. Simply training factorization-based locations to text and images.
approaches on an implicit feedback matrix (A)
proves ineffective, as doing so treats “missing”
instances as being inherently negative, whereas Cross-References
these may simply be items that a user is unaware
of, rather than items they explicitly dislike. The ▶ Collaborative Filtering
concept of One-Class Collaborative Filtering
(OCCF) was introduced to deal with this scenario
(Pan et al. 2008). Several variants exist though a References
popular approach consists of sampling pairs of
items i and i0 for each user u (where i was Bennett, J., & Lanning, S. (2007). The Netflix prize. In
KDD Cup and Workshop.
clicked/purchased and i0 was not) and maximizing Ding, Y., & Li, X. (2005). Time weight collaborative
an objective of the form filtering. In CIKM. ACM. https://dl.acm.org/citation.
cfm?id=1099689.
X Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factor-
u, i, i0 ln sðr ðu, iÞ  r ðu, i0 ÞÞ: ð12Þ ization techniques for recommender systems. Com-
|ffl{zffl} puter, 42(8), 30–37. https://dl.acm.org/citation.cfm?
samp1e id=1608614.
Linden, G., Smith, B., & York, J. (2003). Amazon.com
Optimizing such an objective encourages items recommendations: Item-to-item collaborative filtering.
IEEE Internet Computing, 7(1), 76–80. https://
i (with which the user is likely to interact) to have
ieeexplore.ieee.org/document/1167344/.
a larger scores compared to items i0 (that they are Pan, R., Zhou, Y., Cao, B., Liu, N. N., Lukose, R., Scholz,
unlikely to interact with) (Rendle et al. 2009). M., & Yang, Q. (2008). One-class collaborative filter-
Content-Aware Recommendation. So far, ing. In ICDM. IEEE. https://dl.acm.org/citation.cfm?
id=1511402.
the systems we have considered only make use
Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-
of interaction data, but ignore features associated Thieme, L. (2009). BPR: Bayesian personalized rank-
with the users and items being considered. ing from implicit feedback. In UAI. AUAI Press.
Content-aware recommenders can improve the https://dl.acm.org/citation.cfm?id=1795167.
Rendle, S., Freudenthaler, C., & Schmidt-Thieme,
performance of traditional approaches, especially
L. (2010). Factorizing personalized Markov chains for
in “cold-start” situations where few interactions next-basket recommendation. In WWW. ACM. https://
are associated with users and items. dl.acm.org/citation.cfm?id=1772773.
Regression 783

We know from this regression that there is a


Regression positive linear relationship between the crime
rate (y axis) and residents’ poverty level (x axis).
Qinghua Yang Given the poverty index of a specific community,
Department of Communication Studies, Texas we are able to make a prediction of the crime rate
Christian University, Fort Worth, TX, USA at that area.

Regression is a statistical tool to estimate the Linear Regression


relationship(s) between a dependent variable
(y or outcome variable) and one or more indepen- The estimation target of regression is a function
dent variables (x or predicting variables; Fox that predicts the dependent variable based upon
2008). More specifically, regression analysis values of the independent variables, which is
helps in understanding the variation in a depen- called the regression function. For simple linear
dent variable using the variation in independent regressions, the function can be represented as
variables with other confounding variable(s) yi ¼ α + βxi + εi. The function of multiple linear
controlled. Regression analysis is widely used to regressions is yi ¼ β0 + β1x1 + β2x2 þ    þ
make prediction and estimation of the conditional βkxk + εi where k is the number of independent
expectation of the dependent variable given the variables. The regression estimation using ordi-
independent variables, where its use overlaps with nary least squares (OLS) selects the line with the
the field of machine learning. Figure 1 shows how lowest total sum of squared residuals. The propor-
crime rate is related to residents’ poverty level and tion of total variation (SST) that is explained by
predicts the crime rate of a specific community. the regression (SSR) is known as the coefficient

50

25

R
crime

–25

–50

–1.00 –.50 –.00 .50 1.00 1.50


poverty_sqrt

Regression, Fig. 1 Linear regression of crime rate and residents’ poverty level
784 Regression

of determination, often referred to as R2, a value However, analyzing nonlinear models in this
ranging between 0 and 1 with a higher value way can produce much residual and leave consid-
indicating a better regression model (Keith 2015). erable variance unexplained. The second way is
considered better than the first one from this
aspect, by including nonlinear terms in the regres-
Nonlinear Regression sion function as yb ¼ α þ β1x þ β2x2. As the graph
of a quadratic function is a parabola, if β2 < 0, the
In the real world, there are much more nonlinear parabola opens downward, and if β2 > 0, the
functions than linear ones. For example, the rela- parabola opens upward. Instead of having x2 in
tionship between x and y can be fitted in a qua- the model, the nonlinearity can also be presented
pffiffiffi
dratic function shown in Figure 2. There are in in many other ways, such as x, ln(x), sin(x),
general two ways to deal with nonlinear models. cos(x), and so on. However, which nonlinear
First, nonlinear models can be approximated with model to choose should be based on both theory
linear functions. Both nonlinear functions in or former research and the R2.
Figure 2 can be approximated by two linear func-
tions according to the slope: the first linear regres-
sion function is from the beginning of the Logistic Regression
semester to the final exam, and the second func-
tion is from the final to the end of the semester. When the outcome variable is dichotomous (e.g.,
Similarly, regarding cubic, quartic, and more yes/no, success/failure, survived/died, accept/
complicated regressions, they can also be approx- reject), logistic regression is applied to make
imated with a sequence of linear functions. prediction of the outcome variable. In logistic

Regression, Anxiety
Fig. 2 Nonlinear
regression models

Semester Mid-term Final Semester


begins ends

Confidence in
the Subject

Semester Mid-term Final Semester


begins ends
Regression 785

regression, we predict the odds or log-odds (logit) categories, multinomial logistic regression or
that a certain condition will or will not happen. ordered logistic regression should be
Odds range from 0 to infinity and are a ratio of the implemented depending on whether the depen-
chance of an event (p) divided by the chance of the dent variable is nominal or ordinal.
event not happening, that is, p/(1p). Log-odds
(logits) are transformed odds, ln[p/(1p)], and
range from negative to positive infinity. The rela- Regression in Big Data
tionship predicting probability using x follows an
S-shaped curve as shown in Figure 3. The shape Due to the advanced technologies that have been
of curve above is called a “logistic curve.” This is increasingly used in data collection and the vast
expðb0 þb1 xi þei Þ amount of user-generated data, the amount of
defined as pðyi Þ ¼ . In this logistic
1þ exp ðb0 þb1 xi þei Þ data will continue to increase at a rapid pace,
regression, the value predicted by the equation is a along with a growing accumulation of scholarly
log-odds or logit. This means when we run logis- works. The explosion of knowledge makes big
tic regression and get coefficients, the values the data one of new research frontiers with an exten-
equation produces are logits. Odds is computed as sive number of application areas affected by big
exp(logit), and probability is computed as data, such as public health, social science,
exp ðlogitÞ
1þ exp ðlogitÞ . Another model used to predict binary finance, geography, and so on. The high volume
outcome is the probit model, with the difference and complex structure of big data bring statisti-
between logistic and probit models lying in the cians both opportunities and challenges. Gener-
assumption about the distribution of errors: while ally speaking, big data is a collection of large-
the logit model assumes standard logistic distri- scale and complex data sets that are difficult to be
bution of errors, probit model assumes normal processed and analyzed using traditional data
distribution of errors (Chumney & Simpson analytic tools. Inspired by the advent of machine
2006). Despite the difference in assumption, the learning and other disciplines, statistical learn-
predictive results using these two models are very ing has emerged as a new subfield in statistics,
similar. When the outcome variable has multiple including supervised and unsupervised statistical

Regression,
Fig. 3 Logistic regression 1.00
models

0.80 R

0.60
pass

0.40

0.20

0.00

0 2 4 6 8 10
X
786 Regulation

learning (James, Witten, Hastie, & Tibshirani, James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).
2013). Supervised statistical learning refers to An introduction to statistical learning (Vol. 6).
New York, NY: Springer.
a set of approaches for estimating the function Keith, T. Z. (2015). Multiple regression and beyond: An
f based on the observed data points, to under- introduction to multiple regression and structural
stand the relationship between Y and X ¼ ( equation modeling. New York, NY: Routledge.
X1, X2, . . . , XP), which can be represented as
Y ¼ f(X) þ ε. Since the two main purposes for
the estimation are to make prediction and infer-
ence, which regression modeling is widely used Regulation
for, many classical statistical learning methods
use regression models, such as linear, nonlinear, Christopher Round
and logistic regression, with the selection of spe- George Mason University, Fairfax, VA, USA
cific regression model based on research ques- Booz Allen Hamilton, Inc., McLean, VA, USA
tion and data structure. In contrast, for
unsupervised statistical learning, there is no
response variable to predict for every observa-
tion that can supervise our analysis (James et al. Synonyms
2013). Additionally, more methods have been
recently developed, such as Bayesian and Mar- Governance instrument; Policy; Rule
kov chain Monte Carlo (MCMC). Bayesian
approach, distinct from the frequentist approach, Regulations may be issued for different reasons.
treats model parameters as random and models Regulations may be issued to address collective
them via distributions. MCMC is statistical sam- desires, diversify or limit social experiences, per-
pling investigations that involve sample data form interest group transfers, or address market
generation to obtain empirical sampling distribu- failures. Government regulations can be used to
tions based on constructing a Markov chain that address market failures such as negative external-
has the desired distribution (Bandalos & ities. This allows for the validation of the base
Leite 2013). assumptions economists believe are necessary to
establish for a free market to operate. Regulations
may also be used to codify behavior or norms that
are deemed by the organization issuing the regu-
Cross-References lation as beneficial.
Regulations are designed by their issuing body
▶ Data Mining to address a target issue. There is a wide variety in
▶ Data Mining Algorithms their potential design. Regulations can be direct or
▶ Machine Learning indirect, downstream or upstream, and can be
▶ Statistics influenced by outside concerns such as who will
be indirectly impacted. Direct regulations are aim
to address the issue at hand by fitting the regula-
Further Reading tion as closely as possible to the issue. For exam-
ple, a direct regulation on pollution emissions
Bandalos, D. L., & Leite, W. (2013). Use of Monte Carlo
would issue some form of limit on the pollution
studies in structural equation modeling research. In
G. R. Hancock & R. O. Mueller (Eds.), Structural release (e.g., limiting greenhouse gas emissions to
equation modeling: A second course (pp. 625-666). level X from a power plant). An indirect regula-
Charlotte, NC: Information Age Publishing. tion seeks to address an issue by impacting a
Chumney, E. C., & Simpson, K. N. (2006). Methods and
related issue. For example, a regulation improving
designs for outcomes research. Bethesda, MD: ASHP.
Fox, J. (2008). Applied regression analysis and general- gas mileage for vehicles would indirectly reduce
ized linear models. Thousand Oaks, CA: Sage. greenhouse gas emissions. Regulations can be
Regulation 787

downstream or upstream (Kolstad 2010). An considered friendly by special interests within


upstream regulation seeks to influence decision- the regulated community (Carpenter & Moss,
making in relation to an issue by affecting the 2014a). Material capture is a form of principle–
source of the issue (typically the producer). A agent interaction where a special interest material
downstream regulation seeks to influence an regulatory capture can only be diagnosed if there
issue by changing the behavior of individuals or is demonstrable proof that a regulation issued
organizations who have influence on the origina- originated from a third party (Carpenter and
tor of the issue (typically the consumers). For Moss 2014a, b; Levine and Forrence 1990;
example, an upstream regulation on greenhouse Susan Webb Yackee 2014).
gas emissions may impact fossil fuel production. Big data itself is subject to multiple regulations
A downstream regulation could be a limitation on depending on the information it contains and the
fossil fuel purchases by consumers. Indirect fac- location of the entity responsible for it. Data
tors such as the burden of cost of the regulation containing personally identifiable information
and who bears it will influence questions of regu- (PII) is of particular concern, especially if it con-
lation design (Kolstad 2010). tains information that individuals may wish to
Regulations can take different forms based on keep private such as their medical history. In
the philosophical approach and direct and indirect Europe, big data is regulated under the General
considerations of decision-makers (Cole and Data Protection Regulation (GPDR) (European
Grossman 1999; Kolstad 2010). Command and Parliament and Council 2018). Within the USA,
control regulations provide a prescription for data is regulated by entities at different levels of
choices by the regulated community, such as a governance with no single overarching legal over-
limit on the number of taxi medallions or a nightly view (Chabinsky and Pittman 2019). Thus, indi-
curfew (Cole and Grossman 1999; Kolstad 2010). viduals and organizations utilizing big data in the
Technical specification regulations are a form of USA will need to consult with local rules and
command and control regulation dictating what subject matter–based regulations in order to
technology may be used for a product (Cole and ensure compliance. At the federal level, the US
Grossman 1999; Kolstad 2010). Regulations may Federal Trade Commission is tasked with
take the form of market mechanisms, such as a enforcing federal privacy and data protection reg-
penalty or subsidy to influence the behavior of ulations. Specific types of data are regulated under
actors in a market (Cole and Grossman 1999; different legal authorities such as medical data
Kolstad 2010). which is regulated under the Health Insurance
Regulations may be issued with an ulterior Portability and Accountability Act. Major state-
agenda than to serve the general population level laws include the California Consumer Pri- R
represented by a governing body. Regulatory cap- vacy Act.
ture is a diagnosis of a regulating body in which
the regulating body is serving a special interest
over the interests of the wider population it Further Reading
impacts and is a form of corruption (Carpenter
and Moss 2014a; Levine and Forrence 1990). Carpenter, D. P., & Moss, D. A. (2014a). Introduction. In
This can be done to entrench the power or eco- D. P. Carpenter & D. A. Moss (Eds.), Preventing reg-
nomic interests of a specific group, to manipulate ulatory capture: Special interest influence and how to
limit it (pp. 1–22). Cambridge: Cambridge University
markets, or to weaken or strengthen regulations to Press.
benefit a specific interest. Carpenter, D. P., & Moss, D. A. (Eds.). (2014b). Pre-
Regulatory capture can take two forms: cul- venting regulatory capture: Special interest influence
tural and material capture. Cultural capture occurs and how to limit it. Cambridge: Cambridge University
Press.
when the norms and preferences of the regulated Chabinsky, S., & Pittman, F. P. (2019, March 7). USA Data
community over time permeate into the regulated Protection 2019 (United Kingdom) [Text]. Interna-
body and influence it to make decisions tional Comparative Legal Guides International
788 Relational Data Analytics

Business Reports; Global Legal Group. https://iclg. that religion today is much more commodified,
com/practice-areas/data-protection-laws-and-regula therapeutic, public, and personalized than it has
tions/usa.
Cole, D. H., & Grossman, P. Z. (1999). When is command- been for most of history. He also notes that,
and-control efficient? Institutions, Technology, and the because media are coming together to create an
Comparative Efficiency of Alternative Regulatory environment in which our personal projects of
Regimes for Environmental Protection. Articles by identity, meaning, and self are worked out, reli-
Maurer Faculty, Paper 590.
European Parliament and Council. (2018). REGULATION gion and media are actually converging. As more
(EU) 2016/679 OF THE EUROPEAN PARLIAMENT people around the globe obtain devices capable of
AND OF THE COUNCIL – of 27 April 2016 – On the accessing the Internet, their everyday religious
protection of natural persons with regard to the pro- practices are leaving digital traces for interested
cessing of personal data and on the free movement of
such data, and repealing Directive 95/46/EC (General companies and institutions to pick up on. The age
Data Protection Regulation). Official Journal of the of big data is usually thought to affect institutions
European Union Law, 119, 1–80. like education, mass media, or law, but religion is
Kolstad, C. (2010). Environmental Economics (2nd ed.). undergoing dynamic shifts as well.
New York: Oxford University Press.
Levine, M. E., & Forrence, J. L. (1990). Regulatory cap- Though religious practice was thought to be in
ture, public interest, and the public agenda: Toward a decline through the end of the twentieth century,
synthesis. Journal of Law, Economics, and Organiza- there has been a resurgence of interest through the
tion, 6, 167–198. beginning of the 21st. A Google NGram viewer
Merriam-Webster. (2018). Definition of REGULATION.
Merriam-Webster.Com. Retrieved July 31, 2018, from (which keeps track of a word’s frequency in
https://www.merriam-webster.com/dictionary/ published books and general literature over time)
regulation. shows that “data” surpassed “God” for the first
Susan Webb Yackee. (2014). Reconsidering Agency Cap- time in 1973. Yet, by about 2004, God once again
ture During Regulatory Policymaking. In D. P. Carpen-
ter & D. A. Moss (Eds.), Preventing regulatory overtook data (and its synonym “information”),
capture: Special interest influence and how to limit it indicating that despite incredible scientific and
(pp. 292–325). Cambridge University Press. https:// technological advances, people still wrestle with
www.tobinproject.org/sites/tobinproject.org/files/ spiritual or existential matters.
assets/Kwak%20-%20Cultural%20Capture%20and%
20the%20Financial%20Crisis.pdf. While the term “big data” seems commonplace
Visseren-Hamakers, I. J. (2015). Integrative environmental now, it is a fairly recent development. Several
governance: Enhancing governance in the era of syner- researchers and authors claim to have coined the
gies. Current Opinion in Environmental Sustainability, 14, term, but its modern usage took off in the mid-
136–143. https://doi.org/10.1016/j.cosust.2015.05.008.
1990s and only really became mainstream in 2012
when the White House and the Davos World
Economic Forum identified it as a serious issue
Relational Data Analytics worth tackling. Big data is a broad term, but
generally has two main precepts: humans are
▶ Link/Graph Mining now producing information at an unprecedented
rate, and new methods of analysis are needed to
make sense of that information. Religious prac-
tices are changing in both of these areas. Faith-
Religion based activity is creating new data streams even as
churches, temples, and mosques are figuring out
Matthew Pittman and Kim Sheehan what to do with all that data. On an institutional
School of Journalism & Communication, level, the age of big data is giving religious groups
University of Oregon, Eugene, OR, USA new ways to learn about the individuals who
adhere to their teachings. On an individual level,
technology is changing how people across the
In his work on the changing nature of religion in globe learn about, discuss, and practice their
our modern mediated age, Stewart Hoover notes faiths.
Religion 789

Institutional Religion change, but the clear cultural trend is toward


ubiquitous smart phone connectivity. Religious
It is now common for religious institutions to groups that take advantage of this may provide
using digital technology to reach their believers. several benefits to their followers: members
Like any other business or group that needs mem- could immediately identify and download any
bers to survive, most seek to utilize or leverage worship music being played; interested members
new devices and trends into opportunities to could look up information about a local religious
strengthen existing members or recruit potential leader; members could sign up for events and
new ones. Of course, depending on a religion’s groups as they are announced in the service; or
stance toward culture, they may (like the Amish) those using online scripture software can access
eschew some technology. However, for most texts and take notes. There are just a few
mosques, churches, and synagogues, it has possibilities.
become standard for each to have its own website There are other ways religious groups can
or Facebook page. Email newsletters and Twitter harness big data. Some churches have begun
accounts feeds have replaced traditional newslet- analyzing liturgies to assess and track length
ters and event reminders. and content over time. For example, a dip in
New opportunities are constantly emerging attendance during a given month might be linked
that create novel space for leaders to engage prac- to the sermons being 40% longer in that same
titioners. Religious leaders can communicate time frame. Many churches make their budgets
directly with followers through social media, available to members for the sake of transpar-
adding a personal touch to digital messages, ency, and in a digital age it is not difficult to
which can sometimes feel distant or cold. Rabbi create financial records that are clear and acces-
SchmuleyBoteach, “America’s Rabbi,” has sible to laypeople. Finally, learning from a con-
29 best-selling books but often communicates gregant’s social media profiles and personal
daily though his Twitter account, which has over information, a church might remind a parishioner
a hundred thousand followers. On the flip side, of her daughter’s upcoming birthday, the
people can thoroughly vet potential religious approaching deadline for an application to a fam-
leaders or organizations before committing to ily retreat, or when other congregants are attend-
them. If concerned that a particular group’s ideol- ing a sporting event of which she is a fan. The
ogy might not align with one’s own, a quick risk of overstepping boundaries is real and, just
Internet search or trip to the group’s website like with Facebook or similar entities, privacy
should identify any potential conflicts. In this settings should be negotiated beforehand.
way, providing data about their identity and As with other commercial entities, religious R
beliefs helps religious groups differentiate institutions utilizing big data must learn to
themselves. differentiate information they need from infor-
In a sense, big data makes it possible for mation they don’t. The sheer volume of avail-
religious institutions to function more like – able data makes distinguishing desired signal
and take their cues from – commercial enter- from irrelevant noise an increasingly important
prises. Tracking streams of information about task. Random correlations may lead to false
its followers can help religious groups be more positive causation. A mosque may benefit
in tune with the wants and needs of these “cus- from learning that members with the highest
tomers.” Some religious organizations imple- income are not actually its biggest givers, or
ment the retail practice of “tweets and seats”: testing for a relationship between how far
by ensuring that members always have available away its members live and how often they
places to sit, rest, or hang out, and that wifi attend. Each religious group must determine
(wireless Internet connectivity) is always acces- how big data may or may not benefit its oper-
sible, they hope to keep people present and ation in any given endeavor, and the opportu-
engaged. Not all congregations embrace this nities are growing.
790 Religion

Individual Religion share personal notes directly to friends. The digi-


tal highlights or notes made, even when using the
The everyday practice of religion is becoming app offline, will later upload to one’s account and
easier to track as it increasingly utilizes digital remain in one’s digital “bible” permanently.
technology. A religious individual’s personal All this activity has generated copious amounts
blog, Twitter feed, Facebook profile keep a record of data for YouVersion’s producers. In addition to
of his or her activity or beliefs, making it relatively using the data to improve their product they also
easy for any interested entity to track online released it to the public. This kind of insight into
behavior over time. Producers and advertisers the personal religious behavior of so many indi-
use this data to promote products, events, or viduals is unprecedented. With over a billion
websites to people who might be interested. Cur- opens and/or uses, YouVersion statistically pro-
rently companies like Amazon have more incen- ved several phenomena. The data demonstrated
tive than, say, a local synagogue in keeping tabs the most frequent activity for users is looking up a
on what websites one visits, but the potential favorite verse for encouragement. Despite the ste-
exists for religious groups to access the same reotype of shirtless men at football games, the
data that Facebook, Amazon, Google, most popular verse was not John 3:16, but Philip-
etc. already possess. pians 4:13: “I can do all things through him who
Culturally progressive religious groups antici- gives me strength.” Religious adherents have
pate mutually beneficial scenarios: they provide a always claimed that their faith gives them strength
data service that benefits personal spiritual and hope, but big data has now provided a brief
growth, and in turn the members generate fields insight into one concrete way this actually
of data that are of great value to the group. A Sikh happens.
coalition created the FlyRights app in 2012 to help The YouVersion data also reveal that people
with quick reporting of discriminatory TSA pro- used the bible to make a point in social media.
filing while travelling. The Muslim’s Prayer Verses were sought out and shared in an attempt
Times app provides a compass, calendar (with to support views on marriage equality, gender
moon phases), and reminders for Muslims about roles, or other divisive topics. Tracking how
when and in what direction to pray. Apple’s individuals claim to have their beliefs supported
app store has also had to ban other apps from by scripture may help religious leaders learn
fringe religious groups or individuals for being more about how these beliefs are formed, how
too irreverent or offensive. they change over time, and which interpretations
The most popular religious app to date simply of scripture are most influential. Finally,
provides access to scripture. In 2008 LifeChurch. YouVersion data reveal that Christian users like
tv launched “the Bible app,” also called verses with simple messages, but chapters with
YouVersion, and it currently has over 151 million profound ideas. Verses are easier to memorize
installations worldwide on smartphones and tab- when they are short and unique, but when engag-
lets. Users can access scripture (in over 90 differ- ing in sustained reading, believers prefer chap-
ent translations) while online or download it for ters with more depth. Whether large data sets
access offline. An audio recording of each chapter confirm suspicions or shatter expectations, they
being read aloud can also be downloaded for some continue to change the way religion is practiced
of the translations. A user can search through and understood.
scripture by keyword, phrase, or book of the
Bible, or there are reading plans of varying levels
of intensity and access to related videos or Numerous or Numinous
movies. A “live” option lets users search out
churches and events in surrounding geographic In the past, spiritual individuals had a few reli-
areas, and a sharing option lets users promote the gions to choose from, but the globalizing force of
app, post to social media what they have read, or technology has dramatically increased the
Risk Analysis 791

available options. While the three big monothe- Further Reading


isms (Christianity, Judaism, and Islam) and
pan/polytheisms (Hinduism and Buddhism) are Campbell, H. A. (Ed.). (2012). Digital religion: Under-
standing religious practice in new media worlds.
still the most popular, the Internet has made it
Abingdon: Routledge.
possible for people of any faith, sect, or belief to Hjarvard, S. (2008). The mediatization of religion:
find each other and validate their practice. Though A theory of the media as agents of religious change.
pluralism is not embraced in every culture, there is Northern Lights: Film & Media Studies Yearbook, 6(1),
9–26.
at least increasing awareness of the many ways
Hoover, S. M., & Lundby, K. (Eds.). (1997). Rethinking
religion is practiced across the globe. media, religion, and culture (Vol. 23). Thousand Oaks:
Additionally, more and more people are iden- Sage.
tifying themselves as “spiritual but not religious,” Kuruvilla, C. Religious mobile apps changing the faith-
based landscape in America. Retrieved from http://
indicating a desire to seek out spiritual experi-
www.nydailynews.com/news/national/gutenberg-
ences and questions outside the confines of a moment-mobile-apps-changing-america-religious-
traditional religion. Thus for discursive activities landscape-article-1.1527004. Accessed Sep 2014.
centered on religion, Daniel Stout advocates the Mayer-Schönberger, V., & Cukier, K. (2013). Big data:
A revolution that will transform how we live, work, and
use of another term in addition to “religion”:
think. Houghton Mifflin Harcourt.
numinous. Because “religious” can have negative Taylor, B. (2008). Entertainment theology (cultural exege-
or limiting connotations, looking for the “numi- sis): New-edge spirituality in a digital democracy.
nous” in cultural texts or trends can broaden the Baker Books.
search for and dialogue about a given topic. To be
numinous, something must meet several criteria:
stir deep feeling (affect), spark belief (cognition),
include ritual (behavior), and be done with fellow Risk Analysis
believers (community). This four-part framework
is a helpful tool for identification of numinous Jonathan Z. Bakdash
activity in a society where it once might have Human Research and Engineering Directorate,
been labeled “religious.” U.S. Army Research Laboratory, Aberdeen
By this definition, the Internet (in general) and Proving Ground, MD, USA
entertainment media (in particular) all contain
numinous potential. The flexibility of the Internet
makes it relevant to the needs of most; while Definition and Introduction
authority of some of its sources can be dubious,
the ease of social networking and multi-mediated Society is becoming increasingly interconnected R
experiences provides all the elements of tradi- with networks linking people, the environment,
tional religion (community, ritual, belief, feeling). information, and technology. This rising com-
Entertainment media, which produce at least as plexity is a challenge for risk analysis. Risk
much data as – and may be indistinguishable analysis is the identification and evaluation of
from – religious media, emphasize universal the probability of an adverse outcome, its asso-
truths through storytelling. The growing opportu- ciated risk factors, and the potential impact if
nities of big data (and its practical analysis) will that outcome occurs. Successfully modeling risk
continue to offer for those who engage in numi- within interdependent and complex systems
nous and religious behavior. requires access to considerably more data than
traditional, simple risk models. The increasing
availability of big data offers enormous promise
Cross-References for improving risk analysis through more
detailed, comprehensive, faster, and accurate
▶ Data Monetization predictions of risks and their impacts than
▶ Entertainment small data alone.
792 Risk Analysis

However, risk analysis is not purely a compu- may still cause amplifying negative effects
tational challenge that can be solved by more data. because of human risk perception. Perceived risk
Big data does not eliminate the importance of data is the public social, political, and economic
quality and modeling assumptions; it is not nec- impacts of unrealized (and realized) risks. An
essarily a replacement for small data. Further- example of the impact of a perceived risk is the
more, traditional risk analysis methods typically nuclear power accident at Three-Mile Island. In
underestimate the probability and impact of risks this accident, minimal radiation was released so
(e.g., terrorist attacks, power failures, and natural the real risk was mitigated. Nevertheless, the near
disasters such as hurricanes) because normal data miss of a nuclear meltdown had immense social
and independent observations are assumed. Tra- and political consequences that continue to nega-
ditional methods also typically do not account for tively impact the nuclear power industry in the
cascading failures, which are not uncommon in United States. The realized consequences of per-
complex systems. For example, a hurricane may ceived risk mean that “real” risk should not nec-
cause a power failure, which in turn results in essarily be separated from “perceived” risk.
flooding.
The blessing and curse of risk analysis with big
data are illustrated by the example of Google Flu Data: Quality and Sources
Trends (GFT). Initially, it was highly successful in
estimating flu rates in real time, but over time it Many of the analysis challenges for big data are
became inaccurate due to external factors, lack of not unique but are pertinent to analysis of all data
continued validation, and incorrect modeling (Lazer et al. 2014). Regardless of the size of the
assumptions. dataset, it is important for analysts and
policymakers to understand how, why, when,
and where the data were collected and what the
Interdependencies data contain and do not contain. Big data may be
“poor data” because rules, causality, and out-
Globalization and advances in technology have comes are far less clear compared to small data.
led to highly networked and interdependent More specifically, Vose (2008) describes the
social, economic, political, natural, and techno- quality of data characteristics for risk analysis.
logical systems (Helbing 2013). Strong interde- The highest quality data are obtained using a
pendencies are potentially dangerous because large sample of direct and independent measure-
small or gradual changes in a single system can ments collected and analyzed using established
cause cascading failures throughout multiple best practices over a long period of time and
systems. For example, climate change is associ- continually validated to correct data for errors.
ated with food availability, food availability with The second highest quality data use proxy mea-
economic disparity, and economic disparity with sures, a widely used method for collection, anal-
war. In interconnected systems, risks often ysis, and some validation. Other characteristics of
spread quickly in a cascading process, so early decreasing data quality are: A smaller sample of
detection and mitigation of risks is critical to objective data, agreement among multiple
stopping failures before they become uncontrol- experts, a single expert opinion, and is weakest
lable. Helbing (2013) contends that big data is with speculation. While there may be some situa-
necessary to model risks in interconnected and tions in which expert opinions are the only data
complex systems: Capturing interdependent source, general findings indicate this type of data
dynamics and other properties of systems has poor predictive accuracy. Additional reasons
requires vast amounts of heterogeneous data to question experts are situations or systems with a
over space and time. large number of unknown factors and potentially
Interdependencies are also critical to risk anal- catastrophic impacts for erroneous estimations.
ysis because even when risks are mitigated, they Big data can be an improvement over small data
Risk Analysis 793

and one or several expert opinions. However, Risk analysis informs risk reduction, but they
volume is not necessarily the same as quality. are not one and the same. After the risk matrix is
Multidimensional aspects of data quality, whether constructed, appropriate risk tolerance and miti-
the data are big or small, should always be gation strategies are considered. The last step is
considered. ongoing supervision and evaluation of risk as
conditions and information change, updating the
risk matrix as needed, and provided feedback to
Risk Analysis Methods improve the accuracy of future risk matrix.
Other widely used techniques include inferen-
Vose (2008) explains the general techniques for tial statistical tests (e.g., regression) and the more
conducting risk analysis. A common, descriptive comprehensive approach of what-if data simula-
method for risk analysis is Probability-Impact tions, which are also used in catastrophe model-
(P-I). P-I is the probability of a risk occurring ing. Big data may improve the accuracy of
multiplied by the impact of the risk if it materi- probability and impact estimates, particularly the
alizes: Probability  Impact ¼ Weighted Risk. upper bounds in catastrophe modeling, leading to
All values may be either qualitative (e.g., low, more accurate risk analysis.
medium, and high likelihood or severity) or From a statistical perspective, uncertainty and
quantitative (e.g., 10% or one million dollars). variability tend to be interchangeable. If uncertainty
The Probability may be a single value or multiple can be attributed to random variability, there is no
values such as a distribution of probabilities. The distinction. However, in risk analysis, uncertainty
Impact may also be a single value or multiple can arise from incomplete knowledge (Paté-Cornell
values and is usually expressed as money. A 1996). Uncertainty in risk may be due to a lack of
similar weighted model to P-I, Threat  Vulner- data (particularly for rare events), not knowing rel-
ability  Consequence ¼ Risk, is frequently used evant risks and/or impacts and unknown interdepen-
in risk analysis. However, a significant weakness dencies among risks and/or impacts.
with P-I and related models with fixed values is
that they tend to systematically underestimate the Levels of Risk Analysis
probability and impact of rare events that are There are six levels for understanding uncertainty,
interconnected, such as natural hazards (e.g., ranging from qualitative identification of risk fac-
floods), protection of infrastructure (e.g., power tors (Level 0) to multiple risk curves constructed
grid), and terrorist attacks. Nevertheless, the P-I using different PRAs (Level 5) (Paté-Cornell
method can be effective for quick risk 1996). Big data are relevant to Level 2 and
assessments. beyond. The specific levels are as follows R
(adapted Paté-Cornell 1996):
Probabilistic Risk Assessment
P-I is a foundation for Probabilistic Risk Assess- Level 0: Identification of a hazard or failure
ment (PRA), an evaluation of the probabilities for modes. Level 0 is primarily qualitative. For
multiple potential risks and their respective example, does exposure to a chemical increase
impacts. The US Army’s standardized risk matrix the risk of cancer?
is an example of qualitative PRA, see Fig. 1 (also Level 1: Worst case. Level 1 is also qualitative,
see Level 5 of risk analysis below). with no explicit probability. For example, if
The risk matrix is constructed by: individuals are exposed to a cancer-causing
chemical, what is the highest number that
Step 1: Identifying possible hazards (i.e., potential could develop cancer?
risks) Level 2: Quasi-worst case (probabilistic upper-
Step 2: Estimating the probabilities and impacts of bound). Level 2 introduces subjective estima-
each risk and using the P-Is to categorize tion of probability based on reasonable expec-
weighted risk tation(s). Using the example from Level 1, this
794 Risk Analysis

Risk Analysis, Fig. 1 Risk analysis (Source: Safety Risk Management, Pamphlet 385-30 (Headquarters, Department of
the Army, 2014, p. 8): www.apd.army.mil/pdffiles/p385_30.pdf)

could be the 95th percentile for the number of Level 5: Multiple-curve PRA. Level 5 has more
individuals developing cancer. than one probabilistic risk curve. Using the
Level 3: Best and central estimates. Rather than a cancer risk example, different probabilities
worst case, Level 3 aims to model the most from distinct data can be represented using
likely impact using central values (e.g., mean multiple curves, which are then combined
or median). using the average or another measure. A
Level 4: Single-curve PRA. Previous levels were generic example of Level 5, for qualitative
point estimates of risk; Level 4 is a type of values, was illustrated with the above risk
PRA. For example, what is the number of matrix. When implemented quantitatively,
individuals that will develop cancer across a Level 5 is similar to what-if simulations in
probability distribution? catastrophe modeling.
Risk Analysis 795

Catastrophe Modeling cascading effects in interconnected systems, risk


Big data may improve risk analysis at Level 2 and probabilities and impacts are generally far greater
above but may be particularly informative for than in independent systems and therefore will be
modeling multiple risks at Level 5. Using catas- substantially underestimated when incorrectly
trophe modeling, big data can allow for a more treated as independent.
comprehensive analysis of the combinations of P-
Is while taking into account interdependences
among systems. Catastrophe modeling involves Right Then Wrong: Google Flu Trends
running a large number of simulations to construct
a landscape of risk probabilities and their impacts GFT is an example of both success and failure for
for events such as terrorist attacks, natural disas- risk analysis using big data. The information pro-
ters, and economic failures. Insurance, finance, vided by an effective disease surveillance tool can
other industries, and governments are increas- help mitigate disease spread by reducing illnesses
ingly relying on big data to identify and mitigate and fatalities. Initially, GFT was a successful real-
interconnected risks using catastrophe modeling. time predictor of flu prevalence, but over time, it
Beiser (2008) describes the high level of data becomes inaccurate. This is because the model
detail in catastrophe modeling. For risk analysis of assumptions did not hold over time, validation
a terrorist attack in a particular location, with small data was not on-going, and it lacked
interconnected variables taken into account may transparency. GFT used a data-mining approach
include the proximity to high-profile targets (e.g., to estimate real-time flu rates: Hundreds of mil-
government buildings, airports, and landmarks), lions of possible models were tested to determine
the city, and details of the surrounding buildings the best fit of millions of Google searches to
(e.g., construction materials), as well as the poten- traditional weekly surveillance data. The tradi-
tial size and impact of an attack. Simulations are tional weekly surveillance data consisted of the
run under different assumptions, including the proportion of reported doctor visits for flu-like
likelihood of acquiring materials to carry out a systems. At first, GFT was a timely and accurate
particular type of attack (e.g., a conventional predictor of flu prevalence, but it began to produce
bomb versus a biological weapon) and the proba- systematic overestimates, sometimes by a factor
bility of detecting the acquisition of such mate- of two or greater compared with the gold-standard
rials. Big data is informative for the wide range of of traditional surveillance data. The erroneous
possible outcomes and their impacts in terms of estimates from GFT resulted from a lack of con-
projected loss of life and property damage. How- tinued validation, thus assuming relevant search
ever, risk analysis methods are only as good as terms only changed as a result of flu symptoms R
their assumptions, regardless of the amount of and transparency in the data and algorithms used.
data. Lazer et al. (2014) called the inaccuracy of
GFT a parable for big data, highlighting several
Assumptions: Cascading Failures key points. First, a key cause for the misestimates
Even with big data, risk analysis can be flawed was that the algorithm assumed that influences on
due to inappropriate model assumptions. In the search patterns were the same over time and pri-
case of Hurricane Katrina, the model assumptions marily driven by the onset of flu symptoms. In
for a Category 3 hurricane did specify a large, reality, searches were likely influenced by exter-
slow-moving storm system with heavy rainfall nal events such as media reporting of a possible
nor did they account for the interdependencies in flu pandemic, seasonal increases in searches for
infrastructure systems. This storm caused early cold symptoms that were similar to flu symptoms,
loss of electrical power, so many of the pumping and the introduction of suggestions in Google
stations for levees could not operate. Conse- Search. Therefore, GFT wrongly assumed the
quently, water overflowed, causing breaches, data were stationary (i.e., no trends or changes in
resulting in widespread flooding. Because of the mean and variance of data over time). Second,
796 R-Programming

Google did not provide sufficient information for Cross-References


understanding the analysis, such as all selected
search terms and access to the raw data and algo- ▶ Complex Networks
rithms. Third, big data is not necessarily a replace- ▶ Financial Data and Trend Prediction
ment for small data. Critically, the increased ▶ Google Flu
volume of data does not necessarily make it the ▶ “Small” Data
highest quality source. Despite these issues, GFT
was at the second highest level of data quality
using criteria from Vose (2008) because GFT ini- References
tially used:
Beiser, V. (2008). Pricing terrorism: Insurers gauge risks,
costs, Wired. Permanent link: http://web.archive.org/
1. Proxy measures: search terms originally corre-
save/_embed/http://www.wired.com/2008/06/pb-terror
lated with local flu reports over a finite period ism/.
of time Helbing, D. (2013). Globally networked risks and how to
2. A common method: search terms used for respond. Nature, 497(7447), 51–59. doi:10.1038/
nature12047.
Internet advertising, disease surveillance was
Lazer, D. M., Kennedy, R., King, G., & Vespignani, A.
novel (with limited validation) (2014). The parable of Google flu: Traps in big data
analysis. Science, 343(6176), 1203–1206. doi:10.1126/
In the case of GFT, the combination of big and science.1248506.
Paté-Cornell, M. E. (1996). Uncertainties in risk analysis:
small data, by continuously recalibrating the
Six levels of treatment. Reliability Engineering & Sys-
algorithms for the big data using the small (sur- tem Safety, 54(2), 95–111. doi:10.1016/S0951-8320
veillance) data, would have been much more (96)00067-1.
accurate than either alone. Moreover, big data Vose, D. (2008). Risk analysis: A quantitative guide (3rd
ed.). West Sussex: Wiley.
can make powerful predictions that are impossi-
ble with small data alone. For example, GFT
could provide estimates of flu prevalence in
local geographic areas using detailed spatial
and temporal information from searches; this R-Programming
would be impossible with only the aggregated
traditional surveillance data. Anamaria Berea
Department of Computational and Data Sciences,
George Mason University, Fairfax, VA, USA
Conclusions Center for Complexity in Business, University of
Maryland, College Park, MD, USA
Similar to GFT, many popular techniques for ana-
lyzing big data use data mining to automatically
uncover hidden structures. Data mining tech- R is an open-source software programming lan-
niques are valuable for identifying patterns in guage and software environment for statistical
big data but should be interpreted with caution. computing and graphics that is based on object-
The dimensions of big data do not obviate con- oriented programming (R Core Team 2016). Orig-
siderations of data quality, the need for continuous inally, R was an implementation of the
validation, and the importance of modeling S-programming language and it has been extended
assumptions (e.g., non-normality, non- with various packages, functions, and extensions.
stationarity, and non-independence). While big There is a large R-community of users and devel-
data has enormous potential to improve the accu- opers who are continuously contributing to the
racy and insights of risk analysis, particularly for development of R (Muenchen 2012). R is available
interdependent systems, it is not necessarily a under the GNU General Public License. One of the
replacement for small data. most used online forums of the R-community is
R-Programming 797

Stack Overflow, and one of the most used online processes than batch data, R can be used for data
blogs is r-bloggers (http://www.r-bloggers.com). analytics and visualizations as R-server. The
As of May 2017, there were included more R-server can be connected to other databases and
than 10,500 additional packages and 120,000 run the analytics and visualizations either through
functions with the installation of R. These are direct ODBC connections or through reading
available at the Comprehensive R Archive Net- APIs. The R-server can be set up either on AWS
work (CRAN). (Amazon web Services) or can be bought as an
Arguably, R-language has become one of the enterprise solution from Microsoft.
most important tools for computational statistics,
visualization, and data science. Worldwide, mil-
lions of statisticians and data scientists use R to Comparison with Other Statistical
solve their most challenging problems in fields Software
ranging from computational biology to quantita-
tive marketing (Matloff 2011). SPSS
This software is easily accessible, and anyone SPSS is a well-known statistical software that has
can use it as it is open source (there is no purchas- been used as a business solution for companies.
ing fee). R can be used with textual code scripts as SPSS user interface looks quite similar to Micro-
well as inside an environment (RStudio). R code soft Excel, which is widely known by most pro-
and scripts can be written to analyze the data or to fessionals. They can therefore easily apply their
fully implement simulations. In other words, knowledge to this new program. Additionally, the
R can handle computational jobs from the sim- graphs and visualizations can be easily custom-
plest data analyses, such as showing ranges and ized and are more visually appealing. The trade-
simple statistics of the data (minimum and maxi- off is that SPSS cannot do complex analyses and
mum values), to complex models, such as there is a limitation to the size of the data that can
ARIMA, Bayesian networks, Monte Carlo simu- be analyzed in one batch. This is a commercial
lations, and agent-based simulations. solution, not open source.
Once the program is created and used with
data, various graphic displays can be created SAS
quite easily. Once you become familiar with this For advanced analytics, SAS program has been
programming language, R is an easy tool to use, one of the most widely used. It is quite similar
but if not, it takes a short time to learn how to use, to R, yet it is not open source and not open to the
as currently there are also many online tutorials. public. SAS is more difficult to learn than both
Additonally, there are many books that exist that SPSS and Stata, but can run more complicated R
can explain how to use this software. analyses than both of them. On another hand, it
is easier to use than R, but just like SPSS, it is not
suitable for Big Data or complex, noisy data. SAS
R and Big Data is also hard to implement in data streaming
environments.
R is a very powerful tool for analyzing large
datasets. In one code run of R, datasets as large Stata
as tens of millions of data points can be analyzed This is a command-based software in which the
and crunched within a reasonable time on a per- user writes code to produce analytical results sim-
sonal computer. For truly Big Data, that requires ilar to R. It is widely used by researchers and
parallel or distributed computing, R can be used professionals, as it creates impressive-looking
with a series of packages called pdbR (Raim output tables. Different versions of Stata can be
2013). In this case, data is analyzed in batches. purchased for different needs and budgets. It is
For streaming data, which requires different easier to learn than R and SAS, but much more
data architectures, cleaning, and collection complicated than SPSS. There is a journal called
798 R-Programming

The Stata Journal which releases information another hand, functions are defined in a similar
about work that has been done with Stata and way to JavaScript (Cook 2017).
how to use the program more efficiently. Addi-
tionally, Stata holds an annual conference in
which developers meet and present. On another More than Statistics Programming
hand, Stata is not suitable for large and noisy
datasets either, as the cleaning of the data is R programming is not only a statistical or Big
much more difficult to do using Stata than Data type of programming language. Due to the
using R. development of many packages and the versatility
given by functional programming in general,
Python R can be successfully used for text mining,
Python is considered the closest competitor to geospatial visualizations, artistic visualizations,
R regarding the analysis and visualization of Big and even agent-based modeling. For example,
Data and of complex datasets. Python was devel- R has a package {tm} that can be used for text
oped for programmers, while R was developed mining and the analysis of literary corpuses.
with the statisticians and statistics in mind. Another package, {topicmodeling}, can be used
While Python is a general purpose language and as a natural language processing technique to
has an easier syntax than R, R is more often discover topics in texts based on various probabi-
praised for its features on data visualization and listic samples and metrics. And packages such as
complex statistical analyses. Python is more {maps} or {maptools} or {ggplot2} can be used
focused on code readability and transferability, for geospatial maps, where geographical and
while R is specific for graphical models and data quantitative data can be analyzed and overlaid in
analysis. Both languages can be used to perform the same visualization.
more complex analyses, such as natural language R was also successfully used to develop com-
programming or geospatial analyses, but Python puter simulations such as agent-based modeling
scales up better than R for large, complex data and dynamic social network analysis. Some
architectures. R is being used more by statisticians examples are structurally cohesive network
and researchers, while Python is being used more blocks (Padgett 2006) or the hypercycles model
by engineers and computer programmers. Due to for economic production as chemistry (Padgett
their different syntax styles, R is more difficult to et al. 2003).
learn in the beginning than Python, but after the R can also be used to do machine learning or
learning curve is crossed, R can be easier to use Bayesian networks or Bayesian analyses, thus
than Python. extending the power of the software beyond its
original goal of statistical software.

R Syntax and Use


Limitations
R uses command-line scripting and one of the
“trademarks” of the R syntax is the use of the Besides the steeper learning curve for beginners,
inverse arrow for defining objects inside the R does not have too many limitations regarding
code – the assignment operator (example: what it can do in terms of data analysis, visualiza-
x sample (1:10, 1)). For programmers from tions, data architectures, data cleaning, and Big
other languages, the syntax may look peculiar at Data processing in general. Some limitations may
first, but it is an easy-to-learn syntax, with plenty come from packages that are not being updated or
of tutorials and support online (Cook 2017). maintained and some other limitations are given
Another peculiarity is the way R uses the “$” by memory management, as some tasks or visu-
operator to call variables inside a data set, similar alizations may take longer computational time to
to the way other languages use “.” (the dot). On process, making R less than ideal for some data
Rule 799

mining projects. But, in general, R is a very ver- Available at: https://www.chicagobooth.edu/socialorg/


satile and widely used software for a multitude of docs/padgett-organizationalgenesis.pdf.
Padgett, J. F., Lee, D., & Collier, N. (2003). Economic
analyses and data types. production as chemistry. Industrial and Corporate
Change, 12(4), 843–877.
R Core Team. (2016). R: A language and environment for
Further Reading statistical computing. Vienna: R Foundation for Statis-
tical Computing. http://www.R-project.org/.
Raim, A.M. (2013). Introduction to distributed computing
Cook, J. D. (2017). R programming for those coming from with pbdR at the UMBC High Performance Computing
other languages. Web resource: https://www. Facility (PDF). Technical report. UMBC High Perfor-
johndcook.com/R_language_for_programmers.html. mance Computing Facility, University of Maryland,
Retrieved 12 May 2017. Baltimore. HPCF-2013-2.
Data Science Wars. https://www.datacamp.com/commu
nity/tutorials/r-or-python-for-data-analysis#gs.KOD6_
nA. Retrieved 12 May 2017.
Matloff, N. (2011). The art of R programming: A tour of
statistical software design. New York: No Starch Press.
Muenchen, R. A. (2012). The popularity of data analysis
software. http://r4stats.com/popularity.
Rule
Padgett, J. F. (2006). Organizational genesis in Florentine
history: Four multiple-network processes (unpublished). ▶ Regulation

R
S

Salesforce an end goal to optimize profitability, revenue,


and customer satisfaction by orientating the orga-
Jason Schmitt nization around the customer. This ability to track
Communication and Media, Clarkson University, and message correctly highlights Salersforce’s
Potsdam, NY, USA unique approach to management practice known
in software development as Scrum.
Scrum is an incremental software development
Salesforce is a global enterprise software com- framework for managing product development by
pany, with Fortune 100 standing, most well- a development team that works as a unit to reach a
known for its role in linking cloud computing to common goal. A key principle of Salesforce’s
on-demand customer relationship management Scrum direction is the recognition that during a
(CRM) products. Salesforce CRM and marketing project the customers can change their minds
products work together to make corporations about what they want and need, often called
more functional and ultimately more efficient. churn, and predictive understanding is hard to
Founded in 1999 by Marc Benioff, Parker Harris, accomplish. As such, Salesforce takes an empiri-
Dave Moellenhoff, and Frank Domingues, cal approach in accepting that an organization’s
Salesforce’s varied platforms allow organizations problem cannot be fully understood or defined
to understand the consumer and the varied media and instead focuses on maximizing the team’s
conversations revolving around a business or ability to deliver messaging quickly and respond
brand. According to Forbes (April 2011) which to emerging requirements.
conducted an assessment of businesses focused on Salesforce provides a fully customizable user
value to shareholders, Marc Benioff of Salesforce interface for custom adoption and access for a
was the most effective CEO in the world. diverse array of organization employees. Further,
Salesforce provides a cloud-based centralized Salesforce has the ability to integrate into existing
location to track data. Contacts, accounts, sales websites and allows for building additional web
deals, and documents as well as corporate mes- pages through the cloud-based service. Salesforce
saging and the varied social media conversations has the ability to link with Outlook and other mail
are all archived and retrievable within the clients to sync calendars and associate emails with
Salesforce architecture from any web or mobile the proper contact and provides the functionality
device without the use of any tangible software. to keep a record every time a contact or data entry
Salesforce’s quickly accessible information has is accessed or amended. Similarly, Salesforce

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
802 Salesforce

keeps track and organizes customer support issues conversations, and utilizes user-defined key-
and tracks them through to resolution with the words. Users have the ability to see original
ability to escalate individual cases based on time posts that were targeted from keyword searches
sensitivity and the hierarchy of various clients. and provided a source link to the social media
Extensive reporting is a value of Salesforce’s platform the post or message originated from.
offerings, which provides management an ability The “River of News” displays posts with many
to track problem areas within an organization to a different priorities, such as newest post first, num-
distinct department, area, or tangible product ber of Twitter followers, social media platform
offering. used, physical location, and Klout score. This
Salesforce has been a key leader in evolving tool provides strong functionality for marketers
marketing within this digital era through the use or corporations wishing to hone in, or take
of specific marketing strategy aimed at creating and part in, industry, customer, or competitor
tracking marketing campaigns as well as measur- conversations.
ing the success of online campaigns. These “Topic analysis” is a widget that is most often
services are part of another growing segment avail- used to show share of voice or the percentage of
able within Salesforce offerings in addition to the conversation happening about your brand or orga-
CRM packaging. Marketing departments leverag- nization in relation to competitor brands. It is
ing Salesforce’s Buddy Media, Radian6, or displayed as a pie chart and can be segmented
ExactTarget obtain the ability of users to conduct multiple ways based on user configuration.
demographic, regional, or national searches on Many use this feature as a quick visual assessment
keywords and themes across all social networks, to see the conversations and interest revolving
which create a more informed and accurate mar- around specific initiatives or product launches.
keting direction. Further, Salesforce’s dashboard, “Topic trends” is a widget that provides the
which is the main user interactive page, allows the ability to display the volume of conversation
creation of specific marketing directed tasks that over time through graphs and charts. This feature
can be customized and shared for differing organi- can be used to understand macro day, week, or
zational roles or personal preferences. month data. This widget is useful when tracking
Salesforce marketing dashboard utilizes wid- crisis management or brand sentiment. With a line
gets that are custom, reusable page elements, graph display, users can see spikes of activity and
which can be housed on individual users’ pages. conversation around critical areas. Further, users
When a widget is created, it is added to a widgets then can click and hone in on spikes, which can
view where all team members can easily be open a “Conversation Cloud” or “River of News”
assigned access. This allows companies and orga- that allows users to see the catalyst behind the
nizations to share appropriate widgets defined and spike of social media activity. This tool is used
created to serve the target market or industry- as a way to better understand reasons for increased
specific groups. The shareability of widgets interest or conversation across broad social media
allows the most pertinent and useful tasks to be platforms.
replicated by many users within a single
organization.
Salesforce Uses

Types of Widgets Salesforce offers wide ranging data inference


from its varied and evolving products. As CRM
The Salesforce Marketing Cloud “River of News” integration within the web and mobile has
is a widget that allows users to scroll through increased, the broad interest to better understand
specific search results, within all social media and leverage social media marketing campaigns
Salesforce 803

has risen as well, allowing Salesforce a leading Sales1Platform is geared toward mobile app
push within this industry’s market share. The creation. Sales1Platform gives access to create
diverse array of businesses, nonprofits, munici- and promote mobile apps with over four million
palities, and other organizations that utilize apps created utilizing this service.
Salesforce illustrates the importance of this soft- Chatter is a social and collaborative function
ware within daily business and marketing strat- that relates to the Salesforce platform. Similar to
egy. Salesforce clients include the American Red Facebook and Twitter, Chatter allows users to
Cross, the City of San Francisco, Philadelphia’s form a community within their business that can
311 system, Burberry, H&R Block, Volvo, and be used for secure collaboration and knowledge
Wiley Publishing. sharing.
Work.com is a corporate performance manage-
ment platform for sales representatives. The plat-
Salesforce Service Offerings form targets employee engagement in three
areas: alignment of team and personal goals
Salesforce is a leader within other CRM and with business goals, motivation through public
media marketing-orientated companies such as recognition, and real-time performance
Oracle, SAP, Microsoft Dynamics CRM, Sage feedback.
CRM, Goldmine, Zoho, Nimble, Highrise, Salesforce has more than 5,500 employees,
Insight.ly, and Hootsuite. Salesforce’s offerings revenues of approximately $1.7 billion, and a
can be purchased individually or as a complete market value of approximately $17 billion. The
bundle. It offers current breakdowns of services company regularly conducts over 100 million
and access in its varied options that are referred to transactions a day and has over 3 million
as Sales Cloud, Service Cloud, ExactTarget Mar- subscribers.
keting Cloud, Salesforce1 Platform, Chatter, and Headquartered in San Francisco, California,
Work.com. Salesforce also maintains regional offices in Dub-
Sales Cloud allows businesses to track cus- lin, Singapore, and Tokyo with secondary loca-
tomer inquiries, escalate issues requiring special- tions in Toronto, New York, London, Sydney, and
ized support, and monitor employee productivity. San Mateo, California. Salesforce operates with
This product provides customer service teams over 170,000 companies and 17,000 nonprofit
with the answers to customers’ questions and the organizations. In June 2004, Salesforce was
ability to make the answers available on the web offered on the New York Stock Exchange under
so consumers can find answers for themselves. the symbol CRM.
Service Cloud offers active and real-time infor-
mation directed toward customer service. This
service provides functionality such as Agent Con- S
sole which offers relevant information about cus- Cross-References
tomers and their media profiles. This service also
provides businesses the ability to give customers ▶ Data Aggregation
access to live agent web chats from the web to ▶ Data Streaming
ensure customers can have access to information ▶ Social Media
without a phone call.
ExactTarget Marketing Cloud focuses on cre-
ating closer relationships with customers through Further Reading
directed email campaigns, in-depth social market-
Denning, S. (2011). Successfully implementing radical
ing, data analytics, mobile campaigns, and mar- management at Salesforce.com. Strategy & Leader-
keting automation. ship, 39(6), 4.
804 Satellite Imagery/Remote Sensing

measurements at different wavelengths in the


Satellite Imagery/Remote electromagnetic spectrum which are recorded as
Sensing bands. Each band records the magnitude of the
radiation as the brightness of pixel in the scene.
Carolynne Hultquist Using a combination of these bands, imagery
Geoinformatics and Earth Observation analysis can be used to identify features on the
Laboratory, Department of Geography and surface of the Earth. Remote sensing classification
Institute for CyberScience, The Pennsylvania techniques are typically used to extract features
State University, University Park, PA, USA into classes based on spectral characteristics of
imagery of the Earth’s surface.
The data mining field of image processing is
Definition conceptually similar to remote sensing imagery
classification. In image processing, automated
Remote sensing is a technological approach used processing of visible spectrum RGB (red-
to acquire observations of the surface of the Earth green-blue) images use patterns based on identi-
and the atmosphere. Remote sensing data is stored fied features in images in order to recognize an
in diverse collections on a massive scale, from a overall class to which it belongs. For example,
variety of platforms and sensors, and at varying finding all the dogs in a set of images is learned
spatial and temporal resolutions. The term is often by picking out features of a dog and then identi-
used interchangeably with satellite imagery which fying dogs by these characteristic features. At
uses sensors deployed on satellite platforms to even a glance, humans are very good at visual
collect observations, but remote sensing imagery recognition by quickly putting together informa-
can also be collected by manned and unmanned tion perceived to identify objects in images. In
aircraft as well as ground-based sensors. One of these fields of image classifications, individual
the fundamental computational problems in features are used to identify the overall pattern
remote sensing is dividing imagery into meaning- by moving windows that consider the neighbor-
ful groups of features. Methods have developed ing pixels in order to pick out spatially close
and been adopted to classify and cluster features parts of features. Standard image classification
in images based on pixel values. In the face of may look for the features that make up a face,
increased imagery resolution and big data, recent whereas in remote sensing, the entire landscape
approaches involve object-oriented segmentation is described by classifying the features that make
and machine learning algorithms. it up based on learned characteristics of those
features. For example, we know that vegetation
and urban features register as particular spectral
Introduction signatures in certain bands which allows for the
characterization of those features. When those
The basic principles of remote sensing are related features are extracted with classification, it is
to the sensor itself, the digital products it creates, based off these learned characteristics of the
and the methods used to extract information imagery.
from this data. The concept of remote sensing is Remote sensing analysis employs recognition
that data are collected without being in contact of the visual scene; however, it is different from
with the features observed. Typically, imagery what is normally experienced by humans as the
is collected from sensors that are on satellite plat- perspective of the image is observed as if
forms, manned aircraft, and unmanned aerial above the Earth and characteristics of features
vehicles (UAVs). The sensors record digital not visible to the human eye are made useful by
imagery within a grid of pixels that have values digital observation. Remote sensing classification
from 0 to 255 (Campbell 2011). Remote sensing can take advantage of sensor capabilities to go
instruments are calibrated to collected beyond the visible spectrum by incorporating
Satellite Imagery/Remote Sensing 805

available spatial data about the surface of the of imagery interpretation as variables to computa-
Earth from bands of infrared, radar, LIDAR (Light tionally determine the resulting classification.
Detection and Ranging), etc. Hyperspectral sen-
sors record bands at many small sections of the Methods
electromagnetic spectrum. These additional fea- Dividing the image into meaningful groups is a
tures are not necessary for many applications but fundamental computational problem in remote
can improve classification accuracy as it is more sensing analysis. Bands from imagery are stored
challenging to accurately classify the complexity as pixel values collected by the sensor at a partic-
of features on the Earth’s surface with only visible ular wavelength. These spectral values, referred to
bands. as digital numbers, can be used to identify features
in imagery by classifying the pixels individually
Historical Background or by moving windows that consider the neigh-
Remote sensing is a term often used interchange- boring pixels. Many computational methods
ably with satellite imagery which uses sensors were developed over the years to classify and
deployed on satellite platforms to collect observa- cluster features in images based on pixel values.
tions. Yet, remote sensing had its roots in less In light of big data and high spatial resolution,
stable airborne sensors such as pigeons and recent approaches involve object-oriented seg-
balloons. As aircraft capabilities advanced over mentation and machine learning algorithms.
the twentieth century, both manned and For years, the pixel-based approach was stan-
unmanned planes were often used for remote dard and it classified features only at the scale
sensing applications. Modern satellite programs at which features are able to be observed, which
began to develop for research purposes to observe were very coarse compared to high spatial resolu-
the environment. For example, the prominent tion modern day sensors. The spatial resolution
Landsat program developed in the 1970s by the (measured as the square meter area covered
U.S. government to provide satellite remote sens- by each pixel) of the imagery from the Landsat
ing of the Earth’s surface. UAVs are becoming satellite system, which has been operational since
increasingly popular for high-resolution collec- the 1970s, has traditionally been at 30m by 30m.
tion as the cost to buy into the technology has So, if a building covers less than half of the sur-
decreased and performance capabilities have sig- face area of 30 by 30m, then it would not be
nificantly developed. identified as a building. A feature will likely not
Classification of remote sensing imagery to be classified correctly until it is twice the size of
extract features has been developed as a technique the image resolution as the pixels do not perfectly
since the 1960s. Traditionally, image classifica- align with features. An important concept from
tion was performed manually by image inter- the discipline of cartography is that coarser grids
preters who went over imagery section by are used at larger scales with less information S
section to produce useful classes such as land content while finer grids are traditionally used
use and land cover. Skilled human interpreters only at smaller scales and make details of features
rely on eight elements of imagery interpretation: identifiable (Hengl 2006). Therefore, Landsat is
image tone, texture, shadow, pattern, association, mostly used for large-scale classification of gen-
shape, size, and site (Olson 1960). These elements eral land cover categories and not as often used for
guide human interpreters to define each pixel to detailed land use classes that break down types
a class. of classes into hierarchies, such as urban classes
Today, remote sensing imagery is available for being made up of impervious surfaces and build-
download in large quantities online. There is more ings. Some fuzzy classification methods have
imagery available and at a higher quality than ever been used to improve the pixel-based method.
before so automated methods of classification are Modern sensors provide imagery that is at a
essential for processing. Automated classification much higher spatial resolution than previ-
techniques continue to make use of these elements ously available. At a spatial resolution of 1.5m
806 Satellite Imagery/Remote Sensing

such as SPOT 6, features to be extracted are made changes. The key then becomes accuracy assess-
up of many pixels. This increase in spatial resolu- ment to show the end users that the results of
tion has caused a major shift in the field from the machine learning techniques are reliable.
using a pixel-based approach to an object-oriented Machine learning for image classification tra-
approach as there are many pixels which make up ditionally takes advantage of commonly used
features. Objects can group together many similar techniques such as basic Decision Trees to more
pixels that form a feature, whereas pixel-based advanced techniques of Random Forest and
approaches leave “edge effects” of misclassified boosted trees, Support Vector Machines (SVM),
pixels (Campbell 2011). In a pixel-based and Gaussian Processes. For hyperspectral imag-
approach, pixels are individually assigned to a ery, SVM is shown to outperform traditional sta-
class so small variations in spectral properties tistical and nearest neighbor (NN)-based
can create areas with non-contextual class classification with only ensemble methods some-
combinations. times having better accuracy when using spatial
Object-oriented approaches solve this pixel and textural-based morphological parameters
misclassification problem by first segmenting the (Chutia et al. 2016). Naive Bayesian classifiers
image pixels into meaningful objects before clas- are less used in remote sensing; while the method
sification. An object-oriented method segments is fast, they perform poorly as they assume inde-
homogeneous regions of pixels that consider char- pendence between attributes which is not true of
acteristics of at least spatial configuration and imagery.
spectral characteristics, but often shape, texture, Feature reduction is necessary in some cases
size, and topographical as well. These segmenta- in which many imagery bands are available.
tion parameters are set by the user based on an Multispectral imagery typically has at the mini-
understanding of the resolution of the imagery mum three bands of RGB (red-green-blue).
being used and the size of the features that are to Often other bands are available in the infrared
be identified. Borna et al. (2016) address the issue which can be helpful in distinguishing between
of subjective scale for object segmentation as the vegetation and human constructed features. Mul-
parameters set by trial and error for the size of the tispectral imagery is traditionally used for image
objects is shown to affect the resulting classifica- classification, but recently hyperspectral imagery
tion. In addition, multiple segmentations can be is becoming available over some areas for many
run to construct features that exist at different applications that try to observe specific signa-
typical sizes and shapes such as rivers, streams, tures. As Chutia et al. (2016) describes, using
and lakes. hyperspectral imagery creates challenges for
Machine learning in remote sensing is a grow- classification due to having many bands of imag-
ing field and many researchers are turning to auto- ery; typical systems have 220–400 bands col-
mated methods to extract features. Automated lected at many wavelengths which are often
classification methods are needed due to the autocorrelated features. Many methods are used
increase in imagery available over large coverage for classification feature reduction like decision
areas at better spatiotemporal resolution. This fusion, mixture modelling, and discriminant
means that there are more images of high quality analysis (Chutia et al. 2016). The predictive
(spatial resolution) at more places (coverage) power of methods is increased by reducing the
more often (temporal resolution). Using image high dimensionality of the hyperspectral bands
interpretation techniques from a manual inter- and having low linear correlation between bands
preter is a huge time investment and involves using techniques such as principle components
human error. Automated methods can provide (Chutia et al. 2016), smart band selection (SBS),
quick classification with consistent computational and minimum redundancy maximum relevance
accuracy so that we can take advantage of the high (mRMR). Instead of reducing the features for
spatio-temporal resolution to detect patterns and traditional classification techniques, other
Scientometrics 807

methods can be used such as deep learning which Further Reading


takes advantage of the high dimensionality of
data available. Borna, K., Moore, A. B., & Sirguey, P. (2016). An intelli-
gent geospatial processing unit for image classification
based on geographic vector agents (GVAs). Transac-
Applications tions in GIS, 20(3), 368–381. http://doi.org/10.1111/
Remote sensing is used for a variety of applica- tgis.12226.
tions in order to identify or measure features and Campbell, J. B. (2011). Introduction to remote sensing (5th
ed.). New York: The Guilford Press. ISBN 978-
detect changes. These classes to be identified can
1609181765.
be land use/land cover types, binary detection of Chutia, D., Bhattacharyya, D. K., Sarma, K. K., Kalita, R., &
change, levels of damage assessment, specific Sudhakar, S. (2016). Hyperspectral remote sensing clas-
mineral types, etc. Some application areas are sifications: A perspective survey. Transactions in GIS,
20(4), 463–490. http://doi.org/10.1111/tgis.12164.
measuring the extent of impact from disasters,
Hengl, T. (2006). Finding the right pixel size. Computers &
the melting of glaciers, mapping geology, or Geosciences, 32(9), 1283–1298.
urban land use planning. There is a growing Olson, C. E. (1960). Elements of photographic interpreta-
field of interest in using remote sensing to ana- tion common to several sensors. Photogrammetric
Engineering, 26(4), 651–656.
lyze and optimize agriculture. Archeological
applications are using technologies such as
UAV’s, photogrammetry for 3D modeling, and
radar for buried sites. Environmental change can
be monitored remotely over large areas. Meteo- Scientometrics
rological conditions are constantly monitored
with satellite imagery. As the technology Jon Schmid
advances, many new application fields are Georgia Institute of Technology, Atlanta,
developing. GA, USA

Conclusion Scientometrics refers to the study of science


through the measurement and analysis of
Remote sensing is a growing technological field researchers’ productive outputs. These outputs
with methodological advancements to meet the include journal articles, citations, books, patents,
computational need for processing big data. The data, and conference proceedings. The impact of
use of remote sensing for specialized applications big data analytics on the field of scientometrics
is becoming publically accessible with growing has primarily been driven by two factors: the
interest in how to use the data and decreasing emergence of large online bibliographic data-
costs to buy into the technology. Hopefully the bases and a recent push to broaden the evaluation S
field will continue to develop to meet needs in of research impact beyond citation-based mea-
critical application areas. Remote sensing draws sures. Large online databases of articles, confer-
users in as it enables us to look beyond what we ences proceedings, and books allow researchers
can naturally see to identify features of interest to study the manner in which scholarship
and recognize measureable changes in the develops and measure the impact of researchers,
environment. institutions, and even countries on a field of
scientific knowledge. Using data on social
media activity, article views, downloads, social
Cross-References bookmarking, and the text posted on blogs and
other websites, researchers are attempting to
▶ Environment broaden the manner in which scientific output is
▶ Sensor Technologies measured.
808 Scientometrics

Bibliometrics, a subdiscipline of sciento- each of which receives at least h citations and no


metrics that focuses specifically on the study of other publication receives more than h citations.
scientific publications, witnessed a boon in The advent of large databases and big data
research due to the emergence of large digital analytics has greatly facilitated the calculation of
bibliographic databases such as Web of Science, the h-index and similar impact metrics. For exam-
Scopus, Google Scholar, and PubMed. The utility ple, in a 2013 study, Filippo Radicchi and Claudio
of increased digital indexing is enhanced by the Castellano utilized the Google Scholar Citations
recent surge in total scientific output. Lutz data set to evaluate the individual scholarly con-
Bornmann and Ruediger Mutz find that global tribution of over 35,000 scholars (Radicchi and
scientific output has grown at a rate of 8–9% per Castellano 2013). The researchers found that the
year since World War II (equivalent to a doubling number of citations received by a scientist is a
every 9 years) (Bornmann and Mutz 2015). strong proxy for that scientist’s h-index, whereas
Bibliometric analysis using large data sets has the number of publications is a less precise proxy.
been particularly useful in research that seeks to The same principles behind citation analysis can
understand the nature of research collaboration. be applied to measure the impact or quality of
Because large bibliographic databases contain patents. Large patent databases such as PATSTAT
information on coauthorships, the institutions allow researchers to measure the importance of indi-
that host authors, journals, and publication dates, vidual patents using forward citations. Forward cita-
text mining software can be used in combination tions come from the “prior art” section of the patent
with social network analysis to understand the documents, which describes the technologies that
nature of collaborative networks. Visualizations were deemed critical to their innovation by the
of these networks are increasingly used to show patent applicants. Scholars use patent counts,
patterns of collaboration, ties between scientific weighed by forward citations, to derive measures
disciplines, and the impact of scientific ideas. For of national innovative productivity.
example, Hanjun Xian and Krishna Madhavan Until recently, measurement of research impact
analyzed over 24,000 journal articles and confer- has been almost exclusively based on citation-based
ence proceedings from the field of engineering measures. However, citations are slow to accumulate
education in effort to understand how the litera- and ignore the influence of research on the broader
ture was produced (Xian and Madhaven 2014). public. Recently there has been a push to include
These data were used to map the network of novel data sources in the evaluation of research
collaborative ties in the discipline. The study impact. Gunther Eysenbach has found that tweets
found that cross-disciplinary scholars played a about a journal article within the first 3 days of
critical role in linking isolated network segments. publication are a strong predictor of eventual cita-
Besides studying authorship and collaboration, tions for highly cited research articles (Eysenbach
big data analytics have been used to analyze cita- 2011). The direction of causality in this relationship –
tions to measure the impact of research, i.e., whether strong papers lead to a high volume of
researchers, and research institutions. Citations tweets or whether the tweets themselves cause sub-
are a common proxy for the quality of research. sequent citations – is unclear. However, the author
Important papers will generally be highly cited as suggests that the most promising use of social media
subsequent research relies on them to advance data lies not in its use as a predictor of traditional
knowledge. impact measures but as means of creating novel
One prominent metric used in scientometrics is metrics of the social impact of research.
the h-index, which was proposed by Jorge Hirsch Indeed the development of an alternative set of
in 2005. The h-index considers the number of measurements – often referred to as “altmetrics” –
publications produced by an individual or organi- based on data gleaned from the social web repre-
zation and the number of citations these publica- sents a particularly active field of scientometrics
tions receive. An individual can be said to have an research. Toward this end, services such as PLOS
h-index of h when she produces h publications Article-Level Metrics use big data techniques to
Semantic/Content Analysis/Natural Language Processing 809

develop metrics of research impact that consider


factors other than citations. PLOS Article-Level Semantic Data Model
Metrics pulls in data on article downloads,
commenting and sharing via services such ▶ Ontologies
CiteuLike, Connotea, and Facebook, to broaden
the way in which a scholar’s contribution is
measured.
Certain academic fields, such as the humanities, Semantic/Content Analysis/
that rely on under-indexed forms of scholarship Natural Language Processing
such as book chapters and monographs have proven
difficult to study using traditional scientometrics Paul Nulty
techniques. Because they do not depend on online Centre for Research in Arts Social Science
bibliographic databases, altmetrics may prove useful and Humanities, University of Cambridge,
in studying such fields. Björn Hammarfelt uses data Cambridge, United Kingdom
from Twitter and Mendeley – a web-based citation
manager that has a social networking component –
to study scholarship in the humanities (Hammarfelt Introduction
2014). While his study suggests that coverage gaps
still exist using altmetrics, as these applications One of the most difficult aspects of working with
become more widely used, they will likely become big data is the prevalence of unstructured data,
a useful means of studying neglected scientific and perhaps the most widespread source of
fields. unstructured data is the information contained in
text files in the form of natural language. Human
language is in fact highly structured, but although
Cross-References major advances have been made in automated
methods for symbolic processing and parsing of
▶ Bibliometrics/Scientometrics language, full computational language under-
▶ Social Media standing has yet to be achieved, and so a combi-
nation of symbolic and statistical approaches to
machine understanding of language are com-
Further Reading monly used. Extracting meaning or achieving
understanding from human language through sta-
Bornmann, L., & Mutz, R. (2015). Growth rates of modern tistical or computational processing is one of the
science: A bibliometric analysis based on the number of
most fundamental and challenging problems of
publications and cited references. Journal of the Asso-
ciation for Information Science and Technology, artificial intelligence. From a practical point of S
66(11), 2215–2222. arXiv:1402.4578 [Physics, Stat]. view, the dramatic increase in availability of text
Eysenbach, G. (2011). Can tweets predict citations? Met- in electronic form means that reliable automated
rics of social impact based on Twitter and correlation
analysis of natural language is an extremely useful
with traditional metrics of scientific impact. Journal of
Medical Internet Research, 13, e123. source of data for many disciplines.
Hammarfelt, B. (2014). Using altmetrics for assessing Big data is an interdisciplinary field, of which
research impact in the humanities. Scientometrics, natural language processing (NLP) is a
101, 1419–1430.
Radicchi, F., & Castellano, C. (2013). Analysis of
fragmented and interdisciplinary subfield.
bibliometric indicators for individual scholars in a Broadly speaking, researchers use approaches
large data set. Scientometrics, 97(3), 627–637. https:// somewhere on a continuum between representing
doi.org/10.1007/s11192-013-1027-3. and parsing the structures of human language in a
Xian, H., & Madhavan, K. (2014). Anatomy of scholarly
symbolic, rule-based fashion, or feeding large
collaboration in engineering education: A big-data
bibliometric analysis. Journal of Engineering Educa- amounts of minimally preprocessed text into
tion, 103, 486–514. more sophisticated statistical machine learning
810 Semantic/Content Analysis/Natural Language Processing

systems. In addition, various substantive research machine translation, question answering, and
areas have developed overlapping but distinct summarization.
methods for computational analysis of text. In the social sciences, the terms quantitative
The question of whether NLP tasks are best content analysis, quantitative text analysis, or
approached with statistical, data-driven methods “text as data” are all used. Content analysis may
or symbolic, theory-driven models is an old be performed by human coders, who read and
debate. In 1957, Noam Chomsky wrote: mark-up documents. This process can be stream-
lined with software. Fully automated content anal-
it must be recognized that the notion of “probability
ysis, or quantitative text analysis, typically
of a sentence” is an entirely useless one, under any
known interpretation of this term. employs statistical word-frequency analysis to
discover latent traits from text, or scale documents
However, at present the best methods we have of interest on a particular dimension of interest in
for translating, searching, and classifying natural social science or political science.
language text use flexible machine-learning algo-
rithms that learn parameters probabilistically from
relatively unprocessed text. On the other hand, Tools and Resources
some applications, such as the IBM Watson ques-
tion answering system (Ferruci et al. 2010), make Text data does not immediately challenge compu-
good use of a combination of probabilistic learn- tational resources to the same extent as other big
ing and modules informed by linguistic theory to data sources such as video or sensor data. For
disambiguate nuanced queries. example, the entire proceedings of the European
The field of computational linguistics origi- parliament from 1996 to 2005, in 21 languages,
nally had the goal of improving understanding of can be stored in 5.4 gigabytes – enough to load
human language using computational methods. into main memory on most modern machines.
Historically, this meant implementing rules and While techniques such as parallel and distributed
structures inspired by the cognitive structures pro- processing may be necessary in some cases, for
posed by Chomskyan generative linguistics. Over example, global streams of social media text or
time, computational linguistics has broadened to applying machine learning techniques for classi-
include diverse methods for machine processing fication, typically the challenge of text data is to
of language irrespective of whether the computa- parse and extract useful information from the idi-
tional models are plausible cognitive models of osyncratic and opaque structures of natural lan-
human language processing. As practiced today, guage, rather than overcoming computational
computational linguistics is closer to a branch of difficulties simply to store and manipulate the
computer science than a branch of linguistics. The text. The unpredictable structure of text files
branch of linguistics that uses quantitative analy- means that general purpose programming lan-
sis of large text corpora is known as corpus guages are commonly used, unlike in other appli-
linguistics. cations where the tabular format of the data allows
Research in computational linguistics and nat- the use of specialized statistical software.
ural language processing involves finding solu- Original Unix command line tools such as
tions for the many subproblems associated with grep, sed, and awk are still extremely useful for
understanding language, and combining advances batch processing of text documents. Historically,
in these modules to improve performance on gen- Perl has been the programming language of
eral tasks. Some of the most important NLP sub- choice for text processing, but recently Ruby and
problems include part-of-speech tagging, Python have become more widely used. These are
syntactic parsing, identifying the semantic roles scripting languages, designed for ease of use and
played by verb arguments, recognizing named flexibility rather than speed. For more computa-
entities, and resolving references. These feed tionally intensive tasks, NLP tools are
into performance on more general tasks like implemented in Java or C/Cþþ.
Semantic/Content Analysis/Natural Language Processing 811

The python libraries spaCy and gensim and the applications, it is not desirable to distinguish
Java-based Stanford Core NLP software are between the inflected forms of words, rather we
widely used in industry and academia. They pro- want to sum together counts of equivalent words.
vide implementations and guides for the most Therefore, it is common to remove the inflected
widely used text processing and statistical docu- endings of words and count only the root, or stem.
ment analysis methods. For example, a system to judge the sentiment
of a movie review need not distinguish between
the words “excite,” “exciting,” “excites,” and
Preprocessing “excited.” Typically the word ending is removed
and the terms are treated equivalently.
The first step in approaching a text analysis The Porter stemmer (Porter 1980) is one of
dataset is to successfully read the document for- the most frequently used algorithms for this pur-
mats and file encodings used. Most programming pose. A slightly more sophisticated method is
languages provide libraries for interfacing with lemmatization, which also normalizes inflected
Microsoft Word and pdf documents. The ASCII words, but uses a dictionary to match irregular
coding system represents unaccented English forms such as “be”/“is”/“are”. In addition to stem-
upper and lowercase letters, numbers, and punc- ming and tokenizing, it may be useful to remove
tuation, using one byte per character. This is no very common words that are unlikely to have
longer sufficient for most purposes, and modern semantic content related to the task. In English,
documents are encoded in a diverse set of charac- the most common words are function words such
ter encodings. The Unicode system defines code as “of,” “in,” and “the.” These “stopwords”
points which can represent characters and sym- largely serve a grammatical rather than semantic
bols from all writing systems. The UTF-8 and function, and some NLP systems simply remove
UTF-16 encodings implement these code points them before proceeding with a statistical analysis.
in 8 bit or 16 bit encoded files. After the initial text preprocessing, there are
Words are the most apparent units of written several simple metrics that may be used to assess
text, and most text processing tasks begin with the complexity of language used in the docu-
tokenization – dividing the text into words. In ments. The type-token ratio, a measure of lexical
many languages, this is relatively uncomplicated: diversity, gives an estimate of the complexity of
whitespace delimits words, with a few ambiguous the document by comparing the total number of
cases such as hyphenation, contraction, and the words in the document to the number of unique
possessive marker. Within languages written in words (i.e., the size of the vocabulary). The
the Roman alphabet there is some variance, for Fleisch-Kincaid readability metric uses the aver-
example, agglutinative languages like Finnish and age sentence length and the average number of
Hungarian tend to use long compound terms dis- syllables per word combined with coefficients S
ambiguated by case markers, which can make the calibrated with data from students to give an esti-
connection between space-separated words and mate of the grade-level reading difficulty of a text.
dictionary-entry meanings tenuous. For languages
with a different orthographic system, such as Chi-
nese, Japanese, and Arabic, it is necessary to use a Document-Term Matrices
customized tokenizer to split text into units suit-
able for quantitative analysis. After tokenization and other preprocessing steps,
Even in English, the correspondence between most text analysis methods work with a matrix
space-separated word and semantic unit is not that stores the frequency with which each word in
exact. The fundamental unit of vocabulary – the vocabulary occurs in each document. This is
sometimes called the lexeme – may be modified the simplest case, known as the “bag-of-words”
or inflected by the addition of morphemes indicat- model, and no information about the ordering of
ing tense, gender, or number. For many the words in the original texts is retained. More
812 Semantic/Content Analysis/Natural Language Processing

sophisticated analysis might involve extracting supervised document classification is Naive


counts of complex features from the documents. Bayes, which gives a new document the class
For example, the text may be parsed and tagged that has the maximum a posteriori probability
with part-of-speech information as part of the given the term counts and their independent asso-
preprocessing stage, which would allow for the ciation between the terms and the categories in the
words with identical spellings but different part- training documents. In political science,a similar
of-speech categories or grammatical roles to be algorithm – “wordscores” – is widely used, which
counted as separate features. sums Naive-Bayes-like word parameters to scale
Often, rather than using only single words, new documents based on reference scores
counts of phrases are used. These are known as assigned to training texts with extreme posi-
n-grams, where n is the number of words in the tions (Laver et al. 2003).
phrase, for example, trigrams are three-word Other widely used supervised classifiers
sequences. N-gram models are especially impor- include support vector machines, logistic regres-
tant for language modeling, used to predict the sion, and nearest neighbor models. If the task is to
probability of a word or phrase given the preced- predict a continuous variable rather than a class
ing sequence of words. Language modeling is label, then a regression model may be used. Sta-
particularly important for natural language gener- tistical learning and prediction systems that oper-
ation and speech recognition problems. ate on text data very often face the typical big data
Once each document has been converted to problem of having more features (word types)
a row of counts of terms or features, a wide than observed or labeled documents. This is a
range of automated document analysis methods high dimensional learning problem, where p (the
can be employed. The document-term matrix is number of parameters) is much larger than n (the
usually sparse and uneven – a small number of number of observed examples).
words occur very frequently in many docu- In addition, word frequencies are extremely
ments, while a large number of words occur unevenly distributed (an observation known as
rarely, and most words do not occur at all in a Zipf’s law) and are highly correlated with one
given document. Therefore, it is common prac- another, resulting in parameter vectors that make
tice to smooth or weight the matrix, either less than ideal examples for regression models. It
using the log of the term frequency or with a may therefore be necessary to use regression
measure of term importance like tf-idf (term methods designed to mitigate this problem, such
frequency x inverse document frequency) or as lasso and ridge regression, or to prune the
mutual information. feature space to avoid overtraining, using feature
subset selection or a dimensionality reduction
technique like principal components analysis or
Matrix Analysis singular value decomposition. With recent
advances in neural network research, it has
Supervised classification methods attempt to auto- become more common to use unprocessed counts
matically categorize documents based on the doc- of n-grams, tokens, or even characters as input to a
ument-term matrix. One of the most familiar of neural network with many intermediate layers.
such tasks is the email spam detection problem. With sufficient training data, such a network can
Based on the frequencies of words in a corpus of learn the feature extraction process better than
emails, the system must decide if an email is spam hand-curated feature extraction systems, and
or not. Such a system is supervised in the sense these “deep learning” networks have improved
that it requires as a starting point a set of docu- the state of the art in machine translation and
ments that have been correctly labeled with the image labeling.
appropriate category, in order to build a statistical Unsupervised methods can cluster documents
model of which terms are associated with each or reveal the distribution of topics in documents in
category. One simple and effective algorithm for a data-driven fashion. For unsupervised scaling
Semantic/Content Analysis/Natural Language Processing 813

and clustering of documents, methods include k- accuracy of NLP subtasks described above. How-
means clustering, or the Wordfish algorithm, a ever, as in many other fields, the recent application
multinomial Poisson scaling model for political of neural networks with many hidden layers
documents. (Deep Learning) has led to large improvements
Another goal of unsupervised analysis is to in accuracy rates on many tasks. These opaque but
measure what topics comprise the text corpus, computationally powerful techniques require only
and how these topics are distributed across docu- a large volume of training data and a differentiable
ments. Topic modeling (Blei 2012) is a widely target function to model complex linguistic
used generative technique to discover a set of behavior.
topics that influence the generation of the texts,
and explore how they are associated with other
variables of interest.
Conclusion

Vector Space Semantics and Machine Natural language processing is a complex and
Learning varied problem that lies at the heart of artificial
intelligence. The combination of statistical and
In addition to retrieving or labeling documents, it symbolic methods has led to huge leaps forward
can be useful to represent the meaning of terms over the last few decades, and with the prepon-
found in the documents. Vector space semantics, derance of online training data and advances in
or distributional semantics, aims to represent the machine learning methods, it is likely that further
meaning of words using counts of their co-occur- gains will be made in the coming years. For
rences with other words. The “distributional researchers intending to make use of rather than
hypothesis,” as described by JR Firth (Firth advance these methods, a fruitful approach is a
1957), is the idea that “you shall know a word good working knowledge of a general purpose
by the company it keeps.” The co-occurrence programming language, combined with the ability
vectors of words have been shown to be useful to configure and execute off-the-shelf machine
for noun phrase disambiguation, semantic relation learning packages.
extraction, and analogy resolution. Many systems
now use the factorization of the co-occurrence
matrices as the initial input to statistical learners, Cross-References
allowing a fine-grained representation of lexical
semantics. Vector semantics also allows for word ▶ Artificial Intelligence
sense disambiguation – it is possible to distinguish ▶ Machine Learning
the different senses of a word by clustering the ▶ Unstructured Data S
vector representations of its occurrences.
These vectors may count instances of words
co-occurring with the same context (syntagmatic References
relations) or compare the similarity of the contexts
of words as a measure of their substitutability Blei, D. M. (2012). Probabilistic topic models. Communi-
(paradigmatic relations) (Turney and Pantel cations of the ACM, 55(4), 77–84.
Chomsky, N. (2002). Syntactic structures. Berlin: Walter
2010). The use of neural networks or dimension- de Gruyter.
ality reduction techniques allows researchers to Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek,
produce a relatively low dimensional space in D., Kalyanpur, A., Lally, A., Murdock, J., Nyberg, E.,
which to compare word vectors, sometimes called Prager, J., Schlaefer, N., & Welty, C. A. (2010). Build-
ing Watson: An overview of the deep QA project. AI
word embeddings.
Magazine, 31(3), 59–79.
Machine learning has long been used to per- Firth, J. R. (1957). A synopsis of linguistic theory. In
form classification of documents or to aid the Studies in linguistic analysis. Blackwell: Oxford.
814 Semiotics

Laver, M., Benoit, K., & Garry, J. (2003). Extracting policy undifferentiated field of noise and is, in effect,
positions from political texts using words as data. recognition of a pattern out of noise. Frequently,
American Political Science Review, 97(02), 311–331.
Porter, MF. "An algorithm for suffix stripping." Pro- signals provide an impetus to action. For example,
gram 14.3 (1980): 130–137. in an IoT network, signals reflect properties of
Slapin, J. B., & Proksch, S.-O. (2008). A scaling model for frequency, duration, and strength that indicate a
estimating time-series party positions from texts. Amer- change in an object’s state or environment to elicit
ican Journal of Political Science, 52(3), 705–722.
Turney, P. D., & Pantel, P. (2010). From frequency to a response from an interpreting agent (be it an ICT
meaning: Vector space models of semantics. Journal or human agent) or to transfer meaning.
of Artificial Intelligence Research, 37(1), 141–188. Signals become data – the creation of facts
(something given or admitted, especially as a
basis for reasoning or inference) – by imposing
syntactic and symbolic norms on signals. Data
Semiotics become signs when semantics are applied. Signs
reflect cultural, epistemic norms and function as
Erik W. Kuiler morphemes of meaning by representing objects in
George Mason University, Arlington, VA, USA the mind of the interpreter. In Aristotelian terms,
signs and their associated symbols are figures of
thought that allow an individual to think about an
Background object without its immediate presence. In this
sense, signs have multiple aspects: a designative
Semiotics, as an intellectual discipline, focuses on aspect, which points an interpreter to a specific
the relationships between signs, the objects to object (an index); an appraisive aspect, which
which they refer, and the interpreters – human draws attention to the object’s ontological proper-
individuals or information and communications ties; and a prescriptive aspect, which instructs the
technology (ICT) intelligent agents – who assign interpreter to respond in a specific way, such as in
meaning to the conceptualizations as well as response to a transmission stop signal. A symbol is
instantiations of such signs and objects based on a mark, established by convention, to represent an
those relationships. By focusing on diverse kinds object state, process state, or situation. For exam-
of communications as well as their users, semiot- ple, a flashing red light is usually used to indicate a
ics supports the development of a conceptual state of danger or failure. Information comprises
framework to explore various aspects of knowl- at least one datum. Assuming that the data are
edge transformation and dissemination that syntactically congruent, information constitutes
include the reception and manipulation of signs, the transformation of data by attaching meaning
symbols, and signals received from diverse to the collection in which the individual data are
sources, such as signals received from medical grouped, as the result of analyzing, calculating, or
Internet of Things (IoT) devices, and the algo- otherwise exploring them (e.g., by aggregation,
rithms and analytics required to transform signals combination, decomposition, transformation, cor-
and data into information and knowledge. Semi- relation, mapping, etc.), usually for assessing, cal-
otics encompasses different aspects of knowledge culating, or planning a course of action.
formulation – the epistemically constrained pro- Knowledge constitutes the addition of purpose
cess of extrapolating signals from noise, trans- and conation to the understanding gained from
forming signals into data, and, subsequently, into analyzing information.
knowledge that can be operationalized using large
datasets to support strategic planning, decision-
making, and data analytics. Semiotics: Overview
From the perspective of an ICT-focused semi-
otics, a signal is an anomaly discovered in the Semiotics comprises three interrelated disciplines:
context of a perceived indiscriminate, semantics, syntagmatics (including syntatctics),
Semiotics 815

and pragmatics. Semantics focuses on the sign– ordinate, and subordinate categories of an
object relationships, i.e., the signification of signs ontology.
and the perception of meaning, for example, by Ontologies provide the semantic congruity,
implication, logic, or reference. Syntagmatics consistency, and clarity to support different algo-
focuses on sign-to-sign relationships, i.e., the rithmic-based aggregations, correlations, and
manner in which signs may be combined to form regressions. From an ICT perspective, ontologies
well-formed composite signs (e.g., well-formed enable the development of interoperable informa-
predicates). Pragmatics focuses on sign–inter- tion systems.
preter relationships, i.e., methods by which mean-
ing is derived from a sign or combination of signs Syntagmatics: Relationships and Rules
in a specific context. As morphemes of meaning, signs participate in
complex relationships. In paradigmatic relations,
Semantics: Lexica and Ontologies signs obtain their meaning from their association
Lexica and ontologies provide the semantics com- with other signs based on substitution so that other
ponent for a semiotics-focused approach to infor- signs (terms and objects) may be substituted for
mation derivation and knowledge formulation. signs in the predicate, provided that the signs
Lexica and ontologies reflect social constructions belong to the same ontological category (i.e., par-
of reality, defined in the context of specific episte- adigmatic relations support lexical alternatives
mic cultures as sets of norms, symbols, human and semantic likenesses). Indeed, the notion of
interactions, and processes that collectively facil- paradigmatic relations is foundational for the
itate the transformation of data into information development of ontologies by providing the
and knowledge. A lexicon functions as a con- means to develop categories based on properties
trolled vocabulary and contains the terms and shared by individual instances.
their definitions that collectively constitute the
epistemic domain. The terms and their definitions Syntactics: Metadata
that constitute the lexicon provide the basis for the Whereas the lexicon and ontology support the
ontology, which delineates the interdependencies semantic and interpretive aspects of data analyt-
among categories and their properties, usually in ics, metadata support the semantic and syntag-
the form of similes, meronymies, and matic operational aspects of data analytics.
metonymies. Metadata are generally considered to be informa-
Ontologies define and represent the concepts tion about data and are usually formulated and
that inform epistemic domains, their properties, managed to comply with predetermined stan-
and their interdependencies. An ontology, when dards. Operational metadata reflect the manage-
populated with valid data, provides a base for ment requirements for data security and
knowledge formulation that supports the analyt- safeguarding personally identifiable information; S
ics of those data that collectively operationalize data ingestion, federation, and integration; data
that domain. An ontology informs a named per- anonymization; data distribution; and analytical
spective defined over a set of categories (or clas- data storage. Structural (syntactic) metadata pro-
ses) that collectively delimit a domain of vide information about data structures (e.g., file
knowledge. In this context, a category delineates layouts or database table and column specifica-
a named perspective defined over a set of prop- tions). Bibliographical metadata provide infor-
erties. A property constitutes an attribute or char- mation about the dataset’s producer, such as the
acteristic common to these instances that author, title, table of contents, and applicable key-
constitute a category; for example, length, diam- words of a document; data lineage metadata pro-
eter, and mode of ingestion. A taxonomy is a vide information about the chain of custody of a
directed acyclic perspective defined over a set data item with respect to its provenance – the
of categories; for example, a hierarchical tree chronology of data ownership, stewardship, and
structure depicting the various superordinate, transformations.
816 Semi-structured Data

Pragmatics
From the perspective of ICT, pragmatics has Semi-structured Data
two complementary, closely linked compo-
nents: (1) operationalization support and (2) Yulia A. Strekalova1 and Mustapha Bouakkaz2
1
analytics support. Operationalization pragmat- College of Journalism and Communications,
ics focuses on the development, management, University of Florida, Gainesville, FL, USA
2
and governance of ontologies and lexica, meta- University Amar Telidji Laghouat, Laghouat,
data, interoperability, etc. Analytical pragmat- Algeria
ics focuses on how meaning is derived by
engaging rhetoric, hermeneutics, logic, and
heuristics and their attendant methods to dis- More and more data become available electroni-
cern meaning in data, create information, and cally every day, and they may be stored in a
develop knowledge. variety of data systems. Some data entries may
reside in unstructured document file systems, and
some data may be collected and stored in highly
Summary structured relational databases. The data itself
may represent raw images and sounds or come
Semiotics is normative, bounded by epistemic with a rigid structure as strictly entered entities.
and cultural contexts, and provides the founda- However, a lot of data currently available through
tion for lexicon and ontology development. public and proprietary data systems is semi-
Knowledge formulation paradigms depend on structured.
lexica and ontologies to provide repositories
for formal specifications of the meanings of
symbols delineated by the application of Definition
semiotics.
Semi-structured data is data that resembles struc-
tured data by its format but is not organized with
the same restrictive rules. This flexibility allows
Further Reading collecting data even if some data points are miss-
ing or contain information that is not easily trans-
Morris, C. W. (1938). The foundations of the theory of
signs. Chicago: Chicago University Press. lated in a relational database format. Semi-
Morris, C. W. (1946). Signs, language, and behavior. In C. structured data carries the richness of human
W. Morris (Ed.), Writings on the general theory of signs information exchange, but most of it cannot be
(pp. 73–398). The Hague: Mouton.
automatically processed and used. Developments
Morris, C. W. (1964). Signification and significance: A
study of the relations of signs and values. Cambridge, in markup languages and software applications
MA: MIT Press. allow the collection and evaluation of semi-
Nöth, W. (1995). Handbook of semiotics. Bloomington: structured data, but the richness of natural text
Indiana University Press.
contained in semi-structured data still presents
Ogde, C. K., & Richards, I. A. (2013). The meaning
of meaning: A study of language upon thought challenges for analysts.
and the science of symbolism. CT: Mansfield Structured data has been organized into a for-
Centre. mat that makes it easier to access and process such
Peirce, C. S. (1958). Collected papers of C.S. Peirce.
as databases where data is stored in columns,
Cambridge, MA: Harvard University Press.
Sowa, J. F. (2000). Ontology, metadata, and semiotics. In which represent the attribute of the database. In
B. Ganter & G. Mineau (Eds.), Conceptual structures: reality, very little data is completely structured.
Logical, linguistics, and computational issues (pp. 55– Conversely, unstructured data has been not
81). Berlin: Springer Verlag.
reformatted, and its elements are not organized
Sowa, J. F. (2000). Knowledge representation: Logical,
philosophical, and computational foundations. Pacific into a data structure. Semi-structured data com-
Grove: Brooks/Cole. bines some elements of both data types. It is not
Semi-structured Data 817

organized in a complex manner that supports may be insufficient to answer the original question
immediate analyses; however, it may have infor- despite the access to large amounts of data. Step
mation associated with it, such as metadata tag- 4 involved evaluation of possible query outputs.
ging, that allows elements contained to be Data mining may return a large number of data
addressed through more sophisticated access points, but these data points most frequently need
queries. For example, a word document is gener- to be filtered to focus on the analysis of the ques-
ally considered to be unstructured data. However, tion at hand. At step 5, data should be reviewed
when metadata tags in the form of keywords that and evaluated for its structure and characteristics.
represent the document content are added, the data Returned data may be quantitative or qualitative,
becomes semi-structured. or it may have data points which are missing for a
substantial number of records, which will impact
future data analysis. Step 6 requires a strategic and
Data Analysis systematic data reduction. Although it may sound
counterintuitive, Big Data analysis can provide
The volume and unpredictable structure of the most powerful insights when the data set is con-
available data present challenges in analysis. To densed to bare essentials to answer a focused
get meaningful insights from semi-structured question. Some collected data may be irrelevant
data, analysts need to pre-analyze it to ask ques- or redundant to the problem at hand and will not
tions that can be answered with the data. The fact be needed for the analysis. Step 7 calls for the
that a large number of correlations can be found identification of analytic algorithms, should they
does not necessarily mean that analysis is reliable be deemed necessary. Algorithms are analytic
and complete. One of the preparation measures approaches to data, which may be very sophisti-
before the actual data analysis is data reduction. cated. However, establishing a reliable set of
While a large number of data points may be avail- meaningful metrics to answer a question may be
able for collection, not all these data points should a more reliable strategy. Step 8 looks at the results
be included in an analysis to every question. and conclusions of the analysis and calls for con-
Instead, a careful consideration of data points is servative assessment of possible explanations and
likely to produce a more reliable and explainable models suggested by the data, assertions for cau-
interpretation of observed data. In other words, sality, and possible biases. Finally, step 9 calls for
just because the data is available, it does not mean validation of results in step 8 using comparable
it needs to be included in the analysis. Some data sets. Invalidation of predictions may suggest
elements may be random and will not add sub- necessary adjustments to any of the steps in the
stantively to the answer to a particular questions. data analysis and make conclusions more robust.
Some other elements may be redundant and not
add any new information compared to the one S
already provided by other data points. Data Management
Jules Berman suggests nine steps to the analy-
sis of semi-structured data. Step 1 includes formu- Semi-structured data includes both database char-
lation of a question which can and will be acteristics and incorporates documents and other
subsequently answered with data. A Big Data files types, which cannot be fully described by a
approach may not be the best strategy for ques- standard database entry. Data entries in structured
tions that can be answered with other traditional data sets follow the same order; all entries in a
research methods. Step 2 evaluates data resources group have the same descriptions, defined format,
available for collection. Data repositories may and predefined length. In contrast, semi-structured
have “blind spots” or data points that are system- data entries are organized in semantic entities,
atically excluded or restricted for public access. similar to the structured data, which may not
At step 3, a question is reformulated to adjust for have same attributes in the same order or of the
the resources identified in step 2. Available data same length. Early digital databases were
818 Semi-structured Data

organized based on the relational model of data, is associated with some challenges. The data itself
where data is recorded into one or more tables may present a problem by being embedded in
with a unique identifier for each entry. The data natural text, which cannot always be extracted
for such databases needs to be structured uni- automatically with precision. Natural text is
formly for each record. Semi-structured data based on sentences which may not have easily
but relies on tag or other markers to separate identifiable relationships and entities which are nec-
data elements. Semi-structured data may miss essary for data collection. Natural text is based on
data elements or have more than one data point sentences that may not have easily identifiable rela-
in an element. Overall, while semi-structured tionships and entities, which are necessary for data
data has a predefined structure, the data within collection, and the lack of widely accepted stan-
this structure is not entered with the same rigor as dards for vocabularies. A communication process
in the traditional relational databases. This data may involve different models to transfer the same
management situation arises from the practical information or require richer data transfer available
necessity to handle user-generated and widely through natural text and not through a structured
interactional data brought up by the Web 2.0. exchange of keywords. For example, email
The data contained in emails, blog posts, exchange can capture the data about senders and
PowerPoint presentation files, images, and recipients, but automated filtering and analysis of
videos may have very different sets of attributes, the body of email are limited.
but they also offer a possibility to assign meta- Two main types of semi-structured data for-
data systematically. Metadata may include infor- mats are Extensible Markup Language (XML)
mation about author and time and may create the and JavaScript Object Notation (JSON). XML,
structure to assign the data to semantic groups. developed in the mid-1990s, is a markup lan-
Unstructured data, on the other hand, is the data guage that sets rules for the data interchange.
that cannot be readily organized in tables to cap- XML, although being an improvement to earlier
ture the full extent of it. Semi-structured data, as markup languages, has been critiqued for being
the name suggests, carries some elements of bulky and cumbersome in implementation.
structured data. These elements are metadata JSON is viewed as a possible successor format
tags that may list the author or sender, entry for digital architecture and database technolo-
creation and modification times, the length of a gies. JSON is an open standard format that trans-
document, or the number of slides in a presenta- mits data between an application and a server.
tion. Yet, these data also have elements that can- Data objects in JSON format consist of attribute-
not be described in a traditional relational value pairs stored in databases like MongoDB
database. For example, traditional database and Couchbase. The data, which is stored in a
structure which would require initial infra- database like MongoDB, can be pulled with a
structure design will not be able to handle infor- software network for more efficient and faster
mation as a sent email, and all response that processing. Apache Hadoop is an example of an
were received as it is unknown if an email open-source framework that provides both stor-
respondents will use one or all names in age and processing support. Other multi-
response, if anyone will get added or omitted, if platform query processing applications suitable
original message will be modified, if attachments for enterprise-level use are Apache Spark and
will be added to subsequent messages, etc. Presto.
Semi-structured data allows programmers to
nest data or create hierarchies that represent com-
plex data models and relationships among entries. Cross-References
However, robustness of the traditional relational
data model forces more thoughtful implementa- ▶ Data Integration
tion of data applications and possible subsequent ▶ Digital Storytelling, Big Data Storytelling
ease in analysis. Handling of semi-structured data ▶ Discovery Analytics, Discovery Informatics
Sensor Technologies 819

Further Reading and Scanaill 2013). Sensors have been integrated


into daily life so that we use them without consid-
Abiteboul, S., et al. (2012). Web data management. ering tactile sensors such as elevator buttons,
New York: Cambridge University Press.
touchscreen devices, and touch sensing lamps.
Foreman, J. W. (2013). Data smart: Using data science to
transform information into insight. Indianapolis: Wiley. Typical vehicles contain numerous sensors for driv-
Miner, G., et al. (2012). Practical text mining and statisti- ing functions, safety, and the comfort of the pas-
cal analysis for non-structured text data applications. sengers. Mechanical sensors measure motion,
Waltham: Academic.
velocity, acceleration, and displacement through
such sensors as strain gauges, pressure, force, ultra-
sonic, acoustic wave, flow, displacement, acceler-
ometers, and gyroscopes (McGrath and Scanaill
Sensor Technologies 2013). Chemical and thermal biometric sensors
are often used for healthcare from traditional
Carolynne Hultquist
forms like monitoring temperature, blood pressure
Geoinformatics and Earth Observation
cuffs to glucose meters, pacemakers, defibrillators,
Laboratory, Department of Geography and and HIV testing.
Institute for CyberScience, The Pennsylvania
New sensor applications are developing which
State University, University Park, PA, USA
produce individual, home, and environmental
data. There are many sensor types that were devel-
oped years ago but are finding new applications.
Definition/Introduction Navigational aids, such sensors as gyroscopes,
accelerometers, and magnetometers, have existed
Sensors technologies are developed to detect spe-
for many years in flight instruments for aircraft
cific phenomena, behavior, or actions. The origin
and more modernly for smartphones. Sensors
of the word sensor comes from the Latin root internal to smartphone devices are intended to
“sentire” a verb defined as “to perceive”
monitor the device but can be repurposed to mon-
(Kalantar-zadeh 2013). Sensors are designed to
itor to monitor many things such as extreme expo-
identify certain phenomena as a signal but not sure to heat or movement for health applications.
record anything else as it would create noise in
The interconnected network of devices to promote
the data. Sensors are specified by purpose to iden-
automation and efficiency is often referred to as
tify or measure the presence or intensity of differ- the Internet of things (IoT). Sensors are becoming
ent types of energy: mechanical, gravitational,
more prevalent and cheap enough that the public
thermal, electromagnetic, chemical, and nuclear.
can make use of personal sensors that already
Sensors have become part of everyday life and exist in their daily lives or can be easily acquired.
continue to grow in importance in modern
applications.
S
Personal Health Monitoring
Health-monitoring applications are becoming
increasingly common and produce very large vol-
Prevalence of Sensors umes of data. Biophysical processes such as heart
rate, breathing rate, sleep patterns, and restless-
Sensors are used in everyday life to detect phenom- ness can be recorded continuously using devices
ena, behavior, or actions such as force, temperature, kept in contact with the body. Health-conscious
pressure, flow, etc. The type of sensor utilized is and athletic communities, such as runners, have
based on the type of energy that is being sensed, be particularly taken to personal monitoring by using
it gravitational, mechanical, thermal, electromag- technology to track their current condition and
netic, chemical, or nuclear. The activity of interest progress. Pedometers, weight scales, and ther-
is typically measured by a sensor and converted by mometers are commonplace. Heart rate, blood
a transducer into a signal as a quantity (McGrath pressure, and muscle fatigue are now monitored
820 Sensor Technologies

by affordable devices in the form of bracelets, sensed observations and broad surveys. Remote
rings, adhesive strips, and even clothing. Brands sensing imagery from satellites and airborne
of smart clothing are offering built-in sensors for flights can create large datasets on global environ-
heart rate, respiration, skin temperature and mois- mental changes for use in such applications as
ture, and electrophysiological signals that are agriculture, pollution, water, climatic conditions,
sometimes even recharged by solar panels. There etc. Government agencies also employ static sen-
are even wireless sensors for the insole of shoes to sors and make on-site visits to check sensors
automatically adjust for the movements of the which monitor environmental conditions. These
user in addition to providing health and training sensors are sometimes integrated into networks
analysis. which can communicate observations to
Wearable health technologies are often used to form real-time monitoring systems.
provide individuals with private personal informa- In addition to traditional government sources
tion; however, certain circumstances call for sys- of environmental data, there are growing collec-
tem-wide monitoring for medical or emergency tions of citizen science data that are focused pri-
purposes. Medical patients, such as those with marily on areas of community concern such as air
diabetes or hypertension, can use continuously quality, water quality, and natural hazards. Air
testing glucose meters or blood pressure monitors quality and water quality have long been moni-
(Kalantar-zadeh 2013). Bluetooth-enabled devices tored by communities concerned about pollution
can transmit data from monitoring sensors and in their environment, but a recent development
contact the appropriate parties automatically if after the 2011 Fukushima nuclear disaster is radi-
there are health concerns. Collective health infor- ation sensing. Safecast is a radiation monitoring
mation can be used to have a better understanding project that seeks to empower people with infor-
of such health concerns as cardiac issues, extreme mation on environmental safety and openly dis-
temperatures, and even crisis information. tributes measurements under creative commons
rights (McGrath and Scanaill 2013). Radiation is
Smart Home not visibly observable so it is considered a “silent”
Sensors have long been a part of modern house- environmental harm, and the risk needs to be
holds from smoke and carbon monoxide detec- considered in light of validated data (Hultquist
tors to security systems and motion sensors. and Cervone 2017). Citizen science projects for
Increasingly, smart home sensors are being used sensing natural hazards from flooding, landslides,
for everyday monitoring in order to have more earthquakes, wildfires, etc. have come online with
efficient energy consumption with smart lighting support from both governments and communities.
fixtures and temperature controls. Sensors are Open-source environmental data is a growing
often placed to inform on activities in the house movement as people get engaged with their envi-
such as a door or window being opened. This ronment and become more educated about their
integrated network of house monitoring prom- health.
ises efficiency, automation, and safety based on
personal preferences. There is significant invest-
ment in smart home technologies, and big data Conclusion
analysis can play a major role in determining
appropriate settings based on feedback. The development and availability of sensor tech-
nologies is a part of the big data paradigm. Sen-
Environmental Monitoring sors are able to produce an enormous amount of
Monitoring of the environment from the surface to data, very quickly with real-time uploads, and
the atmosphere is traditionally a function from diverse types of sensors. Many questions
performed by the government through remotely still remain of how to use this data and if
Sentic Computing 821

connected sensors will lead to smart environments Cross-References


that will be a part of everyday modern life. The
Internet of things (IoT) is envisioned to connect ▶ AgInformatics
communication across domains and applications ▶ Biometrics
in order to enable the development of smart cities. ▶ Biosurveillance
Sensor data can provide useful information for ▶ Crowdsourcing
individuals and generalized information from col- ▶ Drones
lective monitoring. Services often offer personal- ▶ Environment
ized analysis in order to keep people engaged using ▶ Health Informatics
the application. Yet, most analysis and interest ▶ Participatory Health and Big Data
from researchers in sensor data is at a generalized ▶ Patient-Centered (Personalized) Health
level. Despite mostly generalized data analysis, ▶ Pollution, Air
there is public concern related to data privacy ▶ Pollution, Land
from individual and home sensors. The privacy ▶ Pollution, Water
level of the data is highly dependent on the system ▶ Satellite Imagery/Remote Sensing
used and the terms of service agreement if a service
is being provided related to the sensor data.
Analysis of sensor data is often complex, messy, Further Reading
and hard to verify. Nonpersonal data can often be
checked or referenced to a comparable dataset to Hultquist, C., & Cervone, G. (2017). Citizen monitoring
during hazards: Validation of Fukushima radiation
see if it makes sense. However, large datasets pro-
measurements. Geo Journal. http://doi.org/10.1007/
duced by personal sensors for such applications as s10708-017-9767-x.
health are difficult to independently verify at an Kalantar-zadeh, K. (2013). Sensors: An introductory
individual level. For example, an environmental course (1st ed.). Boston: Springer US.
McGrath, M. J., & Scanaill, C. N. (2013). Sensor technol-
condition could have caused a natural reaction of
ogies: Healthcare, wellness, and environmental appli-
a rapid heartbeat which is medically safe given the cations. New York: Apress Open.
condition that the user awoke with a quick increase
in heart rate due to an earthquake. Individual
inspection of data for such noise is fraught with
problems as it is complicated to identify causes in Sentic Computing
the raw data from an individual, but at a generalized
level, such data can be valuable for research and Erik Cambria
can appropriately take into account variations in the School of Computer Science and Engineering,
data. Nanyang Technological University, Singapore,
Sensor technologies are integrated into every- Singapore S
day life and are used in numerous applications to
monitor conditions. The usefulness of technolog-
ical sensors should be no surprise as every living With the recent development of deep learning,
organism has biological sensors which serve research in artificial intelligence (AI) has gained
similar purposes to indicate the regulation of new vigor and prominence. Machine learning,
internal functions and conditions of the external however, suffers from three big issues, namely:
environment. The integration of sensor technol-
ogies is a natural step that goes from individual 1. Dependency issue: it requires (a lot of) training
measurements to collective monitoring which data and it is domain-dependent.
highlights the need for big data analysis and 2. Consistency issue: different training and/or
validation. tweaking lead to different results.
822 Sentic Computing

3. Transparency issue: the reasoning process is computing, and more. Sentic computing, whose
uninterpretable (black-box algorithms). term derives from the Latin “sensus” (as in com-
monsense) and “sentire” (root of words such as
Sentic computing (Cambria and Hussain 2015) sentiment and sentience), enables the analysis of
addresses these issues in the context of natural text not only at document, page, or paragraph
language processing (NLP) by coupling machine level, but also at sentence, clause, and concept
learning with linguistics and commonsense rea- level (Fig. 1).
soning. In particular, we apply an ensemble of Sentic computing positions itself as a horizontal
commonsense-driven linguistic patterns and sta- technology that serves as a back-end to many dif-
tistical NLP: the former are triggered when prior ferent applications in the areas of e-business, e-
knowledge is available, the latter is used as commerce, e-governance, e-security, e-health, e-
backup plan when both semantics and sentence learning, e-tourism, e-mobility, e-entertainment,
structure are unknown. Machine learning, in fact, and more. Some examples of such applications
is only useful to make a good guess because it include financial forecasting (Xing et al. 2018)
only encodes correlation and its decision-making and healthcare quality assessment (Cambria et al.
process is merely probabilistic. To use Noam 2012a), community detection (Cavallari et al.
Chomsky’s words, “you do not get discoveries 2017) and cyber issue detection (Cambria et al.
in the sciences by taking huge amounts of data, 2010), human communication comprehension
throwing them into a computer and doing statisti- (Zadeh et al. 2018) and dialogue systems (Young
cal analysis of them: that’s not the way you under- et al. 2018). State-of-the-art performance is ensured
stand things, you have to have theoretical in all these sentiment analysis applications, thanks
insights.” to sentic computing’s new approach to NLP, whose
Sentic computing is a multidisciplinary novelty gravitates around three key shifts:
approach to natural language understanding that
aims to bridge the gap between statistical NLP and 1. Shift from mono- to multidisciplinarity –
many other disciplines that are necessary for evidenced by the concomitant use of AI and
understanding human language, such as linguis- Semantic Web techniques, for knowledge rep-
tics, commonsense reasoning, affective resentation and inference; mathematics, for

Sentic Computing,
Fig. 1 Sentic computing
flowchart
Sentic Computing 823

carrying out tasks such as graph mining and the two sentences bear opposite polarity: the
multidimensionality reduction; linguistics, for former is positive as the user seems to be
discourse analysis and pragmatics; psychol- willing to make the effort to buy the product
ogy, for cognitive and affective modeling; despite its high price; the latter is negative as
sociology, for understanding social network the user complains about the price of iPhoneX
dynamics and social influence; finally ethics, although he/she likes it (Fig. 2).
for understanding related issues about the
nature of mind and the creation of emotional Sentic computing takes a holistic approach to
machines. natural language understanding by handling the
2. Shift from syntax to semantics – enabled by the many subproblems involved in extracting mean-
adoption of the bag-of-concepts model instead ing and polarity from text. While most works
of simply counting word co-occurrence fre- approach it as a simple categorization problem,
quencies in text. Working at concept-level in fact, sentiment analysis is actually a suitcase
entails preserving the meaning carried by research problem (Cambria et al. 2017b) that
multiword expressions such as cloud comput- requires tackling many NLP tasks (Fig. 3). As
ing, which represent “semantic atoms” that Marvin Minsky would say, the expression “senti-
should never be broken down into single ment analysis” itself is a big suitcase (like many
words. In the bag-of-words model, for exam- others related to affective computing (Cambria
ple, the concept cloud computing would be et al. 2017a), e.g., emotion recognition or opinion
split into computing and cloud, which may mining) that all of us use to encapsulate our jum-
wrongly activate concepts related to the bled idea about how our minds convey emotions
weather and, hence, compromise categoriza- and opinions through natural language. Sentic
tion accuracy. computing addresses the composite nature of the
3. Shift from statistics to linguistics – problem via a three-layer structure that concomi-
implemented by allowing sentiments to flow tantly handles tasks such as subjectivity detection
from concept to concept based on the depen- (Chaturvedi et al. 2018), to filter out neutral con-
dency relation between clauses. The sentence tent, named-entity recognition (Ma et al. 2016), to
“iPhoneX is expensive but nice”, for example, locate and classify named entities into pre-defined
is equal to “iPhoneX is nice but expensive” categories, personality recognition (Majumder
from a bag-of-words perspective. However, et al. 2017), for distinguishing between different

Sentic Computing,
Fig. 2 Jumping NLP
curves
S
824 Sentic Computing

Sentic Computing, Fig. 3 Sentiment analysis suitcase

Sentic Computing, Fig. 4 SenticNet


Sentic Computing 825

Sentic Computing, Fig. 5 Sentic patterns


S
personality types of the users, sarcasm detection mining and multidimensional scaling techniques
(Poria et al. 2016), to detect and handle sarcasm in on the affective commonsense knowledge col-
opinions, aspect extraction (Ma et al. 2018), for lected from three different sources, namely:
enabling aspect-based sentiment analysis, and WordNet-Affect, Open Mind Common Sense,
more. and a game engine for commonsense knowledge
The core element of sentic computing is acquisition (GECKA) (Cambria et al. 2015b).
SenticNet (Cambria et al. 2020), a knowledge This knowledge is represented redundantly at
base of 200,000 commonsense concepts (Fig. 4). three levels (following Minsky’s panalogy princi-
Unlike many other sentiment analysis resources, ple): semantic network, matrix, and vector space
SenticNet is not built by manually labeling pieces (Cambria et al. 2015a). Subsequently, semantics
of knowledge coming from general NLP and sentics are calculated though the ensemble
resources such as WordNet or DBPedia. Instead, application of spreading activation (Cambria
it is automatically constructed by applying graph- et al. 2012c), neural networks (Ma et al. 2018),
826 Sentic Computing

and an emotion categorization model (Susanto overall polarity has the same sign as that of the
et al. 2020). second component (positive).
While SenticNet can be used as any other sen-
timent lexicon, e.g., concept matching or bag-of-
concepts model, the right way to use the knowl-
Further Reading
edge base for the task of polarity detection is in
conjunction with sentic patterns (Poria et al. Cambria, E., & Hussain, A. (2015). Sentic computing: A
2014). Sentic patterns are sentiment-specific lin- common-sense-based framework for concept-level sen-
guistic patterns that infer polarity by allowing timent analysis. Cham: Springer.
Cambria, E., Chandra, P., Sharma, A., & Hussain, A.
affective information to flow from concept to con-
(2010). Do not feel the trolls. In ISWC. Shanghai
cept based on the dependency relation between Cambria, E., Benson, T., Eckl, C., & Hussain, A. (2012a).
clauses. The main idea behind such patterns can Sentic PROMs: Application of sentic computing to the
be best illustrated by analogy with an electronic development of a novel unified framework for measur-
ing health-care quality. Expert Systems with Applica-
circuit, in which few “elements” are “sources” of
tions, 39(12), 10533–10543.
the charge or signal, while many elements operate Cambria, E., Livingstone, A., & Hussain, A. (2012b). The
on the signal by transforming it or combining hourglass of emotions. In A. Esposito, A. Vinciarelli,
different signals. This implements a rudimentary R. Hoffmann, & V. Muller (Eds.), Cognitive behavioral
systems, Lecture notes in computer science (Vol. 7403,
type of semantic processing, where the “meaning”
pp. 144–157). Berlin/Heidelberg: Springer.
of a sentence is reduced to only one value: its Cambria, E., Olsher, D., & Kwok, K. (2012c). Sentic
polarity. activation: A two-level affective common sense reason-
Sentic patterns are applied to the dependency ing framework. In AAAI (pp. 186–192). Toronto.
Cambria, E., Fu, J., Bisio, F., & Poria, S. (2015a).
syntactic tree of a sentence, as shown in Fig. 5a.
AffectiveSpace 2: Enabling affective intuition for con-
The only two words that have intrinsic polarity are cept-level sentiment analysis. In AAAI (pp. 508–514).
shown in yellow color; the words that modify the Austin.
meaning of other words in the manner similar to Cambria, E., Rajagopal, D., Kwok, K., & Sepulveda, J.
(2015b). GECKA: Game engine for commonsense
contextual valence shifters are shown in blue. A
knowledge acquisition. In FLAIRS (pp. 282–287).
baseline that completely ignores sentence struc- Cambria, E., Das, D., Bandyopadhyay, S., & Feraco, A.
ture, as well as words that have no intrinsic polar- (2017a). A practical guide to sentiment analysis.
ity, is shown in Fig. 5b: the only two words left are Cham: Springer.
Cambria, E., Poria, S., Gelbukh, A., & Thelwall, M.
negative and, hence, the total polarity is negative.
(2017b). Sentiment analysis is a big suitcase. IEEE
However, the syntactic tree can be reinterpreted in Intelligent Systems, 32(6), 74–80.
the form of a “circuit” where the “signal” flows Cambria, E., Li, Y., Xing, Z., Poria, S., & Kwok, K. (2020).
from one element (or subtree) to another, as SenticNet 6: Ensemble application of symbolic and
subsymbolic AI for sentiment analysis. In CIKM.
shown in Fig. 5c. After removing the words not
Ireland.
used for polarity calculation (in white), a circuit Cavallari, S., Zheng, V., Cai, H., Chang, K., & Cambria, E.
with elements resembling electronic amplifiers, (2017). Learning community embedding with commu-
logical complements, and resistors is obtained, nity detection and node embedding on graphs. In
CIKM (pp. 377–386). Singapore.
as shown in Fig. 5d.
Chaturvedi, I., Ragusa, E., Gastaldo, P., Zunino, R., &
Figure 5e illustrates the idea at work: the sen- Cambria, E. (2018). Bayesian network based extreme
timent flows from polarity words through shifters learning machine for subjectivity detection. Journal of
and combining words. The two polarity-bearing The Franklin Institute, 355(4), 1780–1797.
Ma, Y., Cambria, E., & Gao, S. (2016). Label embedding
words in this example are negative. The negative
for zero-shot fine-grained named entity typing. In
effect of the word “old” is amplified by the inten- COLING (pp. 171–180). Osaka.
sifier “very”. However, the negative effect of the Ma, Y., Peng, H., & Cambria, E. (2018). Targeted aspect-
word “expensive” is inverted by the negation, and based sentiment analysis via embedding commonsense
knowledge into an attentive LSTM. In AAAI (pp. 5876-
the resulting positive value is decreased by the
5883). New Orleans.
“resistor”. Finally, the values of the two phrases Majumder, N., Poria, S., Gelbukh, A., & Cambria, E.
are combined by the conjunction “but”, so that the (2017). Deep learning-based document modeling for
Sentiment Analysis 827

personality detection from text. IEEE Intelligent Sys- According to scholars Erik Cambria, Bjorn
tems, 32(2), 74–79. Schuller, Yunging Xia, and Catherine Havasi,
Poria, S., Cambria, E., Winterstein, G., & Huang, G.-B.
(2014). Sentic patterns: Dependency-based rules for sentiment analysis is a term typically used inter-
concept-level sentiment analysis. Knowledge-Based changeably with opinion mining to refer to the
Systems, 69, 45–63. same field of study. The scholars note, however,
Poria, S., Cambria, E., Hazarika, D., & Vij, P. (2016). A that opinion mining generally involves the
deeper look into sarcastic tweets using deep
convolutional neural networks. In COLING (pp. detection of the polarity of opinion, also
1601–1612). Osaka. referred to as the sentiment orientation of a
Susanto, Y., Livingstone, A., Ng, B.C., & Cambria, E. given text (i.e., whether the expressed opinion
(2020) The Hourglass model revisited. IEEE Intelligent is positive, negative, or neutral). Sentiment
Systems 35(5).
Xing, F., Cambria, E., & Welsch, R. (2018). Natural lan- analysis focuses on the recognition of emotion
guage based financial forecasting: A survey. Artificial (e.g., emotional states such as “sad” or
Intelligence Review. https://doi.org/10.1007/s10462- “happy”), but also typically involves some
017-9588-9. form of opinion mining. For this reason, and
Young, T., Cambria, E., Chaturvedi, I., Zhou, H., Biswas,
S., & Huang, M. (2018). Augmenting end-to-end dia- since both fields rely on natural language pro-
log systems with commonsense knowledge. In AAAI cessing (NLP) to analyze opinions from text,
(pp. 4970-4977). New Orleans. sentiment analysis is often couched under the
Zadeh, A., Liang, P. P., Poria, S., Vij, P., Cambria, E., & same umbrella as opinion mining.
Morency, L.-P. (2018). Multi-attention recurrent net-
work for human communication comprehension. In Sentiment analysis has gained popularity as a
AAAI (pp. 5642-5649). New Orleans. social data analytics tool. Recent years have
witnessed the widespread adoption of social
media platforms as outlets to publicly express
opinions on nearly any subject, including those
Sentiment Analysis relating to political and social issues, sporting and
entertainment events, weather, and brand and con-
Francis Dalisay1, Matthew J. Kushin2 and sumer experiences. Much of the content posted on
Masahiro Yamamoto3 sites such as Twitter, Facebook, YouTube, cus-
1 tomer review pages, and news article comment
Communication & Fine Arts, College of Liberal
Arts & Social Sciences, University of Guam, boards is public. As such, businesses, political
Mangilao, GU, USA campaigns, universities, and government entities,
2 among others, can collect and analyze this infor-
Department of Communication, Shepherd
University, Shepherdstown, WV, USA mation to gain insight into the thoughts of key
3 publics.
Department of Communication, University at
Albany – SUNY, Albany, NY, USA The ability of sentiment analysis to measure
individuals’ thoughts and feelings has a wide S
range of practical applications. For example,
Sentiment analysis is defined as the computational sentiment analysis can be used to analyze
study of opinions, or sentiment, in text. Sentiment online news content and to examine the polarity
analysis typically intends to capture an opinion of news coverage of particular issues. Also,
holder’s evaluative response (e.g., positive, nega- businesses are able to collect and analyze the
tive, or neutral, or a more fine-grained classifica- sentiment of comments posted online to assess
tion scheme) toward an object. The evaluative consumers’ opinions toward their products and
response reflects an opinion holder’s attitudes, or services, evaluate the effectiveness of advertis-
affective feelings, beliefs, thoughts, and ing and PR campaigns, and identify customer
appraisals. complaints. Gathering such market intelligence
helps guide decision-making in the realms of
Francis Dalisay, Matthew Kushin, and Masahiro product research and development, marketing
Yamamoto contributed equally to the writing of this entry. and public relations, crisis management, and
828 Sentiment Analysis

customer relations. Although businesses have Levels of Analysis


traditionally relied on surveys and focus
groups, sentiment analysis offers several unique The classification of an opinion in text as positive,
advantages over such conventional data collec- negative, or neutral (or a more fine-grained clas-
tion methods. These advantages include sification scheme) is impacted by and thus
reduced cost and time, increased access to requires consideration of the level at which the
much larger samples and hard-to-reach analysis is conducted. There are three levels of
populations, and real-time intelligence. Thus, analysis: document, sentence, and aspect and/or
sentiment analysis can be a useful market entity. First, the document-level sentiment classi-
research tool. Indeed, sentiment analysis is fication addresses a whole document as the unit of
now commonly offered by many commercial analysis. The task of this level of analysis is to
social data analysis services. determine whether an entire document (e.g., a
product review, a blog post, an email, etc.) is
positive, negative, or neutral about an object.
Approaches This level of analysis assumes that the opinions
expressed on the document are targeted toward a
Broadly speaking, there exist two approaches single entity (e.g., a single product). As such, this
in the automatic extraction of sentiment from level is not particularly useful to documents that
textual material: the lexicon-based approach discuss multiple entities.
and the machine learning-based approach. In The second, sentence-level sentiment classifi-
the lexicon-based approach, a sentiment orien- cation, focuses on the sentiment orientation of
tation score is calculated for a given text unit individual sentences. This level of analysis is
based on a predetermined set of opinion words also referred to as subjectivity classification and
with positive (e.g., good, fun, exciting) and comprised of two tasks: subjective classification
negative (e.g., bad, boring, poor) sentiments. and sentence-level classification. In the first task,
In a simple form, a list of words, phrases, and the system determines whether a sentence is sub-
idioms with known sentiment orientations is jective or objective. If it is determined that the
built into a dictionary, or an opinion lexicon. sentence expresses a subjective opinion, the anal-
Each word is assigned specific sentiment ori- ysis moves to the second task, sentence-level clas-
entation scores. Using the lexicon, each opin- sification. This second task involves determining
ion word extracted receives a predefined whether the sentence is positive, negative, or
sentiment orientation score, which is then neutral.
aggregated for a text unit. The third type of classification is referred to as
The machine learning-based approach, also entity and aspect-level sentiment analysis. Also
called the text classification approach, builds a called feature-based opinion mining, this level of
sentiment classifier to determine whether a given analysis focuses on sentiments directed at entities
text about an object is positive, negative, or neu- and/or their aspects. An entity can include a prod-
tral. Using the ability of machines to learn, this uct, service, person, issue, or event. An aspect is a
approach trains a sentiment classifier to use a large feature of the entity, such as its color or weight.
set of examples, or training corpus, that have For example, in the sentence “the design of this
sentiment categories (e.g., positive, negative, or laptop is bad, but its processing speed is excel-
neutral). The sentiment categories are manually lent,” there are two aspects stated – “design” and
annotated by humans according to predefined “processing speed.” This sentence is negative
rules. The classifier then applies the properties of about one aspect, “design,” and positive about
the training corpus to classify data into sentiment the other aspect, “processing speed.” Entity- and
categories. aspect-level sentiment analysis is not limited to
Sentiment Analysis 829

analyzing documents or sentences alone. Indeed, sentiment if said sarcastically. Similarly, words
although a document or sentence may contain such as “sick,” “bad,” and “nasty” may have
opinions regarding multiple entities and their reversed sentiment orientation depending on con-
aspects, the entity- and aspect-level sentiment text and how they are used. For example, “My
analysis has the ability to identify the specific new car is sick!” implies positive sentiment
entities and/or aspects that the opinions on the toward the car. These issues can also contribute
document or sentence are referring to and then to inaccuracies in sentiment analysis.
determine whether the opinions are positive, neg- Altogether, despite these limitations, the com-
ative, or neutral. putational study of opinions provided by senti-
ment analysis can be beneficial for practical
purposes. So long as individuals continue to
Challenges and Limitations share their opinions through online user-generated
media, the possibilities for entities seeking to gain
Extracting opinions from texts is a daunting task. meaningful insights into the opinions of key pub-
It requires a thorough understanding of the lics will remain. Yet, challenges to sentiment,
semantic, syntactic, explicit, and implicit rules analysis such as those discussed above, pose sig-
of a language. Also, because sentiment analysis nificant limitations to its accuracy and thus its
is carried out by a computer system with a typical usefulness in decision-making.
focus on analyzing documents on a particular
topic, off-topic passages containing irrelevant
information may also be included in the analyses
(e.g., a document may contain information on Cross-References
multiple topics). This could result in creating
inaccurate global sentiment polarities about the ▶ Brand Monitoring
main topic being analyzed. Therefore, the com- ▶ Data Mining
puter system must be able to adequately screen ▶ Facebook
and distinguish opinions that are not relevant to ▶ LinkedIn
the topic being analyzed. Relatedly, for the ▶ Online Advertising
machine learning-based approach, a sentiment ▶ Online Identity
classifier trained on a certain domain (e.g., car ▶ SalesForce
reviews) may perform well on the particular ▶ Social Media
topic, but may not when applied to another ▶ Time Series Analytics
domain (e.g., computer review). The issue of
domain independence is another important
Further Reading
challenge. S
Also, the complexities of human communica-
Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013).
tion limit the capacity of sentiment analysis to New avenues in opinion mining and sentiment analysis.
capture nuanced, contextual meanings that opin- IEEE Intelligent Systems, 28, 15–21.
ion holders actually intend to communicate in Liu, B. (2011). Sentiment analysis and opinion mining. San
Rafael: Morgan & Claypool.
their messages. Examples include the use of sar-
Pang, B., & Lee, L. (2008). Opinion mining and sentiment
casm, irony, and humor in which context plays a analysis. Foundations and Trends in Information
key role in conveying the intended message, par- Retrieval, 2(1–2), 1–135.
ticularly in cases when an individual says one Pang, B., Lee, L., & Vaithyanathan S. (2002). Thumbs up?
Sentiment classification using machine learning tech-
thing but means the opposite. For example, some-
niques. In Proceedings of the Conference on Empirical
one may say “nice shirt,” which implies positive Methods in Natural Language Processing (EMNLP)
sentiment if said sincerely but implies negative (pp. 79–86).
830 Server Farm

Zezima, K. The secret service wants software that detects necessarily “small” in scope, dimension, or rate of
sarcasm (Yeah, good luck.) The Washington Post. accumulation. The characterization of data as
Retrieved 11 Aug 2014 from http://www.washington
post.com/politics/the-secret-service-wants-software- “small” is essentially dependent on the context
that-detects-sarcasm-yeah-good-luck/2014/06/03/35 and use for which the data are intended. In fact,
bb8bd0-eb41-11e3-9f5c-9075d5508f0a_story.html. disciplinary perspectives vary on how large “big
data” need to be to merit this label, but small data
are not characterized effectively by the absence of
one or more of these “3 Vs.” Most statistical
Server Farm analyses require some amount of vector and
matrix manipulation for efficient computation in
▶ Data Center the modern context. Data sets may be considered
“big” if they are so large, multidimensional,
and/or quickly accumulating in size that the typi-
cal linear algebraic manipulations cannot con-
Silviculture verge or yield true summaries of the full data set.
The fundamental statistical analyses, however, are
▶ Forestry the same for data that are “big” or “small”; the true
distinction arises from the extent to which com-
putational manipulation is required to map and
reduce the data (Day and Ghemawat 2004) such
“Small” Data that a coherent result can be derived. All analyses
share common features, irrespective of the size,
Rochelle E. Tractenberg1,2 and complexity, or completeness of the data – the
Kimberly F. Sellers3 relationship between statistics and the underlying
1
Collaborative for Research on Outcomes and population; the association between inference,
Metrics, Washington, DC, USA estimation, and prediction; and the dependence
2
Departments of Neurology; Biostatistics, of interpretation and decision-making on statisti-
Bioinformatics & Biomathematics; and cal inference. To expand on the lack of distin-
Rehabilitation Medicine, Georgetown University, guishability between “small” data and “big”
Washington, DC, USA data, we explore each of these features in turn.
3
Department of Mathematics and Statistics, By doing so, we expound on the assertion that a
Georgetown University, Washington, DC, USA characterization of a dataset as “small” depends
on the users’ intention and the context in which
the data, and results from its analysis, will be used.
Synonyms

Data; Statistics Understanding “Big Data” as “Data”

An understanding of why some datasets are char-


Introduction acterized as “big” and/or “small” requires some
juxtaposition of these two descriptors. “Big data”
Big data are often characterized by “the 3 Vs”: are thought to expand the boundary of data sci-
volume, velocity, and variety. This implies that ence because innovation has been ongoing to
“small data” lack these qualities, but that is an promote ever-increasing capacity to collect and
incorrect conclusion about what defines “small” analyze data with high volume, velocity, and/or
data. Instead, we define “small data” to be simply variety (i.e., the 3 Vs). In this era of technological
“data” – specifically, data that are finite but not advances, computers are able to maintain and
“Small” Data 831

process terabytes of information, including data along these characteristics; in fact, two ana-
records, transactions, tables, files, etc. However, lysts simultaneously considering the same data set
the ability to analyze data has always depended on may each perceive it to be “big” or “small”; these
the methodologies, tools, and technology avail- labels must be considered to be relative.
able at the time; thus the reliance on computa-
tional power to collect or process data is not new
or specific to the current era and cannot be con- Analysis and Interpretation of “Big
sidered to delimit “big” from “small” data. Data” Is Based on Methods for “Small
Data collection and analyses date back to Data”
ancient Egyptian civilizations that collected cen-
sus information; the earliest Confucian societies Considering analysis, manipulation, and interpre-
collected this population-spanning data as well. tation of data can support a deeper appreciation
These efforts were conducted by hand for centu- for the differences and similarities of “big” and
ries, until a “tabulating machine” was used to “small” data. Large(r) and higher-dimensional
complete the analyses required for the 1890 data sets may require computational manipulation
United States Census; this is possibly the first (e.g., Day and Ghemawat 2004), including group-
time so large a dataset was analyzed with a non- ing and dimension reduction, to derive an inter-
human “computer.” Investigations that previously pretable result from the full data set. Further,
took years to achieve were suddenly completed in whenever a larger/higher dimension dataset is
a fraction of the time (months!). Since then, tech- partitioned for analysis, the partitions or subsets
nology continues to be harnessed to facilitate data are analyzed using standard statistical methods.
collection, management, and analysis. In fact, The following sections explicate how standard
when it was suggested to add “data science” to statistical analytic methods (i.e., for “small”
the field of statistics (Bickel 2000; Rao 2001), data) are applied to a dataset whether it is
“big data” may have referred to a data set of up described as “small” or “big”. These methods are
to several gigabytes in size; today, petabytes of selected, employed, and interpreted specifically to
data are not uncommon. Therefore, neither the support the user’s intention for the results and do
size nor the need for technological advancements not depend inherently on the size or complexity of
are inherent properties of either “big” or the data itself. This underscores the difficulty of
“small” data. articulating any specific criterion/a for character-
Data are sometimes called “big” if the data izing data as “big” or “small.”
collection process is fast(-er), not finite in time
or amount, and/or inclusive of a wide range of Sample Versus Population
formats and quality. These features may be Statistical analysis and summarization of “big”
contrasted with experimental, survey, epidemio- data are the same as for data generally; the S
logic, or census data where the data structure, description, confidence/uncertainty, and coher-
timing, and format are fixed and typically finite. ence of the results may vary with the size and
Technological advances allow investigators to completeness of the data set. Even the largest
collect batches of experimental, survey, or other and most multidimensional dataset is presumably
traditional types of data in near-real or real time, an incomplete (albeit massive) representation of
or in online or streaming fashion; such informa- the entire universe of values – the “population.”
tion has been incorporated to ask and answer Thus, the field of statistics has historically been
experimental and epidemiologic questions, based on long-run frequencies or computed esti-
including testing hypotheses in physics, climate, mates of the true population parameters. For
chemistry, and both social and biomedical sci- example, in some current massive data collection
ences, since the technology was developed. It is and warehousing enterprises, the full population
inappropriate to distinguish “big” from “small” can never be obtained because the data are
832 “Small” Data

continuously streaming in and collected. In other is that there is truly no relationship given the data
massive data sets, however, the entire population that were observed and prior knowledge about
is captured; examples include the medical records whether such a relationship exists.
for a health insurance company, sales on Amazon. Whenever inferences are made about estimates
com, or weather data for the detection of an evolv- and predictions about future events, relationships,
ing storm or other significant weather pattern. The or other unknown/unobserved events or results,
fundamental statistical analyses would be the corrections must be made for the multitude of
same for either of these data types; however, inferences that are made for both frequentist and
they would result in estimates for the Bayesian methods. Confidence and uncertainty
(essentially) infinite data set, while actual about every inference and estimate must accom-
population-descriptive values are possible when- modate the fact that more than one has been made;
ever finite/population data are obtained. Impor- these “multiple comparisons corrections” protect
tantly, it is not the size or complexity of the data against decisions that some outcome or result is
that results in either estimation or population rare/statistically significant when, in fact, the var-
description – it is whether or not the data are finite. iability inherent in the data make that result far
This underscores the reliance of any and all data less rare than it appears. Numerous correction
analysis procedures on statistical methodologies; methods exist with modern (since the mid-
assumptions about the data are required for the 1990s) approaches focusing not on controlling
correct use and interpretation of these methodol- for “multiple comparisons” (which are closely
ogies for data of any size and complexity. It fur- tied to experimental design and formal hypothesis
ther blurs qualifications of a given data set as testing), but controlling the “false discovery rate”
“big” or “small.” (which is the rate at which relationships or esti-
mates will be declared “rare given the inherent
Inference, Estimation, and Prediction variability of the data” when they are not, in fact,
Statistical methods are generally used for two rare). Decisions made about inferences, estimates,
purposes: (1) to estimate “true” population param- and predictions are classified as correct (i.e., the
eters when only sample information is available, event is rare and is declared rare, or the event is
and (2) to make or test predictions about either not rare and is declared not rare) or incorrect (i.e.,
future results or about relationships among vari- the event is rare but is declared not rare – a false
ables. These methods are used to infer “the truth” negative/Type II error; or the event is not rare but
from incomplete data and are the foundations of is declared rare – a false positive/Type I error);
nearly all experimental designs and tests of quan- controls for multiple comparisons or false discov-
titative hypotheses in applied disciplines (e.g., eries seek to limit Type I errors.
science, engineering, and business). Modern sta- Decisions are made based on the data analysis,
tistical analysis generates results (i.e., parameter which holds for “big” or “small” data. While
estimates and tests of inferences) that can be char- multiple comparisons corrections and false dis-
acterized with respect to how rare they are given covery rate controls have long been accepted as
the random variability inherent in the data set. representing competent scientific practice, they
In frequentist statistical analysis (based on long are also essential features of the analysis of big
run results), this characterization typically data, whether or not these data are analyzed for
describes how likely the observed result would scientific or research purposes.
be if there were, in truth, no relationship between
(any) variables, or if the true parameter value was Analysis, Interpretation, and Decision Making
a specific value (e.g., zero). In Bayesian statistical Analyses of data are either motivated by theory or
analysis (based on current data and prior knowl- prior evidence (“theory-driven”), or they are
edge), this characterization describes how likely it unplanned and motivated by the data themselves
Smart Cities 833

(“data-driven”). Both types of investigations can


be executed on data of any size, complexity, or Smart Agriculture
completeness. While the motivations for data
analysis vary across disciplines, evidence that ▶ Agriculture
supports decisions is always important. Statistical
methods have been developed, validated, and uti-
lized to support the most appropriate analysis,
given the data and its properties, so that defensible Smart Cities
and reproducible interpretations and inferences
result. Thus, decisions that are made based on Jan Lauren Boyles
the analysis of data, whether “big” or “small,” Greenlee School of Journalism and
are inherently dependent on the quality of the Communication, Iowa State University, Ames,
analysis and associated interpretations. IA, USA

Conclusion Definition/Introduction

As has been the case for centuries, today’s “big” Smart cities are built upon aggregated, data-
data will eventually be perceived as “small”; how- driven insights that are obtained directly from
ever, the statistical methodologies for analyzing the urban infrastructure. These data points trans-
and interpreting all data will also continue to late into actionable information that can guide
evolve, and these will become increasingly municipal development and policy (Albino et al.
interdependent on the methods for collecting, 2015). Building on the emergent Internet of
manipulating, and storing the data. Because of Things movement, networked sensors (often
the constant evolution and advancement in tech- physically embedded into the built environment)
nology and computation, the notion of “big data” create rich data streams that uncover how city
may be best conceptualized as representing the resources are used (Townsend 2013; Komninos
processes of data collection, storage, and manip- 2015; Sadowski and Pasquale 2015). Such intel-
ulation for interpretable analysis, and not the size, ligent systems, for instance, can send alerts to city
utility, or complexity of the data itself. Therefore, residents when demand for urban resources
the characterization of data as “small” depends outpaces supply or when emergency conditions
critically on the context and use for which the exist within city limits. By analyzing these data
data are intended. flows (often in real time), elected officials, city
staff, civic leaders, and average citizens can more
fully understand resource use and allocation, S
Further Reading thereby optimizing the full potential of municipal
services (Hollands 2008; de Lange and de Waal
Bickel, P. J. (2000). Statistics as the information science. 2013; Campbell 2013; Komninos 2015). Over
Opportunities for the mathematical sciences, 9, 11. time, the integration of such intelligent systems
Day, J., & Ghemawat, S (2004, December). MapReduce:
into metropolitan life acts to better inform urban
Simplified data processing on large clusters. In
OSDI’04: Sixth symposium on operating system design policy making and better direct long-term munic-
and implementation. San Francisco. Downloaded from ipal planning efforts (Batty 2013; Komninos
https://research.google.com/archive/mapreduce.html on 2015; Goldsmith and Crawford 2014). Despite
21 Dec 2016.
this promise of more effective and responsive
Rao, C. R. (2001). Statistics: Reflections on the past and
visions for the future. Communications in Statistics – governance, however, achieving a truly smart
Theory and Methods, 30(11), 2235–2257. city often requires the redesign (and in many
834 Smart Cities

cases, the physical rebuilding) of structures to and collaborative solutions (Batty 2013). Operat-
harvest and process big data from the urban envi- ing in this environment of heightened responsive-
ronment (Campbell 2013). As a result, global ness, municipal leaders within smart cities are
metropolitan leaders continue to experiment with increasingly expected to integrate open data initia-
cost-effective approaches to constructing smart tives that provide public access to the information
cities in the late-2010s. gathered by the data-driven municipal networks
Heralded as potentially revolutionizing citizen- (Schrock 2016). City planners, civic activists, and
government interactions within cities, the initial urban technologists must also jointly consider the
integration of Internet Communication Technolo- needs of city dwellers throughout the process of
gies (ICTs) into the physical city in the late 1990s designing smart cities, directly engaging residents
was viewed as the first step toward today’s smart in the building of smart systems (de Lange and de
cities (Caragliu et al. 2011; Albino et al. 2015). In Waal 2013). At the same time, urban officials must
the early 2000s, the burgeoning population be increasingly cognizant that as more user behav-
growth of global cities mandated the use of more iors within city limits are tracked with data, the
sophisticated computational tools to effectively surveillance required to power smart systems may
monitor and manage metropolitan resources also concurrently challenge citizen notions of pri-
(Campbell 2013; Meijer and Bolivar 2015). The vacy and security (Goldsmith and Crawford 2014;
rise of smart cities in the early 2010s can, in fact, Sadowski and Pasquale 2015). Local governments
be traced to a trio of technological advances: the must also ensure that the data collected will be safe
adoption of cloud computing, the expansion of and secure from hackers, who may wish to disrupt
wireless networks, and the acceleration of pro- essential smart systems within cities (Schrock
cessing power. At the same time, the societal 2016).
uptick in mobile computing by everyday citizens
enables more data to be collected on user habits
and behaviors of urban residents (Batty 2013). Conclusion
The most significant advance in smart city adop-
tion rests, however, in geolocation – the concept The successful integration of intelligent systems
that data can be linked to physical space (Batty into the city is centrally predicated upon financial
2013; Townsend 2013). European metropolises, investment in overhauling aging urban infrastruc-
in particular, have been early adopters of intelli- ture (Townsend 2013; Sadowski and Pasquale
gent systems (Vanolo 2013). 2015). Politically, investment decisions are fur-
ther complicated by fragmented municipal lead-
ership, whose priorities for smart city
The Challenges of Intelligent implementation may shift between election cycles
Governance and administrations (Campbell 2013). Rather than
encountering these challenges in isolation, munic-
Tactically, most smart cities attempt to tackle ipal leaders are beginning to work together to
wicked problems – the types of dilemmas that develop global solutions to shared wicked prob-
have historically puzzled city planners (Campbell lems. Intelligent system advocates argue that devel-
2013; Komninos 2015). The integration of intelli- oping collaborative approaches to building smart
gent systems into the urban environment has accel- cities will drive the growth of smart cities into the
erated the time horizon for policymaking for these next decade (Goldsmith and Crawford 2014).
issues (Batty 2013). Data that once took years to
gather and assess can now be accumulated and
analyzed in mere hours, or in some cases, in real Cross-References
time (Batty 2013). Within smart cities,
crowdsourcing efforts often also enlist residents, ▶ Internet of Things (IoT)
who voluntarily provide data to fuel collective ▶ Open Data
Social Media 835

Further Reading transposing the power of information and com-


munication to the public that had until recently a
Albino, V., Berardi, U., & Dangelico, R. M. (2015). Smart passive role in the mass communication process.
cities: Definitions, dimensions, performance, and ini-
Web 2.0 tools refer to the sites and services that
tiatives. Journal of Urban Technology, 22(1), 3–21.
Batty, M. (2013). Big data, smart cities and city planning. emerged during the early 2000s, such as blogs
Dialogues in Human Geography, 3(3), 274–279. (e.g., Blogspot, Wordpress), wikis (e.g.,
Campbell, T. (2013). Beyond smart cities: How cities net- Wikipedia), microblogs (e.g., Twitter), social net-
work, learn and innovate. New York: Routledge.
working sites (e.g., Facebook, LinkedIn), video
Caragliu, A., Del Bo, C., & Nijkamp, P. (2011). Smart
cities in Europe. Journal of Urban Technology, 18(2), (e.g., YouTube), image (e.g., Flickr), file-sharing
65–82. platforms (e.g., We, Dropbox), and related tools
de Lange, M., & de Waal, M. (2013). Owning the city: New that allow participants to create and share their
media and citizen engagement in urban design. First
own content. Though the term was originally
Monday, 18(11). doi:10.5210/fm.v18i11.4954.
Goldsmith, S., & Crawford, S. (2014). The responsive city: used to identify the second coming of the Web
Engaging communities through data-smart gover- after the dotcom burst and restore confidence in
nance. San Francisco: Jossey-Bass. the industry, it became inherent in the new WWW
Hollands, R. G. (2008). Will the real smart city please stand
applications through its widespread use.
up? Intelligent, progressive or entrepreneurial? City,
12(3), 303–320. The popularity of Web 2.0 applications dem-
Komninos, N. (2015). The age of intelligent cities: Smart onstrates that, regardless of their levels of techni-
environments and innovation-for-all strategies. New cal expertise, users can wield technologies in
York: Routledge.
more active ways than had been apparent previ-
Meijer, A., & Bolívar, M. P. R. (2015). Governing the
smart city: A review of the literature on smart urban ously to traditional media producers and technol-
governance. International Review of Administrative ogy innovators. In addition to referring to various
Sciences. doi:10.1177/0020852314564308. communication tools and platforms, including
Sadowski, J., & Pasquale, F. A. (2015). The spectrum
social networking sites, social media also hint at
of control: A social theory of the smart city.
First Monday, 20(7). doi:10.5210/fm.v20i7.5903. a cultural mindset that emerged in the mid-2000s
Schrock, A. R. (2016). Civic hacking as data activism and as part of the technical and business phenomenon
advocacy: A history from publicity to open government referred to as Web 2.0.
data. New Media & Society, 18(4), 581–599.
It is important to distinguish between social
Townsend, A. (2013). Smart cities: Big data, civic hackers,
and the quest for a new utopia. New York: media and social networks. Whereas often both
W.W. Norton. terms are used interchangeably, it is important to
Vanolo, A. (2013). Smartmentality: The smart city as understand that social media are based on user-
disciplinary strategy. Urban Studies, 51(5), 883–898.
generated content produced by the active users
who now can act as producers as well. Social
media have been defined on multiple levels,
starting from more operational definitions that S
Social Media underline that social media indicate a shift from
HTML-based linking practices of the open Web to
Dimitra Dimitrakopoulou linking and recommendation, which happen
School of Journalism and Mass Communication, inside closed systems. Web 2.0 has three
Aristotle University of Thessaloniki, distinguishing features: it is easy to use, it facili-
Thessaloniki, Greece tates sociality, and it provides users with free
publishing and production platforms that allow
them to upload content in any form, be it pictures,
Social media and networks are based on the tech- videos, or text. Social media are often contrasted
nological tools and the ideological foundations of to traditional media by highlighting their
Web 2.0 and enable the production, distribution, distinguishing features, as they refer to a set of
and exchange of user-generated content. They online tools that supports social interaction
transform the global media landscape by between users. The term is often used to contrast
836 Social Media

with more traditional media such as television and experience, power, and culture. Castells (2000)
books that deliver content to mass populations but introduces the concept of “flows of information,”
do not facilitate the creation or sharing of content underlining the crucial role of information flows
by users as well as their ability to blur the distinc- in networks for the economic and social
tion between personal communication and the organization.
broadcast model of messages. In the development of the flows of information,
the Internet holds the key role as a catalyst of a
novel platform for public discourse and public
Theoretical Foundations of Social Media communication. The Internet consists of both a
technological infrastructure and (inter)acting
Looking into the role of the new interactive and humans, in a technological system and a social
empowering media, it is important to study their subsystem that both have a networked character.
development as techno-social systems, focusing Together these parts form a techno-social system.
on the dialectic relation of structure and agency. The technological structure is a network that pro-
As Fuchs (2014) describes, media are techno- duces and reproduces human actions and social
social systems, in which information and com- networks and is itself produced and reproduced by
munication technologies enable and constrain such practices.
human activities that create knowledge that is The specification of the online platforms, such
produced, distributed, and consumed with the as Web 1.0, Web 2.0, or Web 3.0, marks distinc-
help of technologies in a dynamic and reflexive tively the social dynamics that define the evolu-
process that connects technological structures tion of the Internet. Fuchs (2014) provides a
and human agency. The network infrastructure comprehensive approach for the three “genera-
of the Internet allows multiple and multi-way tions” of the Internet, founding them on the idea
communication and information flow between of knowledge as a threefold dynamic process of
agents, combining both interpersonal (one-to- cognition, communication, and cooperation. The
one), mass (one-to-many), and complex, yet (analytical) distinction indicates that all Web 3.0
dynamically equal communication (many-to- applications (cooperation) and processes also
many). include aspects of communication and cognition
The discussion on the role of social media and that all Web 2.0 applications (communica-
and networks finds its roots in the emergence of tion) also include cognition. The distinction is
the network society and the evolvement of the based on the insight of knowledge as threefold
Internet as a result of the convergence of the process that all communication processes require
audiovisual, information technology, and tele- cognition, but not all cognition processes result in
communications sector. Contemporary society is communication, and that all cooperation pro-
characterized by what can be defined as conver- cesses require communication and cognition, but
gence culture (Jenkins 2006) in which old and not all cognition and communication processes
new media collide, where grassroots and corpo- result in cooperation.
rate media intersect, where the power of the media In many definitions, the notions of collabora-
producer and the power of the media consumer tion and collective actions are central, stressing
interact in unpredictable ways. that social media are tools that increase our ability
The work of Manuel Castells (2000) on the to share, to cooperate, with one another, and to
network society is central, emphasizing that the take collective action, all outside the framework
dominant functions and processes in the Informa- of traditional institutional institutions and organi-
tion Age are increasingly organized around net- zations. Social media enable users to create their
works. Networks constitute the new social own content and decide on the range of its dis-
morphology of our societies, and the diffusion of semination through the various available and eas-
networking logic substantially modifies the oper- ily accessible platforms. Social media can serve as
ation and outcomes in processes of production, online facilitators or enhancers of human
Social Media 837

networks – webs of people that promote connect- has given rise to new forms of journalism defined
edness as a social value. as citizen, alternative, or participatory journalism,
Social network sites (SNS) are built on the but also new forms of propaganda and
pattern of online communities of people who are misinformation.
connected and share similar interests and activi-
ties. Boyd and Ellison (2007) provide a robust and
articulated definition of SNS, describing them as The Emergence of Citizen Journalism
Web-based services that allow individuals to (1)
construct a public or semipublic profile within a The rise of social media and networks has a direct
bounded system, (2) articulate a list of other users impact on the types and values of journalism and
with whom they share a connection, and (3) view the structures of the public sphere. The transfor-
and traverse their list of connections and those mation of interactions between political actors,
made by others within the system. The nature journalists and citizens through the new technol-
and nomenclature of these connections may vary ogies has created the conditions for the emergence
from site to site. As the social media and user- of a distinct form from professional journalism,
generated content phenomena grew, websites often called citizen, participatory, or alternative
focused on media sharing began implementing journalism. The terms used to identify the new
and integrating SNS features and becoming journalistic practices on the Web range from inter-
SNSs themselves. active or online journalism to alternative journal-
The emancipatory power of social media is ism, participatory journalism, citizen journalism,
crucial to understand the importance of network- or public journalism. The level and the form of
ing, collaboration, and participation. These con- public’s participation in the journalistic process
cepts, directly linked to social media, are key determine whether it is a synergy between jour-
concepts to understand the real impact and dimen- nalists and the public or exclusive journalistic
sions of contemporary participatory media cul- activities of the citizens.
ture. According to Jenkins (2006), the term However, the phenomenon of alternative jour-
participatory culture contrasts with older notions nalism is not new. Already in the nineteenth cen-
of passive media consumption. Rather than tury, the first forms of alternative journalism made
talking about media producers and consumers their appearance with the development of the rad-
occupying separate roles, we might now see ical British press. The radical socialist press in the
them as participants who interact with each other USA in the early twentieth century followed as
and contribute actively and prospectively equally did the marginal and feminist press between 1960
to social media production. and 1970. Fanzines and zines appeared in the
Participation is a key concept that addresses the 1970s and were succeeded by pirate radio sta-
main differences between the traditional (old) tions. At the end of the twentieth century, how- S
media and the social (new) media and focuses ever, the attention has moved to new media and
mainly on the empowerment of the audience/ Web 2.0 technologies.
users of media toward a more active information The evolution of social networks with the new
and communication role. The changes transform paradigm shift is currently defining to a great
the relation between the main actors in political extent the type, the impact, and the dynamics of
communication, namely, political actors, journal- action, reaction, and interaction of the involved
ists, and citizens. Social media and networks participants in a social network. According to
enable any user to participate in the mediation Atton (2003), alternative journalism is an ongoing
process by actively searching, sharing, and effort to review and challenge the dominant
commenting on available content. The distrib- approaches to journalism. The structure of this
uted, dynamic, and fluid structure of social alternative journalistic practice appears as the
media enables them to circumvent professional counterbalance to traditional and conventional
and political restrictions on news production and media production and disrupts its dominant
838 Social Media

forms, namely, the institutional dimension of subjective opinion. Although Atton (2003) does
mainstream media, the phenomena of capitaliza- not consider lived experiences as an absolute
tion and commercialization, and the growing con- value, he believes it can constitute the added
centration of ownership. value of alternative journalism, combining it
Citizen journalism is based on the assumption with the capability of recording it through
that the public space is in crisis (institutions, pol- documented reports.
itics, journalism, political parties). It appears as an The purpose of citizen journalism is to reverse
effort to democratize journalism and thereby is the “hierarchy of access” as it was identified by
questioning the added value of objectivity, which Glasgow University Media Group, giving voice to
is supported by professional journalism. the ones marginalized by the mainstream media.
The debate on a counterweight to professional, While mainstream media rely extensively on elite
conventional, mainstream journalism was intensi- groups, alternative media can offer a wider range
fied around 1993, when the signs of fatigue and of “voices” that wait to be heard. The practices of
the loss of public’s credibility in journalism alternative journalism provide “first-hand” evi-
became visible and overlapped with the innova- dences, as well as collective and anti-hierarchical
tive potentials of the new interactive technologies. forms of organizations and a participatory, radical
The term public journalism (public journalism) approach of citizen journalism. This form of jour-
appeared in the USA in 1993 as part of a move- nalism is identified by Atton as native reporting.
ment that expressed concerns for the detachment To determine the moving boundary between
of journalists and news organizations from the news producers and the public, Bruns (2005)
citizens and communities, as well as of US citi- used the term produsers, combining the words
zens from public life. However, the term citizen and concepts of producers and users. These
journalism has defined on various levels. If both changes determine the way in which power rela-
its supporters and critics agree on one core thing, tions in the media industry and journalism are
it is that it means different things to different changing, shifting the power from journalists to
people. the public.
The developments that Web 2.0 has introduced
and the subsequent explosive growth of social
media and networks mark the third phase of public Social Movements
journalism and its transformation to alternative
journalism. The field of information and commu- In the last few years, we have witnessed a growing
nication is transformed into a more participatory heated debate among scholars, politicians, and
media ecosystem, which evolves the news as journalists regarding the role of the Internet in
social experiences. News are transformed into a contemporary social movements. Social media
participatory activity to which people contribute tools such as Facebook, Twitter, and YouTube
their own stories and experiences and their reac- which facilitate and support user-generated con-
tions to events. tent have taken up a leading role in the develop-
Citizen journalism proposes a different model ment and coordination of a series of recent social
of selection and use of sources and of news prac- movements, such as the student protests in Britain
tices and redefinition of the journalistic values. at the end of 2010 as well as the outbreak of
Atton (2003) traces the conflict with traditional, revolution in the Arab world, the so-called Arab
mainstream journalism in three key points: (a) Spring.
power does not come exclusively from the official The open and decentralized character of the
institutional institutions and the professional cat- Internet has inspired many scholars to envisage a
egory of journalists, (b) reliability and validity can rejuvenation of democracy, focusing on the
derive from descriptions of lived experience and (latent) democratic potentials of the new media
not only objectively detached reporting, and (c) it as interactive platforms that can motivate and
is not mandatory to separate the facts from fulfill the active participation of the citizens in
Social Media and Security 839

the political process. On the other hand, Internet the social movements that started in Spain and
skeptics suggest that the Internet will not itself then spread to Portugal, the Netherlands, the
alter traditional politics. On the contrary, it can UK, and Greece. In these cases, the digital social
generate a very fragmented public sphere based networks have proved powerful means to convey
on isolated private discussions while the abun- demands for a radical renewal of politics based on
dance of information, in combination with the a stronger and more direct role of citizens and on a
vast amounts of offered entertainment and the critique of the functioning of Western democratic
options for personal socializing, can lead people systems.
to restrain from public life. The Internet actually
offers a new venue for information provision to
the citizen-consumer. At the same time, it allows Cross-References
politicians to establish direct communication with
the citizens free from the norms and structural ▶ Digital Literacy
constraints of traditional journalism. ▶ Open Data
Social media aspire to create new opportunities ▶ Social Network Analysis
for social movements. Web 2.0 platforms allow
protestors to collaborate so that they can quickly
organize and disseminate a message across the Further Reading
globe. By enabling the fast, easy, and low-cost
diffusion of protest ideas, tactics, and strategies, Atton, C. (2003). What is ‘alternative’ journalism? Jour-
nalism: Theory Practice and Criticism, 4(3), 267–272.
social media and networks allow social move-
Boyd, D. M., & Ellison, N. B. (2007). Social network sites:
ments to overcome problems historically associ- Definition, history, and scholarship. Journal of Com-
ated with collective mobilization. puter-Mediated Communication, 13(1), 210–230.
Over the last years, the center of attention was Bruns, A. (2005). Gatewatching: Collaborative online
news production. New York: Peter Lang.
not the Western societies, which were used in
Castells, M. (2000). The rise of the network society, the
being the technology literate and information- information age: Economy, society and culture vol. I.
rich part of the world, but the Middle Eastern Oxford: Blackwell.
ones. Especially after 2009, there is considerable Fuchs, C. (2014). Social media: A critical introduction.
London: Sage.
evidence advocating in favor of the empowering,
Jenkins, H. (2006). Convergence culture: Where old and
liberating, and yet engaging potentials of the new media collide. New York: New York University
online social media and networks as in the case Press.
of the protesters in Iran who have actively used
Web services like Facebook, Twitter, Flickr, and
YouTube to organize, attract support, and share
information about street protests after the June Social Media and Security S
2009 presidential elections. More recently, a rev-
olutionary wave of demonstrations has swept the Samer Al-khateeb1 and Nitin Agarwal2
1
Arab countries as the so-called Arab Spring, using Creighton University, Omaha, NE, USA
2
again the social media as means for raising aware- University of Arkansas Little Rock, Little Rock,
ness, communication, and organization, facing at AR, USA
the same time strong Internet censorship. Though
neglecting the complexity of these transforma-
tions, the uprisings were largely quoted as “the Introduction
Facebook revolution,” demonstrating the power
of networks. In a relatively short period of time, online social
In the European continent, we have witnessed networks (OSNs) such as Twitter, Facebook,
the recent development of the Indignant Citizens YouTube, and blogs have revolutionized how
Movement, whose origin was largely attributed to societies interact. While this new phenomenon in
840 Social Media and Security

online socialization has brought the world closer, post instructional or recruitment videos on
OSNs have also led to new vectors to facilitate YouTube targeting certain demographics
cybercrime, cyberterrorism, cyberwarfare, and • State/non-state actors’ and extremist groups’
other deviant behaviors perpetrated by state/non- (such as ISIS’) savvy use of social communi-
state actors (Agarwal et al. 2017; Agarwal and cation platforms to make their message viral by
Bandeli 2018; Galeano et al. 2018; Al-khateeb using social bots (Al-khateeb and Agarwal
and Agarwal 2019c). 2015c)
Since OSNs are continuously producing data • Conduct phishing operations, such as viral
with heightened volume, variety, veracity, and retweeting a message containing image which
velocity, traditional methods of forensic investi- if clicked unleashes malware (Calabresi 2017)
gation would be insufficient, as this data would be
real time, constantly expanding, and simply not The threat these deviant groups pose is real and
found in traditional sources of forensic evidence can manifest in several forms of deviance, such as
(Huber et al. 2011; Al-khateeb et al. 2016). These the disabling of critical infrastructure (e.g., the
newer forms of data, such as the communications Ukraine power outage caused by Russian-
of hacker groups on OSNs, would offer insights sponsored hackers that coordinated a cyberattack
into, for example, coordination and planning in December 2015) (Volz and Finkle 2016). All
(Al-khateeb et al. 2016, 2018). Social media is this necessitate expanding the traditional defini-
growing as a data source for cyber forensics, tions of cyber threats from hardware attacks and
providing new types of artifacts that can be rele- malware infections to include such insidious
vant to investigations (Baggili and Breitinger threats that influence behaviors and actions,
2015). Al-khateeb and Agarwal (2019c) identified using social engineering and influence operations
key social media data types (e.g., text posts, (Carley et al. 2018). Observable malicious behav-
friends/groups, images, geolocation data, demo- iors in OSNs, similar to the aforementioned ones,
graphic information, videos, dates/times), as well continue to negatively impact society warranting
as their corresponding applications to cyber foren- their scientific inquiry. It would benefit informa-
sics (author attribution, social network identifica- tion assurance (IA) domain, and its respective
tion, facial/object recognition, personality subdomains, to conduct novel research on the
profiling, location finding, cyber-profiling, decep- phenomenon of deviant behavior in OSNs and
tion detection, event reconstruction, etc.). Practi- especially the communications on social plat-
tioners must embrace the idea of using real-time forms pertaining to the online deviant groups.
intelligence to assist in cyber forensic investiga-
tions, and not just postmortem data.
Due to afforded anonymity and perceived less Definitions
personal risk of connecting and acting online, devi-
ant groups are becoming increasingly common Below are some of the terms that are used in topics
among socio-technically competent “hacktivist” related to social media and security and also are
groups to provoke hysteria, coordinate (cyber) frequently used in this entry. Online deviant
attacks, or even effect civil conflicts. Such deviant groups (ODGs) refer to groups of individuals
groups are categorized as the new face of transna- that are connected online using social media plat-
tional crime organizations (TCOs) that could pose forms or the dark web and have interest in
significant risks to social, political, and economic conducting deviant acts or events (e.g., dissemi-
stability. Online deviant groups have grown in nating false information, hacking). These events
parallel with OSNs, whether it is: or acts are unusual, unaccepted, and illegal and
can have significant harmful effects on the society
• Black hat hackers who use Twitter to recruit and and public in general. ODGs conduct their activ-
arm attackers, announce operational details, coor- ities for various financial or ideological purposes
dinate cyberattacks (Al-khateeb et al. 2016), and because these ODGs could include state and non-
Social Media and Security 841

state actors, e.g., the so-called Islamic State in Iraq excellent narrators and they choose their symbols
and Levant (ISIL), anti-NATO propagandist very carefully to give the members of the groups
(Al-khateeb et al. 2016), Deviant Hackers Net- the feeling of pride as well as cohesion. The
works (DHNs) (Al-khateeb et al. 2016), and Inter- beheading of civilians has been studied in the
net trolls (Sindelar 2014). ODGs can conduct various literature by Regina Janes (2005). In her study,
deviant acts such as Deviant Cyber Flash Mobs Janes categorized the reasons for why beheading
(DCFM); online propaganda, misinformation, or dis- is done into four main categories, viz., judicial,
information dissemination; and recently deepfake. sacrificial, presentational, and trophy. ISIL’s com-
Flash mob (FM) is a form of public engage- municators designed the beheading videos to
ment, which, according to Oxford dictionaries, is serve all of the four categories.
defined as “a large public gathering at which peo- In addition to the aforementioned acts, ODGs
ple perform an unusual or seemingly random act are increasingly disseminating deepfake videos.
and then quickly disperse” (Oxford-Dictionary Deepfake is defined as a technology that uses a
2004). Recent observations pertaining to the devi- specific type of artificial intelligence algorithms
ant aspect of the flash mobs have insisted to add a called “generative adversarial network” to alter or
highly debated perspective, which is the nature of produce fake videos. This technique has been used
the flash mob, i.e., whether it is for entertainment, in the past, probably by technical savvy hobbyists to
satire, and artistic expression (e.g., group of peo- create pornographic videos of various celebrities;
ple gather and dance in a shopping mall), or it is a however, in many recent cases, these videos targeted
deviant act that can lead to robberies and thefts more political figures such as the current president
such as the “bash mob” that happened in Long Donald Trump, ex-president Barack Obama, Nancy
Beach, California in July 9, 2013 (Holbrook Pelosi, etc. The sophistication and ease of use of
2013). Deviant Cyber Flash Mobs (DCFM) are these algorithms gave the capability to anyone to
defined as the cyber manifestation of flash mobs produce high-quality deepfaked videos that are
(FM). They are known to be coordinated via nearly impossible for the humans or machines to
social media, telecommunication devices, or distinguish from the real one. This type of videos are
emails and have a harmful effect on one or many very dangerous as they can mislead citizen to
entities such as government(s), organization(s), believe in various lies (imagine a deepfaked video
society(ies), and country(ies). These DCFMs can showing a president of a specific nation saying that
affect the physical space, cyberspace, or both, i.e., they just launched a nuclear attack on another
the “cybernetic space” (Al-khateeb and Agarwal nation! If this video is taken seriously by the adver-
2015b). sary nation, it can lead to a war or an international
Organized propaganda, misinformation, or dis- catastrophe) and also can lead citizen to distrust real
information campaigns by group of individuals videos (Purdue 2019; “What is deepfake (deep fake
using social media, e.g., Twitter, are considered AI)?” 2019). S
as an instance of a DCFM (Al-khateeb and Many of the ODGs conduct their deviant activ-
Agarwal 2015a). For example, the dissemination ities using deviant actors, who can be real human
of ISIL’s beheading video-based propaganda of (e.g., Internet trolls) or nonhuman (e.g., social
the Egyptians Copts in Libya (Staff 2015), the bots). Internet trolls are deviant groups who
Arab-Israeli “Spy” in Syria (editorial 2015), and flourished as the Internet become more social,
the Ethiopian Christians in Libya (Shaheen 2015). i.e., with the advent of social media. These groups
ISIL’s Internet recruitment propaganda or the disseminate provocative posts on social media for
E-Jihad is very effective in attracting new group the troll’s amusement or financial incentives
members (News 2014). For example, a study (Davis 2009; Indiana University 2013; Moreau
conducted by Quiggle (2015) on the effects of n.d.). Their provocative posts, e.g., insulting a
developing high production value beheading specific person or group, posting false informa-
videos and releasing on social media by ISIL tion, or propaganda on popular social media sites,
members shows that ISIL’s disseminators are result in a flood of angry responses and often
842 Social Media and Security

hijack the discussion (Davis 2009; Moreau n.d.; decade, data collected from OSNs has already
Sindelar 2014). Such “troll armies” (or “web bri- played a major role as evidence in criminal
gades”) piggyback on the popularity of social cases, either as incriminating evidence or to con-
media to disseminate fake pictures and videos firm alibis. Interestingly, despite the growing
coordinating effective disinformation campaigns importance of data that can be extracted from
to which even legitimate news organizations OSNs, there has been little academic research
sometimes fall prey (Sindelar 2014). aimed at developing and enhancing techniques
In addition to the human actors, ODGs use to effectively collect and analyze this data
nonhuman actors such as social bots to conduct (Baggili and Breitinger 2015).
their deviant acts. Social bots are computer pro- In this entry the aim is to take steps toward
grams that can be designed and scheduled to per- bridging the gap between cyber security, big data
form various tasks on behalf of the bot’s owner or analytics, and social computing. For instance, in
creator. Research shows that most Internet traffic, one such study, Al-khateeb et al. (2016) collected
especially on social media, is generated by bots Twitter communications network of known
(Cheng and Evans 2009). Bots can have a benign hacker groups and analyzed their messages and
intention such as the Woebot which is a chatbot network for several weeks (Fig. 1). After applying
that helps people track their mood and give them advanced text analysis and social network analy-
therapeutic advices; however, it can also have a sis techniques, it was observed that hacktivist
malicious intent such as hacker bots, spambots, groups @OpAnonDown and @CypherLulz com-
and deviant social bots which are designed to act municate together a lot more than the rest of the
like a human in social spaces, e.g., social media, nodes. Similarly, members of the “think tank”
and can influence people’s opinion by disseminat- group and the “cult of the dead cow” group are
ing propaganda, disinformation, etc. (@botnerds very powerful/effective in coordination strategies.
2017). Furthermore, these groups use Twitter highly
effectively to spread their messages via hashtags
such as #TangoDown (indicating a successful
State of the Art attack), #OpNimr (calling for DDOS attacks on
Saudi Arabian Government websites), #OpBeast
Very little scientific treatment has been given to
the topic of social cyber forensics (Carley et al.
2018; Al-khateeb and Agarwal 2019b). Most
cyber security research in this direction to date
(e.g., Al Mutawa et al. 2012; Mulazzani et al.
2012; Walnycky et al. 2015) has focused on the
acquisition of social data from digital devices and
the applications installed on them. However, this
data would be based on the analysis of the more
traditional sources of evidence found on systems
and devices, such as file systems and captured
network traffic. Since online social networks
(OSNs) are continuously creating and storing
data on multiple servers across the Internet, tradi-
tional methods of forensic investigation would be
insufficient (Huber et al. 2011). As OSNs contin-
uously replace traditional means of digital stor-
age, sharing, and communication (Galeano et al.
2019), collecting this ever-growing volume of Social Media and Security, Fig. 1 Communication net-
data is becoming a challenge. Within the past work of black hat hacker accounts on Twitter
Social Media and Security 843

(calling for DDOS attacks on animal rights existing research helps in identifying key actors
groups’ websites), among others. Based on work (Agarwal et al. 2012) and key groups (Sen et al.
conducted by some of the researchers in this 2016) responsible for coordinating cyberattacks.
domain, it is clear that OSNs contain vast amounts At a more fundamental level, embracing the the-
of important and often publicly accessible data ories of collective action and collective identity
that can service cyber forensics and related disci- formation, the research identify the necessary
plines. A progression must thus be made toward conditions that lead to the success or failure of
developing and/or adopting methodologies to coordinated cyberattacks (e.g., phishing cam-
effectively collect and analyze evidentiary data paigns), explain the risk and motivation trade-off
extracted from OSNs and leverage them in rele- governing the sustenance of such coordinated
vant domains outside of classical information acts, and develop predictive models of ODGs.
sciences. The methodology can be separated into two
main phases: (1) data acquisition and (2) data
analysis and model development. The main tasks
Research Methods of phase 1 (Fig. 2) include identifying keywords,
events, and cyber incidents, selecting reliable and
To accomplish the aforementioned research optimal social media sources, and collecting the
thrusts, socio-computational models are devel- information from relevant social media sources
oped to advance our understanding of online devi- and metadata using social cyber forensics. The
ant groups (ODGs) networks (Al-khateeb 2017; data is largely unstructured and noisy and that
Al-khateeb and Agarwal 2019a, b) grounded in warrants cleaning, standardization, normalization,
the dynamics of various social and communica- and curation before proceeding to phase 2. Phase
tion processes such as group formation, activa- 2 entails categorizing cyber incidents, analyzing
tion, decentralized decision-making, and incident reports with geolocations to identify
collective action. Leveraging cyber forensics and geospatial diffusion patterns, correlating the iden-
deep web search-based methodologies, the study tified incidents to current news articles, and exam-
extracts relevant open-source information in a ining ODGs’ social and communication networks
guided snowball data collection manner. Further, to identify prominent actors and groups, their

Social Media and Security, Fig. 2 Social media data collection and curation methodology
844 Social Media and Security

Social Media and


Security, Fig. 3
Multilayer network analysis
of deviant groups’ social
media communications

tactics, techniques, procedures (TTPs), and coor- accounts with mainstream media reports on lon-
dination strategies. A multilayered network anal- gitudinal basis and developing corrective mea-
ysis approach (Fig. 3) is adopted to model sures for reliable analysis.
multisource, supra-dyadic relations, and shared
affiliations among DGNs.
Conclusion

Research Challenges In conclusion, social media as good as it is in


connecting people around the globe, forming
Prominent challenges that the research is communities and partnerships, getting customer
confronted with include unstructured data, data feedback on various products, getting real-time
quality issues/noisy social media data, privacy, news updates, marketing, and advertising, it also
and data ethics. Collecting data from social poses a security risk on the society, intrudes
media platforms poses certain limitations includ- privacy, helps in radicalizing citizens, provokes
ing sample bias, missing and noisy information hysteria, and provides a fertile ground for ODGs
along with other data quality issues, data collec- to conduct various deviant events. In the era of
tion restrictions, and privacy-related issues. big data and exascale computing (the fastest
Agarwal and Liu (2009) present existing state- supercomputer in the world), the government,
of-the-art solutions to the problems mentioned industry, academic, and the public should work
above, in particular, unstructured data mining, together to dismantle any risk that can be posed
noise filtering, and data collection from social by social media. Although there are currently
media platforms. Due to privacy concerns, data various initiatives that are trying to address the
collection can only be observational in nature. many security risks posed by social media, e.g.,
Furthermore, only publicly available information the HONR Network which is a company that
can be collected, and all personally identifiable helps the families of people who are affected by
information needs to be stripped before publish- mass shootings deal with possible mis-
ing the data, results, or studies. Collecting cyber information aftermath (Thompson 2019) and
incidents from social media is prone to sample the Facebook “Deepfake Detection Challenge”
bias due to the inherent demographic bias among which aims at creating deepfake videos that can
the social media users. The research needs to be used by the artificial intelligence (AI) research
evaluate this bias by comparing the social media community to help their algorithms detect fake
Social Media and Security 845

videos (Metz 2019), more efforts should be Agarwal, N., & Liu, H. (2009). Modeling and data mining
invested. in blogosphere. San Rafael, California (USA): Morgan
& Claypool.
Agarwal, N., Liu, H., Tang, L., & Yu, P. (2012a). Modeling
blogger influence in a community. Social Network
Analysis and Mining, 2(2), 139–162. Springer.
Further Reading Agarwal, N., Kumar, S., Gao, H., Zafarani, R., & Liu,
H. (2012b). Analyzing behavior of the influentials
Social media and security are an inherently multi- across social media. In L. Cao & P. Yu (Eds.), Behavior
computing (pp. 3–19). London: Springer. https://doi.
disciplinary and multi-methodological computa-
org/10.1007/978-1-4471-2969-1_1.
tional social science. Researchers in this area Agarwal, N., Al-khateeb, S., Galeano, R., & Goolsby,
employ multi-technology computational social R. (2017). Examining the use of botnets and their
science tool chains (Benigni and Carley 2016) evolution in propaganda dissemination. Journal of
employing network analysis and visualization NATO Defence Strategic Communications, 2, 87–112.
Al Mutawa, N., Baggili, I., & Marrington, A. (2012).
(Carley et al. 2016), language technologies (Hu Forensic analysis of social networking applications on
and Liu 2012), data mining and statistics mobile devices. Digital Investigation, 9, S24–S33.
(Agarwal et al. 2012a), spatial analytics Al-khateeb, S. (2017). Studying online deviant groups
(Cervone et al. 2016), and machine learning (ODGs): A socio-technical approach leveraging social
network analysis (SNA) & social cyber forensics (SCF)
(Wei et al. 2016). The theoretical results and ana-
techniques – ProQuest. Ph.D. dissertation, University of
lytics are often multilevel focusing simulta- Arkansas at Little Rock. Retrieved from https://search.
neously on change at the community and proquest.com/openview/fd4ee2e2719ccf1327e03749b
conversation level, change at the individual and f450a96/1?pq-origsite¼gscholar&cbl¼18750&diss¼y.
Al-khateeb, S., & Agarwal, N. (2015a). Analyzing deviant
group level, and so forth.
cyber flash mobs of ISIL on Twitter. In Social comput-
ing, behavioral-cultural modeling, and prediction
Acknowledgments This research is funded in part by the (pp. 251–257). UCDC Center, Washington DC, USA:
US National Science Foundation (OIA-1920920, Springer.
IIS-1636933, ACI-1429160, and IIS-1110868), US Office Al-khateeb, S., & Agarwal, N. (2015b). Analyzing flash
of Naval Research (N00014-10-1-0091, N00014-14-1- mobs in cybernetic space and the imminent security
0489, N00014-15-P-1187, N00014-16-1-2016, N00014- threats a collective action based theoretical perspective
16-1-2412, N00014-17-1-2605, N00014-17-1-2675, on emerging sociotechnical behaviors. In 2015 AAAI
N00014-19-1-2336), US Air Force Research Lab, US spring symposium series. Palo Alto, California: Asso-
Army Research Office (W911NF-16-1-0189), US Defense ciation for the Advancement of Artificial Intelligence
Advanced Research Projects Agency (W31P4Q-17-C- Al-khateeb, S., & Agarwal, N. (2015c). Examining botnet
0059), Arkansas Research Alliance, and the Jerry behaviors for propaganda dissemination: A case study
L. Maulden/Entergy Endowment at the University of of ISIL’s beheading videos-based propaganda
Arkansas at Little Rock. Any opinions, findings, and con- (pp. 51–57). Atlantic City, New Jersey, USA: IEEE.
clusions or recommendations expressed in this material are Al-khateeb, S., & Agarwal, N. (2019a). Deviance in social
those of the authors and do not necessarily reflect the views media. In S. Al-khateeb & N. Agarwal (Eds.), Deviance
of the funding organizations. The researchers gratefully
acknowledge the support.
in social media and social cyber forensics: Uncovering S
hidden relations using open source information
(OSINF) (pp. 1–26). Cham: Springer. https://doi.org/
10.1007/978-3-030-13690-1_1.
Al-khateeb, S., & Agarwal, N. (2019b). Deviance in social
References media and social cyber forensics: Uncovering hidden
relations using open source information (OSINF).
@botnerds. (2017). Types of bots: An overview of chatbot Cham: Springer.
diversity|botnerds.com. Retrieved September 7, 2019, Al-khateeb, S., & Agarwal, N. (2019c). Social cyber foren-
from Botnerds website: http://botnerds.com/types-of- sics: Leveraging open source information and social
bots/. network analysis to advance cyber security informatics.
Agarwal, N., & Bandeli, K. (2018). Examining strategic Computational and Mathematical Organization The-
integration of social media platforms in disinformation ory. https://doi.org/10.1007/s10588-019-09296-3.
campaign coordination. Journal of NATO Defence Al-khateeb, S., Conlan, K. J., Agarwal, N., Baggili, I., &
Strategic Communications, 4, 173–206. Breitinger, F. (2016). Exploring deviant hacker
846 Social Media and Security

networks (DHN) on social media platforms. The Jour- Sysomos website: http://sysomos.com/insidetwitter/
nal of Digital Forensics, Security and Law: JDFSL, mostactiveusers.
11(2), 7–20. Davis, Z. (2009, March 24). Definition of: Trolling [Ency-
Al-khateeb, S., Hussain, M. N., & Agarwal, N. (2017a). clopedia]. Retrieved April 4, 2017, from PCMAG.
Social cyber forensics approach to study Twitter’s and COM website: http://www.pcmag.com/encyclopedia/
blogs’ influence on propaganda campaigns. In D. Lee, term/53181/trolling#.
Y.-R. Lin, N. Osgood, & R. Thomson (Eds.), Social, editorial, T. news. (2015). ISIL executes an Israeli Arab
cultural, and behavioral modeling (pp. 108–113). after accusing him of been an Israeli spy. TV7 Israel
Washington D.C., USA: Springer International News. http://www.tv7israelnews.com/isil-executes-an-
Publishing. israeli-arab-after-accusing-him-of-been-an-israeli-spy/).
Al-khateeb, S., Hussain, M., & Agarwal, N. (2017b). Last checked: June 11, 2015.
Chapter 12: Analyzing deviant socio-technical behav- Galeano, R., Galeano, K., Al-khateeb, S., Agarwal, N., &
iors using social network analysis and cyber forensics- Turner, J. (2018). Chapter 10: Botnet evolution during
based methodologies. In O. Savas & J. Deng (Eds.), Big modern day large scale combat operations. In C. M.
data analytics in cybersecurity and it management. Vertuli (Ed.), Large scale combat operations: Informa-
New York: CRC Press, Taylor & Francis. tion operations: Perceptions are reality. Army Univer-
Al-khateeb, S., Hussain, M., & Agarwal, N. (2018). sity Press.
Chapter 2: Leveraging social network analysis & Galeano, K., Galeano, R., Al-khateeb, S., & Agarwal,
cyber forensics approaches to study cyber propaganda N. (2019). Studying the weaponization of social
campaigns. In T. Ozyer, S. Bakshi, & R. Alhajj (Eds.), media: A social network analysis and cyber forensics
Social network and surveillance for society (Lecture informed exploration of disinformation campaigns. In
notes in social networks) (pp. 19–42). Springer Inter- Open source intelligence and security informatics.
national Publishing AG, part of Springer Nature: Springer. (forthcoming).
Springer. Holbrook, B. (2013). LBPD prepared for potential bash
Baggili, I., & Breitinger, F. (2015). Data sources for mob event. In Everything Long Beach. http://www.
advancing cyber forensics: What the social world has everythinglongbeach.com/lbpd-prepared-for-potential-
to offer. In 2015 AAAI spring symposium series. bash-mob-event/. Last checked: August 15, 2014.
Stanford University, CA. Hu, X., & Liu, H. (2012). Text analytics in social media. In
Benigni, M., & Carley, K. M. (2016). From tweets to C. Aggarwal & C. Zhai (Eds.), Mining text data
intelligence: Understanding the islamic jihad (pp. 385–414). Boston: Springer. https://doi.org/10.
supporting community on Twitter. In K. Xu, 1007/978-1-4614-3223-4_12.
D. Reitter, D. Lee, & N. Osgood (Eds.), SBP-BRiMS Huber, M., Mulazzani, M., Leithner, M., Schrittwieser, S.,
2016 (Lecture notes in computer science) (Vol. 9708, Wondracek, G., & Weippl, E. (2011). Social snap-
pp. 346–355). Cham: Springer. https://doi.org/10.1007/ shots: Digital forensics for online social networks. In
978-3-319-39931-7_33. Proceedings of the 27th annual computer security
Calabresi, M. (2017). Inside Russia’s social media war on applications conference (pp. 113–122). Orlando, Flor-
America. Time. http://time.com/4783932/inside-russia- ida, USA
social-media-war-america/. Last accessed 26 Dec Indiana University. (2013, January 3). What is a troll?
2018. [University Information Technology Services].
Carley, K. M., Wei, W., & Joseph, K. (2016). High dimen- Retrieved April 4, 2017, from Indiana University
sional network analytics: Mapping topic networks in Knowledge Base website: https://kb.iu.edu/d/afhc.
Twitter data during the Arab spring. In S. Cui, A. Hero, Janes, R. (2005). Losing our heads: Beheadings in litera-
Z.-Q. Luo, & J. Moura (Eds.), Big data over networks. ture and culture. NYU Press.
Boston: Cambridge University Press. Metz, R. (2019, September 5). Facebook is making deepfake
Carley, K., Cervone, G., Agarwal, N., & Liu, H. (2018). videos to help fight them [CNN]. Retrieved September 7,
Social cyber-security. International conference on 2019, from https://www.cnn.com/2019/09/05/tech/
social computing, behavioral-cultural modeling and facebook-deepfake-detection-challenge/index.html.
prediction – Behavioral representation in modeling Moreau, E. (n.d.). Here’s what you need to know about
and simulation (SBP-BRiMS), July 10–July 13, Wash- internet trolling. Retrieved February 6, 2018, from
ington, DC, USA, pp. 389–394. Lifewire website: https://www.lifewire.com/what-is-
Cervone, G., Sava, E., Huang, Q., Schnebele, E., Harrison, internet-trolling-3485891.
J., & Waters, N. (2016). Using Twitter for tasking Mulazzani, M., Huber, M., & Weippl, E. (2012). Social
remote-sensing data collection and damage assessment: network forensics: Tapping the data pool of social
2013 Boulder flood case study. International Journal of networks. In Eighth annual IFIP WG (Vol. 11). Uni-
Remote Sensing, 37(1), 100–124. versity of Pretoria, Pretoria, South Africa: Springer
Cheng, A., & Evans, M. (2009). Inside Twitter an in-depth News, C. B. S. (2014). ISIS recruits fighters through pow-
look at the 5% of most active users. Retrieved from erful online campaign. http://www.cbsnews.com/news/
Social Network Analysis 847

isis-uses-social-media-to-recruit-western-allies/. Last
checked: July 1, 2015. Social Network Analysis
Oxford-Dictionary. (2004). Definition of flash mob from
Oxford English Dictionaries Online. In Oxford English
Dictionaries. http://www.oxforddictionaries.com/ Magdalena Bielenia-Grajewska
defnition/english/flash-mob. Last checked: August Division of Maritime Economy, Department of
22, 2014. Maritime Transport and Seaborne Trade,
Purdue, M. (2019, August 14). Deepfake 2020: New artificial
University of Gdansk, Gdansk, Poland
intelligence is battling altered videos before elections.
Retrieved September 6, 2019, from USA TODAY Intercultural Communication and
website: https://www.usatoday.com/story/tech/news/ Neurolinguistics Laboratory, Department of
2019/08/14/election-2020-company-campaigns-against- Translation Studies, University of Gdansk,
political-deepfake-videos/2001940001/.
Gdansk, Poland
Quiggle, D. (2015). The ISIS beheading narrative. Small
Wars Journal. Retrieved from https://smallwarsjournal.
com/jrnl/art/the-isis-beheading-narrative.
Sen, F., Wigand, R., Agarwal, N., Yuce, S., & Kasprzyk, Social Network Analysis: Origin and
R. (2016). Focal structures analysis: Identifying influential Introduction
sets of individuals in a social network. Journal of Social
Network Analysis and Mining, 6(1), 1–22. Springer.
Shaheen, K. (2015). Isis video purports to show massacre of The origins of Social Network Theory can be
two groups of Ethiopian Christians. The Guardian. http:// observed in the works of such sociologists as
www.theguardian.com/world/2015/apr/19/isis-video- Ferdinand Tönnies, Émile Durkheim, and Georg
purports-to-show-massacre-of-two-groups-of-ethiopian-
Simmel, as well as in the works devoted to soci-
christians. Last checked: June 11, 2015.
Sindelar, D. (2014). The Kremlin’s troll Army: Moscow is ometry, such as the one by Jacob Moreno on
financing legions of pro-Russia internet commenters. sociograms. Moreover, researchers interested in
But how much do they matter? The Atlantic. Retrieved holism study the importance of structure over
from http://www.theatlantic.com/international/archive/
individual entities and the way structures govern
2014/08/the-kremlins-troll-army/375932/.
Staff, C. (2015, February 16). ISIS video appears to show the performance of people. Although the interest
beheadings of Egyptian Coptic Christians in Libya on social networks can be traced back to the
[News Website]. Retrieved January 23, 2017, from previous centuries, its great popularity can be
CNN website: http://www.cnn.com/2015/02/15/mid
observed in modern times. The reasons for this
dleeast/isis-video-beheadings-christians/.
Thompson, N. (2019, July 10). A Grieving Sandy Hook state are as follows. First of all, technology has led
Father on How to Fight Online Hoaxers. Retrieved to the proliferation of social networks, nowadays
September 7, 2019, from Medium website: https:// also available on the web. Thus, an individual has
onezero.medium.com/a-grieving-sandy-hook-father-on-
the opportunity to enter into relationships not only
how-to-fight-online-hoaxers-ce2e0ef374c3.
Volz, D. and Finkle, J. (2016) U.S. helping Ukraine inves- in the “standard” way, but also in the online one,
tigate power grid hack. Reuters. January 12. https:// by participating in online discussion lists or social
www.reuters.com/article/us-ukraine-cybersecurity-usa- online networking tools. In addition, the perfor- S
idUSKCN0UQ24020160112. Last accessed 26 Dec
mance of social networks in the offline mode is
2018.
Walnycky, D., Baggili, I., Marrington, A., Moore, J., & supported by the advancements of technology; an
Breitinger, F. (2015). Network and device forensic example can be the use of mobile telephones to
analysis of android social-messaging applications. Dig- stay in contact with other network members. Sec-
ital Investigation, 14, S77–S84.
ondly, technological advancements have led to the
Wei, W., Joseph, K., Liu, H., & Carley, K. M. (2016).
Exploring characteristics of suspended users and net- emergence of data that require a proper methodo-
work stability on Twitter. Social Network Analysis and logical approach, taking into account the perspec-
Mining, 6(1), 51. tive of human beings as both authors and subjects
What is deepfake (deep fake AI)? – Definition from
of a given big data study. Thirdly, individuals are
WhatIs.com. (2019). Retrieved September 6, 2019,
from WhatIs.com website: https://whatis.techtarget. more and more conscious about the significance
com/definition/deepfake. of social networks in their life. Starting from the
848 Social Network Analysis

microlevel, people are embedded in different net- opportunities (e.g., LinkedIn). Another impor-
works, such as families, communities, or profes- tant factor for shaping social networks is lan-
sional groups. Looking at the issue from the guage. Individuals select the social networks
macrolevel, nowadays the world can be viewed that offer them the possibility to interact in the
as a complex system, being made of different language they know. It should be mentioned,
networks, of both national and international char- however, that the relation between social net-
acter, that concern, among others, such areas of works and language is mutual. Language does
life as economics, transportation, energy, as well not only shape social networks, being the tool of
as private and social life. The definition of a social efficient communication among network mem-
network provided by Wasserman and Faust bers. At the same time, social networks create
(1994) in Devan, Barnett and Kim (2011: 27) is linguistic repertoires, being visible in new terms
as follows: a social network is generally defined and forms of expressions created during the
as a system with a set of social actors and a interaction among network members. An exam-
collection of social relations that specify how ple can be the language used by the users of
these actors are relationally tied together. discussion lists who coin new terms to denote
the reality around them. Another important factor
for creating and sustaining social network is
Social Networks: Main Determinants and information. Networks are crucial because of
Characteristics the growing role of data, innovation, and knowl-
edge and new possibilities of creating and dis-
Technology belongs to the main factors of mod- seminating knowledge; only the ones who have
ern social networks since it offers the creation of access to information and can distribute it effec-
new types of social networks and supports the tively can compete on the modern market. Thus,
ones existing in the offline sphere. Taking into information is linked with competition visible
account modern technological advancements, from both organizational and individual perspec-
the Internet constitutes an important factor tives. Starting with the organizational dimension,
responsible for creating and sustaining the companies that cooperate with others in terms of
growth in the area of social networks. The devel- purchasing raw materials, production, and distri-
opments in the sphere of online communication bution have a chance to be successful on the
have led to the proliferation of social contacts competitive market. It should be stated, however,
existing on the web. Social networking tools are that competition is not exclusively the feature of
used for both professional and private purposes, business entities since individuals also have to be
being the place where an individual meets with competitive, e.g., on the job market. The interest
friends and family as well as with the ones he or in continuous education has led to the populari-
she does not know at all. Taking into account the zation of open universities or online courses,
mentioned online dimension of social networks, and, consequently, new social networks formed
these networks serve different purposes. Their within massive open online courses (MOOCs)
functionality can be viewed from the perspective have been created. The next determinant for
of synchronicity. Synchronous social networks forming social networks that should be stressed
require the real-time participation in discussion, is the need for belonging and socializing. Indi-
whereas asynchronous social networks do not viduals need the contact with others to share their
require immediate response by users. As far as feelings and emotions, to have fun and to quarrel.
the synchronous and asynchronous social net- In the case of those who have to be far away from
works are concerned, they are used, among their relatives and friends, online social networks
others, to talk (e.g., Skype), share photos or have become the sphere of socialization and
videos (e.g., Picassa and YouTube), connect interaction.
with friends, acquaintance, or customers (e.g., The mentioned multifactorial and multi-
Facebook), or search for professional aspectual character of social networks has resulted
Social Network Analysis 849

in the intense studies on methodologies underly- expatriate, marrying a person of a higher social
ing the way social networks are formed and status, receiving a promotion). Relational ties are
exercised as well as their role for the environment determined by such factors as geographical envi-
in the micro, meso, and macro meaning. ronments and interior designs. Taking the exam-
ple of corporations, such notions as the division of
office space or the arrangement of managerial
Main Concepts and Terms in Social offices reflects the creation of social networks.
Network Analysis (SNA) Relational ties may also be shaped by, e.g., time
differences that determine the possibility of par-
Social Network Analysis can be defined as an ticipation in online networks. Relational ties may
approach that aims to study how the systems of also be governed by the type of access to commu-
grids, ties and lattices create human relations. nication tools, such as mobile telephones, social
Social Network Analysis focuses on both internal networking tools and the Internet. Relational ties
and external features shaping social networks, may be influenced by other types of networks,
studying individuals or organizations, and the such as transport or economic networks. In addi-
relations between them. Social networks are also tion, relational ties are connected with the flows of
studied by taking into account intercultural differ- ideas and things as well as the movement of peo-
ences. Applying the determinants used to charac- ple. For example, expatriates form social net-
terize national or professional cultures, works in host countries. Taking the reason into
researchers may study how social networks are consideration, relational ties may be formed for
formed and organized by taking into account the private and professional reasons. As far as the
attitude to hierarchy, punctuality, social norms, private domain is concerned, relational ties are
family values, etc. Social Network Analysis is connected with one’s need for emotional support,
mainly used in behavioral and social studies, but intimacy, or sharing common hobbies. Relational
it is also applied in different disciplines, such as ties are formed in a voluntary and involuntary
marketing, economics, linguistics, management, way. Voluntary relational ties are connected with
biology, neuroscience, cognitive studies, one’s free will to become the members of a group
etc. SNA relies on the following terminology, or close friendship with another person. On the
with many terms coming from the graph theory. other hand, involuntary relational ties may be of
As Wasserman and Faust (1994) stress, the fun- biological origin (family ties) or hierarchical
damental concepts in SNA are: actor, relational notions, such as relations at work. The number
tie, dyad, triad, subgroup, and group. Actors of actors participating in a given social network
(or vertices, nodes) are social entities, such as can be analyzed through the dyad or triad perspec-
individuals, companies, groups of people, com- tive. Dyad involves two actors and their relations,
munities, nation states, etc. An example of an whereas triads concern three actors and their rela- S
actor is a student at the university or a company tions. Another classification of networks includes
operating in one’s neighborhood. Ego is used to groups and subgroups. The set of relational ties
denote a focal actor. The characteristics of an actor constitutes relations. Another term discussed in
are called actor attributes. Relational ties consti- social actor network theory is the notion of struc-
tute the next important notions, being the linkages tural holes. Degenne and Forsé (1999) elaborate
used to transfer material and immaterial resources, on the concept of structural holes introduced by
such as information, knowledge, emotions, prod- Burt who states that the structural hole is
ucts, etc. They include one’s personal feelings and connected with non-redundancy between con-
opinions on other people or things (like, dislike, tacts. This concept is studied by these two
hatred, love), contacts connected with the change scholars through the prism of cohesion and equiv-
of ownership (purchasing and selling, giving and alence. According to the cohesive perspective
receiving things) or changes of geographical, presented by them, redundancy can be observed
social, or professional position (e.g., becoming when two of the egos’ relations have a direct link.
850 Social Network Analysis

Consequently, when the cohesion is great, few horizontal and vertical networks. Vertical social
structural holes can be observed. They state that network encompass the relations between people
the approach of equivalence is connected with that occupy different positions in, e.g., hierarchi-
indirect relations in networks between the ego cal ladders. They include networks to be observed
and others. Structural holes exist when there are in professional and occupational settings, such as
no direct or indirect links between the ego and the organizations, universities, schools, etc. On the
contacts or when there is no structural equivalence other hand, horizontal social networks encompass
between them. members of an equal position in a given organi-
zation. Networks may also be classified by taking
into account their power and the strength of rela-
Types of Social Networks tions between networks members. Weak social
networks are the ones that are loosely composed,
Social networks can be categorized by taking into with fragile and loose relations between members,
account different dimensions. One of them is net- whereas in strong social networks the contacts are
work size; social networks vary as far as the very durable. Social networks can be classified by
number of network members is concerned. The taking into account the purpose why they were
second notion is the place where the social net- formed. For example, financial social networks
work is created and exercised. The main dichot- concern the money-related flows between mem-
omy is of technological nature; the distinction bers, whereas informational social networks focus
between online and offline social networks is on exchanging information. Networks may also
one of the most researched types. The next feature be studied by taking into account their flexibility.
that can subcategorize social networks is formal- Thus, fixed social networks rely on a strict
ity. Informal social networks are mainly used to arrangement, whereas flexible social networks do
socialize and entertain, whereas formal social net- not follow a fixed pattern of interactions. As far as
works encompass the contacts characterized by advantages of social networks are concerned, such
strict codes of behavior in a network, hierarchical notions as the access to information, creating
relations among network members, and regulated social relations can be named. As far as potential
methods of interaction. The next feature of social disadvantages are concerned, some state that net-
networks is uniformity. As Bielenia-Grajewska works demand the resign from independence. In
and Gunstone (2015) discuss, heterogeneous addition, to some extent the members of a network
social networks encompass members that differ bear responsibility for the mistakes made by other
as far as certain characteristics are concerned. On members since a single failure may influence the
the other hand, homogeneous social networks performance of the whole network. Analyzing
include similar network members. The types of more complex entity networks, they may demand
member compatibility differ, depending on the more energy and time to adjust to new conditions.
characteristics of social networks, and may be Social Networks can be divided into online social
connected with, e.g., profession, age, gender, networks and offline social networks. As far as
mother tongue, and hobby. In SNA terminology, online social networks are concerned, they can be
these networks are often described through the further subcategorized into asynchronous online
prism of homophily. Homophilous social net- social networks and synchronous social networks
works consist of people who are similar because (discussion on these networks provided above).
of age, gender, social status, cultural background, Social networks can be studied through the prism
profession, or hobbies. On the other hand, hetero- of purpose and the prism of investigation may
philous social networks are directed mainly at focus on the dichotomy of private and profes-
individuals that differ as far as their individual sional life. For example, professional social net-
attributes, social or professional positions are works may be categorized by taking into account
concerned. Bielenia-Grajewska (2014) also the notion of free will in network creation and
stresses that networks can also be divided into performance. Professional social networks
Social Network Analysis 851

depend on the type of organizations. For example, a diary and note down the names of people they
at universities student and scientific networks can meet or interact with in some settings. The
be examined. They mainly include the relations methods that involve the direct participation of
formed at work or connected with work, as well as researchers include, e.g., observation and ethno-
the ones among specialists from the same disci- graphic studies. Thus, individuals and the rela-
pline. Private social networks are formed and tions between them are observed, e.g., in their
sustained in one’s free time to foster family rela- natural environment, such as at home or at work.
tions, participate in hobby, etc. The next dichot- Another way of using SNA is by conducting
omy may involve the notion of law and order. For interviews and surveys, asking respondents to
example, Social Network Analysis also studies answer questions related to their social networks.
illegal networks and the issue of crime in social As in the case of other social research, methods
networks. Another classification involves the can be divided into qualitative and quantitative
notion of entry conditions. Closed social networks ones. Social Network Analysis, as other methods
are aimed exclusively at carefully selected indi- used in social studies, may benefit from neuro-
viduals, providing barriers for entering them. scientific investigation, using such techniques as,
Examples of closed social networks are social e.g., fMRI or EEG to study emotions and
networks at work; the reason for their closeness involvement in communities or groups. Taking
is connected with the need for privacy and open- the growing role of the Internet, social networks
ness among the network users. On the contrary, are also analyzed by studying the interactions in
open social networks do not pose any entrance the online settings. Thus, SNA concerns the rela-
barriers for the users. Social networks may also be tions taking places in social online networking
categorized by taking into account their scope. tools, discussion forums, emails, etc. Social Net-
One of the ways to look at social networks is to work Analysis takes into account differences
divide them into local and global social networks. within the online places of interaction, by
Depending on other notions, the local character of observing the access to the online tool, types of
social networks is connected with the limited individuals the tool is directed at, etc. It should be
scope of its performance, being restricted, e.g., stressed that since social networks do not exist in
to the local community. Wasserman and Faust a vacuum, they should be studied by taking their
(1994) categorize networks by taking modes into environment into account. Thus, other network
account. The mode is understood as the number of approaches may prove useful to study the rela-
sets of entities. One-mode networks encompass a tion between different elements in systems. In
single set of entities, whereas two-mode networks addition, social network analysis studies not
involve two sets of entities or one set of entities only human beings and organizations and the
and one set of events. More complex networks are way they cooperate but also technological ele-
studied through the perspective of three or more ments are taken into account. For example, S
mode networks. Actor-Network-Theory (ANT) may prove useful
in the discussion on SNA since it facilitates the
understanding of the relations between living
Methods of Investigating Social and non-living entities in shaping social rela-
Networks tions. For example, computer network or tele-
phone networks and their influence on social
There are different ways of researching social networks may be studied. Moreover, since social
networks, depending what features of social net- networks are often created and exercised
works are to be investigated. For example, a in communication, modern methodological
researcher may study the influence of social net- approaches include discourse studies, such as
works on individuals or the types of social net- Critical Discourse Analysis, that stress how the
works and their implications for professional or selection of verbal and nonverbal elements of
private life. Participants may also be asked to run communication (e.g., drawings, pictures)
852 Social Network Analysis

facilitates the creation of social networks and Tabu Search heuristic (EA-TS). In addition, the
their performance. application of SNA in the studies on big data can
be analyzed by taking into account different con-
cepts crucial for the research conducted in vari-
Social Network Analysis and Big Data ous disciplines. One of such concept is identity
Studies that can be investigated in, e.g., organizational
setting. Company identity being understood as
There are different ways social networks are the image created at both the external and inter-
linked with big data. First of all, social networks nal level of corporations is a complex concept
generate a large amount of data. Depending on that requires multilevel studies. Within the phe-
the type of networks, big data concern pictorial nomenon of company identity, its linguistic
data, verbal data or audio data. Big data gathered dimension can be studied, by taking into account
from social networks can also be categorized by how communication is created and conducted
taking into account the type of network and within corporate social networks. It should also
accumulated data. For example, professional be stated that the study on complex social net-
social networks provide data on users’ education works is connected with some problems that
and professional experience, whereas private researchers may encounter; for example, compa-
social networks offer information on one’s nies may have different hierarchies and the ways
hobbies and interests. Depending on the type of they are organized. One of them is the issue of
network they are gathered from, data provide boundary setting and group membership. It
information on demographic changes, customer should also be stated that Social Network Anal-
preference and behaviors, etc. One of the ways is ysis relies on different visualization techniques
to organize information on social groups and that offer the pictorial presentation of
professional communities. Social Network Anal- gathered data.
ysis may be applied to the study on modern
organizations to show how big data is gathered
and distributed in organizations. SNA is also
important when there is an outbreak of a disease Cross-References
or other crisis situations to show how informa-
tion is administered within a given social net- ▶ Blogs
work. It should be stressed, however, that the ▶ Digital Storytelling, Big Data Storytelling
process of gathering and storing data should ▶ Economics
reflect the ethical principles of research. In the ▶ Facebook
case of showing big data, SNA visualization ▶ Network Analytics
techniques (e.g., VISONE) facilitate the presen- ▶ Network Data
tation of complex data. SNA may also benefit ▶ Social Media
from statistics, by applying, e.g., exponential
random graph models or such programs as
UCINET or PAJEK. Big data in social networks Further Reading
may be handled in different ways, but one of the
Bielenia-Grajewska, M. (2014). Topology of social net-
key problems in such analyses includes memory
works. In K. Harvey (Ed.), Encyclopedia of social
and time limits. Stanimirović and Mišković media and politics. Thousand Oaks: SAGE.
(2013) have developed three metaheuristic Bielenia-Grajewska, M., & Gunstone, R. (2015). Lan-
methods to overcome the mentioned difficulties: guage and learning science. In R. Gunstone (Ed.),
Encyclopedia of science education. Dordrecht:
a pure evolutionary algorithm (EA), a hybridiza- Springer.
tion of the EA and a Local Search Method Degenne, A., & Forsé, M. (1999). Introducing social net-
(EA-LS), and a hybridization of the EA and a works. London: SAGE.
Social Sciences 853

Rosen, D., Barnett, G. A., & Kim, J. H. (2011). Social societies (Archeology, History, Demography),
networks and online environments: when science and social interaction (Political Economy, Sociology,
practice co-evolve. SOCNET 1, 27–42. https://doi.org/
10.1007/s13278-010-0011-7. Anthropology), or cognitive system (Psychology,
Stanimirović, Z., & Mišković, S. (2013). Efficient meta- Linguistics). There are also applied Social Sci-
heuristic approaches for exploration of online social ences (Law, Pedagogy) and other Social Sciences
networks. In W.-C. Hu & N. Kaabouch (Eds.), Data classified in the generic group of Humanities
management, technologies, and applications. Hershey:
IGI Global. (Political Science, Philosophy, Semiotics, Com-
Wasserman, S., & Faust, K. (1994). Social network analy- munication Sciences). The anthropologist Claude
sis. Cambridge: Cambridge University Press. Lévi-Strauss, the philosopher and political scien-
tist Antonio Gramsci, the philosopher Michel
Foucault, the economist and philosopher Adam
Smith, the economist John Maynard Keynes, the
Social Sciences psychoanalyst Sigmund Freud, the sociologist
Émile Durkheim, the political scientist and soci-
Ines Amaral ologist Max Weber, and the philosopher, sociolo-
University of Minho, Braga, Minho, Portugal gist, and economist Karl Marx are some of the
Instituto Superior Miguel Torga, Coimbra, leading social scientists of the last centuries.
Portugal The social scientist studies phenomena, struc-
Autonomous University of Lisbon, Lisbon, tures, and relationships that characterize the social
Portugal and cultural organizations; analyzes the move-
ments and population conflicts, the construction
of identities, and the formation of opinions;
Social Science is an academic discipline researches behaviors and habits and the relation-
concerned with the study of humans through ship between individuals, families, groups, and
their relations with society and culture. Social institutions; and develops and uses a wide range
Science disciplines analyze the origins, develop- of techniques and research methods to study
ment, organization, and operation of human soci- human collectivities and understand the problems
eties and cultures. The technological evolution of society, politics, and culture.
has strengthened Social Sciences since it enables The study of humans through their relations
empirical studies developed through quantitative with society and culture relied on “surface data”
means, allowing the scientific reinforcement of and “deep data.” “Surface data” was used in the
many theories about the behavior of man as a disciplines that adapted quantitative methods, like
social actor. The rise of big data represents an Economics. “Deep data” about individuals or
opportunity for the Social Sciences to advance small groups was used in disciplines that analyze
the understanding of human behavior using mas- society through qualitative methods, such S
sive sets of data. Sociology.
The issues related to Social Sciences began to Data collection has always been a problem for
have a scientific nature in the eighteenth century social research because of its inherent subjectivity
with the first studies on the actions of humans in as Social Sciences have traditionally relied on
society and their relationships with each other. It small samples using methods and tools gathering
was by this time that Political Economy emerged. information based on people. In fact, one of the
Most of the subjects belonging to the fields of critical issues of Social Science is the need to
Social Sciences, such as Anthropology, Sociol- develop research methods that ensure the objec-
ogy, and Political Science arisen in the nineteenth tivity of the results. Moreover, the objects of study
century. of Social Sciences do not fit into the models and
Social Sciences can be divided in disciplines methods used by other sciences and do not allow
that are dedicated to the study of the evolution of the performance of experiments under controlled
854 Social Sciences

laboratory conditions. The quantification of infor- interconnecting disciplinary fields. Within the
mation is possible because there are several tech- social domain, data is being collected from trans-
niques of analysis that transform ideas, social actions and interactions through multiple devices
capital, relationships, and other variables from and digital networks. The analysis of large
social systems into numerical data. However, the datasets is not within the field of a single scientific
object of study always interacts with the culture of discipline or approach. In this regard, big data can
the social scientist, making it very difficult to have change Social Science because it requires an inter-
a real impartiality. section of sciences within different research tradi-
Big data is not self-explanatory. Consequently, tions and a convergence of methodologies and
it requires new research paradigms across multi- techniques. The scale of the data and the methods
ple disciplines, and for social scientists, it is a required to analyze them need to be developed
major challenge as it enables interdisciplinary combining expertise with scholars from other sci-
studies and the intersection between computer entific disciplines. Within this collaboration with
science, statistics, data visualization, and social data scientists, social scientists must have an
sciences. Furthermore, big data empowers the essential role in order to read the data and under-
use real-time data on the level of whole stand the social reality.
populations, to test new hypotheses and study The era of big data implies that Social Sciences
social phenomena on a larger scale. In the context rethink and update theories and theoretical ques-
of modern Social Sciences, large datasets allow tions such as small world phenomenon, complex-
scientists to understand and study different social ity of urban life, relational life, social networks,
phenomena, from the interactions of individuals study of communication and public opinion for-
and the emergence of self-organized global move- mation, collective effervescence, and social influ-
ments to political decisions and the reactions of ence. Although computerized databases are not
economic markets. new, the emergence of an era of big data is critical
Nowadays, social scientists have more infor- as it creates a radical shift of paradigm in social
mation on interaction and communication pat- research. Big data reframes key issues on the
terns than ever. The computational tools allow foundation of knowledge, the processes and tech-
understanding the meaning of what those patterns niques of research, the nature of information, and
reveal. The models build about social systems the classification of social reality.
within the analysis of large volumes of data must The new forms of social data have interesting
be coherent with the theories of human actors and dimensions: volume, variety, velocity, exhaustive,
their behavior. The advantages of large datasets indexical, relational, flexible, and scalable. Big
and of the scaling up the size of data are that it is data consists of relational information in large
possible to make sense of the temporal and spatial scale that can be created in or near real time with
dimensions. What makes big data so interesting to different structures, extensive in scope, capable of
Social Sciences is the possibility to reduce data, identifying and indexing information distinc-
apply filters that allow to identify relevant patterns tively, flexible, and able to expand in size quickly.
of information, aggregate sets in a way that helps The datasets can be created by personal data or
identify temporal scales and spatial resolutions, nonpersonal data. Personal data can be defined as
and segregate streams and variables in order to information relating to an identified person. This
analyze social systems. definition includes online user-generated content,
As big data is dynamic, heterogeneous, and online social data, online behavioral data, location
interrelated, social scientists are facing new chal- data, sociodemographic data, and information
lenges due to the existence of computational and from an official source (e.g., police records). All
statistical tools, which allow extracting and ana- data collected that do not directly identify individ-
lyzing large datasets of social information. Big uals are considered nonpersonal data. Personal
data is being generated in multiple and data can be collected from different sources with
Social Sciences 855

three techniques: voluntary data that is created disciplines. The data-driven science uses a hybrid
and shared online by individuals; observed data, combination of abductive, inductive, and deduc-
which records the actions of the individual; and tive methods to the understanding of a phenome-
data inferred about individuals based on voluntary non. This approach assumes theoretical
information or observed. frameworks and pursues to generate scientific
The disciplinary outlines of Social Sciences in hypotheses from the data by incorporating a
the age of big data are in constant readjustment mode of induction into the research design. There-
because of the speed of change in the data land- fore, the epistemological strategy adopted within
scape. Some authors argued that the new data this model is to detect techniques to identify
streams could reconfigure and constitute social potential problems and questions, which can be
relations and populations. Academic researchers worth of further analysis, testing, and validation.
attempt to handle the methodological challenges Although big data enhance the set of data avail-
presented by the growth of big social data, and able for analysis and enable new approaches and
new scientific trends arise, although the diversity techniques, it does not replace the traditional small
of the philosophical foundations of Social Science data studies. Due to the fact that big data cannot
disciplines. Objectivity of the data does not result answer specific social questions, more targeted
directly in their interpretation. The scientific studies are required. Computational Social Sci-
method postulated by Durkheim attempts to ences can be the interface between computer sci-
remove itself from the subjective domain. Never- ence and the traditional social sciences. This
theless, the author stated that objectivity is made interdisciplinary and emerging scientific from
by subjects and is based on subjective observa- Social Sciences uses computationally methods to
tions and selections of individuals. model social reality and analyze phenomena, as
A new empiricist epistemology emerged in well as social structures and collective behavior.
Social Sciences and goes against the deductive The main computational approaches from Social
approach that is hegemonic within modern sci- Sciences to study big data are social network anal-
ence. According to this new epistemology, big ysis, automated information extraction systems,
data can capture an entire social reality and pro- social geographic information systems, complexity
vide their full understanding. Therefore, there is modeling, and social simulation models.
no need for theoretical models or hypotheses. This Computational Social Science is an intersec-
perspective assumes that patterns and relation- tion of Computer Science, Statistics, and the
ships within big data are characteristically signif- Social Sciences, which uses large-scale demo-
icant and accurate. Thus, the application of data graphic, behavioral, and network data to analyze
analytics transcends the context of a single scien- individual activity, collective behaviors, and rela-
tific discipline or a specific domain of knowledge tionships. Computational Social Sciences can be
and can be interpreted by those who can interpret the methodological approach to Social Sciences S
statistics or data visualization. study big data because of the use of mathematical
Several scholars, who believe that the new methods to model social phenomena and the abil-
empiricism operates as a discursive rhetorical ity to handle with large datasets.
device, criticize this approach. Kitchin argues The analysis of big volumes of data opens up
that whereas data can be interpreted free of con- new perspectives of research and makes it possi-
text and domain-specific expertise, such an epis- ble to answer questions that were previously
temological interpretation is probable to be incomprehensible. Though big data itself is rela-
unconstructive as it absences to be embedded in tive, its analysis within the theoretical tradition of
broader discussions. Social Sciences to build a context for information
As large datasets are highly distributed and will enable its understanding and the intersection
present complex data, a new model of data-driven with the smaller studies to explain specific data
science is emerging within the Social Science variables.
856 Socio-spatial Analytics

Big data may have a transformational impact as


it can transform policy making, by helping to Socio-spatial Analytics
improve communication and governance in sev-
eral policy domains. Big social data also raise Xinyue Ye
significant ethical issues for academic research Landscape Architecture & Urban Planning, Texas
and request an urgent debate for a wider critical A&M University, College Station, TX, USA
reflection on the epistemological implications of
data analytics.
Questions on inequality lie at the heart of the
discipline of social science and geography, moti-
vating the development of socio-spatial analyt-
Cross-References ics. Growing socioeconomic inequality across
various spatial scales in any region threatens
▶ Anthropology social harmony. Meanwhile, a number of fasci-
▶ Communications nating debates on the trajectories and mecha-
▶ Complex Networks nisms of socioeconomic development are
▶ Computational Social Sciences reflected in numerous empirical studies ranging
▶ Computer Science from specific regions and countries to the scale of
▶ Data Science individuals and groups. With accelerated techno-
▶ Network Analytics logical advancements and convergence, there
▶ Network Data have been major changes in how people carry
▶ Psychology out their activities and how they interact with
▶ Social Network Analysis each other. With these changes in both technol-
▶ Visualization ogy and human behavior, it is imperative to
improve our understanding of human dynamics
in order to tackle the inequality challenges rang-
Further Reading ing from climate change, public health, traffic
congestion, economic growth, digital divide,
Allison, P. D. (2002). Missing data: Quantitative appli- social equity, political movements, and cultural
cations in the social sciences. British Journal of
conflicts, among others. Socio-spatial analytics
Mathematical and Statistical Psychology, 55(1),
193–196. has been, and continues to be, challenged by
Berg, B. L., & Lune, H. (2004). Qualitative research dealing with the temporal trend of spatial pat-
methods for the social sciences (Vol. 5). Boston: terns and spatial dynamics of social development
Pearson.
from the human needs perspective. As a frame-
Boyd, D., & Crawford, K. (2012). Critical questions for big
data: Provocations for a cultural, technological, and work promoting human-centered convergence
scholarly phenomenon. Information, Communication research, socio-spatial analytics has the potential
& Society, 15(5), 662–679. to enable more effective and symbiotic collabo-
Coleman, J. S. (1990). Foundations of social theory. Cam-
ration across disciplines to improve human soci-
bridge, MA: Belknap Press of Harvard University
Press. eties. The growth in citizen science and smart
Floridi, L. (2012). Big data and their epistemological chal- cities has reemphasized the importance of
lenge. Philosophy & Technology, 25, 435–437. socio-spatial analytics. Theory, methodology,
González-Bailón, S. (2013). Social science in the era of big
data. Polymer International, 5(2), 147–160.
and practice of computational social science
Lohr, S. (2012). The age of big data. New York Times 11. have emerged as an active domain to address
Lynch, C. (2008). Big data: How do your data grow? these challenges.
Nature, 455(7209), 28–29. Socio-spatial analytics can reveal the dynamics
Oboler, A., et al. (2012). The danger of big data: Social
of spatial economic structures, such as the emer-
media as computational social science. First Monday,
17(7-2). Retrieved from http://firstmonday.org/ojs/ gence and evolution of poverty traps and conver-
index.php/fm/article/view/3993/3269. gence clubs. Spatial inequality is multiscale in
Socio-spatial Analytics 857

nature. Sources or underlying forces of inequality under different contexts, many efforts have aimed
are also specific to geographic scale and social at analyzing fine-scale spatial patterns and geo-
groups. Such a scalar perspective presents a topol- graphical dynamics and maximizing the potential
ogy of inequality and has the potential to link of massive data to improve human well-being
inequalities at the macroscale to the microscale, toward human dynamics level. Big spatiotempo-
even everyday life experiences. The dramatic ral data have become increasingly available, allo-
improvement in computer technology and the wing the possibility for individuals’ behavior in
availability of large-volume geographically space and time to be modeled and for the results of
referenced social data have enabled spatial ana- such models to be used to gain information about
lytical methods to move from the fringes to central trends at a daily and street scale. Many research
positions of methodological domains. The history efforts in socioeconomic inequality dynamics can
of the open-source movement is much younger, be substantially transformed in the context of new
but its impact on quantitative social science and data and big data. Spatial inequality can be further
spatial analysis is impressive. The OSGeo pro- examined at the finer scale, such as social media
jects that support spatial data handling have a data and movement data, in order to catalyze
large developer community with extensive collab- knowledge and action on environment and sus-
orative activities, possibly due to the wide audi- tainability challenges in the built environment.
ence and publicly adopted OGC standards. In Given the multidimensionality, current research
comparison, spatial analysis can be quite flexible faces challenges of systematically uncovering
and is often field- and data-specific. Therefore, spatiotemporal and societal implications of
analysis routines are often written by domain sci- human dynamics. Particularly, a data-driven
entists with specific scientific questions in mind. policy-making process may need to use data
The explosion of these routines is also facilitated from various sources with varying resolutions,
by increasingly easier development processes analyze data at different levels, and compare the
with powerful scripting language environments results with different scenarios. As such, a synthe-
such as R and Python. sis of varying spatiotemporal and network
In addition to space, things near in time or in methods is needed to provide researchers and
statistical distribution are more related than planning specialists a foundation for studying
distant things. Hence, ignoring the interdepen- complex social and spatial processes. Socio-
dence across space, time, and statistical distri- spatial analytics can be delivered in an interactive
bution leads to overlooking many possible visual system to answer day-to-day questions by
interactions and dependencies among space, non-specialized users. The following questions
time, and attributes. To reveal these relation- can be asked: has this policy change brought any
ships, the distributions of space, time, and attri- positive effects to this street? Where can I tell my
butes should be treated as the context in which a patient to exercise that is safe, culturally accept- S
socio-spatial measurement is made, instead of able, and appropriate to who he/she is?
specifying a single space or time as the context. Professionals working with academic,
The “distribution” in space (the dimension of government, industry, and not-for-project orga-
space) refers to the spatial distribution of attri- nizations across the socio-spatial analytics also
butes, while the “distribution” of attributes (the recognize a widespread challenge with ade-
dimension of statistical distribution) implies the quately implementing spatial decision support
arrangement of attributes showing their capabilities to address complex sustainable sys-
observed or theoretical frequency of occur- tems problems. Wide-ranging knowledge gaps
rence. In addition, the “distribution” of time stem in large part from an inability to synthesize
(the dimension of time) signifies the temporal data, information, and knowledge emerging from
trend of attributes. diverse stakeholder perspectives broadly, deeply,
To advance core knowledge on how humans and flexibly within application domains of spa-
understand and communicate spatial relationships tial decision support systems. Socio-spatial
858 South Korea

analytics will facilitate an understanding of the


complicated mechanisms of human communica- South Korea
tions and policy development in both cyberspace
(online) and the real world (offline) for decision Jooyeon Lee
support. Hankuk University of Foreign Studies, Seoul,
The metrics for evaluation for socio-spatial Korea (Republic of)
analytics can cover the following quantitative
and qualitative aspects: (1) usability – whether
the proposed solutions achieve the goal of under- Keeping pace with global trends, the Korean gov-
standing the socioeconomic dynamics; (2) accept- ernment is seeking the use of big data in various
ability, whether and to what degree the proposed areas through efforts such as the publication of the
solutions are operational for dissemination into report A Plan to Implement a Smart Government
the community; (3) extensibility, whether the pro- Using Big Data by the President’s Council on
posed methodology and workflow can be used to Information Strategies in 2011. In particular, the
address different themes; (4) documentation, current government aims for the realization of a
whether the documentation has sufficient and smart government that creates convergence of
clear descriptions about the proposed solutions knowledge through data sharing between govern-
as well as software tools; and (5) community ment departments. One of the practical strategies
building, whether and how this research can for achieving these goals is active support of the
attract the attention and participation of use of big data in the public sector. For this pur-
researchers from diverse domain communities pose, the Big Data Strategy Forum was launched
and how the research can be extended to other in April 2012 led by the National Information
themes. Society Agency. In addition to this, the Electron-
ics and Telecommunications Research Institute
(ETRI) in South Korea has carried out a task to
build up knowledge assets for the use of big data
Cross-References in the public sector with the support of the Korean
Communications Commission and the Korean
▶ Social Sciences
Communications Agency. The South Korean gov-
▶ Spatiotemporal Analytics
ernment is also operating a national big data cen-
ter. The main purpose of this center is to support
small- and medium-sized businesses, universities,
Further Reading and institutions that find it difficult to manage or
maintain big data due to financial constraints.
Ye, X., & He, C. (2016). The new data landscape for
regional and urban analysis. GeoJournal. https://doi. Furthermore, this center is preparing to develop
org/10.1007/s10708-016-9737-8. new business models by collecting data from tele-
Ye, X., & Mansury, Y. (2016). Behavior-driven agent- communications companies, medical services,
based models of spatial systems. Annals of
and property developers.
Regional Science. https://doi.org/10.1007/s00168-
016-0792-3. There are many examples of how much effort
Ye, X., & Rey, S. (2013). A framework for exploratory the Korean government is making in applying big
space-time analysis of economic data. The Annals of data in many different public sector organizations.
Regional Science, 50(1), 315–339.
Ye, X., Huang, Q., & Li, W. (2016). Integrating big social
For example, there is a night bus project run by the
data, computing, and modeling for spatial social Seoul Metropolitan Government which was
science. Cartography and Geographic Information started in response to a night bus service problem
Science. https://doi.org/10.1080/15230406.2016. in Seoul in 2013. In order to achieve this, the
1212302.
Seoul Metropolitan Government took advantage
Ye, X., Zhao, B., Nguyen, T. H., & Wang, S. (2020). Social
media and social awareness. In Manual of digital earth of aggregated datasets(comprised of around three
(pp. 425–440). Singapore: Springer. billion calls and the analysis results of five million
South Korea 859

customers who got in and out of cabs) to create a its big data market is over 2 years behind the
night bus route map. A second example is the Chinese equivalent and does not have sufficient
National Health Care Service, which is operated specialists with practical skills. In addition, there
by Korean National Health Insurance. It set up are problems in that big data has yet not been fully
services such as Google Flu Trends, which is a utilized and has not even been discussed fully in
web service which estimates influenza activity in South Korea. Even though many scholars in social
South Korea by analyzing big data. The Korean sciences have stressed the importance of the prac-
National Health Insurance Service investigates tical use of big data, it is true that there have been
how many people search for the symptoms of a many problems in using big data in reality. Fur-
cold, such as a high fever and coughing, on SNS thermore, there have been many debates about the
(Social Network Services) and Twitter. In addi- leakage of personal information resulting from the
tion, the Ministry of Employment and Labour in use of big data, as in other countries. In 2012,
South Korea has used big data by consulting credit card companies experienced considerable
records of customer service centers and search issues due to a series of leakages of customer
engine systems on SNS to predict supply and information and, as a result, investments and
demand for job prospects in South Korea. Finally, development related to big data have shrunk
the Ministry of Gender Equality and Family in markedly. Thus, the Korean Communications
South Korea has analyzed big data by consulting Commission has been establishing and promoting
records about teenagers who are being bullied at guidelines for the protection of personal infor-
school and feel suicidal urges, blogs, and SNS in mation in big data. The main purposes of the
order to prevent potential teenager delinquency, guidelines are to prevent the misuse/abuse of per-
suicide, disappearance from home, and academic sonal information and to delimit the scope of
interruption. personal information collectible and usable
In addition, customized marketing strategies without the information subjects’ prior consent
based on big data are becoming popular in the within the current legal provisions on the big
commercial sector. This is reflected, for exam- data industry. Although many civic organizations
ple, in the marketing strategies of credit card are involved in arguments for and against the
companies. Shinhan Card, one of the major establishment of the guidelines, the Korean gov-
credit companies in South Korea, developed a ernment’s efforts in the use of big data continue.
card product known as Code Nine by analyzing Only a few years ago, big data was merely an
the consumption patterns and characteristics of abstract concept in South Korea. Now, however,
its 22 million customers. In addition, Shinhan the customized services and marketing strategies
opened its Big Data Center in December 2013. of Korean companies using extensive personal
Similarly, Samsung Card is also inviting experts information are emerging as crucial. In response
to assist in promoting the use of big data in its to this trend, government departments such as the S
business and has opened a marketing-related Ministry of Science and Technology and the
department responsible for big data analysis. National Information Society Agency are
There have been many other attempts at using supporting the big data industry actively; for
big data to analyze sociopolitical phenomena, example, the announcement of the Manual for
such as public opinion on sensitive political Big Data Work Processes and Technologies 1.0
issues. on May 2014 for the introduction and distribution
However, despite such high interest in big data, of big data services in South Korea. Moreover,
the big data market in South Korea is still smaller from 2015, national qualification examinations
than in other developed countries, such as the have been introduced and academic departments
United States and the United Kingdom. More- relating to big data will be opened in universities
over, although South Korea is a leader in the IT in order to develop big data experts for the future.
industry and has the highest Long-Term Evolution Furthermore, many social scientists have
(LTE) distribution rate among all Asian countries, published many articles related to the practical
860 Space Research Paradigm

use of big data and have discussed how the primary focus of space research has been in
Korean government and companies will be able advancing and maturing capabilities in satellites,
to fully use big data and problems they may face. telescopes, and auxiliary science instruments.
Thus, it is clear that the big data industry in South These advancements have provided unprece-
Korea has been rapidly developed and the efforts dented multi-spectral, multi-temporal, and multi-
of the Korean government and businesses alike in spatial data along with the computing abilities to
using big data are sure to continue. combine these multi-dimensional datasets to
enrich the study of various sub-systems within
our cosmos. In the new millennia, the research in
Cross-References the fields of earth sciences and astronomy have
been undergoing revolutionary changes, mainly
▶ Data Mining due to technical capabilities to acquire, store, han-
▶ Google Flu dle and analyze large volumes of scientific data.
▶ Industrial and Commercial Bank of China These advancements are building upon technical
innovation that the space-age accelerated, thus
transforming and enriching space research. In
Further Reading this entry space research refers to both earth and
astronomical sciences, conducted from both space
Diana, M. (2014, June 24). Could big data become big and ground-based observatories.
brother? Governemt Health IT.
Jee, K., & Kim, G. H. (2013). Potentiality of big data in the
medical sector: Focus on how to reshape the healthcare Space-Based Earth Sciences
system. Healthcare Informatics Research, 19(2), The very first artificial satellite, Sputnik 1, was
79–85. launched into low earth orbit in 1957 triggering
Lee, Y., & Chang, H. (2012). Ubiquitous health in Korea:
the dawn of space age. It provided the first set of
Progress, barriers, and prospects. Healthcare Informat-
ics Research, 18(4), 242–251. space-based scientific data about earth’s upper
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., atmosphere. Soon after, the US Government
Roxburgh, C., & Byers, A. H. (2011). Big data: The launched the world’s first series of meteorological
next frontier for innovation, competition, and produc-
satellites, TIROS 1 through 10. Although these
tivity. Washington, DC: McKinsey Global Institute.
Park, H. W., & Leydesdorff, L. (2013). Decomposing early satellites had low resolution cameras and
social and semantic networks in emerging “big data” limited capacity to store and transmit earth
research. Journal of Informetrics, 7(3), 756–765. images, they quickly demonstrated the national
and economic significance of accurately forecast-
ing and understanding weather patterns. The next
series of meteorological satellites included very
Space Research Paradigm high-resolution instruments with multi-spectral
sensors providing earth images ranging from vis-
Hina Kazmi ible light to infrared bands. The application of
George Mason University, Fairfax, VA, USA weather satellite technology rapidly expanded to
study land masses and led to the launch of Landsat
spacecraft series. These satellites have provided
Post-Space Age Advancements the world an unmatched recorded history of land-
forms and respective changes over the past
While the field of astronomy is as old as humanity 40 years, such as tropical rainforests, glaciers,
itself, technologically speaking a large part of the coral reefs, fault lines, and environmental impact
scientific understanding about our own planet, the of human-driven development like agriculture
solar system, the Milky Way galaxy, and beyond and deforestation (Tatem et al. 2008). Further
has occurred in just past 40–60 years. Between the advancements in spacecraft technologies enabled
1960s and the turn of the new millennia, the National Aeronautics and Space Agency (NASA)
Space Research Paradigm 861

to launch its Earth Observation System (EOS) – a embedded beyond science into mainstream cul-
series of large coordinated satellites with multiple ture and arts globally. Hubble’s findings have
highly complex science instruments that have been complemented with a series of other
been in operation since the 1990s. The EOS sci- advanced space telescopes observing in X-ray,
ence has expanded our understanding of earth’s gamma-ray, and infrared wavelengths, which are
environmental state such as cloud formation, pre- measurable only from space.
cipitation, ice sheet mass, and ozone mapping. At the same time, ground-based observatories
During this period the US military also commer- have also grown in size and complexity thanks to
cialized the Global Positioning System (GPS) a series of innovations such as: (a) mirror technol-
technology that led to the development of multiple ogies making telescopes lighter in weight and
Geographic Information System (GIS) tools. adjustable to correct for physical deformations
Today our dependency on continuous earth obser- over time (namely, active and adaptive optics)
vation data is fully intertwined with many day-to- and (b) detectors that allow for high resolution
day functions like traffic management and timely and wide angle. Even more, the addition of radio
responses to weather-related emergencies. astronomy as an established discipline in the field
has added to the ever-growing set of multi-
Ground- and Space-Based Astronomy spectral astronomical research.
Throughout the twentieth century, the field of
astronomy made major breakthroughs (deGrijs
and Hughes 2007). Among the notable discover- Paradigm Shift
ies are the expansion and the rate of acceleration
of our universe, the measurement of cosmic Sixty-three years after the launch of Sputnik 1,
background radiation (confirming the big bang the overall field of space research is diverse,
theory), and discovery and understanding of complex, and rich in data. At the same time,
black holes. These revolutionary findings were the digital age has equipped us with super-
confirmed in large part due to launch of space- computing, large data storage, and sophisticated
based telescopes, starting in the 1960s. There are analytical tools. The combination of these fac-
two key advantages of conducting astronomical tors is steadily leading us to new realms in
observations from space. First, images are much research that is more and more driven by big
sharper and stable above earth’s atmosphere; data. Civilian space agencies, such as NASA
second, we can observe the sky across wave- and European Space Agency (ESA), have pro-
lengths on the electromagnetic spectrum that moted open data policy for decades and have
are not detectable by ground-based telescopes. helped develop large publicly available online
In the post-space age era, one of the most signif- archives to encourage data-intensive research
icant contributions to astronomy has come from that varies in both depth and breadth. The S
Hubble Space Telescope (HST) (NAP 2005), the National Academies of Sciences, Engineering,
largest space-based observatory that included a and Medicine (NASEM) has also focused on the
range of instruments to observe in wavelengths increasing significance of archival data and
spanning from ultraviolet to near-infrared. interestingly has referred to science centers as
Hubble’s findings have revolutionized our “archival centers” for scientific research (NAP
understanding of the cosmos. It has made over 2007). Government continues to invest in vari-
1.4 million observations to date and created a ous data mining tools for the ease of access along
repository of science data that has resulted in with data recipes to facilitate the use of multiple
over 17,000 peer-reviewed publications over its layers of datasets as well as the metadata asso-
30 years of operational life; its H-Index remains ciated with the processed results, such as cata-
the highest among all observatories (257 as of logs of object positions and brightness.
2017) with well over 738,000 citations. HST is a In 2013, the Public Broadcast Service’s NOVA
historic treasure, and its images have been made a documentary “Earth From Space” using
862 Space Research Paradigm

NASA’s earth observation data. Combining the in its Earth Science Data and Information System
images from multiple satellites, the documentary (ESDIS 2020). The data volume in 2019 alone
creatively demonstrated the delicate interconnec- totaled to 34 PB, made up of 12,000 datasets,
tedness and interdependence of biological, geo- and served 1.3 million users. The ESDS program
logical, oceanic, and atmospheric elements of our projects that its total archive volume will reach
planet as one ecosystem. Earth scientists are able close to 250 PB by year 2025.
to now pursue such expansive research which is
indeed a study in system of systems at a full Astronomy Archives
planetary scale building on pool of scientific Astronomy is just showing up to the big data
knowledge of past decades that continues to grow. world. Civil space agencies are investing
In astronomy, the age of big data is changing resources to develop and mature various calibra-
the very nature of conducting astronomical tion algorithms and archives for all astronomical
research, and we are in the midst of this paradigm disciplines (planetary, heliophysics, and astro-
shift. For example, rather than astronomers gen- physics) and across multiple wavelengths on the
erating data for their research by proposing to electromagnetic spectrum. One such data source
telescopes to observe selected targets in the sky, that is increasing in demand and volume is
big data is influencing scientists to instead create NASA’s Infrared Science Archive (IRSA). It cur-
their research programs from the large swaths of rently holds 1 PB of data from 15 infrared space-
archival data already available. More and more based observatories and is in the process of adding
now, astronomers do not have to spend time a list of ground-based observatories. Hubble Tele-
observing with a telescope as their predecessors scope archives contain about 150 TB of data
traditionally did. This trend began with Hubble’s according to NASA’s website. The ESA science
expansive archives. New observatories like Gaia data center contains about 646 TB of total archival
and Alma are also driving such archival-based data for all its science missions, and its monthly
research. Gaia is a space-based astrometric tele- download rate is about 69 TB (ESDC 2020). The
scope that is designed to build a full 3D map of the data-intensive observatories (such as Gaia) aim to
Milky Way and track the movement of nearly scan the full sky on repeated basis for multiple
2 billion stars and other objects in it by scanning years and thus further pushing the big data-driven
the skies multiple times during its rotation around archival research in astronomy. The following
the sun (Castelvecchi 2020). This is an extraordi- table lists a few examples of such next generation
nary scale of information about galactic structure of observatories that are considered game
and patterns of movements of stars within it – changers due to the volume of data they intend
including our own Sun. In radio astronomy, the to generate (Table 1).
ground-based observatory, Alma, is transforming
the study of cosmology in a similar manner. Con- Analytical Challenges
sequently, the next chapters in astrophysics are The devil is in the details of mining, handling,
focusing on finding larger patterns and under- organizing, analyzing, and visualizing ever-
standing the structural characteristics of objects growing volumes of archival records. IRSA pro-
in space. Similar to understanding Earth as one gram cites limitations in computational capabili-
system, scientists hope to use big data to ulti- ties and efficiently transferring large volumes of
mately understand the interconnectedness across data (Grid n.d.). NASEM emphasizes the need to
system of galaxies and stars that can reveal the increase and sustain common archival systems
construct of system of systems at the cosmologi- such as IRSA and organizing astronomical
cal scale. archives by wavelengths and using standardized
tools that are repeatable and reliable (NAP 2007).
Earth Science Archives NASA, academia, and other stakeholders are
NASA’s Earth Science Data System (ESDS) pro- partnering to transition to cloud-based technolo-
gram adds on average 20 TB of new data per day gies where analytical tools are co-located with
Space Research Paradigm 863

Space Research Paradigm, Table 1 Next generation of observatories


Observatory Data volume Description
Gaia 1.3 TB with latest data dump Space-based astrometric telescope building a full 3D map of the
Milky Way galaxy and tracking the movement of nearly
2 billion stars and other objects
Alma Between 200 and 400 PB annually Ground-based radio observatory that studies the universe
Vera About 500 PB annually – over Ground-based observatory scheduled to start operations in 2023
C. Rubin 20 TB of data to be processed daily
Square Km 600 PB annually Planned largest radio telescope ever built which will include
Array thousands of dishes and up to a million low-frequency antennas

data archives – therefore resolving the limitation Cross-References


on downloading and migrating large data sizes.
For example, Yao et al. propose the Discrete ▶ Data Brokers and Data Services
Global Grid System (DGGS) which is a unified ▶ Earth Science
spatiotemporal framework combining data stor-
age, analytics, and imaging (Yao et al. 2019) for
earth observing data. Further Reading
Big data analytics are in preliminary phases in
space research. It will take an active participation Castelvecchi, D. (2020, December 3). Best map of Milky
Way reveals a billion stars in motion. Retrieved from
of scientists working in close collaboration with
Nature https://www.nature.com/articles/d41586-020-
software, IT, and data scientists to develop 03432-9.
DGGS-like tools. More importantly, this collab- deGrijs, R., & Hughes, D. W. (2007, October). The top ten
orative approach is needed to incorporate the astronomical “Breakthroughs” of the 20th century. The
CAP Journal, 1(1):11–17.
complicated data calibration and reduction algo-
ESDC. (2020). ESAC science data center. Retrieved from
rithms that scientists can trust. The development https://www.cosmos.esa.int/web/esdc/home.
of tools that can reliably and efficiently synergize ESDIS. (2020, October 20). Earth science data. Retrieved
these facets will be necessary to fully realize the from https://earthdata.nasa.gov/esds/nasa-earth-scien
ce-data-systems-program-highlights-2019.
potentials and expansion of space research
Grid, O. S. (n.d.). Astronomy archives are making new
disciplines. science every day. Retrieved from https://openscience
grid.org/news/2017/08/29/astronomy-archives.html.
NAP. (2005). Assessment of options for extending the life of
the Hubble Space Telescope. National Academies of
Summary Sciences. Retrieved from https://www.nap.edu/read/
11169/chapter/5.
NAP. (2007). The portals of the universe: The NASA S
Space research is evolving in both depth and Astronomy Science Centers. Retrieved from The Por-
breadth since the turn of the millennia. The data- tals of the Universe: The NASA Astronomy Science
driven paradigm shift in research promises to Centers. https://www.nap.edu/download/11909.
NASA. (2019, October). Earth observing systems.
build upon the technological advancements that Retrieved from Project Science Office. https://eospso.
were made in the post-space age era. Space agen- nasa.gov/content/nasas-earth-observing-system-proj
cies are investing in analytical tools and building ect-science-office.
large data archives, turning science centers into Tatem, A. J., Goetz, S. J., & Hay, S. I. (2008, September–
October). Fifty years of earth-observation satellites.
archive centers. As a result, the data-intensive American Scientist, 96(5). Retrieved from https://
transformation in astronomical and earth sciences www.americanscientist.org/article/fifty-years-of-earth-
is truly exciting, and because of it the humanity is observation-satellites.
at the cusp of understanding the cosmos and our Yao, X., Li, G., Xia, J., Ben, J., Cao, Q., Zhao, L., ... Zhu,
D. (2019). Enabling the big earth observation data via
place in it in unprecedented ways and for decades cloud computing and DGGS: Opportunities and chal-
to come. lenges. Remote Sensing, 12(1):62.
864 Spain

are missing and should be included to achieve


Spain further progress in the policy of Open Govern-
ment of Spain.
Alberto Luis García At this point, the Spanish government
Departamento de Ciencias de la Comunicación approved Royal Decree 1495/2011, which regu-
Aplicada, Facultad de Ciencias de la información, lates the tools available in the above described
Universidad Complutense de Madrid, Madrid, specific website and open data (general, technical,
Spain formats, semantics, etc.) and for the case mix of
each agency.
The articulation of the data provided by the
Big Data technology is growing and, in some administration was managed through catalogs.
cases, has already matured, when it is about to Access allows to introduce to them from a single
be fulfilled 104 years since the publication of point to various websites and resources of the
MapReduce, the model of massive and distributed central government to provide public informa-
computing that marked its beginning as the heart tion. The data are available, organized, and struc-
of Hadoop. MapReduce, as defined in IBM tured by formats and topics users, among other
webpage, “is this programming paradigm that criteria. Within the website you have the possi-
allows for massive scalability across hundreds or bility to search for catalogs or applications; there
thousands of servers in a concrete cluster.” The are examples of specific applications like
term MapReduce, as follows in IBM webpage, IPlayas, looking for the nearest beaches and
actually refers to map job, “that takes a set of coves to your mobile. Therefore, the objective
data and converts it into another set of data is to provide services that can help you obtain
where individual elements are broken down into economic returns in strategic sectors of the Span-
tuples.” ish economy.
The origins of the use of Big Data in Spain The profile of users is taken into account to
began in local performances in regions such as the interact with the data is of three types:
Basque Country, Catalonia, and Asturias; how-
ever, Spanish public administration has the • Anonymous users, i.e., those who can visit all
webpage (http://datos.gob.es/) that offers all public areas of the site, send suggestions, rate,
kinds of public data for reuse in matters private. and comment content.
This website is managed by the Ministry of Indus- • Users infomediaries who can publish and
try, Energy and Tourism and the Ministry of manage applications in the App Catalog (with
Finance and Public Administration, and a key prior registration on the portal).
objective is to create a strategy of openness to • Users of public sector, which are allowed to
the management of public sector data for use in add and manage their data sets within the data
business and to promote the necessary transpar- catalog (with prior registration on the portal).
ency of public policies.
In this same line of policy transparency from The profile of users undergoing a major
Big Data management, the government influence on the use of Big Data is the
published in April 2014 the II Action Plan for infomediary Sector that was defined as set of
Open Government under the Open Government companies that generate applications, products,
Partnership, an alliance born in 2011 between 64 and/or added value services for third parties,
countries – including Spain – whose mission is to from public sector information. The
develop commitments to achieve improvements Infomediary Sector has been cataloged into
in the three key aspects of open government: subsectors according to the area of reusable
accountability, participation, and transparency. information; these areas are: Business/Economy,
However, these action plans are under public Geographical/Cartographical, Social–Demograph-
consultation to indicate which important issues ical/Statisitical, Legal, Meteorogical, Transport,
Spatial Data 865

Information about Museums, Libraries and Cul- specific regulations that will allow the implemen-
tural Files. tation of the regulations.
Within the different types of activity would be
the most prolific Geographical/Cartographical
Information and Business/Economy Information. Further Reading
The sources of reused information for the
activities in the Infomediary Sector are – from http://datos.gob.es/.
the most used until the less used – State Admin- http://datos.gob.es/saber-mas.
http://www.access-info.org/es.
istration, Regional Administration, Local
IBM Webpage. (2014). What is map reduce? http://www-
Administration, European Union, Intelligence 01.ibm.com/software/data/infosphere/hadoop/
Agencies, and from another countries. And in mapreduce. Accessed Aug 2014.
the other hand, the Administration itself becomes
a client of infomediary companies, but the most
important clients of Spanish Infomediaries
Companies are Self-Employers Workers, Uni- Spatial Big Data
versities, and in a second level Public Adminis-
tration and Citizens. The revenue models for ▶ Big Geo-data
payment of services are payment for Works
done, Access, Use, Linear subscription, etc, and
the products or services offered from the sector
are Data Processed, Maps, Raw Data, and Publi- Spatial Data
cations. The main Generic Services from Data
are Custom Reports, Advices, Comparatives, Xiaogang Ma
and Clipping; the main applications are Client Department of Computer Science, University of
Software, Mobile Software, GPS Information, Idaho, Moscow, ID, USA
and SMS/Mail Alerts. And the Assessment of
Infomediary Sector for Clients is the develop-
ment for new products and applications and Synonyms
increased customer loyalty.
In Spain, the whole strategy is integrated into Geographic information; Geospatial data;
the Plan Aporta in which there are three types of Geospatial information
users involved in the reuse of information:

• Public bodies or content generators Introduction


• Infomediary Sector or generations of applica- S
tions and value added Spatial property is almost a pervasive component
• End users are those who use the information in the big data environment because everything
happening on the Earth happens somewhere. Spa-
The regulations governing the Big Data in tial data can be grouped into raster or vector
Spain are designed, therefore, in order to deepen according to the methods used in representations.
the use of two main elements: the reuse of infor- Web-based services facilitate the publication and
mation for business benefit and transparency in use of spatial data legacies, and the crowdsourcing
governance. This regulation is perfectly inte- approaches enable people to be both contributors
grated in the common strategy of the European and users of spatial data. Semantic technologies
Union around Access Info Europe will allow the further enable people to link and query the spatial
reuse of information at higher scales access as it data available on the Web, find patterns of interest,
relates to public data and geolocation. However, and to use them to tackle scientific and business
the Spanish government has not yet issued issues.
866 Spatial Data

Raster and Vector Representations geographic phenomena, as the cell boundaries


are independent of feature boundaries. However,
Spatial data are representations of facts that con- the raster representations are efficient for image
tain positional values, and geospatial data are processing. In contrast, the vector representations
spatial data that are about facts happening on the have complex data structures but are efficient for
surface of the Earth. Almost everything on the representing spatial interrelations. The vector rep-
Earth has location properties, so geospatial data resentations work well in scale changes but are
and spatial data are regarded as synonyms. Spatial hard to implement overlays. Also, they allow the
data can be seen almost everywhere in the big data representation of networks and enable easy asso-
deluge, such as social media data stream, traffic ciation with attribute data.
control, environmental sensor monitoring, and The collection, processing, and output of spa-
supply chain management, etc. Accordingly, tial data are often relevant to a number of plat-
there are various applications of spatial data in forms and systems, among them the most well-
the actual world. For example, one may find a known are the geographic information system,
preferred restaurant based on the grading results remote sensing, and the global positioning sys-
on Twitter. A driver may adjust his route based on tem. A geographic information system is a com-
the real-time local traffic information. An engi- puterized system that facilitates the phases of data
neer may identify the best locations for new build- collection, data processing, and data output, espe-
ings in an area with regular earthquakes. A forest cially for spatial data. Remote sensing is the use of
manager may optimize timber production using satellites to capture information about the surface
data of soil and tree species distribution and con- and atmosphere of the Earth. Remote sensing data
sidering a few constraints such as the requirement are normally stored in raster representations. The
of biodiversity and market price. global positioning system is a space-based satel-
Spatial data can be divided into two groups: lite navigation system that provides direct mea-
raster representations and vector representations. surement of position and time on the surface of the
A raster representation can be regarded as a group Earth. Remote sensing images and global posi-
of mutually exclusive cells which form the repre- tioning system signals can be regarded as primary
sentation of a partition of space. There are two data sources for the geographic information
types of raster representations: regular and irreg- system.
ular. The former has cells with same shape and
size and the latter with cells of varying shape and
size. Raster representations do not store coordi- Spatial Data Service
nate pairs. In contrast, vector representations use
coordinate pairs to explicitly describe a geo- Various proprietary and public formats for raster
graphic phenomenon. There are several types of and vector representations have been introduced
vector representations, such as points, lines, areas, since computers were used for spatial data collec-
and the triangulated irregular networks. A point is tion, analysis, and presentation. Plenty of remote
a single coordinate pair in a two-dimensional sensing images, digital maps, and sensor data
space or a coordinate triplet in a three-dimensional form a massive spatial data legacy. On the one
space. A line is defined by two end points and zero hand, they greatly facilitate the progress of using
to more internal points to define the shape. An spatial data to tackle scientific and social issues.
area is a partition of space defined by a boundary On the other hand, the heterogeneities caused by
(Huisman and de By 2009). the numerous data formats, conceptual models,
The raster representations have simple but less and software platforms bring huge challenges for
compact data structures. They enable simple data integration and reuse from multiple sources.
implementation of overlays but pose difficulties The Open Geospatial Consortium (OGC) (2016)
for the representation of interrelations among was formed in 1994 to promote a worldwide
Spatial Data 867

consensus process for developing publicly avail- A virtual globe has the capability to represent
able interface standards for spatial data. By early various different views on the surface of the
2015, the consortium consists of more than 500 Earth by adding spatial data as layers on the sur-
members from industry, government agencies, face of a three-dimensional globe. Well-known
and academia. Standards developed by OGC virtual globes include Google Earth, NASA
have been implemented for promoting interoper- World Wind, ESRI ArcGlobe, etc. Besides spatial
ability in spatial data collection, sharing, service, data browsing, most virtual globe programs also
and processing. Well-known standards include the enable the functionality of interactions with users.
Geography Markup Language, Keyhole Markup For example, the Google Earth can be extended
Language, Web Map Service, Web Feature Ser- with many add-ons encoded in the Keyhole
vice, Web Processing Service, Catalog Service for Markup Language, such as geological map layers
the Web, Observations and Measurements, etc. exported from OneGeology.
Community efforts such as the OGC service
standards offer a solution to publish multisource
heterogeneous spatial data legacy on the Web. A Open-Source Approaches
number of best practices have emerged in recent
years. The OneGeology is an international initia- There are already widely used free and open-
tive among the geological surveys across the source software programs serving different pur-
world. It was launched in 2007, and by early poses in spatial handling (Steiniger and Hunter
2015, it has 119 participating member nations. 2013). Those programs can be grouped into a
Most members in OneGeology share national number of categories:
and/or regional geological maps through the
OGC service standards, such as Web Map Service (1) Standalone desktop geographic information
and Web Feature Service. The OneGeology Portal systems such as GRASS GIS, QGIS, and
provides a central node for the various distributed ILWIS
data services. The Portal is open and easy to use. (2) Mobile and light geographic information sys-
Anyone with an internet browser can view the tems such as gvSIG Mobile, QGIS for
maps registered on the portal. People can also Android, and tangoGPS
use the maps in their own applications as many (3) Libraries with capabilities for spatial data pro-
software programs now provide interfaces to cessing, such as GeoScript, CGAL, and
access the spatial data services. Another more GDAL
comprehensive project is the GEO Portal of the (4) Data analysis and visualization tools such as
Global Earth Observation System of Systems, GeoVISTA Studio and R and PySAL;
which is coordinated by the Group on Earth (5) Spatial database management systems such as
Observations. It acts as a central portal and clear- PostgreSQL, Ingres Geospatial, and JASPA S
inghouse providing access to spatial data in sup- (6) Web-based spatial data publication and pro-
port of the whole system. The portal provides cessing servers such as GeoServer,
registry for both data services and standards used MapServer, and 52n WPS
in data services. It allows users to discover, (7) Web-based spatial data service development
browse, edit, create, and save spatial data from frameworks such as OpenLayers, GeoTools,
members of the Group on Earth Observations and Leaflet
across the world.
Another popular spatial data service is the vir- An international organization, the Open
tual globe, which provides three-dimensional rep- Source Geospatial Foundation, was formed in
resentation of the Earth or another world. It allows 2006 to support the collaborative development
users to navigate in a virtual environment by of open-source geospatial software programs and
changing the position, viewing angle, and scale. promote their widespread use.
868 Spatial Data

Companies such as Google, Microsoft, and methodologies and technologies to publish struc-
Yahoo! already provide free map services. One tured data on the Web so they can be annotated,
can browse maps on the service website, but the interlinked, and queried to generate useful infor-
spatial data behind the service is not open. In mation. The Web-based capabilities of linking and
contrast, the free and open-source spatial data querying are specific features of the Linked Data,
approach requires not only freely available which help people to find patterns from data and
datasets but also details about the data, such as use them in scientific or business activities. To
format, conceptual structure, vocabularies used, make full use of the Linked Data, the geospatial
etc. A well-known open-source spatial data pro- community is developing standards and technol-
ject is the OpenStreetMap, which aims at creating ogies to (1) transform spatial data into Semantic
a free editable map of the world. The project was Web compatible formats such as the Resource
launched in 2004. It adopts a crowdsourcing Description Framework (RDF), (2) organize and
approach, that is, to solicit contributions from a publish the transformed data using triple stores,
large community of people. By the middle of and (3) explore patterns in the data using new
2014, the OpenStreetMap project has more than query languages such as GeoSPARQL.
1.6 million contributors. Comparing with the The RDF uses a simple triple structure of sub-
maps, the data generated by the OpenStreetMap ject, predicate, and object. The structure is robust
are considered as the primary output. Due to the enough to support the linked spatial data
crowdsourcing approach, the current data quali- consisting of billions of triples. Building on the
ties vary across different regions. Besides the basis of the RDF, there are a number of specific
OpenStreetMap, there are numerous similar schemas for representing locations and spatial
open-source and collaborative spatial data pro- relationships in triples, such as the GeoSPARQL.
jects addressing the needs of different communi- Triple stores offer functionalities to manage spa-
ties, such as the GeoNames for geographical tial data RDF triples and query them, which are
names and features, the OpenSeaMap for a world- very similar to what the traditional relational data-
wide nautical chart, and the eBird project for real- bases are capable. As mentioned above, spatial
time data about bird distribution and abundance. data have two major sources: conventional data
Open-source spatial data formats have also legacy and crowdsourcing data. While technolo-
received increasing attention in recent years, espe- gies are being mature for transforming both of
cially Web-based formats. A typical example is them into triples, the crowdsourcing data provide
GeoJSON, which enables the encoding of simple a more flexible mechanism for the Linked Data
geospatial features and their attributes using approach and data exploration as they are fully
JavaScript Object Notation (JSON). GeoJSON is open. For example, there are already works done
now supported by various spatial data software to transform data of the OpenStreetMap and
packages and libraries, such as OpenLayers, GeoNames into RDF triples. For the pattern
GeoServer, and MapServer. Map services of Goo- exploration, there are already initial results, such
gle, Yahoo!, and Microsoft also support as those in the GeoKnow project (Athanasiou
GeoJSON in their application programming et al. 2014). The project built a prototype called
interfaces. GeoKnow Generator which provides functions to
link, enrich, query, and visualize RDF triples of
spatial data and build lightweight applications
Spatial Intelligence addressing specific requests in the actual world.
The linked spatial data is still far from mature
The Semantic Web brings innovative ideas to the yet. More efforts are needed on the annotation and
geospatial community. The Semantic Web is a accreditation of shared spatial RDF data, the inte-
web of data compared to the traditional web of gration and fusion of them, the efficient RDF
documents. A solid enablement of the Semantic query in a big data environment, and innovative
Web is the Linked Data, which is a group of ways to visualize and present the results.
Spatial Econometrics 869

Cross-References organization, political sciences, psychology, agri-


cultural economics, health economics, demogra-
▶ Geography phy, epidemiology, managerial economics, urban
▶ Socio-spatial Analytics planning, education, land use, social sciences,
▶ Spatiotemporal Analytics economic development, innovation diffusion,
environmental studies, history, labor, resources
and energy economics, transportation, food secu-
References rity, real estate, and marketing. But the list of
applied disciplines that can benefit from the
Athanasiou, S., Hladky, D., Giannopoulos, G., Rojas, A. advances in spatial econometrics is, in fact, a lot
G., Lehmann, J. (2014). GeoKnow: Making the web an
longer and likely to further increase in the future.
exploratory place for geospatial knowledge. ERCIM
News, 96. http://ercim-news.ercim.eu/en96/special/ The number of textbooks available to introduce
geoknow-making-the-web-an-exploratory-place-for-geo new scholars to the discipline has also raised a lot
spatial-knowledge. Accessed 29 Apr 2016. recently. To the long-standing traditional textbook
Huisman, O., & de By, R. A. (Eds.). (2009). Principles of
by Luc Anselin (1988), a list of new volumes were
geographic information systems. Enschede: ITC Edu-
cational Textbook Series. added in the last decade or so (e.g., Arbia 2006,
Open Geospatial Consortium (2016). About OGC. http:// 2014; LeSage and Pace 2009) that can introduce
www.opengeospatial.org/ogc. Accessed 29 Apr 2016. the topic to scholars at various levels of
Steiniger, S., & Hunter, A. J. S. (2013). The 2012 free and
formalization.
open source GIS software map: A guide to facilitate
research, development, and adoption. Computers, The broad field of spatial econometrics can be
Environment and Urban Systems, 39, 136–150. distinguished into two branches according to the
typology of data considered in the empirical ana-
lyses. Conventional spatial econometrics treats
mainly data aggregated at the level of a real geo-
Spatial Econometrics graphical partition, such as countries, regions,
counties, or census tracts. This first branch is
Giuseppe Arbia referred to as the spatial econometrics of regional
Universita’ Cattolica Del Sacro Cuore, Catholic data and represents, to date, the mainstream in the
University of the Sacred Heart, Rome, Italy scientific research. The second branch introduces
space and spatial relationships in the empirical
analysis of individual granular data referring to
Spatial Econometrics and Big Data the single economic agent, thus overcoming the
problems connected with data aggregation (see
Spatial econometrics is the branch of scientific “▶ Data Aggregation”). This second branch is
knowledge, at the intersection between statistics, termed spatial microeconometrics and is emerg- S
economics, and geography, which studies empir- ing in recent years as an important new field of
ically the geographical aspects of economic rela- research (Arbia 2016).
tionships. The term was coined by the father of the Both branches have been interested in the last
discipline, Jean H. Paelinck, in the general address decades by the big data revolution in terms of the
he delivered to the Annual Meeting of the Dutch volume and the velocity with which data are
Statistical Association in May 1974. The interest becoming more and more available to the scien-
in the discipline has recorded a particularly sharp tific community. Data geographically aggregated
increase in the last two decades which recorded an are more and more available at a very high level of
explosion in the number of applied disciplines resolution. For instance, the Italian National Sta-
interested in the subject and of the number of tistical Institute releases census information
papers appeared in scientific journals. The major related to about 402,000 census tracts. Many
application fields are subjects like regional eco- demographic variables are collected by Eurostat
nomics, criminology, public finance, industrial at the level of the European regular square (1 km-
870 Spatial Econometrics

by-1 km size) lattice grid involving many millions interpoint distances. In this case, many different
of observations. On the other hand, the availabil- definitions are possible considering, e.g., (i) an
ity of very large geo-referenced individual micro- inverse function of the interpoint distances dij,
data has also increased dramatically in all fields of e.g., wij ¼ daij ; a > 0 ; (ii) a threshold criterion
economic analysis, making it possible to develop expressed
 in a binary form by wij ¼
a spatial microeconometric approach which was 1 if dij < d *
, d being the threshold distance;
unconceivable only until few decades ago. For 0 otherwise
instance, the US Census Bureau provides annual (iii) a combination of threshold and inverse
observations for every private sector establish- distance
 definition such that wij ¼
ment with payroll and includes approximately d a
ij if d ij < d
; and (iv) a nearest neighbors
4 million establishments and 70 million 0 otherwise
employees each year. Examples of this kind can definition. Equation 1 considers the spatially
be increasingly found in all branches of lagged variable of the dependent variable y (the
economics. term WY) as one of the regressors and may also
Founding on the mathematical theory of ran- contain spatially lagged variables of some or all of
dom fields, the basic linear, isotropic (i.e., the exogenous variables (the term WX).
directionally invariant), homoschedastic spatial Equation 2 considers a spatial autoregressive
econometric models are based on the SARAR model for the stochastic disturbances (the term Wu).
(acronym for Spatial AutoRegressive with addi- The SARAR model represents the benchmark
tional AutoRegressive error structure) paradigm. for the analysis of both regional and individual
The general formulation of this model is based on microgeographical data. There are, however,
the following set of equations: some important differences in the two cases.
Indeed, when dealing with regional data, almost
y ¼ lWy þ Xbð1Þ þ WXbð2Þ þ u jlj < 1 ð1Þ invariably, the spatial units constitute a complete
cross section of territorial units with no missing
u ¼ rWu þ e jrj < 1 ð2Þ data, variables are observed directly, there is no
uncertainty on the spatial observations that are
where y is a vector of n observations of the inde- free from measurement error, and the location of
pendent variable, X is an n-by-n
 matrix
 of non- the regions is perfectly known. In contrast, gran-
stochastic regressors, eX  i:i:d:N 0, s2e n I n ular spatial microdata quite often present different
(with nIn the unitary matrix of dimension n) are forms of imperfections: they are often based on a
the disturbance terms, β(1), β(2) are vectors of sample drawn from a population of spatial loca-
parameters, and l and r scalar parameters to be tions, some data are missing, some variable only
estimated. The definition of the n-by-n W matrix proxy the target variables, and they almost invari-
deserves further explanations. In general the ably contain both attribute and locational errors
matrix W represents a set of, exogenously given, (see “▶ Big Data Quality”).
weights, which depend on the geography of the Many possible alternatives have been pro-
phenomenon. If data are aggregated at a regional posed to estimate the parameters of model (1–2)
 matrix, say wijϵW, is
level, the generic entry of the (see Arbia 2014 for a review). A maximum like-
1 if j  N ðiÞ lihood (ML) approach assuming normality of the
usually defined by wij ¼ (N(i)
0 otherwise residuals guarantees the optimal properties of the
being the set of neighbors of location j), with estimators, but, since no closed-form solution is
wii ¼ 0 by definition. Conversely, if data represent generally available, the solutions have to be
granular observations on the single economic obtained numerically raising severe problems of
agent, the W matrix is based on the information computing time, storage, and accuracy. Alterna-
about the (physical or economic) pairwise tively, the generalized method of moments
Spatial Econometrics 871

(GMM) procedures that have been proposed do the availability of very large databases is also
not require any distributional assumptions and increasing at an accelerated speed. Apart from a
may reduce (although not completely eliminate) large number of attempts to simplify the problem
the computational problems in the presence of computationally, some of the most recent litera-
very large databases and very dense W matrices. ture has concentrated on the specification of
These estimators, however, are not fully efficient. alternative models that are computationally sim-
Further models have been suggested in the litera- pler. In this respect the three most relevant
ture to overcome the limits of the basic SARAR methods that can be used for big data in spatial
model, considering methodological alternatives to econometrics are the matrix exponential spatial
remove the (often unrealistic) hypotheses of isot- specification (MESS), the unilateral approxima-
ropy, linearity, and homoschedasticity on which tion, and the bivariate marginal likelihood
they are based (Arbia 2014 for a review). Methods approach (see Arbia 2014 for a review).
and models are also available for the analysis of
spatiotemporal econometric data (Baltagi 2013)
(see “▶ Spatiotemporal Analytics”). Conclusions
The estimation of both the regional and the
microeconometric models may encounter severe Spatial econometrics is a rapidly changing disci-
computational problems connected with the pline also due to the formidable explosion of data
dimension of the dataset. Indeed, both the ML availability and their diffusion in all spheres of
and the GMM estimation procedures require human society. Under this respect the use of
repeated inversions of an n-by-n matrix satellite data and the introduction of new sophis-
expressed as some function of the W matrix. If ticated positioning devices together with the
n is very large, this operation could become widespread access to individual granular data
highly demanding if not prohibitive. A way out, deriving from archives, from social networks,
employed for years in the literature, consisted of crowdsourcing, and other sources have the
exploiting an eigenvalue decomposition of the potential to revolutionize the way in which
matrices involved, a solution which, however, econometric modelling of spatial data will be
does not completely eliminate the accuracy prob- approached in the future. Under this respect, in
lems if n is very large because the spectral the future we will progressively observe a tran-
decomposition in very large matrices is the out- sition from economic phenomena that are
come of an approximation. Many studies report modelled on a discrete to phenomena which are
that the computation of eigenvalues by standard observed on a continuous in space and time and
subroutines for general nonsymmetric matrices that will need a completely novel set of tools.
may be highly inaccurate already for relatively This is certainly true for many phenomena that
small sample sizes (n >400). The accuracy are intrinsically continuous in space and that S
improves if the matrix is symmetric, which, were observed so far on a discrete only due to
unfortunately, is not always the case with spatial our limitations in the observational tools (e.g.,
econometrics models. Many other approxima- environmental variables), but also for phenom-
tions were proposed, but none entirely satisfac- ena characterized by spatial discontinuities, like
tory especially when the W matrices are very those observed in transportation or health stud-
dense. The computational issues connected with ies, just to make few examples. Under this point
the estimation of spatial econometric models are of view, spatial econometrics will benefit in the
doomed to become more and more severe in the future from the cross-contamination with tech-
future even with the increasing power of com- niques developed for the analysis of continuous
puter machines and the diffusion of parallel pro- spatial data and with other useful tools that could
cessing (see “▶ Parallel Processing”) because be borrowed from physics.
872 Spatial Scientometrics

Cross-References science has been added into account since


research activities usually start from a certain
▶ Big Data Quality region or several places in the world and then
▶ Data Aggregation spread to other places, thus displaying spatiotem-
▶ Parallel Processing poral patterns. The analysis of spatial aspects of
▶ Socio-spatial Analytics the science system is composed of spatial
▶ Spatiotemporal Analytics scientometrics (Frenken et al. 2009), which
address the studies of geospatial distribution pat-
terns on scientific activities, domain interactions,
Further Reading co-publications, citations, academic mobility, and
so forth. The increasing availability of large-scale
Anselin, L. (1988). Spatial econometrics, methods and research metadata repositories in the big data age
models. Dordrecht: Kluwer Aacademic.
and the advancement in geospatial information
Arbia, G. (2006). Spatial econometrics: Statistical founda-
tions and applications to regional convergence. Hei- technologies have enabled geospatial big data
delberg: Springer. analytics for the quantitative study of science.
Arbia, G. (2014). A primer for spatial econometrics.
Basingstoke: Palgrave-MacMillan.
Arbia, G. (2016). Spatial econometrics. Foundations and
Trends in Econometrics, 8(3–4), 145–265. Main Research Topics
Baltagi, B. (2013). Econometric analysis of panel data
(5th ed.). New York: Wiley. The earliest spatial scientometrics studies date
LeSage, J., & Pace, K. (2009). Introduction to spatial
back to 1970s. Researchers analyzed the distribu-
econometrics. Boca Raton: Chapman and Hall/CRC
Press. tion of worldwide science productivity by region
and country. Later on, the availability of more
detailed affiliation address information, and geo-
graphic coordinate data offers the possibility to
investigate the role of physical distance in collab-
Spatial Scientometrics orative knowledge production. And the “spatial”
dimension can refer to not only the “geographic
Song Gao space” but also the “cyberspace.” The book Atlas
Department of Geography, University of of Science: Visualizing What We Know collected a
California, Santa Barbara, CA, USA series of visual maps in cyberspace for navigating
Department of Geography, University of the dynamic structure of science and technology
Wisconsin-Madison, Madison, WI, USA (Börner 2010). According to the research frame-
work for spatial scientometrics proposed by
Frenken et al. (2009), there are at least three
Synonyms main topics addressed in this research domain:
(1) spatial distribution, it studies the location
Geospatial scientometrics arrangement of different scientific activities
including research collaborations, publications,
and citations across the Earth’s surface. Whether
Definition/Introduction geographic concentration or clustered patterns can
bring advantages in scientific knowledge produc-
The research field of scientometrics (or tion is an important research issue in spatial
bibliometrics) is concerned with measuring and scientometrics. (2) Spatial bias, it refers to those
analyzing science, with the aim of quantifying a uneven spatial distributions on the scientific activ-
publication, a journal, or a discipline’s structure, ities and their structure because of the limits on
impact, change, and interrelations. The spatial research funding, intellectual property, equip-
dimension (e.g., location, place, proximity) of ment, language, and so on. One prominent spatial
Spatial Scientometrics 873

bias is that researchers collaborate domestically a citation, or a researcher. Popular geocoding tools
more frequently than internationally, which might include Google Maps Geocoding API and ArcGIS
also be influenced by the number of researchers in Online Geocoding Service.
a country. Another spatial bias is that collabora- After getting the coordinate information, a
tive articles from nearby research organizations variety of statistical analysis and mapping/
are more likely to be cited than articles from geovisualization techniques can be employed for
research organizations further away within the spatial scientometrics analyses (Gao et al. 2013).
same country. But there is a positive effect of A simplistic approach showing the spatial distri-
international co-publications on citation impact bution pattern is to map the affiliation location of
compared with domestic co-publications. Such pat- publications or citations or to aggregate the affil-
terns might change with the increasing accessibility iation locations to the administrative places (e.g.,
of crowdsourced or open-sourced bibliographic city or country boundaries). Another method is to
databases. Regarding researchers’ trajectory or aca- use the kernel density estimation (KDE) mapping
demic mobility patterns, they are also highly skew to identify the “hotspot regions” in the geography
distributed across countries. Recent interests arise of science (Bornmann and Waltman 2011). The
in the analysis of the origin patterns of regional or KDE mapping has been widely used in spatial
international conference participants. (3) Citation analysis to characterize a smooth density surface
impact, it attracts much attention in the that shows the geographic clustering of point or
scientometrics studies. In academia, the number line features. The two-dimensional KDE can iden-
of citations is an important criterion to estimate tify the regions of citation clusters for each cited
the impact of a scientific publication, a journal, or paper by considering both the quantity of citations
a scientist. Spatial scientometrics researchers study and the area of geographical space, compared to
and measure the geospatial distributions and the single-point representation which may neglect
impacts of citations for scientific publications and the multiple citations in the same location. More-
knowledge production. over, the concept of geographic proximity (dis-
tance) is widely used to quantify the spatial
patterns of co-publications and citations. In addi-
Key Techniques and Analysis Methods tion, the socioeconomic factors that affect the
scientific interactions have also been addressed.
In order to analyze the geospatial distribution and Boschma (2005) proposed a proximity framework
interaction patterns of scientific activities in of physical, cognitive, social, and institutional
scientometrics studies, one important task is to forms to study the scientific interaction patterns.
get the location information of publications or Researchers studied the relationship between each
research activities. There are two types of location proximity and citation impact by controlling other
information: (1) place names at different geopo- proximity variables. Also, the change of author S
litical scales (e.g., city, state, country, region) and affiliations over time adds complexity to the net-
(2) geographic coordinates (i.e., latitude and lon- work analysis of universities. The approach with
gitude). The place information can usually be thematic, spatial, and similarity operators has
retrieved from the affiliation information in stan- been studied in the GIScience community to
dard bibliographic databases such as Thomson address this challenging issue.
Reuters Web of Science or Elsevier Scopus. But When measuring the citation impact of a pub-
the geographic coordinate information is not lication, a journal, or a scientist, traditional
directly available in those databases. Additional approaches purely counting the number of cita-
processing techniques “georeferencing” which tions do not take into account the geospatial and
assigns a geographic coordinate to a place-name temporal impact of the evaluating target. The spa-
and “geocoding” which converts an address text tial distribution of citations could be different
into a geographic coordinate are required to gen- even for publications with the same number of
erate the coordinate information for a publication, citations. Similarly, some work may be relevant
874 Spatiotemporal Analytics

and cited for decades, while other contributions quantitative, and computational approaches and
only have a short-term impact. Therefore, Gao et al. technologies into the spatial scientometrics ana-
(2013) proposed a theoretical and novel analytical lyses. The spatial scientometrics is still an infant
spatial scientometrics framework which employs interdisciplinary field with the support of spatial
spatiotemporal KDE, cartograms, distance distri- analysis, information science and statistic meth-
bution curves, and spatial point patterns to evaluate odologies. New data sources and measurements to
the spatiotemporal citation impacts for scientific evaluate the excellence in the geography of sci-
publications and researchers. Three geospatial cita- ence are emerging in the age of big data.
tion impact indices (Sinstitution index, Scity index,
and Scountry index) were developed to evaluate an
individual scientist’s geospatial citation impact, Further Reading
which complement traditional nonspatial measures
such as h-index and g-index. Börner, K. (2010). Atlas of science: Visualizing what we
know. Cambridge: The MIT Press.
Bornmann, L., & Waltman, L. (2011). The detection of
“hot regions” in the geography of science – A visuali-
Challenges in the Big Data Age zation approach by using density maps. Journal of
Informetrics, 5(4), 547–553.
Boschma, R. (2005). Proximity and innovation: A critical
Considering the three V’s characteristics (volume,
assessment. Regional Studies, 39(1), 61–74.
velocity, and variety) of big data, there are many Bratt, S., Hemsley, J., Qin, J., & Costa, M. (2017). Big
challenges in big-data-driven (spatial) data, big metadata and quantitative study of science: A
scientometrics studies. These challenges require workflow model for big scientometrics. Proceedings of
the Association for Information Science and Technol-
both computationally intensive processing and
ogy, 54(1), 36–45.
careful research design (Bratt et al. 2017). First, Frenken, K., Hardeman, S., & Hoekman, J. (2009). Spatial
the author names, affiliation trajectory, and insti- scientometrics: Towards a cumulative research pro-
tution names and locations often need to be dis- gram. Journal of Informetrics, 3(3), 222–232.
Gao, S., Hu, Y., Janowicz, K., & McKenzie, G. (2013,
ambiguated and uniquely identified. Second, the
November). A spatiotemporal scientometrics frame-
heterogonous formats (i.e., structured, semi-struc- work for exploring the citation impact of publications
tured, unstructured) of bibliographic data might and scientists. In Proceedings of the 21st ACM
be incredibly varied and cannot fit into a single SIGSPATIAL international conference on advances in
geographic information systems (pp. 204–213).
spreadsheet or a database application. Moreover,
Orlando, Florida, USA: ACM.
the metadata standards are inconsistent across
multiple sources and may change over time. All
the abovementioned challenges can affect the
validity and reliability of (spatial) scientometrics
studies. The uncertainty or sensitivity analyses Spatiotemporal Analytics
need to be included in the data processing and
analytical workflows. Tao Cheng and James Haworth
SpaceTimeLab, University College London,
London, UK
Conclusion

Spatial scientometrics involves the studies of spa- Spatiotemporal analytics or space-time analytics
tial patterns, impacts, and trends of scientific (STA) is the use of integrated space-time thinking
activities (e.g., co-publication, citation, academic and computation to discover insights from geo-
mobility). In the new era, because of the increas- located and time-stamped data. This involves
ing availability of digital bibliographic databases extracting unknown and implicit relationships,
and open data initiatives, researchers from multi- structures, trends, or patterns from massive
ple domains can contribute various qualitative, datasets collected at multiple locations and times
Spatiotemporal Analytics 875

that make up space-time (ST) series. Examples of visualization with expert knowledge and data
such datasets include daily temperature series at analysis.
meteorological stations, street-level crime counts Hand in hand with geovisual analytics go the
in world capital cities, and daily traffic flows on statistical ESTDA tools of STA. Particularly cen-
urban roads. The toolkit of STA includes explor- tral to STA is the concept of spatiotemporal
atory ST data analysis (ESTDA) and visualiza- dependence. To paraphrase Tobler’s first law of
tion, spatiotemporal modeling, prediction, geography (Tobler 1970), an observation from
classification and clustering (profiling), and sim- nature is that near things tend to be more similar
ulation, which are developed based upon the latest than distant things both in space and in time. A
progress in spatial and temporal analysis. space-time series may exhibit ST dependence,
STA starts with ESTDA, which is used to which describes its evolution over space and
explore patterns and relationships in ST data. time. If the ST dependence in a dataset can be
This ranges from ST data visualization and map- modeled, then one can make predictions of future
ping to geovisual analytics and statistical values of the series. ST dependence can be quan-
hypothesis testing. ST data visualization titatively measured using ST autocorrelation indi-
explores the patterns hidden in the large ST ces such as the ST autocorrelation function
datasets using visualization, animation, and (Cheng et al. 2011) and the ST (semi)variogram
interactive techniques. This includes conven- (Griffith and Heuvelink 2009), which are key
tional 2D maps and graphs alongside advanced tools of ESTDA. Also important are tools for
3D visualizations. The 3D space-time cube, pro- measuring ST heterogeneity, whereby global pat-
posed by Hägerstraand (1970), is an important terns of ST autocorrelation are violated at the local
tool in STA. It consists of two dimensions of level. When dealing with point patterns, tests for
geographic locations on a horizontal plane and ST clustering or ST density estimation may be
a time dimension in the vertical plane (or axis). used.
The space-time cube is used to visualize trajec- ESTDA helps to reveal the most appropriate
tories of objects in 3D space-time dimension, or method for STA, which varies depending on the
“space-time paths,” but can also show hotspots, data type and objective. Alongside visualization,
isosurfaces, and densities (Cheng et al. 2013; the core tasks of STA are predictive modeling,
Demsar et al. 2015). clustering/profiling, and simulation. Predictive
In STA, ST data visualization is undertaken as modeling involves using past values of a ST series
part of an iterative process involving information (and possible covariates) to forecast future values.
gathering, data preprocessing, knowledge repre- Depending on the data, predictive modeling may
sentation, and decision-making, which is known involve either classification, whereby the desired
as geovisual analytics (Andrienko et al. 2007). output is two or more classes, or regression,
Geovisual analytics is an extension of visual ana- whereby the desired output is continuous. These S
lytics, which is becoming more important to many tasks are referred to as supervised learning as the
disciplines including scientific research, business desired output is known. Predictive modeling
enterprise, and other areas that face problems of methods can be separated into two broad catego-
an overwhelming avalanche of data. First, ST data ries: statistical and machine learning approaches.
are visualized to reveal basic patterns, and then Statistical methods are generally adaptations of
users will use their perception (intuition) to gain existing models from the fields of time series
insights from the images produced. Insights gen- analysis, spatial analysis, and econometrics to
erated are then transformed into knowledge. This deal with spatiotemporal data. Some of the
knowledge can be used to generate hypotheses methods commonly used in the literature include
and carry out further ESTDA, the results of space-time autoregressive integrated moving
which will be visualized for presentation, and average (STARIMA) models (Pfeifer and Deutsch
further knowledge generation. Geovisual analyt- 1980) and variants, multiple ARIMA models,
ics is an integrated approach to combining ST data space-time geostatistical models (Heuvelink and
876 Spatiotemporal Analytics

Griffith 2010), spatial panel data models (Elhorst The aforementioned predictive modeling
2003), geographically and temporally weighted methods assume knowledge of the desired output.
regression (Huang et al. 2010; Fotheringham Often we will know little about an ST dataset and
et al. 2015), and eigenvector spatial filtering may wish to uncover hidden structure in the data.
(Patuelli et al. 2009). More recently, Bayesian This is known as unsupervised learning and is
hierarchical models have become popular due to addressed using clustering methods. Clustering
their ability to capture spatial, temporal, and spa- involves grouping unlabeled objects that share
tiotemporal effects (Blangiardo and Cameletti similar characteristics. The goal is to maximize
2015). the intraclass similarity and minimize the
The aforementioned methods tend to rely on interclass similarity. Widely used spatial cluster-
strong statistical assumptions and can be difficult ing techniques, e.g., K-means and K-medoids,
to fit to large datasets. Increasingly, researchers and have been extended to spatiotemporal clustering
practitioners are turning toward machine learning problems. Initial research on spatial clustering has
and data mining methods that are better equipped to focused on point data with popular algorithms
deal with the heterogeneous, nonlinear, and multi- such as DBSCAN and BIRCH. However, design-
scale properties of big ST data. Artificial neural ing an effective ST clustering algorithm is a diffi-
networks (ANNs), support vector machines cult task because it must account for the dynamics
(SVMs), and random forests (RFs) are now being of a phenomenon in space and time.
successfully applied to ST predictive modeling Very few clustering algorithms consider the spa-
problems (Kanevski et al. 2009). ANNs are a family tial, temporal, and thematic attributes seamlessly
of nonparametric methods for function approxima- and simultaneously. Capturing the dynamicity in
tion that have been shown to be very powerful tools the data is the most difficult challenge in ST cluster-
in many application domains. They are inspired by ing, which is the reason that traditional clustering
the observation that biological learning is governed algorithms, in which the clustering is carried out on
by a complex set of interconnected neurons. a cross section of the phenomenon, cannot be
Although individual neurons may be simple in directly applied to ST phenomena. The arbitrarily
structure, their interconnections allow them to per- chosen temporal intervals may not capture the real
form complex tasks such as pattern recognition and dynamics of the phenomena since they only con-
classification. SVMs are a set of supervised learning sider the thematic values at the same time, which
methods originally devised for classification tasks cannot capture the influence of flow (i.e., time lag
that are based on the principles of statistical learning phenomena). It is only recently that this has been
theory (Vapnik 1999). SVMs use a linear algorithm attempted. ST-DBSCAN is one method that has
to find a solution to classification or regression been developed and applied to clustering ST data
problems that is linear in a feature space and non- (Birant and Kut 2007). Spatiotemporal scan statis-
linear in the input space. This is accomplished using tics (STSS) is a clustering technique that was orig-
a kernel function. SVMs have many advantages: inally devised to detect disease outbreaks (Neill
(1) they have a globally optimal solution; (2) they 2008). The goal is to automatically detect regions
have a built-in capacity to deal with noisy data; and of space that are “anomalous,” “unexpected,” or
(3) they can model high-dimensional data effi- otherwise “interesting.” Spatial and temporal prox-
ciently. These advantages have made SVMs, imities are exploited by scanning the entire study
along with other kernel methods, a very important area via overlapping space-time regions (STRs).
tool in STA. RFs are ensembles of decisions trees. Each STR represents a possible disease outbreak
They work on the premise that the mode (classifi- with a geometrical shape which is either a cylinder
cation) or average (regression) of a large number of or rectangular prism. The base corresponds to the
trees trained on the same data will tend towards an spatial dimension and the height corresponds to the
optimal solution. RFs have achieved performance temporal dimension. The dimensions of the STR are
comparable to SVMs and ANNs and are becoming allowed to vary in order to detect outbreaks of
common in STA. varying sizes.
Spatiotemporal Analytics 877

The final task of STA discussed here is simu- Cheng, T., Tanaksaranond, G., Brunsdon, C., & Haworth,
lation, which involves the development of models J. (2013). Exploratory visualisation of congestion evo-
lutions on urban transport networks. Transportation
for simulating complex ST processes. Two com- Research Part C: Emerging Technologies, 36, 296–
mon methods are cellular automata (CA) and 306. https://doi.org/10.1016/j.trc.2013.09.001.
agent-based modeling (ABM) (Batty 2007). In Demsar, U., Buchin, K., van Loon, E. E., & Shamoun-
CA, a spatial region is divided into cells that Baranes, J. (2015). Stacked space-time densities: A
geovisualisation approach to explore dynamics of
have certain states. The probability of a cell space use over time. GeoInformatica, 19, 85–115.
changing from one state to another is affected by doi:10.1007/s10707-014-0207-5.
the state of surrounding cells at the same or pre- Elhorst, J. P. (2003). Specification and estimation of spatial
vious times. In ABMs, agents are constructed that panel data models. International Regional Science
Review, 26, 244–268. https://doi.org/10.1177/
have certain behaviors that determine their inter- 0160017603253791.
action with their environment and other agents. In Fotheringham, A. S., Crespo, R., & Yao, J. (2015). Geo-
both model types, the aim is to study emergent graphical and temporal weighted regression (GTWR).
behavior from small-scale interactions. Simula- Geographical Analysis, 47, 431–452. https://doi.org/
10.1111/gean.12071.
tion models have been applied to study many Griffith, D. A., & Heuvelink, G. B. (2009, June). Deriving
phenomena including traffic congestion, urban space–time variograms from space–time auto-
change, emergency evacuation and vegetation regressive (STAR) model specifications. In: StatGIS
dynamics, and policing and security. If properly 2009 Conference, Milos, Greece.
Hägerstraand, T. (1970). What about people in
calibrated, simulation models can be used to pre- regional science? Papers in Regional Science, 24,
dict ST processes over long time periods and to 7–24. https://doi.org/10.1111/j.1435-5597.1970.
develop and test theories. However, the principal tb01464.x.
issue with such methods is validation against real Heuvelink, G. B. M., & Griffith, D. A. (2010). Space–time
geostatistics for geography: A case study of radiation
data, which is only recently being addressed monitoring across parts of Germany. Geographical
(Wise and Cheng 2016). Analysis, 42, 161–179.
Using the toolkit of STA, the researcher can Huang, B., Wu, B., & Barry, M. (2010). Geographically
uncover insights into their ST data that they may and temporally weighted regression for modeling
spatio-temporal variation in house prices. International
otherwise miss and make their data work for them, Journal of Geographical Information Science, 24,
thus realizing its potential whether it be for busi- 383–401. https://doi.org/10.1080/13658810802672
ness or scientific research. 469.
Kanevski, M., Timonin, V., & Pozdnukhov, A. (2009).
Machine learning for spatial environmental data: The-
ory, applications, and software. Har/Cdr. (Ed.), EFPL
Further Reading Press.
Neill, D. B. (2008). Expectation-based scan statistics for
Andrienko, G., Andrienko, N., Jankowski, P., Keim, D., monitoring spatial time series data. International Jour-
Kraak, M. J., MacEachren, A., & Wrobel, S. (2007). nal of Forecasting, 25(3), 498–517.
Geovisual analytics for spatial decision support: Set- Patuelli, R., Griffith, D. A., Tiefelsdorf, M., & S
ting the research agenda. International Journal of Geo- Nijkamp, P. (2009). Spatial filtering and eigenvec-
graphical Information Science, 21, 839–857. tor stability: Space-time models for German unem-
Batty, M. (2007). Cities and complexity: Understanding ployment data. Quad. Della Fac. Sci. Econ.
cities with cellular automata, agent-based models, and DellUniversità Lugano.
fractals. London: The MIT Press. Pfeifer, P. E., & Deutsch, S. J. (1980). A three-stage itera-
Birant, D., & Kut, A. (2007). ST-DBSCAN: An algorithm tive procedure for space-time modelling. Techno-
for clustering spatial–temporal data. Data Knowledge metrics, 22, 35–47.
Engineering Intelligent Data Mining, 60, 208–221. Tobler, W. R. (1970). A computer movie simulating urban
https://doi.org/10.1016/j.datak.2006.01.013. growth in the Detroit region. Economic Geography, 46,
Blangiardo, M., & Cameletti, M. (2015). Spatial and 234–240. https://doi.org/10.2307/143141.
spatio-temporal Bayesian models with R – INLA (1st Vapnik, V. (1999). The nature of statistical learning theory
ed.). Chichester: Wiley. (2nd ed.). New York: Springer.
Cheng, T., Haworth, J., & Wang, J. (2011). Spatio-tempo- Wise, S. C., & Cheng, T. (2016). How officers create
ral autocorrelation of road network data. Journal of guardianship: An agent-based model of policing.
Geographical Systems. https://doi.org/10.1007/ Transactions in GIS, 20, 790–806. https://doi.org/10.
s10109-011-0149-5. 1111/tgis.12173.
878 Speech Processing

distribution. In practice, standardization is used


Speech Processing to equate two or more groups within a sample on
a subset of variables (internal standardization) or
▶ Voice User Interaction to equate a sample to an external source, such as
another sample or a known population (external
standardization). Roughly speaking, internal stan-
dardization is used to perform causal inferences
Speech Recognition between the two groups, while external
standarization is used to generalize results from a
▶ Voice User Interaction sample to a population. Done carefully, standard-
ization allows analysts to make inferences and
generalizations they would be unable to make
with basic statistical comparisons.
Standardization

Travis Loux A Simple Example


Department of Epidemiology and Biostatistics,
College for Public Health and Social Justice, Saint One of the most commonly used standardization
Louis University, St. Louis, MO, USA tools, and an instructive starting point, is the stan-
dard normal distribution, commonly denoted as
the Z distribution. If a variable Y follows a normal
Definition/Introduction distribution with mean m and standard deviation s
(Fig. 1), it can be standardized through the
In many big data applications, the data was not formula
collected through a formally designed study, but
through available means. The data is often obser- Ym
vational (with no randomization mechanism) or Z¼
s
incomplete (a convenience sample rather than the
full population) (Fung 2014). Thus, subgroups
within the data and the data set as a whole may The resulting Z variable will also be normally
not be representative of the appropriate popula- distributed, but will have mean 0 and standard
tion, leaving analyses based on such data open to deviation 1. This is a standardization process
biases. Data standardization is the process of scal- because Z will have the same distribution regard-
ing a data set to be comparable to a reference less of the initial values of m and s, meaning any

Standardization,
Fig. 1 The density
function of a normal
distribution with mean m
and standard deviation s
Standardization 879

two normal distributions can be transformed to the Another basic standardization process is nor-
same scale. malization (though this term has multiple mean-
The initial motivation for standardizing normal ings in data-centric fields). Normalization rescales
distributions was a computational one: it is diffi- a data set so that all values are between 0 and 1,
cult to compute an area under a normal curve using the formula
using pen-and-paper methods. The ability to
y  ymin
rescale any normal distribution to a standard one y0 ¼
meant that a single table of numbers could provide ymax  ymin
all the information necessary to compute areas for
any normal distribution. where ymin and ymax are the minimum and maxi-
For example, IQ scores are approximately nor- mum values of the data set, respectively. The
mal with a mean of 100 and a standard deviation resulting normalized values y0 range from 0 to 1.
of 15. To find the probability of having an IQ Once a data set has been normalized, values can
score above 120, the IQ distribution can be stan- be compared within data sets based on relative
dardized, reducing the problem to one involving standing or across data sets based on location
the standard normal distribution: relative to the range of the respective data sets.

 
IQ  100 120  100
PðIQ > 120Þ ¼ P >
15 15
Direct Versus Indirect Standardization
¼ PðZ > 1:33Þ
Standardization has a long history in the study of
Similarly, the heights of adult males in the epidemiology (e.g., Miettinen 1972). Within the
USA are approximately normally distributed field, there is a distinction made between direct
with a mean of 69 in. and a standard deviation of standardization and indirect standardization.
3 in. To find the probability of being taller than Direct standardization scales an outcome from a
73 in., one can follow a similar procedure: study sample to estimate rates in a reference pop-
ulation, while indirect standardization scales an
 
HT  69 73  69 outcome from a reference population to estimate
PðHT > 73Þ ¼ P >
3 3 rates in the study sample. In both cases, the study
sample and reference population are stratified in
¼ PðZ > 1:33Þ
similar ways (e.g., using the same age strata cut
points).
Both solutions require only the Z distribution;
In direct standardization, the risk or rate of
these problems in very different contexts can be
outcome within each stratum of the study sample
solved by referring to the same standardized
is calculated, then multiplied by the stratum size S
scale.
within the reference population to estimate the
number of events expected to occur in the relevant
stratum of the reference population. These
Beyond Z
expected stratum-specific counts are then totaled
over all strata and divided by the total reference
Standardization can be performed on data that is
population size. Using the notation from Table 1
not normally distributed. In this case, the resulting
below, the direct standardized
! rate in the reference
standardized score (y  m)/s can generally be
interpreted as the number of standardard devia- P xi
population is mi ni =M, resulting in an esti-
tions from the mean above which the observation i
y lies, with positive standarized scores meaning y mated rate in the reference population.
is greater than m and negative scores meaning y is Indirect standardization reverses this process,
less than m. by applying stratum-specific outcome rates from
880 Standardization

Standardization, Table 1 Example table for direct and indirect standardization


Stratum Study population Reference population
Events Size Events Size
1 x1 n1 y1 m1
2 x2 n2 y2 m2
... ... ... ... ...
k xk nk yk mk
Total N M

the reference population to the study sample strata


and taking a stratum-size weighted average across
the strata of the study sample. In Table 1, the
indirect standardized
! rate in the study sample is
P yi
ni mi =N. The result of indirect standardiza-
i
tion is the outcome rate that would have been
expected in the study sample if the individuals
experienced the outcome at the same rate as the
reference population.
With larger data sets, simple stratification like
the methods discussed above can be improved
upon by using more variables, yielding more
fully standardized results, and more observations,
allowing for finer strata.

Standardization, Fig. 2 Matching of a hypothetical data


Internal Standardization set based on closeness of the variables represented on the
horizontal and vertical axes
Internal standardization attempts to balance two
or more subsets of a data set on a set of baseline been paired to an observation on the comparison
variables and is commonly used in causal infer- group. The result is a subset of the comparison
ence analyses. Common approaches to internal group in which each individual looks similar to an
standardization include finding similar observa- observation in the intervention group, effectively
tions across the subsets through matching or standardizing the comparison group to the distri-
weighting observations within the subsets so the bution of the intervention group (Fig. 2). Varia-
baseline variables mirror a reference distribution. tions on this concept include matching two or
There are numerous matching algorithms (Stu- more comparison observations to each interven-
art 2010), though most follow a similar tion observation (1:2 or 1:k matching).
framework. In “greedy” or “nearest neighbor” In contrast to greedy matching, optimal
one-to-one (1:1) matching, an observation matching attempts to find the best comparison
from one group (usually the intervention or expo- subset by using a global measure of imbalance
sure group in a causal inference setting) is ran- between intervention and comparison groups and
domly selected, and a similar observation from the has been found to yield significant improvements
comparison group is found and paired to the inter- over greedy matching (Rosenbaum 1989). Opti-
vention observation. This process is repeated until mal matching is far more computationally inten-
all observations in the intervention group have sive with regards to both memory/storage and
Standardization 881

time, and may be infeasible in some big data 800  1.25 ¼ 1000. The weighted distributions of
settings without substantial resources. Other the variables included in X will also match the full
advances in matching include matching for more sample distribution. Thus, both the intervention
than two groups (e.g., Rassen et al. 2013). As an and comparison groups are standardized to the full
additional benefit of standardizing groups, sample.
matching also makes resulting analystical conclu-
sions less susceptible to model misspecifications
(Ho et al. 2007). External Standardization
An alternative, or in some cases complemen-
tary, approach to matching is weighting. Stan- External standardization is used to scale a sample
dardization weighting begins with modeling the to a reference data source in order to account for
probability of subgroup membership based on a selection bias. Common applications of external
set of relevant variables. Weights are then defined standardization include adjusting nonrandom
as the ratio of the probabilities of group member- samples to match a well-defined population, for
ship. For example, suppose an analyst wants to example, the US voting population, and general-
standardize group G ¼ B to group G ¼ A. Then the izing results from randomized trials.
weight for observation i in Group B is An increasingly popular tool for standardizing
P(G ¼ A j X ¼ xi)/P(G ¼ B j X ¼ xi), where X a nonrandom sample to a target population is
contains the variables to be standardized (Sato and multilevel regression with poststratification
Matsuyama 2003). Group membership probabili- (MRP). To begin MRP, a multilevel regression
ties can be estimated using any standard classifi- model is applied to a data set. The predicted
cation algorithm such as logistic regression or values from this model are then weighted in pro-
neural networks. portion to the distribution of the generating pre-
In the simple intervention/comparison setting, dictors in the target population. Wang, Rothschild,
this group membership probability is called a Goel, and Gelman (2014) collected data on voting
propensity score (Rosenbaum and Rubin 1983) history and intent from a sample of X box users
and commonly denoted as e(xi) ¼ P(T ¼ 1 j X ¼ xi), leading up to the 2012 US presidental election.
where T is an indicator for intervention. To stan- Fitted values from the regression model were
dardize the comparison group to the intervention weighted to match US electorate on demo-
group, observation i in the comparison group gets graphics, political party identification, and 2008
the weight e(xi)/(1  e(xi)). Alternatively, one presidential vote. In a retrospective analysis, this
could standardize both the intervention and com- approach predicted national voting results within
parison group to the full sample distribution. To one percentage point. In other uses of MRP,
perform this analsysis, observations in the inter- Ghitza and Gelman (2013) and Zhang et al.
vention group get weight 1/e(xi) while observa- (2015) used results from large surveys to weight S
tions in the comparison group get weight 1/(1  e specific small, hard-to-sample populations (elec-
(xi)). In more complex settings analysts can use toral subgroups in Ghitza and Gelman (2013) and
the generalized propensity score (Imai and Dyk Missouri counties in Zhang et al. (2015)).
2004) to obtain group membership probabilities. Though analysis for intervention effects
As a concrete example, suppose in a data set requires internal standardization between inter-
there is a subgroup of n ¼ 1000 for which e vention and comparison groups for causal infer-
(xi) ¼ 0.20. Within this subgroup, there will be ences of average effect, external standardization is
200 intervention and 800 comparison observa- necessary to obtain population estimates if the
tions. Weighting each intervention observation effects vary across individuals (called heteroge-
by 1/e(xi) ¼ 1/0.20 ¼ 5 will yield a weighted neous effects). Cole and Stuart (2010) adapted
sample size of 200  5 ¼ 1000, while weighting propensity score weighting to generalize the esti-
each comparison observation by 1/(1  e(xi)) ¼ 1/ mated treatment effects from a clinical trial of HIV
0.80 ¼ 1.25 will yield a weighted sample size of treatment to the full US HIV-positive population
882 Standardization

as estimated by the Center for Disease Control and ▶ Correlation Versus Causation
Prevention. The two data sets were combined with ▶ Data Quality Management
a selection variable indicating selection into the ▶ Demographic Data
trial from the general population. Cole and col- ▶ Regression
leagues then use propensity score weighting,
replacing the intervention indicator with the selec-
tion indicator, to standardize the clinical trial sam- Further Reading
ple to the larger HIV-positive population. Stuart,
Cole, Bradshaw, and Leaf (2011) further Cole, S. R., & Stuart, E. A. (2010). Generalizing evidence
from randomized clinical trials to target populations:
developed these methods in the context of a
The ACTG 320 trial. American Journal of Epidemiol-
school-level behavioral intervention study and ogy, 172(1), 107–115. https://doi.org/10.1093/aje/
used propensity score weighting to evaluate the kwq084.
magnitude of selection bias. Rudolph, Diaz, Fung, K. (2014). Toward a more useful definition of Big
Data. Retrieved from http://junkcharts.typepad.com/
Rosenblum, and Stuart (2014) investigated the
numbersruleyourworld/2014/03/toward-a-more-use
use of other internal standardization techniques ful-definition-of-big-data.html.
for estimating popoulation intervention effects. Ghitza, Y., & Gelman, A. (2013). Deep interactions with
MRP: Election turnout and voting patterns among
small electoral subgroups. American Journal of Politi-
cal Science, 57(3), 762–776. https://doi.org/10.1111/
Conclusion ajps.12004.
Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007).
Standardization can be used to alleviate some of Matching as nonparametric preprocessing for reducing
model dependence in parametric causal inference.
the pitfalls of working with big data (Fung 2014).
Political Analysis, 15(3), 199–236. https://doi.org/10.
Since big data is usually observational in nature, 1093/pan/mpl013.
causal inferences cannot be made from basic Imai, K., & Dyk, D. a. v. (2004). Causal inference with
between-group comparisons. Internal standardi- general treatment regimes. Journal of the American
Statistical Association, 99(467), 854–866. https://doi.
zation will equate intervention or exposure groups
org/10.1198/016214504000001187.
on a set of baseline variables. This procedure Miettinen, O. S. (1972). Standardization of risk ratios.
ensures comparability between the two groups American Journal of Epidemiology, 96(6), 383–388.
on these measures and excludes them as potential Rassen, J. A., Shelat, A. A., Franklin, J. M., Glynn, R. J.,
Solomon, D. H., & Schneeweiss, S. (2013). Matching
causal explanations. In addition, big data is often
by propensity score in cohort studies with three treat-
not complete and may have serious selection ment groups. Epidemiology, 24(3), 401–409. https://
biases, meaning certain types of observations doi.org/10.1097/EDE.0b013e318289dedf.
may be systematically less likely to appear in the Rosenbaum, P. R. (1989). Optimal matching for observa-
tional studies. Journal of the American Statistical Asso-
data set. A naive analysis of such data will yield
ciation, 84(408), 1024–1032. https://doi.org/10.1080/
results that reflect this disparity and do not accu- 01621459.1989.10478868.
rately represent the broader population. External Rosenbaum, P. R., & Rubin, D. B. (1983). The central role
standardization can be used to poststratify, of the propensity score in observational studies for
causal effects. Biometrika, 70(1), 41–55. https://doi.
weight, or otherwise equate a data set with a
org/10.1093/biomet/70.1.41.
known population distribution, e.g., from the US Rudolph, K. E., Diaz, I., Rosenblum, M., & Stuart, E. A.
Census Bureau. The resulting conclusions may (2014). Estimating population treatment effects from a
then be more representative of the full population survey subsample. American Journal of Epidemiol-
ogy, 180(7), 737–748. https://doi.org/10.1093/aje/
and more easily generalizable.
kwu197.
Sato, T., & Matsuyama, Y. (2003). Marginal structural
models as a tool for standardization. Epidemiology, 14
Cross-References (6), 680–686. https://doi.org/10.1097/01.EDE.
0000081989.82616.7d.
Stuart, E. A. (2010). Matching methods for causal infer-
▶ Association Versus Causation ence: A review and a look forward. Statistical Science,
▶ Big Data Quality 25(1), 1–21. https://doi.org/10.1214/09-STS313.
State Longitudinal Data System 883

Stuart, E. A., Cole, S. R., Bradshaw, C. P., & Leaf, P. J. processes used by state educational agencies to
(2011). The use of propensity scores to assess the make education data transparent through federal
generalizability of results from randomized trials. Jour-
nal of the Royal Statistical Society: Series A, 174(2), and public reporting (US Department of
369–386. https://doi.org/10.1111/j.1467-985X.2010. Education 2015). The Statewide Longitudinal
00673.x. Data Systems Grant Program funds states’ efforts
Wang, W., Rothschild, D., Goel, S., & Gelman, A. (2014). to develop and implement these data systems in
Forecasting elections with non-representative polls.
International Journal of Forecasting. https://doi.org/ respond to legislative initiatives (US Department
10.1016/j.ijforecast.2014.06.001. of Education 2015).
Zhang, X., Holt, J. B., Yun, S., Lu, H., Greenlund, K. J., &
Croft, J. B. (2015). Validation of multilevel regression
and poststratification methodology for small area esti-
mation of health indicators from the behavioral risk Information Offered
factor surveillance system. American Journal of Epide-
miology. https://doi.org/10.1093/aje/kwv002. The data system aligns p-12 student education
records with secondary and postsecondary edu-
cation and the workforce records, with linkable
student and teacher identification numbers and
State Longitudinal Data student and teacher information on student level
System (National Center for Education Statistics 2010).
The student education records include informa-
Ting Zhang tion on enrollment, demographics, program par-
Department of Accounting, Finance and ticipation, test records, transcript information,
Economics, Merrick School of Business, college readiness test scores, successful transi-
University of Baltimore, Baltimore, MD, USA tion to postsecondary programs, enrollment in
postsecondary remedial courses, entries, and
exits from various levels of the education sys-
Definition tem (National Center for Education Statistics
2010).
State Longitudinal Data Systems (SLDS) connect
databases across two or more of state-level agen-
cies of early learning, K–12, postsecondary, and Statewide Longitudinal Data Systems
workforce. It is a state-level Integrated Data Sys- Grant Program
tem and focuses on tracking individuals
longitudinally. According to US Department of Education
(2015), the Statewide Longitudinal Data Sys-
tems Program awards grants to State educa- S
Purpose of the SLDS tional agencies to design, develop, and
implement SLDS to efficiently and accurately
SLDS are intended to enhance the ability of states manage, analyze, disaggregate, and use individ-
to capture, manage, develop, analyze, and use ual student data. As authorized by the Educa-
student education records, to support evidence- tional Technical Assistance Act of 2002, Title II
based decisions to improve student learning, to of the statute that created the Institute of Edu-
facilitate research to increase student achievement cation Sciences (IES), the SLDS Grant Program
and close achievement gaps (National Center for has awarded competitive, cooperative agree-
Education Statistics 2010), to address potential ment grants to almost all states since 2005; in
recurring impediments to student learning, to addition to the grants, the program offers many
measure and document education long-term services and resources to assist education agen-
return on investment, to support education cies with SLDS-related work (US Department
accountability systems, and to simplify the of Education 2016).
884 State Longitudinal Data System

Challenges Maintaining Longitudinal Data


Many state’s SLDS already have linked student
In addition to the challenges an Integrated Data records, but decision making based on a short-term
System has, SLDS has the following main return on education investment is not necessarily
challenges: useful; the word “longitudinal” is the keystone
needed for development of a strong business case
Training/Education Provider Participation for sustained investment in a SLDS (Stevens and
In spite of the recent years’ progress, participa- Zhang 2014). “Longitudinal” means the capability
tion by training/education providers has not been to link information about individuals across defined
universal. To improve the training and education segments and through time. While there is no evi-
coverage, a few states have taken effective dence that the length of data retention increases
action. For example, the Texas state legislature identity disclosure risk, public concern about data
has tied a portion of the funding of state technical retention is escalating (Stevens and Zhang 2014).
colleges to their ability to demonstrate high
levels of program completion and employment
in occupations related to training (Davis Examples
et al. 2014).
Examples of US SLDS include:
Privacy Issues and State Longitudinal Data
Systems Florida Education & Training Placement Informa-
To ensure data privacy and protect personal infor- tion Program
mation, Family Educational Rights and Privacy Louisiana Workforce Longitudinal Data System
Act (FERPA), the Pupil Protection Rights Act (WLDS)
(PPRA), and Children’s Online Privacy Protec- Minnesota’s iSEEK data.
tion Act (COPPA) are issued (Parent Coalition Heldrich Center data at Rutgers University
for Student Privacy 2017). However, the related Ohio State University’s workforce longitudinal
issues and rights are complex, and the privacy administrative database
rights provided by law are often not provided in University of Texas Ray Marshall Center database,
practice (National Center for Education Statistics Virginia Longitudinal Data System
2010). For a sustained SLDS, a push in the Washington’s Career Bridge
established privacy rights is important. Connecticut’s Preschool through Twenty and
Workforce Information Network
FERPA Interpretation Delaware Education Insight Dashboard
Another challenge is that some state education Georgia Statewide Longitudinal Data System and
agencies have been reluctant to share their educa- Georgia Academic and Workforce Analysis
tion records, largely due to narrow state interpre- and Research Data System (GA AWARDS)
tations of the confidentiality provisions of FERPA Illinois Longitudinal Data System
and its implementing regulations (Davis et al. Indiana Network of Knowledge (INK),
2014). Many states have overcome potential Maryland Longitudinal Data System
FERPA-related obstacles in their own unique Missouri Comprehensive Data System
ways, for example: (1) obtaining legal advice Ohio Longitudinal Data Archive (OLDA)
recognizing that the promulgation of amended South Carolina Longitudinal Information Center
FERPA regulations was intended to facilitate the for Education (SLICE)
use of individual-level data for research purposes, Texas Public Education Information Resource
(2) maintaining the workforce data within the (TPEIR) and Texas Education Research Center
education state’s agency, and (3) creating a special (ERC)
agency that holds both the education and work- Washington P-20W Statewide Longitudinal Data
force data (Davis et al. 2014). System.
Storage 885

Conclusion Federal Register. Available at https://www.fed


eralregister.gov/documents/2016/10/07/2016-24298/
agency-information-collection-activities-comment-
SLDS connects databases across two or more of request-state-longitudinal-data-system-slds-survey.
agencies of p-20 and Workforce. It is a US state- Parent Coalition for Student Privacy (2017). Federal Stu-
level Integrated Data System and focuses on dent Privacy Rights: FERPA, PPRA AND COPPA,
tracking individuals longitudinally. SLDS are retrieved on May 14, 2017 from the World Wide Web
https://www.studentprivacymatters.org/ferpa_ppra_
intended to enhance the ability of states to capture, coppa/.
manage, design, develop, analyze, and use student
education records and to support data-driven deci-
sions to improve student learning and to facilitate
research to increase student achievement and
close achievement gaps. The Statewide Longitu- Statistician
dinal Data Systems (SLDS) Grant Program funds
states’ efforts to develop and implement these data ▶ Data Scientist
systems in respond to legislative initiatives. The
main challenges of SLDS include training/educa-
tion provider participation, privacy issues and
State Longitudinal Data Systems, and FERPA Statistics
interpretation, and maintaining longitudinal data.
There are many Nationwide SLDS examples. ▶ “Small” Data

Cross-References Storage
▶ Integrated Data System
Christopher Nyamful1 and Rajeev Agrawal2
1
Department of Computer Systems Technology,
North Carolina A&T State University,
Further Reading Greensboro, NC, USA
2
Information Technology Laboratory, US Army
Davis, S., Jacobson, L., & Wandner, S. (2014). Using
workforce data quality initiative databases to develop Engineer Research and Development Center,
and improve consumer report card systems. Vicksburg, MS, USA
Washington, DC: Impaq International.
National Center for Education Statistics. (2010). “Data
stewardship: Managing personally identifiable infor-
mation in student education records.” SLDS technical Introduction S
brief. Available at http://nces.ed.gov/pubsearch/
pubsinfo.asp?pubid¼2011602. Data storage generally refers to the keeping of
Stevens, D., & Zhang, T. (2014). “Toward a business case
data in an electronic or a hard copy form, which
for sustained investment in State Longitudinal Data
Systems.” Jacob France Institute. Available at http:// can be processed by a computer or a device. Most
www.jacob-france-institute.org/wp-content/uploads/ data today are captured in electronic format, pro-
JFI-WDQI-Year-Three-Research-Report1.pdf. cessed, and stored likewise. Data storage is a key
US Department of Education. (2015). “Applications for
component of the Information Technology (IT)
new awards; Statewide Longitudinal Data Systems
Program,” Federal register. Available at https://www. infrastructure. Different types of data storage,
federalregister.gov/documents/2015/03/12/2015-05682/ such as, on-site storage, remote storage, and
applications-for-new-awards-statewide-longitudinal- more recently, cloud storage, play different roles
data-systems-program.
US Department of Education (2016). “Agency information
in the computing environment. Huge streams of
collection activities; Comment request; State Longitu- data are being generated daily. Data activities
dinal Data System (SLDS) Survey 2017–2019.” from social media, data-intensive applications,
886 Storage

scientific research, and industries are increasing Big data storage systems face complex chal-
exponentially. These huge volumes of data sets lenges. Big data has outgrown its current infra-
must be stored for analytical purposes and also to structure, and its complexities translate into
be compliant with state laws such as the Data variables such as volume, velocity, and variety.
Protection Act. Big data means big storage. The demand for stor-
Companies such as YouTube receives one bil- age capacity and scalability has become a huge
lion unique users each month, and 100 hours of challenge for large organizations and governmen-
video are uploaded to YouTube every minute tal agencies. The existing traditional system can-
(YouTube Data Statistics 2015). Flickr receives not efficiently store and support processing of
on the average 3.5 million uploaded images these data. Data is being transmitted and received
daily, and Facebook processes 300 million photos from every conceivable direction. To enable high-
per day, and scans roughly 105 terabytes of data velocity capture, big data storage system must
each half hour. Storing these massive volumes of process with speed. Clients and automated
data has become problematic, since the conven- devices demands real-time or near real-time
tional data storage reaches a bottleneck. The stor- response in order to function or stay in business.
age demand for big data at the organizational level Late results from a storage system are of no or
is reaching petabytes (PB) and even beyond. A little value. In addition to speed, big data comes in
new generation of data storage systems that different forms. They may consist of structured
focuses on large data sets has now become a data – tables, log files and other database files,
research focus. semi-structured data and unstructured data such as
An ideal storage system comprises of a vari- pictures, blogs, and videos. There has to be a
ety of components, including disk arrays, stor- connection and correlation between these diverse
age controllers, servers, storage network data types. The complex relationship between this
switches, and management software. These key data types cannot be efficiently processed by tra-
components must fit together to achieve high ditional storage systems.
storage performance. Disks or storage devices
are fundamental to every storage system out
there. Solid-state drives and hard disk drives Storage Systems
are mostly used by organizations as their storage
device, with their capacity density expected to Different types of data storage systems serve dif-
increase at a rate of 20% (Fontana et al. 2012). ferent purposes. Organizational data requirement
Several attributes such as capacity, data transfer usually determines the choice of storage system.
rate, access time, and cost influences the choice Small to medium size enterprises may prefer to
of disk for a storage system. Magnetic disc, such keep on-site data storage. Direct-attached storage
as the hard disk drive (HDD) provides huge (DAS) is an example of on-site storage system.
capacity at a relatively low cost. More HDDs DAS architecture connects storage device directly
can be added to a storage system to scale to to hosts. This connection can be internal or exter-
meet the rate of data growth, but are subject to nal. External DAS often attached dedicated stor-
reliability risk, such as overheating, external age arrays directly to their host, and data can be
magnetic faults, and electrical faults. Besides, accessed at both block level and file level. DAS
they have relatively poor input/output opera- provides users with enhanced performance than
tions per second (IOPS) capabilities. Solid- network storage, since host does not have to tra-
state disks (SSDs) on the other hand, are more verse the network in order to read and write data.
recent and more reliable than HDDs. They pro- Communication between storage arrays and hosts
vide a high aggregate input/output data transfer can be over small computer system interface
rate and consume less energy in a storage sys- (SCSI) or Fibre Channel (FC) protocol. DAS are
tem. The disadvantage is that they are very easy to deploy and manage. In big data environ-
expensive per the capacity they provide. ment, DAS is highly limited in terms of
Storage 887

performance and reliability. DAS can’t be shared consolidating storage nodes to increase scalabil-
among multiple nodes, and hence, when one ity, to facilitate a balance input/output operation,
server fails, there is no failover to ensure avail- and a high performance throughput capability.
ability. The storage array has a limited number of Implementing FC-SAN architecture is very
ports, making DAS not to scale well to meet the costly. Besides, it has limited the distance it can
demands of data growth. go, averagely about 10 km. Large organizations
The efficient dissemination of mission-critical are striving to achieve the best out of their storage
data among clients over a wide geographical area systems, while maintaining a low cost. To make
is very crucial in big data era. Network-attached use of existing IP-based infrastructure of organi-
storage (NAS) infrastructure provides the flexibil- zations, IP-SAN technology allows block data to
ity of file sharing over a wide area network. NAS be sent across IP networks. The widespread of IP
achieve this by consolidating wide spread storage networks makes IP-SAN attractive to many orga-
used by clients into a single system. It makes use nizations that are widespread geographically.
of file sharing protocols to provide access to the More often, big data analysis and processes
storage units. A NAS device can exist anywhere involves both block-level and file-level data.
on the local area network (LAN). The device is Object storage technology provisions for file and
optimized for cross platform file services such as block-based data storage. Storage object is a log-
file sharing, retrieving, and storing. NAS comes ical collection of discrete units of data storage.
with its own operating system, optimized to Each object includes data, metadata, and a unique
enhance performance and throughput. It provides identifier which allows the retrieval of object data
a centralized and a simplified storage management without the need to know the actual location of the
to minimize data redundancy on client worksta- storage device. Objects are of variable sizes and
tion. For large data sets, a scale-out NAS can be ideal to store different types of data that are found
implemented. More storage nodes can be added in the big data environment. Object-based storage
whiles maintaining performance and low latency. metadata capabilities and flat addressing process
Despite the functionalities provided by NAS, it allows it to scale with data growth as compared to
still has some shortcomings. NAS operate on the file system approach. The idea of data and meta-
internet protocol (IP) network; therefore, factors data to be stored together ensures easier manage-
such as bandwidth and response time that affects ability and migration for long-term storage.
IP networks, equally affects NAS performance. Object-based storage is a unified system which
The massive volumes of data can increase latency combines the advantages of both NAS and SAN.
since IP network can process input/output opera- This makes it ideal for storing the massive growth
tions in a timely manner. of unstructured data such as photos, videos,
Storage area networks (SAN) employs a tech- tweets, and blogs. It also makes it attractive for
nology that to a considerable extent deals with the cloud deployments. S
challenges posed by big data storage. SAN archi- Active data centers provide data storage ser-
tecture comes in two forms – Fibre Channel (FC) vice capabilities to clients. It makes use of
SAN and IP-SAN. FC-SAN uses a Fibre Channel virtualization to efficiently deploy storage
protocol to communicate between host and stor- resources to organizations. Virtualization ensures
age devices. Fibre Channel is a high-speed, high- flexibility of resource utilization and optimizes
performance network technology that increases resource management. Data centers are built to
data transfer rate between hosts and large storage host and manage very large data sets. Large orga-
systems. FC-SAN architecture is made up of nizations keep their data across multiple data cen-
servers, FC switches, connecters, storage arrays, ters to ease workload processes and as a backup,
and management software. The introduction of in case of any eventualities. The storage layer of a
FC switches has enhanced the performance of data center usually consists of servers, storage
FC-SAN to be highly scalable and to enable better devices, switches, routers, and connectors. Fibre
data accessibility. FC-SAN focuses on Channel switch is used in a SAN data center for
888 Storage

high transmission of data or commands between availability brings data backup, archiving, and
servers and storage disk. Storage network devices replication into the storage domain. Because orga-
provide the needed connectivity between hosts to nizations require quick recovery from backups, IT
storage nodes. Data center environment supports departments and storage vendors are faced with a
high-speed transmission of data for both block- huge challenge affecting these activities in big
level access, supported by SAN, and file-level data environment. And ideal backup solution
access, which is also supported by NAS. The should ensure minimal loss of data, avoid storage
consolidation of applications, servers, and storage of redundant data, and efficient recovery method.
under one central management increases flexibil- The rate of data loss and downtime of an organi-
ity and high performance in the big data settings. zation in terms of RPO and RTO determines the
backup solution to chose. Recovery point in time
(RPO) is the point in time from which data must
Distributed Storage be restored in order to resume processing trans-
actions. RPO determines backup frequencies.
Distributed file systems, such as Hadoop Distrib- Recovery time objective (RTO) is the period of
uted File Systems (HDFS) (White 2012), have time allowed for recovery, time that can elapse
become very significant in the era of big data. It between the disaster and the activation of second-
provides a less expensive but reliable alternative ary site. Backup media or storage devices can
for current data storage systems. It runs on low- significantly affect data recovery time, especially
cost hardware. HDFS is optimized to run huge with large data sets.
volumes of data sets – terabytes and petabytes. It The implementation of large-scale storage sys-
provides high performance data transfer rate and tem is not straightforward. An ideal storage sys-
scalability to multiple nodes in a single cluster. tem comprises a well-balanced components that
HDFS are designed to be very reliable. It stores fits together to achieve optimal performance. Big
files as a sequence of blocks of data. Each block of data requires a high-performance storage system.
data is replicated to another storage array to ensure Such a system usually consists of a cluster of host,
reliability and fault tolerant. HDFS cluster has two interconnected with high-speed network to an
types of nodes – NameNode and DataNode. The array of disks. An example is Lustre file system
NameNode optimizes the file system namespace (LFS), which was partly developed by Hewlett
and metadata for all files. Storage and retrieval of Packard and Intel. LFS capabilities provide up to
block-level data is done by the DataNode as per 512 PB of storage space for one file system. It has
client request or instruction. The retrieved data is a high throughput of about 2 TB/s in a production
sent back to the NameNode with the list of stored system. It can contain up to ten million files in a
block data. Storage vendors such as EMC are directory, and two billion files in a system. It
releasing ViPR HFDS Storage and Isilon HDFS allows up to 25,000þ clients access in a produc-
Storage for large enterprises and data centers. tion system. It provides high availability of data
These systems allow large organizations to roll and supports automated failover to meet no-sin-
an HDFS file system over their existing data in gle-point-of failure requirements. EMC ISILON
place, to perform various services efficiently. SCALE-OUT NAS also provides a high-perfor-
Data backup and recovery has played a signif- mance distribution file system (Rui-Xia and Bo
icant role in data storage systems. By creating 2012). It consists of nodes of modular hardware
additional copy of production data, organizations arranged in a cluster. Its operating system com-
are insured against corrupt or deleted data, and bines memory, I/Os, CPUs, and disk arrays in a
recovery of lost data in case of extreme disaster. cohesive storage unit as single file system. It pro-
Besides this objective, there is also the need for vides the capability to address big data challenges
compliance with regulatory standards for data by providing multiprotocol file access, dynamic
storage. The retention of data to ensure high expansion of file system, and high scalability.
Structured Query Language (SQL) 889

Conclusion
Structured Data
Storage is still evolving with advances in technol-
ogy. The explosive growth rate of data has out- ▶ Data Integration
grown its storage capacity. Organizations and
businesses are now more concerned about how
to efficiently keep and retain all their data. Storage
vendors and data center providers are researching
into more possible areas of improving storage Structured Query Language
systems to completely address the overwhelming (SQL)
data size facing the industry.
Joshua Lee
Schar School of Policy and Government, George
References Mason University, Fairfax, VA, USA

Fontana, R. E., Hetzler, S. R., & Decad, G. (2012).


Technology roadmap comparisons for TAPE, HDD,
Introduction
and NAND flash: Implications for data storage
applications. Magn IEEE Transactions, 48(5),
1692–1696. https://doi.org/10.1109/TMAG.2011. Storing, modifying, and retrieving data is one of
2171675. the most important tasks in modern computing.
Rui-Xia, Y., & Bo, Y. (2012). Study of NAS Secure System
Computers will always need to retain certain
Base on IBE. Paper presented at the Industrial Control
and Electronics Engineering (ICICEE), 2012 Interna- information for later use, and that information
tional Conference on. needs to be organized, secure, and easily accessi-
White, T. (2012). Hadoop: The definitive guide. ISBN: ble. For smaller, simpler datasets, applications
9781449311520, O’Reilly Media, Inc.
like Microsoft Excel can suffice for data manage-
YouTube Data Statistics. (2015). Retrieved 01–15-2015,
2015, from http://www.youtube.com/yt/press/statistics. ment. For smaller data transfer needs, XML func-
html. tions effectively. However, when the size and
complexity of the information becomes great
enough, a complex relational database manage-
ment system (RDBMS) such as SQL becomes
Storage Media necessary.
SQL stands for structured query language. It
▶ Data Storage was initially invented in 1974 by IBM under the
name SEQUEL, and was bought by Oracle in
1979, and since then it has become the dominant S
force in RDBMS (See http://docs.oracle.com/cd/
B12037_01/server.101/b10759/intro001.htm).
Storage System Unlike programming languages such as C or
BASIC, which are imperative programming lan-
▶ Data Storage guages, SQL is a declarative programming lan-
guage. The difference is explored in greater detail
below.
In a SQL environment, you create one or more
databases. Each database contains any number of
Stream Reasoning tables. Each table in turn contains any number of
rows of data, and each row contains cells, just like
▶ Data Streaming an Excel spreadsheet.
890 Structured Query Language (SQL)

Definitions Commands below for more


information.
Declarative A programming language which SQL clause The building blocks of both
programming expresses the logic of a SQL statements and queries.
language computation without describing
its flow control. SQL is an
Flat File Databases Versus Relational
example of a declarative
Databases: A Comparison
programming language: SQL
commands describe the logic of
Importantly, RDBMS are not always the correct
computation in great detail, but
solution. While they allow for a much greater
SQL does not contain flow
degree of control, optimization in processing
control elements (such as IF
speed, and handling of complex relational data,
statements) without extensions.
they also require substantially more time and skill
Imperative A programming language which
to configure and maintain properly. Therefore, for
programming focuses on how a program
simpler, non-relational data, there is no problem
language should operate, generally via
with using a flat file database.
flow-control statements. C and
Visualization is a critical component of under-
BASIC are examples of
standing how relational databases (such as SQL)
imperative programming
compare to flat file databases. Consider the
languages.
following Excel spreadsheet:
XML Stands for Extensible markup
language. XML documents are Orders
designed to store data in such a Order_No Name City
way as it is easily readable both
for human beings and 46321 John Doe Los Angeles
computers. XML is a common 94812 James Hill Miami
format for sending organized,
structured data across the 29831 Maria Lee Austin
Internet.
59822 James Hill Miami
Flat file A database in which there is a
database single table. Microsoft Excel
spreadsheets are a popular
example of a flat file database. This example represents a classic database
Relational A database which contains need: there is an online business, and that business
database multiple tables that are related needs to store data. They naturally want to store
management (linked) to one another in the orders that each of their customers have made
system various possible ways. for later analysis. Thus, there are two kinds of data
(RDBMS) being represented in a single spreadsheet: the
Table A set of data values represented orders that a customer makes and the customer’s
by rows and columns information. These two types of information are
SQL Shorthand for the data distinct, but they’re also inherently related to one
statement manipulation language category another: orders only exist because a specific cus-
of SQL commands. See Types tomer makes them.
of SQL Commands below for However, storing this information as a flat file
more information. database is flawed. First and foremost, there is a
SQL query Shorthand for the data query many-to-one relationship at play: one person can
language category of SQL have multiple orders. To represent this in an
commands. See Types of SQL Excel spreadsheet, you would thus need to have
Structured Query Language (SQL) 891

multiple rows dedicated to a single user, with information needs to be repeated for every
each row differing simply in the order that the order, even though it should always be identical.
customer made. We can see this in the second and Now, instead of the small amount of data above,
fourth Orders rows: James Hill has his informa- imagine a spreadsheet with 10,000 orders and
tion repeated for two separate orders. This is a 2000 unique customers. Clearly, the amount of
highly inefficient method of data storage. In extraneous data needed to represent this infor-
addition, it causes data modification tasks to mation is huge – with an average of five orders
become more complicated and time-consuming per person, that person’s name and city must be
than necessary. repeatedly entered. For this problem, a flat file
For example, if a customer needed to change database has both excessive file size and exces-
their City, the data for every single order that sive processing time.
person has made must be modified. This is Next, consider how a RDBMS such as SQL
needlessly repetitive – the customer’s would handle this situation:

Online_Store_DB

Persons Orders
Order_ID Order_No Person_ID
Person_ID Name City
1 46321 2
1 James Hill Miami
2 94812 1
2 John Doe Los Angeles
3 29831 3
3 Maria Lee Austin
4 59822 1

This SQL database is named Online_Store_ information (as the flat file database does), each
DB, and it contains two tables: Persons and Order is linked to their associated customer’s
Orders. Persons contains three rows and three Person_ID. Thus, by modifying the City in a
columns, and Orders contains four rows and single row in the Persons table, all Orders asso-
three columns. The first immediately noticeable ciated with that person both (a) don’t need to be
difference is that SQL splits this dataset into its changed and (b) will always be linked to the most
two distinct tables – a Person and an Order are two current information.
different things, after all. However, even if you Notably, even if you attempted to create such a
S
split an Excel spreadsheet in the same way, SQL structure in two different Excel spreadsheets, it
does more than merely split the data. It is the still wouldn’t be able to achieve the same result.
relational aspect of SQL that allows for a far Excel doesn’t allow for this kind of linking of
more elegant solution. columns in different spreadsheets. This linking
Looking above, both the Persons and the not only speeds up the querying of data substan-
Orders tables have a column titled Person_ID. tially; it also adds fail-safe features to ensure that
Each Person will always get their own Person_ID modifications to one table don’t inadvertently cor-
in the Persons table, so it’s always unique. rupt data in another table. For example, SQL can
What’s more, the Orders table is linked to the (optionally) stop you from deleting a person from
Persons table through that column (in SQL par- the Persons table if they currently have any Orders
lance, Person_ID in the Orders table is known as associated with them – after all, orders shouldn’t
a foreign key). Rather than repeat all the exist without an associated person. By contrast,
892 Structured Query Language (SQL)

it’s easy to imagine two (or more) different Excel • Transactional control commands are for con-
spreadsheets having inconsistent data appear over trolling whether certain SQL queries should be
a long period. executed. For example, perhaps the user needs
three consecutive SQL queries to be run,
Queries A, B, and C. However, the user also
Types of SQL Commands wants to ensure that if any one of the three
queries fails to execute properly, the other
SQL functions via the use of commands. These two queries are immediately reversed and the
commands can be separated into six general cate- database is restored to its previous state. Trans-
gories: DDL (data definition language), DML actional control commands include COMMIT,
(data manipulation language), DQL (data query ROLLBACK, and SAVEPOINT, among
language), DCL (data control language), data others.
administration commands, and transactional con-
trol commands.
SQL Extensions
• DDL includes all SQL commands to create/
modify/delete entire databases or tables, such Multiple organizations have created their own
as CREATE, ALTER, and DROP. unique “extended” versions of SQL with addi-
• DML includes all SQL commands to manipu- tional capabilities. Some of these versions are
late all data stored inside of tables, such as public and open-source, whereas others are pro-
INSERT, UPDATE, and DELETE, among prietary. While these extended versions of SQL
others. generally don’t remove any of its core capabilities
• DQL includes only a single SQL command, as described above, they instead add features on
SELECT, which focuses exclusively on top. These additional capabilities usually involve
retrieving data from within SQL. Rather flow control.
than creating/modifying/deleting data, DQL For example, one popular example of SQL
focuses simply on grabbing the data for use with such an extension is MySQL (See https://
by the user. SELECT statements can have dev.mysql.com/doc/refman/5.7/en/forofficialdo
many possible clauses, such as WHERE, cumentation). MySQL is open-source and has all
GROUP BY, HAVING, ORDER BY, JOIN, the capabilities of SQL noted above. However, it
and AS. These clauses modify how the also provides SQL with flow control capabilities
SELECT statement functions. like an imperative programming language. The
• DCL includes all SQL commands to control full list of such features is beyond the scope of
user permissions within SQL. For example, if this section, but some of the most important
User A should be able to view Table A and include (1) stored procedures, (2) triggers, and
Table B, but User B should only be allowed to (3) nested SELECT statements. Notably, these
view Table A. Additionally, User B should not three features are also very common in other
be permitted to modify any data he views. extensions of SQL.
DCL commands include ALTER PASS- Stored procedures are specific collections of
WORD, REVOKE, and GRANT, among SQL commands that can be run through with a
others. single EXECUTE command. Not only can multi-
• Data administration controls focus on analyz- ple different SQL commands be executed in order,
ing the performance of other SQL commands – but they also can be run using common flow-
how quickly are they processed, how often control mechanisms found in any imperative pro-
are certain SQL queries used, and where are gramming language. These include IF statements,
the greatest bottlenecks in performance. WHILE loops, RETURN values, and iterating
AUDIT is a common data administration con- through values obtained via SELECT statements
trol command. as if they were an array. It also includes the ability
Supercomputing, Exascale Computing, High Performance Computing 893

to store values temporarily in variables outside of


the permanent database. While much of this logic Supercomputing, Exascale
could theoretically be handled by another pro- Computing, High
gramming language that connects to SQL, execut- Performance Computing
ing this logic directly in SQL can increase both
performance and security since the processing is Anamaria Berea
all done server-side (See https://www.sitepoint. Department of Computational and Data Sciences,
com/stored-procedures-mysql-php/formoreinfo George Mason University, Fairfax, VA, USA
rmation). Center for Complexity in Business, University of
Triggers are related to stored procedures in that Maryland, College Park, MD, USA
they are also collections of SQL commands. How-
ever, whereas stored procedures are generally
activated by a user, triggers are event-based in Supercomputing and High Performance
their activation. For example, a trigger can be set Computing (HPC)
to run every 30 min, or can be set to run every
X queries that are executed, or whenever a certain Supercomputing and High Performance Comput-
table is modified. ing are synonymous; both terms reflect a compu-
Nested SELECT statements are another feature tational system that is measured in FLOPS and
of MySQL. They allow for SELECT statements to which requires a complex computing architecture.
contain other SELECT statements inside of them. On another hand, Exascale computing is a specific
This nesting allows for on-the-fly sorting and type of super computing, with a computational
filtering of complex data without needing to power of one billion calculations per second.
make unnecessary database modifications along While there are supercomputing and high perfor-
the way. mance computing systems in existence now in
various countries (the USA, China, India, and
the EU), the Exascale computing has not been
achieved yet.
Further Reading
Supercomputing is proved to be very useful for
Introductory Book: SQL in 10 Minutes, Sams Teach Your- large-scale computational models, such as
self (4th edn). ISBN: 978-0672336072 weather and climate change models, nuclear
Online Interactive SQL Tutorial: https://www. weapons and security simulations, brute force
khanacademy.org/computing/computer-programming/ decryption, molecular dynamics, the Big Bang
sql/sql-basics/v/welcome-to-sql.
Quick-Start SQL Command Cheat Sheet: https://www. and the beginning of the Universe, gene interac-
w3schools.com/sql/sql_intro.asp. tions, and simulations of the brain.
Supercomputers will represent significant S
human capital and innovation. The USA offers
current opportunities for HPC access to any
American company that demonstrates strategies
Stylistics to “make the country more competitive.” High
Performance Computing will emerge as the ulti-
▶ Authorship Analysis and Attribution mate signifier of talent and scientific prestige; at
least one study found that universities that invest
in supercomputers have a competitive edge in
research. Meanwhile, Microsoft has reorganized
HPC efforts into a new “big compute” team,
Stylometry denoting a new era of supercomputing.
On another hand, there are many challenges
▶ Authorship Analysis and Attribution that come with achieving supercomputing and
894 Supercomputing, Exascale Computing, High Performance Computing

HPC, among which there are the “end of Moore’s example, Pleiades was NASA’s first petascale
Law,” parallelization mechanisms, and economic computer (Vetter 2013).
costs. Specifically, current research efforts into The current forecasts place the HPC into
supercomputing are not focused on improving “Exascale” capacity by 2020, developing comput-
clock speeds as in classic Moore’s Law, but on ing capacities 50 times greater than today’s most
improving speeds to core and parallelization. On advanced supercomputers. Exascale feasibility
another hand, parallelization efforts in super- depends on the rise of energy-efficient technol-
computing are not focused on improving the sys- ogy: the processing power exists but the energy to
tem design, but the architecture design. And lastly, run it, and cool it, does not. Currently, the Amer-
these efforts come at a significant price: currently, ican supercomputer MIRA, while not the fastest,
the USA has invested more than $120 million into is the most energy efficient, thanks to circulating
supercomputing and high performance computing water-chilled air around the processors inside the
research. machine rather than merely using fans.
Some other important challenges for Exascale Intel has revealed successful test results on
and HPC computing in terms of technology are servers submerged in mineral oil liquid coolant.
resilience and scalability; with the physical Immersion cooling will impact the design, hous-
increase in the systems there also comes a ing, and storage of servers and motherboards,
decrease in resilience and their scalability is shifting the paradigm from traditional air cooling
more difficult to achieve (Shalf et al. 2010). to liquid cooling and increasing the energy effi-
ciency of HPC. Assurances that computer equip-
ment can be designed to withstand liquid
Supercomputing Projects Around immersion will be important to the development
the World of this future. Another strategy to keep energy
costs down is ultra-low-power mobile chips.
The European Commission estimates that High
Performance Computing (HPC) will accelerate
the speed of big data analysis toward a future Applications of HPC
where a variety of scientific, environmental, and
social challenges can be addressed, especially on Promising applications of HPC to address numer-
extremely large and small scales (IDC 2014). Tens ous global challenges exist:
of thousands of times more powerful than laptop
computers, HPC conducted on supercomputers • Nuclear weapons and deterrence: monitors the
processes information using parallel computing, health of America’s atomic arsenal and per-
allowing for many simultaneous computations to forms “uncertainty quantification” calculations
occur at once. These integrated machines are mea- to pinpoint the degree of confidence in each
sured in “flops” which stands for “floating point prediction of weapons behavior.
operations per second.” • Supersonic noise: a study at Stanford Univer-
As of June 2013, Tianhe-2 (in translation sity to better understand impacts of engine
Milky Way-2), a supercomputer developed by noise from supersonic jets.
China’s National University of Defense Technol- • Climate change: understanding natural disas-
ogy is the world’s fastest system with a perfor- ters and environmental threats, predicting
mance of 33.86 petaflop/s. extreme weather.
• Surveillance: intelligence analysis software to
rapidly scan video images. IARPA
Exascale Computing (Intelligence Advanced Research Projects
Activity) requests the HPC industry to develop
During the past 20 years, we have witnessed the a small computer especially designed to
move from terascale to petascale computing. For address intelligence gathering and analysis.
Supercomputing, Exascale Computing, High Performance Computing 895

• Natural Resources: simulating the impact of On another hand, the emergence of GPUs will
renewable energy on the grid without likely help solve some of these current challenges
disrupting existing utilities, making fossil fuel (Keckler et al. 2011). GPUs are particularly good
use more efficient through modeling and sim- at handling big data.
ulations of small (molecular) and large-scale Some supercomputers can cost even up to $20
power plants. million, and they are made of thousands of pro-
• Neuroscience: mapping the neural structure of cessors. Alternatively, clusters of computers work
the brain. together as a supercomputer. For example, a small
business could have a supercomputer with as few
The HPC future potentially offers solutions to a as 4 nodes, or 16 cores. A common cluster size in
wide range of seemingly insurmountable critical many businesses is between 16 and 64 nodes, or
challenges, like climate change and natural from 64 to 256 cores.
resource depletion. The simplification of “big As power consumption of supercomputers
problems” using “big data” with the processing increases though, so does the energy consumption
power of “big compute” runs the risk of putting necessary to cool and maintain the physical infra-
major decisions at the mercy of computer science. structure of HPCs. This means that energy effi-
Meanwhile, some research has suggested that ciency will move from desirable to mandatory and
crowdsourcing (human data capture and analysis) currently there is research underway to understand
may in fact exceed or improve outcomes of super- how green HPC or efficient energy computing can
computer tasks. be achieved (Hemmert 2010).
Also in the USA, the Oak Ridge National Lab-
oratory (ORNL) provides access to the scientific
research community to the largest US supercom- Architectures and Software
puter. Some of the projects that are using the Oak
Ridge Leadership Computing Facility’s super- The architectures of HPC can be categorized in:
computer are in the fields of sustainable energy (1) commodity-based clusters, with standard
solutions, nuclear energy systems, advanced HPC software stacks, from Intel or AMD;
materials, sustainable transportation, climate (2) GPU-accelerated commodity-based clusters
change, and the atomic-level structure and (GPUs are specific for gaming and professional
dynamics of materials. graphics markets); (3) customized architectures,
The Exascale Initiative launched by the with customization both for the nodes and for
Department of Energy, National Nuclear Security their interconnectivity networks (i.e.,
Administration (NNSA), and the Office of Sci- K computer and Blue Gene systems); and (4) spe-
ence (SC) has a main goal to “target the R&D, cialized systems (i.e., protein folding). These last
product development, integration, and delivery of systems are more robust, but less adaptable, S
at least two Exascale computing systems for while the customized ones are the most adaptable
delivery in 2023.” to efficiency, resilience, and energy consumption
(Vetter 2013).
HPC systems share many data architectures
Supercomputing and Big Data and servers, but the software is much more inte-
grated and hierarchical. The HPC software stack
The link between HPC or Exascale computing has a system software, a development software, a
and big data is obvious and intuitive in theory, system management software, and a scientific
but in practice this is more difficult to achieve data management and visualization software.
(Reed and Dongarra 2015). Some of these These include operating systems, runtime sys-
implementation challenges come from the diffi- tems, and file systems. These facilitate program-
culty of creating resilient and scalable data ming models, compilers, scientific frameworks
architectures. and libraries, and performance tools.
896 Supply Chain and Big Data

Outside of HPC, clouds and grids have environment and affecting supply chain manage-
increased in popularity (i.e., Amazon EC2). ment practices (Min et al. 2019) as firms look for
These are specific for data centers and enterprise opportunities to improve their long-term perfor-
markets, as well as internal corporate clouds. mance. While supply chain management has
always been technology-oriented and data-inten-
sive, the ongoing explosion of big data, and the
tools to make use of this data, is opening many
Further Reading avenues to advance decision-making along the
supply chain (Alicke et al. 2019). Big data enables
Hemmert, S. (2010). Green HPC: From nice to necessity.
Computing in Science & Engineering, 12(6), 8–10. companies to use new data sources and analytical
IDC. (2014). High performance computing in the EU: techniques to design and run smarter, cheaper, and
Progress on the implementation of the European HPC more flexible supply chains. These benefits can
strategy. Brussels: European Commission., ISBN:
often be observed in areas making use of auto-
978-92-79-49475-8. https://doi.org/10.2759/034719.
Keckler, S. W., et al. (2011). GPUs and the future of mated, high-frequency decisions, such as demand
parallel computing. IEEE Micro, 31(5), 7–17. forecasting, inventory planning, picking, or
Reed, D. A., & Dongarra, J. (2015). Exascale computing routing. However, also other supply chain activi-
and big data. Communications of the ACM, 58(7),
ties benefit from the different Vs of big data (e.g.,
56–68.
Shalf, J., Dosanjh, S., & Morrison, J. (2010). Exascale volume, variety, or velocity). This improved deci-
computing technology challenges. In International sion-making, often supported by artificial intelli-
conference on high performance computing for com- gence (AI) and machine learning (ML)
putational science. Berlin/Heidelberg: Springer.
approaches, is frequently referred to as supply
Vetter, J. S. (Ed.). (2013). Contemporary high performance
computing: From Petascale toward Exascale. Boca chain analytics.
Raton: CRC Press.

Supply Chain Activities

Supply Chain and Big Data Supply chain management encompasses many
complex decisions along the various supply
Kai Hoberg chain activities. Here, we distinguish between
Kühne Logistics University, Hamburg, Germany strategic, tactical, and operational decisions. Stra-
tegic decisions, such as network design or product
design, typically have a long-term impact. As
Introduction such, they rely on a holistic perspective that
requires data as an input along with human judg-
Supply chain management focuses on managing ment. For planning and execution with a mid- to
the different end-to-end flows (i.e., material, infor- short-term focus, big data offers tremendous pos-
mation, and financial flows) within a particular sibilities to improve key decisions.
company and across businesses within the supply Figure 1 outlines the key decisions along the
chain (Min et al. 2019). As such, it encompasses supply chain for a typical consumer goods prod-
all activities along the value chain, e.g., from uct. The complexity of supply chain operations is
planning, sourcing, and manufacturing to obvious: raw materials are sourced from hundreds
warehousing and transportation, in collaboration of suppliers for thousands of stock keeping units
with suppliers, customers, and third parties. Many (SKUs) that are produced in many plants world-
of these activities require different business func- wide and are delivered to tens of thousands of
tions and cross-functional teams for successful stores. The overarching sales and operations
decision-making. In recent years, new and planning (S&OP) process is centered around
existing technologies have been introduced that demand, production, and inventory planning and
are dramatically changing the business aims to create a consensus, efficient plan that
Supply Chain and Big Data 897

Supply Chain and Big Data, Fig. 1 Mid- and short-term decisions in key supply chain activities

aligns supply and demand. These activities lever- jobs to assets, loading goods, and defining truck
age data to minimize forecast errors, improve routing based on ETA (estimated time of arrival).
production plan stability, and minimize inventory Using the available data, transportation times and
holding and shortage costs. In sourcing, activities costs can be reduced, e.g., by optimizing the
are focused on supplier selection decisions, man- routing according to the customer preferences,
aging the risk of supply chain disruptions and road conditions, and traffic. Finally, goods are
shortages, and continued cost improvement. handled at the point of sale at the retailer (or
Among many other purposes, data is used to point of consumption for other industrial settings).
model product and supplier cost structures, to Here, activities are centered around shelf-space
forecast disruptions from suppliers at different optimization, inventory control, detecting stock
tiers, and to track and manage supplier perfor- outs, and optimizing pricing. Data is used to max-
mance. In manufacturing, activities are centered imize revenues, reduce lost sales, and avoid waste.
around detailed production scheduling, quality Based on the goods handled in the supply chain,
control, and maintenance. Here, the key objec- activities could have a very different emphasis, S
tives typically include improving the overall and additional key activities, such as returns or
asset effectiveness (OEE) of production equip- recycling, must be carefully considered.
ment, reducing maintenance costs, and diagnos-
ing and improving processes. Next, in
warehousing, activities are centered around stor- Data Sources
ing, picking, and packing, often including some
additional value-added services (VAS). Data is Until recently, enterprise resource planning (ERP)
leveraged, e.g., to allocate goods to storage systems (provided by commercial vendors, such
areas, to reduce walking distances, to increase as SAP, Oracle, or Sage) have been the primary
process quality, and to redesign processes. Fur- source of data for supply chain decision-making.
ther, in transportation, mid- to short-term activi- Typically, data from ERP systems is sufficiently
ties are centered around asset management (for structured and available at decent quality levels.
trucks, ships, or planes), allocating transportation ERP data includes, among lots of other
898 Supply Chain and Big Data

information, master data, historic sales and pro- retail stores, camera-equipped service robots are
duction data, customer and supplier orders, and used to measure inventory levels on store shelves,
cost information. However, many additional data or fixed cameras can be used to identify customers
sources can be leveraged to enrich the various waiting for assistance and to notify staff to help
supply-chain decisions. Data collection plays an (Musalem et al. 2016).
essential role as without an efficient and effective
approach for capturing data, it would be impossi- Wearables
ble to carry out data-based analytics (Zhong et al. Easy-to-use interfaces introduced by wearables
2016). Among the many additional data sources, (e.g., handheld radio-frequency devices, smart
the following set has particular relevance for sup- glasses) offer location-based and augmented real-
ply chain management since the acquired infor- ity-enabled instructions to workers (Robson et al.
mation can be leveraged for many purposes. 2016). However, wearables also offer the poten-
tial to log data as the worker uses them. For
Advanced Barcodes and RFID Tags example, in-door identifiers can track walking
Classic barcodes have long been applied to paths within plants, barometers can measure the
uniquely identify products across supply chain altitude of a delivery in a high-rise building, and
partners. With the emergence of advanced eye trackers can capture information about pick-
barcodes (e.g., 2D data matrix or QR codes), ing processes in warehouses.
additional information, such as batch information,
serial number, or expiry date can be stored. In Data Streams and Archives
contrast to cheaper barcodes, RFID tags can pro- While most analyses focus on historic demand
vide similar information without the need for a data to forecast future sales, numerous exoge-
direct line of sight. The pharmaceutical industry nous factors affect demand, and so incorporating
and fashion industry are among the early adopters them can increase forecast accuracy. Data
of these technologies due to the benefits and their sources that can be leveraged include macro-
needs for increased visibility. level developments (e.g., interest rates, business
climate, construction volumes), prices (e.g.,
IoT Devices competitor prices, market prices, commodity
The introduction of relatively cheap sensors with prices), or customer market developments (e.g.,
the Internet-of-things (IoT) connectivity has trig- motor vehicle production, industrial production).
gered numerous opportunities to obtain data in Demand forecasting models can be customized
the supply chain (Ben-Daya et al. 2019). Appli- using data streams and archival data relevant to
cations include temperature sensors tracking the the specific industry or by incorporating com-
performance of reefer containers, gyroscopes to pany-specific factors, such as the daily weather
monitor shocks during the transportation of frag- (Steinker et al. 2017).
ile goods, voltmeters to monitor the performance
of electric engines and to trigger preventive Internet and Social Media
maintenance, GPS sensors to track the routes of Substantial amounts of information are available
delivery trucks, and consumption sensors to online and on social media. For example, firms
track the use of, e.g., coffee machines (Hoberg can obtain insights into consumer sentiments and
and Herdmann 2018). behaviors, crises, or natural disasters in real time.
Social media information from Facebook can
Cameras and Computer Vision improve the accuracy of daily sales forecasts
HD cameras are already frequently applied and (Cui et al. 2018). In a supply chain context, any
enhance the visibility and security in the supply timely information allows securing an alternative
chain. For example, cameras installed in produc- supply in the case of supply chain disruptions or
tion lines measure the product quality and auto- improving sourcing volumes for high-demand
matically trigger alerts based on deviations. In items.
Supply Chain and Big Data 899

In assessing the suitability of the different data archives, real-time data feeds, consumption, and
sources, it is important to answer several key point-of-consumption (POC) inventory data) and
questions: end-to-end planning. Leveraging concurrent, end-
to-end information, the visibility in planning can
• Is the required data directly available to the extend beyond the currently still predominant
relevant decision-maker or is there a need to intra-firm focus (upon partner approval). As a
obtain data from other functions or external result, detailed information about a supplier’s pro-
partners (e.g., promotion information from duction volumes, in-transit goods, and estimated
sales functions for manufacturing decisions, time of arrivals (ETAs) can reduce costly produc-
point-of-sale data from retailers for consumer tion changes/express shipments and allow for
goods manufacturers’ demand forecasting)? demand shaping.
• Is the data sufficiently structured to support the
analysis or is a lot of pre-processing required Sourcing
(e.g., unstructured customer feedback on deliv- In sourcing, big data enables complex cost models
ery performance may require text analytics, to be developed that improve the understanding of
data integration from different supply chain cost drivers and optimal sourcing strategies. Fur-
partners may be challenging without common ther supply chain risk management can benefit
identifiers)? from social media information on supply chain
• Is the data sufficiently timely to support the disruptions (e.g., accidents, strikes, bankruptcies)
decision-making for the considered activity as second-, third-, and fourth-tier suppliers are
(e.g., real-time traffic data for routing decisions better mapped. Finally, contract compliance can
or weekly information about a supplier’s be improved by analyzing shipping and invoice
inventory level)? information in real time.
• Is the data quality sufficient for the purpose of
the decision-making? (e.g., is the inventory Manufacturing
accuracy at a retail store or master data on the One way of using big data in manufacturing is to
supplier lead times sufficient for replenishment improve the product quality and to increase
decisions)? yields. Information on tolerances, temperatures,
• Is the data volume manageable for the deci- or pressures obtained in real time can enable
sion-maker, or does the amount of data require engineers to continuously optimize production
support by data engineers/scientists (e.g., processes. Unique part identifiers allow autono-
aggregate monthly demand data for supplier mous live corrections for higher tolerances in
selection decisions vs. location-SKU-level later manufacturing steps. IoT sensor data
information on hourly demand for pharmacy enables condition-based maintenance strategies
replenishment)? to reduce breakdowns and production S
downtimes.

Supply Chain Opportunities for Big Data Warehousing


Large amounts of data are already widely used in
The benefits of applying big data for supply chain warehousing to increase operational efficiency for
analytics are generally obvious but often very storing, picking, and packing processes. Further
context-specific (Waller and Fawcett 2013). advances would be possible due to improved fore-
Many opportunities exist along the different key casts on individual item pick probabilities (to
supply chain processes. decide on storage location) or by the better pre-
diction of individual picker preferences (to cus-
Sales and Operations Planning tomize recommendations on routing or packing).
Planning can particularly benefit from big data to As warehouses install more goods-to-man butler
increase demand forecast accuracy (e.g., using systems and pickings robots, data is also required
900 Supply Chain and Big Data

to coordinate machines and to enhance computer made-to-order product would not provide tangible
vision. benefits, whereas any small increase in ETA accu-
racy in last-mile delivery can increase routing and
Transportation improve customer satisfaction significantly. More
To boost long-range transportation and last-mile research is evolving to obtain insights about
efficiency, accuracy of detailed future volumes (in where big data can provide the most value in
air cargo, also the weight) is important. Better data supply chain management.
on travel times and for ETA projection (e.g., using
weather and traffic data) could further boost asset
effectiveness and customer satisfaction. Further,
data allows analyzing and managing driver behav- References
ior for increased performance and sustainability.
Alicke, K., Hoberg, K., & Rachor, J. (2019). The supply
Customer-specific information can further
chain planner of the future. Supply Chain Management
improve efficiency, e.g., what are the likely Review, 23, 40–47.
demands that can be bundled, when is the cus- Ben-Daya, M., Hassini, E., & Bahroun, Z. (2019). Internet
tomer likely to be at home, or where to park and of things and supply chain management: A literature
review. International Journal of Production Research,
how long to walk to the customer’s door?
57(15–16), 4719–4742.
Cui, R., Gallino, S., Moreno, A., & Zhang, D. J. (2018).
Point of Sale The operational value of social media information.
Bricks-and-mortar stores are increasingly explor- Production and Operations Management, 27, 1749–
1769.
ing many new data-rich technologies to compete
Hazen, B. T., Skipper, J. B., Boone, C. A., & Hill, R. R.
with online competitors. In particular, the poten- (2018). Back in business: Operations research in sup-
tial of data has been recognized to tweak pro- port of big data analytics for operations and supply
cesses and store layouts, to increase customer chain management. Annals of Operations Research,
270, 201–211.
intimacy for individual advertising, and to
Hoberg, K., & Herdmann, C. (2018). Get smart (about
advance inventory control and pricing by fore- replenishment). Supply Chain Management Review,
casting individual sales by store. Using IoT sen- 22(1), 12–19.
sors and HD cameras, experts can build offline Min, S., Zacharia, Z. G., & Smith, C. D. (2019). Defining
supply chain management: In the past, present, and
“click-stream” data to track customers throughout
future. Journal of Business Logistics, 40, 44–55.
the store. Integrated data from online and offline Musalem, A., Olivares, M., & Schilkrut, A. (2016). Retail
sources can be used to create coupons and cus- in high definition: Monitoring customer assistance
tomer-specific offers. Finally, accurate real-time through video analytics (February 10, 2016). Columbia
Business School Research Paper No. 15–73. Available
stock information can improve inventory control
at SRN: https://ssrn.com/abstract¼2648334 or https://
and mark-down pricing. doi.org/10.2139/ssrn.2648334.
Robson, K., Pitt, L. F., Kietzmann, J., & APC Forum.
(2016). Extending business values through wearables.
MIS Quarterly Executive, 15(2), 167–177.
Conclusion Steinker, S., Hoberg, K., & Thonemann, U. W. (2017). The
value of weather information for E-commerce opera-
Big data in supply chain management offers many tions. Production and Operations Management, 26
interesting opportunities to improve decision- (10), 1854–1874.
Waller, M. A., & Fawcett, S. E. (2013). Data science,
making, yet supply chain managers still need to
predictive analytics, and big data: A Revolution That
adjust to the new prospects (Alicke et al. 2019; Will Transform Supply Chain Design and Manage-
Hazen et al. 2018). However, the ultimate ques- ment. Journal of Business Logistics, 34, 77–84
tion arises as to if and where the expected value Zhong, R. Y., Newman, S. T., Huang, G. Q., & Lan S.
(2016). Big Data for supply chain management in the
justifies the effort for collecting, cleaning, and
service and manufacturing sectors: Challenges, oppor-
analyzing the data. For example, a relatively tunities, and future perspectives. Computers & Indus-
high increase in forecasting accuracy for a trial Engineering, 101, 572–591
Surface Web vs Deep Web vs Dark Web 901

be accessed by using special technologies, such as


Surface Web the TOR browser, which access the “overlay net-
works” within which the dark web pages reside
▶ Surface Web vs Deep Web vs Dark Web (Weimann 2016; Chertoff 2017). Most of the dark
web pages are hosted anonymously and are
encrypted. Dark web pages are intentionally hid-
den due to their tendency to involve content of an
Surface Web vs Deep Web vs illegal nature, such as the purchasing and selling
Dark Web of pornography, drugs, and stolen consumer
financial and identity information. An effective
Esther Mead1 and Nitin Agarwal2 way to visualize these three web components is
1
Department of Information Science, University as an iceberg in the ocean: Only about 10% of the
of Arkansas Little Rock, Little Rock, AR, USA iceberg is visible above the water’s surface, while
2
University of Arkansas Little Rock, Little Rock, the remaining 90% lies hidden beneath the water’s
AR, USA surface (Chertoff 2017).

Synonyms Technological Fundamentals

Dark web; Darknet; Deep web; Indexed web, The surface web is comprised of billions of stati-
Indexable web; Invisible web, Hidden web; cally linked HTML (Hypertext Markup Lan-
Lightnet; Surface web; Visible web guage) web pages, which are stored as files in
searchable databases on web servers that are
accessed over HTTP (Hypertext Transfer Proto-
Key Points col) using a web browser application (i.e., Google
Chrome, Firefox, Safari, Internet Explorer, Micro-
The World Wide Web (WWW) is a collection of soft Edge, etc.). Users can find these surface web
billions of web pages connected by hyperlinks. pages by using a standard search engine – Google,
A minority of these pages are “discoverable” due Bing, Yahoo, etc., – entering keywords or phrases
to their having been “indexed” by search engines, to initiate a search. These web pages can be found
such as Google, rendering them visible to the because these search engines create a composite
general public. Traditional search engines only index by crawling the surface web, traveling
see about 0.03% of the existing web pages through the static web pages via their hyperlinks.
(Weimann 2016). This indexed portion is called Deep web pages, on the other hand, cannot be
the “surface web.” The remaining unindexed por- found using traditional search engines because S
tion of the web pages, which comprises the major- they cannot “see” them due to their readability,
ity, is called the “deep web,” which search engines dynamic nature, or proprietary content; hence the
cannot find, but can be accessed through direct other often-used terms “hidden web” and “invisi-
Uniform Resource Locators (URLs) or IP ble web.” The term “invisible web” was coined in
(Internet Protocol) addresses. The deep web is 1994 by Jill Ellsworth, and the term “deep web”
estimated to be 400–500 times larger than the was coined in 2001 by M. K. Bergman (Weimann
surface web (Weimann 2016). Common examples 2016). The issue of readability has to do with file
of the deep web include email, online banking, types; the issue of dynamic nature has to do with
and Netflix, which can only be accessed past their consistent content updates; and the issue of pro-
main public page when a user creates an account prietary content has to do with the idea of
and utilizes login credentials. There are parts of freemium or pay-for-use sites that require regis-
the deep web called the “dark web” that can only tration with a username and password. Deep web
902 Surface Web vs Deep Web vs Dark Web

pages do not have static URL links; consequently, searching the deep web including BrightPlanet
information on the deep web can only be accessed Deep Query Manager (DQM2), Quigo Technolo-
by executing a dynamic query via the query inter- gies’ Intellisionar, and Deep Web Technologies’
face of a database or by logging in to an account Distributed Explorit. Another category of key
with a username and password. The dark web applications for accessing portions of the deep
(or darknet) is a collection of networks and tech- web are academic libraries that require users to
nologies that operate within a protocol layer that create an account and log in with a username and
sits on the conventional internet. The term password; some of these libraries are cost-based if
“darknet” was coined in the 1970s to indicate it you are not affiliated with an educational institu-
was insulated from ARPANET, the network cre- tion. Common social media networks such as
ated by the U.S. Advanced Research Projects Facebook, Twitter, and Snapchat are classified as
Agency that became the basis for the Surface deep web applications because they can only be
Web back in 1967. The dark web pages cannot fully utilized if a user accesses them through their
be found or accessed by traditional means due to respective application program interface and sets
one or more of the following: (1) The network up an account. Popular instant messaging appli-
host is not using the standard router configuration cations such as iChat, WhatsApp, and Facebook
(Border Gateway Protocol); (2) The host server’s Messenger are also part of the deep web, as well as
IP address does not have a standard DNS entry are some file-sharing and storage applications
point because it is not allocated; (3) The host such as DropBox and Google Drive (Chertoff
server has been set up to not respond to pinging 2017). Key applications for accessing the dark
by the Intelligent Contact Manager; and (4) Fast- web include peer-to-peer file sharing programs
fluxing DNS techniques were employed that such as Napster, LimeWire, and BitTorrent, and
enables the host server’s IP address to continually peer-to-peer networking applications such as Tor.
and quickly change (Biddle et al. 2002). The Many of these applications have been forced to
technologies of the dark web ensure that users’ cease operations, but various subsequent applica-
IP addresses are disguised and their identity tions continue to be developed (Biddle et al.
remains anonymous by using a series of com- 2002). Tor is actually a network and a browser
puters to route users’ activity through so that the application that enables users to browse the inter-
traffic cannot be traced. Once users are on the dark net anonymously by either installing the client or
web, they can use directories such as the “Hidden using “a web proxy to access the Tor network that
Wiki” to help find sites by category. Otherwise, to uses the pseudo-top level domain .onion that can
access a particular dark website, a user must know only be resolved through Tor” (Chertoff 2017).
the direct URL and use the same encryption tool The Tor network was first presented in 2002 by
as the site (Weimann 2016). the United States Naval Research Laboratory as
“The Onion Routing project” intended to be a
method for anonymous online communication
Key Applications (Weimann 2016; Gehl 2016; Gadek et al. 2018).
Using Tor allows users to discover hidden online
Key applications for accessing the surface web are anonymous marketplaces and messaging net-
common web browsers such as Google Chrome, works such as Silk Road (forced to cease opera-
Firefox, Safari, Internet Explorer, Microsoft Edge, tions in 2013), I2P, Agora, and Evolution, which
etc. For finding surface web pages, users can use have relatively few restrictions on the types of
standard search engine applications such as Goo- goods and services sellers can offer or that buyers
gle, Bing, Yahoo, etc., to enter in keywords or can solicit. Bitcoin is the common currency used
phrases to initiate a search. Accessing the deep on these marketplaces due to its ability to preserve
web requires database interfaces and queries. payment anonymity (Weimann 2016). Dark web
There have been some commercial products applications also include those used by journal-
directed toward the area of trying to enable ists, activists, and whistleblowing for file-sharing
Surface Web vs Deep Web vs Dark Web 903

such as SecureDrop, GlobalLeaks, and Wikileaks chemicals, hardcore pornography, contract kill-
(Gadek et al. 2018). There are also social media ing, and coordination activity of terrorist groups,
networks on the dark web such as Galaxy2 that for example (Weimann 2016; Chertoff 2017).
offer anonymous and censor-free alternatives to Many dark web users experience a sense of free-
Twitter, Facebook, YouTube, etc. (Gehl 2016; dom and power knowing that they can operate
Gadek et al. 2018). Additionally, there are real- online anonymously (Gehl 2016; Gadek
time anonymous chat alternatives such as Tor et al. 2018).
Messenger, The Hub, OnionChat, and Ricochet. Not all dark web users fall into this category,
however; some users, for example, may wish to
hide their identities and locations due to fear of
Behavioral Aspects of Users violence and political retaliation, while still others
simply use it because they believe that internet
Users of the surface web tend to be law-abiding, censorship is an infringement on their rights
well-intentioned individuals who are simply (Chertoff 2017). Nonetheless, dark web sites pre-
engaging in their routine daily tasks such as dominantly serve as underground marketplaces
conducting basic internet searches or casually for the exchange of various illegal (or ethically
watching YouTube. The surface web is easy to questionable, such as hacker forums) products
navigate and does not present many technological and services such as those mentioned above
challenges to most users. Surface web pages tend (Gadek et al. 2018).
to be reliably available to users as long as they
follow the conduct and use policies that are
increasingly common, but they also tend to track Topology of the Web
their traffic, location, and IP address. Today, much
of the surface web pages are embedded with There are two primary attempts at modeling the
advertising that relies on these tracking and iden- topology of the hyperlinks between the web pages
tifying mechanisms. The majority of surface web of the surface web, the Jellyfish and Bow Tie
users are aware of their acceptance of these con- models. The Jellyfish model (Siganos et al.
duct and use policies, and tacitly comply so that 2006) depicts web pages as nodes that are
they can continue to use the sites. Surface web connected to one another to varying degrees.
users are also increasingly becoming relatively Those that are strongly connected comprise the
immune to the advertising practices of many of dense, main body of the jellyfish and there is a
these surface web pages, choosing to endure the hyperlink path from any page within the core of
ads so that they can continue on with the use of the the group to every other page within the core;
site. Users of the deep web tend to be somewhat whereas, the nodes that are loosely connected
more technologically savvy in that deep web and do not have a clear path to every other page S
pages require users to first find and navigate the constitute the dangling tentacles. The Bow Tie
site, and also create accounts and maintain the use model (Metaxas 2012) also identifies a strongly
of usernames and passwords to enable them to connected central core of web pages as nodes.
continue to use specific sites. Additionally, some This central core is the “knot” in the bow tie
deep web sites such as academic databases require diagram, and two other large groups of web
a user to be knowledgeable about common search pages (the “bows”) reside on opposite sides of
techniques such as experimenting with different the knot. One of these “bows” consists of all of
combinations of keywords, phrases, and date the web pages that link to the core, but do not have
ranges. Users operate on the dark web in order to any pages from the core that link back to them.
deliberately remain anonymous and untraceable. The other “bow” in the bow tie are the web pages
User behavior on the dark web is commonly asso- that the core links to but do not link back out to the
ciated with illegal activity such as cybercrime, core. These groups are called the “in-” and “out-”
drugs, weapons, money laundering, restricted groups, respectively, referring to the “origination”
904 Surface Web vs Deep Web vs Dark Web

and “termination” aspects of their hyperlinks; i.e., United States’ Computer Fraud and Abuse Act,
the “In-group” web pages originate outside of the which addresses interstate and international com-
strongly connected core, and link into it; while, merce and bars trafficking in unauthorized com-
the “Out-group” web pages are linked to from the puter access and computer espionage.
core and terminate there. Similar to the Jellyfish Additionally, the 2001 Convention on Cyber-
model, the “In” and “Out” groups of the Bow Tie crime of the Council of Europe (known as the
model also each have “tendrils,” which are the Budapest Convention) allows for international
web pages that have links to and from the web law enforcement cooperation on several issues.
pages within the larger group. The web pages Yet there still exist wide variations in international
within a tendril do not belong to the larger “In” approaches to crime and territorial limitations
or “Out” groups, but link to or from them for some with regard to jurisdiction, and “international con-
reason, which means that within each tendril there sensus remains elusive” with some countries such
exist both “origination” and “termination” links. as Russia and China actively resisting the forma-
There is a fourth group in the Bow Tie model, tion of international norms. At the same time,
which is comprised of all of the web pages that are however, Russia, China, and Austria have passed
entirely disconnected from the bow tie, meaning some of their own strict laws concerning the dark
that they are not linked in any way to or from the web such as the forced collection of encryption
core. A final group of web pages within the Bow keys from internet service providers and
Tie model are called “Tubes,” which consists of deanonymizing or blocking Tor and/or arresting
web pages that are not part of the strongly users discovered to be hosting a Tor relay on their
connected core but link from the “In” bow to the computer (Chertoff 2017). Furthermore, high
“Out” bow; i.e., these web pages link the web levels of government censorship in some coun-
pages within the “In” group to the web pages tries can actually push users to the dark web to
within the “Out” group. find application alternatives.

Socio-technical Implications Challenges

Several socio-technical implications can be iden- There are challenges to the surface web mainly in
tified underlying the dark web. The interaction the forms of continually maintaining the develop-
between users and dark web applications can cre- ment of standard search engines in order to opti-
ate both good and bad situations. For example, the mize their effectiveness and deterring web
case of online anonymity that the dark web offers spammers who utilize tricks to attempt to influ-
can help numerous groups of users, such as civil- ence page ranking metrics. Another challenge to
ians, military personnel, journalists and their the surface web is that some people and organiza-
audiences, law enforcement, and whistleblowers tions object to their sites or documents being
and activists. But the same online anonymity included in the index and are lobbying for the
can also help users to commit crimes and escape government to come up with an easy way to
being held accountable for committing those maintain a right to be deindexed. The main chal-
crimes (Chertoff 2017). From a global perspec- lenge with regard to the deep web is the
tive, policymakers need to continue to work searchability of its contents. With vast amounts
together to understand the deep web and the dark of documents being invisible to standard internet
web in order to develop better search methods search engines, users are missing out on useful
aimed at rooting out this criminal activity while and perfectly legal information. Deep web search
still maintaining a high level of privacy protection engines have been developed, but for the most
for noncriminal users. There are a few already part remain in the realm of academic and propri-
existing legal frameworks in place, which these etary business use. There are several challenges
hacking and cybercrime tools violate, such as the surrounding the use of the dark web, such as the
Sustainability 905

lack of endpoint anonymity in peer-to-peer file- with deep web social media analysis; develop-
sharing, meaning that sometimes the host nodes ing tools to scan the network and analyze the
and destination nodes can be identified and may text according to common techniques such as
therefore be subjected to legal action (Biddle et al. topic modeling, sentiment analysis, influence
2002). This can be most problematic if users use analysis, user clustering, and graph network
their work or university networks to create host analysis (Gadek et al. 2018).
nodes. Workplace and institutional policymakers
have the challenge of devising, implementing, and
maintaining pertinent safeguards. Policy and law Further Reading
on a global level is also a continual challenge for
the dark web. Of particular concern is the chal- Biddle, P., England, P., Peinado, M., & Willman, B. (2002).
The darknet and the future of content protection. In
lenge of identifying and preventing organizational
Proceedings from ACM CCS-9 workshop, digital rights
attempts of terrorist activity on the dark web management (pp. 155–176). Washington, DC:
(Weimann 2016). However, finding and Springer.
maintaining an appropriate balance between pri- Chertoff, M. (2017). A public policy perspective of the
dark web. Journal of Cyber Policy, 2(1), 26–38.
vacy and freedom of expression and crime pre-
Gadek, G., Brunessaux, S., & Pauchet, A. (2018). Appli-
vention is a difficult endeavor, especially when it cations of AI techniques to deep web social network
comes to reaching some type of international analysis. [PDF file]. North Atlantic Treaty Organiza-
agreement on the issues (Gehl 2016). Another tion Science and Technology Organization: Semantic
Scholar. Retrieved from https://pdfs.semanticscholar.
challenge is deciding on the appropriate level of
org/e6ca/0a09e923da7de315b2c0b146cdf00703e8d4.
governmental action with regard to the dark web. pdf.
For example, some of the aspects of the dark web Gehl, R. W. (2016). Power/freedom on the dark web:
are considered by many to be beneficial, such as A digital ethnography of the dark web social network.
New Media & Society, 18(7), 1219–1235.
whistleblowing and positive types of hactivism
Metaxas, P. (2012). Why is the shape of the web a bowtie?
(such as when hackers executed a coordinated In Proceedings from April 2012 International World
effort to take down a child abuse website) Wide Web Conference. Lyon: Wellesley College Digital
(Chertoff 2017). Scholarship and Archive.
Siganos, G., Tauro, S. L., & Faloutsos, M. (2006). Jelly-
fish: A conceptual model for the AS internet topology.
Journal of Communications and Networks, 8(3),
Future Directions 339–350.
Weimann, G. (2016). Going dark: Terrorism on the dark
web. Studies in Conflict & Terrorism, 30(3), 195–206.
Various data and web mining techniques have
been used for data collection and analysis to
study the dark web. Some projects are ongoing
such as The University of Arizona Dark Web S
project that focuses on terrorism. The project Sustainability
has been successful in generating an extensive
archive of extremist websites, forums, and Andrea De Montis1 and Sabrina Lai2
1
documents. The project recognizes the need, Department of Agricultural Sciences, University
however, for continual methodology develop- of Sassari, Sassari, Italy
2
ment due to the rapid reactive nature of the Department of Civil and Environmental
terrorist strategists (Weimann 2016). The Engineering and Architecture, University of
United States Defense Advanced Research Cagliari, Cagliari, Italy
Projects Agency (DARPA) and the National
Security Agency (NSA) also actively pursue
projects for understanding the dark web and Synonyms
developing related applications (Weimann
2016). Researchers are also experimenting Eco-development; Sustainable development
906 Sustainability

Definition/Introduction resources at the mankind’s disposal were made


with both the book The Limits to Growth, which
Sustainability is a flexible and variously defined prompted the concept of carrying capacity of the
concept that – irrespective of the exact wording – planet, and the United Nations Conference on the
encompasses the awareness that natural resources Human Environment held in Stockholm, in which
are finite, that social and economic development the tentative term “eco-development” was coined.
cannot be detached from the environment and that Subsequently, an early definition, and possibly the
equity across space and time is required if devel- first one, provided in 1980 by the International
opment is to be carried on in the long term. Union for Conservation of Nature (IUCN) in its
The concept, as well as its operative transla- World Conservation Strategy stated that sustain-
tions, was shaped across the years through several able development (SD) “must take account of
global conferences and meetings, in which state social and ecological factors, as well as economic
representatives agreed upon policy documents, ones; of the living and non-living resource base;
plans, and goals. Hence, this institutional context and of the long term as well as the short-term
must be kept in mind to understand properly the advantages and disadvantages of alternative
sustainability concept as well as the concerted actions.” Hence, sustainability cannot be detached
efforts to integrate it within both regulatory eval- from – to the contrary, it tends to identify itself
uation processes, whereby the environmental with – sustainable development.
effects of plans and projects are appraised, and The most widely known and cited definition is,
voluntary schemes aiming at certifying the envi- however, the one later provided by the Brundtland
ronmental “friendliness” of processes, products, Commission in 1987, according to which SD is
and organizations. development that “meets the needs of the present
From an operational standpoint, several without compromising the ability of future genera-
attempts have been made at measuring sustain- tions to meet their own needs,” often criticized in
ability through quantitative indicators and at find- the ecologists’ and environmentalists’ circles
ing aggregate, easily communicable, indices to because of its putting at the core the “needs” of
measure progresses and trends. In this respect, human beings, both present and future generations.
the adoption of big data is key to the assessment Notwithstanding, this broad definition of sustain-
of the achievement of sustainability goals by com- ability was often understood as synonymous with
plex societies. Recently, the resilience concept has environmental sustainability, primarily concerned
emerged in sustainability discourses; this is a soft with the consumption of renewable resources
and somewhat unstructured approach, which is within their regeneration capacity and of non-
currently gaining favor because deemed appropri- renewable resources at a slow rate so as not to
ate to deal with ever-evolving environmental and prevent future generation from using them as
social conditions. well. Therefore, the two other pillars of sustainabil-
ity (social and economic) implied in the definition
by the IUCN were often left in the background.
Origins of the Term: First Definitions and Two significant and opposing standpoints
Interpretations about sustainability concern “strong” and
“weak” sustainability. While strong sustainability
Although an early warning of the environmental assumes that natural capital cannot be replaced by
impacts of unsustainable agriculture was already man-made capital (comprising manufactured
present in the 1962 book Silent Spring by Rachel goods, technological advancement, and knowl-
Carson, the origins of the concept can be traced edge), weak sustainability assumes that natural
back to the beginning of the 1970s, when two first, capital can be substituted for man-made capital
notable, attempts at overcoming the long-standing provided that the total stock of capital is
view of the planet Earth as an unlimited source of maintained for future generations.
Sustainability 907

Sustainability and Institutions and addressed the discussion of strategic SDGs,


whose achievement should be properly encour-
The evolution of the concept of SD and its oper- aged and monitored.
ative translations in the public domain are
intertwined with the organization and follow-up
of a number of mega-conferences, also known as Sustainability Measures, Big Data, and
world summits. A synopsis of these events, Assessment Frameworks
mostly organized by the United Nations, is
reported in Table 1. Agenda 21, Chapter 40, urged international gov-
The first two conferences were held before – ernmental and nongovernmental bodies to con-
and heralded – the definition of SD, as they ceive, design, and operationalize SD indicators
stressed the opportunity to limit human develop- and to harmonize them at the national, regional,
ment when it affects negatively the environment. and global levels. Similarly, the milestone docu-
Rio +20 was the latest mega-conference on SD ment “The Future We Want” has recently

Sustainability, Table 1 Synopsis of the major conferences on sustainable development


Name, acronym (short
Place, date, website name) Main products Key issues
Stockholm, 5–16 June United Nations Conference Declaration and Environmental consequences of
1972, https:// on the Human action plan human actions, environmental
sustainabledevelopment. Environment, UNCHE quality, improvement of the human
un.org/ environment for present and future
generations, responsibility of the
international community, safeguard
of natural, especially renewable,
resources
Nairobi, 10–18 May United Nations Declaration Follow-up renovated recalls of the
1982 Environment Program, issues stressed in the UNCHE,
UNEP (Stockholm +10) focus on the unsatisfactory
implementation of the UNCHE
action plan
Rio de Janeiro, 3–14 United Nations Conference Rio declaration and Production of toxic substances,
June 1992, http://www. on Environment and action plan (Agenda scarcity of water, and alternative
un.org/geninfo/bp/ Development, UNCED 21) energy sources; public transport
enviro.html (“Earth Summit”) systems; comprehensive action plan
for the implementation of SD,
monitoring role of the Commission
of Sustainable Development (CSD) S
New York, 23–28 June United Nations General Program for the Review of progress since the
1997 Assembly Special Session, further UNCED
UNGASS (“Earth Summit implementation of
II”) Agenda 21
Johannesburg, 26 World Summit on Declaration on SD Sustainable development
August–6 September Sustainable Development, (“political agreements in four specific areas:
2002, www. WSSD (Rio +10) declaration”); civil freshwater, sustainable energy, food
earthsummit2002.org/ society declaration security, and health
Rio de Janeiro, 20–22 United Nations Conference “The future we Green economy and the
June 2012, https:// on Sustainable want” report institutional framework for SD;
sustainabledevelopment. Development, UNCSD seven critical issues: jobs, energy,
un.org/rio20 (Rio +20) cities, food, water, oceans, and
disasters; Sustainable Development
Goals (SDGs)
908 Sustainability

emphasized “the importance of time-bound and towards sustainability. Because of its openness
specific targets and indicators when assessing pro- and transparency, this continuous monitoring pro-
gress toward the achievement of SDGs [. . .]”. cess enabled by big data would also entail
According to Kwarta et al. (2016), several mea- improvements in accountability and in people
sures have been developed: over 500 indicators empowerment (Maarof 2015). Big data analytics
have been proposed by various governmental and and context aware computing are expected to
nongovernmental organizations. Nearly 70 have contribute to the project and development of inno-
been applied at the global level, over 100 at the vative solutions integrating the Internet of Things
national level, more than 70 at the subnational (IoT) into smarter cities and territories (Bibri
level, and about 300 at the local or metropolitan 2018). In this perspective, several attempts have
level. SD indicators should be designed as effec- been made to apply big data techniques to many
tive instruments able to support policy makers by other IoT domains, including healthcare, energy,
communicating timely and precisely the perfor- transportation, building automation, agriculture,
mance of administrations at all levels. They industry, and military (Ge et al. 2018). Some key
should be relevant and describe phenomena com- issues need, however, to be considered, such as
munities need to know, easy to understand even disparities in data availability between developed
by nonexperts, be reliable and correspond to fac- and developing countries (UN 2018), gender
tual and up-to-date situations, and be accessible inequalities implying under- or over-representa-
and available when there is still time to react. tions, the need for public-private partnership in
Indicators often feed into broader frameworks, data production and collection.
which are adopted to ascertain whether ongoing Sustainability assessment is a prominent con-
processes correctly lead to SD. cept that groups any ex-ante process that aims to
Following Rio +20, in 2015 the “2030 Agenda direct decision-making to sustainability. Sustain-
for Sustainable Development” was adopted ability assessment encompasses two dimensions:
through a resolution of the United Nations. The sustainability discourse and representation and
Agenda comprises a set of 17 Sustainable Devel- decision-making context. Discourse is based on
opment Goals (SDGs) together with 169 targets a pragmatic integration of development and envi-
that have been envisaged to monitor the progress ronmental goals through constraints on human
towards sustainability. Maarof (2015) argues that activities, which leads to representing sustainabil-
big data provide an unprecedented “opportunity to ity in the form of disaggregated Triple Bottom
support the achievements of the SDGs”; the Line (TBL) and composite variables. As for the
potential of big data to control, report, and moni- second dimension, decisional contexts are
tor trends has been emphasized by various disentangled in three areas: assessment (policies,
scholars, while Gijzen (2013) highlights three plans, programs, projects, the whole Planet, and
ways big data can help securing the sustainable persistent issues), decision question (threshold,
future implied by the Agenda: first, they allow for and choice), and responsible party (regulators,
modeling and testing different scenarios for sus- proponents, and third parties). In an institutional
tainable conversion and enhancement of produc- context, sustainability assessment constitutes the
tion processes; second, big data gathering, main theoretical and practical reference for sev-
analysis, and modeling can help better understand eral mandatory and voluntary processes of impact
current major environmental challenges such as assessment. Mandatory processes consist of pro-
climate change, or biodiversity loss; third, global cedures regulated and imposed by specific laws
coordination initiatives of big datasets developed and concerning the evaluation of the impacts over
by States or research centers (which also implies the environment caused by human activities. They
institutional coordination between big data collec- include the North-American Environmental
tors and analysts) would enable tracking each Impact Statement (EIS) and the European Envi-
goal’s and target’s trend, hence the global progress ronmental Impact Assessment (EIA) and Strategic
Sustainability 909

Environmental Assessment (SEA). EIS and EIA Conclusions: Future of Sustainable


were introduced, respectively, in the USA in 1969 Decision Making with Big Data
by the National Environment Policy Act and in
Europe in 1985 by Directive 337/85/CE, whereas The operationalization of the concept of sustain-
SEA was introduced in Europe in 2001 by Direc- ability has so far heavily relied on attempts to
tive 2001/42/CE. EIS consists of a public proce- identify proper indicators and measure them, pos-
dure managed to clarify whether given plans or sibly within the framework of dashboards (i.e.,
projects exert impacts over the environment and, user-friendly software packages) and composite
if so, to propose proper mitigation strategies. As indices. Such indices are conceived to synthesize
European counterparts, EIA and SEA have been complex and multidimensional phenomena
introduced to assess and prevent environmental within an aggregate measure and include the
impacts generated by human activities connected Environmental Sustainability Index for the years
respectively to certain projects of linear or isolated 1999–2005 and the subsequent Environmental
infrastructure or buildings and to the implementa- Performance Index, from 2006 onwards, both
tion of given plans and programs. Voluntary pro- developed by Yale University and Columbia Uni-
cesses are spontaneous procedures originally versity, and the Human Development Index
carried out by private enterprises (and recently maintained by the United Nations. Other
also by public bodies) to certify that products endeavors to quantify (un)sustainability of current
and processes comply with certain regulations development include communicative concepts
regarding the quality of the Environmental Man- aiming at raising awareness of the consequences
agement System (EMS). Regulations include the of consumption choices and lifestyles, such as the
Eco-Management and Audit Scheme (EMAS) Earth Overshoot Day, the Ecological Footprint, or
elaborated by the European Commission and the the Carbon Footprint.
14,000 family of standards set by the International A different, and complementary approach to
Standard Organization Technical Committee such quantitative tools to measure progresses
(ISO/TC 207). These processes imply relevant towards, and divergence from, sustainability has
changes and continuous virtuous cycles and lead taken place in the last years with the emergence of
to the enhancement of: credibility, transparency, the resilience concept. As with “sustainability,”
and reputation; environmental risk and opportu- also with “resilience” different definitions coexist.
nity management; environmental and financial In the engineering domain, resilience is grounded
performance; and employee empowerment and on the single equilibrium model and focuses on
motivation. Techniques to appraise sustainability the pace at which a system returns to an equilib-
of products and processes include prominently rium state after a disturbance. To the contrary, in
Life Cycle Assessment (LCA). As standardized the ecology domain, assumed that multiple equi-
by ISO 14040 and 14044 norms, LCA is a meth- librium states can exist, resilience focuses on the S
odology for measuring and disentangling environ- amount of disturbance that a system can tolerate
mental impacts associated with the life cycle of before shifting from a stability state to another,
products, services, and processes. Historically, while reorganizing itself so as to maintain its
LCA was applied to evaluate the environmental functions in a changing environment. Similarly,
friendliness of functional units, such as enter- when the resilience concept is applied to social-
prises. Territorial LCA constitutes a new approach ecological systems, it carries the idea that such
to the study of a territory, whereby the reference systems can endure disturbance by adapting them-
flow is the relationship between a territory and a selves to a different environment through learning
studied land planning scenario. Territorial LCA and self-organization, hence tending towards a
enables to obtain two outputs: the environmental new desirable state. Therefore, the resilience con-
impacts and impacts associated with human activ- cept is often used in the context of mitigation and
ities in the territory. adaptation to climate change, and building
910 Sustainable Development

resilient communities and societies is increasingly


becoming an imperative reference in sustainabil- Systemology
ity discourses. These concepts call for the design
of complex monitoring systems able to manage ▶ Systems Science
big data real time by pruning and visualizing
immediately their trends and rationales. The
achievement of SDGs for 2030 will constitute
the political frontier of the extensive implementa- Systems Science
tion of the IoT framework including sensors that
capture real-time, continuous data, hardware Carolynne Hultquist
devices for storing, software for processing big Geoinformatics and Earth Observation
data and elaborating the analytics, and, ultimately, Laboratory, Department of Geography and
decision making tools able to select and, eventu- Institute for CyberScience, The Pennsylvania
ally, implement the necessary (re-)actions. State University, University Park, PA, USA

Cross-References Synonyms

▶ Environment Systemology; Systems theory


▶ Internet of Things (IoT)
▶ Social Sciences
▶ United Nations Educational, Scientific and Cul- Definition
tural Organization (UNESCO)
Systems science is a broad interdisciplinary field
that developed as an area of study in many disci-
Further Reading plines that have a natural inclination to systems
thinking. Instead of scientific reductionism which
Bibri, S. E. (2018). The IoT for smart sustainable cities of seeks to reduce things to their parts, systems think-
the future: An analytical framework for sensor-based
ing focuses on relating the parts as a holistic para-
big data applications for environmental sustainability.
Sustainable Cities and Society, 38, 230–253. digm that considers interactions within the system
Ge, M., Bangui, H., & Buhnova, B. (2018). Big data for and dynamic behavior. A system is not just a random
internet of things: A survey. Future Generation Com- collection of parts. Even Descartes, who argued for
puter Systems, 87, 601–614.
breaking complex problems into manageable parts,
Gijzen, H. (2013). Big data for a sustainable future. Nature,
502, 38. warned to be careful how one is breaking things
Kwatra, S., Kumar, A., Sharma, P., Sharma, S., Singhal, apart. A system is connected by interactions and
S. (2016). Benchmarking Sustainability Using Indica- behaviors that provide meaning to the configuration
tors: An Indian Case Study. Ecological Indicators, 61,
that forms a system (Checkland 1981). Big data has
928–40.
Maarof, A. (2015). Big data and the 2030 agenda for sus- volume and complexity that require processing to
tainable development. Draft report. https://www.unescap. characterize and recognize patterns both within and
org/sites/default/files/Final%20Draft_%20stock-taking between systems. Systems science has a long history
%20report_For%20Comment_301115.pdf.
of theoretical development in areas that deal with big
UN. (2018). Big data for sustainable development. http://
www.un.org/en/sections/issues-depth/big-data-sustain data problems and is applied in a diversity of fields
able-development/index.html. Accessed 25 July 2018. that employ systems thinking.

Introduction
Sustainable Development
Much of modern scientific thought focuses on
▶ Sustainability reductionism which seeks to reduce things to
Systems Science 911

their parts. However, how parts fit together mat- other hand, denormalization is an approach that pro-
ters. Descartes even warned to be careful of how vides scalability and keeps data integrity by not
things are broken apart as a system is not just a reducing the data into smaller parts or making it
random collection of parts. Aristotle argued that necessary to do large-scale joins. There are yet
the whole is greater than the sum of its part and many questions about the structure of datasets, and
supported viewing a system as a whole (Cordon the form of analysis we utilize in data science
2013). Systems thinking focuses on relating the fields as our understanding of patterns can be led
parts in a holistic paradigm to consider interac- astray in ill-defined systems. Theoretical assump-
tions within the system and represent dynamic tions on how we collect, represent, structure, and
behavior from emerging properties. The systems analyze the data should be critically considered in a
science view focuses on the connection by inter- systematic manner before determining what mean-
actions and behaviors that provide meaning to the ingful sense can be made of observed patterns.
configuration that forms a system (Checkland
1981). These theories could be applicable to char-
acterizing complex big data of large scales over Theoretical Frameworks
time with nonstandard sampling methods and data
quality issues. The configuration of representation Systems theory is often used to refer to general
matters when attempting to recognize patterns, systems theory which is the concept of app-
and big data analysis could benefit from a hierar- lying systems thinking to all fields. Early research
chical systems modeling approach. on systems science was driven by general systems
It is frequently asked in the data science field if research from the 1950s which endeavored to find
meaningful general patterns of reality can be a unified systems explanation of all scientific
found in the data. Big datasets are often not a fields. It is an approach that was adapted over
product of intentional sampling and are some- time to meet new application requirements, and
times thought of as truly capturing the entire sys- the theories can be applied to understanding sys-
tem or population. Some argue that big data is a tems without the construction of a hypothesis-
well-defined system that enables new analysis by driven methodologically based analysis.
not relying on statistical sampling but looking at Systems science can be used to study systems
the “whole picture” as it is assumed to all from the simple to the complex. Complex systems
be there due to the size of the dataset. How- or complexity science is a sister field that is spe-
ever, regardless of how much data is available, cifically engaged with the philosophical and com-
data is simply a representation of reality, and we putational modeling challenges associated with
should not forget that portions of a system may complex behaviors and properties in systems
inherently be excluded from collection. In addi- (Mitchell 2009). As a framework, complexity
tion, big data analysis presents problems of over- builds off a holistic approach that argues it is S
fitting when too many parameters are set than is impossible to predict emergent properties from
necessary and when meaningful variations of the only the initial parts as reduction does not give
system are erroneously specified as noise. This insight into the interactions of the system (Miller
presents issues for future modeling as system- and Page 2007). Often, complex systems research
atic trends may not be captured and reliability prioritizes connections that can be represented
predicted. Systems science approaches can help using models such as agent-based models (ABM)
to identify the system from the noise in order to or networks.
improve modeling performance. ABM and networks can grow in complexity by
Systems science approaches could also encour- scaling up the model to more agents or adding in
age a critical awareness of the implications of data new ties which makes the processing more com-
structure on analysis. Typically, data is fit into struc- putationally intensive. In addition, there can be
tured databases which is an imposition of a structure difficulties in parallelizing computation when a
on the data which might cause data loss and not process cannot be split between multiple cores.
incorporate useful content on uncertainty. On the Larger networks and more interactions can make
912 Systems Theory

it difficult to identify specific patterns. Modeling Conclusion


agents or networks as a system, be it a social or
physical model, can provide an environment to Systems science transcends disciplinary boundaries
test the perceived patterns directly and perhaps and is argued to have unlimited scope (Warfield
allow for a comparison of the resulting model to 2006). It can be considered a non-disciplinary para-
raw data. digm for developing knowledge and insights in any
Systems modeling of big data could bring anal- discipline. It can also be viewed in relation to sys-
ysis beyond general correlation to causation. Stan- tems design in order to think critically about how
dard big data analysis techniques lack explanatory parts seem to fit together. Essentially, systems sci-
power as they do not typically produce a hierar- ence challenges reductionist thinking—whether
chical structure that leads to unification in order to theoretical or applied—by considering the dynamic
make an argument from a general law. Instead, interactions among elements in the system.
analysis focuses on causal models without an Systems science has a long history of develop-
understanding of the system in which it occurred. ing theoretical frameworks. However, like big data,
Systems theory can provide a theoretical basis for systems science has become a buzzword which has
creating systematic structures of modelled causal led those long in the field to question if new work in
relationships that builds on other conditions and applied fields is grounded in theory. Basically, does
makes an argument for a generalized rules. In the theory inform the practice? And if so, does the
addition to building on complexity, chaos theory research outputs advance an understanding of
could find applications through big data analysis either the applied system itself or systems thinking?
of chaotic systems that have so far only been The point is, systems science becoming a popular
systematically modeled. This theoretically branch term does not necessarily advance the development
of mathematics is applied as theory to many fields of systems theories. Data scientists could provide
and is based on the concept that small changes of benefit to the field by building off of the theoretical
initial conditions in deterministic systems can basis that informs the applied work.
have a significant impact on behavior in a dynam-
ical system.
Further Reading

Key Applied Fields Checkland, P. (1981). Systems thinking. In Systems prac-


tice. Chichester, UK: Wiley.
Cordon, C. P. (2013). System theories: An overview of
The concepts of systems science are interdisci- various system theories and its application in
plinary and as a result it has been developed and healthcare. American Journal of Systems Science,
applied in numerous fields. For example, Earth 2(1), 13–22.
Miller, J. H., & Page, S. E. (2007). Complex adaptive
Systems Science is an interdisciplinary field that
systems: An introduction to computational models of
allows for a holistic consideration of the social life. Princeton: Princeton University Press.
dynamic interactions of processes on Earth. Mitchell, M. (2009). Complexity: A guided tour. Oxford,
Some applied fields, such as systems engineer- UK: Oxford University Press. New York.
Warfield, J. N. (2006). An introduction to systems science.
ing, information systems, social systems, evolu-
Hackensack: World Scientific.
tionary economics, and network theory, gain
significant attention by breakthroughs in these
niche fields. The system and complexity sci-
ences have long-standing traditions that could Systems Theory
guide data scientists grappling with theoretical
and applied problems. ▶ Systems Science
T

Tableau Software History

Andreas Veglis The company was founded by Chris Stolte, Chris-


School of Journalism and Mass Communication, tian Chabot, and Pat Hanrahan in Mountain View,
Aristotle University of Thessaloniki, California. The initial aim of the company was to
Thessaloniki, Greece commercialize research conducted by two of the
founders at Stanford University’s Department of
Computer Science. The research included visual-
Introduction ization techniques for exploring and analyzing
relational databases and data cubes. Shortly after
Tableau Software is a computer software com- the company was moved to its present location,
pany situated in Seattle (USA) that produces a Seattle, Washington, it is worth noting that Tab-
series of interactive data visualization products leau Software was one of the first companies to
(http://tableau.com). The company offers a vari- withdraw support from WikiLeaks after they
ety of products that query relational databases, started publishing US embassy cables.
cubes, cloud databases, and spreadsheets and gen-
erate a variety of graph types. These graphs can be
combined into dashboards and shared over the Tableau and Big Data
internet. The products utilize a database visuali-
zation language called VizQL (Visual Query Lan- Tableau supports more than 75 native data con-
guage), which is a combination of a structured nectors and also a significant number of others via
query language for databases with a descriptive its extensibility options. Some examples of the
language for rendering graphics. VizQL is the supported connectors include SQL-based connec-
core of the Polaris system, which is an interface tions, NoSQL interfaces, open database connec-
for exploring large multidimensional databases. tivity (ODBC), and Web data connector. In order
Special attention is given to the support of big for its customers to have a fast interaction with
data sets, since in recent years the demand of big their data, Tableau has developed a number of
data visualization has increased significantly. Tab- technologies, namely, hyper data engine, hybrid
leau’s users can work with big data without hav- data architecture, and VizQL. Thus, Tableau
ing advanced knowledge on query languages. offers a real-time interaction with the data.

© Springer Nature Switzerland AG 2022


L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
914 Tableau Software

Products the user is able to use the program’s mapping


function to create a map of a specific parameter
Tableau state-of-art data visualization is consid- by country. The software also supports working
ered to be among the best of the business intelli- with a number of different data sets. After the data
gence (BI) suites. BI can be defined as the has been imported by Tableau Desktop, the user
transformation of raw data into meaningful and can illustrate it with graphs, diagrams, and maps.
useful information for business analysis purposes. The product is available for both Windows and
Tableau offers clean and elegant dashboards. The MacOS platforms.
software utilizes drag-n-drop authoring in analy- Tableau Server: Tableau Server is an
sis and design processes. Tableau usually deals enterprise-class business analytics platform that
with at least two special dimensions, namely can scale up to hundreds of thousands of users.
time and location. The use of special dimensions It supports distributed mobile, browser-based
is quite interesting since some dimensions should users, and Tableau Desktop clients to interact
be treated differently for more efficient analysis. with Tableau workbooks published to the server
For example, in maps and spatial analysis we from Tableau Desktop. The platform comprises
usually employ location dimensions. Also in the four main components, namely the application
case of time dimensions, they should be treated server, the VizQL Server, the data server, and the
differently since in almost all cases information is backgrounder. The product can be installed on
relevant only in specific time context. Maps are Windows and Linux servers, or it can be hosted
considered to be one of the strongest features of in Tableau’s data centers. Tableau Server is acces-
Tableau products. Usually maps are quite difficult sible through annual subscriptions.
to develop by BI developers since most BI plat- Tableau Online: It is a hosted version running
forms do not offer strong support for maps. But on the company’s own multitenant cloud infra-
this is not the case with Tableau since it incorpo- structure of its data visualization product. Cus-
rates excellent mapping functionalities. The latter tomers can upload data, create, maintain, and
is supported by regular updates of the offered collaborate their visualization on the cloud
maps and complimentary information (for exam- (Tableau’s servers). Tableau Online is usable
ple, income, population, and other statistical data) from a Web browser, but is also compatible with
licensed from third parties. It is worth noting that Tableau Desktop. Tableau Online can connect to
Tableau can import, manipulate, and visualize cloud-based data sources (for example,
data from various big data sources, thus following ▶ Salesforce, Google BigQuery, and Amazon
the current trend of working with big data. Redshift). Customers are also able to connect
As of the June 2020, Tableau Software offers their own on-premise data.
seven main products: Tableau Desktop, Tableau Tableau Prep Builder: Introduced in 2018 this
Server, Tableau Online, Tableau Prep Builder, tool is used for the preparation of the data that will
Tableau Mobile, Tableau Public, and Tableau be analyzed with the help of another product of
Reader. the Tableau ecosystem. It supports extraction,
Tableau Desktop: It is a business intelligence combination, and cleaning of data. As expected,
tool that allows the user to easily visualize, ana- Tableau Prep Builder works seamlessly with the
lyze, and share large amounts of data. It supports other Tableau products.
importing data or connection with various types of Tableau Mobile: Tableau Mobile is a free
database for automatic updates. While importing mobile app (for iphone, iPads, and Android
data the software also attempts to identify and phones and tablets) that allows users to access
categorize it. For instance, Tableau recognizes their data from mobile devices. Users can select,
the country names and automatically adds infor- filter, and drill down their data and generally
mation on the latitude and longitude of each coun- interact with the data using controls that are auto-
try. This means that without any extra data entry matically optimized for devices with touch
Taxonomy 915

screens. The Tableau Mobile can connect securely instructors who want to teach interactive visuali-
to Tableau Online and Tableau Server. It is worth zations with Tableau products.
noting that Tableau also offers the Tableau
Vizable which is a free mobile app only available
for iPads though which a user can access and Competitors
explore.
Tableau Public: Tableau Public is a free avail- Today there are many companies that offer tools
able tool for any user who wants to create inter- for creating interactive visualizations that can be
active data stories on the web. It is delivered as a considered competitors of Tableau Software. But
service so it can be up and running immediately. its direct competitors are other BI platforms like
Users are able to connect to data, create interac- Microsoft BI, SAP Business Objects, QlikView,
tive data visualizations, and publish them IBM Cognos, Oracle Analytics Server, Sisense,
directly to their website. Also they are able to Dundas BI, Microstrategy, Domo, and Birst.
guide readers through a narrative of data insights
and allow them to interact with the data to make
new discoveries. All visualizations are stored on Conclusions
the web and are visible by everyone. The product
is available for both Windows and MacOS Tableau’s products are considered to be well
platforms. designed and to be suited for nontechnical users.
Tableau Reader: Tableau Reader is a free desk- They are powerful, easy to use, highly visual, and
top application that allows users to open, view, aesthetically pleasant. By utilizing the free edi-
and interact with visualizations built in Tableau tions users can create complex interactive visual-
Desktop. It supports actions such as filter, drill izations that can be employed for exploring
down, and view details of the data as far as the complex data sets.
author allows. Users are not able to edit or per-
form any interactions if the author has not built
it. Tableau Reader is a simple way to share ana- Cross-References
lytical insights.
▶ Business Intelligence
▶ Interactive Data Visualization
Licenses ▶ Visualization

As of June 2020, Tableau offers annual subscrip-


tions for individuals that include Tableau Desk- Further Reading
top, Tableau Prep Builder, and one license for
Tableau Server or Tableau Online. For teams and Lorth, A. (2019). Visual analytics with Tableau. Hoboken:
organizations there are similar subscriptions per
Wiley.
Milligan, J. (2019). Learning Tableau 2019: Tools for
T
user. Except the free products, all other products business intelligence, data prep, and visual analytics
are available for downloading on a trial basis for (3rd ed.). Birmingham: Packt Publishing.
14 days. Also, Tableau Software offers free Sleeper, R. (2018). Practical Tableau. Newton: O’Reilly.
Stirrup, J. (2016). Tableau dashboard cookbook. Birming-
access to university students and instructors
ham: Packt Publishing.
who want to utilize Tableau products in their
courses.
It is worth mentioning that Tableau Software
provides a variety of start-up guides (http://www. Taxonomy
tableau.com/support/) training options to help
customers get the most out of their data, and also ▶ Ontologies
916 Technological Singularity

useful about the future” (Vinge 1993). Accord-


Technological Singularity ingly, regardless of how it manifests, a super-
intelligent entity may be “the last invention that
Laurie A. Schintler and Connie L. McNeely man need ever make” (Good 1966).
George Mason University, Fairfax, VA, USA Big data has a big hand to play in the path to a
technological singularity (Dresp-Langley et al.
2019). Indeed, the integration of big data, machine
Technology is progressing at a rapid and unprec- learning, and AI is increasingly described in terms
edented pace. Over the last half-century – and in of a data singularity (Arora 2018), arguably making
the context of the “digital revolution” – global for a disruptive intersection leading to the techno-
data and information storage, transmission, and logical singularity. Therefore, new and expanding
processing capacities have ballooned, expanding sources of structured and unstructured data have
super-exponentially rather than linearly. More- the potential to catalyze the transition to an intelli-
over, profound advancements and breakthroughs gence explosion. For example, massive troves of
in artificial intelligence (AI), genetics, nanotech- streaming data produced by sensors, satellites,
nology, robotics, and other technologies and fields mobile devices, observatories, imaging, social
are being made continuously, in some cases on a media, and other sources help drive and shape AI
day-to-day basis. Considering these trends and innovations. The rapidly accelerating volume of
transformations, some scholars, analysts, and big data also creates continual pressures to develop
futurists envision the possibility of a technologi- better and expanded computational and storage
cal singularity – i.e., a situation in which techno- capabilities. Moreover, acceleration itself, referring
logical growth becomes unsustainable, resulting to increasing rates of growth, is common across
in a gradual or punctuated transition beyond even conceptions of the technological singularity (Eden
combined human capabilities alone (Vinge 1993). et al. 2012). In recent years, related demands have
A technological singularity would be “an event or led to an array of technological breakthroughs, e.g.,
phase beyond which human civilization, and per- quantum technology, amplifying in magnitudes of
haps even human nature itself, would be radically order the ability to acquire, process, and dissemi-
changed” (Eden et al. 2012). nate data and information. This situation has meant
Theories about the technological singularity a better understanding and increased capabilities
generally assume that accelerating scientific, for modeling complex phenomena, such as climate
industrial, social, and cultural change will give change, dynamics in the cosmos, the physiology
rise to a technology that is so smart that it can and pathology of diseases, and even the human
self-learn and self-improve. It would become brain’s inner-workings and human intelligence,
ever-more intelligent with each iteration, possi- which in turn fuel technological developments
bly enabling an “intelligence explosion” and, ulti- even further.
mately, the rise of super-intelligence (Sandberg Alternatively, big data – or, more aptly, the
2013). A “greater-than-intelligent” entity could “data tsunami” – can be viewed as an obstacle in
be a single machine or a network of devices and the way to a technological singularity. The rate at
humans with a collective intellect that far exceeds which data and information are produced today is
that of human beings (i.e., a social machine). staggering, far exceeding management and ana-
Alternatively, it might arise from human intelli- lytical capacities – and the gap is widening. Thus,
gence amplification enabled by technologies such there is a deepening information overload. In this
as computer/machine interfaces or whole brain context, the share of imprecise, incorrect, and
emulation, or even profound advancements in irrelevant data is amassing much faster than the
biological science. At a technological singularity, proportion of data that can be used or trusted,
there is the possibility of a “prediction horizon,” contributing to increasing levels of “data smog”
where projections become meaningless or impos- or “information pollution” (Shenk 1997). In other
sible, where “we can no longer say anything words, the signal-to-noise ratio is in a downward
Telemedicine 917

spiral. Although new developments in artificial ▶ Information Overload


intelligence, such as deep learning, are signifi- ▶ Information Quantity
cantly enhancing abilities to process and glean ▶ Machine Learning
insights from big data, i.e., to see the signals,
such technologies also are used in misinformation
and disinformation, i.e., to produce noise. Con- Further Reading
sider, for example, “deepfake” images and videos
or nefarious “bots” operating in social media. Arora, A. (2018). Heading towards the data singularity.
Towards Data Science. https://towardsdatascience.com/
Society has long been better at producing more
heading-towards-the-data-singularity-829bfd82b3a0
data and information than it can consume. In fact, Blair, A. M. (2010). Too much to know: Managing schol-
the information overload problem pre-dates the arly information before the modern age. New
digital age by thousands of years (Blair 2010). Haven/London: Yale University Press.
de Solla Price, D. (1961). Science since Babylon. New
Moreover, while new technologies and approaches
Haven: Yale University Press.
always come on to the scene to help find, filter, Dresp-Langley, B., Ekseth, O. K., Fesl, J., Gohshi, S.,
organize, sort, index, integrate, appraise, and inter- Kurz, M., & Sehring, H. W. (2019). Occam’s razor for
pret data and information, there is always a predict- big data? On detecting quality in large unstructured
datasets. Applied Sciences, 9(15), 3065.
able “knock-on” information explosion (de Solla
Eden, A. H., Moor, J. H., Søraker, J. H., & Steinhart,
Price 1961). For instance, the Web 2.0 era ushered E. (Eds.). (2012). Singularity hypotheses: A scientific
in new tools and strategies for managing the infor- and philosophical assessment. Heidelberg: Springer.
mation and data deluge produced in the first gen- Good, I. J. (1966). Speculations concerning the first ultra-
intelligent machine. In Advances in computers (Vol.
eration of the World Wide Web. This, in turn,
6, pp. 31–88). Amsterdam: Elsevier.
resulted in an endless and ever-expanding collec- Sandberg, A. (2013). An overview of models of techno-
tion of digital tags, ratings, annotations, and com- logical singularity. In The transhumanist reader: Clas-
ments with each ensuing iteration of the Web. sical and contemporary essays on the science,
technology, and philosophy of the human future
For any information processing entity –
(pp. 376–394). Hoboken: Wiley.
whether an algorithm, a robot, an organization, a Shenk, D. (1997). Data smog. New York: HarperCollins
city, or the human brain – to be intelligent in the Publishers.
face of information overload, it must know “when Simon, H. A. (1996). Designing organizations for an
information-rich world. International Library of Criti-
to gather information...; where and in what form
cal Writings in Economics, 70, 187–202.
to store it; how to rework and condense it; how to Vinge, V. (1993, March). Technological singularity. In
index and give access to it; and when and VISION-21 symposium sponsored by NASA Lewis
on whose initiative to communicate it to others” Research Center and the Ohio Aerospace Institute
(pp. 30–31).
(Simon 1996). That is, it must be a screener,
compressor, synthesizer, and interpreter of infor-
mation, with the capacity to listen and think, more
that it speaks. In other words, the capacity to
consume big data must exceed the capacity to Telemedicine T
produce it. Accordingly, the extent to which big
data can facilitate a transition to a technological Warren Bareiss
singularity hinges largely on the ability for Department of Fine Arts and Communication
machines (and humans) to manage the informa- Studies, University of South Carolina Upstate,
tion overload. Spartanburg, SC, USA

Cross-References Overview

▶ Artificial Intelligence Telemedicine is the transmission and reception of


▶ Big Data Concept health information and/or treatment from point to
918 Telemedicine

point or among many points across space. It is The third form is also synchronous wherein
used for diagnosis, treatment, and prevention as patients engage with healthcare providers “face
well as for research and continuing education. to face” via a combination of audio and video
Other terms that are sometimes used synony- technologies. Examples include patients being
mously, or at least with some semantic overlap, assisted at rural clinics and paramedics working
are “telehealth” and “e-health.” Telemedicine, in with hospital emergency departments, transmit-
its various forms, brings healthcare services to ting vital signs and ascertaining if transportation
remote locations, while making it possible to col- to an emergency facility is required.
lect big data to better understand trends and ser- A wide range of telemedicine applications has
vice opportunities for poor and underserved been used globally including dermatology, psy-
populations far from urban medical centers. chotherapy, neonatal testing, genetic counseling,
The purpose of telemedicine is to increase wound care, diabetes management, and neurol-
accessibility to healthcare and health-related ogy. Accuracy of synchronous telemedicine
information where spatial distance is problem- examinations has been shown to be comparable
atic leading to challenges not only pertaining to to that of traditional examinations when a bedside
traveling distance but also time, expense, and clinician is present to assist with the requisite
shortage of medical professionals in areas that technology.
are not well served with regard to medical Telemedicine in its current forms began in the
options or even accessibility to transportation. early 1990s with the development of fiber optics
Telemedicine thus facilitates the flow of informa- necessary to carry large amounts of date to and
tion among patients and providers for caregiving, from central hubs within respective systems.
on the one hand, and among healthcare profes- Because of its reliance upon visual data, radiology
sionals for training and coordination, on the was the first major application.
other. Benefits include potentially faster diagno- Early in its development, telemedicine was still
ses resulting in faster healing and more time somewhat space biased, requiring patients and pro-
available in treatment centers to care for patients viders to communicate via dedicated facilities
who need to be admitted on-site. equipped with expensive videoconferencing tech-
Services provided via telemedicine are partic- nology. The cost and required technical support of
ularly needed in developing nations marked by such systems was prohibitive, especially in poor
high poverty, large rural communities, weak infra- regions and in locations where reception was weak.
structures including transportation systems, and Today, telemedicine systems are more commonly
chronic shortages of medical personnel – special- used due to the ease and relative low cost of using
ists in particular. the Internet and readily available video cameras.
Telemedicine has three primary forms. Asyn-
chronous telemedicine, called “search and for-
ward” is the collection and storage of medical Regulation
data for later retrieval. For example, a nurse prac-
titioner at a remote location could take an X-ray, Regulatory structures are closely associated with
forward the image to a specialist via the Internet, the expansion, acceptance, and use of telemedi-
and have the result in a matter of hours. Informa- cine systems. Canada, for example, has a central-
tional and reminder messages can also be pro- ized system with universal accreditation across
grammed and send to patients. provinces. Regulation regarding reimbursement
The second form of telemedicine is the moni- and malpractice are comparable with traditional
toring of patients in real time from a distant facil- care, in some cases providing higher reimburse-
ity. Examples include use of step counters, gait ment for telemedicine services as an incentive
sensors, and electromyography devices. strategy.
Telemedicine 919

The United States, on the other hand, lacks a the interstate licensing as proffered by the Inter-
centralized system so that licenses are not valid state Licensing Board.
from state to state except in the case of the Vet- A fundamental barrier to multistate licensure is
erans Administration. The issue reflects ongoing the fact that there is no single, universally agreed
debate in the United States regarding the legiti- upon definition of what exactly telemedicine is or
macy of federal regulation in areas traditionally does. “Telemedicine” as a term is widely applied
left to states, in this case, state medical boards. to include a plethora of technologies, applications,
In the United States, the telemedicine industry and personnel. Cursory examination of medical
is represented by multiple lobbying agencies and literature on telemedicine reveals everything from
professional organizations. One of the largest a telephone conversation between doctor and
organizations advocating on behalf of telemedi- patient to a host of specialized services emanating
cine is the American Telemedicine Association to a potentially vast public from a centralized
(ATA) which provides information to profes- source employing hundreds of healthcare
sionals and to the general public, organizes an providers.
annual conference, and publishes a peer-reviewed Medicaid – administered on a state-by-state
journal: Telemedicine and e-Health. Among the basis – has been much more supportive of tele-
most pressing issues for the ATA is promotion of medicine than has Medicare. According to a 2015
interstate licensing agreements without which report published by the Center for Connected
regional and national systems of telemedicine Health Policy (CCHP), 46 states currently provide
are currently impossible in the United States. some Medicaid reimbursement for synchronous,
In 2015, the Center for Medicare and Medicaid interactive telemedicine. Nine Medicaid programs
Services (CMS) permitted Medicare to reimburse support store-and-forward services apart from
telemedicine providers, but maintained limita- radiology, and 14 programs reimburse remote
tions so that only rural patients are eligible for monitoring. Alaska, Minnesota, and Mississippi
telemedicine services. Furthermore, patients reimburse all three forms of telemedicine.
must receive treatment at approved “originating While the decentralized approach has slowed
sites” such as health clinics, nursing facilities, and the growth of a national telemedicine system in
physicians’ offices. A final restriction is that the United States, more localized networks such
Medicare reimbursement will only cover synchro- as the South Carolina Telehealth Alliance and the
nous means of telemedicine between originating Ohio Health Stroke Network support cooperative
and distal sites. telemedicine endeavors among practitioners,
The new ruling was welcomed, in part, by the medical centers, and government regulators
ATA. The ATA’s reaction to the measure’s limita- within respective states.
tions, however, also reveals discord between the Further, the so-called retail health movement in
telemedicine industry and federal regulators. In the United States has moved forward with tele-
response to the new ruling, for example, the medicine endeavors on a state-by-state basis, with
ATA suggested that the CMS had slowed the California taking the lead. Pharmacy chain CVS T
proliferation of telemedicine when calling for began offering consultations with nurse practi-
more research instead of forging a multistate tioners via audio- and video-based technologies
licensure agreement. in selected California clinics in 2014. Other pro-
Like the ATA, the American Medical Associa- viders such as Kaiser Permanente, also based in
tion (AMA) also supports interstate licensure. In California, are moving forward with retail-based
2014, the AMA formally adopted the tripartite telemedicine in conjunction with Target Corp. A
definition of telemedicine described above (syn- similar partnership is under development between
chronous delivery, remote monitoring, and store- Sutter Express Care and Rite Aid in its California
and-forward systems) while joining in support of pharmacies.
920 Telemedicine

Benefits shorter stays in medical facilities, avoiding time


away from work, and financial savings.
Beneficiaries of telemedicine include patients in
rural locations where health practitioners are
scarce. Transportation to medical facilities Barriers
requires transportation, time, and money, any of
which can be prohibitive to potential patients. Despite benefits to patients, providers, and
Also, in nations where there are not enough doc- insurers, development of telemedicine-based sys-
tors to serve the population regardless of the ter- tems faces many impediments. Obstacles include
rain, telemedicine can be used to more efficiently lack of suitable infrastructure and national policies
reach large numbers of patients. Finally, telemed- such as those described above, lack of expertise,
icine is useful for illnesses where physical mobil- perceived costs, and providers’ unwillingness to
ity itself is problematic. adopt technologies and abandon old routines.
From the start, the US military has been at Lack of a generally agreed upon system of
the forefront of telemedicine, for example, reimbursement is perhaps the greatest barrier to
among military personnel deployed to locations the proliferation of telemedicine systems world-
far from fully equipped medical facilities. wide, despite the fact that some research – albeit
Store-and-forward systems are used in multiple limited at this stage – has shown that over time
applications to collect images from cameras or telemedicine increases profit and costs less than
cell phones. Images are then sent from an on- traditional, in-person healthcare.
site medical provider via secure e-mail to a Another major concern is protection of privacy
central manager who then distributes the infor- as information is shared and stored electronically.
mation to a consultant and/or a specialist. The Besides ensuring that standard means of pro-
process is then reversed. Telemedicine used in tecting patient privacy, such as HIPAA require-
this way has led to a reduction in unnecessary ments, secure telemedicine systems require
evacuations from remote locations which, in dedicated data storage and servers along with
turn, reduce the need for replacement person- point-to-point encryption. Privacy is also a con-
nel. The speed in which military personnel cern when delivery is not in a completely secure
receive consultation has been reduced from environment such as a kiosk at a retail center.
several weeks to a matter of hours. Also, the Further problematic issues include questions
US Department of Veterans Affairs (VA) man- about the costs involving system maintenance
ages a robust telemedicine program stateside and training. Also, the entire model of medicine
due to the high number of rural veterans, espe- at a distance via electronic technology puts much
cially those who are losing mobility due to age of the responsibility for health care in the hands of
and illness. patients, for example in describing symptoms in
Telemedicine has demonstrated similar results isolation apart from a more integrated examina-
among civilian populations by reducing unneces- tion involving blood tests combined with the
sary and costly transferal of patients from civilian knowledge and trust built from traditional, long-
populations in remote regions, again, speeding term doctor-patient relationships.
patient evaluations and promoting greater concen-
tration on patients whose transfer requirements
are more urgent and necessary. These factors Telemedicine and Big Data
also benefit insurers due to reduction of fees
accruing from ambulance or helicopter transport Use of telemedicine technologies among large
and hospital readmission. numbers of patients permits the collection of a
Numerous studies have reported that patient vast amount of data across space and through
satisfaction with telemedicine is remarkably time. Such data can be used to examine patterns
high. Patients appreciate convenient accessibility, and differences in health conditions within and
Telemedicine 921

across geographical settings and to track changes benefits to patients and providers in need of
across space and through time. Such data could be immediate, accessible, and cost-effective health
used to provide regional, national, and interna- care.
tional profiles, while also pinpointing localized
health issues, thus mitigating against the spread
of health crises such as viral contagion. As such, Cross-References
big data gathered through telemedicine can be
used to inform regional, national, and interna- ▶ Data Sharing
tional healthcare policymaking in the areas of ▶ Data Storage
prediction, prevention, intervention, and ▶ Health Care Delivery
promotion. ▶ Social Media
Conversely, big data can be used in conjunc-
tion with telemedicine platforms and applications
to help determine treatment for individual patients Further Reading
and thereby improve medical service on a case-
by-case basis. For example, big data can be used Achey, M., et al. (2014). Past, present, and future of tele-
to predict emergency medical situations and the medicine for Parkinson’s disease. Movement Disor-
ders, 29.
onset of specific diseases among individual American Telemedicine Association. http://www.american
patients when comparing aggregate data with telemed.org. Accessed May 2015.
patient records, family history, immediate symp- Azkari, A., et al. (2014). The 60 Most highly cited articles
toms, and so on. Further, use of easily portable published in the Journal of Telemedicine and Telecare
and Telemedicine Journal and E-health. Journal of
telehealth devices makes such service relatively Telemedicine and Telecare, 20, 871–883.
inexpensive when contrasted with long wait times Center for Connected Health Policy. http://cchpca.org/
and travel distances from remote locations to well- telehealth-medicaid-state-policy. Accessed May 2015.
equipped healthcare facilities in population Desjardins, D. (2015). Telemedicine comes to retail clinics.
Health Leaders Media, 21/1, 1–3.
centers. Dougherty, J. P., et al. (2015). Telemedicine for adolescents
with type I diabetes. Western Journal of Nursing
Research, 36, 1199–1221.
Looking Forward Hwang, J. S. (2014). Utilization of telemedicine in the U.S.
military in a deployed setting. Military Medicine, 179,
1347–1353.
Despite ongoing questions and regulatory issues, Islam, R., et al. (2019). Portable health clinic: An advanced
national and even global telemedicine technolo- tele-healthcare system for unreached communities. In
gies, practices, and uses seem poised for exponen- L. Ohno-Machado & B. Séroussi (Eds.), MEDINFO
2019: Health and wellbeing e-networks for all. Inter-
tial growth due to the ubiquity of cell phones national Medical Informatics Association (IMIA) and
across the globe, even in many poor nations and IOS Press, 616–619.
regions severely lacking in healthcare providers. Jelnes, R. (2014). Reflections on the use of telemedicine in
Furthermore, given their propensity to communi- wound care. EWMA Journal, 14, 48–51. T
Kalid, N., et al. (2018). Based real time remote health
cate via social media, today’s adolescents are monitoring systems: A review on patients prioritization
likely to welcome health care via telemedicine if and related ‘big data’ using body sensors information
messaging is easy to access and use, personalized, and communication technology. Journal of Medical
and is interactive. Easy access and affordability Systems, 42(2), 30.
Leventhal, R. (2014). In Ohio, Optimizing Stroke Care
are not ends in themselves, however, as new ques- with Telemedicine. Retrieved from https://www.
tions will be raised particularly regarding appro- hcinnovationgroup.com/policy-value-based-care/arti
priate training and oversight. cle/13024392/in-ohio-optimizing-stoke-care-with-tele
Although telemedicine does little to alleviate medicine, Accessed 27 Nov 2020.
Ma, L. V., et al. (2016). An efficient session weight load
economic and social conditions that cause balancing and scheduling methodology for high-qual-
shortages of medical personnel in many parts ity telehealth care service based on WebRTC. The
of the world, it does appear to offer many Journal of Supercomputing, 72, 3909–3926.
922 Testing and Evaluation

Sibson, L. (2014). The use of telemedicine technology to Subsequent values of time-ordered data often
support pre-hospital patient care. Journal of Paramet- depend on previous observations. Time series
ric Practice, 6, 344–353.
Wenger, T. L. (2014). Telemedicine for genetic and neuro- analytics is, therefore, interested in techniques
logic evaluation in the neonatal intensive care unit. that can analyze this dependence (Box et al.
Journal of Perinatology, 34, 234–240. 2015; Zois et al. 2015). Up until the second half
of the twentieth century, social scientists largely
ignored the possibility of dependence within time
series data (Kirchgässner et al. 2012). Statisticians
Testing and Evaluation have since demonstrated that adjacent observa-
tions are frequently dependent in a time series
▶ Anomaly Detection and that previous observations can often be used
to accurately predict future values (Box et al.
2015).
Time series data abound and are of importance
The Big Data Research and to many. Physicists and geologists investigating
Development Initiative climate change, for example, use annual tempera-
(TBDRDI) ture readings, economists study quarterly gross
domestic product and monthly employment
▶ White House Big Data Initiative reports, and policy makers might be interested in
before and after annual traffic accident data to
determine the efficacy of safety legislation. Time
series analytics can be used to forecast, determine
Time Series the transfer function, assess the effects of unusual
intervention events, analyze the relationships
▶ Financial Data and Trend Prediction between variables of interest, and design control
schemes (Box et al. 2015). Preferably, observa-
tions have been recorded at fixed time intervals. If
the time intervals vary, interpolation can be used
Time Series Analysis to fill in the gaps (Zois et al. 2015).
Of critical importance is whether the variables
▶ Time Series Analytics are stationary or nonstationary. Stationary vari-
ables are not time dependent (i.e., mean, variance,
and covariance remain constant over time). How-
ever, time series data are quite often non-
Time Series Analytics stationary. The trend of nonstationary variables
can be deterministic (e.g., following a time
Erik Goepner trend), stochastic (i.e., random), or both.
George Mason University, Arlington, VA, USA Addressing nonstationarity is a key requirement
for those working with time series and is
discussed further under “Challenges” (Box et al.
Synonyms 2015; Kirchgässner et al. 2012).
Time series are frequently comprised of four
Time series analysis, Time series data components. There is the trend over the long-term
and, often, a cyclical component that is normally
understood to be a year or more in length. Within
Introduction the cycle, there can be a seasonal variation. And
finally, there is the residual which includes all
Time series analytics utilize data observations variation not explained by the trend, cycle, and
recorded over time at certain intervals. seasonal components. Prior to the 1970s, only the
Time Series Analytics 923

residual was thought to include random impact, rates are discussed further in the “Challenges”
with trend, cycle, and seasonal change understood section below. Veracity includes issues relating
to be deterministic. That has changed, and now it to inaccurate, missing, or incomplete data. Before
is assumed that all four components can be sto- analysis, these issues should be addressed via
chastically modeled (Kirchgässner et al. 2012). duplicate elimination, interpolation, data fusion,
or an influence model (Zois et al. 2015).

The Evolution of Time Series Analytics Contending with Massive Amounts of Data
Tremendous amounts of time series data exist,
In the first half of the 1900s, fundamentally dif- potentially overwhelming computer memory. In
ferent approaches were pursued by different dis- response, solutions are needed to lessen the effects
ciplines. Natural scientists, mathematicians, and on secondary memory access. Sliding windows
statisticians generally modeled the past history of and time series indexing may help. Both are com-
the variable of interest to forecast future values of monly used; however, newer users may find the
the variable. Economists and other social scien- learning curve unhelpfully steep for time series
tists, however, emphasized theory-driven models indexing. Similarly, consideration should be
with their accompanying explanatory variables. In given to selecting management schemes and
1970, Box and Jenkins published an influential query languages simple enough for common
textbook, followed in 1974 by a study from users (Zois et al. 2015).
Granger and Newbold, that has substantially
altered how social scientists interact with time
series data (Kirchgässner et al. 2012). Analysis and Forecasting
The Box Jenkins approach, as it has been fre-
quently called ever since, relies on extrapolation. Time series are primarily used for analysis and
Box Jenkins focuses on the past behavior of the forecasting (Zois et al. 2015). A variety of poten-
variable of interest rather than a host of explana- tial models exist, including autoregressive (AR),
tory variables to predict future values. The vari- moving average (MA), mixed autoregressive
able of interest must be transformed so that it moving average (ARMA), and autoregressive
becomes stationary and its stochastic properties integrated moving average (ARIMA). ARMA
time invariant. At times, the terms Box Jenkins models are used with stationary processes and
approach and time series analysis have been used ARIMA models for nonstationary ones (Box
interchangeably (Kennedy 2008). et al. 2015). Forecasting options include regres-
sion and nonregression based models. Model
development should follow an iterative approach,
Time Series Analytics and Big Data often executed in three steps: identification, esti-
mation, and diagnostic checking. Diagnostic
Big Data has stimulated interest in efficient que- checks examine whether the model is properly T
rying of time series data. Both time series and Big fit, and the checks analyze the residuals to deter-
Data share similar characteristics relating to vol- mine model adequacy. Generally, 100 or more
ume, velocity, variety, veracity, and volatility observations are preferred. If fewer than 50 obser-
(Zois et al. 2015). The unprecedented volume of vations exist, development of the initial model
data can overwhelm computer memory and pre- will require a combination of experience and
vent processing in real time. Additionally, the past data (Box et al. 2015; Kennedy 2008).
speed at which new data arrives (e.g., from sen-
sors) has also increased. The variety of data Autoregressive, Moving Average, and Mixed
includes the medium from which it comes (e.g., Autoregressive Moving Average Models
audio and video) as well as differing sampling An autoregressive model predicts the value of the
rates, which can prove problematic for data anal- variable of interest based on its values from one or
ysis. Missing data and incompatible sampling more previous time periods (i.e., its lagged value).
924 Time Series Analytics

If, for instance, the model only relied on the value express a single vector (of all the variables) as
of the immediately preceding time period, then it a linear function of the vector’s lagged values
would be a first-order autoregression. Similarly, if combined with an error vector. The single vec-
the model included the values for the prior two tor is derived from the linear function of each
time periods, then it would be referred to as a variable’s lagged values and the lagged values
second-order autoregression and so on. A moving for each of the other variables. VAR models are
average model also uses lagged values, but of the used to investigate the potential causal relation-
error term rather than the variable of interest (Ken- ship between different time series, yet they are
nedy 2008). If neither an autoregressive nor mov- controversial because they are atheoretical and
ing average process succeeds in breaking off the include dubious assertions (e.g., orthogonal
autocorrelation function, then a mixed auto- innovation of one variable is assumed to not
regressive moving average approach may be pre- affect the value of any other variable). Despite
ferred (Kirchgässner et al. 2012). AR, MA, and the controversy, many scholars and practi-
ARMA models are used with stationary time tioners view VAR models as helpful, particu-
series, to include time series made stationary larly VAR’s role in analysis and forecasting
through differencing. However, the potential loss (Kennedy 2008; Kirchgässner et al. 2012; Box
of vital information during differencing opera- et al. 2015).
tions should be considered (Kirchgässner et al.
2012). Error Correction Models
ARMA models produce unconditional fore- These models attempt to harness positive fea-
casts, using only the past and current values of tures of both ARIMA and VAR models, account-
the variable. Because such forecasts frequently ing for the dynamic feature of time series data
perform better than traditional econometric while also taking advantage of the contributions
models, they are often preferred. However, explanatory variables can make. Error correction
blended approaches, which transform linear models add theory-driven exogenous variables to
dynamic simultaneous equation systems into a general form of the VAR model (Kennedy
ARMA models or the inverse, are also available. 2008).
These blended approaches can retain information
provided by explanatory variables (Kirchgässner
et al. 2012). Challenges

Autoregressive Integrated Moving Average Nonstationarity


(ARIMA) Models Nonstationarity can be caused by deterministic
In ARIMA models, also known as ARIMA (p,d, and stochastic trends (Kirchgässner et al. 2012).
q), p indicates the number of lagged values of Y*, To transform nonstationary processes into station-
which represents the variable of interest after it ary ones, the deterministic and/or stochastic
has been made stationary by differencing. d indi- trends must be eliminated. Measures to accom-
cates the number of differencing operations plish this include differencing operations and
required to transform Y into its stationary version, regression on a time trend. However, not all non-
Y*. The number of lagged values of the error term stationary processes can be transformed
is represented by q. ARIMA models can forecast (Kirchgässner et al. 2012).
for univariate and multivariate time series (Ken- The Box Jenkins approach assumes that
nedy 2008). differencing operations will make nonstationary
variables stationary. A number of unit root tests
Vector Autoregressive (VAR) Models have been developed to test for nonstationarity,
VAR models blend the Box Jenkins approach but their lack of power remains an issue. Addi-
with traditional econometric models. They can tionally, differencing (as a means of eliminating
be quite helpful in forecasting. VAR models unit roots and creating stationarity) comes with
Time Series Data 925

the undesirable effect of eliminating any theory- Conclusion


driven information that might otherwise contrib-
ute to the model. Time series analytics utilizes data observations
Granger and colleagues developed recorded over time at certain intervals, observa-
cointegrated procedures to address this chal- tions which often depend on each other. Time
lenge (Kirchgässner et al. 2012). When non- series analytics focuses on this dependence (Box
stationary variables are cointegrated, that is, et al. 2015; Zois et al. 2015). A variety of models
the variables remain relatively close to each exist for use in time series analysis (e.g., ARMA,
other as they wander over time, procedures ARIMA, VAR, and ECM). Of critical importance
other than differencing can be used. Examples is whether the variables are stationary or non-
of cointegrated variables include prices and stationary. Stationary variables are not time
wages and short- and long-term interest rates. dependent (i.e., mean, variance, and covariance
Error correcting models may be an appropriate remain constant over time). However, time series
substitute for differencing operations (Kennedy data are quite often nonstationary. Addressing
2008). Cointegration analysis has helped shrink nonstationarity is a key requirement for users of
the gap between traditional econometric time series (Box et al. 2015; Kirchgässner et al.
methods and time series analytics, facilitating 2012).
the inclusion of theory-driven explanatory vari-
ables into the modeling process (Kirchgässner
et al. 2012). Cross-References

Autocorrelation ▶ Core Curriculum Issues (Big Data Research/


Time series data are frequently autocorrelated Analysis)
and, therefore, violate the assumption of ran- ▶ Spatiotemporal Analytics
domly distributed error terms. When autocorrela- ▶ Time Series Analytics
tion is present, the current value of a variable
serves as a good predictor of its next value. Auto-
correlation can disrupt models such that the anal- Further Reading
ysis incorrectly concludes the variable is
statistically significant when, in fact, it is not Berman, E., & Wang, X. (2012). Essential statistics for
public managers and policy analysts (3rd ed.). Los
(Berman and Wang 2012). Autocorrelation can
Angeles: CQ Press.
be detected visually or with statistical techniques Box, G., Jenkins, G., Reinsel, G., & Ljung, G. (2015). Time
like the Durbin-Watson test. If present, autocorre- series analysis: Forecasting and control. Hoboken:
lation can be corrected with differencing or by Wiley.
Kennedy, P. (2008). A guide to econometrics (6th ed.).
adding a trend variable, for instance (Berman
Malden: Blackwell.
and Wang 2012). Kirchgässner, G., Wolters, J., & Hassler, U. (2012). Intro-
duction to modern time series analysis (2nd ed.). Hei- T
Missing Data and Incompatible Sampling delberg: Springer Science & Business Media.
Zois, V., Chelmis, C., & Prasanna, V. (2015). Querying of
Rates
time series for big data analytics. In L. Yan (Ed.),
Missing data occur for any number of rea- Handbook of research on innovative database query
sons. Records may be lost, destroyed, or oth- processing techniques (pp. 364–391). Hershey: IGI
erwise unavailable. At certain points, Global.
sampling rates may fail to follow the standard
time measurement of the data series. Special-
ized algorithms may be necessary. Interpola-
tion can be used as a technique to fill in Time Series Data
missing data or to smooth the gaps between
intervals (Zois et al. 2015). ▶ Time Series Analytics
926 Transnational Crime

been transformed in scale or form by criminal use


Transnational Crime of the Internet, dark web, or social media.
Included in this category are such crimes as drug
Louise Shelley trafficking, credit card fraud, human trafficking,
Terrorism, Transnational Crime, and Corruption and online sales of counterfeits, wildlife and
Center, George Mason University, Fairfax, antiquities. For example, dark websites have allo-
VA, USA wed bulk sales of narcotics facilitating impersonal
interactions of drug traffickers and buyers. Silk
Road, the first large online dark web drug market-
Transnational crime has expanded dramatically in place, did billions of dollars in sales in its rela-
the past two decades as criminals have benefited tively short existence. Its replacement have
from the speed and anonymity of the cyber world continued to sell significant supplies of drugs
and encrypted social media. Developments in online (Shelley 2018). During the COVID-19
technology have facilitated the growth of many pandemic, such cyber-enabled crimes as online
forms of traditional crime as well as introduced fraud, dissemination of child abuse and pornogra-
cyber-dependent crime in which the crime is phy imagery, and sale of counterfeit medical prod-
linked to pernicious items sold primarily on the ucts needed for the medical emergency have
dark web, such as ransomware, botnets, and tro- grown particularly.
jans. These new tools deployed by criminals have Cyber-dependent crimes are defined as crimi-
permitted the theft of billions of private records, nal activity in which a digital system is the target
the theft of identities, and the enormous growth of as well as the means of attack. Dark websites,
illicit e-commerce. This criminal activity has accessed only by special software (e.g., Tor), sell
expanded even more during the COVID-19 pan- these criminal tools such as ransomware, trojans,
demic when individuals are isolated and spend and botnets. Under this category of crime, infor-
greater amount of time on the Internet and on mation technology (IT) infrastructure can be
cell phones. The increase in this crime has disrupted, and data can be stolen on a massive
required cyber security firms and law enforcement scale using malware and phishing attacks. Many
to rely more on large-scale data analytics to stop online cyber products are sold that can be used to
this crime and to locate and aid its victims and extract ransoms, spread spam, and execute denial
bring criminals to justice. of service attacks. These same tools can lead to
Transnational criminals have capitalized on the massive numbers of identity thefts and the theft of
possibilities of the Internet, the deep and the dark personal passwords facilitating intrusion into
web, and social media, especially its end-to-end bank and other financial accounts and loss of
encryption to expand their activities globally. large sums by victims. Ransomware sold online
Criminals are among the major beneficiaries of has been used to freeze the record systems of
the anonymity of this new world of big data, hospitals treating patients until ransom payments
providing them greater speed and outreach than are made. Year-on-year growth is detected in
previously. This phenomenal growth has occurred cyber-dependent crimes and tens if not hundreds
over the last two decades but has intensified par- of millions of individuals were affected in 2020
ticularly during the COVID-19 pandemic as indi- through large-scale hacks and data breaches
viduals are more isolated and use their computers (Osborne 2020).
and cell phones more to engage with the outside The availability of the Internet has provided for
world. the dramatic expansion of customer access to pur-
The use of big data for criminal uses can be chase commercial sex and for exploiters to adver-
divided into two distinct categories: cyber- tise victims of human trafficking. A major US
enabled crime and cyber-dependent crime that government-funded computer research program,
can exist only in the cyber world. Cyber-enabled known as Memex, reported identified advertise-
crimes include existing forms of crime that have ment sales of approximately $250 million spent
Transparency 927

on posting more than 60 million advertisements Goodman, M. (2015). Future crimes: Everything is
for commercial sexual services in a 2-year period connected, everyone is vulnerable and what we can
do about it. New York: Doubleday.
(Greenmeier 2015). The Memex tool that pro- Greenmeier, L. (2015). Human traffickers caught on hidden
vides big data analytics for the deep web is now Internet. https://www.scientificamerican.com/article/
used to target the human trafficking criminals human-traffickers-caught-on-hidden-internet/. Accessed
operating online. One human trafficking network, 22 Dec 2020.
Lusthaus, J. (2018). Industry of anonymity: Inside the
operating out of China, indicted by federal prose- business of cybercrime. Cambridge, MA: Harvard Uni-
cutors was linked to hundreds of thousands of versity Press.
escort advertisements and 55 websites in more Osborne, C. (2020). The biggest hacks, data breaches of
than 25 cities in the USA, Canada, and 2020. https://www.zdnet.com/article/the-biggest-
hacks-data-breaches-of-2020/. Accessed 22 Dec 2020.
Australia. This case reveals how large-scale data Shelley, L. (2018). Dark commerce: How a new illicit
analytics is now key to understanding the net- economy is threatening our future. Princeton: Princeton
works and the activities behind transnational orga- University Press.
nized crime (the USA et al. 2018). United States of America, Chen, Z. a.k.a. Chen, M., Zhou,
W., Wang, Y. a.k.a. Sarah, Fu, T., & Wang, C.. (2018,
Online and dark web sales as well as that November 15). https://www.justice.gov/usao-or/press-
conducted through social media are all facilitated release/file/1124296/download. Accessed 22 Dec
by payment systems that process billions of trans- 2020.
actions. The growth of global payments and the
increased use of crypto currencies, many of them
anonymized, make the identification of the account
owners challenging. Therefore, finding the crimi- Transparency
nal transactions among the numerous international
wire transfers, credit card, prepaid credit card, and Anne L. Washington
crypto-currency transactions is difficult. Under- George Mason University, Fairfax, VA, USA
standing the illicit activity requires the develop-
ment of complex data analytics and artificial
intelligence to ascertain the suspicious payments Transparency is a policy mechanism that encour-
and link them with actual criminal activity. ages organizations to disclose information to the
Transnational criminals have been major ben- public. Scholars of big data and transparency rec-
eficiaries of globalization and the rise of new ognize the inherent power of information and
technology. With their ability to use the Internet, share a common intellectual history. Government
deep and dark web, and social media to their and corporate transparency, which is often
advantage, capitalizing on anonymity and encryp- implemented by releasing open data increases
tion, they have managed to advance their criminal the amount of material available for big data pro-
objectives. Millions of individuals and institutions jects. Furthermore, big data has its own need for
globally have suffered both personal and financial transparency as data-driven algorithms support
losses as law enforcement rarely possesses or essential decisions in society with little disclosure T
keeps up with the advanced data analytics skills about operations and procedures. Critics question
needed to counter the criminals’ pernicious activ- whether information can be used as a control
ities in social media and cyberspace. mechanism in an industry that functions as a dis-
tributed network.

Further Reading
Definition
Global Initiative Against Transnational Organized Crime.
(2020). Crime and contagion: The impact of a pan- Transparency is defined as a property of glass or
demic on organized crime. https://globalinitiative.net/
wp-content/uploads/2020/03/CovidPB1rev.04.04. any object that lets in light. As a governance
v1.pdf. Accessed 22 Dec 2020. mechanism, transparency discloses the inner
928 Transparency

mechanisms of an organization. Organizations assumes that stakeholders will discover the


implement or are mandated to abide by transpar- disclosed information, comprehend its impor-
ency policies that encourage the release of informa- tance, and subsequently use it to change behavior.
tion about how they operate. Hood and Heald Organizations, including corporations and gov-
(2006) use a directional typology to define trans- ernment, often implement transparency using
parency. Upward and downward transparency technology which creates digital material used in
refers to disclosure within an organization. Super- big data.
visors observing subordinates is upward transpar- Corporations release information about
ency, while subordinates observing the hierarchy how their actions impact communities. The
above is downward transparency. Inward and out- goal of corporate transparency is to improve
ward transparency refers to disclosure beyond orga- services, share financial information, reduce
nizational boundaries. An organization aware of its harm to the public, or reduce reputation risks.
environment is outward transparency, while citizen The veracity of corporate disclosures has
awareness of government activity is inward trans- been debated by management science
parency. Transparency policies encourage the visi- scholars (Bennis et al. 2008). On the one
bility of operating status and standard procedures. hand, mandatory corporate reporting fails if
First, transparency may compel information on the information provided does not solve the
operating status. When activities may impact target issue (Fung et al. 2007). On the other
others, organizations disclose what they are hand, organizations that are transparent to
doing in frequent updates. For example, the US employees, management, stockholders, regula-
government required regular reports from stock tors, and the public may have a competitive
exchanges and other financial markets after the advantage. In any case, there are real limits to
stock market crash in 1929. Operating status what corporations can disclose and still remain
information gives any external interest an ability both domestically and internationally
to evaluate the current state of the organization. competitive.
Second, transparency efforts may distribute Governments release information as a form of
standard procedures in order to enforce ideal accountability. From the creation of the postal
behaviors. This type of transparency holds people code system to social security numbers, govern-
with the public trust accountable. For example, ments have inadvertently provided core categories
cities release open data with transportation sched- for big data analytics (Washington 2014). Starting
ules and actual arrival times. The planned infor- in the mid-twentieth century, legislatures around
mation is compared to the actual information to the world began to write freedom of information
evaluate behaviors and resource distribution. Pro- laws that supported the release of government
cedural transparency assumes that organizations materials on request. Subsequently, electronic
can and should operate predictably. government projects developed technology capa-
Disclosures allow comparison and review. bilities in public sector organizations.
Detailed activity disclosure of operations answers Advances in computing have increased the
questions of who, what, when, and where. Con- use of big data techniques to automatically
versely, disclosures can also answer questions review transparency disclosures. Transparency
about influential people or wasteful projects. Dis- can be implemented without technology, but
closure may emphasize predictive trends and retro- often the two are intrinsically linked. One impact
spective measurement, while other disclosures may technology has on transparency is that informa-
emphasize narrative interpretation and explanation. tion now comes in multiple forms. Disclosure
before technology was the static production of
documents and regularly scheduled reports that
Implementation could be released on paper by request. Disclosure
with technology is the dynamic streaming of
Transparency is implemented by disclosing real-time data available through machine-read-
timely information to meet specific needs. This able search and discovery. Transparency is
Transparency 929

often implemented by releasing digital material Analysis of digital material can be done without
as open data that can be reused with few limita- explicit acknowledgment or agreement. Further-
tions. Open data transparency initiatives disclose more, the industry that exchanges consumer data
information in formats that can be used with big is easily obscured because transactions are all
data methods. virtual. While a person may willingly agree to
free services from a platform, it is not clear if
users know who owns, sees, collects, or uses
Intellectual History their data. The transparency of big data is
described from three perspectives: sources, orga-
Transparency has its origins in economic and nizations, and the industry.
philosophical ideas about disclosing the activities Transparency of sources discloses information
of those in authority. In Europe, the intellectual about the digital material used in big data. Disclo-
history spans from Aristotle in fifth-century sure of sources explains which data generated on
Greece to Immanuel Kant in eighteenth-century which platforms were used in which analysis. The
Prussia. Debates on big data can be positioned flip side of this disclosure is that those who create
within these conversations about the dynamics user-generated content would be able to trace their
of information and power. An underlying assump- digital footprint. User-generated content creators
tion of transparency is that there are hidden and could detect and report errors and also be aware of
visible power relationships in the exchange of their overall data profile. Academic big data
information. Transparency is often an antidote to research on social media was initially questioned
situations where information is used as power to because of opaque sources from private compa-
control others. nies. Source disclosure increases confidence in
Michel Foucault, the twentieth-century data quality and reliability.
French philosopher, considered how rulers Transparency of platforms considers organiza-
used statistics to control populations in his lec- tions that provide services that create user-gener-
ture on Governmentality. Foucault engaged with ated content. Transparency within the
Jeremy Bentham’s eighteenth-century descrip- organization allows for internal monitoring.
tions of the ideal prison and the ideal govern- While part of normal business operations, some-
ment, both of which require full visibility. This one with command and control is able to view
philosophical position argues that complete sur- personally identifiable information about the
veillance will result in complete cooperation. activities of others. The car ride service Uber
While some research suggests that people will was fined in 2014 because employees used the
continue bad behavior under scrutiny, transpar- internal customer tracking system inappropriately.
ency is still seen as a method of enforcing good Some view this as a form of corporate surveillance
behavior. because it includes monitoring customers and
Big data extends concerns about the balance of employees.
authority, power, and information. Those who Transparency of the analytics industry dis- T
collect, store, and aggregate big data have more closes how the big data market functions. Industry
control than those generating data. These concep- transparency of operations might establish techni-
tual foundations are useful in considering both the cal standards or policies for all participating orga-
positive and negative aspects of big data. nizations. The World Wide Web Consortium’s
data provenance standard provides a technical
solution by automatically tracing where data orig-
Big Data Transparency inated. Multi-stakeholder groups, such as those
for Internet Governance, are a possible tool to
Big data transparency discloses the transfer and establish self-governing policy solutions. The
transformation of data across networks. Big data intent is to heighten awareness of the data supply
transparency brings visibility to the embedded chain from upstream content quality to down-
power dynamic in predicting human behavior. stream big data production. Industry transparency
930 Transparency

of procedure might disclose algorithms and Other critics question whether what is learned
research designs that are used in data-driven through disclosure is looped back into the system
decisions. for reform or learning. Information disclosed for
Big data transparency makes it possible to transparency may not be channeled to the right
compare data-driven decisions to other methods. places or people. Without any feedback mecha-
It faces particular challenges because its produc- nism, transparency can be a failure because it does
tion process is distributed across a network of not drive change. Ideally, either organizations
individuals and organizations. The process flows improve performance or individuals make new
from an initial data capture to secondary uses and consumer choices.
finally into large-scale analytic projects. Transpar-
ency is often associated with fighting potential
corruption or attempts to gain unethical power. Summary
Given the influence of big data in many aspects
of society, the same ideas apply to the transpar- Transparency is a governance mechanism for
ency of big data. disclosing activities and decisions that pro-
foundly enhances confidence in big data. It
builds on existing corporate and government
Criticism transparency efforts to monitor the visibility of
operations and procedures. Transparency schol-
A frequent criticism of transparency is that its arship builds on earlier research that examines
unintended consequences may thwart the antici- the relationship between power and information.
pated goals. In some cases, the trend toward Transparency of big data evaluates the risks and
visibility is reversed as those under scrutiny stop opportunities of aggregating sources for large-
creating findable traces and turn to informal mech- scale analytics.
anisms of communication.
It is important to note that a transparency
label may be used to legitimize authority without Cross-References
any substantive information exchange. Large
amounts of information released under the ▶ Business
name of transparency may not, in practice, pro- ▶ Data Governance
vide the intended result. Helen Margetts (1999) ▶ Economics
questions whether unfiltered data dumps obscure ▶ Privacy
more than they reveal. Real-time transparency ▶ Standardization
may lack meaningful engagement because it
requires intermediary interpretation. This com-
plaint has been lodged at open data transparency Further Reading
initiatives that did not release crucial
Bennis, W. G., Goleman, D., & O’Toole, J. (2008). Trans-
information. parency: How leaders create a culture of candor. San
Implementation of big data transparency is Francisco: Jossey-Bass.
constrained by complex technical and busi- Fung, A., Graham, M., & Weil, D. (2007). Full disclosure:
ness issues. Algorithms and other technology The perils and promise of transparency. New York:
Cambridge University Press.
are layered together, each with its own Hood, C., & Heald, D. (Eds.). (2006). Transparency: The
embedded assumptions. Business agreements key to better governance? Oxford. New York: Oxford
about the exchange of data may be private, University Press.
and release may impact market competition. Margetts, H. (1999). Information technology in govern-
ment: Britain and America. London: Routledge.
Scholars question how to analyze and com- Washington, A. L. (2014). Government information policy
municate what drives big data, given these in the era of big data. Review of Policy Research, 31(4).
complexities. https://doi.org/10.1111/ropr.12081.
Transportation Visualization 931

interactive visualization. To gain the insights


Transportation Visualization from the heterogeneous and unstructured trans-
portation data, the users need to conduct iterative,
Xinyue Ye evolving information foraging, and sense making
Landscape Architecture & Urban Planning, Texas using their domain knowledge or collaborative
A&M University, College Station, TX, USA thinking. Iterative visual exploration is fundamen-
tal in this process, which needs to be supported by
efficient data management and visualization capa-
The massive amounts of granular mobility data of bilities. Even simple tasks, such as smoothly
people and transportation vehicles form a basic displaying the heat maps of thousands of taxis’
component in smart cities paradigm. The volume average, maximum, or minimum speed, cannot be
of available trajectory data has increased consid- easily completed with active user interactions
erably because of the increasing sophistication without the appropriate algorithm design. Such
and ubiquity of information and communication operations require temporal and spatial data
technology. The movement data records real-time aggregations and visualizations with random
trajectories sampled as a series of georeferenced access patterns, where advanced computational
points over transportation networks. There is an technologies should be employed. In general, an
imperative need for effective and efficient ideal visual analytics software system needs to
methods to represent and examine the human offer the following: (1) powerful computing plat-
and vehicle attributes as well as the contextual form so that common users are not limited by their
information of transportation phenomena in the computational resources and can accomplish their
comparative context. Visualizing the emerging tasks over daily-used computers or mobile
large-scale transportation data can offer the stake- devices; (2) easy-access gateway so that the trans-
holders unprecedented capability to carry out portation data can be retrieved, analyzed, and
data-driven urban system studies based on real- visualized by different user groups and their
world flow information, in order to enhance com- results can be shared and leveraged by others;
munities in the twenty-first century. Nowadays, a (3) scalable data storage and management so that
large amount of transportation data sets are col- a variety of data queries can be immediately
lected or distributed by administrations, compa- responded; (4) exploratory visualizations so that
nies, researchers, and volunteers. Some of them intuitive and efficient interactions can be facili-
are available for public use, allowing to duplicate tated; and (5) a multiuser system so that simulta-
the results of a prior study using the same data and neous operations are allowed by many users from
procedures. More datasets are not publicized due different places.
to the privacy, but the results of a prior study can Conventional transportation design software,
be duplicated if the same procedures are followed such as TransCAD, Cube, and EMME, provides
but a similar transportation data is collected. The platforms for transportation forecasting, planning,
coming decades will witness more such datasets and analysis with some visual representations of T
and especially openness of visual analytics the results. However, these software packages are
procedures due to the increasing popularity of not specifically developed for big transportation
trajectory recording devices and citizen science. data visualization. Domain practitioners,
Open-source visual analytics represents a para- researchers, and decision-makers need to store,
digm shift in transportation research that has facil- manage, query, and visualize big and dynamic
itated collaboration across disciplines. transportation data. Therefore, transportation
Since the early twenty-first century, the devel- researchers demand handy and effective visual
opment of spatial data visualization for computa- analytics software systems integrating scalable
tional social science has been gaining momentum. trajectory databases, intuitive and interactive visu-
As a multidimensional and multiscale phenome- alization, and high-end computing resources. The
non, transportation calls for scalable and availability of codes or publicly accessible
932 Transportation Visualization

software will further play a transformative role for develop a more spatially explicit transportation
reproducible, replicable, and generalizable trans- theory, it is first necessary to develop operational
portation sciences. Early transportation visualiza- visualization that captures the spatial dynamics
tion with limited interaction capabilities relied on inherent in the datasets. The debates on transpor-
the abilities of traditional visualization methods, tation dynamics have been informed by, and to
such as bar charts, line plots, and geographic some extent inspired by, a parallel development of
information system mapping (e.g., heat maps or new visualization methods which has quantified
choropleths). However, many new methods and the magnitude of human dynamics. The users
packages have been developed to visually explore can be released from the burden of computing
trajectories data, using various visual metaphors capacity, allowing them to focus on their research
and instant interactions, such as GeoTime, Tri- questions on transport systems. Furthermore,
pVista, FromDaDy, vessel movement, and transportation visualization can act as an outreach
TrajAnalytics. Such visualization will facilitate platform which can help government agencies to
easy exploration of big trajectory data by an communicate the transportation planning and pol-
extensive community of stakeholders. Knowledge icies more effectively to the communities. With
coproduction and community engagement can be the coming age of digital twin, a digital replica of
strategically realized by using transportation visu- transportation system would also offer a new eco-
alization as a networking platform. system for transportation visualization to play a
To conduct transportation visualization, two critical role.
major technical components are needed: a robust
database and an interactive visualization inter-
face. A scalable database is a must for transpor-
Cross-References
tation data management, supporting fast
computation over various data queries in a
▶ Cell Phone Data
remote and distributed computing environment.
▶ Visualization
An interactive visualization interface allows the
researchers to query the data stored in database,
to discover patterns, to generate hypotheses, and
Further Reading
to share their insights with others. In addition to
semantic zoom, brushing, and linking, the users Al-Dohuki, S., Kamw, F., Zhao, Y., Ye, X., Yang, J., &
can perform progressive visual exploration, for Jamonnak, S. (2019). An open source TrajAnalytics
the purpose of interactively formulating hypoth- software for modeling, transformation and visualiza-
eses instead of testing hypotheses. Moreover, tion of urban trajectory data. In 2019 IEEE Intelligent
Transportation Systems Conference (ITSC)
transportation visualization can reveal complex (pp. 150–155). Piscataway: IEEE.
urban system phenomenon not identified other- Huang, X., Zhao, Y., Yang, J., Zhang, C., Ma, C., &
wise. Public trajectory datasets can be made Ye, X. (2016). TrajGraph: A graph-based visual ana-
available on the cloud, so that researchers all lytics approach to studying urban network centralities
using taxi trajectory data. IEEE Transactions on Visu-
over the world can stimulate transportation alization and Computer Graphics, 22(1), 160–169.
research and other activities to enhance the Li, M., Ye, X., Zhang, S., Tang, X., & Shen, Z. (2017).
robustness and reliability of their urban and A framework of comparative urban trajectory analysis.
regional studies. On the other hand, the software Environment and Planning B. https://doi.org/10.1177/
2399808317710023.
can also be used privately by companies and Pack, M. L. (2010). Visualization in transportation:
research centers to manage their trajectory data Challenges and opportunities for everyone. IEEE Com-
on clouds or local clusters, where researchers puter Graphics and Applications, 30(4), 90–96.
with granted rights can access and visually ana- Shaw, S., & Ye, X. (2019). Capturing spatiotemporal
dynamics in computational modeling. In J. P. Wilson
lyze the data easily. (Ed.), The geographic information science & technol-
In summary, getting acquainted with data is the ogy body of knowledge. https://doi.org/10.22224/
task of transportation visualization. In order to gistbok/2019.1.6.
Treatment 933

information revealed by huge medical databases


Treatment (i.e., medical big data) can improve the under-
standing of potential risks and benefits of various
Qinghua Yang1 and Yixin Chen2 treatments, accelerate the development of new
1
Department of Communication Studies, Texas medicines or treatments, and ultimately advance
Christian University, Fort Worth, TX, USA the health care quality.
2
Department of Communication Studies, Sam There is a rapid development in academic
Houston State University, Huntsville, TX, USA research on personalized treatment using big
data in recent years. For instance, a research pro-
ject funded by the National Science Foundation
Treatment and Big Data used big data to explore better treatments for pain
management, by creating a system that coordi-
The Information Age has witnessed a rapid nates and optimizes all the available information
increase in biomedical information, which can including the pain data and taking into consider-
lead to information overload and make informa- ation a number of variables, such as daily living
tion management difficult. One solution to man- activities, marital status, and drug use. Similarly,
aging large volume of data and reducing in a project sponsored by the American Society of
diagnostic errors is big data, which involve indi- Clinical Oncology (ASCO), the researchers col-
viduals’ basic information, daily activities, and lected patients’ age, gender, medications, and
health conditions. Such information can come other illnesses, along with their diagnoses, treat-
from different sources ranging from patients’ ment, and, eventually, date of death. Since the
health records, public health reports, and social sheer volume of patients should overcome some
media posts. After being digitalized, archived, data limitations (e.g., outliers), their overarching
and/or transformed, they grow into big data, goal is to first accumulate data based on as many
which can serve as a valuable source for public cancer patients as possible and then to analyze and
health professionals and researchers to obtain new quantify the data. Instead of giving a particular
medical knowledge and find new treatment for treatment for everyone for a particular disease,
diseases. For example, big data can be used to these projects aimed to personalize health care
develop predictive models of clinical trials, and by having the treatment specifically for each
the results from such modeling and simulation can patient.
inform early decision-making for treatment. Besides personalized treatment, there is also an
One primary application of big data to medical increasing application of artificial intelligence and
treatment is personalized treatment, which refers machine learning (ML) techniques in discovering
to developing individualized therapies based on new drugs and assessing their efficacy, as well as
subgroups of patients who have a specific type of improving treatment. For instance, ranking
disease. Driven by the need for personalized med- methods, a new class of ML methods that can
icine, big data have already been applied in the rank chemical structures based on their chances T
health care industry, particularly in treatment for of clinical success, can be invaluable in prioritiz-
cancer and rare diseases. Doctors can consult the ing chemical compounds for screening and saving
databases to get advice on treatment strategies that resources in developing compounds for new
might work for specific patients, based on the drugs. Although this new class of ML methods
records of similar patients around the world. For is promising during initial stages of screening, it
instance, by interpreting biological data on child- turns out to be unsuitable for further development
hood cancer patients, a research team at the Uni- after several rounds of expensive (pre)clinical
versity of Technology, Sydney, compared existing testing. Other important applications of ML tech-
and previous patients’ gene expressions and vari- niques include self-organizing maps, multilayer
ations to assist clinicians at the bedside to deter- perception, Bayesian neural networks, counter-
mine the best treatment for patients. The propagation neural network, and support vector
934 Treatment

machines, whose performance was found to have and time-consuming. Thus, researchers in charge
significant advantages compared to some tradi- of advancing treatment are often constrained in
tional statistical methods (e.g., multiple linear their ability to improve people’s health and life
regressions, partial least squares) in drug designs, quality. Also, some researchers can still publish
especially in solving actual problems such as pre- their studies and get grants by following tradi-
diction of biological activities, construction of tional methods, so there is not enough motivation
quantitative structure–activity relationships for them to try innovative approaches such as big
(QSAR) or quantitative structure–property rela- data for treatment development. The application
tionships (QSPR) models, virtual screening, and of big data to treatment and health care can be held
the prediction of pharmacokinetic properties. Fur- back, due to the long translation process of
thermore, IBM Watson’s cognitive computing research to practice and the limited new knowl-
helps physicians and researchers to analyze huge edge generated by published studies following
volume of data and focus on critical decision traditional methods.
points, which is essential to improving the deliv-
ery of effective therapies and using the treatment
to personalize therapies. Cross-References

▶ Biomedical Data
Controversy ▶ Health Care Delivery
▶ Health Informatics
Despite the promise of applying big data to med- ▶ Patient Records
ical treatment, some issues related to big data
application are equally noteworthy. First, several
questions remain unanswered regarding patients’ Further Reading
consent to have their data join the system, includ-
ing how often the consent should be given, in Agarwal, S., Dugar, D., & Sengupta, S. (2010). Ranking
what form the consent should be obtained, and chemical structures for drug discovery: A new machine
learning approach. Journal of Chemical Information
whether it is possible to obtain true consent given and Modeling, 50(5), 716–731.
the public’s limited knowledge about big data. DeGroff, C. G., Bhatikar, S., Hertzberg, J., Shandas, R.,
Failure to appropriately answer these questions Valdes-Cruz, L., & Mahajan, R. L. (2001). Artificial
may engender ethical issues and misuse of big neural network-based method of screening heart mur-
murs in children. Circulation, 103(22), 2711–2716.
data in medical treatment. Duch, W., Swaminathan, K., & Meller, J. (2007). Artificial
Second, there is a gap between the curriculum intelligence approaches for rational drug design and
in medical education and the need to integrate discovery. Current Pharmaceutical Design, 13(14),
big data into better treatment decisions. There- 1497–1508.
Gertrudes, J. C., Maltarollo, V. G., Silva, R. A., Oliveira, P.
fore, doctors may find information in medical big R., Honorio, K. M., & Da Silva, A. B. F. (2012).
data, though overwhelming, not particularly rel- Machine learning techniques and drug design. Current
evant, for making treatment decisions. On the Medicinal Chemistry, 19(25), 4289–4297.
other hand, the algorithm models generated by Hoffman, S., & Podgurski, A. (2013). The use and misuse
of biomedical data: Is bigger really better? American
big data analyses may not be transparent enough Journal of Law & Medicine, 39(4), 497–538.
about why a specific treatment is recommended Liu, B. (2014). Utilizing big data to build personalized
for certain patients, making these models like technology and system of diagnosis and treatment in
black boxes and thus may not be trusted by traditional Chinese medicine. Frontiers in Medicine, 8
(3), 272–278.
doctors. Weingart, N. S., Wilson, R. M., Gibberd, R. W., & Harri-
Lastly, the translation process from academic son, B. (2000). Epidemiology of medical error. BMJ,
research to medical practice is often expensive 320(7237), 774–777.
U

United Nations Educational, in science as well as promoting dialogue between


Scientific and Cultural scientists and policy-makers. In doing so, it acts as
Organization (UNESCO) a platform for dissemination of ideas in science
and encourages efforts on crosscutting themes
Jennifer Ferreira including disaster risk reduction, biodiversity,
Centre for Business in Society, Coventry engineering, science education, climate change,
University, Coventry, UK and sustainable development. Within the social
and human sciences, UNESCO plays a large role
in promoting heritage as a source of identity and
United Nations Educational, Scientific and Cul- cohesion for communities. It actively contributes
tural Organization (UNSCO), founded in 1945, is by developing cultural conventions that provide
an agency of the United Nations (UN) which mechanisms for international cooperation. These
specializes in education, natural sciences, social international agreements are designs to safeguard
and human sciences, culture, and communications natural and cultural heritage across the globe, for
and information. With 195 members, 9 associate example, through designation as UNESCO World
members, and 50 field offices, working with over Heritage sites. The development of communica-
300 international NGOs, UNESCO carries out tion and sharing information is embedded in all
activities in all of these areas, with the post-2015 their activities.
development agenda underpinning their overall UNESCO has five key objectives: to attain
agenda. quality education for all and lifelong learning;
As the only UN agency with a mandate to mobilize science knowledge and policy for sus-
address all aspects of education, it proffers that tainable development; address emerging social
education is at the heart of development, with a and ethical challenges; foster cultural diversity,
belief that education is fundamental to human, intercultural dialogue, and culture of peace; and
social, and economic development. It coordinates build inclusive knowledge societies through
“Education for All” movement, a global commit- information and communication. Like other UN
ment to provide quality basic education for all agencies, UNESCO has been involved in debates
children, youth, and adults, monitoring trends in about the data revolution for development and the
education and where possible make attempts to role that big data can play.
raise the profile of education on the global devel- The data revolution for sustainable develop-
opment agenda. For the natural sciences, ment is an international initiative designed to
UNESCO acts as an advocate for science as it improve the quality of data and information that
focuses on encouraging international cooperation is generated and made available. It recognizes that
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
936 United Nations Educational, Scientific and Cultural Organization (UNESCO)

societies need to take advantage of new technolo- Access to big data for development, as with all
gies and crowd-sourced data and improve digital big data sources, presents a number of ethical
connectivity in order to empower citizens with considerations based around the ownership of
information that can contribute towards progress data and privacy. This is an area the UN recog-
towards wider development goals. While there are nizes that policy-makers will need to address to
many data sets available about the state of global ensure that data will be used safely to address their
education, it is argued that better data could be objectives while still protecting the rights of peo-
generated, even around basic measures such as the ple whom the data is about or generated from.
number of schools. In fact rather than focus on Furthermore, there are a number of critiques of
“big data” which has captured the attention of big data which make more widespread use of big
many leaders and policy-makers, instead more data for UNESCO problematic: first that claims
efforts should focus on “little data,” i.e., focus that big data are objective and accurate represen-
on data that is both useful and relevant to partic- tations are misleading; not all data produced can
ular communities. Now discussions are shifting to be used comparably; there are important ethical
identify which indicators and data should be considerations necessary about the use of big data;
prioritized. limited access to big data is exacerbating existing
UNESCO Institute for Statistics is the organi- digital divides.
zation’s own statistic arm; however, much of the The Scientific Advisory Board of the Secre-
data collection and analysis that takes place here tary-General of the United Nations which is
relies on much more conventional management hosted by UNESCO provided comments on the
and information systems which in turn relies on report on data revolution in sustainable develop-
national statistical agencies which in many devel- ment. It highlighted concerns over equity and
oping countries are often unreliable or heavily access to data noting that the data revolution
focused on administrative data (UNESCO 2012). should lead to equity in access and use of data
This means that the data used by UNESCO is for all. Furthermore, it suggested that a number of
often out of date, or not detailed enough. While global priorities should be included in any agenda
digital technologies have become widely used in related to the data revolutions: first that countries
many societies, more potential sources of data are should seek to avoid contributing to a data
generated (Pentland 2013). For example, mobile divide between the rich and poor countries and
phones are now used as banking devices as well as secondly that there should be some form of har-
for standard communications. Official statistics monization and standardization of data platform
organizations are still behind in many countries to increase accessibility internationally, there
and international organizations in that they have should be national and regional capacity building
not developed ways to adapt and make use of this efforts, and there should be a series of training
data alongside the standard administrative data institutes and training programs in order to
already collected. develop skills and innovation in areas related to
There are a number of innovative initiatives to data generation and analysis (Manyika et al.
make better use of survey data and mobile phone- 2011). A key point made here is that the quality
based applications to collect data more efficiently and integrity of the data generated needs to be
and prove more timely feedback to schools, com- addressed, as it is recognized that big data often
munities, and ministries on target areas such as plays an important role in political and economic
enrolment, attendance, and learning achievement. decision-making. Therefore a series of standards
UNESCO could make a significant contribution to and methods for analysis and evaluation of data
a data revolution in education by investing in quality should be developed.
resources in collecting these innovations and In the journal Nature, Hubert Gijzen,
making them more widely available to countries. UNESCO Regional Science Bureau for Asia and
Unstructured Data 937

the Pacific, calls for more big data to help secure a a core part of any legal or regulatory mechanisms
sustainable future (Gijzen 2013). He argues that that are developed with respect to big data
more data should be collected which can be used (United Nations 2014). These principles are
to model different scenarios for sustainable soci- likely to influence UNESCOs engagement with
eties concerning a range of issues from energy big data in the future.
consumption, improving water conditions, and UNESCO, and the UN more broadly, acknowl-
poverty eradication. Big data, according to edge that technology has been, and will continue
Gijzen, has the potential if coordinated globally to be, a driver of the data revolution and a wider
between countries, regions, and relevant institu- variety of data sources. For big data that is derived
tions to have a big impact on the way societies from this technology to have an impact, these data
address some of these global challenges. The sources need to be leveraged in order to develop a
United Nations has begun to take actions to do greater understanding of the issues related to the
this through the creation of the Global Pulse ini- development agenda.
tiative bringing together experts from the govern-
ment, academic, and private sectors to consider
new ways to use big data to support development Cross-References
agendas. Global Pulse, a network of innovation
labs which conduct research on Big Data for ▶ International Development
Development via collaborations between the gov- ▶ United Nations Educational, Scientific and Cul-
ernments, academic, and private sectors. The ini- tural Organization (UNESCO)
tiative is designed especially to make use of the ▶ World Bank
digital data flood that has developed in order to
address the development agendas that are at the
heart of UNESCO, and the UN more broadly. Further Reading
The UN Secretary-General’s Independent
Expert Advisory Group on the Data Revolution Gijzen, H. (2013). Development: Big data for a sustainable
future. Nature, 52, 38.
for Sustainable Development produced the
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,
report “A World That Counts” UN Secretary- Roxburgh, C., Byers, A. (2011). Big data: The next
General’s Export Advisory Group on Data Rev- frontier for innovation, competition, and productivity.
olution report in November 2014 suggested a McKinsey Global Institute. New York. http://www.
mckinsey.com/insights/mgi/research/technology_and_
number of key principles which should be sought
innovation/big_data_the_next_frontier_for_innova
regards to the use of data: data quality and integ- tion. Accessed 12 Nov 14.
rity to ensure clear standards for use of data, data Pentland, A. (2013). The data driven society. Scientific
disaggregation to provide a basis for comparison, American, 309, 78–83.
UNESCO (2012). Learning analytics. UNESCO Institute
data timeliness to encourage a flow of high qual-
for Information Technologies Policy Brief. Available
ity data for used in evidence-based policy-mak- from http://iite.unesco.org/pics/publications/en/files/
ing, data transparency to encourage systems 3214711.pdf. Accessed 11 Nov 14.
which allow data to make freely available, data United Nations (2014). A world that counts. United
usability to ensure data can be made user- Nations. United Nations. Available from http://www. U
unglobalpulse.org/IEAG-Data-Revolution-Report-A-
friendly, data protection and privacy: to establish World-That-Counts. Accessed 28 Nov 14.
international and national policies and legal
frameworks for regulating data generation and
use, data governance and independence, data
resources and capacity to ensure all countries Unstructured Data
have effective national statistical agencies, and
finally data rights to ensure human rights remains ▶ Data Integration
938 Upturn

worked for the Federal Trade Commission and


Upturn the Center for Democracy and Technology on
data security and privacy issues.
Katherine Fink Cofounders Robinson and Yu began their col-
Department of Media, Communications, and laboration at Princeton University as researchers
Visual Arts, Pace University, Pleasantville, on government transparency and civic engage-
NY, USA ment. They were among four coauthors of the
2009 Yale Journal of Law & Technology article
“Government Data and the Invisible Hand,”
Introduction which argued that the government should prior-
itize opening access to more of its data rather
Upturn is a think tank that focuses on the impact than creating websites. The article suggested
of big data on civil rights. Founded in 2011 as that “private parties in a vibrant marketplace of
Robinson þ Yu, the organization announced a engineering ideas” were better suited to develop
name change in 2015 and expansion of its staff websites that could help the public access gov-
from two to five people. The firm’s work ernment data. In 2012, Robinson and Yu
addresses issues such as criminal justice, lend- coauthored the UCLA Law Review article “The
ing, voting, health, free expression, employment, New Ambiguity of ‘Open Government,’” in
and education. Upturn recommends policy which they argued that making data more avail-
changes with the aim of ensuring that institutions able to the public did not by itself make govern-
use technology in accordance with shared public ment more accountable. The article
values. The firm has published white papers, recommended separating the notion of open gov-
academic articles, and an online newsletter ernment from the technologies of open data in
targeting policymakers and civil rights order to clarify the potential impacts of public
advocates. policies on civic life.

Background Criminal Justice

Principals of Upturn include experts in law, pub- Upturn has worked with the Leadership Confer-
lic policy, and software engineering. David Rob- ence, a coalition of civil rights and media justice
inson was formerly the founding Associate organizations, to evaluate police department pol-
Director of Princeton University’s Center for icies on the use of body-worn cameras. The orga-
Information Technology Policy, which conducts nizations, noting increased interest in the use of
interdisciplinary research in computer science such cameras following police-involved deaths in
and public policy. Robinson holds a JD from communities such as Ferguson (Missouri), New
Yale University’s Law School and has reported York City, and Baltimore, also cautioned that
for the Wall Street Journal and The American, an body-worn cameras could be used for surveil-
online magazine published by the American lance, rather than protection, of vulnerable indi-
Enterprise Institute. Harlan Yu holds a PhD in viduals. The organizations released a scorecard on
Computer Science from Princeton University, body-worn camera policies of 25 police depart-
where he developed software to make court ments in November 2015. The scorecard included
records more accessible online. He has also criteria such as whether body-worn camera poli-
advised the US Department of Labor on open cies were publicly available, whether footage was
government policies and analyzed privacy, available to people who file misconduct com-
advertising, and broadband access issues for plaints, and whether the policies limited the use
Google. Aaron Rieke has a JD from the Univer- of biometric technologies to identify people in
sity of California Berkeley’s Law School and has recordings.
Upturn 939

Lending Voting

Upturn has warned of the use of big data by Robinson þ Yu partnered with Rock the Vote in
predatory lenders to target vulnerable consumers. 2013 in an effort to simplify online voter registra-
In a 2015 report, “Led Astray,” Upturn explained tion processes. The firm wrote a report,
how businesses used online lead generation to sell “Connected OVR: a Simple, Durable Approach
risky payday loans to desperate borrowers. In to Online Voter Registration.” At the time of the
some cases, Upturn found that the companies report, nearly 20 states had passed online voter
violated laws against predatory lending. Upturn registration laws. Robinson þ Yu recommended
also found some lenders exposed their customers’ that all states allow voters to check their registra-
sensitive financial data to identity thieves. The tion statuses in real time. It also recommended that
report recommended that Google, Bing, and online registration systems offer alternatives to
other online platforms tighten restrictions on pay- users who lack state identification, and that the
day loan ads. It also called on the lending industry systems be responsive to devices of various sizes
to promote best practices for online lead genera- and operating systems. Robinson þ Yu also
tion and for greater oversight of the industry by suggested that states streamline and better coordi-
the Federal Trade Commission and Consumer nate their online registration efforts. Robinson þ
Financial Protection Bureau. Yu recommended that states develop a simple,
Robinson þ Yu researched the effects of the standardized platform for accepting voter data
use of big data in credit scoring in a guide for and allow third-party vendors (such as Rock the
policymakers titled “Knowing the Score.” The Vote) to design interfaces that would accept voter
guide endorsed the most widely used credit scor- registrations. Outside vendors, the report
ing methods, including FICO, while acknowledg- suggested, could use experimental approaches to
ing concerns about disparities in scoring among reach new groups of voters while still adhering to
racial groups. The guide concluded that the scor- government registration requirements.
ing methods themselves were not discriminatory,
but that the disparities rather reflected other under-
lying societal inequalities. Still, the guide advo- Big Data and Civil Rights
cated some changes to credit scoring methods.
One recommendation was to include “mainstream In 2014, Robinson þ Yu advised The Leadership
alternative data” such as utility bill payments in Conference on “Civil Rights Principles for the Era
order to allow more people to build their credit of Big Data.” Signatories of the document
files. The guide expressed reservations about included the American Civil Liberties Union,
“nontraditional” data sources, such as social net- Free Press, and NAACP. The document offered
work data and the rate at which users scroll guidelines for developing technologies with
through terms of service agreements. Robinson þ social justice in mind. The principles included an
Yu also called for more collaboration among finan- end to “high-tech profiling” of people through the
cial advocates and the credit industry, since use of surveillance and sophisticated data-gather-
much of the data on credit scoring is proprietary. ing techniques, which the signatories argued U
Finally, Robinson þ Yu advocated that govern- could lead to discrimination. Other principles
ment regulators more actively investigate “market- included fairness in algorithmic decision-making;
ing scores,” which are used by businesses to target the preservation of core legal principles such as
services to particular customers based on their the right to privacy and freedom of association;
financial health. The guide suggested that market- individual control of personal data; and protec-
ing scores appeared to be “just outside the scope” tions from data inaccuracies.
of the Fair Credit Reporting Act, which requires The “Civil Rights Principles” were cited by the
agencies to notify consumers when their credit files White House in its report, “Big Data: Seizing
have been used against them. Opportunities, Preserving Values.” John Podesta,
940 Upturn

Counselor to President Barack Obama, cautioned workers. Robinson þ Yu questioned whether ask-
in his introduction to the report that big data had ing the commuting question was fair, particularly
the potential “to eclipse longstanding civil rights since it could lead to discrimination against appli-
protections in how personal information is used.” cants who lived in lower-income areas. Finally,
Following the White House report, Robinson þ Robinson þ Yu raised concerns about “sublimi-
Yu elaborated upon four areas of concern in the nal” effects on employers who conducted web
white paper “Civil Rights, Big Data, and Our searches for job applicants. A Harvard researcher,
Algorithmic Future.” The paper included four they noted, found that Google algorithms were
chapters: Financial Inclusion, Jobs, Criminal Jus- more likely to show advertisements for arrest
tice, and Government Data Collection and Use. records in response to web searches of “black-
The Financial Inclusion chapter argued the era identifying names” rather than “white-identifying
of big data could result in new barriers for low- names.”
income people. The automobile insurance com- Robinson þ Yu found that big data had changed
pany Progressive, for example, installed devices approaches to criminal justice. Municipalities used
in customers’ vehicles that allowed for the track- big data in “predictive policing,” or anti-crime
ing of high-risk behaviors. Such behaviors efforts that targeted ex-convicts and victims of
included nighttime driving. Robinson þ Yu crimes as well as their personal networks. Robin-
argued that many lower-income workers com- son þ Yu warned that these systems could lead to
muted during nighttime hours and thus might police making “guilt by association” mistakes,
have to pay higher rates, even if they had clean punishing people who had done nothing wrong.
driving records. The report also argued that mar- The report also called for greater transparency in
keters used big data to develop extensive profiles law enforcement tactics that involved surveillance,
of consumers based on their incomes, buying such as the use of high-speed cameras that can
habits, and English-language proficiency, and capture images of vehicle license plates, and so-
such profiling could lead to predatory marketing called stingray devices, which intercept phone calls
and lending practices. Consumers often are not by mimicking cell phone towers. Because of the
aware of what data has been collected about secretive nature with which police departments
them and how that data is being used, since such procure and use these devices, the report contended
information is considered to be proprietary. Rob- that it was difficult to know whether they were
inson þ Yu also suggested that credit scoring being used appropriately. Robinson þ Yu also
methods can disadvantage low-income people noted that police departments were increasingly
who lack extensive credit histories. using body cameras and that early studies
The report found that big data could impair job suggested the presence of the cameras could de-
prospects in several ways. Employers used the escalate tension during police interactions.
federal government’s E-Verify database, for The Data Government and Use chapter
example, to determine whether job applicants suggested that big data tools developed in the
were eligible to work in the United States. The interest of national security were also being used
system could return errors if names had been domestically. The DEA, for example, worked
entered into the database in different ways. For- closely with AT&amp;T to develop a secret data-
eign-born workers and women have been dispro- base of phone records for domestic criminal
portionately affected by such errors. Resolving investigations. To shield the database’s existence,
errors can take weeks, and employers often lack agents avoided mentioning it by name in official
the patience to wait. Other barriers to employment documents. Robinson þ Yu warned that an abun-
arise from the use of automated questionnaires dance of data and a lack of oversight could result
some applicants must answer. Some employers in abuse, citing cases in which law enforcement
use the questionnaires to assess which potential workers used government data to stalk people
employees will likely stay in their jobs the lon- they knew socially or romantically. The report
gest. Some studies have suggested that longer also raised concerns about data collection by the
commute times correlate to shorter-tenured US Census Bureau, which sought to lower the
Upturn 941

cost of its decennial count by collecting data from ▶ e-commerce


government records. Robinson þ Yu cautioned ▶ Financial Services
that the cost-cutting measure could result in ▶ Google
undercounting some populations. ▶ Governance
▶ National Association for the Advancement of
Colored People
Newsletter ▶ Online Advertising

Equal Future, Upturn’s online newsletter, began in


2013 with support from the Ford Foundation. The Further Reading
newsletter has highlighted news stories related to
social justice and technology. For instance, Equal Civil Rights Principles for the Era of Big Data. (2014,
February). http://www.civilrights.org/press/2014/civil-
Future has covered privacy issues related to the
rights-principles-big-data.html.
FBI’s Next Generation Identification system, a Robinson, D., & Yu, H. (2014, October). Knowing the score:
massive database of biometric and other personal New data, underwriting, and marketing in the consumer
data. Other stories have included a legal dispute in credit marketplace. https://www.teamupturn.com/static/
files/Knowing_the_Score_Oct_2014_v1_1.pdf.
which a district attorney forced Facebook to grant
Robinson þ Yu. (2013). Connected OVR: A simple, durable
access to the contents of nearly 400 user accounts. approach to online voter registration. Rock the Vote.
Equal Future also wrote about an “unusually com- http://www.issuelab.org/resource/connected_ovr_a_simp
prehensive and well-considered” California law le_durable_approach_to_online_voter_registration.
Robinson, D., Yu, H., Zeller, W. P., & Felten, E. W. (2008).
that limited how technology vendors could use
Government data and the invisible hand. Yale JL &
educational data. The law was passed in response Tech., 11, 159.
to parental concerns about sensitive data that could The Leadership Conference on Civil and Human Rights &
compromise their children’s privacy or limit their Upturn. (2015, November). Police body worn cameras:
A policy scorecard. https://www.bwcscorecard.org/
future educational and professional prospects.
static/pdfs/LCCHR_Upturn-BWC_Scorecard-v1.04.pdf.
Upturn. (2014, September). Civil rights, big data, and our
algorithmic future. https://bigdata.fairness.io/.
Cross-References Upturn. (2015, October). Led Astray: Online lead genera-
tion and payday loans. https://www.teamupturn.com/
reports/2015/led-astray.
▶ American Civil Liberties Union Yu, H., & Robinson, D. G. (2012). The new ambiguity of
▶ Biometrics ‘open government’. UCLA L. Rev. Disc. 59, 178.

U
V

Verderer Introduction

▶ Forestry People use visualization for information commu-


nication. Data visualization is the study of creat-
ing visual representations of data, which bears two
Verification levels of meaning: the first is to make information
visible and the second is to make it obvious for
▶ Anomaly Detection understand. Visualization is a pervasive existence
in the data life cycle and recent trends is to pro-
mote the use of visualization in data analysis
rather than use it only as a way to present the
Visible Web result. Community standards and open source
libraries set the foundation for visualization of
▶ Surface Web vs Deep Web vs Dark Web Big Data, and domain expertise and creative
ideas are needed to put standards into innovative
applications.
Visual Representation

▶ Visualization Visualization and Data Visualization

Visualization, in its literal meaning, is the proce-


dure to form a mental picture of something that is
Visualization not present to the sight (Cohen et al. 2002). People
can also illustrate such kind of mental pictures by
Xiaogang Ma using various visible media such as papers and
Department of Computer Science, University of computer screens. Seen as a way to facilitate
Idaho, Moscow, ID, USA information communication, the meaning of visu-
alization can be understood at two levels. The first
level is to make something to be visible and the
Synonyms second level is to make it obvious so it is easy to
understand (Tufte 1983). People’s daily experi-
Data visualization; Information visualization; ence shows that graphics are easier to read and
Visual representation understand than words and numbers, such as the
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
944 Visualization

use of maps in automotive navigation systems to just one of the many steps in the data life cycle,
show the location of an automobile and the road to and visualization is useful through the whole
the destination. The daily experience is approved data life cycle. In conventional understanding, a
by scientific discoveries. Studies on visual object data life cycle begins with data collection and
perceptions explain such differentiation in reading continues with cleansing, processing, archiving,
graphics and texts/numbers: the human brain and distribution. Those are from the perspective of
deciphers image elements simultaneously and data providers. Then, from the perspective of data
decodes language in a linear and sequential man- users, the data life cycles continues with data
ner, where the linear process takes more time than discovery, access, analysis, and then repurposing.
the simultaneous process. From repurposing, the life cycle may go back to
Data are representations of facts and informa- the collection or processing step restarting the
tion is the meaning worked out from data. In the cycle. Recent studies show that there is another
context of the Big Data, visualization is a crucial step called concept before the step of data collec-
method to tackle the considerable needs of extra- tion. The concept step covers works such as con-
cting information from data and presenting ceptual models, logical models, and physical
it. Data visualization is the study of creating visual models for relational databases, and ontologies
representations of data. In practice, data visuali- and vocabularies for Linked Data in the
zation means to visually display one or more Semantic Web.
objects by combined use of words, numbers, sym- Visualization or more specifically data visual-
bols, points, lines, color, shading, coordinate sys- ization provides support to different steps in the
tems, and more. While there are various choices of data life cycle. For example, the Unified Modeling
visual representations for a same piece of data, Language (UML) provides a standard way to
there are a few general guidelines that can be visualize the design of information systems,
applied to establish effective and efficient data including the conceptual and logical models of
visualization. This first is to avoid distorting databases. Typical relationships in UML include
what the data have to say. That is, the visualization association, aggregation, and composition at the
should not give a false or misleading account of instance level, generalization and realization at the
the data. The second is to know the audience and class level, and general relationships such as
serve a clear purpose. For instance, the visualiza- dependency and multiplicity. For ontologies and
tion can be a description of the data, a tabulation vocabularies in the Semantic Web, concept maps
of the records, or an exploration of the information are widely used for organizing concepts in a sub-
that is of interest to the audience. The third is to ject domain and the interrelationships among
make large datasets coherent. A few artistic those concepts. In this way a concept map is the
designs will be required to present the data and visual representation of a knowledge base. The
information in an orderly and consistent way. The concept maps are more flexible than UML
presidential, Senate, and House elections of the because they cover all the relationships defined
United States have been reported with well- in UML and allow people to create new relation-
presented data visualization, such as those on the ships that apply to the domain under working
website of The New York Times. The visualization (Ma et al. 2014). For example, there are concept
on that website is underpinned by dynamic maps for the ontology of the Global Change Infor-
datasets and can show the latest records mation System led by the US Global Change
simultaneously. Research Program. The concept maps are able to
show that report is a subclass of publication, and
there are several components in a report, such as
Visualization in the Data Life Cycle chapter, table, figure, array, and image. Recent
work in information technologies also enable
Visualization is crucial in the process from data to online visualized tools to capture and explore
information. However, information retrieval is concepts underlying collaborative science
Visualization 945

activities, which greatly facilitate the collabora- Visual analytics is a field of research to address
tion between domain experts and computer the requests of interactive data analysis. It com-
scientists. bines many existing techniques from data visual-
Visualization is also used to facilitate data ization with those from computational data
archive, distribution, and discovery. For instance, analysis, such as those from statistics and data
the Tetherless World Constellation at Rensselaer mining. Visual analytics is especially focused on
Polytechnic Institute recently developed the Inter- the integration of interactive visual representa-
national Open Government Dataset Catalog, tions with the underlying computational process.
which is a Web-based faceted browsing and For example, the IPython Notebook provides an
search interface to help users find datasets of online collaborative environment for interactive
interest. A facet represents a part of the properties and visual data analysis and report drafting.
of a dataset, so faceted classification allows the IPython Notebook uses JavaScript Object Nota-
assignment of the dataset to multiple taxonomies, tion (JSON) as the scripting language, and each
and then datasets can be classified and ordered in notebook is a JSON document that contains a
different ways. On the user interface of a data sequential list of input/output cells. There are
center the faceted classification can be visualized several types of cells to contain different contents,
as a number of small windows and options, which such as text, mathematics, plots, codes, and even
allows the data center to hide the complexity of rich media such as video and audio. Users can
data classification, archive and search on the design a workflow of data analysis through the
server side. arrangement and update of cells in a notebook.
A notebook can be shared with others as a normal
file, or it can also be shared with the public using
Visual Analytics online services such as the IPython Notebook
Viewer. A completed notebook can be converted
The pervasive existence of visualization in the into a number of standard output formats, such as
data life cycle shows that visualization can be HyperText Markup Language (HTML), HTML
applied broadly in data analytics. Yet, in actual presentation slides, LaTeX, Portable Document
practices visualization is often treated as a Format (PDF), and more. The conversion is
method to show the result of data analysis rather done through a few simple operations, so that
than as a way to enable the interactions between means once a notebook is complete, a user only
users and complex datasets. That is, the visuali- needs to press a few buttons to generate a scien-
zation as a result is separated from the datasets tific report. The notebook can be reused to analyze
upon which the result is generated. Many of the other datasets, and the cells inside it can also be
data analysis and visualization tools scientists reused in other notebooks.
use in nowadays do not allow dynamic and live
linking between visual representations and
datasets, and when dataset changes, the visuali- Standards and Best Practices
zation is no longer updated to reflect the changes.
In the context of Big Data, many socioeconomic Any applications of Big Data will face the chal-
challenges and scientific problem facing the lenges caused by the four dimensions of Big Data:
V
world are increasingly linked to the volume, variety, velocity, and veracity. Com-
interdependent datasets from multiple fields of monly accepted standards or communities con-
research, organizations, instruments, dimen- sensus are a proved way to reduce the
sions, and formats. Interactions are becoming heterogeneities between datasets under working.
an inherent characteristic of data analytics with Various standards have already been used in appli-
the Big Data, which requires new methodologies cation tackling scientific, social, and business
and technologies of data visualization to be issues, such as the aforementioned JSON for
developed and deployed. transmitting data with human-readable text, the
946 Vocabulary

Scalable Vector Graphics (SVG) for two- feature applications. The demo system of the
dimensional vector graphics, and the GeoJSON Dutch Heritage and Location shows the linked
for representing collections of georeferenced fea- open dataset of the National Cultural Heritage
tures. There are also organizations coordinating with more than 13 thousand archaeological mon-
the works on community standards. The World uments in the Netherlands. Besides the
Wide Web Consortium (W3C) coordinates the GeoSPARQL, GeoJSON and few other standards
development of standards for the Web. For exam- and libraries are also used in that demo system.
ple, the SVG is an output of the W3C. Other W3C
standards include the Resource Description
Framework (RDF), the Web Ontology Language Cross-References
(OWL), and the Simple Knowledge Organization
System (SKOS). Many of them are used for data ▶ Data Visualization
in the Semantic Web. The Open Geospatial Con- ▶ Data-Information-Knowledge-Action Model
sortium (OGC) coordinates the development of ▶ Interactive Data Visualization
standards relevant to geospatial data. For exam- ▶ Pattern Recognition
ple, the Keyhole Markup Language (KML) is
developed for presenting geospatial features in
Web-based maps and virtual globes such as Goo- References
gle Earth. The Network Common Data Form
(netCDF) is developed for encoding array- Cohen, L., Lehericy, S., Chochon, F., Lemer, C., Rivaud,
S., & Dehaene, S. (2002). Language-specific tuning of
oriented data. Most recently, the GeoSPARQL is
visual cortex? Functional properties of the visual word
developed for encoding and querying geospatial form area. Brain, 125(5), 1054–1069.
data in the Semantic Web. Fox, P., & Hendler, J. (2011). Changing the equation on
Standards just enable the initial elements for scientific data visualization. Science, 331(6018), 705–708.
Ma, X., Fox, P., Rozell, E., West, P., & Zednik, S. (2014).
data visualization, and domain expertise and
Ontology dynamics in a data life cycle: Challenges and
novel ideas are needed to put standards into prac- recommendations from a geoscience perspective. Jour-
tice (Fox and Hendler 2011). For example, Goo- nal of Earth Science, 25(2), 407–412.
gle Motion Chart adapts the fresh idea of motion Murray, S. (2013). Interactive data visualization for the
web. Sebastopol: O’Reilly.
charts to extend the traditional static charts, and
Tufte, E. (1983). The visual display of quantitative infor-
the aforementioned IPython Notebook allows the mation. Cheshire: Graphics Press.
use of several programming languages and data
formats through the use of cells. There are various
programming libraries developed for data visual-
ization, and many of them are made available on Vocabulary
the Web. The D3.js is a typical example of such
open source libraries (Murray 2013). The D3 here ▶ Ontologies
represents Data-Driven Documents. It is a
JavaScript library using digital data to drive the
creation and running of interactive graphics in
Web browsers. D3.js based visualization uses Voice Assistants
JSON as the format of input data and SVG as
the format for the output graphics. The ▶ Voice User Interaction
OneGeology data portal provides a platform to
browse geological map services across the
world, using standards developed by both OGC
and W3C, such as SKOS and Web Map Service Voice Data
(WMS). GeoSPARQL is a relatively newer stan-
dard for geospatial data but there are already ▶ Voice User Interaction
Voice User Interaction 947

feedback to callers but are hardly intelligent as


Voice User Interaction responses would be crafted based on the options
selected on the call. Aural response statements
Steven J. Gray would be pre-recorded or created from streams
The Bartlett Centre for Advanced Spatial of words spoken by voice actors which often
Analysis, University College London, sound robotic or unnatural.
London, UK IVR Systems are regarded as the first wave of
VUI but now with the advancement of natural
language processing systems, machine learning
Synonyms models to detected, Named Entity Recognition,
AI computational voice creation, and advanced
Speech processing; Speech recognition; Voice speech recognition, it is now possible to create
assistants; Voice data; Voice user interfaces conversational interfaces that allow users to inter-
act with larger Information Retrieval systems and
Knowledge Graphs to answer requests. Vast
Introduction to Voice Interaction amounts of voice data was collected through auto-
mated IVR systems such as Google Voice Local
Voice User Interaction is the study and methods of Search (“GOOG-411”), a service which allowed
designing systems and workflows that process users to call Google to get search results over the
natural speech into commands and actions that phone (Bacchiani et al. 2008), allowing the train-
can be carried out automatically on behalf of the ing of AI systems that recognize speech patterns
user. The convergence of natural language pro- and accents to provide the various models needed
cessing research, machine learning and the avail- to detect speech on device (Heerden et al. 2009).
ability of vast amounts of data, both written and
voice, have allowed new ways of interaction into
data discovery and traversing knowledge graphs. The Introduction of Voice Assistants
These interfaces allow users to control systems in
a conversational way realizing the science fiction Voice assistants are available on multiple surfaces,
reality of conversing with computers to discover mobile devices, home hubs, watches, cars enter-
insights and control computing systems. tainment systems, and televisions for example
allowing users to ask questions of these systems
directly (Hoy 2018). Many devices can process
History of Conversational Interfaces the speech recording on a device, for example, for
automated subtitles from video playback or tran-
Interactive Voice Response systems (IVR) were scription of messages instead of using touch inter-
developed, and first introduced commercially in faces. Voice detection is processed locally, on the
1973, which allowed users to interact with auto- device, to activate the assistant, using the cloud-
mated phone systems using Dual-tone multi-fre- based systems to send the users recording to cloud
quency signaling (DMTF) tones (Corkrey and workflows to fulfill the request for data based on
Parkinson 2002). More commonly named “touch the query. Responses are created within seconds
V
tone dialling” to select options allowing callers to to create the illusion of conversation between the
navigate single actions through a tree on a phone user and the device. Linking these voice assistants
call. As technology advanced and Computer Tele- to knowledge bases allows answers to be digested
phony Integration (CTI) was introduced into call by real time workflows and surface answers in
centers, IVR systems became pseudo-interactive conversational ways (Shalaby et al. 2020). The
allowing the recognition of simple words to state of the conversation is saved in the cloud
enable routing of calls to specific agents to handle allowing users to ask follow up questions and
the calls. Text to speech systems allow primitive the context being preserved to provide relevant
948 Voice User Interfaces

responses as would happen in natural conversa- information systems in a natural way will not
tion (Eriksson 2018). only prove useful for many people in daily lives
Voice Assistants bring an additional form of but also allow new surfaces for multimodal inter-
modality into multimodal interfaces (Bourguet action giving users who find graphical user inter-
2003). Voice will not necessarily replace other faces prohibitive or have various accessibility
forms of interfaces but augment them and work in issues (Corbet and Weber 2016).
conjunction to provide the best interface for the user
whatever the user’s current circumstances are.
VUI design is diametrically opposed to Graph- Further Reading
ical User Interface design and standard User
Experience paradigms. There are no graphical Bacchiani, M., Beaufays, F., Schalkwyk, J., Schuster, M.,
& Strope, B. (2008, March). Deploying GOOG-411:
elements to select or present error states to the
Early lessons in data, measurement, and testing. In
user, the system has to gracefully recover from 2008 IEEE international conference on acoustics,
error conditions by using prompts and responses. speech and signal processing (pp. 5260–5263). IEEE.
Simply, responding in a fashion of, “I’m sorry I Bourguet, M. L. (2003). Designing and prototyping multi-
modal commands. In Proceedings of human-computer
don’t understand that input” will confuse or frus-
interaction (INTERACT’03) (pp. 717–720).
trate a user, leading to users to interact less with Corbet, E., & Weber, A. (2016). What can I say? Addressing
such systems (Suhm et al. 2001). New design user experience challenges of a mobile voice user inter-
patterns for Voice Data systems have to adapt to face for accessibility. In Proceedings of the 18th inter-
national conference on human-computer interaction
the users inputs and lead them through the
with mobile devices and services (MobileHCI’16) (pp.
workflows to successful outcomes (Pearl 2016). 72–82). Association for Computing Machinery, New
York. https://doi.org/10.1145/2935334.2935386.
Corkrey, R., & Parkinson, L. (2002). Interactive voice
response: Review of studies 1989–2000. Behavior
Expanding Voice Interfaces Research Methods, Instruments, & Computers, 34,
342–353. https://doi.org/10.3758/BF03195462.
Allowing developers to create customized third- Eriksson, F. (2018). Onboarding users to a voice user
party applications to surface on these virtual assis- interface: Comparing different teaching methods for
onboarding new users to intelligent personal assistants
tants unlocks the potential to expand these voice
(Dissertation). Retrieved from http://urn.kb.se/resolve?
interfaces to systems that allow data entry and urn¼urn:nbn:se:umu:diva-149580.
surface information to the user. In recent years, Heerden, C. V., Schalkwyk, J., & Strope, B. (2009). Lan-
virtual assistants have also been expanded to guage modeling for what-with-where on GOOG-411.
In Tenth annual conference of the international speech
screen calls on personal mobile phones on behalf
communication association.
of users as well as making bookings and organize Hoy, M. B. (2018). Alexa, Siri, Cortana, and more: An
events with local businesses. These businesses introduction to voice assistants. Medical Reference Ser-
will soon replace adequate IVR systems to more vices Quarterly, 37(1), 81–88.
Pearl, C. (2016). Designing voice user interfaces: Princi-
conversational systems to interact with customers
ples of conversational experiences. Newton: O’Reilly.
to free up employee time dealing with simple Shalaby, W., Arantes, A., GonzalezDiaz, T., & Gupta, C.
requests, bookings, and customer queries. In the (2020, June). Building chatbots from large scale domain-
near future, virtual assistants will be able to inter- specific knowledge bases: Challenges and opportunities.
In 2020 IEEE international conference on prognostics
face automatically with virtual business agents
and health management (ICPHM) (pp. 1–8). IEEE.
allowing machines to directly communicate with Suhm, B., Myers, B., & Waibel, A. (2001). Multimodal
each other in natural language and becoming a error correction for speech user interfaces. ACM Trans-
digital personal assistant for the masses. actions on Computer-Human Interaction, 8(1), 60–98.
The combination of virtual assistants, knowl-
edge graphs, and Voice User Interfaces has
brought the science fiction dream of conversa- Voice User Interfaces
tional computing to reality within the home.
Being able to converse with computers and ▶ Voice User Interaction
Vulnerability 949

Organizational data breaches can mean the theft


Vulnerability of proprietary information, client data, and
employee work and personal data. Indeed, bank-
Laurie A. Schintler and Connie L. McNeely ing and financial organizations, government agen-
George Mason University, Fairfax, VA, USA cies, and healthcare providers all face such big
data security issues as a matter of course
(Smartym Pro 2020).
Vulnerability is an essential and defining aspect The volume of data being collected about peo-
of big data in today’s increasingly digitalized ple, organizations, and places is exceptional, and
society. Along with volume, variety, velocity, what can be done with that data is growing in
variability, veracity, and value, it is one of the ways that would have been beyond imagination
“7 Vs” identified as principal determinant fea- in previous years. “From bedrooms to board-
tures in the conceptualizations of big data. Vul- rooms, from Wall Street to Main Street, the
nerability is an integrated notion that concerns ground is shifting in ways that only the most
security and privacy challenges posed by the vast cyber-savvy can anticipate” (Morgan 2015) and,
amounts, range of sources and formats, and the in a world where sensitive data may be sold,
transfer and distribution of big data (Smartym traded, leaked, or stolen, questions of confidenti-
Pro 2020). Data – whether a data feed, a trade ality and access, in addition to purpose and con-
secret, internet protocol, credit card numbers, sequences, all underscore and point to problems
flight information, email addresses, passwords, of vulnerability.
personal identities, transportation usage, Privacy and safety violations of personal data
employment, purchasing patterns, etc. – are have been of particular concern. Data breaches
accessible (Morgan 2015), and the nature of and misuse have been broadly experienced by
that accessibility can vary by purpose and out- individuals and targeted groups in general, with
come. Additionally, vulnerability encompasses broad social implications. Vulnerability
the susceptibility of selected individuals, groups, addresses the fact that personal data – “the life-
and communities who are particularly vulnerable blood of many commercial big data initiatives” –
to data manipulation and inequitable application is being used to pry into individual behaviors
and use, and broader societal implications. To and encourage purchasing behavior (Marr
that end, vulnerability is a vital consideration 2016). Personal data, including medical and
regarding every piece of data collected (Marr financial data, are increasingly extracted and
2016). distributed via a range of connected devices in
Vulnerability is a highly complex issue, with the internet-of-things (IoT). “As more sensors
information theft and data breaches occurring find their way into everything from smartphones
regularly (DeAngelis 2018) and, more to the to household appliances, cars, and entire cities,
point, “a data breach with big data is a big it is possible to gain unprecedented insight into
breach” (Firican 2017). As an increasingly typ- the behaviors, motivations, actions, and plans of
ical example, in the United States in September individuals and organizations” (Morgan 2015),
2015, addresses and social security numbers of such that privacy itself has less and less meaning
over 21 million current and former federal gov- over time. In other scenarios, vulnerability
V
ernment employees were stolen, along with the addresses the fact that personal data – “the life-
fingerprints of 5.6 million (Morgan 2015). Data blood of many commercial big data initiatives” –
security and privacy challenges may occur is being used to pry into individual behaviors
through data leaks and cyber-attacks and the and encourage purchasing behavior (Marr
blatant hijacking and sale of data collected for 2016). In general, big data brings with it a
legitimate purposes, for example, financial and range of security and privacy challenges –
medical data. Frankly, an unbreachable data including the rampant sale of personal informa-
repository simply does not exist (Morgan 2015). tion on the dark web – and the proliferation of
950 Vulnerability

data has left many people feeling exposed and Cross-References


vulnerable to the way their data is being violated
and used (Experian 2017). ▶ Big Data Concept
Traditionally disadvantaged and ▶ Cybersecurity
disenfranchised populations (e.g., the poor, ▶ Ethics
migrants, minorities) are often the ones who are ▶ Privacy
most vulnerable as a “result of the collection and
aggregation of big data and the application of
predictive analytics” (Madden et al. 2017). Further Reading
Indeed, there is an “asymmetric relationship
between those who collect, store, and mine large Andrejevic, M. (2014). Big data, big questions| the big data
divide. International Journal of Communication, 8, 17.
quantities of data, and those whom data collection
DeAngelis, S. (2018). The seven ‘Vs’ of big data. https://
targets” (Andrejevic 2014). Moreover, algorithms www.enterrasolutions.com/blog/the-seven-vs-of-big-
and machine learning models have the propensity data.
to produce unfair outcomes in situations where the Experian. (2017). A data powered future. https://www.
underlying data used to develop and train them experian.co.uk/blogs/latest-thinking/small-business/a-
data-powered-future.
reflect societal gaps and disparities in the first Firican, G. (2017). The 10 Vs of big data. Upside. https://
place. tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx.
Big data and related analytics often are Madden, M., Gilman, M., Levy, K., & Marwick, A. (2017).
discussed as means for improving the world. Privacy, poverty, and big data: A matrix of vulnerabil-
ities for poor Americans. Washington University Law
However, the other side of the story is one of Review, 95, 53.
laying bare and generating vulnerabilities Marr, B. (2016). Big data: The 6th ‘V’ everyone should
through information that can be used for nefar- know about. Forbes. https://www.forbes.com/sites/
ious purposes and to take advantage of various bernardmarr/2016/12/20/big-data-the-6th-v-everyone-
should-know-about/?sh¼4182896e2170.
populations through, for example, identity theft Morgan, L. (2015). 14 creepy ways to use big data.
and blanket marketing and purchasing manipu- InformationWeek. https://www.informationweek.com/
lation. It is in this sense that an ongoing ques- big-data/big-data-analytics/14-creepy-ways-to-use-
tion is how to minimize big data vulnerability. big-data/d/d-id/1322906.
Smartym Pro. (2020). How to protect big data? The key big
That is, the challenge is to manage the vulnera- data security challenges. https://smartym.pro/blog/
bilities presented by big data now and in the how-to-protect-big-data-the-main-big-data-security-
future. challenges.
W

Web Scraping The process of scraping data from the Internet


can be divided into two sequential steps; acquiring
Bo Zhao web resources and then extracting desired infor-
College of Earth, Ocean, and Atmospheric mation from the acquired data. Specifically, a web
Sciences, Oregon State University, Corvallis, scraping program starts by composing a HTTP
OR, USA request to acquire resources from a targeted
website. This request can be formatted in either a
URL containing a GET query or a piece of HTTP
Web scraping, also known as web extraction or message containing a POST query. Once the
harvesting, is a technique to extract data from the request is successfully received and processed by
World Wide Web (WWW) and save it to a file the targeted website, the requested resource will
system or database for later retrieval or analysis. be retrieved from the website and then sent back to
Commonly, web data is scrapped utilizing the give web scraping program. The resource can
Hyp5ertext Transfer Protocol (HTTP) or through be in multiple formats, such as web pages that are
a web browser. This is accomplished either man- built from HTML, data feeds in XML or JSON
ually by a user or automatically by a bot or web format, or multimedia data such as images, audio,
crawler. Due to the fact that an enormous amount or video files. After the web data is downloaded,
of heterogeneous data is constantly generated on the extraction process continues to parse,
the WWW, web scraping is widely acknowl- reformat, and organize the data in a structured
edged as an efficient and powerful technique for way. There are two essential modules of a web
collecting big data (Mooney et al. 2015; Bar-Ilan scraping program – a module for composing
2001). To adapt to a variety of scenarios, current an HTTP request, such as Urllib2 or selenium
web scraping techniques have become custom- and another one for parsing and extracting infor-
ized from smaller ad hoc, human-aided proce- mation from raw HTML code, such as Beautiful
dures to the utilization of fully automated Soup or Pyquery. Here, the Urllib2 module
systems that are able to convert entire websites defines a set of functions to dealing with HTTP
into well-organized data set. State-of-the-art web requests, such as authentication, redirections,
scraping tools are not only capable of parsing cookies, and so on, while Selenium is a web
markup languages or JSON files but also inte- browser wrapper that builds up a web browser,
grating with computer visual analytics (Butler such as Google Chrome or Internet Explorer, and
2007) and natural language processing to simu- enables users to automate the process of browsing
late how human users browse web content a website by programming. Regarding data
(Yi et al. 2003). extraction, Beautiful Soup is designed for
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
952 Web Scraping

scraping HTML and other XML documents. It data monitoring, website change detection, and
provides convenient Pythonic functions for navi- web data integration. For examples, at a micro-
gating, searching, and modifying a parse tree; a scale, the price of a stock can be regularly scraped
toolkit for decomposing an HTML file and extra- in order to visualize the price change over time
cting desired information via lxml or html5lib. (Case et al. 2005), and social media feeds can be
Beautiful Soup can automatically detect the collectively scraped to investigate public opinions
encoding of the parsing under processing and and identify opinion leaders (Liu and Zhao 2016).
convert it to a client-readable encode. Similarly, At a macro-level, the metadata of nearly every
Pyquery provides a set of Jquery-like functions to website is constantly scraped to build up Internet
parse xml documents. But unlike Beautiful Soup, search engines, such as Google Search or Bing
Pyquery only supports lxml for fast XML Search (Snyder 2003).
processing. Although web scraping is a powerful technique
Of the various types of web scraping programs, in collecting large data sets, it is controversial and
some are created to automatically recognize the may raise legal questions related to copyright
data structure of a page, such as Nutch or Scrapy, (O’Reilly 2006), terms of service (ToS) (Fisher
or to provide a web-based graphic interface that et al. 2010), and “trespass to chattels” (Hirschey
eliminates the need for manually written web 2014). A web scraper is free to copy a piece of data
scraping code, such as Import.io. Nutch is a robust in figure or table form from a web page without
and scalable web crawler, written in Java. It any copyright infringement because it is difficult
enables fine-grained configuration, paralleling to prove a copyright over such data since only a
harvesting, robots.txt rule support, and machine specific arrangement or a particular selection of
learning. Scrapy, written in Python, is an reusable the data is legally protected. Regarding the ToS,
web crawling framework. It speeds up the process although most web applications include some
of building and scaling large crawling projects. In form of ToS agreement, their enforceability usu-
addition, it also provides a web-based shell to ally lies within a gray area. For instance, the
simulate the website browsing behaviors of a owner of a web scraper that violates the ToS
human user. To enable nonprogrammers to har- may argue that he or she never saw or officially
vest web contents, the web-based crawler with a agreed to the ToS. Moreover, if a web scraper
graphic interface is purposely designed to mitigate sends data acquiring requests too frequently, this
the complexity of using a web scraping program. is functionally equivalent to a denial-of-service
Among them, Import.io is a typical crawler for attack, in which the web scraper owner may be
extracting data from websites without writing any refused entry and may be liable for damages under
code. It allows users to identify and convert the law of “trespass to chattels,” because the
unstructured web pages into a structured format. owner of the web application has a property inter-
Import.io’s graphic interface for data identifica- est in the physical web server which hosts the
tion allows user to train and learn what to extract. application. An ethical web scraping tool will
The extracted data is then stored in a dedicated avoid this issue by maintaining a reasonable
cloud server, and can be exported in CSV, JSON, requesting frequency.
and XML format. A web-based crawler with a A web application may adopt one of the fol-
graphic interface can easily harvest and visualize lowing measures to stop or interfere with a web
real-time data stream based on SVG or WebGL scrapping tool that collects data from the given
engine but fall short in manipulating a large data website. Those measures may identify whether
set. an operation was conducted by a human being or
Web scraping can be used for a wide variety of a bot. Some of the major measures include the
scenarios, such as contact scraping, price change following: HTML “fingerprinting” that investi-
monitoring/comparison, product review collec- gates the HTML headers to identify whether a
tion, gathering of real estate listings, weather visitor is malicious or safe (Acar et al. 2013); IP
White House Big Data Initiative 953

reputation determination, where IP addresses Snyder, R. (2003). Web search engine with graphic snap-
with a recorded history of use in website assaults shots. Google Patents.
Yi, J., Nasukawa, T., Bunescu, R., & Niblack, W. (2003).
that will be treated with suspicion and are more Sentiment analyzer: Extracting sentiments about a
likely to be heavily scrutinized (Sadan and given topic using natural language processing tech-
Schwartz 2012); behavior analysis for revealing niques. Data Mining, 2003. ICDM 2003. Third IEEE
abnormal behavioral patterns, such as placing a International Conference on, IEEE. Melbourne,
Florida, USA.
suspiciously high rate of requests and adhering
to anomalous browsing patterns; and progres-
sive challenges that filter out bots with a set of
tasks, such as cookie support, JavaScript execu-
tion, and CAPTCHA (Doran and Gokhale White House Big Data
2011). Initiative

Gordon Alley-Young
Further Reading Department of Communications and Performing
Arts, Kingsborough Community College, City
Acar, G., Juarez, M., Nikiforakis, N., Diaz, C., Gürses, S.,
University of New York, New York, NY, USA
Piessens, F., & Preneel, B. (2013). Fpdetective: Dusting
the web for fingerprinters. In Proceedings of the 2013
ACM SIGSAC conference on computer & communica-
tions security. New York: ACM. Synonyms
Bar-Ilan, J. (2001). Data collection methods on the web for
infometric purposes – A review and analysis.
Scientometrics, 50(1), 7–32. The Big Data Research and Development Initia-
Butler, J. (2007). Visual web page analytics. Google tive (TBDRDI)
Patents.
Case, K. E., Quigley, J. M., & Shiller, R. J. (2005). Com-
paring wealth effects: The stock market versus the
housing market. The BE Journal of Macroeconomics, Introduction
5(1), 1.
Doran, D., & Gokhale, S. S. (2011). Web robot detection On March 29, 2012, the White House introduced
techniques: Overview and limitations. Data Mining
The Big Data Research and Development Initia-
and Knowledge Discovery, 22(1), 183–210.
Fisher, D., Mcdonald, D. W., Brooks, A. L., & Churchill, tive (TBDRDI) at a cost of $200 million. Big data
E. F. (2010). Terms of service, ethics, and bias: Tapping (BD) refers to the collection and interpretation of
the social web for CSCW research. Computer enormous datasets, using supercomputers running
Supported Cooperative Work (CSCW), Panel
smart algorithms to rapidly uncover important
discussion.
Hirschey, J. K. (2014). Symbiotic relationships: Pragmatic features (e.g., interconnections, emerging trends,
acceptance of data scraping. Berkeley Technology Law anomalies, etc.). The Obama Administration
Journal, 29, 897. developed TBDRDI because having the large
Liu, J. C.-E., & Zhao, B. (2016). Who speaks for climate
amounts of instantaneous data that is continually
change in China? Evidence from Weibo. Climatic
Change, 140(3), 413–422. being produced by research and development
Mooney, S. J., Westreich, D. J., & El-Sayed, A. M. (2015). (R&D) and emerging technology go unprocessed
Epidemiology in the era of big data. Epidemiology, hurts the US economy and society. President
26(3), 390.
Obama requested an all-hands-on-deck for
O’Reilly, S. (2006). Nominative fair use and Internet
aggregators: Copyright and trademark challenges TBDRDI including the public (i.e., government) W
posed by bots, web crawlers and screen-scraping tech- and private (i.e., business) sectors to maximize
nologies. Loyola Consumer Law Review, 19, 273. economic growth, education, health, clean energy,
Sadan, Z., & Schwartz, D. G. (2012). Social network
and national security (Raul 2014; Savitz 2012).
analysis for cluster-based IP spam reputation. Informa-
tion Management & Computer Security, 20(4), The administration stated that the private sector
281–295. would lead by developing BD while the
954 White House Big Data Initiative

government will promote R&D, facilitate private significantly increasing the speed of scientific
sector access to government data, and shape pub- inquiry and discovery, bolstering national secu-
lic policy. Several government agencies made the rity, and overhauling US education. TBDRDI is
initial investment in this initiative to advance the the result of recommendations in 2011 by the
tools/techniques required to analyze and capital- President’s Council of Advisors on Science and
ize on BD. TBDRDI has been compared by the represents the US government’s wish to get ahead
Obama Administration to previous administra- of the wave BD and prevent a cultural lag by
tions’ investments in science in technology that revamping its BD practices (Executive Office of
lead to innovations such as the Internet. Critics of the President 2014). John Holdren, Director of
the initiative argue that administration BD efforts WHOSTP, compared the $200 million being
need to be directed elsewhere. invested in BD to prior federal investments in
science and technology that are responsible for
our current technological age (Scola 2013). The
History of the White House Big Data innovations of the technology age ironically have
Initiative created the BD that makes initiatives such as these
necessary.
TBDRDI is the White House’s $200 million fed- In addition to the US government agencies that
eral agency funded initiative that seeks to secure helped to unveil TBDRDI, several other federal
the US’s position as the world’s most powerful agencies had been requested to develop BD man-
and influential economy by channeling the infor- agement strategies in the time leading up to and
mation power of BD into social and economic following this initiative. A US government fact
development (Raul 2014). BD is an all-inclusive sheet listed between 80 and 85 BD projects across
name for the nonstop supply of sophisticated elec- a dozen federal agencies including, in addition to
tronic data that is being produced by a variety of the departments previously mentioned, the
technologies and by scientific inquiry. In short, Department of Homeland Security (DHS),
BD includes any digital file, tag or data that is Department of Health and Human Services
created whenever we interact with technology, no (DHHS), and the Food and Drug Administration
matter how briefly (Carstensen 2012). The (FDA) (Henschen 2012). The White House
dilemma posed by BD to the White House, as referred to TBDRDI as representing it placing its
well as to other countries, organizations, and busi- bet on BD meaning that the financial investment
nesses worldwide, is that so much of it goes in this initiative is expected to yield a significant
unanalyzed due to its sheer volume and the limits return for the country in coming years. To this
of our current technological tools to effectively end, President Obama has sought the involvement
store, organize, and analyze. Processing BD is not of public, private, and other (e.g., academia, non-
so simple because it requires supercomputing governmental organizations) experts and organi-
capabilities, some of which are still emerging. zations to work in a way that emphasizes
Experts argue that up until 2003, only collaboration. For spearheading TBDRDI and
5-exabytes (EB) of data were produced; that num- for choosing to stake the future of the country on
ber has since exploded to over five quintillion BD, President Barack Obama has been dubbed the
bytes of data (approximately 4.3 EB) every BD president by the media.
2 days.
The White House Office of Science and Tech-
nology Policy (WHOSTP) announced TBDRDI Projects of the White House Big Data
in March 2012 in conjunction with the National Initiative
Science Foundation (NSF), National Institutes of
Health (NIH), US Geological Survey (USGS), The projects included under the umbrella of
and the Department of Defense (DoD) and TBDRDI are diverse, but they share common
Department of Energy (DoE). Key concerns to themes of emphasizing collaboration (i.e., to max-
be addressed by TBDRDI are to manage BD by imize resources and eliminate data overlap) and
White House Big Data Initiative 955

making data openly accessible for its social and ClimatEdge™ risk management suite of tools.
economic benefits. One project undertaken with CSC will collect and interpret the climate data
the co-participation of NIH and Amazon, the and make it available to subscribers in the forms
world’s largest online retailer, aims to provide of monthly reports that anticipate how climate
public access to the 1,000 Genomes Project changes could affect global agriculture, global
using cloud computing (Smith 2012). The 1,000 energy demand/production, sugar/soft commodi-
Genomes Project involved scientists and ties, grain/oilseeds, and energy/natural gas. These
researchers sequencing the genomes of over tools are promoted to help companies and con-
1,000 anonymous and ethnically diverse people sumers make better decisions. For example, fluc-
between 2008 and 2012 in order to better treat tuating resource prices caused by climate changes
illness and predict medical conditions that are will allow a consumer/business to find new sup-
genetically influenced. The NIH will deposit plies/suppliers in advance of natural disasters and
200 terabytes (TB) of genomic data into Ama- weather patterns. Future goals include providing
zon’s Web Services. According to the White streaming data to advanced users of the service
House, this is currently the world’s largest collec- and expanding this service to other sectors includ-
tion of human genetic data. In August 2014, the ing disease and health trends (Eddy 2014).
UK reported that it would undertake a 100,000 The DoD argues that it will spend $250 million
genomes project that is slated to finish in 2017. annually on BD. Several of its initiatives promote
The NIH and NSF will cooperate to fund 15–20 cybersecurity like its Cyber-Insider Threat pro-
research projects for a cost of $25 million. Other gram quick and precise targeting of cyber espio-
collaborations include the DoE’s and University nage threats to military computer networks. The
of California’s creation of a new facility as part of DoD’s cybersecurity projects also include devel-
their Lawrence Berkeley National Laboratory oping cloud-computing capabilities that would
called the Scalable Data Management, Analysis, retain function in the midst of an attack, program-
and Visualization Institute ($25 million) and the ming languages that stay encrypted whenever in
NSF and University of California, Berkeley’s use, and security programs suitable for BD super-
geosciences Earth Cube BD project ($10 million). computer networks. In keeping with TBDRDI
The CyberInfrastructure for Billions of Elec- maxim to collaborate and share, the DoD has
tronic Records (CIBER) project is a co-initiative partnered with Lockheed Martin Corporation to
of the National Archives and Records Adminis- provide the military and its partners with time-
tration (NARA), the NSF, and the University of sensitive intelligence, surveillance, and recon-
North Carolina Chapel Hill. The project will naissance data in what is being called a Distrib-
assemble decades of historical and digital-era doc- uted Common Ground System (DCGS). This
uments on demographics and urban development/ project is touted as having the potential to save
renewal. The project draws on citizen-led sourc- individual soldier’s lives on the battlefield. Other
ing or citizen sourcing meaning that the project defense-oriented initiatives under TBDRDI
will build a participative archive fueled by include how the Pentagon is working to increase
engaged community members and not just by its ability to extract information from texts over
professional archivists and/or governmental 100 times its current rates and Defense Advanced
experts. Elsewhere the NSF will partner with Research Projects Agency’s (DARPA) develop-
NASA’s on its Global Earth Observation System ment of XDATA (Raul 2014), a $100 million
of Systems (GEOSS), an international project to program for sifting BD. W
share and integrate Earth observation data. Simi-
larly, the National Oceanic and Atmospheric
Administration (NOAA) and NASA, who collec- Influences of the Initiative and Expected
tively oversee hundreds of thousands of environ- Outcomes
mental sensors producing reams of climate data,
have partnered with Computer Science Corpora- The United Nations’ (UN) Global Pulse Initiative
tion (CSC) to manage this climate data using their (GPI) may have shaped TBDRDI (UN Global
956 White House Big Data Initiative

Pulse 2012). Realizing in 2009–2010 that the data 64-gigabyte (GB) iPhones in order to store this
it relied upon to respond to global crises was amount of data. Experts argue that the full extent
outdated, the UN created its GPI to provide real- of technology/applications required to success-
time data. In 2011, the proof of concept (i.e., fully manage the amounts BD that TBDRDI
primary project) phase began with the analysis could produce now and in the future remains to
of 2 years’ worth of US and Irish social media be seen.
data for mood scores/conversation indicators that President Obama promised that TBDRDI would
could, in some cases, predict economic downturns stimulate the economy and save taxpayer money,
5 months out and economic upturns 2 months out. and there is evidence to indicate this. The employ-
Success in this project justified opening GPI labs ment outlook for individuals trained in mathemat-
in Jakarta, Indonesia, and Kampala, Uganda. ics, science, and technology is strong as the US
Similarly in 2010, President Obama’s Council of government attempts to hire sufficient staff to
Advisors on Science and Technology urged focused carry out the work of TBDRDI. Hiring across gov-
investment in information technology (IT) to avoid ernmental agencies requires the skilled work of
overlapping efforts (Henschen 2012). This advice fit deriving actionable knowledge from BD. This
with 2010’s existing cost-cutting efforts that were responsibility falls largely on a subset of highly
moving government work to less expensive trained professionals known as quantitative analysts
Internet-based applications. TBDRDI, emerging or the quants for short. Currently these employees
from IT recommendations and after a period of are in high demand and thus can be difficult to
economic downturn, differs from the so-called source as the US government must compete along-
reality-based community (i.e., studying what has side private sector businesses for talent when the
happened) of the Bush Administration by focusing latter may be able to provide larger salaries and
instead on what will happen in the future. Some also higher profile positions (e.g., Wall Street firms).
argue that an inkling of TBDRDI can be seen as Some have argued for the government to invest
early as 2008 when then Senator Obama more money in the training of quantitative analysts
cosponsored a bipartisan online federal spending to feed initiatives such as this (Tucker 2012).
database bill (i.e., for USAspending.gov) and as a In terms of cutting overspending, cloud comput-
presidential candidate who actively utilized BD ing (platform-as-a-service technologies) has been
techniques (Scola 2013). identified under TBDRDI as a means to consolidate
TBDRDI comes at a time when International roughly 1,200 unneeded federal data centers (Tucker
Data Corporation (IDC) predicts that by 2020, 2012). The Obama Administration has stated that it
over a third of digital information will generate will eliminate 40 % of federal data centers by 2015.
value if analyzed. Making BD open and accessi- This is estimated to generate a $5 billion in savings.
ble will bring businesses an estimated three tril- Some in the media applaud the effort and
lion dollars in profits. Mark Weber, President of corresponding savings while some critics of the
US Public Sector for NetApp and government IT plan argue that the data centers be streamlined and
commentator, argues that the value of BD lies in upgraded instead. As of 2014, the US government
transforming it into quality knowledge for reports that 750 data centers have been eliminated.
increasing efficiency better informed decision- In January 2014, after classified information
making (CIO Insight 2012). TBDRDI is also leaks by former NSA contractor Edward Snowden,
said to national security. Kaigham Gabriel, a Goo- President Obama asked the White House for a com-
gle executive and the next CEO and President of prehensive review of BD that some argue dampened
Draper Laboratory, argued that the cluttered the enthusiasm for TBDRDI (Raul 2014). The US
nature of the BD field allows America’s adversar- does not have a specific BD privacy law leading
ies to hide and that field is becoming increasingly critics to claim a policy deficit. Others point to the
cluttered as it is estimated that government agen- Federal Trade Commission (FTC) Act, Section 5
cies generated one petabyte (PB) or one quadril- that prohibits unfair or deceptive acts or practices in
lion bytes of data from 2012 to 2014 (CIO Insight or affecting commerce as being firm enough to
2012). One would need almost 14,552 handle any untoward business practices that might
White House BRAIN Initiative 957

emerge from BD while flexible enough to not hinder c/a/Latest-News/Big-Data-Still-a-Big-Challenge-for-G


the economy (Raul 2014). Advocates note that the overnment-IT-651653/.
Eddy, N. (2014). Big data proves alluring to federal IT pros.
European Union (EU) has adopted a highly detailed Retrieved from http://www.eweek.com/enterprise-apps/
privacy policy that has done little to foster commer- big-data-proves-alluring-to-federal-it-pros.html.
cial innovation and economic growth (Raul 2014). Executive Office of the President (2014). Big data: Seizing
opportunities, preserving values. Retrieved from https://
www.whitehouse.gov/sites/default/files/docs/big_data_
privacy_report_may_1_2014.pdf.
Conclusion Henschen, D. (2012). Big data initiative or big government
boondoggle? Retrieved from http://www.information
Other criticism argues that TBDRDI, and the week.com/software/information-management/big-data
-initiative-or-big-government-boondoggle/d/d-id/110
Obama Administration by default, actually serves 3666?.
big business instead of individual consumers and Raul, A.C. (2014). Don’t throw the big data out with the bath
citizens. In support of this argument, critics argue water. Retrieved from http://www.politico.com/maga
that the administration pressured communications zine/story/2014/04/dont-throw-the-big-data-out-with-the-
bath-water-106168_full.html?print#.U_PA-lb4bFI.
companies to provide more affordable and higher Savitz, E. (2012). Big data in the enterprise: A lesson or
speeds of mobile broadband. As of the summer of two from big brother. Retrieved from http://www.
2014, Hong Kong has the world’s fastest mobile forbes.com/sites/ciocentral/2012/12/26/big-data-in-the-
broadband speeds that are also some of the most enterprise-a-lesson-or-two-from-big-brother/.
Scola, N. (2013). Obama, the ‘big data’ president. Retrieved
affordable with South Korea second and Japan from http://www.washingtonpost.com/opinions/obama-
third; the US and its neighbor Canada are not the-big-data-president/2013/06/14/1d71fe2e-d391-11e2-
even in the top ten list of fastest mobile broadband b05f-3ea3f0e7bb5a_story.html.
speed countries. Supporters of the administration Smith, J. (2012). White House aims to tap power of gov-
ernment data. Retrieved from https://www.yahoo.
cite that the Obama Administration has instead com/news/white-house-aims-tap-power-government-
chosen to emphasize its unprecedented open data data-093701014.html?ref¼gs.
initiatives under TBDRDI. The US Open Data Tucker, S. (2012). Budget pressures will drive government
Action Plan emphasizes making high-priority IT change. Retrieved from http://www.washingtonpost.
com/business/capitalbusiness/budget-pressures-will-dri
US government data both mobile and publically ve-government-it-change/2012/08/24/ab928a1e-e898-
accessible while Japan is reported to have fallen 11e1-a3d2-2a05679928ef_story.html.
behind in open-sourcing its BD, specifically in UN Global Pulse. (2012). Big data for development: Chal-
providing access to their massive stores of state/ lenges & opportunities. Retrieved from UN Global Pulse,
Executive Office of the Secretary-General United
local data, costing its economy trillions of yen. Nations, New York, NY at http://www.unglobalpulse.
org/sites/default/files/BigDataforDevelopment-UNGl
obalPulseJune2012.pdf.
Cross-References

▶ Big Data
▶ Cloud Computing
▶ Cyberinfrastructure (U.S.) White House BRAIN Initiative
▶ National Oceanic and Atmospheric
Administration Gordon Alley-Young
Department of Communications and Performing
Arts, Kingsborough Community College, City W
References University of New York, New York, NY, USA

Carstensen, J. (2012). Berkeley group digs in to challenge of


making sense of all that data. Retrieved from http:// Synonyms
www.nytimes.com/2012/04/08/us/berkeley-group-tries-
to-make-sense-of-big-data.html?_r¼0.
CIO Insight (2012). Can government IT meet the big data Brain research through advancing innovative
challenge? Retrieved from http://www.cioinsight.com/ neurotechnologies
958 White House BRAIN Initiative

Introduction would see the creation of a dynamic picture of


the brain through large-scale monitoring of neural
The White House BRAIN Initiative (TWHBI) activity. Fourth is to link brain activity to behavior
includes an acronym where BRAIN stands for with tools that could intervene in and change
the Brain Research Through Advancing Innova- neural circuitry. A fifth goal is to increase under-
tive Neurotechnologies. The goal of the initiative standing of the biological basis for mental pro-
is to spur brain research, such as mapping the cesses by theory building and developing new
brain’s circuitry, and technology that will lead to data analysis tools. The sixth is to innovate tech-
treatments and preventions for common brain dis- nology to better understand the brain so as to
orders. President Barack Obama first announced better treat disorders. The seventh is to establish
the initiative in his February 2013 State of the and sustain interconnected networks of brain
Union Address (SOTHA). More than 200 leaders research. Finally, the last goal is to integrate the
from universities, research institutes, national lab- outcomes of the other goals to discover how
oratories, and federal agencies were invited to dynamic patterns of neural activity get translated
attend when President Obama formally unveiled into human thought, emotion, perception, and
TWHBI on April 2, 2013. The Obama adminis- action in illness and in health.
tration identified this initiative as one of the grand NIH Director Dr. Francis Collins echoed Pres-
challenges of the twenty-first century. The $100 ident Obama in publically stating that TWHBI
million initiative is funded via The National Insti- will change the way we treat the brain and grow
tutes of Health (NIH), the Defense Advanced the economy (National Institutes of Health
Research Projects Agency (DARPA), and the 2014). During his 2013 SOTUA, President
National Science Foundation (NSF) with Obama drew an analogy to the Human Genome
matching support for the initiative reported to Project (HGP) arguing that for every dollar the
come from private research institutions and foun- USA invested in the project, the US economy
dations. TWHBI has drawn comparisons to the gained $140. Estimates suggest that the HGP
Human Genome Project (HGP) for the potential created $800 billion in economic activity. The
scientific discovery that the project is expected to HGP was estimated to cost $3 billion and take
yield. The HGP and TWHBI are also big data 15 years (i.e., 1990–2005). The project finished
projects for the volume of data that they have 2 years early and under cost at $2.7 billion in
already produced and will produce in the future. 1991 dollars. The HGP project is estimated to
have cost $3.39–$5 billion in 2003 dollars.
TWHBI has a budget of $100 million allocated
History and Aims of the Initiative in budget year 2014 with comparable funds
($122 million) contributed by private investors.
TWHBI aims to provide opportunities to map, A US federal report calls for $4.5 billion in
study, and thus treat brain disorders including funding for brain research over the next 12 years.
Alzheimer’s disease, epilepsy, autism, and trau-
matic brain injuries. The NIH will lead efforts
under the initiative to map brain circuitry, measure Projects Undertaken by the Initiative
electrical/chemical activity along those circuits,
and understand the role of the brain in human The first research paper believed to be produced
behavioral and cognitive output. The initiative is under TWHBI initiative came from a paper
guided by eight key goals. The first is to make published on June 19, 2014, by principal investi-
various types of brain cells available for experi- gator Dr. Karl Deisseroth of Stanford University.
mental researchers to study their role in illness and The research described Deisseroth and his team’s
well-being. The second is to create multilayered innovation of the CLARITY technique that can
maps of the brain’s different circuitry levels as remove fat from the brain without damaging its
well as a map of the whole organ. The third wiring and enable the imaging of a whole
White House BRAIN Initiative 959

transparent brain. Data from the study is being scientists and researchers from a variety of fields
used by international biomedical research such as nanoscience, imaging, engineering, infor-
projects. matics, has the greatest opportunity to develop
TWHBI was undertaken because it addresses these tools.
what the science, society, and government con-
siders one of the grand challenges of the twenty-
first century (i.e., The HGP was previously Earlier Efforts and Influences
deemed a grand challenge). Unlocking the secrets
of the brain will tell us how the brain can record, Brain research was emphasized prior to TWHBI
process, utilize, retain, and recall large amounts of by the previous two administrations. The Clinton
information. Dr. Geoffrey Ling, deputy director of administration held a White House conference on
the Defense Sciences Office at Defense Advanced early childhood development and leaning focused
Research Projects Agency (DARPA), states that on insights gleaned from the latest brain research
TWHBI is needed to attract young and intelligent in 1997. In 2002 the Bush administration’s
people into the scientific community. Ling cites a National Drug Control Policy Director John
lack of available funding as a barrier to persuading Walters donated millions of dollars of drug-war
students to pursue research careers (Vallone money to purchase dozens of MRI machines.
2013). Current NIH director and former HGP Their goal was a decade long, $100 million
director Dr. Francis Sellers Collins notes the brain-imaging initiative to study the brain to better
potential of TWHBI to create jobs while poten- understand addiction.
tially curing diseases of the brain and the nervous Publicity surrounding TWHBI brings attention
system, for instance, Alzheimer’s disease (AD). In to how much science has learned about the brain
2012 Health and Human Services Secretary in relatively short period of time. In the nineteenth
Kathleen Sebelius stated the Obama administra- century, brain study focused mostly on what hap-
tion’s goal to cure AD by 2025. The Alzheimer’s pens when parts of the brain are damaged/
Association (AA) estimates that AD/dementia removed. For instance, Phineas Gage partially
health and care cost $203 billion in 2013 ($142 lost his prefrontal cortex in an 1848 accident,
billion by Medicare/Medicaid); this will reach and scientists noted how Mr. Gage changed from
$1.2 trillion by 2050 (Alzheimer’s Association easygoing and dependable before to angry and
2013). irresponsible afterward. From the late eighteenth
Dr. Ling argues that for scientists to craft and to mid-nineteenth centuries, pseudoscientists
validate their hypotheses to build on their knowl- practiced phrenology or reading a person’s mind
edge that potentially lead to medical break- by handling a person’s skull.
throughs, they need access to the latest research Phillip Low, a director of San Diego-based
tools. Ling states that some of the today’s best NeuroVigil Inc. (NVI), states that the White
clinical brain research tools are nonetheless lim- House talked to many scientists and researchers
ited and outdated in light of TWHBI work that while planning TWHBI but did not reveal to these
remains to be done. To bolster his case for better individuals that they were talking to many others,
research tools, Ling uses an analogy whereby the all of who potentially believed they were the par-
physical brain is hardware and the dynamic pro- ent of TWHBI. However, the originators of the
cesses across the brain’s circuits are software. idea that lead to TWHBI are said to be six scien-
Ling notes that cutting-edge tools can help iden- tists, whose journal article in the June 2012 issue W
tify bugs in the brain’s software caused by a phys- of Neuron proposed a brain-mapping project. The
ical trauma (i.e., to the hardware) that once found six are A. Paul Alivisatos (University of Califor-
might be repairable. The tools necessary for med- nia Berkeley), Miyoung Chun (The Kavli Foun-
ical research will need to be high-speed tools with dation), George M. Church (Harvard University),
a much greater capacity for record signals from Ralph J. Greenspan (The Kavli Institute), Michael
brain cells. TWHBI, by bringing together L. Roukes (Kavli Nanoscience Institute), and
960 White House BRAIN Initiative

Rafael Yuste (Columbia University) (Alivisatos will oversee via its own experts. Ethics are also at
et al. 2012). New York Times reporter Steve the forefront of TWHBI. Specifically President
Connor says the roots of TWHBI occur 10 years Obama identified adhering to the highest stan-
earlier when Microsoft cofounder and philanthro- dards of research protections as a prime focus.
pist Paul G. Allen established a brain science Oversight of ethical issues related to this as well
institute in Seattle for a $300 million investment. as any other neuroscience initiative will fall to
Similarly, with a $500 million investment, billion- the administration’s Commission for the Study of
aire philanthropist Fred Kavli funded brain insti- Bioethical Issues.
tutes at Yale, Columbia, and the University of The NSF’s strength as a contributor to TWHBI
California (Broad 2014). It was primarily scien- is that it will sponsor interdisciplinary research
tists from these two institutes that crafted the that spans the fields of biology, physics, engineer-
TWHBI blueprint. Connor states that there are ing, computer science, social science, and behav-
benefits and downsides to TWHBI’s connections ioral science. The NSF’s contribution to TWHBI
to private philanthropy. Connor acknowledges again emphasizes the development of tools and
that philanthropists are able to invest in risky equipment specifically molecular-scale probes
initiatives in a way that the government cannot that can sense and record the activity of neural
but that this can lead to a self-serving research networks. Additionally, the NSF will also seek to
focus, the privileging of affluent universities at the address the innovations that will be necessary in
expense of poorer ones and a US government that the field of big data in order to store, organize, and
is following the lead of private interests rather analyze the enormous amounts of data that will be
than setting the course itself (Connor 2013). produced. Finally, NSF projects under TWHBI
The $100 million for the first phase of TWHBI will see better understanding of how thoughts,
in fiscal year 2014 comes from three government emotions, actions, and memories get represented
agencies’ budgets specifically NIH, DARPA, in the brain.
and NSF. The NIH Blueprint for Neuroscience In addition to federal government agencies, at
Research will lead with contributions specifi- least four private institutes and foundations have
cally geared to projects that would lead to the pledged an estimated $122 million to support to
development of cutting edge, high-speed tools, TWHBI: The Allen Institute (TAI), the Howard
training, and other resources. The next genera- Hughes Medical Institute (HHMI), The Kavli
tion of tools has designated as viewed as vital to Foundation (TKF), and The Salk Institute for Bio-
the advancement of this initiative. Contributor logical Studies (TSI). TAI’s strengths lie in large-
DARPA will invest in programs that aim to scale brain research, tools, and data sharing which
understand the dynamic functions of the brain, is necessary for a big data project like TWHBI
noted in Dr. Ling’s analogy as the software of the represents. Starting in March 2012, TAI under-
brain, showing breakthrough applications based took a 10-year project to unlock the neural code
on the dynamic function insights gained. (i.e., how brain activity leads to perception,
DARPA also seeks to develop new tools for decision-making, and action). HHMI by compar-
capturing and processing dynamic neural and ison is the largest nongovernmental funder of
synaptic activities. DARPA develops applica- basic biomedical research and has long supported
tions for improving the diagnosis and treatment neuroscience research. TKF anticipates drawing
of post-traumatic stress, brain injury, and mem- on the endowments of existing Kavli Institutes
ory loss sustained through war and battle. Such (KI) to fund its participation in TWHBI. This
applications would include generating new infor- includes funding new KIs. Finally the TSI, under
mation processing systems related to the infor- its dynamic BRAIN initiative, will support cross-
mation processing system in the brain and boundary research in neuroscience. For example,
mechanisms of functional restoration after brain TSI researchers will map brain’s neural networks
injury. DARPA is mindful that advances in to determine their interconnections. TSI scientists
neurotechnology, such as those outlined above, will lay the groundwork for solving neurological
will entail ethical, legal, and social issues that it puzzles such as Alzheimer’s/Parkinson’s by
White House BRAIN Initiative 961

studying age-related brain differences (The White keynotes for Israel’s first International Brain
House 2013). Technology Conference in Tel Aviv in October
The work of TWHBI will be spread across 2013. Australia also supports TWHBI through
affiliated research institutions and laboratories neuroscience research collaboration and increased
across the USA. The NIH is said to be establishing hosting of the NSF’s US research fellows for
a bicoastal cochaired working group under collaborating on relevant research projects.
Dr. Cornelia Bargmann, a former UCSF Profes-
sor, with the Rockefeller University in New York
City and Dr. William Newsome from California’s
Stanford University to specify goals for the NIH’s Cross-References
investment and create a multiyear plan for achiev-
ing these goals with timelines and costs (Univer- ▶ Big Data
sity of California San Francisco 2013). On the east ▶ Data Sharing
coast of the USA, the NIH Blueprint for Neuro- ▶ Medicaid
science Research, draws on 15 of its 27 NIH Insti-
tutes and Centers headquartered in Bethesda, MD,
will be a leading NIH contributor to TWHBI. References
Research will occur in nearby Virginia at
Alivisatos, A. P., Chun, M., Church, G. M., Greenspan,
HHMI’s Janelia Farm Research Campus that R. J., Roukes, M. L., & Yuste, R. (2012). The brain
focuses on developing new imaging technologies activity map project and the challenge of functional
and finding out how information is stored and connectomics. Neuron, 74(6), 970–974.
processed in neural networks. Imaging technol- Alzheimer’s Association. (2013). Alzheimer’s Association
applauds White House Brain Mapping Initiative.
ogy furthers TWHBI’s goals of mapping the Retrieved from Alzheimer’s Association National
brain’s structures by allowing researchers to cre- Office, Chicago, IL at http://www.alz.org/news_and_
ate dynamic brain pictures down to the level of events_alz_association_applauds_white_house.asp.
single brain cells as they interact with complex Broad, W.J. (2014). Billionaires with big ideas are
privatizing American science. Retrieved from The
neural circuits at the speed of thought. New York Times, New York, NY http://www.nytimes.
com/2014/03/16/science/billionaires-with-big-ideas-
are-privatizing-american-science.html.
Conclusion Connor, S. (2013). One of the biggest mysteries in the
universe is all in the head. Retrieved from Independent
Digital News and Media, London, UK at http://www.
Contributions to and extensions of TWHBI are independent.co.uk/voices/comment/one-of-the-biggest-
also happening on the US west coast and interna- mysteries-in-the-universe-is-all-in-the-head-8791565.
tionally. San Diego State University (SDSU) is html.
Keshavan, M. (2013). BRAIN Initiative will tap our best
contributing to TWHBI via its expertise in clinical minds. San Diego Business Journal, 34(15), 1.
and cognitive neuroscience specifically their National Institutes of Health. (2014). NIH embraces bold,
investigations to understand and treat brain- 12-year scientific vision for BRAIN Initiative. Retrieved
based disorders like autism, aphasia, fetal alcohol from National Institutes of Health, Bethesda, MD at
http://www.nih.gov/news/health/jun2014/od-05.htm.
spectrum (FAS) disorders, and AD. San Diego’s The White House. (2013). Fact sheet: BRAIN Initiative.
NVI, founded in 2007 and advised by Dr. Stephen Retrieved from The White House Office of the Press
Hawking, and its founder, CEO, and Director Secretary, Washington, DC at http://www.whitehouse.
gov/the-press-office/2013/04/02/fact-sheet-brain-
Dr. Phillip Low, helped to shape TWHBI initia-
initiative. W
tive. NVI’s is notable for its iBrain™ single- University of California San Francisco. (2013). President
channel electroencephalograph (EEG) device Obama unveils brain mapping project. Retrieved from
that noninvasively monitors the brain (Keshavan the University Of California San Francisco at http://
2013). Dr. Low has also taken the message of the www.ucsf.edu/news/2013/04/104826/president-obama-
unveils-brain-mapping-project.
WBHI international as he was asked to go to Israel Vallone, J. (2013). Federal initiative takes aim at treating
and help them develop their own BRAIN initia- brain disorders. In Investors Business Daily, Los
tive. To this end Dr. Low delivered one of two Angeles, CA, (p. A04).
962 WikiLeaks

authorities leading to her arrest. The United


WikiLeaks States’ government officials were outraged by
the leak of classified documents and viewed Man-
Kim Lacey ning as a traitor. This leak eventually led to Man-
Saginaw Valley State University, University ning’s detention, and officials kept her detained
Center, MI, USA for more than 1,000 days without a trial. Because
of this delay, supporters of WikiLeaks were out-
raged at Manning’s denial of a swift trial. Man-
WikiLeaks is a nonprofit organization devoted to ning was eventually acquitted of aiding the
sharing classified, highly secretive, and otherwise enemy, but, in August 2013, was sentenced to
controversial documents to promote transparency 35 years for various crimes including violations
among global superpowers. These shared docu- of the Espionage Act.
ments are commonly referred to as “leaks.” One of the most well-known documents Man-
WikiLeaks has received both highly positive and ning shared put WikiLeaks on the map for many
negative attention for this project particularly who were previously unfamiliar with the organi-
because of its mission to share leaked information. zation. This video, known familiarly as “Collat-
WikiLeaks is operated by the Icelandic Sunshine eral Murder,” shows a United States’ Apache
Press, and Julian Assange is often named the helicopter shooting Reuters reporters, individuals
founder of the organization. helping these reporters, and seriously injuring two
WikiLeaks began in 2006, and its founding is children. There have been two versions of the
largely attributed to Australian Julian Assange, video that have been released: a shorter, 17-min
often described as an Internet activist and hacker. video and a more detailed 39-min video. Both
The project, which aims to share government doc- videos were leaked by WikiLeaks and remain on
uments usually kept from citizens, is a major its website.
source of division between individuals and offi- WikiLeaks uses a number of different drop
cials. The perspective on this division differs boxes in order to obtain documents and maintain
depending on the viewpoint. From the perspective the anonymity of the leakers. Many leakers are
of its opponents, the WikiLeaks documents are well versed in anonymity protective programs
obtained illegally, and their distribution is poten- such as Tor, which uses what they call “onion
tially harmful for national security purposes. routing”: several layers of encryption to avoid
From the perspective of its supporters, the docu- detection. However, in order to make leaking
ments point to egregious offenses perpetrated, and less complicated, WikiLeaks provides instruc-
ultimately stifled, by governments. On its website, tions on its website for users to skirt around reg-
WikiLeaks notes that it is working toward what it ular detection through normal identifiers. Users
calls “open governance,” the idea that leaks are are instructed to submit documents in one of
not only for international, bureaucratic diplomacy many anonymous drop boxes to avoid detection.
but more importantly for clarity of citizens’ In order to verify the authenticity of a docu-
consciousness. ment, WikiLeaks performs several forensic tests
In 2010, Chelsea (born Bradley) Manning including weighing the price of forgery as well as
leaked a United States’ military cable containing possible motives for falsifying information. On its
400,000 files regarding the Iraq War. According to website, WikiLeaks explains that it verified the
Andy Greenberg, this leak, which later became now infamous “Collateral Murder” video by actu-
known as Cablegate, marked the largest leak of ally sending journalists to interview individuals
United States’ government information since affiliated with the attack. WikiLeaks states that
Daniel Ellsberg photocopied The Pentagon simply when it publishes a document, the fact
Papers. After chatting for some time, Manning that it has been published is verification enough.
confessed to former hacker Adrian Lamo. Even- By making information more freely available,
tually, Lamo turned Manning over to the army WikiLeaks aims to start a larger conversation
Wikipedia 963

within the press about access to authentic docu- Further Reading


ments and democratic information.
Funding for WikiLeaks has been a contentious Dwyer, D. n.d. “WikiLeaks’ Assange for Nobel Prize?”
ABC News. Available at: http://abcnews.go.com/Poli
issue since its founding. Since 2009, Assange has
tics/wikileaks-julian-assange-nominated-nobel-peace-
noted several times that WikiLeaks is in danger of prize/story?id¼12825383. Accessed 28 Aug 2014.
running out of funding. One of the major reasons Greenberg, A. This machine kills secrets: How
causing these funding shortages is the result of wikileakers, cypherpunks, and hacktivists aim to free
the world’s information. Dutton: New York, 2012.
many corporations (including Visa, MasterCard,
Sifry, Micah L 2011. WikiLeaks and the age of transpar-
and PayPal) ceasing to allow its customers to ency. O/R Books: New York, Wikileaks.org.
donate money to WikiLeaks. On the WikiLeaks WikiLeaks. Available at: https://www.wikileaks.org/.
website, this action is described as the “banking Accessed 28 Aug 2014.
Tate, J. n.d., “Bradley Manning Sentenced to 35 Years in
blockade.” To work around this banking block-
WikiLeaks Case.” Washington Post Available at: http://
ade, many mirror sites (websites that are hosted www.washingtonpost.com/world/national-security/
separately but contain the same information) have judge-to-sentence-bradley-manning-today/2013/08/
appeared allowing users to access WikiLeaks doc- 20/85bee184-09d0-11e3-b87c-476db8ac34cd_story.
html. Accessed 26 Aug 2014.
uments and also donate with “blocked” payment
WikiRebels: The Documentary. n.d.. Available at: https://
methods. WikiLeaks also sells paraphernalia on www.youtube.com/watch?v¼z9xrO2Ch4Co.
its website, but it is unclear if these products fall Accessed 1 Sept 2012.
under the banking blockade restrictions.
Because of his affiliation with WikiLeaks,
Julian Assange has been granted political asylum
in Ecuador in 2012. Prior to his asylum, he had Wikipedia
been accused of molestation and rape in Sweden
but evaded arrest. In June 2013, Edward Snow- Ryan McGrady
den, a former employer of the National Security North Carolina State University, Raleigh, NC,
Agency (NSA), leaked evidence of the United USA
States spying on its citizens to the UK’s The
Guardian. On many occasions, WikiLeaks has
supported Snowden, helping him apply for polit- Wikipedia is an open-access online encyclopedia
ical asylum, providing funding, and also provid- hosted and operated by the Wikimedia Founda-
ing him with escorts him on flights (most notably tion (WMF), a San Francisco-based nonprofit
Sarah Harrison accompanying Snowden from organization. Unlike traditional encyclopedias,
Hong Kong to Russia). Wikipedia is premised on an open editing model
WikiLeaks has been nominated for multiple whereby everyone using the site is allowed and
awards for reporting. Among the awards, it has encouraged to contribute content and make
won including the Economist Index on Censor- changes. Since its launch in 2001, it has grown
ship Freedom of Expression award (2008) and the to over 40 million articles across nearly three
Amnesty International human rights reporting hundred languages, constructed almost entirely
award (2009, New Media). In 2011, Norwegian by unpaid pseudonymous and anonymous users.
citizen Snorre Valen publically announced that he Since its infancy, Wikipedia has attracted
nominated Julian Assange for the Nobel Peace researchers from many disciplines to its vast col-
Prize, although Assange did not win. lection of user-generated knowledge, unusual pro- W
duction model, active community, and open
approach to data.
Cross-References Wikipedia works on a type of software called a
wiki, a popular kind of web application designed
▶ National Security Agency (NSA) to facilitate collaboration. Wiki pages can be mod-
▶ Transparency ified directly using a built-in text editor. When a
964 Wikipedia

user saves his or her changes, a new version of the Wikipedia data, as well. One of the earliest, the
article is created and immediately visible to the IBM History Flow tool, produces images based on
next visitor. Part of what allows Wikipedia to stages of an individual article’s development over
maintain standards for quality is the meticulous time, giving a manageable, visual form to an
record-keeping of changes provided by wiki soft- imposingly long edit history and the disagree-
ware, storing each version of a page permanently ments, vandalism, and controversies it contains.
in a way that is easily accessible. If someone The Wikipedia database has been and con-
makes changes that are not in the best interest of tinues to be a valuable resource, but there are
the encyclopedia, another user can easily see the limitations to what can be done with its unstruc-
extent of those changes and if necessary restore a tured data. It is downloaded as a relational data-
previous version or make corrections. Each base filled with text and markup, but machines
change is timestamped and attributed to either a that researchers use to process data are not able to
username or, if made anonymously, an IP address. understand text like a human, limiting what tasks
Although Wikipedia is transparent about what they can be given. It is for this reason there have
data it saves and draws little criticism on privacy been a number of attempts to extract structured
matters, any use of a wiki requires self-awareness data as well. DBPedia is a database project started
given that one’s actions will be archived in 2007 to put as much of Wikipedia into the
indefinitely. Resource Description Framework (RDF) as pos-
Article histories largely comprise the sible. Whereas content on the web typically
Wikipedia database, which the WMF makes employs HTML to display and format text, mul-
available to download for any purpose compatible timedia, and links, RDF emphasizes not what a
with its Creative Commons license, including document looks like but how its information is
mirroring, personal and institutional offline use, organized, allowing for arbitrary statements and
and data mining. The full English language data- associations which effectively make the items
base download amounts to more than ten meaningful to machines. The article for the film
terabytes, with several smaller subsets available Moonlight Kingdom may contain the textual
that, for example, exclude discussion pages and statement “it was shot in Rhode Island,” but a
user profiles or only include the most current machine would have a difficult time extracting
version of each page. the desired meaning, instead preferring to see a
As with any big data project, there is a chal- subject “Moonlight Kingdom” with a standard
lenge in determining not just what questions to property “filming location” set to the value
ask but how to use the data to convey meaningful “Rhode Island.”
answers. Wikipedia presents an incredible amount In 2012, WMF launched Wikidata, its own
of knowledge and information, but it is widely structured database. In addition to Wikipedia,
dispersed and collected in a database organized WMF operates a number of other sites like
around articles and users, not structured data. One Wiktionary, Wikinews, Wikispecies, and
way the text archive is rendered intelligible is Wikibooks. Like Wikipedia, these sites are avail-
through visualization, wrangling the unwieldy able in many languages, each more or less inde-
information by expressing statistics and patterns pendent from the others. To solve redundancy
through visuals like graphs, charts, or histograms. issues and to promote resource sharing, the
Given the multi-language and international nature Wikimedia Commons was introduced in 2004 as
of Wikipedia, as well as the disproportionate size a central location for images and other media for
and activity of the English version in particular, all WMF projects. Wikidata works on a similar
geography is important in its critical discourse. premise with data. Its initial task was to centralize
Maps are thus popular visuals to demonstrate inter-wiki links, which connect, for example, the
disparities, locate concentrations, and measure English article “Cat” to the Portuguese “Gato”
coverage or influence. Several programs have and Swedish “Katt.” Inter-language links had pre-
been developed to create visualizations using viously been handled locally, creating links at the
Wikipedia 965

bottom of an article to its counterparts at every Some Wikipedia metadata is easy to locate and
other applicable version. Since someone adding parse as fundamental elements of wiki technol-
links to the Tagalog Wikipedia is not likely to ogy: timestamps, usernames, and article titles, for
speak Swedish, and because someone who speaks example. Other data is incidental, like template
Swedish is not likely to actively edit the Tagalog parameters. Design elements that would other-
Wikipedia and vice versa, this process frequently wise be repeated in many articles are frequently
resulted in inaccurate translations, broken links, copied into a separate template which can then be
one-way connections, and other complications. invoked when relevant, using parameters to cus-
Wikidata helps by acting as a single junction for tomize it for the particular page on which it is
each topic. displayed. For example, in the top-right corner
A topic, or an item, on Wikidata is given its of articles about books there is typically a neatly
own page which includes an identification num- formatted table called an infobox which includes
ber. Users can then add a list of alternative terms standardized information input as template
for the same item and a brief description in every parameters like author, illustrator, translator,
language. Items also receive statements awards received, number of pages, Dewey deci-
connecting values and properties. For example, mal classification, and ISBN number.
The Beatles’s 1964 album A Hard Day’s Night is A fundamental part of DBPedia and the second
item Q182518. The item links to the album’s goal for Wikidata is the collection of data based on
Wikipedia articles in 49 languages and includes these relatively few structured fields that exist in
17 statements with properties and values. The Wikipedia.
very common instance of property has the value Standardizing the factual information in
“album,” a property called record label has the Wikipedia holds incredible potential for research.
value “Parlophone Records,” and four state- Wikidata and DBPedia, used in conjunction with
ments connect the property genre with “rock the Wikipedia database, make it possible to, for
and roll,” “beat music,” “pop music,” and “rock example, assess article coverage of female musi-
music.” Other statements describe its recording cians as compared to male musicians in different
location, personnel, language, and chronology, parts of the world. Since they use machine-
and many applicable properties are not yet filled readable formats, they can also interface with
in. Like Wikipedia, Wikidata is an open commu- one another and with many other sources like
nity project and anybody can create or modify GeoNames, Library of Congress Subject Head-
statements. Some of the other properties items ings, Internet Movie Database, MusicBrainz, and
are given include names, stage names, pen Freebase, allowing for richer, more complex
names, dates, birth dates, death dates, demo- queries. Likewise, just as these can be used to
graphics, genders, professions, geographic coor- support Wikipedia research, Wikipedia can be
dinates, addresses, manufacturers, alma maters, used to support other forms of research and even
spouses, running mates, predecessors, affilia- enhance commercial products. Google, Facebook,
tions, capitals, awards won, executives, parent IBM, and many others regularly make use of data
companies, taxonomic orders, and architects, from Wikipedia and Wikidata in order to improve
among many others. So as to operate according search results or provide better answers to ques-
to the core Wikipedia tenet of neutrality, multiple tions. By creating points of informational inter-
conflicting values are allowed. Property-value section and interpretation for hundreds of
pairs can furthermore be assigned their own languages, Wikidata also has potential for use in W
property-value pairs such that the record sales translation applications and to enhance cultural
property and its value can have the qualifier as of education. The introduction of Wikidata in 2012,
and another value to reflect when the sales figure built on an already impressively large knowledge
was accurate. Each property-value pair along the base, and its ongoing development, have opened
way can be assigned references akin to cited many new areas for exploration and accelerated
sources on Wikipedia. the pace of experimentation, incorporating the
966 World Bank

data into many areas of industry, research, educa- natural resource management. In addition to the
tion, and entertainment. financial support, the World Bank provides policy
advice, research, analysis, and technical assis-
tance to various countries in order to inform its
Cross-References own investments and ultimately to work toward
its key objectives. Part of its activities relate to the
▶ Anonymity provision of tools to research and address devel-
▶ Crowdsourcing opment challenges, some of which are in the form
▶ Open Data of providing access to data, for example, the Open
Data website which includes a comprehensive
range of downloadable data sets related to differ-
Further Reading ent issues. This shows its recognition of the
demand for access to quantitative data to inform
Jemielniak, D. (2014). Common knowledge: An ethnogra- development strategies (Lehdonvirta and Ernkvist
phy of wikipedia. Stanford: Stanford University Press.
2011).
Krötzscha, M., et al. (2007). Semantic Wikipedia. Web
Semantics: Science, Services and Agents on the World A significant amount of the data hosted and
Wide Web, 5(4), 251–261. disseminated by the World Bank is drawn from
Leetaru, K. (2012). A bigdata approach to the humanities, national statistical organizations, and it recognizes
arts, and social sciences: Wikipedia’s view of the world
that the quality of global data therefore is reliant
through supercomputing. Research Trends, 30, 17–30.
Stefaner, M., et al. Notability – Visualizing deletion dis- on the capacity and effectiveness of these national
cussions on Wikipedia. http://www.notabilia.net/. statistical organizations. The World Bank has ten
Viégas, F., et al. (2004). Studying cooperation and conflict key principles with respect to its statistical activ-
between authors with history flow visualizations. Paper
ities (in line with the Fundamental Principles of
presented at CHI 2004, Vienna.
Official Statistics and the Principles Governing
International Statistical Activities of the United
Nations Statistical Division): quality, innovation,
professional integrity, partnership, country own-
World Bank ership, client focus, results, fiscal responsibility,
openness, and good management.
Jennifer Ferreira The world is now experiencing unprecedented
Centre for Business in Society, Coventry capacity to generate, store, process, and interact
University, Coventry, UK with data (McAfee and Brynjolfsson 2012), a
phenomenon that has been recognized by the
World Bank, like other international institutions.
The World Bank, part of the World Bank Group For the World Bank, data is seen as critical for the
established in 1944, is the international financial design, implementation, and evaluation of effi-
institution responsible for promoting economic cient and effective development policy recom-
development and reducing poverty. The World mendations. In 2014, Jim Yong Kim, the
Bank has two key objectives: to end extreme President of the World Bank, discussed the impor-
poverty by reducing the proportion of the world’s tance of efforts to invest in infrastructure, includ-
population living on less than $1.25 a day and ing data systems. Big data is recognized as a new
promoting shared prosperity by fostering income advancement which has the potential to enhance
growth in the lowest 40% of the population. efforts to address development, although it recog-
A core activity for the World Bank is the pro- nizes there are a series of challenges associated
vision of low interest loans, zero- to low-interest with this. In 2013, the World Bank hosted an event
grants to developing countries. This could be to where over 150 experts, data scientists, civil soci-
support a wide range of activities from education ety groups, and development practitioners met to
and health care to infrastructure, agriculture, or analyze various forms of big data and consider
World Bank 967

how it could be used to tackle development issues. data sets such as weather which can then be used
The event was a public acknowledgement of how to make traffic predictions, using mobile phone
the World Bank viewed the importance of data to predict mobility patterns.
expanding the awareness of how big data can The World Bank piloted some activities in
help combine various data sets to generate knowl- Central America to explore the potential of big
edge which can in turn foster development data to impact on development agendas. This
solutions. region has historically experienced low frequen-
A report produced in conjunction with the cies of data collection for traditional data forms,
World Bank, Big Data in Action for Development, such as household surveys and so other forms of
highlights some of the potential ways in which big data collection, were viewed as particularly
data can be used to work toward development important. One of these pilot studies used google
objectives and some of the challenges associated trends data to explore the potential for the ability
with doing so. The report sets out a conceptual to forecast price changes to commodities. Another
framework for using big data in the development study, in conjunction with the UN Global Pulse,
sector highlighting the potential transformative explored the use of social media content to ana-
capacity of big data, particularly in relation to lyze public perceptions of policy reforms, in par-
raising awareness, developing understanding, ticular a gas subsidy reform in El Salvador,
and contributing to forecasting. highlighting the potential for this form of data to
Using big data to develop and enhance aware- complement other studies on public perception
ness of different issues has been widely acknowl- (United Nations Global Pulse 2012).
edged. Examples of this include: using The report from the World Bank, Big Data in
demographic data in Afghanistan to detect Action for Development, presents a matrix of dif-
impacts of small scale violence outbreaks, using ferent ways in which big data could be used in
social media content to indicate unemployment transformational ways toward the development
rises or crisis related stress, or using tweets to agenda: using mobile data (e.g., reduced mobile
recognize where cholera outbreaks were phone top ups as an indicator of financial stress),
appearing at a much faster rate than was recog- financial data (e.g., increased understanding of
nized in official statistics. This ability to gain customer preferences), satellite data (e.g., to
awareness of situations, experiences, and senti- crowd source information on damage after an
ments is seen to have the potential to reduce earthquake), internet data (e.g., to collect daily
reaction times and improve processes which deal prices), and social media data (e.g., to track par-
with such situations. ents perception of vaccination). The example of
Big data can also be used to develop under- examining the relationship between food and fuel
standing of societal behaviors (LaValle et al. prices and corresponding change in official price
2011). Examples include investigation of twitter index measures by using twitter data (by the UN
data to explore the relationship between food and Global Pulse Lab) is outlined in detail explaining
fuel price tweets and changes in official price how it was used to provide an indication of social/
indexes in Indonesia; after the 2010 earthquake economic conditions in Indonesia. This was done
in Haiti, mobile photo data was used to track by extracting tweets mentioning food and fuel
population displacement after the event, and sat- prices between 2011 and 2013 (around 100,000
ellite rainfall data was used in combination with relevant tweets after filtering for location and lan-
qualitative data sources to understand how rainfall guage) and analyzing these with corresponding W
affects migration. changes from official data sets. The analysis indi-
Big data is also seen to have potential for cated a clear relationship between official food
contributing to modelling and forecasting. Exam- inflation statistics and the number of tweets
ples include: the use of GPS-equipped vehicles in about food price increases. This study was cited
Stockholm, providing real-time traffic assess- as an example of how big data could be used to
ments, which are used in conjunction with other analyze public sentiment, in addition to objective
968 World Bank

economic conditions. The examples mentioned Then the media through which data is collected is
here are just some of the activities undertaken by also an important factor to consider. Mobile
the World Bank to embrace the world of big data. phones, for example, producing highly sensitive
As with many other international institutions data, satellite images produce highly unstructured
which recognize the potential uses for big data, data, and social media platforms produce a lot of
the World Bank also recognizes there are a range unstructured text which requires filtering and cod-
of challenges associated with the generation, anal- ifying which in itself requires specific analytic
ysis, and use of big data. capabilities.
One of the most basic challenges for many Then in order to make effective use of big data,
organizations (and individuals) is gaining access those using it need to consider elements about the
to data, from both government institutions and the data itself. The generation of big data has been
private sector. A new ecosystem needs to be driven by advances in technology, yet these
developed where data is made openly available advances are not alone sufficient to be able to
and sharing incentives are in place. It is acknowl- understand the results which can be gleaned
edged by the World Bank that international agen- from big data. Transforming vast data sets into
cies will need to address this challenge by not only meaningful results requires effective human capa-
by promoting the availability of data but promot- bilities. Depending on how the data is generated,
ing collaboration and mechanisms for sharing and by whom, there is scope for bias and therefore
data. In particular, a shift in business models will misleading conclusions. Then with large amounts
be required in order to ensure the private sector is of data, there is a tendency for patterns to be
willing to share data, and governments will need observed where there may be none; because of
to design policy mechanisms to ensure the value its nature, big data can give rise to significant
of big data is captured and is shared across depart- statistical correlations. It is important to remember
ments. Related to this, there need to be consider- that correlation does not imply causation. Then
ations of how to engage the public with this data. just because there is large amount of data avail-
Thinking particularly about the development able, this does not necessarily mean this is the
agenda at the heart of the World Bank, there is a right data for the question or issue being
paradox: countries where poverty is high or where investigated.
development agendas require the most attention The World Bank acknowledges that for big
are often countries where data infrastructures or data to be made effective for development, there
technological systems are insufficient. Because will need to be collaboration between practi-
the generation of big data relies largely on tech- tioners, social scientists, and data scientists in
nological capabilities, relying on those who use or order to ensure the understanding of the real-
interact with digital sources may be systematically world conditions and data generation mecha-
unrepresentative of the larger population that nisms, and methods of interpretation are effec-
forms the focus of the research. tively combined. Beyond this there will need to
The ways in which data are recorded have be cooperation between public and private sector
implications for the results which are interpreted. bodies in order to foster greater data sharing and
Where data is passively recorded, there is less incentivize the use of big data across different
potential for bias in the results generated, and sectors. Even when data has been accessed, in
likewise where data is actively recorded, there is nearly all occasions it needs to be filtered and
greater potential for the results to be more made suitable for analysis. Filters require human
susceptive to selection bias. Furthermore, how input and need to be applied carefully as their use
data is processed into a more structured from the may preclude information and affect the results.
often very large and unstructured data sets Data needs to be cleaned. Mobile data is received
requires expertise to both clean the data and in unstructured form of millions of files, which
where necessary aggregate it (e.g., if one set of requiring time-intensive processing to obtain data
data collected every hour, and another every day). suitable for analysis. Likewise, analysis of text
World Bank 969

from social media requires a decision making made clear efforts to engage with the use of big
process to filter out suitable search terms. data and has begun to explore areas of clear poten-
Finally, there are a series of concerns about tial for big data use. However, questions remain
how privacy is ensured with big data, given that about how it can support countries to take owner-
often there are elements of big data which can be ship and create, manage, and maintain their own
sensitive in nature (either to the individual or data, contributing to their own development
commercially). This is made more complicated agendas in effective ways.
as each country will have different regulations
about data privacy which poses particular chal-
lenges for institutions working across national Cross-References
boundaries, like the World Bank.
For the World Bank, the use of big data is seen ▶ International Development
to have potential for improving and changing the ▶ United Nations Educational, Scientific and Cul-
international development sector. Underpinning tural Organization (UNESCO)
the ideas of the World Bank’s approach to big
data is the recognition that while the technological
capacities for generation, storage, and processing Further Reading
of data continue to develop, this also needs to be
accompanied by institutional capabilities to Coppola, A., Calvo-Gonzalez, O., Sabet, E., Arjomand, N.,
Siegel, R., Freeman, C., Massarat, N. (2014). Big data
enable big data analysis to contribute to effective
in action for development. Washington, DC: World
actions that can contribute to development, Bank and Second Muse. Available at: http://live.
whether this is through strengthening of warning worldbank.org/sites/default/files/Big%20Data%20for
systems, raising awareness, or developing under- %20Development%20Report_final%20version.pdf.
LaValle, S., Lesser, E., Shockley, R., Hopkins, M., &
standing of social systems or behaviors.
Kruschwitz, N. (2011). Big data, analytics and the
The World Bank has begun to consider an path from insights to value. MIT Sloan Management
underlying conceptual framework around the use Review, 52(2), 21–31.
of big data, in particular considering the chal- Lehdonvirta, V., & Ernkvist, M. (2011). Converting the
virtual economy into development potential: Knowl-
lenges it presents in terms of using big data for
edge map of the virtual economy. InfoDev/World Bank
development. In the report Big Data in Action for White Paper, 1, 5–17.
Development, it is acknowledged that there is McAfee, A., & Brynjolfsson, E. (2012). Big data: The
great potential for big data to provide a valuable management revolution. Harvard Business Review,
90(10), 60–66.
input for designing effective development policy
United Nations Global Pulse. (2012). Big data for devel-
recommendation but also that big data is no pan- opment: Challenges & opportunities. New York: UN,
acea (Coppola et al. 2014). The World Bank has New York.

W
Z

Zappos continue his search for his preferred shoes,


which again was unsuccessful. Swinmurn realized
Jennifer J. Summary-Smith that there were no major online retailers special-
Florida SouthWestern State College, Fort Myers, izing in shoes. It was at this point that Swinmurn
FL, USA decided to quit his full-time job and start an online
Culver-Stockton College, Canton, MO, USA shoe retailer named Zappos. Overtime the com-
pany has evolved, focusing on making the speed
of its customers’ online purchase central to its
As one of the largest online retailers of shoes, business model. In order to achieve this, Zappos
Zappos (derived from the Spanish word zapatos warehouses have everything it sells. As the com-
meaning shoes) is a company that is setting an pany grew, it reached new heights in 2009 when
innovative trend in customer service and manage- Zappos and Amazon joined forces combining
ment style. According to Zappos’ website, one of their passion for strong customer service. Since
its primary goals is to provide the best online then, Zappos has grown significantly and
service. The company envisions a world where restructured into ten separate companies.
online customers will make 30% of all retail trans-
actions in the United States. Zappos hopes to be
the company that leads the market in online sales, Security Breach
setting itself aside from other online retail com-
petitors by offering the best customer service and Unfortunately, Zappos has not been without a few
selection. missteps. In 2012, the company experienced a
security breach, compromising as many as 24 mil-
lion customers. Ellen Messmer reports that cyber-
History of the Company hackers successfully gained access to the
company’s internal network and systems. To
Zappos was founded in 1999 by Nick Swinmurn address this security breach, Zappos CEO Tony
who developed the idea for the company while Hsieh announced that existing customer pass-
walking around a mall in San Francisco, Califor- words would be terminated as a result of the
nia, looking for a pair of shoes. After spending an breach. Still yet, the cyberhackers likely gained
hour in the mall searching from store to store for accessed to names, phone numbers, the last four
the right color and shoe size, he left the mall digits of credit card numbers, cryptographically
empty handed and frustrated. Upon arriving scrambled passwords, email, billing information,
home, Swinmurn turned to the Internet to and shipping addresses. After Zappos CEO Tony
© Springer Nature Switzerland AG 2022
L. A. Schintler, C. L. McNeely (eds.), Encyclopedia of Big Data,
https://doi.org/10.1007/978-3-319-32010-6
972 Zappos

Hsieh posted an open letter explaining the breach “clearly not a contract.” Eric Goldman argues that
and how the company would head off resulting the click-through agreements are effective in courts
problems, there were mixed responses to how the unlike browsewraps. Browsewraps are user agree-
company had handled the situation. As part of its ments that bind users simply for browsing the
response to the breach, the company sent out website. The courts ruled that Zappos presented
emails informing its customers of the problem its user agreement as a browsewrap. Furthermore,
urging them to change their passwords. Zappos Zappos claimed on its website that the company
also provided an 800-number phone service to its reserved the right to amend the contract whenever
customers helping them through the process of it saw fit. Despite other companies using this lan-
choosing a new password. guage online, it is detrimental to a contract. The
However, some experts familiar with the courts ruled that Zappos can amend the terms of the
online industry have criticized the moves by user agreement at any time, making the arbitration
Zappos. In an article by Ellen Messmer, she clause susceptible to change as well. This makes
interviewed an Assistant Professor of Information the clause unenforceable. Eric Goldman posits that
Technology from the University of Notre Dame, the court ruling left Zappos in a bad position
who argued that the response strategy by Zappos because all of the risk management provisions are
was not appropriate. Professor John D’Arcy posits ineffective. In other words, losing the contract left
that the company’s decision to terminate cus- Zappos without the following: its waiver of conse-
tomers’ passwords promotes a panic mode, creat- quential damages, its disclaimer of warranties, its
ing a sense of panic in its customers. In contrast, clause restricting class actions in arbitration, and its
other analysts claim that Zappos public response reduced statute of limitations. Conversely, compa-
to the situation was the right move, communicat- nies that use click-through agreements and remove
ing to its customers publicly. clauses that state they can amend the contract uni-
Nevertheless, Zappos is doing a good job of laterally are in a better legal position, according to
getting the information out about the security Eric Goldman.
breach to the public as soon as possible, according
to Professor John D’Arcy. This typically benefits
the customers, creating favorable reactions. In Holacracy
terms of the cost of the security breaches, the
Ponemon Institute estimates that on average, a Zappos CEO Tony Hsieh announced in
data breach costs $277 per compromised record. November 2013 that his company would be
implementing the management style known as
Holacracy. With Holacracy, there are two key
Lawsuits elements that Zappos will follow: distributed
authority and self-organization. According to an
After the security breach, dozens of lawsuits were article by Nicole Leinbach-Reyhle, distribution
filed. Zappos attempted to send the lawsuits to authority allows employees to evolve the organi-
arbitration, citing its user agreement. In the fall of zation’s structure by responding to real-word cir-
2012, a federal court struck down Zappos.com’s cumstances. In regard to self-organization,
user agreement, according to Eric Goldman. Eric employees have the authority to engage in useful
Goldman is a professor of law at Santa Clara Uni- action to express their purpose as long as it does
versity School of Law who writes about Internet not “violate of the domain of another role.” There
law, intellectual property, and advertising law. He is a common misunderstanding that Holacracy is
states that Zappos made mistakes that are easily nonhierarchical when in fact it is strongly hierar-
avoidable. The courts typically divide user agree- chical, distributing power within the organization.
ments into one of three groups: “clickwraps” or This approach to management creates an atmo-
“click-through agreements,” “browsewraps,” and sphere where employees can speak up evolving
Zillow 973

into leaders rather than followers. Zappos CEO Further Reading


Tony Hsieh states that he is trying to structure
Zappos less like a bureaucratic corporation and Corbett, S. (n.d.). How Zappos’ CEO turned Las Vegas into
a startup fantasyland. http://www.wired.com/2014/01/
more like a city, resulting in increased productiv-
zappos-tony-hsieh-las-vegas/.
ity and innovation. To date, with 1,500 Goldman, E. (n.d.). How Zappos’ user agreement Failed in
employees, Zappos is the largest company to court and left Zappos legally naked. http://www.forbes.
adopt the management model – Holacracy. com/sites/ericgoldman/2012/10/10/how-zappos-user-
agreement-failed-in-court-and-left-zappos-legally-
naked/. Accessed Jul 2014.
Leinbach-Reyhle, N. (n.d.). Shedding hierarchy: Could
Innovation Zappos be setting an innovative trend? http://www.
forbes.com/sites/nicoleleinbachreyhle/2014/07/15/
shedding-hierarchy-could-zappos-be-setting-an-innvo
The work environment at Zappos has become
ative-trend/. Accessed Jul 2014.
known for its unique corporate culture, which Messmer, E. (n.d.). Zappos data breach response a good
incorporates fun and humor into daily work. As idea or just panic mode? Online shoe and clothing
stated on Zappos.com, the company has a total retailer Zappos has taken assertive steps after breach,
but is it enough? http://www.networkworld.com/
of ten core values: “deliver WOW through ser-
article/2184860/malware-cybercrime/zappos-data-
vice, embrace and drive change, create fun and a breach-response-a-good-idea-or-just-panic-mode-.
little weirdness, be adventurous, creative, and html. Accessed Jul 2014.
open-minded, pursue growth and learning, Ponemon Group. (n.d.). 2013 cost of data breach study:
Global analysis. http://www.ponemon.org. Accessed
build open and honest relationships with com-
Jul 2014.
munication, build a positive team and family Zappos. (n.d.). http://www.zappos.com. Accessed Jul 2014.
spirit, do more with less, be passionate and
determined, and be humble.” Nicole Leinbach-
Reyhle writes that Zappos’ values help to
encourage its employees to think outside of
the box. Zillow
To date, Zappos is a billion-dollar online
retailer, expanding beyond selling shoes. The Matthew Pittman and Kim Sheehan
company is also making waves in its corporate School of Journalism & Communication,
culture and hierarchy. Additionally, information University of Oregon, Eugene, OR, USA
technology plays a huge role in the corporation,
serving its customers and the business. Based
upon the growing success of Zappos, it is keeping Overview and Business Model
true to its mission statement “to provide the best
customer service possible.” It evident that Zappos Like most industries, real estate is undergoing
will continue to make positive changes for the dynamic shifts in the age of big data.
corporation and its corporate headquarters in Las Real estate information, once in the hands of a
Vegas. In 2013, Zappos CEO Tony Hsieh com- few agents or title companies, is being democra-
mitted $350 million to rebuild and renovate the tized for any and all interested consumers. What
downtown Las Vegas region. As Sara Corbett were previously physical necessities – real estate
notes in her article, he hopes to change the area agents, showings, and physical homes – are being
into a start-up fantasyland. obsolesced by digital platforms like Zillow. Real
estate developers can use technology to track how
communities flow and interact with one another,
Cross-References which will help build smarter, more efficient
neighborhoods in the future. The companies that
▶ Ethical and Legal Issues succeed in the future will be the ones who, like
Z
974 Zillow

Zillow, find innovative, practical, and valuable an average of $4,000 every year for leads to get
ways to navigate and harness the massive amounts new buyers and sellers. Zillow keeps a record of
of data that are being produced in and around their how many times a listing has been viewed, which
field. may help negotiate the price with among agents,
Founded in Seattle in 2005, Zillow is a billion- buyers, and sellers. Real estate agents can sub-
dollar real estate database that uses big data to scribe to silver, gold, or platinum programs to get
help consumers learn about home prices, rent CRM (customer relationship management) tools,
rates, market trends, and more. They provide esti- their photo in listings, a web site, and more. Basic
mates for most housing units in the United States. plans start at 10 dollars a month.
It acquired its closest competitor, Trulia, in 2014 Zillow’s mortgage marketplace also earns
for $3.5 billion. It is the most-viewed real estate them revenue. Potential homebuyers can find
destination in the country. Now with Trulia, it and engage with mortgage brokers and firms.
accounts for 48% of Web traffic for real estate The mortgage marketplace tells potential buyers
listings, though that number is diminished to what their monthly payment would be, how much
around 15% if you factor in individual realtor they can afford, submit loan requests, and get
sites and local MLS (multiple listing service) list- quotes from various lenders. In the third quarter
ings. The company’s chief economist Stan of 2013, Zillow’s mortgage marketplace received
Humphries created a tool that processes 1.2 mil- 5.9 million loan requests from borrowers (more
lion proprietary statistical models three times per than all of 2011), which grew its revenue stream
week on the county and state real estate data it is 120% to $5.7 million. A majority of Zillow’s
constantly gathering. In 2011, they shifted from revenue comes from the real estate segment that
an in-house computer cluster to renting space in lets users browse homes for sale and for rent. This
the Amazon cloud to help with the massive earned them over $35 million in 2013’s third
computing load. quarter.
On the consumer side, Zillow is a web site or Analysts and shareholders have voiced some
mobile app that is free to use. Users can enter a concerns over Zillow’s business model. Zillow
city or zip code and search, filtering out home now spends over 70% of its revenues on sales
types, sizes, or prices that are undesirable. There and marketing, as opposed to 33% for LinkedIn
are options to see current homes for sale, recently and between 21% and 23% for IBM and Micro-
sold properties, foreclosures, rental properties, soft. Spending money on television commercials
and even Zillow “zestimates” (the company’s sig- and online ads for its services seems to have
nature feature) of the home’s current value based diminishing returns for Zillow, who is spending
on similar homes in the area, square footage, more and more on marketing for the same net
amenities, and more. Upon clicking on a house profit. What once seemed like a sure-fire
of interest, the user can see a real estate agent’s endeavor – making money by connecting cus-
description of the home, how long it has been on tomers to agents through relevant and concise
the market – along with any price fluctuations – as management of huge amounts of data – is no
well as photos, similarly priced nearby houses, longer a sure thing. Zillow will have to continu-
proposed mortgage rates on the home, the agents ally evolve its business model if it is to stay
associated with it, the home’s sale history, and afloat.
facts and features.
Zillow makes money on real estate firms and
agents that advertise through the site and by pro- Zillow and the Real Estate Industry
viding subscriptions to real estate professionals.
They can charge more for ads that appear during a Zillow has transformed the real estate industry by
search for homes in Beverly Hills than in Bis- finding new and practical ways to make huge
marck, South Dakota. Some 57,000 agents spend amounts of data accessible to common people.
Zillow 975

Potential buyers no longer need to contact a real real estate brokers. Because of the high number of
estate agent before searching for homes – they cooperative buildings, New York City is another
can start a detailed search on just about any house difficult area in which to gauge real estate prices.
in the country from their own mobile or desktop Tax assessments are made on the co-ops, not the
device. This is empowering for consumers, but it individual units, which negates that factor in
shakes up an industry that has long relied on zestimate calculations. Additional information,
human agents. These agents made it their busi- like square footage or amenities, is also difficult
ness to know specific areas, learn the ins and outs to come by, forcing Zillow to seek out alternative
of a given community, and then help connect sources.
interested buyers to the right home. Sites that Of course, zestimates can be accurate as well.
give users a tool peer into huge amounts of data As previously noted, when the house is midrange
(like Zillow) are useful to a point, but some and in a neighborhood with plenty of comps (and
critics feel only a human being who is local and thus plenty of data), zestimates can be very good
present in a community can really serve potential indicators of the home’s actual worth. As Zillow
buyers. zestimates – and sources from which to draw
Because it takes an aggregate of multiple factoring information – continue to evolve, the
national and MLS listing sites, Zillow is rarely service may continue growing in popularity. The
perfect. Any big data computing service that more popular Zillow becomes, the more incentive
works with offline or subjective entities – and real estate agents will have to list all of their
real estate prices certainly fit this description – housing database information with the service.
will have to make logical (some would say illog- Agents know that, in a digital society, speed is
ical) leaps where information is scarce. When key: 74% of buyers and 76% of sellers will work
Zillow does not have exact or current data on a with the first agent with whom they talk.
house or neighborhood, it “guesses” when prices Recently Zillow is recognizing a big shift to
come in too high, sellers have unrealistic expec- mobile: about 70% of Zillow’s usage now occurs
tations of the potential price of their home. on mobile platforms. This trend is concurrent with
Buyers, too, may end up paying for a home than other platforms’ shift to mobile usage; Facebook,
it is actually worth. Instagram, Zynga, and others have begun to rec-
A human expert (real estate agent) has tradi- ognize and monetize users’ access from
tionally been the expert in this area, yet people are smartphones and tablets. For real estate, this
still surprised when too much stock is put into an mobile activity is about more than just conve-
algorithm. Zillow zestimates tend to work best for nience: user can find information on homes in
midrange homes in an area where there are plenty real time as they drive around a neighborhood,
of comparable houses. Zestimates are less accu- looking directly at the potential homes, and con-
rate for low- and high-end homes because there tact the relevant agent before they get home. This
are fewer comps (comparable houses for sale or sort of activity bridges the traditional brick-and-
recently sold). Similarly, zestimates of rural, mortar house hunting of the past with the instant
unique, or fixer-upper homes are difficult to big data access of the future (and increasingly, the
gauge. Local MLS sites may have more detail on present). Zillow has emerged as a leader in its field
a specific area, but Zillow has broader, more gen- of real estate by connecting its customers, not just
eral information over a larger area. They estimate to big data but the right data at the right time and
their coverage of American homes to be around places.
57%.
Real estate data is more difficult to come by in
some areas. Texas doesn’t provide public records Cross-References
of housing transaction prices, so Zillow had to
access sales data from property databases through ▶ E-Commerce
Z
976 Zillow

Further Reading Huang, H., & Tang, Y. (2012). Residential land use regu-
lation and the US housing price cycle between 2000
Arribas-Bel, D. (2014). Accidental, open and everywhere: and 2009. Journal of Urban Economics, 71(1), 93–99.
Emerging data sources for the understanding of cities. Wheatley, M. (n.d.). Zillow-Trulia merger will create bound-
Applied Geography, 49, 45–53. less new big data opportunities. http://siliconangle.com/
Cranshaw, J., Schwartz, R., Hong, J.I., Sadeh, blog/2014/07/31/zillow-trulia-merger-will-create-bound
N.M. (2012). The livelihoods project: Utilizing social less-new-big-data-opportunities/. Accessed on Sept
media to understand the dynamics of a city. In ICWSM. 2014.
Hagerty, J. R.(2007). How good are Zillow’s estimates?
Wall Street Journal.

You might also like