You are on page 1of 338

A

American Library Association divisions are working to develop and implement


policies and procedures that will enhance the
David Brown quality of, the security of, the access to, and the
Southern New Hampsire University, University of utility of big data.
Central Florida College of Medicine, Independent
Consultant, Huntington Beach, CA, USA
ALA Divisions Working with Big Data

The American Library Association (ALA) is a At this time, the Association of College & Research
voluntary organization that represents libraries Libraries (ACRL) is a primary division of the
and librarians around the world. Worldwide, the ALA that is concerned with big data issues. The
ALA is the largest and oldest professional ACRL has published a number of papers, guides,
organization for libraries, librarians, information and articles related to the use of, promise of, and
science centers, and information scientists. The the risks associated with big data. Several other
association was founded in 1876 in Philadelphia, ALA divisions are also involved with big data.
Pennsylvania. Since its inception, the ALA has The Association for Library Collections &
provided leadership for the development, promo- Technical Service (ALCTS) division discusses
tion, and improvement of libraries, information issues related to the management, organization,
access, and information science. The ALA is pri- and cataloging of big data and its sources. The
marily concerned with learning enhancement and Library Information Technology Association
information access for all people. The organiza- (LITA) is an ALA division that is involved with
tion strives to advance the profession through its the technological and user services activities that
initiatives and divisions within the organization. advance the collection, access, and use of big data
The primary action areas for the ALA are advo- and big data sources.
cacy, education, lifelong learning, intellectual
freedom, organizational excellence, diversity,
equitable access to information and services, Big Data Activities of the Association of
expansion of all forms of literacy, and library College & Research Libraries (ACRL)
transformation to maintain relevance in a dynamic
and increasing global digitalized environment. The Association of College & Research Libraries
While ALA is composed of several different divi- (ACRL) is actively involved with the opportuni-
sions, there is no single division devoted exclu- ties and challenges presented by big data. As
sively to big data. Rather, a number of different science and technology advance, our world
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_6-1
2 American Library Association

becomes more and more connected and linked. Conclusion


These links in and of themselves may be consid-
ered big data, and much of the information that The American Library Association and its
they transmit is big data. Within the ACRL, big member libraries, librarians, and information sci-
data is conceptualized in terms of the three Vs: its entists are involved in shaping the future of big
volume, its velocity, and its variety. Volume refers data. As disciplines and professions continue to
to the tremendously large size of the big data. advance with big data, librarians and information
However, ACRL stresses that the size of the data scientists’ skills need to advance to enable them
set is a function of the particular problem one is to provide valuable resources for strategists,
investigating and size is only one attribute of big decision-makers, policy-makers, researchers, mar-
data. Velocity refers to the speed at which data is keters, and many other big data users. The ability
generated, needed, and used. As new information to effectively use big data will be a key to success
is generated exponentially, the need to catalogue, as the world economy and its data sources expand.
organize, and develop user-friendly means of In this rapidly evolving environment, the work of
accessing these big data increases multiple expo- the ALA will be highly valuable and an important
nentially. The utility of big data is a function of the human resource for business, industry, govern-
speed at which it can be accessed and used. For ment, academic and research planners, decision-
maximum utility, big data needs to be accurately makers, and program evaluators who want and
catalogued, interrelated, and integrated with other need to use big data.
big data sets. Variety refers to the many different
types of data that are typically components of and
are integrated into big data. Traditionally, data sets
Cross-References
consist of a relatively small number of different
types of data, like word-processed documents,
▶ Automated Modeling/Decision Making
graphs, and pictures. Big data on the other hand
▶ Big Data Curation
is typically concerned with many additional types
▶ Big Data Quality
of information such as emails, audio and video-
▶ Data Preservation
tapes, sketches, artifacts, data sets, and many
▶ Data Processing
other kinds of quantitative and qualitative data.
▶ Data Storage
In addition, big data information is usually pre-
▶ Digital Libraries
sented in many different languages, dialects, and
tones. A key point that ACRL stresses is that as
disciplines advance, the need for and the value of
big data will increase. However, this advancement Further Readings
can be facilitated or inhibited by the degree to
which the big data can be accessed and used. American Library Association. About ALA. http://www.
ala.org/aboutala/. Accessed 10 Aug 2014.
Within this context, librarians who are also infor- American Library Association. Association for Library
mation scientists are and will continue to be Collections and Technical Services. http://www.ala.
invaluable resources that can assist with the col- org/alcts/. Accessed 10 Aug 2014.
lection, storage, retrieval, and utilization of big American Library Association. Library Information Tech-
nology Association (LITA). http://www.ala.org/lita/.
data. Specifically, ACRL anticipates needs for Accessed 10 Aug 2014.
specialists in the areas of big data management, Bieraugel, Mark. Keeping up with... big data. American
big data security, big data cataloguing, big data Library Association. http://www.ala.org/acrl/publica
storage, big data updating, and big data accessing. tions/keeping_up_with/big_data. Accessed 10 Aug
2014.
American Library Association 3

Carr, P. L. (2014). Reimagining the library as a technology: Finnemann, N. O. (2014). Research libraries and the
An analysis of Ranganathan’s five laws of library sci- Internet: On the transformative dynamic between institu-
ence within the social construction of technology tions and digital media. Journal of Documentation, 70(2),
framework. The Library Quarterly, 84(2), 152–164. 202–220.
Federer, L. (2013). The librarian as research information- Gordon-Murnane, L. (2012). Big data: A big opportunity
ist: A case study. Journal of the Medical Library for Librarians. Online, 36(5), 30–34.
Association, 101(4), 298–302.
A

Anonymization Techniques anonymize the data. In other words, personally


identifiable information (PII) needs to be
Mick Smith1 and Rajeev Agrawal2 encrypted or altered so that a person’s sensitive
1
North Carolina A&T State University, data remains indiscernible to outside sources and
Greensboro, NC, USA readable to the pre-approved parties. Some popu-
2
Information Technology Laboratory, US Army lar anonymization techniques include noise addi-
Engineer Research and Development Center, tion, differential privacy, k-anonymity, l-diversity,
Vicksburg, MS, USA and t-closeness.
The need for anonymizing data has come from
the availability of data through big data. Cheaper
Synonyms storage, improved processing capabilities, and a
greater diversity of analysis techniques have cre-
Anonymous data; Data anonymization; Data pri- ated an environment in which big data can thrive.
vacy; De-Identification; Personally identifiable This has allowed organizations to collect massive
information amounts of data on the customer/client base. This
information in turn can then be subjected to a
variety of business intelligence applications so as
Introduction to improve the efficiency of the collecting organi-
zation. For instance, a hospital can collect various
Personal information is constantly being collected patient health statistics over a series of visits. This
on individuals as they browse the internet or share information could include vital statistics measure-
data electronically. This collection of information ments, family history, frequency of visits, test
has been further exacerbated with the emergence results, or any other health-related metric. All of
of the Internet of things and the connectivity of this data could be analyzed to provide the patient
many electronic devices. As more data is dissem- with an improved plan of care and treatment,
inated into the world, interconnected patterns are ultimately improving the patient’s overall health
created connecting one data record to the next. and the facilities ability to provide a diagnosis.
The massive data sets that are collected are of However, the benefits that can be realized from
great value to businesses and data scientists the analysis of massive amounts of data come with
alike. To properly protect the privacy of these the responsibility of protecting the privacy of the
individuals, it is necessary to de-identify or entities whose data is collected. Before the data is

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_9-1
2 Anonymization Techniques

released, or in some instances analyzed, the sen- overcome this, it would be necessary to apply
sitive personal information needs to be altered. minimum noise so that the average income before
The challenge comes in deciding upon a method and after would not be representative of the
that can achieve anonymity and preserve the data change. At the same time, the computational
integrity. integrity of the data is maintained. The amount
of noise and whether an exponential or Laplacian
mechanism is used is still subject to ongoing
Noise Addition research/discussion.

The belief with noise addition is that by adding


noise to data sets that the data becomes ambiguous K-Anonymity
and the individual subjects will not be identified.
The noise refers to the skewing of an attribute so In the k-anonymity algorithm, two common
that it is displayed as a value within a range. For methods for anonymizing data are suppression
instance, instead of giving one static value for a and generalization. By using suppression, the
person’s age, it could be adjusted 2 years. If the values of categorical variable, such as name, are
subject’s age is displayed as 36, the observer removed entirely from the data set. With general-
would not know the exact value, only that the ization quantitative variables, such as age or
age may be between 34 and 38. The challenge height, are replaced with a range. This in turn
with this technique comes in identifying the makes each record in a data set indistinguishable
appropriate amount of noise. There needs to be from at least k–1 other records. One of the major
enough to mask the true attribute value, while at drawbacks to k-anonymity is that it may be pos-
the same time preserving the data mining relation- sible to infer identity if certain characteristics are
ships that exist within the dataset. already known. As a simple example consider a
data set that contains credit decisions from a bank
(Table 1). The names have been omitted, the age
Differential Privacy categorized, and the last two digits of the zip code
have been removed.
Differential privacy is similar to the noise addition This obvious example is for the purposes of
technique in that the original data is altered demonstrating the weakness of a potential homo-
slightly to prevent any de-identification. How- geneity attack in k-anonymity. In this case, if it
ever, it is done in a manner that if a query is was known that a 23-year-old man living in
done on two databases that differ in only one 14,999 was in this data set, the credit decision
row that the information contained in the missing information for that particular individual could
row is not discernable. Cynthia Dwork provides be inferred.
the following definition:
A randomized function K gives e-differential
privacy if for all data sets D1and D2differing on L-Diversity
at most one element, and all S  Range(K),
L-diversity can be viewed as an extension to k-
Pr ½KðD1 Þ  S  expðeÞ  Pr ½KðD1 Þ  S anonymity in which the goal is to anonymize
specific sensitive values of a data record. For
As an example think of a database containing instance, in the previous example, the sensitive
the incomes of 75 people in a neighborhood and information would be the credit decision. As with
the average income is $75,000. If one person were k-anonymity generalization and suppression tech-
to leave the neighborhood and the average income niques are used to mask the true values of the
dropped to $74,000, it would be easy to identify target variable. The authors of the l-diversity prin-
the income of the departing individual. To ciple, Ashwin Machanavajjhala, Daniel Kifer,
Anonymization Techniques 3

Anonymization Techniques, Table 1 K-anonymity credit example


Age Gender Zip Credit decision
18–25 M 149** Yes
18–25 M 148** No
32–39 F 149** Yes
40–47 M 149** Yes
25–32 F 148** No
32–39 M 149** Yes

Johannes Gehrke, and Muthuramakrishnan However, it should be noted that the distance
Venkitasubramniam, define it as follows: metric may differ depending on the data types.
A q*-block is l-diverse if it contains at least l well- This includes the following distance measures:
represented values for the sensitive attribute S. numerical, equal, and hierarchical.
A table is l-diverse if every q*-block is l-diverse.

The concept of well-represented has been


defined in three possible methods: distinct l- Conclusion
diversity, entropy l-diversity, and recursive
(c, l)-diversity. A criticism of the l-diversity To be effective each anonymization technique
model is that it does not hold up well when the should prevent against the following risks: sin-
sensitive value has a minimal number of states. gling out, linkability, and inference. Singling out
As an example, consider the credit decision table is the process of isolating data that could identify
from above. If that table were extended to an individual. Linkability occurs when two or
include 1000 records and 999 of them had a more records in a data set can be linked to either
decision of “yes,” then l-diversity would not be an individual or grouping of individuals. Finally
able to provide sufficient equivalence classes. inference is the ability to determine the value of
the anonymized data through the values of other
elements within the set. An anonymization
T-Closeness approach that can mitigate these risks should be
considered robust and will reduce the possibility
Continuing with the refinement of de-identifica- of re-identification. Each of the techniques pre-
tion techniques, t-closeness is an extension of l- sented address each of these risks differently. The
diversity. The goal of t-closeness is to create following table outlines their respective perfor-
equivalence classes that approximate the original mance (Table 2):
distribution of the attributes in the initial database. For instance, unlike k-anonymity, l-diversity,
Privacy can be considered a measure of informa- and t-closeness are not subject to inference attacks
tion gain. T-Closeness takes this characteristic that utilize the homogeneity or background
into consideration by assessing an observer’s knowledge of the data set. Similarly, the three
prior and posterior belief about the content of a generalization techniques (k-anonymity, l-diver-
data set as well as the influence of the sensitivity sity, and t-closeness), all present differing levels
attribute. As with l-diversity, this approach hides of association that can be made due to the cluster-
the sensitive values within a data set while ing nature of each approach.
maintaining association through “closeness.” As with any aspect of data collection, sharing,
The algorithm uses a distance metric known as publishing, and marketing, there is the potential
the Earth Mover Distance to measure the level of for malicious activity. However, the benefits that
closeness. This takes into consideration the can be achieved from the potential analysis of
semantic interrelatedness of the attribute values. such data cannot be overlooked. Therefore, it is
4 Anonymization Techniques

Anonymization Techniques, Table 2 Anonymization algorithm comparison


Technique Singling out Linkability Inference
Noise addition At risk Possibly Possibly
K-anonymity Not at risk At risk At risk
L-diversity Not at risk At risk Possibly
T-closeness Not at risk At risk Possibly
Differential privacy Possibly Possibly Possibly

extremely important to mitigate such risks Machanavajjhala, A., et al. (2007). l-diversity: Privacy
through the use of effective de-identification tech- beyond k-anonymity. ACM Transactions on Knowl-
edge Discovery from Data, 1(1), Article 3, 1–12.
niques so as to protect sensitive personal informa- Sweeney, L. (2002). k-anonymity: A Model for Protecting
tion. As the amount of data becomes more Privacy. International Journal of Uncertainty, Fuzzi-
abundant and accessible, there is an increased ness and Knowledge-Based Systems, 10(5).
importance to continuously modify and refine The European Parliament and of the Council Working
Party. (2014). Opinion 05/2014 on anonymisation tech-
existing anonymization techniques. niques. http://ec.europa.eu/justice/data-protection/arti
cle-29/documentation/opinion-recommendation/files/
2014/wp216_en.pdf. Retrieved on 29 Dec 2014.
Further Reading

Dwork, C. (2006). Differential privacy. In Automata, lan-


guages and programming. Berlin: Springer.
Li, Ninghui, et al. (2007). t-Closeness: Privacy beyond k-
anonymity and l-diversity. IEEE 23rd International
Conference on Data Engineering, 7.
A

Archaeology record, or even the record of any specific time


period or region. If one takes any definition of
Stuart Dunn “Big Data” as it is generally understood, a corpus
Department of Digital Humanities, King’s of information which is too massive for desktop-
College London, London, UK based or manual analysis or manipulation, no
single archaeological dataset is likely to have
these attributes of size and scale. The significance
Introduction of Big Data for archaeology lies not so much in
the analysis and manipulation of single or multi-
In one sense, archaeology deals with the biggest ple collections of vast datasets but rather in the
dataset of all: the entire material record of human bringing together of multiple data, created at dif-
history, from the earliest human origins c. 2.2 ferent times, for different purposes and according
million years Before Present (BP) to the present to different standards; the interpretive and critical
day. However this dataset is, by its nature, incom- frameworks needed to create knowledge from
plete, fragmentary, and dispersed. Archaeology them. Archaeology is “Big Data” in the sense
therefore brings a very particular kind of chal- that it is “data that is bigger than the sum of its
lenge to the concept of big data. Rather than parts.”
real-time analyses of the shifting digital landscape Those parts are massively varied. Data in
of data produced by the day to day transactions of archaeology can be normal photographic images,
millions of people and billions of devices, images and data from remote sensing, tabular data
approaches to big data in archaeology refer to of information such as artifact findspots, numeri-
the sifting and reverse-engineering of masses of cal databases, or text. It should also be noted that
data derived from both primary and secondary the act of generating archaeological data is rarely,
investigation into the history of material culture. if ever, the end of the investigation or project. Any
dataset produced in the field or the lab typically
forms part of a larger interpretation and interpola-
Big Data and the Archaeological tion process and – crucially – archaeological data
Research Cycle is often not published in a consistent or interoper-
able manner; although approaches to so-called
Whether derived from excavation, post- Grey Literature, which constitutes reports from
excavation analysis, experimentation, or simula- archaeological surveys and excavations that typi-
tion, archaeologists have only tiny fragments of cally do not achieve a wide readership, are
the “global” dataset that represents the material discussed below. This fits with a general
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_12-1
2 Archaeology

characteristic of Big Data, as opposed to the “e- accentuated by the rise of more sophisticated
Science/Grid Computing” paradigm of the 2000s. data capture techniques in the field, which is
Whereas the latter was primarily concerned with increasing the capacity of data that can be gath-
“big infrastructure,” anticipating the need for sci- ered and analyzed. Although still not “big” in the
entists to deal with a “deluge” of monolithic data literal sense of “Big Data,” this class of material
emerging from massive projects such as the Large undoubtedly requires the kinds of approaches in
Hardron Collider, as described by Tony Hey and thinking and interpretation familiar from else-
Anne Trefethen, Big Data is concerned with the where in the Big data agenda. Recent applications
mass of information which grows organically as in landscape archaeology have highlighted the
the result of the ubiquity of computing in every- need both for large capacity and interoperation.
day life and in everyday science. In the case of For example, integration of data from the in the
archaeology, it may be considered more as a Stonehenge Hidden Landscape project, also
“complexity deluge,” where small data, produced directed by Gaffney, provides for “seamless” cap-
on a daily basis, forms part of a bigger picture. ture of reams of geophysical data from remote
There are exceptions: Some individual projects sensing, visualizing the Neolithic landscape
in archaeology are concerned with terabyte-scale beneath modern Wiltshire to a degree of clarity
data. The most obvious example in the UK is the and comprehensiveness that would only have
North Sea Paleolandscapes, led by the University been possible hitherto with expensive and labori-
of Birmingham, a project which has reconstructed ous manual survey. Due to improved capture tech-
the Early Holocene landscape of the bed of the niques, this project succeeded in gathering a
North Sea, which was an inhabitable landscape quantity of data in its first two weeks equivalent
until its inundation between 20,000 and 8,000 to that of the landmark Wroxeter survey project in
BP – so-called Doggerland. Vince Gaffney and the 1990s.
others describe drawing on 3D seismic data gath- These early achievements of big data in an
ered during the process of oil prospection, this archaeological context fall against a background
project has used large-scale data analytics and of falling hardware costs, lower barriers to usage,
visualization to reconstruct the topography of the and the availability of generic web-based plat-
preinundation land surface spanning an area larger forms where large-scale distributed research can
than the Netherlands, and to thus allow inferences be conducted. This combination of affordability
as to what environmental factors might have and usability is bringing about a revolution in
shaped human habitation of it; although it must applications such as those described above,
be stressed that there is no direct evidence at all of where remote sensing is reaching new concepts
that human occupation. While such projects dem- and applications. For example, coverage of freely
onstrate the potential of Big Data technologies for available satellite imagery is now near-total;
conducting large-scale archaeological research, graphical resolution is finer for most areas than
they remain the exception. Most applications in ever before (1 m or less); and pre-georeferenced
archaeology remain relatively small scale, at least satellite and aerial images are delivered to the
in terms of the volume of data that is produced, user’s desktop, removing the costly and highly
stored, and preserved. specialized process of locating imagery of the
However, this is not to say that approaches Earth’s surface. Such platforms also allow access
which are characteristic of Big Data are not chang- to imagery of archaeological sites in regions
ing the picture significantly in archaeology, espe- which are practically very difficult or impossible
cially in the field of landscape studies. Data from to survey, such as Afghanistan, where declassified
geophysics, the science of scanning subterranean CORONA spy satellite data are now being
features using techniques such as magentometry employed to construct inventories of the region’s
and resistivity typically produce relatively large (highly vulnerable) archaeology. If these develop-
datasets, which require holistic analysis in order to ments cannot be said to have removed the bound-
be understood and interpreted. This trend is aries within which archaeologists can produce,
Archaeology 3

access, and analyze data, then it has certainly the field: LIDAR (Light Detection and Ranging or
made them more porous. Laser Imaging Detection and Ranging) data,
As in other domains, strategies for the storage which models terrain elevation modelled from
and preservation of data in archaeology have a airborne sensors, 3D laser scanning, maritime sur-
fundamental relationship with relevant aspects of vey, and digital video. At first glance this appears
the Big Data paradigm. Much archaeological to underpin an assumption that the primary focus
information lives on the local servers of institu- is data formats which convey larger individual
tions, individuals, and projects; this has always data objects, such as images and geophysics
constituted an obvious barrier to their integration data, with the report noting that “many formats
into a larger whole. However, weighing against have the potential to be Big Data, for example, a
this is the ethical and professional obligation to digital image library could easily be gigabytes in
share, especially in a discipline where the process size. Whilst many of the conclusions reached here
of gathering the data (excavation) destroys its would apply equally to such resources this study
material context. National strategies and bodies is particularly concerned with Big Data formats in
encourage the discharge of this obligation. In the use with technologies such as lidar surveys, laser
UK, as well as data standards and collections held scanning and maritime surveys.”
by English Heritage, the main repository for However, the report also acknowledges that “If
archaeological data is the Archaeology Data Ser- long term preservation and reuse are implicit goals
vice, based at the University of York. The ADS data creators need to establish that the software to
considers for accession any archaeological data be used or toolsets exist to support format migra-
produced in the UK in a variety of formats. This tion where necessary.” It is true that any “Big
includes most of the data formats used in day-to- Data” which is created from an aggregation of
day archaeological workflows: Geographic Infor- “small data” must interoperate. In the case of
mation System (GIS) databases and shapefiles, “social data” from mobile devices, for example,
images, numerical data, and text. In the latter location is a common and standardizable attribute
case, particular note should be given to the that can be used to aggregate Tb-scale datasets:
“Grey Literature” library of archaeological reports heat maps of mobile device usage can be created
from surveys and excavations, which typically which show concentrations of particular kinds of
present archaeological information and data in a activity in particular places at particular times. In
format suitable for rapid publication, rather than more specific contexts hashtags can be used to
the linking and interoperation of that data. Cur- model trends and exchanges between large
rently, the Library contains over 27,000 such groups. Similarly intuitive attributes that can be
reports. Currently, the total volume of the ADS’s used for interoperation, however, elude archaeo-
collections stands at 4.5 Tb (I thank Michael logical data, although there is much emerging
Charno for this information). While this could be interest in Linked Data technologies, which
considered “big” in terms of any collection of data allow the creation of linkages between web-
in the humanities, it is not of a scale which would exposed databases, provided they conform
overwhelm most analysis platforms; however (or can be configured to conform) to predefined
what is key here is that it is most unlikely to be specifications in descriptive languages such as
useful to perform any “global” scale analysis RDF. Such applications have proved immensely
across the entire collection. The individual successful in areas of archaeology concerned with
datasets therein relate to each other only inasmuch particular data types, such as geodata, where there
as they are “archaeological.” In the majority of is a consistent base reference (such as latitude and
cases, there is only fragmentary overlap in terms longitude). However, this raises a question which
of content, topic, and potential use. A 2007 is fundamental to archaeological data in any
ADS/English Heritage report on the challenges sense. Big Data approaches here, even if the data
of Big Data in archaeology identified four types is not “Big” in terms of relative terms to the social
of data format potentially relevant to Big Data in and natural sciences, potentially allows an
4 Archaeology

“n=all” picture of the data record. As noted and at new levels of complexity. We can harvest
above, however, this record represents only a public discourse about cultural heritage in social
tiny fragment of the entire picture. A key question, media and elsewhere and ask what that tells us
therefore, is does “Big data” thinking risk techno- about that heritage’s place in the contemporary
logical determination, constraining what ques- world. We can examine what are the fundamental
tions can be asked? This is a point which has building blocks of our knowledge about the past
concerned archaeologists since the very earliest and ask what do we gain, as well as lose, by
days of computing in the discipline. In 1975, a putting them into a form that the World Wide
skeptical Sir Moses Finley noted that “It would be Web can read.
a bold archaeologist who believed he could antic-
ipate the questions another archaeologist or a his-
torian might ask a decade or a generation later, as
References
the result of new interests or new results from
older researchers. Computing experience has pro- Archaeology data service. http://archaeologydataservice.
duced examples enough of the unfortunate conse- ac.uk. Accessed 25 May 2017.
quences . . . of insufficient anticipation of the Austin, T., & Mitcham, J. (2007). Preservation and man-
possibilities at the coding stage.” agement strategies for exceptionally large data for-
mats: ‘Big Data’. Archaeology Data Service &
English Heritage: York, 28 Sept 2007.
Gaffney, V., Thompson, K., & Finch, S. (2007). Mapping
Conclusion Doggerland: The Mesolithic landscapes of the South-
ern North Sea. Oxford: Archaeopress.
Gaffney, C., Gaffney, V., Neubauer, W., Baldwin, E.,
Such questions probably cannot be predicted, but Chapman, H., Garwood, P., Moulden, H., Sparrow, T.,
big data is (also) not about predicting questions. Bates, R., Löcker, K., Hinterleitner, A., Trinks, I., Nau,
The kind of critical framework that Big Data is W., Zitz, T., Floery, S., Verhoeven, G., & Doneus,
advancing, in response to the ever-more linkable M. (2012). The Stonehenge Hidden Landscapes Pro-
ject. Archaeological Prospection, 19(2), 147–155.
mass of pockets of information, each themselves Tudhope, D., Binding, C., Jeffrey, S., May, K., &
becoming larger in size as hardware and software Vlachidis, A. (2011). A STELLAR role for knowledge
barriers lower, allows us to go beyond what is organization systems in digital archaeology. Bulletin of
available “just” from excavation and survey. We the American Society for Information Science and
Technology, 37(4), 15–18.
can look at the whole landscape in greater detail
A

Asian Americans Advancing Justice Media Justice, to propose, sign, and release the
“Civil Rights Principles for the Era of Big Data.”
Francis Dalisay The coalition acknowledged that progress and
Communication & Fine Arts, College of Liberal advances in technology would foster improve-
Arts & Social Sciences, University of Guam, ments in the quality of life of citizens and help
Mangilao, GU, USA mitigate discrimination and inequality. However,
because various types of “big data” tools and
technologies – namely, digital surveillance, pre-
Asian Americans Advancing Justice (AAAJ) is a dictive analytics, and automated decision-
national nonprofit organization founded in 1991. making – could potentially ease the level in
It was established to empower Asian Americans, which businesses and governments are able to
Pacific Islanders, and other underserved groups, encroach upon the private lives of citizens, the
ensuring a fair and equitable society for all. The coalition found it critical that such tools and tech-
organization’s mission is to promote justice, unify nologies are developed and employed with the
local and national constituents, and empower intention of respecting equal opportunity and
communities. To this end, AAAJ dedicates itself equal justice.
to develop public policy, educate the public, liti- According to civilrights.org (2014), the Civil
gate, and facilitate in the development of grass- Rights Principles for the Era of Big Data proposes
roots organizations. Some of their recent five key principles: (1) stop high-tech profiling,
accomplishments have included increasing Asian (2) guarantee fairness in automated decisions,
Americans and Pacific Islanders’ voter turnout (3) maintain constitutional protections,
and access to polls, enhancing immigrants’ access (4) enhance citizens’ control of their personal
to education and employment opportunities, and information, and (5) protect citizens from inaccu-
advocating for greater protections of rights as they rate data. These principles were intended to
relate to the use of “big data.” inform law enforcement, companies, and policy-
makers about the impact of big data practices on
racial justice and the civil and human rights of
The Civil Rights Principles for the Era of citizens.
Big Data
1. Stop high-tech profiling. New and emerging
In 2014, AAAJ joined a diverse coalition com- surveillance technologies and techniques have
prising of civil, human, and media rights groups, made it possible to piece together comprehen-
such as the ACLU, the NAACP, and the Center for sive details on any citizen or group, resulting in
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_14-1
2 Asian Americans Advancing Justice

an increased risk of profiling and discrimina- 4. Enhance citizens’ control of their personal
tion. For instance, it was alleged that police in information. According to this principle, citi-
New York had used license plate readers to zens should have direct control over how cor-
document vehicles that were visiting certain porations gather data from them, and how
mosques; this allowed the police to track corporations use and share such data. Indeed,
where the vehicles were traveling. The acces- personal and private information known and
sibility and convenience of this technology accessible to a corporation can be shared with
meant that this type of surveillance could hap- companies and the government. For example,
pen without policy constraints. The principle unscrupulous companies can find vulnerable
of stopping high-tech profiling was thus customers through accessing and using highly
intended to limit such acts through setting targeted marketing lists, such as one that might
clear limits and establishing auditing proce- contain the names and contact information of
dures for surveillance technologies and citizens who have cancer. In this case, the
techniques. principle of enhancing citizens’ control of per-
2. Ensure fairness in automated decisions. Today, sonal information ensures that the government
computers are responsible for making critical and companies should not be able to disclose
decisions that have the potential to affect the private information without a legal process to
lives of citizens’ in the areas of health, employ- do so.
ment, education, insurance, and lending. For 5. Protect citizens from inaccurate data. This
example, major auto insurers are able to use principle advocates that when it comes to mak-
monitoring devices to track drivers’ habits, and ing important decisions about citizens – partic-
as a result, insurers could potentially deny the ularly, the disadvantaged (the poor, persons
best coverage rates to those who often drive with disabilities, the LGBT community,
when and where accidents are more likely to seniors, and those who lack access to the
occur. The principle of ensuring fairness in Internet) – corporations and the government
automated decisions advocates that computer should work to ensure that their databases con-
systems should be operating fairly in situations tain accurate of personal information about
and circumstances such as the one described. citizens. To ensure the accuracy of data, this
The coalition had recommended, for instance, could require disclosing the underlying data
that independent reviews be employed to and granting citizens the right to correct infor-
assure that systems are working fairly. mation that is inaccurate. For instance, govern-
3. Preserve constitutional protections. This prin- ment employment verification systems have
ciple advocates that government databases had higher error rates for legal immigrants
must be prohibited from undermining core and individuals with multiple surnames
legal protections, including those concerning (including many Hispanics) than for other
citizens’ privacy and their freedom of associa- legal workers; this has created a barrier to
tion. Indeed, it has been argued that data from employment. In addition, some individuals
warrantless surveillance conducted by the have lost job opportunities because of inaccu-
National Security Agency have been used by racies in their criminal history information, or
federal agencies, including the DEA and the because their information had been expunged.
IRS, even though such data were gathered out-
side the policies that rule those agencies. Indi- The five principles above continue to help
viduals with access to government databases inspire subsequent movements highlighting the
could also potentially use them for improper growing need to strengthen and protect civil rights
purposes. The principle of preserving constitu- in the face of technological change. Asian Amer-
tional protections is thus intended to limit such icans Advancing Justice and the other members of
instances from occurring. the coalition also continue to advocate for these
rights and protections.
Asian Americans Advancing Justice 3

Cross-References Further Readings

▶ American Civil Liberties Union Civil rights and big data: Background material. http://
www.civilrights.org/press/2014/civil-rights-and-big-
▶ Center for Democracy and Technology
data.html. Accessed 20 June 2016.
▶ Center for Digital Democracy
▶ National Hispanic Media Coalition
A

Automated Modeling/Decision organizational contexts. I explain each of these


Making implications in detail to illustrate the associated
opportunities and challenges.
Murad A. Mithani With data and information exceeding our
School of Business, Stevens Institute of capacity for storage, there is a need for decisions
Technology, Hoboken, NJ, USA to be made on the fly. While this does not imply
that all decisions have to be immediate, our inabil-
ity to store large amounts of data that is often
Big data promises a significant change in the generated continuously suggests that decisions
nature of information processing, and hence, deci- pertaining to the use and storage of data, and
sion making. The general reaction to this trend is therefore the boundaries of the eventual decision
that the access and availability of large amounts of making context, need to be defined earlier in the
data will improve the quality of individual and process. With the parameters of the eventual deci-
organizational decisions. However, there are also sion becoming an apriori consideration, big data is
concerns that our expectations may not be entirely likely to overcome the human tendency of pro-
correct. Rather than simplifying decisions, big crastination. It imposes the discipline to recognize
data may actually increase the difficulty of mak- the desired information content early in the pro-
ing effective choices. I synthesize the current state cess. Whether this entails decision processes that
of research and explain how the fundamental prefer immediate conclusions or if the early
implications of big data offer both a promise for choices are limited to the identification of critical
improvement but also a challenge to our capacity information that will be used for later evaluation,
for decision making. the dual decision model with a preliminary deci-
Decision making pertains to the identification sion far removed from the actual decision offers
of the problem, understanding of the potential an opportunity to examine the available alterna-
alternatives, and the evaluation of those alterna- tives more comprehensively. It allows decision
tives to select the ones that optimally resolve the makers to have a greater understanding of the
problem. While the promise of big data relates to alignment between goals and alternatives. Com-
all aspects of decision making, it more often pare this situation to the recruitment model for a
affects the understanding, the evaluation, and the human resource department that screens as well as
selection of alternatives. The resulting implica- finalizes prospective candidates in a single round
tions comprise of the dual decision model, higher of interviews, or separates the process into two
granularity, objectivity, and transparency of deci- stages where the potential candidates are first
sions, and the bottom-up decision making in identified from the larger pool and they are then
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_17-1
2 Automated Modeling/Decision Making

selected from the short-listed candidates in the though information granularity makes it possible
second stage. The dual decision model not only to know what was previously impossible, infor-
facilitates greater insights, it also eliminates the mation overload can lead us astray towards inap-
fatigue that can seriously dampen the capacity for propriate choices, and at worse, it can incapacitate
effective decisions. Yet this discipline comes at a our ability to make effective decisions.
cost. Goals, values, and biases that are part of the The third implication of big data is the poten-
early phase of a project can leave a lasting imprint. tial for objectivity. When a planned and compre-
Any realization later in the project that was not hensive examination of alternatives is combined
deliberately or accidently situated in the earlier with a deeper understanding of the data, the result
context becomes more difficult to incorporate is more accurate information. This makes it less
into the decision. In the context of recruitment, if likely for individuals to come up to an incorrect
the skills desired of the selected candidate change conclusion. This eliminates the personal biases
after the first stage, it is unlikely that the short- that can prevail in the absence of sufficient infor-
listed pool will rank highly in that skill. The more mation. Since traditional response to overcome
unique is the requirement that emerges in the later the effect of personal bias is to rely on individuals
stage, the greater is the likelihood that it will not with greater experience, big data predicts an elim-
be sufficiently fulfilled. This tradeoff suggests that ination of the critical role of experience. In this
an improvement in our understanding of the vein, Andrew McAfee and Erik
choices comes at the cost of limited maneuver- Brynjolfson (2012) find that regardless of the
ability of an established decision context. level of experience, firms that extensively rely
In addition to the benefits and costs of early on data for decision making are, on average, 6%
decisions in the data generation cycle, big data more profitable than their peers. This suggests that
allows access to information at a much more gran- as decisions become increasingly imbibed with an
ular level than possible in the past. Behaviors, objective orientation, prior knowledge becomes a
attitudes, and preferences can now be tracked in redundant element. This however does not elimi-
extensive detail, fairly continuously, and over lon- nate the value of domain-level experts. Their role
ger periods of time. They can in turn be combined is expected to evolve into individuals who know
with other sources of data to develop a broader what to look for (by asking the right questions)
understanding of consumers, suppliers, and where to look (by identifying the appropriate
employees, and competitors. Not only can we sources of data). Domain expertise and not just
understand in much more depth the activities and experience is the mantra to identify people who
processes that pertain to various social and eco- are likely to be the most valuable in this new
nomic landscapes, higher level of granularity information age. However, it needs to be
makes decisions more informed and, as a result, acknowledged that this belief in objectivity is
more effective. Unfortunately, granularity also based on a critical assumption: individuals endo-
brings with it the potential of distraction. All wed with identical information that is sufficient
data that pertains to a choice may not be necessary and relevant to the context, reach identical con-
for the decision, and excessive understanding can clusions. Yet anyone watching the same news
overload our capacity to make inferences. Ima- story reported by different media outlets knows
gine the human skin which is continuously sens- the fallacy of this assumption. The variations that
ing and discarding thermal information generated arise when identical facts lead individuals to
from our interaction with the environment. What contrasting conclusions are a manifestation of
if we had to consciously respond to every signal the differences in the way humans work with
detected by the skin? It is this loss of granularity information. Human cognitive machinery associ-
that comes through the human mind responsive ates meanings to concepts based on personal his-
only to significant changes in temperature that tory. As a result, even while being cognizant of
saves us from being overwhelmed by data. Even our biases, the translation of information into
Automated Modeling/Decision Making 3

conclusion can be unique to individuals. More- opaque. Regardless of the comprehensiveness of


over, this effect compounds with the increase in the disclosed details, transparency largely remains
the amount of information that is being translated. a symbolic expression of the participants’ faith in
While domain experts may help ensure consis- the people managing the process.
tency with the prevalent norms of translation, A second advantage that arises from the objec-
there is little reason to believe that all domain tive nature of data is decentralization. Given that
experts are generally in agreement. The consensus decisions made in the presence of big data are
is possible in the domains of physical sciences more objective and require lower monitoring,
where objective solutions, quantitative measure- they are easier to delegate to people who are closer
ments, and conceptual boundaries leave little to the action. By relying on proximity and expo-
ambiguity. However, the larger domain of sure as the basis of assignments, organizations can
human experience is generally devoid of stan- save time and costs by avoiding the repeated
dardized interpretations. This may be one reason concentration and evaluation of information that
that a study by the Economist Intelligence often occurs at the various hierarchical levels as
Unit (2012) found a significantly higher propor- the information travels upwards. So unlike the
tion of data-driven organizations in the industrial flatter organizations of the current era which rely
sectors such as the natural resources, biotechnol- on the free flow of information, lean organizations
ogy, healthcare, and financial services. Lack of the future may decrease the flow of information
of extensive reliance on data in the other indus- altogether, replacing it with data-driven, contex-
tries is symptomatic of our limited ability for tually rich, and objective findings. In fact, this is
consensual interpretation in areas that challenge imminent since the dual decision model defines
the positivistic approach. the boundaries of subsequent choices. Any
The objective nature of big data produces two attempt to disengage the later decision from the
critical advantages for organizations. The first is earlier one is likely to eliminate the advantages of
transparency. A clear link between data, informa- granularity and objectivity. Flatter organizations
tion, and decision implies the absence of personal of the future will delegate not because managers
and organizational biases. Interested stakeholders have greater faith in the lower cadres of the orga-
can take a closer look at the data and the associ- nization but because individuals at the lower
ated inferences to understand the basis of conclu- levels are the ones that are likely to be best posi-
sions. Not only does this promise a greater buy-in tioned to make timely decisions. As a result, big
from participants that are affected by those deci- data is moving us towards a bottom-up model of
sions, it develops a higher level of trust between organizational decisions where people at the inter-
decision makers and the relevant stakeholders, face between data and findings determine the stra-
and it diminishes the need for external monitoring tegic priorities within which higher-level
and governance. Thus, transparency favors the executives can make their call. Compare this
context in which human interaction becomes eas- with the traditional top-down model of organiza-
ier. It paves the way for richer exchange of infor- tional decisions where strategic choices of the
mation and ideas. This in turn facilitates the higher executives define the boundaries of actions
quality of future decisions. But due to its very for the lower-level staff. However, the bottom-up
nature, big data makes replications rather difficult. approach is also fraught with challenges. It mini-
The time, energy, and other resources required to mizes the value of executive vision. The subjec-
fully understand or reexamine the basis of choices tive process of environmental scanning allows
makes transparency not an antecedent but a con- senior executives to imbibe their valued prefer-
sequence of trust. Participants are more likely to ences into organizational choices through selec-
believe in transparency if they already trust the tive attention to information. It enables
decision makers, and those that are less receptive organizations to do what would be uninformed
to the choices remain free to accuse the process as and at times, highly irrational. Yet it sustains the
4 Automated Modeling/Decision Making

Automated Modeling/Decision Making, Table 1 Opportunities and challenges for the decision implications of
big data
Big data implication Opportunity Challenge
1. Dual decision model Comprehensive examination of Early choices can constrain later considerations
alternatives
2. Granularity In-depth understanding Critical information can be lost due to
information overload
3. Objectivity Lack of dependence on experience Inflates the effect of variations in translation
4. Transparency Free-flow of ideas Difficult to validate
5. Bottom-up decision Prompt decisions Impairment of vision
making

spirit of beliefs that take the form of entrepreneur- Cross-References


ial action. By setting up a mechanism where facts
and findings run supreme, organization of the ▶ Big Data Quality
future may constrain themselves to do only what ▶ Data Governance
is measureable. Extensive reliance on data can ▶ Decision Theory
impair our capacity to imagine what lies beyond ▶ Decision Tree
the horizon (Table 1).
In sum, the big data revolution promises a
change in the way individuals and organiza-
Further Readings
tions make decisions. But it also brings with
it a host of challenges. The opportunities and Boyd, D., & Crawford, K. (2012). Critical questions for big
threats discussed in this article reflect different data. Information, Communication & Society, 15(5),
facets of the implications that are fundamental 662–679.
to this revolution. They include the dual deci- Economist Intelligence Unit. (2012). The deciding factor:
Big data & decision making. New York, NY, USA:
sion model, granularity, objectivity, transpar- Capgemini/The Economist.
ency, and the bottom-up approach to McAfee, A., & Brynjolfsson, E. (2012). Big data: The
organizational decisions. The table above sum- management revolution. Harvard Business Review,
marizes how the promise of big data is an 90(10), 61–67.
opportunity as well as a challenge for the
future of decision making.
B

Behavioral Analytics analytics put forth by Davenport and Harris as


well as Kohavi and colleagues. Business analytics
Lourdes S. Martinez in turn is a subarea within business intelligence
School of Communication, San Diego State and described by Negash and Gray as systems that
University, San Diego, CA, USA integrate data processes with analytics tools to
demonstrate insights relevant to business planners
and decision-makers. According to Montibeller
Behavioral analytics can be conceptualized as a and Durbach, behavioral analytics differs from
process involving the analysis of large datasets traditional descriptive analysis of behavioral data
comprised of behavioral data in order to extract by focusing analyses on driving action and
behavioral insights. This definition encompasses improving decision-making among individuals
three goals of behavioral analytics intended to and organizations. The purpose of this process is
generate behavioral insights for the purposes of threefold. First, behavioral analytics facilitates the
improving organizational performance and deci- detection of users’ behavior, judgments, and
sion-making as well as increasing understanding choices. For example, a health website that tracks
of users. Coinciding with the rise of big data and the click-through behavior, views, and downloads
the development of data mining techniques, a of its visitors may offer an opportunity to person-
variety of fields stand to benefit from the emer- alize user experience based on profiles of different
gence of behavioral analytics and its implications. types of visitors.
Although there exists some controversy regarding Second, behavioral analytics leverages find-
the use of behavioral analytics, it has much to ings from these behavioral patterns to inform
offer organizations and businesses that are willing decision-making at the organizational level and
to explore its integration into their models. improve performance. If personalizing the visitor
experience to a health website reveals a mismatch
between certain users and the content provided on
Definition the website’s navigation menu, the website may
alter the items on its navigation menu to direct this
The concept of behavioral analytics has been group of users to relevant content in a more effi-
defined by Montibeller and Durbach as an analyt- cient manner. Lastly, behavioral analytics informs
ical process of extracting behavioral insights from decision-making at the individual level by
datasets containing behavioral data. This defini- improving judgments and choices of users. A
tion is derived from previous conceptualizations health website that is personalized to unique
of the broader overarching idea of business health characteristics and demographics of
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_18-1
2 Behavioral Analytics

visitors may help users fulfill their informational of face-to-face communication, behavioral analyt-
needs so that they can apply the information to ics allows commercial marketers to examine e-
improve decisions they make about their health. consumers through additional lenses apart from
the traditional demographic and traffic tracking.
In approaching the selling process from a relation-
Applications ship standpoint, behavioral analytics uses data
collected via web-based behavior to increase
According to Kokel and colleagues, the largest understanding of consumer motivations and
behavioral databases can be found at Internet goals, and fulfill their needs. Examples of these
technology companies such as Google as well as sources of data include keyword searchers, navi-
online gaming communities. The sheer size of gation paths, and click-through patterns. By input-
these datasets is giving rise to new methods, ting data from these sources into machine learning
such as data visualization, for behavioral analyt- algorithms, computational social scientists are
ics. Fox and Hendler note the opportunity in able to map human factors of consumer behavior
implementing data visualization as a tool for as it unfolds during purchases. In addition, behav-
exploratory research and argue for a need to create ioral analytics can use web-based behaviors of
a greater role for it in the process of scientific consumers as proxies for cues typically conveyed
discovery. For example, Carneiro and Mylonakis through in-person face-to-face communication.
explain how Google Flu relies on data visualiza- Previous research suggests that web-based dia-
tion tools to predict outbreaks of influenza by logs can capture rich data pointing toward behav-
tracking online search behavior and comparing it ioral cues, the analysis of which can yield highly
to geographical data. Similarly, Mitchell notes accurate predictions comparable to data collected
how Google Maps analyzes traffic patterns during face-to-face interactions. The significance
through data provided via real-time cell phone of this ability to capture communication cues is
location to provide recommendations for travel reflected in marketers increased ability to speak to
directions. In the realm of social media, Bollen their consumers with greater personalization that
and colleagues have also demonstrated how anal- enhances the consumer experience.
ysis of Twitter feeds can be used to predict public Behavioral analytics has also enjoyed increas-
sentiments. ingly widespread application in game develop-
According to Jou, the value of behavioral ana- ment. El-Nasr and colleagues discuss the
lytics has perhaps been most notably observed in growing significance of assessing and uncovering
the area of commercial marketing. The consumer insights related to player behavior, both of which
marketing space has borne witness to the progress have emerged as essential goals for the game
made through extracting actionable and profitable industry and catapulted behavioral analytics into
insights from user behavioral data. For example, a central role with commercial and academic
between recommendation search engines for implications for game development. A combina-
Amazon and teams of data scientists for LinkedIn, tion of evolving mobile device technology and
behavioral analytics has allowed these companies shifting business models that focus on game dis-
to transform their plethora of user data into tribution via online platforms has created a
increased profits. Similarly, advertising efforts situation for behavioral analytics to make impor-
have turned toward the use of behavioral analytics tant contributions toward building profitable
to glean further insights into consumer behavior. businesses.
Yamaguchi discusses several tools on which dig- Increasingly available data on user behavior
ital marketers rely that go beyond examining data has given rise to the use of behavioral analytic
from site traffic. approaches to guide game development. Fields
Nagaitis notes observations that are consistent and Cotton note the premium placed in this indus-
with Jou’s view of behavioral analytics’ impact on try on data mining techniques that decrease
marketing. According to Nagaitis, in the absence behavioral datasets in complexity while extracting
Behavioral Analytics 3

knowledge that can drive game development. capitalizing on the use of behavioral analytics is
However, determining cutting-edge methods in security. Although Brown reports on exploration
behavioral analytics within the game industry is in the use of behavioral analytics to track cross-
a challenge due to reluctance on the part of various border smuggling activity in the United Kingdom
organizations to share analytic methods. Drachen through vehicle movement, the application of
and colleagues observe a difficulty in assessing these techniques under the broader umbrella of
both data and analytical methods applied to data security remains understudied. Along these lines
analysis in this area due to a perception that these and in the context of an enormous amount of
approaches represent a form of intellectual prop- available data, Jou discusses the possibilities for
erty. Sifa further notes that to the extent that data implementing behavioral analytics techniques to
mining, behavioral analytics, and the insights identify insider threats posed by individuals
derived from these approaches provide a compet- within an organization. Inputting data from a vari-
itive advantage over rival organizations in an ety of sources into behavioral analytics platforms
industry that already exhibits fierce competition can offer organizations an opportunity to contin-
in the entertainment landscape, organizations will uously monitor users and machines for early indi-
not be motivated to share knowledge about these cators and detection of anomalies. These sources
methods. may include email data, network activity via
Another area receiving attention for its appli- browser activity and related behaviors, intellec-
cation of behavioral analytics is business manage- tual property repository behaviors related to how
ment. Noting that while much interest in applying content is accessed or saved, end-point data show-
behavioral analytics has focused on modeling and ing how files are shared or accessed, and other less
predicting consumer experiences, Géczy and col- conventional sources such as social media or
leagues observe a potential for applying these credit reports. Connecting data from various
techniques to improve employee usability of inter- sources and aggregating them under a comprehen-
nal systems. More specifically, Géczy and col- sive data plane can provide enhanced behavioral
leagues describe the use of behavioral analytics threat detection. Through this, robust behavioral
as a critical first step to user-oriented management analytics can be used to extract insights into pat-
of organizational information systems through terns of behavior consistent with an imminent
identification of relevant user characteristics. threat. At the same time, the use of behavioral
Through behavioral analytics, organizations can analytics can also measure, accumulate, verify,
observe characteristics of usability and interaction and correctly identify real insider threats while
with information systems and identify patterns of preventing inaccurate classification of nonthreats.
resource underutilization. These patterns are Jou concludes that the result of implementing
important in providing implications for designing behavioral analytics in an ethical manner can pro-
streamlined and efficient user-oriented processes vide practical and operative intelligence while
and services. Behavioral analytics can also offer raising the question as to why implementation in
prospects for increasing personalization during this field has not occurred more quickly.
the user experience by drawing from user infor- In conclusion, behavioral analytics has been
mation provided in user profiles. These profiles previously defined as a process in which large
contain information about how the user interacts datasets consisting of behavioral data are analyzed
with the system, and the system can accordingly for the purpose of deriving insights that can serve
adjust based on clustering of users. as actionable knowledge. This definition includes
Despite advances made in behavioral analytics three goals underlying the use of behavioral ana-
within the commercial marketing and game indus- lytics, namely, to enhance organizational perfor-
tries, several areas are ripe with opportunities for mance, improve decision-making, and generate
integrating behavioral analytics to improve per- insights into user behavior. Given the burgeoning
formance and decision-making practices. One presence of big data and spread of data mining
area that has not yet reached its full potential for techniques to analyze this data, several fields have
4 Behavioral Analytics

begun to integrate behavioral analytics into their Carneiro, H. A., & Mylonakis, E. (2009). Google trends: A
approaches for problem-solving and perfor- web-based tool for real-time surveillance of disease
outbreaks. Clinical Infectious Diseases, 49(10).
mance-enhancing actions. While concerns related Davenport, T., & Harris, J. (2007). Competing on analyt-
to accuracy and ethical use of these insights ics: The new science of winning. Boston: Harvard
remain to be addressed, behavioral analytics can Business School Press.
present organizations and business with unprece- Drachen, A., Sifa, R., Bauckhage, C., & Thurau, C. (2012).
Guns, swords and data: Clustering of player behavior in
dented opportunities to enhance business, man- computer games in the wild. Proceedings of the IEEE
agement, and operations. Computational Intelligence and Games.
El-Nasr, M. S., Drachen, A., & Canossa, A. (2013). Game
analytics: Maximizing the value of player data. New
York: Springer Publishers.
Cross-References Fields, T. (2011). Social game design: Monetization
methods and mechanics. Boca Raton: Taylor &
Francis.
▶ Big Data Fox, P., & Hendler, J. (2011). Changing the equation on
▶ Business Analytics scientific data visualization. Science, 331(6018).
▶ Data Mining Géczy, P., Izumi, N., Shotaro, A., & Hasida, K. (2008).
Toward user-centric management of organizational
▶ Data Science
information systems. Proceedings of the Knowledge
▶ Data Scientist Management International Conference, Langkawi,
▶ Data-Driven Decision-Making Malaysia (pp. 282-286).
Kohavi, R., Rothleder, N., & Simoudis, E. (2002). Emerg-
ing trends in business analytics. Communications of the
ACM, 45(8).
Further Readings Mitchell, T. M. (2009). Computer science: Mining our
reality. Science, 326(5960).
Montibeller, G., & Durbach, I. (2013). Behavioral analyt-
Bollen, J., Mao, H., & Pepe, A. (2011). Modeling public
ics: A framework for exploring judgments and choices
mood and emotion: Twitter sentiment and socio-eco-
in large data sets. Working Paper LSE OR13.137. ISSN
nomic phenomena. Proceedings of the Fifth Interna-
2041-4668.
tional Association for Advancement of Artificial
Negash, S., & Gray, P. (2008). Business intelligence. Ber-
Intelligence Conference on Weblogs and Social Media.
lin/Heidelberg: Springer.
Brown, G. M. (2007). Use of kohonen self-organizing
Sifa, R., Drachen, A., Bauckhage, C., Thurau, C., &
maps and behavioral analytics to identify cross-border
Canossa, A. (2013). Behavior evolution in tomb raider
smuggling activity. Proceedings of the World Congress
underworld. Proceedings of the IEEE Computational
on Engineering and Computer Science.
Intelligence and Games.
B

Big Humanities Project to the digitalized portfolios. On the other hand


they relate to the computerized philology tools
Ramon Reichert for the application of secondary data or results.
Department for Theatre, Film and Media Studies, Even today these elementary methods of digi-
Vienna University, Vienna, Germany tal humanities are based on philological tradi-
tion, which sees the evidence-driven collection
and management of data as the foundation of
“Big Humanities” are a heterogenic field of hermeneutics and interpretation. Beyond the
research between IT, cultural studies, and human- narrow discussions about the methods,
ities in general. Recently, because of higher avail- computer-based measuring within humanities
ability of digital data, they gained even more and cultural studies claims the media-like pos-
importance. The term “Big Humanities Data” tulates of objectivity within modern sciences.
has prevailed due to the wider usage of the Inter- Contrary to the curriculum of text studies in the
net, and it replaced the terms like “computational 50s and 60s within the “Humanities Comput-
science” and “humanities computing,” which ing” (McCarthy 2005) the research area of
have been used since the beginning of the com- related disciplines has been differentiated and
puter era in the 1960s. These terms were related broadened to history of art, culture and sociol-
mostly to the methodological and practical devel- ogy, media studies, technology, archaeology,
opment of digital tools, infrastructures, and history and musicology (Gold 2012).
archives. 2. According to the second phase, in addition to
In addition to the theoretical explorations on the quantitative digitalization of texts, the
science according to Davidson (2008), Svensson research practices are being developed in
(2010), Anne et al. (2010) and Gold (2012), “Big accordance with the methods and processes of
Humanities Data” are divided into three trendset- production, analysis and modeling of digital
ting theoretical approaches, simultaneously cov- research environments for work within human-
ering the historical development and changes in ities with digital data. This approach stands
the field of research according to the epistemolog- behind the enhanced humanities and tries to
ical policy: find new methodological approaches of quali-
tative application of generated, processed and
1. The usage of computers and digitalization of archived data for reconceptualization of tradi-
“primary data” within humanities and cultural tional research subjects. (Ramsey and
studies are in the center of Digital humanities. Rockwell 2012, pp. 75–84).
On the one hand the digitization projects relate
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_22-1
2 Big Humanities Project

3. The development from humanities 1.0 to methods on the evidence and truth and support
humanities 2.0 (Davidson 2008, pp. 707–717) the argumentation that digital humanities were
marks the transition from digital development developed from a network of historical cultures
of methods within “Enhanced Humanities” to of knowledge and media technologies with their
the “Social Humanities” which use the possi- roots in the end of the nineteenth century.
bility of web 2.0 to construct the research The relevant research literature of the historical
infrastructure. Social humanities use interdis- context and genesis of Big Humanities is regarded
ciplinarity of scientific knowledge by making as one of the first projects of genuine humanistic
use of software for open access, social reading usages of computer a Concordance of Thomas of
and open knowledge and by enabling online Aquino based on punch cards by Roberto Busa
cooperative and collaborational work on (Vanhoutte 2013, p. 126). Roberto Busa
research and development. On the basis of the (1913–2011), an Italian Jesuit priest, is considered
new digital infrastructure of social web as a pioneer of Digital Humanities. This project
(hypertext systems, Wiki tools, Crowd funding enabled the achievement of uniformity in histori-
software etc.) these products transfer the ography of computational science in its early
computer-based processes from the early stage (Schischkoff 1952). Busa, who in 1949
phase of digital humanities into the network developed the linguistic corpus of “Index
culture of the social sciences. Today it is Blog- Thomisticus” together with Thomas J. Watson,
ging Humanities (work on digital publications the founder of IBM, (Busa 1951, 1980,
and mediation in peer-to-peer networks) and pp. 81–90), is regarded a founder of the point of
Multimodal humanities (presentation and rep- intersection between humanities and IT. The first
resentation of knowledge within multimedia digital edition on punch cards initiated a series of
software environments) that stand for the tech- the following philological projects: “In the 60s the
nical modernization of academic knowledge first electronic version of ‘Modern Language
(McPherson 2008). Because of them Big Association International Bibliography’
Social Humanities claims the right to represent (MLAIB) came up, a specific periodical bibliog-
paradigmatically alternative form of knowl- raphy of all modern philologies, which could be
edge production. In this context one should searched through with a telephone coupler. The
reflect on the technical fundamentals of the retrospective digitalization of cultural heritage
computer-based process of gaining insights started after that, having had ever more works
within the research of humanities and cultural and lexicons such as German vocabulary by
studies while critically considering data, Grimm brothers, historical vocabularies as the
knowledge genealogy and media history in Krünitz or regional vocabularies” (Lauer 2013,
order to evaluate properly the understanding p. 104).
of a role in the context of digital knowledge At first, a large number of other disciplines and
production and distribution (Thaller 2012, non-philological areas were formed such as liter-
pp. 7–23). ature, library, and archive studies. They had lon-
ger epistemological history in the field of
philological case studies and practical information
History of Big Humanities studies. Since the introduction of punch card
methods, they have been dealing with quantitative
Big Humanities have been considered only occa- and IT procedures for facilities of knowledge
sionally from the perspective of science and management. As one can see, neither the research
media history in the course of the last few years question nor Busa’s methodological procedure
(Hockey 2004). Historical approach to the have been without its predecessors, so they can
interdependent relation between humanities and be seen as a part of a larger and longer history of
cultural studies and the usage of computer-based knowledge and media archeology. Sketch models
processes relativize the aspiration of digital of mechanical knowledge apparatus capable of
Big Humanities Project 3

combining information were found in the manu- starting in the early 1950s, the first autonomous
scripts of Suisse Archivar Karl Wilhelm Bührer research area, which could provide an “objective
(1861–1917, Bührer 1890, pp. 190–192). This analysis of exact knowledge” (Pietsch 1951). In
figure of thought of flexible and modularized the 1960s, the first studies in the field of computer
information unit was made to a conceptional linguistics concerning the automatized indexing
core of mechanical data processing. The archive of large text corpora appeared, publishing the
and library studies took part directly in the histor- computer-based analysis about word indexing,
ical change of paradigm of information pro- word frequency, and word groups.
cessing. It was John Shaw Billings, the doctor The automatized evaluation procedure of texts
and later director of the National Medical Library, for the editorial work within literary studies was
who worked further on the development of appa- described already in the early stages of “humani-
ratus for machine-driven processing of statistical ties computing” (mostly within its areas of “com-
data, a machine developed by Hermann Hollerith puter philology” and “computer linguistics”) on
in 1886 (Krajewski 2007, p. 43). Technology of the ground of two discourse figures relevant even
punch cards traces its roots in technical pragmat- today. The first figure of discourse describes the
ics of library knowledge organization; even if achievements of the new tool usage with instru-
only later – within the rationalization movement mental availability of data (“helping tools”); the
in the 1920s – the librarian working procedure other figure of discourse focuses on the econom-
was automatized in specific areas. Other projects ical disclosure of data and emphasizes the effi-
of data processing show that the automatized pro- ciency and effectivity of machine methods of
duction of an index or a concordance marks the documenting. The media figure of automation
beginning of computer-based humanities and cul- was finally combined with the expectance that
tural studies for the lexicography and catalogue interpretative and subjective influences from the
apparatus of libraries. Until the late 1950s, it was processing and analysis of information can be
the automatized method of processing large text systematically removed. In the 1970s and 1980s,
data with the punch card system after Hollerith the computer linguistics was established as an
procedure that stood in the center of the first institutionally positioned area of research with
applications/usages. The technical procedure of its university facilities, its specialist journals
punch cards changed the lecture practice of text (Journal of Literary and Linguistic Computing,
analysis by transforming a book into a database Computing in the Humanities), discussion panels
and by turning the linear-syntagmatic structure of (HUMANIST), and conference activities. The
text into a factual and term-based system. As early computer-based work in the historical-
as 1951, the academic debate among the contem- sociological research has its first large rise, but it
poraries started in academic journals. This debate remains regarded in the work reports less than an
saw the possible applications of the punch card autonomous method, and it is seen mostly as a
system as largely positive and placed them into tool for critical text examination and as a simpli-
the context of economically motivated rationality. fication measure by quantifying the prospective
Between December 13 and 16, 1951, the German subjects (Jarausch 1976, p. 13).
Society for Documentation and the Advisory A sustainable media turn both in the field of
Board of German Economical Chamber orga- production and in the field of reception aesthetics
nized a working conference on the study of mech- appeared with the application of standardized
anization and automation of documentation markup texts such as the Standard Generalized
process, which was enthusiastically discussed by Markup Language established in 1986 and
philosopher Georgi Schischkoff. He talked about software-driven programs for text processing.
a “significant simplification and acceleration [. . .] They made available the additional series of dig-
by mechanical remembrance” (Schischkoff 1952, ital modules, analytical tools, and text functions
p. 290). The representatives of computer-based and transformed the text into a model of a data-
humanities saw in the “literary computing,” base. The texts could be loaded as structured
4 Big Humanities Project

information and were available as (relational) Since the social net is not only a neutral reading
databases. In the 1980s and 1990s, the technical channel of research, writing, and publication
development and the text reception were domi- resources without any power but also a govern-
nated by the paradigm of a database. mental structure of power of scientific knowledge,
With the domination of the World Wide Web, the epistemological probing of social, political,
the research and teaching practices changed dras- and economic contexts of Digital Humanities
tically: the specialized communication experi- includes also a data critical and historical
enced a lively dynamics through the digital questioning of its computer-based reformation
network culture of publicly accessible online agenda (Schreibmann 2012, pp. 46–58).
resources, e-mail distribution, chats, and forums, What did the usage of computer technology
and it became largely responsive through the change for cultural studies and humanities on the
media-driven feedback mentality of rankings and basis of theoretical essentials? Computers did
voting. With its aspiration to go beyond the hier- reorganize and accelerated the quantification and
archical structures of academic system through calculation process of scientific knowledge; they
the reengineering of scientific knowledge, the did entrench the metrical paradigm in the cultural
Digital Humanities 2.0 made the ideals of equal- studies and humanities and promoted the
ity, freedom, and omniscience attainable again. hermeneutical-interpretative approaches with a
As opposed to its beginnings in the 1950s, the mathematical formalization of the respective sub-
Digital Humanities today have also an aspiration ject field. In addition to these epistemological
to reorganize the knowledge of the society. There- shifts, the research practices within the Big
fore, they regard themselves “both as a scientific Humanities have been shifted, since the research
as well as a socioutopistic project” (Hagner and and development are seen as project related, col-
Hirschi 2013, p. 7). With the usage of social media laborative, and network formed, and on the net-
in the humanities and cultural studies, the techno- work horizon, they become the subject of research
logical possibilities and the scientific practices of of network analysis. The network analysis itself
Digital Humanities not only developed but they has its goal to reveal the correlations and relation-
also brought to life new phantasmagoria of scien- patterns of digital communication of scientific
tific distribution, quality evaluation, and transpar- networks and to declare the Big Humanities itself
ency in the World Wide Web (Haber 2013, to the subject of reflections within a social con-
pp. 175–190). In this context, Bernhard Rieder structivist actor-network-theory.
and Theo Röhle identified five central problematic
perspectives for the current “Digital Humanities”
in their text from 2012 “five challenges.” These
Further Readings
are the following: the temptation of objectivity,
the power of visual evidence, black-boxing Anne, B, Drucker, J., Lunenfeld, P., Presner, T., &
(fuzziness, problems of random sampling, etc.), Schnapp, J. (2010). Digital_humanities. Cambridge,
institutional turbulences (rivaling service facilities MA: MIT Press, 201(2). Online: http://mitpress.mit.
and teaching subjects), and the claim of univer- edu/sites/default/files/titles/content/9780262018470_
Open_Access_Edition.pdf
sality. Computer-based research is usually domi- Bührer, K. W. (1890). Ueber Zettelnotizbücher und
nated by the evaluation of data so that some Zettelkatalog. Fernschau, 4, 190–192.
researchers see the advanced analysis within the Busa, R. (1951). S. Thomae Aquinatis Hymnorum
research process even as a substitution for a sub- Ritualium Varia Specimina Concordantiarum. Primo
saggio di indici di parole automaticamentecomposti e
stantial theory construction. That means that the stampati da macchine IBM a schede perforate. Milano:
research interests are almost completely data Bocca.
driven. This evidence-based concentration on the Busa, R. (1980). The annals of humanities computing: The
data possibilities can deceive the researcher to index Thomisticus. Computers and the Humanities,
14(2), 83–90.
neglect the heuristic aspects of his own subject.
Big Humanities Project 5

Davidson, C. N. (2008). Humanities 2.0: Promise, perils, to HUMLab Seminar, Umeå University, 4 Mar. http://
predictions. Publications of the Modern Language stream.humlab.umu.se/index.php?streamName=dynami
Association (PMLA), 123(3), 707–717. cVernaculars
Gold, M. K. (Ed.). (2012). Debates in the digital humani- Pietsch, E. (1951). Neue Methoden zur Erfassung des
ties. Minneapolis: University of Minnesota Press. exakten Wissens in Naturwissenschaft und Technik.
Haber, P. (2013). ‘Google Syndrom‘. Phantasmagorien des Nachrichten für Dokumentation, 2(2), 38–44.
historischen Allwissens im World Wide Web. Zürcher Ramsey, S., & Rockwell, G. (2012). Developing things:
Jahrbuch für Wissensgeschichte, 9, 175–190. Notes toward an epistemology of building in the digital
Hagner, M., & Hirschi, C. (2013). Editorial Digital humanities. In M. K. Gold (Ed.), Debates in the digital
Humanities. Zürcher Jahrbuch für Wissensgeschichte, humanities (pp. 75–84). Minneapolis: University of
9, 7–11. Minnesota Press.
Hockey, S. (2004). History of humanities computing. In Rieder, B., & Röhle, T. (2012). Digital methods: Five
S. Schreibman, R. Siemens, & J. Unsworth (Eds.), A challenges. In D. M. Berry (Ed.), Understanding digital
companion to digital humanities. Oxford: Blackwell. humanities (pp. 67–84). London: Palgrave.
Jarausch, K. H. (1976). Möglichkeiten und Probleme der Schischkoff, G. (1952). Über die Möglichkeit der
Quantifizierung in der Geschichtswissenschaft. In: Dokumentation auf dem Gebiete der Philosophie.
ders., Quantifizierung in der Geschichtswissenschaft. Zeitschrift für Philosophische Forschung, 6(2),
Probleme und Möglichkeiten (pp. 11–30). Düsseldorf: 282–292.
Droste. Schreibman, S. (2012). Digital humanities: Centres and
Krajewski, M. (2007). In Formation. Aufstieg und Fall der peripheries. In: M. Thaller (Ed.), Controversies around
Tabelle als Paradigma der Datenverarbeitung. In the digital humanities (Historical social research, Vol.
D. Gugerli, M. Hagner, M. Hampe, B. Orland, 37(3), pp. 46–58). Köln: Zentrum für Historische
P. Sarasin, & J. Tanner (Eds.), Nach Feierabend. Sozialforschung.
Zürcher Jahrbuch für Wissenschaftsgeschichte (Vol. Svensson, P. (2010). The landscape of digital humanities.
3, pp. 37–55). Zürich/Berlin: Diaphanes. Digital Humanities Quarterly (DHQ), 4(1). Online:
Lauer, G. (2013). Die digitale Vermessung der Kultur. http://www.digitalhumanities.org/dhq/vol/4/1/000080/
Geisteswissenschaften als Digital Humanities. In 000080.html
H. Geiselberger & T. Moorstedt (Eds.), Big Data. Das Thaller, M. (Ed.). (2012). Controversies around the digital
neue Versprechen der Allwissenheit (pp. 99–116). humanities: An agenda. Computing Historical Social
Frankfurt/M: Suhrkamp. Research, 37(3), 7–23.
McCarty, W. (2005). Humanities computing. London: Vanhoutte, E. (2013). The gates of hell: History and defi-
Palgrave. nition of digital | humanities. In M. Terras, J. Tyham, &
McPherson, T. (2008). Dynamic vernaculars: Emergent dig- E. Vanhoutte (Eds.), Defining digital humanities
ital forms in contemporary scholarship. Lecture presented (pp. 120–156). Farnham: Ashgate.
B

Biomedical Data The collection and analysis of big data in bio-


medical area have demonstrated its ability to
Qinghua Yang1 and Fan Yang2 enable efficiencies and accountability in health
1
Department of Communication Studies, Texas care, which provides strong evidence for the ben-
Christian University, Fort Worth, TX, USA efits of big data usage. Electronic health records
2
Department of Communication Studies, (EHRs), an example of biomedical big data, can
University of Alabama at Birmingham, provide timely data for assisting monitoring of
Birmingham, AL, USA infectious diseases, disease outbreaks, and
chronic illnesses, which could be particularly
valuable during public health emergencies. By
Thanks to the development of modern data col- collecting and extracting data from EHRs, public
lection and analytic techniques, biomedical health organizations and authorities could receive
research generates increasingly large amounts of extraordinary amount of information. By analyz-
data in various formats and at all levels, which is ing the massive data from EHRs, public health
referred to as big data. Big data is a collection of researchers could conduct comprehensive obser-
data sets, which are large in volume and complex vational studies with uncountable patients who
in structure. To illustrate, the data managed by are treated in real clinical settings over years.
America’s leading healthcare provider Kaiser is Disease progress, clinical outcomes, treatment
4,000 times more than the amount of information effectiveness, and public health intervention effi-
stored in the Library of Congress. As to data cacies can also be studied by analyzing EHRs
structure, the range of nutritional data types and data, which may influence public health deci-
sources make it really difficult to normalize. Such sion-making (Hoffman and Podgurski 2013).
volume and complexity of big data make it diffi- As a crucial juncture of addressing the oppor-
cult to be processed by traditional data analytic tunities and challenges presented by biomedical
techniques. Therefore, to further knowledge and big data, the National Institutes of Health (NIH)
uncover hidden value, there is an increasing need has initiated a Big Data to Knowledge (BD2K)
to better understand and mine biomedical big data initiative to maximize the use of biomedical big
by innovative techniques and new approaches, data. BD2K, a response to the Data and Informat-
which requires interdisciplinary collaborations ics Working Groups (DIWG), focuses on
involving data providers and users (e.g., biomed- enhancing:
ical researchers, clinicians, and patients), data sci-
entists, funders, publishers, and librarians. (a) the ability to locate, access, share, and apply
biomedical big data,
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_25-1
2 Biomedical Data

(b) the dissemination of data analysis methods MAP, a next-generation sequencing (NGS) DNA
and software, processing software launched by a Dutch Soft-
(c) the training in biomedical big data and data ware Company GENALICE. Processing biomed-
science, ical big data one hundred times faster than
(d) the establishment of centers of excellence in conventional data analytic tools, MAP demon-
data science (Margolis et al. 2014) strated robustness and spectacular performance
and raised the NGS data processing and analysis
First, BD2K initiative fosters the emergence of to a new level.
data science as a discipline relevant to biomedi-
cine by developing the solutions to specific high-
need challenges confronting the research commu- Challenges
nity. For instance, the Centers of Excellence in
Data Science initiated the first BD2K Funding Despite the opportunities brought by biomedical
Opportunity to test and validate new ideas in big data, certain noteworthy challenges also exist.
data science. Second, BD2K aims to enhance the First, to use big biomedical data effectively, it is
training of methodologists and practitioners in imperative to identify the potential sources of
data science by improving their skills in demand healthcare information and to determine the
under the data science “umbrella,” such as com- value of linking them together (Weber et al.
puter science, mathematics, statistics, biomedical 2014). The “bigness” of biomedical data sets is
informatics, biology, and medicine. Third, given multidimensional: some big data, such as EHRs,
the complex questions posed by the generation of provide depth by including multiple types of data
large amounts of data requiring interdisciplinary (e.g., images, notes, etc.) about individual patient
teams, BD2K initiative facilitates the develop- encounters; others, such as claims data, provide
ment of investigators in all parts of the research longitudinality, which refers to patients’ medical
enterprise for interdisciplinary collaboration to information over a period of time. Moreover,
design studies and perform subsequent data ana- social media, credit cards, census records, and a
lyses (Margolis et al. 2014). various number of other types of data can help
Besides these promotive initiatives proposed assemble a holistic view of a patient and shed light
by national research institutes, such as NIH, on social and environmental factors that may be
great endeavors in improving biomedical big influencing health.
data processing and analysis have also been The second technical obstacle in linking big
made by biomedical researchers and for-profit biomedical data results from the lack of a national
organizations. National cyberinfrastructure has unique patient identifier (UPI) in the United
been suggested by biomedical researchers as one States (Weber et al. 2014). To address the absence
of the systems that could efficiently handle many of a UPI to enable precise linkage, hospitals and
of big data challenges facing the medical infor- clinics have developed sophisticated probabilistic
matics community. In the United States, the linkage algorithms based on other information,
national cyberinfrastructure (CI) refers to an such as demographics. By requiring enough vari-
existing system of research supercomputer centers ables to match, hospitals and clinics are able to
and high-speed networks that connect them reduce the risk of linkage errors to an acceptable
(LeDuc et al. 2014). CI has been widely used by level even though two different patients share the
physical and earth scientists, and more recently same characteristics (e.g., name, age, gender, zip
biologists, yet little used by biomedical code). In addition, the same techniques used to
researchers. It has been argued that more compre- match patients across different EHRs can be
hensive adoption of CI could facilitate many chal- extended to data sources outside of health care,
lenges in biomedical area. One example of which is an advantage of probabilistic linkage.
innovative biomedical big data technique pro- Third, besides the technical challenges, pri-
vided by for-profit organizations is GENALICE vacy and security concerns turn to be a social
Biomedical Data 3

challenge in linking biomedical big data (Weber et Cross-References


al. 2014). As more data are linked, they become
increasingly more difficult to be de-identified. For ▶ Biometrics Databases
instance, although clinical data from EHRs offer ▶ Data sharing
considerable opportunities for advancing clinical ▶ Health Informatics
and biomedical research, unlike most other forms ▶ National Institutes of Health
of biomedical research data, clinical data are typ-
ically obtained outside of traditional research set-
tings and must be converted for research use. This
Further Readings
process raises important issues of consent and
protection of patient privacy (Institute of Medi- Hoffman, S., & Podgurski, A. (2013). Big bad data: Law,
cine 2009). Possible constructive responses could public health, and biomedical databases. The Journal of
be to regulate legality and ethics, to ensure that Law, Medicine & Ethics, 41(8), 56–60.
benefits outweigh risks, to include patients in the Institute of Medicine. (2009). Beyond the HIPAA privacy
rule: Enhancing privacy, improving health through
decision-making process, and to give patients research. Washington, DC: The National Academies
control over their data. Additionally, changes in Press.
policies and practices are needed to govern LeDuc, R., Vaughn, M., Fonner, J. M., Sullivan, M., Wil-
research access to clinical data sources and facil- liams, J. G., Blood, P. D., et al. (2014). Leveraging the
national cyberinfrastructure for biomedical research.
itate their use for evidence-based learning in Journal of the American Medical Informatics Associa-
healthcare. Improved approaches to patient con- tion, 21(2), 195–199.
sent and risk-based assessments of clinical data Margolis, R., Derr, L., Dunn, M., Huerta, M., Larkin,
usage, enhanced quality and quantity of clinical J., Sheehan, J., et al. (2014). The National Institutes of
Health’s Big Data to Knowledge (BD2K) initiative:
data available for research, and new methodolo- Capitalizing on biomedical big data. Journal of the
gies for analyzing clinical data are all needed for American Medical Informatics Association, 21(6),
ethical and informed use of biomedical big data. 957–958.
Weber, G., Mandl, K. D., & Kohane, I. S. (2014). Finding
the missing link for big biomedical data. Journal of
American Medical Association, 331(4), 2479-2480.
B

Biosurveillance infrastructures of social media, participatory


sources, and non-text-based sources. The struc-
Ramón Reichert tural change generated by digital technologies, as
Universität Wien, Wien, Austria main driver for Big Data, offers a multitude of
applications for sensor technology and biometrics
as key technologies. Biometric analysis technolo-
Internet biosurveillance, or Digital Disease gies and methods are finding their way into all
Detection, represents a new paradigm of Public areas of life, changing people’s daily lives. In
Health Governance. While traditional approaches particular the areas of sensor technology, biomet-
to health prognosis operated with data collected in ric recognition process, and the general tendency
the clinical diagnosis, Internet biosurveillance toward convergence of information and commu-
studies use the methods and infrastructures of nication technologies are stimulating the Big Data
Health Informatics. That means, more precisely, research. The conquest of mass markets through
that they use unstructured data from different sensor and biometric recognition processes can
web-based sources and targets using the collected sometimes be explained by the fact that mobile,
and processed data and information about changes web-based terminals are equipped with a large
in health-related behavior. The two main tasks of variety of different sensors. More and more users
the Internet biosurveillance are (1) the early detec- come this way into contact with the sensor tech-
tion of epidemic diseases, biochemical, radiolog- nology or with the measurement of individual
ical, and nuclear threats (Brownstein et.al. 2009) body characteristics. Due to the more stable and
and (2) the implementation of strategies and mea- faster mobile networks, many people are perma-
sures of sustainable governance in the target areas nently connected to the Internet using their mobile
of health promotion and health education (Walters devices, providing connectivity an extra boost.
et al. 2010). Biosurveillance has established itself With the development of apps, application
as an independent discipline in the mid-1990s, as software for mobile devices such as smartphones
military and civilian agencies began to get inter- (iPhone, Android, BlackBerry, Windows Phone)
ested in automatic monitoring systems. In this and Tablet computer, the application culture of
context, the biosurveillance program of the biosurveillance changed significantly, since these
Applied Physics Laboratory of Johns Hopkins apps are strongly influenced by the dynamics of
University has played a decisive and pioneering the bottom-up participation. Andreas
role (Burkom et al. 2008). Albrechtslund speaks in this context of the “Par-
The Internet biosurveillance uses the accessi- ticipatory Surveillance” (2008) on the social net-
bility to data and analytic tools provided by digital working sites, in which biosurveillance
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_27-1
2 Biosurveillance

increasingly assumes itself as a place for open was subsequently extended to provide early warn-
production of meaning and permanent negotia- ing of epidemics in cities, regions, and countries,
tion, by providing comment functions, hypertext in cooperation with the 2008 established Google
systems, and ranking and voting procedures Flu Trends in collaboration with the US authority
through collective framing processes. This is the for the surveillance of epidemics (CDC). On the
case of the sports app Runtastic, monitoring dif- Google Flu Trends website, users can visualize
ferent sports activities, using GPS, mobile the development of influenza activity both geo-
devices, and sensor technology, and making infor- graphically and chronologically. Some studies
mation, such as distance, time, speed, and burned criticize that the predictions of the Google project
calories, accessible and visible for friends and are far above the actual flu cases.
acquaintances in real time. The Eatery app is Ginsberg et al. (2009) point out that in the case
used for weight control and requires its users the of an epidemic, it is not clear whether the search
ability to do self-optimization through self-track- engines behavior of the public remains constant
ing. Considering that health apps also aim to and thus whether the significance of Google Flu
influence the attitudes of their users, they can Trends is secured or not. They refer to the
additionally be understood as persuasive media medialized presence of the epidemic as distorting
of Health Governance. With their feedback tech- cause of an “Epidemic of Fear” (Eysenbach 2006,
nologies, the apps facilitate not only issues related p. 244), which can lead to miscalculations
to healthy lifestyles but also multiply the social concerning the impending influenza activity. Sub-
control over compliance with the health regula- sequently, the prognostic reliability of the corre-
tions in peer-to-peer networks. Taking into con- lation between increasing search engine entries
sideration the network connection of information and increased influenza activity has been
technology equipment, as well as the commercial questioned. In recent publications on digital
availability of biometric tools (e.g., “Nike Fuel,” biosurveillance, communication processes in
“Fitbit,” “iWatch”) and infrastructure (apps), the online networks are more intensely analyzed.
biosurveillance is frequently associated, in the Especially in the field of Twitter Research (Paul
public debates, to dystopian ideas of a society of and Dredze 2011), researchers developed specific
control biometrically organized. techniques and knowledge models for the study of
Organizations and networks for health promo- future disease development and work backed up
tion, health information, and health education and by context-oriented sentiment analysis and social
formation observed with great interest that, every network analysis to hold out the prospect of a
day, millions of users worldwide search for infor- socially and culturally differentiated
mation about health using the Google search biosurveillance.
engine. During the influenza season, the searches
for flu increase considerably, and the frequency of
certain search terms can provide good indicators
Further Readings
of flu activity. Back in 2006, Eysenbach evaluated
in a study on “Infodemiology” or “Infoveillance” Albrechtslund, A. (2008). Online social networking as
the Google AdSense click quotas, with which he participatory surveillance. First Monday, 13(3).
analyzed the indicators of the spread of influenza Online: http://firstmonday.org/ojs/index.php/fm/arti
and observed a positive correlation between cle/viewArticle/2142/1949
Brownstein, J. S., et al. (2009). Digital disease detection –
increasing search engine entries and increased Harnessing the web for public health surveillance. The
influenza activity. Further studies on the volume New England Journal of Medicine, 360(21),
of search patterns have found that there is a sig- 2153–2157.
nificant correlation between the number of flu- Burkom, H. S., et al. (2008). Decisions in biosurveillance
tradeoffs driving policy and research. Johns Hopkins
related search queries and the number of people technical digest, 27(4), 299–311.
showing actual flu symptoms (Freyer-Dugas et al.
2012). This epidemiological correlation structure
Biosurveillance 3

Eysenbach, G. (2006). Infodemiology: Tracking flu-related Paul, M. J., & Dredze, P. (2011). You are what you Tweet:
searches on the Web for syndromic surveillance. In Analyzing Twitter for public health. In Proceedings of
AMIA Annual Symposium, Proceedings 8/2, 244–248. the Fifth International AAAI Conference on Weblogs
Freyer-Dugas, A., et al. (2012). Google Flu Trends: Cor- and Social Media. Online: www.aaai.org/ocs/index.
relation with emergency department influenza rates and php/ICWSM/ICWSM11/paper/.../3264
crowding metrics. Clinical Infectious Diseases, 54(15), Walters, R. A., et al. (2010). Data sources for
463–469. biosurveillance. In G. Voeller John (Ed.), Wiley hand-
Ginsberg, J., et al. (2009). Detecting influenza epidemics book of science and technology for homeland security
using search engine query data. In Nature. Interna- (Vol. 4, pp. 2431–2447). Hoboken: Wiley.
tional weekly journal of science (Vol. 457, pp.
1012–1014).
C

Cancer stages within the cancer continuum. Sources of


data include laboratory investigations, feasibility
Christine Skubisz studies, clinical trials, cancer registries, and
Department of Communication Studies, Emerson patient medical records. The paragraphs that fol-
College, Boston, MA, USA low describe current practices and future direc-
Department of Behavioral Health and Nutrition, tions for cancer-related research in the era of big
University of Delaware, Newark, DE, USA data.

Cancer is an umbrella term that encompasses Cancer Prevention and Early Detection
more than 100 unique diseases related to the
uncontrolled growth of cells in the human body. Epidemiology is the study of the causes and pat-
Cancer is not completely understood by scientists, terns of human diseases. Aggregated data allows
but it is generally accepted to be caused by both epidemiologists to study why and how cancer
internal genetic factors and external environmen- forms. Researchers study the causes of cancer
tal factors. The US National Cancer Institute and ultimately make recommendations about
describes cancer on a continuum, with points of how to prevent cancer. Data provides medical
significance that include prevention, early detec- practitioners with information about populations
tion, diagnosis, treatment, survivorship, and end- at risk. This can facilitate proactive and preventive
of-life care. This continuum provides a frame- action. Data is used by expert groups including the
work for research priorities. Cancer prevention American Cancer Society and the United States
includes lifestyle interventions such as tobacco Preventive Services Task Force to write recom-
control, diet, physical activity, and immunization. mendations about screening for detection. Screen-
Detection includes screening tests that identify ing tests, including mammography and
atypical cells. Diagnosis and treatment involves colonoscopy, have advantages and disadvantages.
informed decision making, the development of Evidence-based results, from large representative
new treatments and diagnostic tests, and outcomes samples, can be used to recommend screening for
research. Finally, end-of-life care includes pallia- those who will gain the largest benefit and sustain
tive treatment decisions and social support. Large the fewest harms. Data can be used to identify
data sets can be used to uncover patterns, view where public health education and resources
trends, and examine associations between vari- should be disseminated.
ables. Searching, aggregating, and cross- At the individual level, aggregated information
referencing large data sets is beneficial at all can guide lifestyle choices. With the help of
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_32-1
2 Cancer

technology, people have the ability to quickly and Data is also being used to predict which med-
easily measure many aspects of their daily lives. ications may be good candidates to move forward
Gary Wolf and Kevin Kelly coined this rapid into clinical research trials. Clinical trials are sci-
accumulation of personal data the quantified self entific studies that are designed to determine if
movement. Individual-level data can be collected new treatments and diagnostic procedures are safe
through wearable devices, activity trackers, and and effective. Margaret Mooney and Musa Mayer
smartphone applications. The data that is accumu- estimate that only 3% of adult cancer patients
lated is valuable for cancer prevention and early participate in clinical trials. Much of what is
detection. Individuals can track their physical known about cancer treatment is based on data
activity and diet over time. These wearable from this small segment of the larger population.
devices and applications also allow individuals Data from patients who do not participate in clin-
to become involved in cancer research. Individ- ical trials exists, but this data is unconnected and
uals can play a direct role in research by contrib- stored in paper and in electronic medical records.
uting genetic data and information about their New techniques in big data aggregation have the
health. Health care providers and researchers can potential to facilitate patient recruitment for clin-
view genetic and activity data to understand the ical trials. Thousands of studies are in progress
connections between health behaviors and worldwide at any given point in time. The tradi-
outcomes. tional, manual, process of matching patients with
appropriate trials is both time consuming and
inefficient. Big data approaches can allow for the
Diagnosis and Treatment integration of medical records and clinical trial
data from across multiple organizations. This
Aggregated data that has been collected over long aggregation can facilitate the identification of
periods of time has made a significant contribu- patients for inclusion in an appropriate clinical
tion to research on the diagnosis and treatment of trial. Nicholas LaRusso writes that IBM’s super-
cancer. The Human Genome Project, completed computer Watson will soon be used to match
in 2003, was one of the first research endeavors to cancer patients with clinical trials. Patient data
harness large data sets. Researchers have used can be mined for lifestyle factors and genetic
information from the Human Genome Project to factors. This can allow for faster identification of
develop new medicines that can target genetic participants that meet inclusion criteria. Watson,
changes or drivers of cancer growth. The ability and other supercomputers, can shorten the patient
to sequence the DNA of large numbers of tumors identification process considerably, matching
has allowed researchers to model the genetic patients in seconds. This has the potential to
changes underlying certain cancers. increase enrollment in clinical trials and ulti-
Genetic data is stored in biobanks, repositories mately advance cancer research.
in which samples of human DNA are stored for Health care providers’ access to large data sets
testing and analysis. Researchers draw from these can improve patient care. When making a diagno-
samples and analyze genetic variation to observe sis, providers can access information from
differences in the genetic material of someone patients exhibiting similar symptoms, lifestyle
with a specific disease compared to a healthy choices, and demographics to form more accurate
individual. Biobanks are run by hospitals, conclusions. Aggregated data can also improve a
research organizations, universities, or other med- patient’s treatment plan and reduce the costs of
ical centers. Many biobanks do not meet the needs conducting unnecessary tests. Knowing a
of researchers due to an insufficient number of patient’s prognosis helps a provider decide how
samples. The burgeoning ability to aggregate aggressively to treat cancer and what steps to take
data across biobanks, within the United States after treatment. If aggregate data from large and
and internationally, is invaluable and has the diverse groups of patients were available in a
potential to lead to new discoveries in the future. single database, providers would be better
Cancer 3

equipped to predict long-term outcomes for the data that is available. The data set will always
patients. Aggregate data can help providers select be incomplete and will fail to cover the entire
the best treatment plan for each patient, based on population. Data from diverse sources will vary
the experiences of similar patients. This can also in quality. Self-reported survey data will appear
allow providers to uncover patterns to improve alongside data from randomized, clinical trials.
care. Providers can also compare their patient out- Second, the major barrier to using big data for
comes to outcomes of their peers. Harlan diagnosis and treatment is the task of integrating
Krumholz, a professor at the Yale School of Med- information from diverse sources. Allen Lichter
icine, argued that the best way to study cancer is to explained that 1.6 million Americans are diag-
learn from everyone who has cancer. nosed with cancer every year, but in more than
95% of cases, details of their treatments are in
paper medical records, file drawers, or electronic
Survivorship and End-of-Life Care systems that are not connected to each other.
Often, the systems in which useful information
Cancer survivors face physical, psychological, is currently stored cannot be easily integrated.
social, and financial difficulties after treatment The American Association of Clinical Oncology
and for the remaining years of their lives. As sci- is working to overcome this barrier and has devel-
ence advances, people are surviving cancer and oped software that can accept information from
living in remission. A comprehensive database on multiple formats of electronic health records. A
cancer survivorship could be used to develop, test, prototype system has collected 100,000 breast
and maintain patient navigation systems to facili- cancer records from 27 oncology groups. Third,
tate optimal care for cancer survivors. traditional laboratory research is necessary to
Treating or curing cancer is not always possi- understand the context and meaning of the infor-
ble. Health care providers typically base patient mation that comes from the analysis of big data.
assessments on past experiences and the best data Large data sets allow researchers to explore cor-
available for a given condition. Aggregate data relations or relationships between variables of
can be used to create algorithms to model the interest. Danah Boyd and Kate Crawford point
severity of illness and predict outcomes. This out that data are often reduced to what can fit
can assist doctors and families who are making into a mathematical model. Taken out of context,
decisions about end-of-life care. Detailed infor- results lose meaning and value. The experimental
mation, based on a large number of cases, can designs of clinical trials will ultimately allow
allow for more informed decision making. For researchers to show causation and identify vari-
example, if a provider is able to tell a patient’s ables that cause cancer. Bigger data, in this case
family with confidence that it is extremely more data, is not always better. Fourth, patient
unlikely that the patient will survive, even with privacy and security of information must be pri-
radical treatment, this eases the discussion about oritized at all levels. Patients are, and will
palliative care. continue to be, concerned with how genetic and
medical profiles are secured and who will have
access to their personal information.
Challenges and Limitations

The ability to search, aggregate, and cross-refer- Cross-References


ence large data sets has a number of advantages in
the prevention and treatment of cancer. Yet, there ▶ DNA
are multiple challenges and limitations to the use ▶ Evidence Based Medicine
of big data in this domain. First, we are limited to ▶ Genetics
4 Cancer

▶ Genome Data Further Readings


▶ Health Care Delivery
▶ Human Genome Project Murdoch, T. B., & Detsky, A. S. (2013). The inevitable
application of big data to health care. Journal of the
▶ Nutrition
American Medical Association, 309(13), 1351–1352.
▶ Prevention
▶ Treatment
C

Cloud Services and there are multiple cloud vendors providing a


range of cloud services. The fiscal benefit of cloud
Paula K. Baldwin computing is the consumer only pays for the use
Department of Communication Studies, Western on the resources they need without any concern
Oregon University, Monmouth, OR, USA over compromising their physical storage areas.
The cloud service manages the data on the back
end. In an era where physical storage limitations
As consumers and institutions congregate larger has become problematic with increased down-
and larger portions of data, hardware storage has loads of movies, books, graphics, and other high
become inadequate. These additional storage data memory products, cloud computing has been
needs led to the development of virtual data cen- a welcome development.
ters, also known as the cloud, cloud computing,
or, in the case of the cloud providers, cloud ser-
vices. The origin of the term, cloud computing, is Choosing a Cloud Service
somewhat unclear, but a cloud-shaped symbol is
often used as a representation on the Internet of As the cloud service industry grows, choosing a
the cloud. The cloud symbol also represents the cloud service can be confusing for the consumer.
remote, complex system infrastructure used to One of the first areas to consider is the unique
store and manage the consumer’s data. cloud service configurations. Cloud services are
The first reference to cloud computing in the configured in four ways. One, public clouds may
contemporary age appeared in the mid-1990s, and be free or bundled with other services or offered as
it became popular in the mid-2000s. As cloud pay per usage. Generally speaking, public cloud
services become much more versatile and eco- service providers like Amazon AWS, Microsoft,
nomical, consumers’ use is increasing. The and Google own and operate their own infrastruc-
cloud offers users immediate access to a shared ture data centers, and access to these providers’
pool of computer resources. As processors con- services is through the Internet. Private cloud
tinue to develop both in power and economic services are data management infrastructures cre-
feasibility, the expansion of these data centers ated solely for one particular organization. Man-
(the cloud) has expanded on an enormous scale. agement of the private cloud may be internal or
Cloud services incentivize migration to the cloud external. Community cloud services exist when
as users recognize the elastic potential for data multiple organizations from a specific community
storage as a reasonable cost. Cloud services are with common needs choose to share an infrastruc-
the new generation of computing infrastructures, ture. Again, management of the community cloud
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_37-1
2 Cloud Services

service may be internal or external, and fiscal with correct service level. When looking at cloud
responsibility is shared between the organiza- services, it is important to examine four different
tions. Hybrid clouds are a grouping of two or aspects: application requirements, business
more clouds, public or private community, expectations, capacity provisioning, and cloud
where the cloud service is comprised of variant information collection and process. These four
combination that extends the capacity of the ser- areas complicate the process of selecting a cloud
vice through aggregation, integration, or service. First, the application requirements refer to
customizations with another cloud service. Some- the different features such as data volume, data
times a hybrid cloud is used on a temporary basis production rate, data transfer and updating, com-
to meet short-term data needs that cannot be ful- munication, and computing intensities. These fac-
filled by the private cloud. Having the ability to tors are important because the differences in these
use the hybrid cloud enables the organization to factors will affect the CPU (central processing
only pay for the extra resources when they are unit), memory, storage, and network bandwidth
needed, so this exists as a fiscal incentive for for the user. Business expectations fluctuate
organizations to use a hybrid cloud service. depending on the applications and potential
The other aspect to consider when evaluating users, which, in turn, affect the cost. The pricing
cloud services is the specific service models model depends on the level of the service required
offered for the consumer or organization. Cloud (e.g., voicemail, a dedicated service, amount of
computing offers three different levels of service: storage required, additional software packages,
Software as a Service (SaaS), Platform as a Ser- and other custom services). Capacity provisioning
vice (PaaS), and Infrastructure as a Service (IaaS). is based on the concept that, according to need,
The SaaS has a specific application or service different IT technologies are employed and, there-
subscription for the customer (e.g., Dropbox, fore, each technology has its own unique strengths
Salesforce.com, and QuickBooks). With the and weaknesses. The downside for the consumer
SaaS, the service provider handles the installation, is the steep learning curve required. The final
setup, and running of the application with little to challenge requires that the consumers invest a
no customization. The PaaS allows businesses an substantial amount of time to investigate individ-
integrated platform on which they can create and ual websites, collect information about each cloud
deploy custom apps, databases, and line-of-busi- service offering, collate their findings, and employ
ness service (e.g., Microsoft Windows Azure, their own assessments to determine their best
IBM Bluemix, Amazon Web Services (AWS), match. If an organization has an internal IT
Elastic Beanstalk, Heroku, Force.com, Apache department or employs an IT consultant, the deci-
Stratos, Engine Yard, and Google App Engine). sion is easier to make; for the individual con-
The PaaS service model includes the operating sumer, without an IT background, the choice
system, programming language execution envi- may be considerably more difficult.
ronment, database, and web servicer designed
for a specific framework with a high level of
customization. With Infrastructure as a Service Cloud Safety and Security
(IaaS), businesses can purchase infrastructure
from providers as virtual resources. Components For the consumer, two primary issues are relevant
include servers, memory, firewalls, and more, but to cloud usage: a check and balance system on the
the organization provides the operating system. usage versus service level purchased and data
IaaS providers include Amazon Elastic Cloud safety. This on-demand computation model of
Computer (Amazon EC2), GoGrid, Joyent, cloud computing is processed through large vir-
AppNexus, Rackspace, and Google Compute tual data centers (clouds), offering storage and
Engine. computation needs for all types of cloud users.
Once the correct cloud service configuration is These needs are based on service level agree-
determined, the next step is to match user needs ments. Although cloud services are relatively
Cloud Services 3

low cost, there is no way to know if the services deal of variance in how different countries and
they are purchasing are equivalent to the service regions deal with security issues. At this point in
level purchased. Although being able to deter- time, until there are universal rules or legacy
mine that a consumer’s usage in relationship to specifically addressing data privacy legislation,
the service level purchased is appropriate, the the consumers must take responsibility for their
more serious concern for consumers is data safety. own data. There are five strategies for keeping
Furthermore, because users do not have physical your data secure in the cloud, outside of what the
possession of their data, public cloud services are cloud services offer. First, consider storing crucial
underutilized due to trust issues. Larger organiza- information somewhere other than the cloud. For
tions use privately held clouds, but if a company this type of information, perhaps utilizing the
does not have the resources to develop their own available hardware storage might be the best solu-
cloud service, most organizations are unlikely to tion rather than using a cloud service. Second,
use public cloud services due to safety concerns. when choosing a cloud service, take the time to
Currently, there is no global standardization of read the user agreement. The user agreement
data encryption between cloud services, and should clearly delineate the parameters of their
there have been some concerns raised by experts service level and that will help with the decision-
who say there is no way to be completely sure that making. Third, take creating passwords seriously.
data, once moved to the cloud, remains secure. Oftentimes, the easy route for passwords is famil-
With most cloud services, control of the encryp- iar information such as dates of birth, hometowns,
tion keys is retained by the cloud service, making and pet’s or children’s names. With the advances
your data vulnerable to a rogue employee or a in hardware and software designed specially to
governmental request to see your data. crack passwords, it is particularly important to
The Electronic Frontier Foundation (EFF) is a use robust, unique passwords for each of your
privacy advocacy group that maintains a section accounts. Fourth, the best way to protect data is
on their website (Who Has Your Back) that rates through encryption. The way encryption works in
the largest Internet companies on their data pro- this instance is to use an encryption software on a
tections. The EFF site uses six criteria to rate the file before you move the file to the cloud. Without
companies: requires a warrant for content, tells the password to the encryption, no one will be
users about government data requests, publishes able to read the file content. When considering a
transparency reports, publishes law enforcement cloud service, investigate their encryption ser-
guidelines, fights for user privacy rights in courts, vices. Some cloud services encrypt and decrypt
and fights for user privacy rights in Congress. user files local as well as provide storage and
Another consumer and corporate data protection backup. Using this type of service ensures that
group is the Tahoe Least Authority File System data is encrypted before it is stored in the cloud
(Tahoe-LAFS) project. Tahoe-LAFS protects a and after it is downloaded from the cloud provid-
free, open-source storage system created and ing the optimal safety net for consumer data.
developed by Zooko Wilcox-O’Hearn with the
goal of data security and protection from hard-
ware failure. The strength of this storage system is
Cross-References
their encryption and integrity – checks first go
through gateway servers, and after the process is
▶ Cloud
complete, the data is stored on a secondary set of
▶ Cloud Computing
servers that cannot read or modify the data.
▶ Cloud Safety
Security for data storage via cloud services is a
▶ Cloud Storage
global concern whether for individuals or organi-
▶ Computer Network Storage
zations. From a legal perspective, there is a great
4 Cloud Services

Further Readings third party auditor. IET Communications, 8(12), 2106–


2113.
Ding, S., et al. (2014). Decision support for personalized Mell, P., et al. (2011). National Institute of Standards and
cloud service selection through multi-attribute trust- Technology, U.S. Department of Commerce. The
worthiness evaluation. PLoS One, 9(6), e97762. NIST definition of cloud computing. Special Publica-
Gui, Z., et al. (2014). A service brokering and recommen- tion 800-145, 9–17.
dation mechanism for better selecting cloud services. Qi, Q., et al. (2014). Cloud service-aware location update
PLoS One, 8(8). e105297. https://doi.org/10.1371/jour- in mobile could computing. IET Communications, 8(8),
nal.pone.0105297 1417–1424.
Hussain, M., et al. (2014). Software quality in the clouds: A Rehman, Z., et al. (2014). Parallel could service selection
cloud-based solution. Cluster Computing, 17(2), and ranking based on QoS history. International Jour-
389–402. nal of Parallel Programming, 42(5), 820–852.
Kun, H., et al. (2014). Securing the cloud storage audit
service: Defending against frame and collude attacks of
C

Communications Singularly thinking of big data as a unit of measure-


ment or a size fails to underscore the many uses and
Alison N. Novak methods used by Communications to explore big
Department of Public Relations and Advertising, datasets.
Rowan University, Glassboro, NJ, USA One frequent source of big data analysis in
Communications is that of network analysis or
social network analysis. This method is used to
There is much debate about the origins and history explore the ways in which individuals are
of the field of Communications. While many connected in physical and digital spaces. Commu-
researchers point to a rhetorical origin in ancient nications research on social networks particularly
Greece, others suggest the field is much newer, investigates how close individuals are to each
developing from psychology and propaganda other, whom they are connected through, and
studies of the 1940s. The discipline includes what resources can be shared amongst networks.
scholars exploring subtopics such as political These networks can be archived from social net-
communication, media effects, and organizational working sites such as Twitter or Facebook, or
relationships. The field generally uses both qual- alternatively can be constructed through surveys
itative and quantitative approaches, as well as of people within a group, organization, or com-
developing a variety of mixed-methods tech- munity. The automated data aggregation of digital
niques to understand social phenomena. social networks makes the method appealing to
Russell W. Burns argues that the field of Com- Communications researchers because it produces
munications developed from a need to explore the large networks quickly and with limited possibil-
ways in which media influenced people to behave, ity of human error in recording nodes. Addition-
support, or believe in a certain idea. Much of ally, the subfield of Health Communications has
Communication studies investigates the idea of adopted the integration of big datasets in an effort
media and texts, such as newspaper discourses, to study how healthcare messages are spread
social media messages, or radio transcripts. As the across a network.
field has developed, it has investigated new tech- Natural language processing is another area of
nologies and media, including those still in their big data inquiry in the field of Communications. In
infancies. this vein of research, scholars explore the way that
Malcom R. Parks states that the field of Com- computers can develop an understanding of lan-
munications has not adopted one set definition of guage and generate responses. Often studied along
big data, but rather sees the term as a means to with Information Science researchers and Artificial
identify datasets and archival techniques. intelligence developers, natural language processing
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_39-1
2 Communications

draws from Communications association with lin- research with larger theoretical contexts. One cri-
guistics and modern languages. Natural language tique of the data-revolution is the false identifica-
processing is an attempt to build communication tion of this form of analysis as being new. Rather
into computers so they can understand and provide than consider big data as an entirely new phenom-
more sender-tailored messages to users. ena, by situating it within a larger history of Com-
The field of communication has also been out- munications theory, more direct comparisons
spoken about the promises levied with big data between past and present datasets can be drawn.
analytics as well as the ethics of big data use. Second, the field requires more attention to the
Recognizing that the field is still early in its devel- topic of validity in big data analysis. While quan-
opment, scholars point to the lifespan of other titative and statistical measurements can support
technologies and innovations as examples of the reliability of a study, validity asks researchers
how optimism early in the lifecycle often turns to provide examples or other forms of support for
into critique. Pierre Levy is one Communications their conclusions. This greatly challenges the eth-
scholar who explains that although new datasets ical notions of anonymity in big data, as well as
and technologies are viewed as positive changes the consent process for individual protections.
with big promises early in their trajectory, as more This is one avenue in which the quality of big
information is learned about their effects, scholars data research needs more work within the field of
often begin to challenge their use and ability to communications.
provide insight. Communications asserts that big data is an
Communications scholars often refer to big data important technological and methodological
as the “datafication” of society, meaning turning advancement within research, however, due to its
everyday interactions and experiences into quanti- newness, researchers need to exercise caution when
fiable data that can be segmented and analyzed considering its future. Specifically, researchers must
using brad techniques. This in particular refers to focus on the ethics of inclusion in big datasets,
analyzing data that has not been previously viewed along with the quality of analysis and long term
as data before. Although this is partially where the effects of this type of dataset on society.
value of big data develops from, for Communica-
tions researchers, this complicates the ability to
think holistically or qualitatively.
Further Readings
Specifically, big datasets in Communications
research include information taken from social Burns, R. W. (2003). Communications: An international
media sites, health records, media texts, political history of the formative years. New York: IEE History
polls, and brokered language transcriptions. The of Technology Series.
wide variety of types of datasets reflects the truly Levy, P. (1997). Collective intelligence: Mankind’s emerg-
ing world in cyberspace. New York: Perseus Books.
broad nature of the discipline and its subfields. Parks, M. R. (2014). Big data in communication research:
Malcom Parks offers suggestions on the future Its contents and discontents. Journal of Communica-
of big data research within the field of Communi- tion, 64, 355–360.
cations. First, the field must situate big data
C

Computational Social Sciences statistics, and machine learning methods can


improve several social science fields like anthro-
Ines Amaral pology, sociology, economics, psychology, polit-
University of Minho, Braga, Minho, Portugal ical science, media studies, and marketing.
Autonomous University of Lisbon, Lisbon, Therefore, computational social sciences is an
Portugal interdisciplinary scientific area, which explores
social dynamics of society through advanced
computational systems.
Computational social sciences is a research disci- Computational social science is a relatively
pline at the interface between computer science new field, and its development is closely related
and the traditional social sciences. This interdis- to the computational sociology that is often asso-
ciplinary and emerging scientific field uses com- ciated to the study of social complexity, which is a
putationally methods to analyze and model social useful conceptual framework for the analysis of
phenomena, social structures, and collective society. Social complexity is theory neutral that
behavior. The main computational approaches to frames both local and global approaches to social
the social sciences are social network analysis, research. The theoretical background of this con-
automated information extraction systems, social ceptual framework dates back to the work of
geographic information systems, complexity Talcott Parsons on action theory, the integration
modeling, and social simulation models. of the study of social order with the structural
New areas of social science research have features of macro and micro factors. Several
arisen due the existence of computational and decades later, in the early 1990s, social theorist
statistical tools, which allow social scientists to Niklas Luhmann began to work on the themes of
extract and analyze large datasets of social infor- complex behavior. By then, new statistical and
mation. Computational social sciences diverges computational methodologies were being devel-
from conventional social science because of the oped for social science problems.
use of mathematical methods to model social phe- Nigel Gilbert, Klaus G. Troitzsch, and Joshua M.
nomena. As an intersection of computer science, Epstein are the founders of modern computational
statistics, and the social sciences, computational sociology, merging social science research with sim-
social science is an interdisciplinary subject, ulation techniques in order to model complex policy
which uses large-scale demographic, behavioral, issues and essential features of human societies.
and network data to analyze individual activity, Nigel Gilbert is a pioneer in the use of agent-based
collective behaviors, and relationships. Modern models in the social sciences. Klaus G. Troitzsch
distributed computing frameworks, algorithms, introduces the method of computer-based simulation
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_41-1
2 Computational Social Sciences

in the social sciences. Joshua M. Epstein developed, when big data is connected, it forms large net-
with Robert Axtell, the first large-scale agent-based works of heterogeneous information with data
computational model, which aims to explore the role redundancy that can be exploited to compensate
of social experiences such as seasonal migrations, for the lack of data, to validate trust relationships,
pollution, and transmission of disease. to disclose inherent groups, and to discover hid-
As an instrument-based discipline, computa- den patterns and models. Several methodologies
tional social sciences enables the observation and and applications in the context of modern social
empirical study of phenomena through computa- science datasets allow scientists to understand and
tional methods and quantitative datasets. Quantita- study different social phenomena, from political
tive methods such as dynamical systems, artificial decisions to the reactions of economic markets to
intelligence, network theory, social network analy- the interactions of individuals and the emergence
sis, data mining, agent-based modeling, computa- of self-organized global movements.
tional content analysis, social simulations Trillions of bytes of data can be captured by
(macrosimulation and microsimulation), and statis- instruments or generated by simulation. Through
tical mechanics are often combined to study com- better analysis of these large volumes of data that
plex social systems. are becoming available, there is the potential to
Technological developments are constantly make further advances in many scientific disci-
changing society, ways of communication, behav- plines and improve the social knowledge and the
ioral patterns, the principles of social influence, success of many companies. More than ever, sci-
and the formation and organization of groups and ence is now a collaborative activity. Computational
communities, enabling the emergence of self- systems and techniques created new ways of
organized movements. As technology-mediated collecting, crossing and interconnecting data.
behaviors and collectives are primary elements Analysis of big data are now at the disposal of
in the dynamics and in the design of social struc- social sciences, allowing the study of cases in
tures, computational approaches are critical to macro- and in microscales in connection to other
understand the complex mechanisms that form scientific fields.
part of many social phenomena in contemporary
society. Big data can be used to understand many
complex phenomena as it offers new opportuni- Cross-References
ties to work toward a quantitative understanding
of our complex social systems. Technological- ▶ Computer Science
mediated social phenomena emerging over multi- ▶ Data Visualization
ple scales are available in complex datasets. Twit- ▶ Network Analytics
ter, Facebook, Google, and Wikipedia showed ▶ Network Data
that it is possible to relate, compare, and predict ▶ Physics
opinions, attitudes, social influences, and collec- ▶ Social Network Analysis (SNA)
tive behaviors. Online and offline big data can ▶ Sociology
provide insights that allow the understanding of ▶ Visualization
social phenomena like diffusion of information,
polarization in politics, formation of groups, and
evolution of networks.
Further Readings
Big data is dynamic, heterogeneous, and inter-
related. But it is also often noisy and unreliable. Bankes, S., Lempert, R., & Popper, S. (2002). Making
However, even so, big data may be more valuable computational social science effective epistemology,
to social sciences than small samples because the methodology, and technology. Social Science Com-
puter Review, 20(4), 377–388.
overall statistics obtained from frequent patterns
Bainbridge, W. S. (2007). Computational sociology. In
and correlation analysis disclose often hidden pat- The Blackwell Encyclopedia of Sociology. Malden,
terns and more reliable knowledge. Furthermore, MA: Blackwell Publishing.
Computational Social Sciences 3

Cioffi-Revilla, C. (2010). Computational social science. Miller, J. H., & Page, S. E. (2009). Complex adaptive
Wiley Interdisciplinary Reviews: Computational Statis- systems: An introduction to computational models of
tics, 2(3), 259–271. social life. Princeton: Princeton University Press.
Conte, R., et al. (2012). Manifesto of computational social Oboler, A., et al. (2012). The danger of big data: Social
science. The European Physical Journal Special media as computational social science. First Monday 17
Topics, 214(1), 325–346. (7). Retrieved from http://firstmonday.org/article/view/
Lazer, D., et al. (2009). Computational social science. 3993/3269/
Science, 323(5915), 721–723.
C

Content Moderation site or platform level and reflect that platform’s


brand and reputation, its tolerance for risk, and the
Sarah T. Roberts type of user engagement it wishes to attract. In
Department of Information Studies, University of some cases, content moderation may take place in
California, Los Angeles, Los Angeles, CA, USA haphazard, disorganized, or inconsistent ways; in
others, content moderation is a highly organized,
routinized, and specific process. Content modera-
Synonyms tion may be undertaken by volunteers or, increas-
ingly, in a commercial context by individuals or
Community management; Community modera- firms who receive remuneration for their services.
tion; Content screening The latter practice is known as commercial con-
tent moderation, or CCM. The firms who own
social media sites and platforms that solicit UGC
Definition employ content moderation as a means to protect
the firm from liability and negative publicity and
Content moderation is the organized practice of to curate and control user experience.
screening user-generated content (UGC) posted to
Internet sites, social media, and other online out-
lets, in order to determine the appropriateness of History
the content for a given site, locality, or jurisdic-
tion. The process can result in UGC being The Internet and its many underlying technologies
removed by a moderator, acting as an agent of are highly codified and protocol-reliant spaces
the platform or site in question. Increasingly, with regard to how data are transmitted within it
social media platforms rely on massive quantities (Galloway 2006), yet the subject matter and
of UGC data to populate them and to drive user nature of content itself has historically enjoyed a
engagement; with that increase has come the con- much greater freedom. Indeed, a central claim to
comitant need for platforms and sites to enforce the early promise of the Internet as espoused by
their rules and relevant or applicable laws, as the many of its proponents was that it was highly
posting of inappropriate content is considered a resistant, as a foundational part of both its architec-
major source of liability. ture and ethos, to censorship of any kind.
The style of moderation can vary from site to Nevertheless, various forms of content moder-
site, and from platform to platform, as rules ation occurred in early online communities. Such
around what UGC is allowed are often set at a content moderation was frequently undertaken by
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_44-1
2 Content Moderation

volunteers and was typically based on the enforce- These media firms began to employ a variety of
ment of local rules of engagement around com- techniques to combat what they viewed as the
munity norms and user behavior. Moderation misappropriation of the comments spaces, using
practices and style therefore developed locally in-house moderators, turning to firms that special-
among communities and their participants and ized in the large-scale management of such inter-
could inform the flavor of a given community, active areas and deploying technological
from the highly rule-bound to the anarchic: the interventions such as word filter lists or
Bay Area-based online community the WELL disallowing anonymous posting, to bring the com-
famously banned only three users in its first ments sections under control. Some media outlets
6 years of existence, and then only temporarily went the opposite way, preferring instead to close
(Turner 2005, p. 499). their comments sections altogether.
In social communities, on the early text-based
Internet, mechanisms to enact moderation were
often direct and visible to the user and could Commercial Content Moderation and
include demanding that a user alters a contribution the Contemporary Social Media
to eliminate offensive or insulting material, the Landscape
deletion or removal of posts, the banning of
users (by username or IP address), the use of text The battle with text-based comments was just the
filters to disallow posting of specific types of beginning of a much larger issue. The rise of
words or content, and other overt moderation Friendster, MySpace, and other social media
actions. Examples of sites of this sort of content applications in the early part of the twenty-first
moderation include many Usenet groups, BBSes, century has given way to more persistent social
MUDs, listservs, and various early commercial media platforms of enormous scale and reach. As
services. of the second quarter of 2016, Facebook alone
Motives for people participating in voluntary approached two billion users worldwide, all of
moderation activities varied. In some cases, users whom generate content by virtue of their partici-
carried out content moderation duties for prestige, pation on the platform. YouTube reported receiv-
status, or altruistic purposes (i.e., for the better- ing upwards of 100 hours of UGC video per
ment of the community); in others, moderators minute as of 2014.
received non-monetary compensation, such as The contemporary social media landscape is
free or reduced-fee access to online services, therefore characterized by vast amounts of UGC
e.g., AOL (Postigo 2003). The voluntary model uploads made by billions of users to massively
of content moderation persists today in many popular commercial Internet sites and social
online communities and platforms; one such media platforms with a global reach. Mainstream
high-profile site where volunteer content modera- platforms, often owned by publicly traded firms
tion is used exclusively to control site content is responsible to shareholders, simply cannot afford
Wikipedia. the risk – legal, financial, and to reputation – that
As the Internet has grown into large-scale unchecked UGC could cause. Yet, contending
adoption and a massive economic engine, the with the staggering amounts of transmitted data
desire for major mainstream platforms to control from users to platforms is not a task that can
the UGC that they host and disseminate has also currently be addressed reliably and at large scale
grown exponentially. Early on in the proliferation by computers. Indeed, making nuanced decisions
of so-called Web 2.0 sites, newspapers and other about what UGC is acceptable and what is not
news media outlets, in particular, began noticing a currently exceeds the abilities of machine-driven
significant problem with their online comments processes, save for the application of some algo-
areas, which often devolved into unreadable rithmically informed filters or bit-for-bit or hash
spaces filled with invective, racist and sexist dia- value matching, which occur at relatively low
tribes, name-calling, and irrelevant postings. levels of computational complexity.
Content Moderation 3

The need for adjudication of UGC – video- and guidelines of the platform for which they labor.
image-based content, in particular – therefore They must also be aware of the laws and statutes
calls on human actors who rely upon their own that may govern the geographic or national loca-
linguistic and cultural knowledge and competen- tion from where the content emanates, for which
cies to make decisions about UGC’s appropriate- the content is destined, and for where the platform
ness for a given site or platform. Specifically, or site is located – all of which may be distinct
“they must be experts in matters of taste of the places in the world. They must be aware of the
site’s presumed audience, have cultural knowl- platform’s tolerance for risk, as well as the expec-
edge about location of origin of the platform and tations of the platform for whether or how CCM
of the audience (both of which may be very far workers should make their presence known.
removed, geographically and culturally, from In many cases, CCM workers may work at
where the screening is taking place), have linguis- organizational arm’s length from the platforms
tic competency in the language of the UGC (that they moderate. Some labor arrangements in
may be a learned or second language for the CCM have workers located at great distances
content moderator), be steeped in the relevant from the headquarters of the platforms for which
laws governing the site’s location of origin and they are responsible, in places such as the Philip-
be experts in the user guidelines and other pines and India. The workers may be structurally
platform-level specifics concerning what is and removed from those firms, as well, via
is not allowed” (Roberts 2016). These human outsourcing companies who take on CCM con-
workers are the people who make up the legions tracts and then hire the workers under their aus-
of commercial content moderators: moderators pices, in call center (often called BPO, or business
who work in an organized way, for pay, on behalf process outsourcing) environments. Such
of the world’s largest social media firms, apps, and outsourcing firms may also recruit CCM workers
websites who solicit UGC. using digital piecework sites such as Amazon
CCM processes may take place prior to mate- Mechanical Turk or Upwork, in which the rela-
rial being submitted for inclusion or distribution tionships between the social media firms, the
on a site, or they may take place after material has outsourcing company, and the CCM worker can
already been uploaded, particularly on high- be as ephemeral as one review.
volume sites. Specifically, content moderation Even when CCM workers are located on-site at
may be triggered as the result of complaints a headquarters of a social media firm, they often
about material from site moderators or other site are brought on as contract laborers and are not
administrators, from external parties (e.g., compa- afforded the full status, or pay, of a regular full-
nies alleging misappropriation of material they time employee. In this regard, CCM work, wher-
own; from law enforcement; from government ever it takes place in the world and by whatever
actors) or from other users themselves who are name, often shares the characteristic of being rel-
disturbed or concerned by what they have seen atively low wage and low status as compared to
and then invoke protocols or mechanisms on a other jobs in tech. These arrangements of institu-
site, such as the “flagging” of content, to prompt tional and geographic removal can pose a risk for
a review by moderators (Crawford and Gillespie workers, who can be exposed to disturbing and
2016). In this regard, moderation practices are shocking material as a condition of their CCM
often uneven, and the removal of UGC may rea- work but can be a benefit to the social media
sonably be likened to censorship, particularly firms who require their labor, as they can distance
when it is undertaken in order to suppress speech, themselves from the impact of the CCM work on
political opinions, or other expressions that the workers. Further, the working conditions,
threaten the status quo. practices, and existence of CCM workers in social
CCM workers are called upon to match and media are little known to the general public, a fact
adjudicate volumes of content, typically at rapid that is often by design. CCM workers are fre-
speed, against the specific rules or community quently compelled to sign NDAs, or
4 Content Moderation

nondisclosure agreements, that preclude them enforcement of laws, social norms, and mores
from discussing the work that they do or the that frequently vary and often are in conflict with
conditions in which they do it. While social each other. The acknowledgement and under-
media firms often gesture at the need to maintain standing of the history of content moderation
secrecy surrounding the exact nature of their mod- and the contemporary reality of large-scale CCM
eration practices and the mechanisms they used to is central to many of these core questions of what
undertake them, claiming the possibility of users’ the Internet has been, is now, and will be in the
being able to game the system and beat the rules if future, and yet the continued invisibility and lack
armed with such knowledge, the net result is that of acknowledgment of CCM workers by the firms
CCM workers labor in secret. The conditions of for which their labor is essential means that such
their work – its pace, the nature of the content they questions cannot fully be addressed. Neverthe-
screen, the volume of material to be reviewed, and less, discussions of moderation practices and the
the secrecy – can lead to feelings of isolation, people who undertake them are essential to the
burnout, and depression among some CCM end of more robust, nuanced understandings of
workers. Such feelings can be enhanced by the the state of the contemporary Internet and to better
fact that few people know such work exists, policy and governance based on those
assuming, if they think of it at all, that algorithmi- understandings.
cally driven computer programs take care of social
media’s moderation needs. It is a misconception
that the industry has been slow to correct. Cross-References

▶ Algorithm
Conclusion ▶ Facebook
▶ Internet
Despite claims and conventional wisdom to the ▶ Social Media
contrary, content moderation has likely always ▶ Wikipedia
existed in some form on the social Internet. As ▶ YouTube
the Internet’s many social media platforms grow
and their financial, political, and social stakes
increase, the undertaking of organized control of Further Readings
user expression through such practices as CCM
will likewise only increase. Nevertheless, CCM Crawford, K., & Gillespie, T. (2016). What is a flag for?
remains a little discussed and little acknowledged Social media reporting tools and the vocabulary of
complaint. New Media & Society, 18(3), 410–428.
aspect of the social media production chain,
Galloway, A. R. (2006). Protocol: How control exists after
despite its mission-critical status in almost every decentralization. Cambridge, MA: MIT Press.
case in which it is employed. The existence of a Postigo, H. (2003). Emerging sources of labor on the
globalized CCM workforce abuts many difficult, internet: The case of America online volunteers. Inter-
national Review of Social History, 48(S11), 205–223.
existential questions about the nature of the Inter-
Roberts, S. T. (2016). Commercial content moderation:
net itself and the principles that have long been Digital laborers’ dirty work. In S. U. Noble &
thought to undergird it, particularly, the free B. Tynes (Eds.), The intersectional internet: Race,
expression and circulation of material, thought, sex, class and culture online (pp. 147–160). New
York: Peter Lang.
and ideas. These questions are further compli- Turner, F. (2005). Where the counterculture met the new
cated by the pressures related to contested notions economy: The WELL and the origins of virtual com-
of jurisdiction, borders, application and munity. Technology and Culture, 46(3), 485–512.
C

Crowdsourcing crowd’s help, and the crowd engages the tasks.


In selective crowdsourcing, the best solution from
Heather McIntosh the crowd is chosen, while in integrative
Mass Media, Minnesota State University, crowdsourcing, the crowd’s solutions become
Mankato, MN, USA worked into the overall project in a useful manner.
Working online is integral to the
crowdsourcing process. It allows the gathering
Crowdsourcing is an online participatory culture of diverse individuals who are geographically dis-
activity that brings together large, diverse sets of persed to “come together” for working on the
people and directs their energies and talents projects. The tools the crowds need to engage
toward varied tasks designed to achieve specific the tasks also appear online. Since using an orga-
goals. The concept draws on the principle that the nization’s own tools can prove too expensive for
diversity of knowledge and skills offered by a big data projects, organizations sometimes use
crowd exceeds the knowledge and skills offered social networks for recruitment and task fulfill-
by an elite, select few. For big data, it offers access ment. The documentary project Life in a Day, for
to abilities for tasks too complex for computa- example, brought together video footage from
tional analysis. Corporations, government groups, people’s everyday lives from around the world.
and nonprofit organizations all use crowdsourcing When possible, people uploaded their footage to
for multiple projects, and the crowds consist of YouTube, a video-sharing platform. To address
volunteers who choose to engage tasks toward the disparities of countries without access to
goals determined by the organizations. Though digital production technologies and the Internet,
these goals may benefit the organizations more the project team sent cameras and memory storage
so than the crowds helping them, ideally the ben- cards through the mail. Other services assist with
efit is shared between the two. Crowdsourcing recruitment and tasks. LiveWork and Amazon
breaks down into basic procedures, the tasks and Mechanical Turk are established online service
their applications, the crowds and their makeup, marketplaces, while companies such as
and the challenges and ethical questions. InnoCentive and Kaggle offer both the crowds
Crowdsourcing follows a general procedure. and the tools to support an organization’s project
First, an organization determines the goal or the goals.
problem that requires a crowd’s assistance in order Tasks vary depending on the project’s goals,
to achieve or solve. Next, the organization defines and they vary in structure, interdependence, and
the tasks needed from the crowd in order to fulfill commitment. Some tasks follow definite bound-
its ambitions. After, the organization seeks the aries or procedures, while others are open-ended.
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_47-1
2 Crowdsourcing

Some tasks depend on other tasks for completion, together crowds to analyze issues related to global
while others stand alone. Some tasks require but a climate change, registering more than 14,000
few seconds, while others demand more time and members who participate in a range of contests.
mental energy. More specifically, tasks might Within the contests, members create and refine
include finding and managing information, ana- proposals that offer climate change solutions.
lyzing information, solving problems, and pro- The proposals then are evaluated by the commu-
ducing content. With big data, crowds may enter, nity and, through voting, recommended for imple-
clean, and validate data. The crowds may even mentation. Contest winners presented their
collect data, particularly geospatial data, which proposals to those who might implement them at
prove useful for search and rescue, land manage- a conference. Some contests build their initiatives
ment, disaster response, and traffic management. on big data, such as Smart Mobility, which relies
Other tasks might include transcription of audio or on mobile data for tracking transportation and
visual data and tagging. traveling patterns in order to suggest ways for
When bringing crowdsourcing to big data, the people to reduce their environmental impacts
crowd offers skills that benefit through matters of while still getting where they want to go.
judgment, contexts, and visuals – skills that Another government example comes from the
exceed computational models. In terms of judg- city of Boston, wherein a mobile app called Street
ment, people can determine the relevance of items Bump tracks and maps potential potholes
that appear within a data set, identify similarities throughout the city in order to guide crews toward
among items, or fill in holes within the set. In fixing them. The crowdsourcing for this initiative
terms of contexts, people can identify the situa- comes from two levels. One, the information gath-
tions surrounding the data and how those situa- ered from the app helps city crews do their work
tions influence them. For example, a person can more efficiently. Two, the app’s first iteration
determine the difference between the Statue of reported too many false positives, leading crews
Liberty on Ellis Island in New York and the rep- to places where no potholes existed. The city
lica on The Strip in Las Vegas. The contexts then worked with a crowd drawn together through
allow determination of accuracy or ranking, such InnoCentive to improve the app and its efficiency,
as in this case differentiating the real from the with the top suggestions coming from a hackers
replica. People also can determine more in-depth group, a mathematician, and a software engineer.
relationships among data within a set. For exam- Corporations also use crowdsourcing to work
ple, people can better decide the accuracy of with their big data. AOL needed help with
search engine terms and results matches, deter- cataloging the content on its hundreds of thou-
mine better the top search result, or even predict sands web pages, specifically the videos and their
other people’s preferences. sources, and turned to crowdsourcing as a means
Properly managed crowdsourcing begins to expedite and streamline the project’s costs.
within an organization that has clear goals for its Between 2006 and 2010, Netflix, an online
big data. These organizations can include govern- streaming and mail DVD distributor, sought help
ment, corporations, and nonprofit organizations. with perfecting its algorithm for predicting user
Their goals can include improving business prac- ratings of films. The company developed a contest
tices, increasing innovations, decreasing project with a $1 million dollar prize, and for the contest,
completion times, developing issue awareness, it offered data sets consisting of multiple million
and solving social problems. These goals fre- units for analysis. The goal was to beat Netflix’s
quently involve partnerships that occur across current algorithm by 10%, which one group
multiple entities, such as government or corpora- achieved and took home the prize.
tions partnering with not-for-profit initiatives. Not-for-profit groups also incorporate
At the federal level and managed through Mas- crowdsourcing as part of their initiatives. AARP
sachusetts Institute for Technology’s Center for Foundation, which works on behalf of older
Collective Intelligence, Climate CoLab brings Americans, used crowdsourcing to tackle such
Crowdsourcing 3

issues as eliminating food insecurity and food often tasks that require a higher time and mental
deserts (areas where people do not have conve- commitment than others.
nient or close access to grocery stores). Humani- Crowds bring wisdom to crowdsourced tasks
tarian Tracker crowdsources data from people “on on big data through their diversity of skills and
the ground” about issues such as disease, human knowledge. Determining the makeup of that
rights violations, and rape. Focusing particularly crowd proves more challenging, but one study of
on Syria, Humanitarian Tracker aggregates these Mechanical Turk offers some interesting findings.
data into maps that show the impacts of systematic It found that US females outnumber males by 2 to
killings, civilian targeting, and other human tolls. 1 and that many of the workers hold bachelor’s
Not all crowdsourcing and big data projects and even master’s degrees. Most live in small
originate within these organizations. For example, households of two or fewer people, and most use
Galaxy Zoo demonstrates the expanses of both the crowdsourcing work to supplement their
big data and crowds. The project asked people to household incomes as opposed to being the pri-
classify a data set of one million galaxies into mary source of income.
three categories: elliptical, merger, and spiral. By Crowd members choose the projects on which
the project’s completion, 150,000 people had con- they want to work, and multiple factors contribute
tributed 50 million classifications. The data fea- to their motivations for joining a project and
ture multiple independent classifications as well, staying with it. For some working on projects
adding reliability. The largest crowdsourcing pro- that offer no further incentive to participate, the
ject involved searching satellite images for wreck- project needs to align with their interests and
age from Malaysia Airlines flight MH370, which experience so that they feel they can make a
went missing in March 2014. Millions of people contribution. Others enjoy connecting with other
searched for signs among the images made avail- people, engaging in problem-solving activities,
able by Colorado-based Digital Globe. The seeking something new, learning more about the
amount of crowdsourcing traffic even crashed data at hand, or even developing a new skill. Some
websites. projects offer incentives such as prize money or
Not all big data crowdsourced projects suc- top-contributor status. For some entertainment
ceed, however. One example is the Google Flu motivates them to participate in that the tasks
tracker. The tracker included a map to show the offer a diversion. For others, though, working on
disease’s spread throughout the season. It was crowdsourced projects might be addiction as well.
later revealed that the tracker overestimated the While crowdsourcing offers multiple benefits
expanse of the flu spreading, predicting twice as for the processing of big data, it also draws some
much as actually occurred. criticism. A primary critique centers on the notion
In addition to their potentially not succeeding, of labor, wherein the crowd contributes knowl-
another drawback to these projects is their overall edge and skills for little-to-no pay, while the orga-
management, which tends to be time-consuming nization behind the data stands to gain much more
and difficult. Several companies attempt to fulfill financially. Some crowdsourcing sites offer low
this role. InnoCentive and Kaggle use crowds to cash incentives for the crowd participants, and in
tackle challenges brought to them by industries, doing so, they sidestep labor laws requiring min-
government, and nonprofit organizations. Kaggle imum wage and other worker benefits. Opponents
in particular offers almost 150,000 data scientists of this view cite that the labor involved frequently
– statisticians – to help companies develop more requires menial tasks and that the labor faces no
efficient predictive models, such as deciding the obligation in completing the assigned tasks. They
best order in which to show hotel rooms for a also cite that crowd participants engage the tasks
travel app or guessing which customers would because they enjoy doing so.
leave an insurance company within a year. Both Ethical concerns come back to the types of
InnoCentive and Kaggle run their crowdsourcing crowdsourced big data projects and the intentions
activities as contests or competitions as these are behind them, such as information gathering,
4 Crowdsourcing

surveillance, and information manipulation. With ▶ Netflix


information manipulation, for example, crowd ▶ Predictive Analytics
participants might create fake product reviews
and ratings for various web sites, or they might
crack anti-spam devices such as CAPTCHAs
Further Readings
(Completely Automated Public Turing test to tell
Computers and Humans Apart). Other activities Brabham, D. C. (2013). Crowdsourcing. Cambridge, MA:
involve risks and possible violations of other indi- MIT Press.
viduals, such as gathering large amounts of per- Howe, J. (2009). Crowdsourcing: why the power of the
sonal data for sale. Overall, the crowd participants crowd is driving the future of business. New York:
Crown.
remain unaware that they are engaging in Nakatsu, R. T., Grossman, E. B., & Charalambos, L. I.
unethical activities. (2014). A taxonomy of crowdsourcing based on task
complexity. Journal of Information Science, 40(6),
823–834.
Shirky, C. (2009). Here comes everybody: the power of
Cross-References organizing without organizations. New York: Penguin.
Surowiecki, J. (2005). The wisdom of crowds. New York:
▶ Amazon Anchor.
▶ Cell Phone Data
C

Curriculum, Higher Education, and however, the use of big data social sciences
Social Sciences departments at colleges and universities seems
likely to increase.
Stephen T. Schroth
Department of Early Childhood Education,
Towson University, Baltimore, MD, USA Background

A variety of organizations, including government


Big data, which has revolutionized many practices agencies, businesses, colleges, universities,
in business, government, healthcare, and other schools, hospitals, research centers, and others,
fields, promises to radically change the curricu- collect data regarding their operations, clients,
lum offered in many of the social sciences. Big students, patients, and findings. Disciplines
data involves the capture, collection, storage, within the social sciences, which are focused
collation, search, sharing, analysis, and visualiza- upon society and the relationships among individ-
tion of enormous data sets so that this information uals within a society, often use such data to inform
may be used to spot trends, to prevent problems, studies related to these. Such a volume of data has
and to proactively engage in activities that make been generated, however, that many social scien-
success more likely. The social sciences, which tists have found it impossible to use this in their
includes fields as disparate as anthropology, eco- work in a meaningful manner. The emergence of
nomics, education, political science, psychology, computers and other electronic forms of data stor-
and sociology, is a disparate area, and the tools of age has resulted in more data than ever before
big data are being embraced differently within being collected, especially during the last two
each. The economic demands of setting up sys- decades of the twentieth century. This data was
tems that permit the use of big data in higher generally stored in separate databases. This
education have also hindered some efforts to use worked to make data from different sources inac-
these processes, as these institutions often lack the cessible to most social science users. As a result,
infrastructure necessary to proceed with such much of the information that could potentially be
efforts. Opponents of the trend toward using big obtained from such sources was not used.
data tools for social science analyses often stress Over the past decade and a half, many busi-
that while these tools may provide helpful for nesses became increasingly interested in making
certain analyses, it is also crucial for students to use of data they had but did not use regarding
receive training in more traditional methods. As customers, processes, sales, and other matters.
equipment and training concerns are overcome, Big data became seen as a way of organizing
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_50-1
2 Curriculum, Higher Education, and Social Sciences

and using the numerous sources of information in Some have added two additional criteria to
ways that could benefit organizations and individ- these: variability and complexity. Variability con-
uals. Infonomics, the study of how information cerns the potential inconsistency that data can
could be used for economic gain, grew in impor- demonstrate at times, which can be problematic
tance as companies and organizations worked to for those who analyze the data. Variability can
make better use of the information they possessed, hamper the process of managing and handling
with the end goal being to use it in ways that the data. Complexity refers the intricate process
increased profitability. A variety of consulting that data management involves, in particular when
firms and other organizations began working large volumes of data come from multiple and
with large corporations and organizations in an disparate sources. For analysts and other users to
effort to accomplish this. They defined big data fully understand the information that is contained
as consisting of three “v”s, volume, variety, and in these data, they must be must first be connected,
velocity. correlated, and linked in a way that helps users
Volume, as used in this context, refers to the make sense of them.
increase in data volume caused by technological
innovation. This includes transaction-based data
that has been gathered by corporations and orga- Big Data Comes to the Social Sciences
nizations over time but also includes unstructured
data that derives from social media and other Colleges, universities, and other research centers
sources as well as increasing amounts of sensor have tracked the efforts of the business world to
and machine-to-machine data. For years, exces- use big data in a way that helped to shape organi-
sive data volume was a storage issue, as the cost of zational decisions and increase profitability. Many
keeping much of this information was prohibitive. working in the social sciences were intrigued by
As storage costs have decreased, however, cost this process, as they saw it as a useful tool that
has diminished as a concern. Today, how best to could be used in their own research. The typical
determine relevance within large volumes of data program in these areas, however, did not provide
and how best to analyze data to create value have students, be they at the undergraduate or graduate
emerged as the primary issues facing those wish- level, the training necessary to engage in big data
ing to use it. research projects. As a result, many programs in
Velocity refers to the amount of data streaming the social sciences have altered their curriculum in
in at great speed raises the issue of how best to an effort to assure that researchers will be able to
deal with this in an appropriate way. Technologi- carry out such work. For many programs across
cal developments, such as sensors and smart the social sciences that have pursued curricular
meters, and client and patient needs emphasize changes that will enable students to engage in
the necessity of overseeing and handling inunda- big data research, these changes have resulted in
tions of data in near real time. Responding to data more coursework in statistics, networking, pro-
velocity in a timely manner represents an ongoing gramming, analytics, database management, and
struggle for most corporations and other organi- other related areas. As many programs already
zations. Variety in the types of formats in which required a substantial number of courses in other
data today comes to organizations presents a prob- areas, the drive toward big data competency has
lem for many. Data today includes that in struc- required many departments to reexamine the work
tured numeric forms which is stored in traditional required of their students.
databases but has grown to include information This move toward more coursework that sup-
created from business applications, e-mails, text ports big data has not been without its critics.
documents, audio, video, financial transactions, Some have suggested that changes in curricular
and a host of others. Many corporations and orga- offerings have come at a high cost, with students
nizations struggle with governing, managing, and now being able to perform certain operations
merging different forms of data. involved with handling data but unable to
Curriculum, Higher Education, and Social Sciences 3

competently perform other tasks, such as how it interacts with disciplinary issues and con-
establishing a representative sample or composing cerns have been emphasized by many programs.
a valid survey. These critics also suggest that Some programs have embraced big data tools
while big data analysis has been praised for offer- but suggested that not every student needs mas-
ing tremendous promise, in truth the analysis tery of them. Instead, these programs have
performed is shallow, especially when compared suggested that big data has emerged as a field of
to that done with smaller data sets. Indeed, repre- its own and that certain students should be trained
sentative sampling would negate the need for, and in these skills so that they can work with others
expense of, many big data projects. Such critics within the discipline to provide support for those
suggest that increased emphasis in the curriculum projects that require big data analysis. This
should focus on finding quality, rather than big, approach offers more incremental changes to the
data sources and that efforts to train students to social science curricular offerings, as it would
load, transform, and extract data is sublimating require fewer changes for most students yet still
other more important skills. enable departments to produce scholars who are
Despite these criticisms, changes to the social equipped to engage in research projects involving
sciences curriculum are occurring at many insti- big data.
tutions. Many programs now require students to
engage in work that examines practices and para-
digms of data science, which would provide stu-
Cross-References
dents with a grounding in the core concepts of
data science, analytics, and data management.
▶ Big Data Quality
Work in algorithms and modeling, which provide
▶ Correlation vs. Causation
proficiency in basic statistics, classification, clus-
▶ Curriculum, Higher Education, Business
ter analysis, data mining, decision trees, experi-
▶ Curriculum, Higher Education, Humanities
mental design, forecasting, linear algebra, linear
▶ Education
and logistic regression, market basket analysis,
▶ Public Administration/Government
predictive modeling, sampling, text analytics,
summarization, time series analysis, and
unsupervised learning constrained optimization,
is also an area of emphasis in many programs. Further Readings
Students require exposure to tools and platforms,
Foreman, J. W. (2013). Data smart: Using data science to
which provides proficiency in modeling, develop-
transform information into insight. Hoboken: Wiley.
ment and visualization tools to be used on big data Lane, J. E., & Zimpher, N. L. (2014). Building a smarter
projects, as well as knowledge about the platforms university: Big data, innovation, and analytics.
used for execution, governance, integration, and Albany: The State University of New York Press.
Mayer-Schönberger, V., & Cukier, K. (2013). Big data.
storage of big data. Finally, work with applica- New York: Mariner Books.
tions and outcomes, which emphasize the primary Siegel, E. (2013). Predictive analytics: The power to pre-
applications of data science to one’s field, and dict who will click, buy, lie, or die. Hoboken: Wiley.
D

Data Science information and insights from data. Conceptually,


data science closely resembles data mining, or a
Lourdes S. Martinez process relying on technologies that implement
School of Communication, San Diego State these techniques in order to extract insights from
University, San Diego, CA, USA data. According to Dhar, Jarke, and Laartz, data
science seeks to move beyond simply explaining
a phenomenon. Rather its main purpose is to
Data science has been defined as the structured answer questions that explore and uncover action-
study of data for the purpose of producing able knowledge that informs decision making
knowledge. Going beyond simply using data, or predicts outcomes of interest. As such, most
data science revolves around extracting actionable of the challenges currently facing data science
knowledge from said data. Despite this definition, emanate from properties of big data and the size
confusion exists surrounding the conceptual of its datasets, which are so massive they require
boundaries of data science in large part due to its the use of alternative technologies for data
intersection with other concepts, including big processing.
data and data-driven decision making. Given that Given these characteristics, data science as a
increasingly unprecedented amounts of data are field is charged with navigating the abundance of
generated and collected every day, the growing data generated on a daily basis, while supporting
importance of the data science field is undeniable. machine and human efforts in using big data to
As an emerging area of research, data science answer the most pressing questions facing indus-
holds promise for optimizing performance of try and society. These aims point toward the
companies and organizations. The implications interdisciplinary nature of data science.
of advances in data science are relevant for fields According to Loukides, the field itself falls
and industries spanning an array of domains. inside the area where computer programming
and statistical analysis converge within the con-
text of a particular area of expertise. However,
data science differs from statistics in its holistic
Defining Data Science
approach to gathering, amassing, and examining
user data to generate data products. Although
The basis of data science centers around
several areas across industry and society are
established guiding principles and techniques
beginning to explore the possibilities offered by
that help organize the process of drawing out

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_60-1
2 Data Science

data science, the idea of what constitutes data where big data ends and data science begins
science remains nebulous. continues to be imprecise.
Another source of confusion in defining data
science stems from the absence of formalized
Controversy in Defining the Field academic programs in higher education. The
lack of these programs exists in part due to chal-
According to Provost and Fawcett, one reason lenges in launching novel programs that cross
why data science is difficult to define relates to disciplines and the natural pace at which these
its conceptual overlap with big data and data- programs are implemented within the academic
driven decision making. Data-driven decision environment. Although several institutions within
making represents an approach characterized by higher education now recognize the importance
the use of insights gleaned through data analysis of this emerging field and the need to develop
for deciding on a course of action. This form of programs that fulfill industry’s need for practi-
decision making may also incorporate varying tioners of data science, the result up to now has
amounts of intuition, but does not rely solely on been to leave the task for defining the field to data
it for moving forward. For example, a marketing scientists.
manager faced with a decision about how much Data scientists currently occupy an enviable
promotional effort should be invested in a partic- position as among the most coveted employees
ular product has the option of solely relying on for twenty-first-century hiring according to Dav-
intuition and past experiences, or using a combi- enport and Patil. They describe data scientists as
nation of intuition and knowledge gained from professionals, usually of senior-level status, who
data analysis. The latter represents the basis for are driven by curiosity and guided by creativity
data-driven decision making. At times, however, and training to prepare and process big data.
in addition to enabling data-driven decision mak- Their efforts are geared toward uncovering find-
ing, data science may also overlap with data- ings that solve problems in both private and
driven decision making. The case of automated public sectors. As businesses and organizations
online recommendations of products based on accumulate greater volumes of data at faster
user ratings, preferences, and past consumer speeds, Davenport and Patil predict the need for
behavior is an example of where the distinction data scientists will to continue in a very steep and
between data science and data-driven decision upward trajectory.
making is less clear.
Similarly, differentiating between the concepts
of big data and data science becomes murky when Opportunities in Data Science
considering that approaches used for processing
big data overlay with the techniques and princi- Several sectors stand to gain from the explosion in
ples used to extract knowledge and espoused by big data and acquisition of data scientists to ana-
data science. This conceptual intersection exists lyze and extract insights from it. Chen, Chiang,
where big data technologies meet data mining and Storey note the opportunities inherent through
techniques. For example, technologies such as data science for various areas. Beginning with e-
Apache™ Hadoop ® which are designed to store commerce and the collection of market intelli-
and process large-scale data can also be used to gence, Chen and colleagues focus on the develop-
support a variety of data science efforts related to ment of product recommendation systems via e-
solving business problems, such as fraud detec- commerce vendors such as Amazon that are com-
tion, and social problems, such as unemployment prised of consumer-generated data. These product
reduction. As the technologies associated with big recommendation systems allow for real-time
data are also often used to apply and bolster access to consumer opinion and behavior data in
approaches to data mining, the boundary between record quantities. New data analytic techniques to
Data Science 3

harness consumer opinions and sentiments have analytics for science and technology research.
accompanied these systems, which can help busi- The iPlant Collaborative represents another
nesses become better able to adjust and adapt NSF-funded initiative that relies on cyber infra-
quickly to needs of consumers. Similarly, in the structure to instill skills related to computational
realm of e-government and politics, a multitude of techniques that address evolving complexities
data science opportunities exist for increasing the within the field of plant biology among emerging
likelihood for achieving a range of desirable out- biologists.
comes, including political campaign effective- The health field is also flush with opportunities
ness, political participation among voters, and for advances using data science. According to
support for government transparency and Chen and colleagues, opportunities for this field
accountability. Data science methods used to are rising in the form of massive amounts of
achieve these goals include opinion mining, social health- and healthcare-related data. In addition to
network analysis, and social media analytics. data collected from patients, data are also gener-
Public safety and security represents another ated through advanced medical tools and instru-
area that Chen and colleagues observe has pros- mentation, as well as online communities formed
pects for implementing data science. Security around health-related topics and issues. Big data
remains an important issue for businesses and within the health field is primarily comprised of
organizations in a post-September 11th 2001 era. genomics-based data and payer-provider data.
Data science offers unique opportunities to pro- Genomics-based data encompasses genetic-
vide additional protections in the form of security related information such as DNA sequencing.
informatics against terrorist threats to transporta- Payer-provider data comprises information col-
tion and key pieces of infrastructure (including lected as part of encounters or exchanges between
cyberspace). Security informatics uses a three- patients and the healthcare system, and includes
pronged approach coordinating organizational, electronic health records and patient feedback.
technological, and policy-related efforts to Despite these opportunities, Miller notes that
develop data techniques designed to promote application of data science techniques to health
international and domestic security. The use of data remains behind that of other sectors, in part
data science techniques such as crime data min- due to a lack of initiatives that leverage scalable
ing, criminal network analysis, and advanced analytical methods and computational platforms.
multilingual social media analytics can be instru- In addition, research and ethical considerations
mental in preventing attacks as well as surrounding privacy and protection of patients’
pinpointing whereabouts of suspected terrorists. rights in the use of big data present some chal-
Another sector flourishing with the rise of data lenges to full utilization of existing health data.
science is science and technology (S&T). Chen
and colleagues note that several areas within S&T,
such as astrophysics, oceanography, and geno-
Challenges to Data Science
mics, regularly collect data through sensor sys-
tems and instruments. The result has been an
Despite the enthusiasm for data science and the
abundance of data in need of analysis, and the
potential application of its techniques for solving
recognition that information sharing and data ana-
important real-world problems, there are some
lytics must be supported. In response, the National
challenges to full implementation of tools from
Science Foundation (NSF) now requires the sub-
this emerging field. Finding individuals with the
mission of a data management plan with every
right training and combination of skills to become
funded project. Data-sharing initiatives such as
data scientists represents one challenge. Daven-
the 2012 NSF Big Data program are examples of
port and Pital discuss the shortage of data scien-
government endeavors to advance big data
tists as a case in which demand has grossly
4 Data Science

exceeded supply, resulting in intense competition Cross-References


among organizations to attract highly sought-after
talent. ▶ Big Data
Concerns related to privacy represent ▶ Data Mining
another challenge to data science analysis of big ▶ Data Scientist
data. Errors, mismanagement, or misuse of data ▶ Data-Driven Decision-Making
(specifically data that by its nature is traceable
to individuals) can lead to potential problems.
One famous incident involved Target correctly
Further Readings
predicting the pregnancy status of a teenaged girl
before her father was aware of the situation, Chen, H. (2006). Intelligence and security informatics for
resulting in wide media coverage over issues international security: Information sharing and data
equating big data with “Big Brother.” This per- mining. New York: Springer Publishers.
ception of big data may cause individuals to Chen, H. (2009). AI, E-government, and politics 2.0. IEEE
Intelligent Systems, 24(5), 64–86.
become reluctant to provide their information, or Chen, H. (2011). Smart health and wellbeing. IEEE Intel-
choose to alter their behavior when they suspect ligent Systems, 26(5), 78–79.
they are being tracked, potentially undermining Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Busi-
the integrity of data collected. ness intelligence and analytics: From big data to big
impact. MIS Quarterly, 36(4), 1165–1188.
Data science has been characterized as a field Davenport, T. H., & Patil, D. J. (2012). Data scientist: The
concerned with the study of data for the purpose sexiest job of the 21st century. Harvard Business
of gleaning insight and knowledge. The primary Review, 90, 70–76.
goal of data science is to produce knowledge Dhar, V., Jarke, M., & Laartz, J. (2014). Big data. Business
& Information Systems Engineering, 6(5), 257–259.
through the use of data. Although this definition Hill, K. (2012). How target figured out a teen girl was
provides clarity to the conceptualization of data pregnant before her father did. Forbes magazine.
science as a field, there persists confusion as to Forbes Magazine.
how data science differs from related concepts Loukides, M. (2011). What is data science? The future
belongs to the companies and people that turn data
such as big data and data-driven decision making. into products. Sebastopol: O’ Reilly Media.
The future of data science appears very bright, and Miller, K. (2012). Big data analytics in biomedical
as the amount and speed with which data is col- research. Biomedical Computation Review, 2, 14–21.
lected continues to increase, so too will the need Provost, F., & Fawcett, T. (2013). Data science and its
relationship to big data and data-driven decision mak-
for data scientists to harness the power of big data. ing. Big Data, 1(1), 51–59.
The opportunities for using data science to maxi- Wactlar, H., Pavel, M., & Barkis, W. (2011). Can computer
mize corporate and organizational performance science save healthcare? IEEE Intelligent Systems, 26
cut across several sectors and areas. (5), 79–83.
I

Industrial and Commercial Bank of With its combination of state and private owner-
China ship, state governance, and commercial dealings,
ICBC serves as a perfect case study to examine the
Jing Wang1 and Aram Sinnreich2 transformation of China’s financial industry.
1
School of Communication and Information, Big data collection and database construction
Rutgers University, New Brunswick, NJ, USA are fundamental to ICBC’s management strate-
2
School of Communication, American University, gies. Beginning in the late 1990s, ICBC paid
Washington, DC, USA unprecedented attention on the implication of
information technology (IT) in their daily opera-
tions. Several branches adopted computerized
The Industrial and Commercial Bank of input and internet communication of transactions,
China (ICBC) which had previously relied upon manual prac-
tices by bank tellers. Technological upgrades
The Industrial and Commercial Bank of China increased work efficiency and also helped to
(ICBC) was the first state-owned commercial save labor costs. More importantly, compared to
bank of the People’s Republic of China (PRC). It the labor-driven mechanism, the computerized
was founded on January 1st, 1984, and is system was more effective for retrieving data
headquartered in Beijing. In line with Deng from historical records and analyzing these data
Xiaoping’s economic reform policies launched for business development. At the same time, it
in the late 1970s, the State Council (chief admin- became easier for the headquarters to control the
istrative authority of China) decided to relay all local branches by checking digitalized informa-
the financial businesses related to industrial and tion records. Realizing the benefits of these
commercial sectors from the central bank informatization and centralization tactics, the
(People’s Bank of China) to ICBC (China Indus- head company assigned its Department of Infor-
trial Map Committee 2016). This decision made mation Management to develop a centralized
in September 1983 is considered a landmark event database collecting data from every single branch.
in the evolution of China’s increasingly special- This database is controlled and processed by
ized banking system (Fu and Hefferman 2009). ICBC headquarters but is also available for use
While the government retains control over ICBC, by local branches with the permission of top
the bank began to take on public shareholders in executives.
October, 2006. As of May 2016, ICBC was In this context, “big data” refers to all the
ranked as the world’s largest public company by information collected from ICBC’s daily opera-
Forbes “Global 2000.” (Forbs Ranking 2016) tions and can be divided into two general
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_113-1
2 Industrial and Commercial Bank of China

categories: “structured data” (which is organized By 2014, ICBC’s Data Center in Shanghai had
according to preexisting database categories) and collected more than 430 million individual cus-
“unstructured data” (which does not) (Davenport tomers’ profiles and more than 600,000 commer-
and Kim 2013). For example, a customer’s cial business records. National transactions –
account information is typically structured data. exceeding 215 million on daily basis – have all
The branch has to input the customer’s gender, been documented at the Data Center. Data storage
age, occupation, etc., into the centralized network. and processing on such a massive scale cannot be
This information then flows into the central data- fulfilled without a powerful and reliable computer
base which is designed specifically to accommo- system. The technology infrastructure supporting
date it. Any data other than the structured data will ICBC’s big data strategy consists of three major
be stored as raw data and preserved without pro- elements: hardware, software, and cloud comput-
cessing. For example, the video recorded at a local ing. Suppliers are both international and domestic,
branch’s business hall will be saved with only a including IBM, Teradata, and Huawei.
date and a location label. Though “big data” in Further, ICBC has also invested in data backup
ICBC’s informational projects refers to both struc- to secure its database infrastructure and data
tured and unstructured data, the former is the core records. The Shanghai Data Center has a backup
of ICBC’s big data strategy and is primarily used system in Beijing which can record data when the
for data mining. main server fails to work properly. The Beijing
Since the late 1990s, ICBC has invested in big data center serves as a redundant system in case
data development with increasingly large eco- the Shanghai Data Center fails. It only takes less
nomic and human resources. On September 1st, than 30 s to switch between two centers. To speed
1999, ICBC inaugurated its “9991” project, which data backup and minimize data loss in significant
aimed at centralizing the data collected from disruptive events, ICBC undertakes multiple
ICBC branches nationwide. This project took disaster recovery (DR) tests on a regular basis.
more than 3 years to accomplish its goal. Begin- The accumulation and construction of big data
ning in 2002, all local branches were connected to is significant for ICBC’s daily operation in three
ICBC’s Data Processing Center in Shanghai – a respects. First of all, big data allows ICBC to
data warehouse with a 400 terabyte (TB) capacity. develop its customers’ business potential through
The center’s prestructured database enables ICBC a so-called “single-view” approach. A customer’s
headquarters to process and analyze data as soon business data collected from one of ICBC’s
as they are generated, regardless of the location. 35 departments are available for all the other
With its enhanced capability in storing and man- departments. By mining the shared database,
aging data, ICBC also networked and digitized its ICBC headquarters is able to evaluate both a
local branch operations. Tellers are able to input customer’s comprehensive value and the overall
customer information (including their profiles and quality of all existing customers. Cross depart-
transaction records) into the national Data Center mental business has also been propelled (e.g.,
through their computers at local branches. These the Credit Card Department may share business
two-step strategies of centralization and digitiza- opportunities with the Savings Department). Sec-
tion allow ICBC to converge local operations on ond, the ICBC marketing department has been
one digital platform, which intensifies the head- using big data for email-based marketing
quarters’ control over national businesses. In (EBM). Based on the data collected from
2001, ICBC launched another data center in branches, the Marketing and Business Develop-
Shenzhen, China, which is in charge of the big ment Department is able to locate their target
data collected from its oversea branches. ICBC’s customers and follow up with customized market-
database thus enables the headquarters’ control ing/advertising information via customized email
over business and daily operations globally and communications. This data-driven marketing
domestically. approach is increasingly popular among financial
institutions in China. Third, customer
Industrial and Commercial Bank of China 3

management systems rely directly on big data. All collection and data mining. The governing poli-
customers have been segmented into six levels, cies primarily regulate the release of data from
ranging from “one star” to “seven stars,” (one star ICBC to other institutions, yet the protection of
and two stars fall into a single segment which customer privacy within ICBC itself has rarely
indicates the customers’ savings or investment been addressed. According to the central bank’s
levels at ICBC). “Seven Stars” clients have the Regulation on the Administration of the Credit
highest level of credit and enjoy the best benefits Investigation Industry issued by the State Council
provided by ICBC. in 2013, interbank sharing of customer informa-
Big data has influenced ICBC’s decision- tion is forbidden. Further, a bank is not eligible to
making on multiple levels. For local branches, release customer information to its nonbanking
market insights are available at a lower cost. Con- subsidiaries. For example, the fund management
sumer data generated and collected at local company (ICBCCS) owned by ICBC is not allo-
branches have been stored on a single platform wed access customer data collected from ICBC
provided and managed by the national data center. banks. The only situation in which ICBC could
For example, a branch in an economically devel- release customer data to a third party is when such
oping area may predict demand for financial prod- information has been linked to the official inquiry
ucts by checking the purchase data from branches by law enforcement. These policies prevent con-
in more developed areas. The branch could also sumer information from leaking to other compa-
develop greater insights regarding the local con- nies for business purposes. Yet, the policies have
sumer market by examining data from multiple also affirmed the fact that ICBC has full owner-
branches in the geographic area. For ICBC head- ship of the customer information, thus giving
quarters, big data fuels a dashboard through which ICBC greater power to use the data in its own
it monitors ICBC’s overall business and is alerted interests.
to potential risks. Previously, individual depart-
ments used to manage their financial risk through
their own balance sheets. This approach was
Cross-References
potentially misleading and even dangerous for
ICBC’s overall risk profile. A given branch pro-
▶ Data Driven Marketing
viding many loans and mortgages may be consid-
▶ Data Mining
ered to be performing well, but if a large number
▶ Data Warehouse
of branches overextended themselves, the emer-
▶ Hardware
gent financial consequences might create a crisis
▶ Structured Data
for ICBC or even for the financial industry at
large. Consequently, today, a decade after its
data warehouse construction, ICBC considers
big data indispensable in providing a holistic per- Further Reading
spective, mitigating risk for its business and
development strategies. China Industrial Map Editorial Committee, China Eco-
nomic Monitoring & Analysis Center & Xinhua Hold-
To date, ICBC has been a pioneer in big data ings. 2016. Industrial map of China’s financial sectors,
construction among all the financial enterprises in Chapter 6. World Scientific Publishing.
China. It was the first bank to have all local data Davenport, T., & Kim, J. (2013). Keeping up with the
centralized in a single database. As the Director of quants: Your guide to understanding and using analyt-
ics. Boston: Harvard Business School Publishing.
ICBC’s Informational Management Department Fu, M., & Hefferman, S. (2009). The effects of reform on
claimed in 2014, ICBC has the largest Enterprise China’s bank structure and performance. Journal of
Database (EDB) in China. Banking & Finance, 33(1), 39–52.
Parallel to its aggressive strategies in big data Forbs Ranking (2016). The World’s Biggest Public Com-
pany. Retrieved from https://www.forbes.com/compa
construction, the issue of privacy protection has nies/icbc/
always been a challenge in ICBC’s customer data
I

Information Commissioner, United Regulations regulated by the Scottish Information


Kingdom Commissioner and the Freedom of Information
(Scotland) Act 2002.
Ece Inan The Information Commissioner is appointed
Provost&Academic Dean, Girne American by the Queen and reports directly to Parliament.
University Canterbury, Canterbury, UK The Commissioner is supported by the manage-
ment board. The ICO’s headquarter is in
Wilmslow, Cheshire; in addition to this, three
The Information Commissioner’s Office (ICO) is regional offices in Northern Ireland, Scotland,
the UK’s independent public authority which is and Wales are aimed to provide relevant services
responsible for data protection mainly in England, where legislation or administrative structure is
Scotland, Wales, and Northern Ireland; and also different.
ICO has right to conduct some international Under the Freedom of Information Act, Envi-
duties. ICO was firstly set up to uphold informa- ronmental Information Regulations, INSPIRE
tion rights by implementing the Data Protection Regulations, and associated codes of practice,
Act 1984. The ICO declared their mission state- the functions of the ICOs contain noncriminal
ment as to promote respect for the private lives of enforcement and assessments of good practice,
individuals and in particularly, for the privacy of providing information to individuals and organi-
their information by implementing the Data Pro- zations, taking appropriate action when the law an
tection Act 1984 and also influencing national and freedom of information is broken, considering
international thinking on privacy and personal complaints, disseminating publicity and encour-
information. aging sectoral codes of practice, and taking action
ICO enforces and oversees all the data protec- to change the behavior of organizations and indi-
tion issues by following the Freedom of Informa- viduals that collect, use, and keep personal infor-
tion Act 2000, Environmental Information mation. The main aim is to promote data privacy
Regulations 2004, and Privacy and Electronic for individuals, for providing this service, the ICO
Communications Regulations 2003, and also has different tools such as criminal prosecution,
ICO has some limited responsibilities under the noncriminal enforcement, and audit. The Informa-
INSPIRE Regulations 2009, in England, Wales, tion Commissioner also has the power to serve a
Northern Ireland, and UK-wide public authorities monetary penalty notice on a data controller and
based in Scotland. On the other hand, Scotland promotes openness to public.
has complementary INSPIRE Regulations and The Data Protection Act 1984 introduced basic
its own Scottish Environmental Information rules of registration for users of data and rights of
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_114-1
2 Information Commissioner, United Kingdom

access to that data for the individuals to which it Information Act and changed its name to the
related. In order to comply with the Act, a data Information Commissioner’s Office. On 1 Janu-
controller must comply with the following eight ary, 2005, the Freedom of Information Act 2000
principles as “data should be processed fairly and was fully implemented. The Act was intended to
lawfully; should be obtained only for specified improve the public’s understanding of how public
and lawful purposes; should be adequate, rele- authorities carry out their duties, why they make
vant, and not excessive; should be accurate and, the decisions they do, and how they spend their
where necessary, kept up to date; should not be money. Placing more information in the public
kept longer than is necessary for the purposes for domain would ensure greater transparency and
which it is processed; should be processed in trust and widen participation in policy debate. In
accordance with the rights of the data subject October 2009, the ICO adopted a new mission
under the Act; should be appropriate technical statement: “The ICO’s mission is to uphold infor-
and organisational measures should be taken mation rights in the public interest, promoting
against unauthorised or unlawful processing of openness by public bodies and data privacy for
personal data and against accidental loss or individuals.” In 2011, ICO launched the “data
destruction of, or damage to, personal data; and sharing code of practice” at the House of Com-
should not be transferred to a country or territory mons and enable to impose monetary penalties of
outside the European Economic Area unless that up to £500,000 for serious breaches of the Privacy
country or territory ensures an adequate level of and Electronic Communications Regulations.
protection for the rights and freedoms of data
subjects in relation to the processing of personal
data.”
Cross-References
In 1995, The EU formally adopted the General
Directive on Data Protection. In 1997, DUIS, the
▶ Data Protection
Data User Information System, was implemented,
▶ Open Data
and the Register of Data Users was published on
the internet. In 2000, the majority of the Data
Protection Act comes into force. The name of
the office was changed from the Data Protection Further Readings
Registrar to the Data Protection Commissioner.
Notification replaced the registration scheme Data Protection Act 1984. http://www.out-law.com/page-
413. Accessed Aug 2014.
established by the 1984 Act. Revised regulations DataProtectionAct 1984. http://www.legislation.gov.uk/
implementing the provisions of the Data Protec- ukpga/1984/35/pdfs/ukpga_19840035_en.pdf?view=
tion Telecommunications Directive 97/66/EC extent. Accessed Aug 2014.
came into effect. In January 2001, the office was Smartt, U. (2014). Media & entertainment law (2nd ed.).
London: Routledge.
given the added responsibility of the Freedom of
I

Interactive Data Visualization History

Andreas Veglis Although people have been using tables in order


School of Journalism and Mass Communication, to arrange data since the second century BC, the
Aristotle University of Thessaloniki, idea of representing quantitative information
Thessaloniki, Greece graphically first appeared in the seventeenth cen-
tury. Rene Descartes, who was a French philoso-
pher and mathematician, proposed a two-
Definition dimensional coordinate system for displaying
values, consisting of a horizontal axis for one
Data visualization is a modern branch of descrip- variable and a vertical axis for another, primarily
tive statistics that involves the creation and study as a graphical means of performing mathematical
of the visual representation of data. It is the graph- operations. In the eighteenth century William
ical display of abstract information for data anal- Playfair began to exploit the potential of graphics
ysis and communication purposes. Static data for the communication of quantitative, by devel-
visualization offers only precomposed “views” oping many of the graphs that are commonly used
of data. Interactive data visualization supports today. He was the first to employ a line moving up
multiple static views in order to present a variety and down as it progressed from left to right to
of perspectives on the same information. Impor- show how values changed through time. He
tant stories include “hidden” data, and interactive invented the bar graph, as well as the pie chart.
data visualization is the appropriate mean to dis- In the 1960s Jacques Bertin proposed that visual
cover, understand, and present these stories. In perception operates according to rules that can be
interactive data visualization there is a user input followed to express information visually in ways
(a control of some aspect of the visual representa- that represented it intuitively, clearly, accurately,
tion of information), and the changes made by the and efficiently. Also John Tukey, a statistics pro-
user must be incorporated into the visualization in fessor set the basis of the exploratory data analy-
a timely manner. They are based on existing sets sis, by demonstrating the power of data
of data, and obviously this subject is strongly visualization as a means for exploring and making
related with the issue of big data. Data visualiza- sense of quantitative data (Few 2013).
tions is the best method in order to transform In 1983, Edward Tufte published his ground-
chunks of data to meaningful information (Ward breaking book “The Visual Display of Quantita-
et al. 2015). tive Information,” in which he distinguished
between the effective ways of displaying data
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_116-1
2 Interactive Data Visualization

visually and the ways that most people are doing it Infographics are being used for many years, and
without much success. Also around this time, recently the availability of many easy-to-use free
William Cleveland extended and refined data tools have made the creation of infographics
visualization techniques for statisticians. At the available to every Internet user (Murray 2013).
end of the century, the term information visuali- Of course static visualizations can also be
zation was proposed. In 1999, Stuart Card, Jock published on the World Wide Web in order to
Mackinlay, and Ben Shneiderman published their disseminate more easily and rapidly. Publishing
book entitled “Readings in Information Visuali- on the web is considered to be the quickest way to
zation: Using Vision to Think.” Moving to the reach a global audience. An online visualization is
twenty-first century, Colin Ware published two accessible by any Internet user that employs a
books entitled “Information Visualization: Per- recent web browser, regardless of the operating
ception for Design (2004) and Visual Thinking system (Windows, Mac, Linux, etc.) and device
for Design (2008)” in which he compiled, orga- type (laptop, desktop, smartphone, tablet). But the
nized, and explained what we have learned from true capabilities of the web are being exploited in
several scientific disciplines about visual thinking the case of interactive data visualization.
and cognition and applied that knowledge to data Dynamic, interactive visualizations can
visualization (Few 2013). empower people to explore data on their own.
Since the turn of the twenty-first century, data The basic functions of most interactive visualiza-
visualization has been popularized, and it has tion tools have been set back in 1996, when Ben
reached the masses through commercial software Shneiderman proposed a “Visual Information-
products that are distributed through the web. Seeking Mantra” (overview first, zoom and filter,
Many of these data visualization products pro- and then details on demand). The above functions
mote more superficially appealing esthetics and allow data to be accessible from every user, from
neglect the useful and effective data exploration, the one who is just browsing or exploring the
sense-making, and communication. Nevertheless dataset to the one who approaches the visualiza-
there are a few serious contenders that offer prod- tion with a specific question in mind. This design
ucts which help users fulfill data visualization pattern is the basic guide for every interactive
potential in practical and powerful ways. visualization today.
An interactive visualization should initially
offer an overview of the data, but it must also
From Static to Interactive include tools for discovering details. Thus it will
be able to facilitate different audiences, from those
Visualization can be categorized into static and who are new to the subject to those who are
interactive. In the case of the static visualization, already deeply familiar with the data. Interactive
there is only one view of data, and in many occa- visualization may also include animated transi-
sions, multiple cases are needed in order to fully tions and well-crafted interfaces in order to
understand the available information. Also the engage the audience to the subject it covers.
number of dimensions of data is limited. Thus
representing multidimensional datasets fairly in
static images is almost impossible. Static visuali- User Control
zation is ideal when alternate views are neither
needed nor desired and is special suited for static In the case of interactive data visualization, users
medium (e.g., print) (Knaffic 2015). It is worth interact with the visualization by introducing a
mentioning that infographics are also part of the number of input types. Users can zoom in a par-
static visualization. Infographics (or information ticular part of an existing visualization, pinpoint
graphics) are graphic visual representations of an area that interest them, select an option from an
data or knowledge, which are able to present offered list, choose a path, and input numbers or
complex information quickly and clearly. text that customize the visualization. All the
Interactive Data Visualization 3

previous mentioned input types can be accom- generated where different regions are
plished by using keyboard, mice, touch screen, updated over time.
and other more specialized input devices. With
the help of these input actions, users can control
both the information being represented on the Types of Interactive Data Visualizations
graph or the way that the information is being
presented. In the second case, the visualization is The information and more specifically statistical
usually part of a feedback loop. In most cases the information is abstract, since it describes things
actual information remains the same, but the rep- that are not physical. It can concern education,
resentation of the information does change. One sales, diseases, and various other things. But
other important parameter in the interactive data everything can be displayed visually, if the way
visualizations is the time it takes for the visuali- is found to give them a suitable form. The trans-
zation to be updated after the user has introduced formation of the abstract into physical representa-
an input. A delay of more than 20 ms is noticeable tion can only succeed if we understand a bit about
by most people. The problem is that when large visual perception and cognition. In other words, in
amounts of data are involved, this immediate ren- order to visualize data effectively, one must
dering is impossible. follow design principles that are derived from an
Interactive framerate is a term that is often understanding of human perception.
being used to measure the frequency with which Heer, Bostock and Ogievetsky (2010) defined
a visualization system generates an image. In the types (and also their subcategories) of data
case that the rapid response time, which is visualization:
required for interactive visualization, is not fea-
sible, there are several approaches that have (i) Time series data (index charts, stacked
been explored in order to provide people with graphs, small multiples, horizon graphs)
rapid visual feedback based on their input. (ii) Statistical distributions (stem-and-leaf plots,
These approaches include: Q-Q plots, scatter plot matrix (SPLOM),
parallel coordinates)
Parallel rendering: in this case the image is being (iii) Maps (flow maps, choropleth maps, gradu-
rendered simultaneously by two or more com- ated symbol maps, cartograms)
puters (or video cards). Different frames are (iv) Hierarchies (node-link diagrams, adjacency
being rendered at the same time by different diagrams, enclosure diagrams)
computers, and the results are transferred over (v) Networks (force-directed layout, arc dia-
the network for display on the user’s computer. grams, matrix views)
Progressive rendering: in this case a framerate is
guaranteed by rendering some subset of the
information to be presented. It also provides Tools
progressive improvements to the rendering
when the visualization is no longer changing. There are a lot of tools that can be used for
Level-of-detail (LOD) rendering: in this case sim- creating interactive data visualizations. All of
plified representations of information are ren- them are either free or offer a free version (except
dered in order to achieve the desired frame rate, a paid version that includes more features).
while a user is providing input. When the user According to datavisualization.ch, the list of the
has finished manipulating the visualization, then tools that most users employ includes: Arbor.js,
the full representation is used in order to generate CartoDB, Chroma.js, Circos, Cola.js,
a still image. ColorBrewer, Cubism.js, Cytoscape, D3.js,
Frameless rendering: in this type of rendering, Dance.js, Data.js, DataWrangler, Degrafa, Envi-
the visualization is not presented as a time sion.js, Flare, GeoCommons, Gephi, Google
series of images. Instead a single image is Chart Tools, Google Fusion Tables, I Want
4 Interactive Data Visualization

Hue, JavaScript InfoVis Toolkit, Kartograph, Cross-References


Leaflet, Many Eyes, MapBox, Miso, Modest
Maps, Mr. Data Converter, Mr. Nester, NVD3. ▶ Business Intelligence
js,. NodeBox, OpenRefine, Paper.js, Peity, Poly- ▶ Tableau Software
maps, Prefuse, Processing, Processing.js, Pro- ▶ Visualization
tovis, Quadrigram, R, Raphael, Raw, Recline.js,
Rickshaw, SVG Crowbar, Sigma.js. Tableau
Public, Tabula, Tangle, Timeline.js, Unfolding,
Further Readings
Vega, Visage, and ZingCharts.
Few, S. (2013). Data visualization for human perception. In
S. Mads & D. R. Friis (Eds.), The encyclopedia of human-
computer interaction (2nd ed.). Aarhus: The Interaction
Conclusion Design Foundation. http://www.interaction-design.org/lit
erature/book/the-encyclopedia-of-human-computer-inter
Data visualization is a significant discipline that action-2nd-ed/data-visualization-for-human-perception.
Accessed 12 July 2016.
is expected to become even more important as
Heer, J., Bostock, M., & Ogievetsky, V. (2010). A tour
we gradually moving, as a society, in the era of through the visualization zoo. Communications of the
big data. Especially the case of interactive data ACM, 53(6), 59–67.
visualization allows data analysts to convey Knaffic, C. N. (2015). Storytelling with data: A data visu-
alization guide for business professionals. Hoboken,
complex data to meaningful information that
New Jersey: John Wiley & Sons Inc.
can be searched, explored, and understood by Murray, S. (2013). Interactive data visualization for the
end users. web. Sebastopol, CA: O’Reilly Media, Inc.
Ward, M., Grinstein, G., & Keim, D. (2015). Interactive
data visualization: Foundations, techniques, and
applications. Boca Raton, FL: CRC Press, Taylor &
Francis Group.
I

International Development (2014), GDP calculations had to be revised


upward by 63% and 89%, respectively. Poor-
Jon Schmid quality or stale data prevent national policy
Georgia Institute of Technology, Atlanta, GA, makers and donors from making informed policy
USA decisions.
Big data analytics has the potential to amelio-
rate this problem by providing alternative
Big data can affect international development in methods for collecting data. For example, big
two primary ways. First, big data can enhance our data applications may provide a novel means by
understanding of underdevelopment by which national economic statistics are calculated.
expanding the evidence base available to The Billion Prices Project – started by researchers
researchers, donors, and governments. Second, at the Massachusetts Institute of Technology –
big data-enabled applications can affect interna- uses daily price data from hundreds of online
tional development directly by facilitating eco- retailers to calculate changes in price levels. In
nomic behavior, monitoring local conditions, countries where inflation data is unavailable – or
and improving governance. The following sec- in cases such as Argentina where official data is
tions will look first at the role of big data in unreliable – these data offer a way of calculating
increasing our understanding of international national statistics that does not require a high-
development and then look at examples where quality national statistics agency.
big data has been used to improve the lives of Data from mobile devices is a particularly rich
the world’s poor. source of data in the developing world. Roughly
20% of mobile subscriptions are held by individ-
uals that earn less than 5 $ a day. Besides emitting
Big Data in International Development geospatial, call, and SMS data, mobile devices are
Research increasingly being used in the developing world
to perform a broad array of economic functions
Data quality and data availability tend to be low in such as banking and making purchases. In many
developing countries. In Kenya, for example, African countries (nine in 2014), more people
poverty data was last collected in 2005, and have online mobile money accounts than have
income surveys in other parts of sub-Saharan traditional bank accounts. Mobile money services
Africa often take up to 3 years to be tabulated. such M-Pesa and MTN Money produce trace data
When national income-accounting methodologies and thus offer intriguing possibilities for increas-
were updated in Ghana (2010) and Nigeria ing understanding of spending and saving
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_117-1
2 International Development

behavior in the developing world. As the func- Big data analytics are also being used to
tionality provided by mobile money services enhance understanding of international develop-
extends into loans, money transfers from abroad, ment assistance. In 2009, the College of William
cash withdrawal, and the purchase of goods, the and Mary, Brigham Young University, and Devel-
data yielded by these platforms will become even opment Gateway created AidData (aiddata.org), a
richer. website that aggregates data on development pro-
The data produced by mobile devices has jects to facilitate project coordination and provide
already been used to glean insights into complex researchers with a centralized source for develop-
economic or social systems in the developing ment data. AidData also maps development pro-
world. In many cases, the insights into local eco- jects geospatially and links donor-funded projects
nomic conditions that result from the analysis of to feedback from the project’s beneficiaries.
mobile device data can be produced more quickly
than national statistics. For example, in Indonesia
the UN Global Pulse monitored tweets about the Big Data in Practice
price of rice and found them to be highly corre-
lated with national spikes in food prices. The same Besides expanding the evidence base available to
study found that tweets could be used to identify international development scholars and practi-
trends in other types of economic behavior such as tioners, large data sets and big data analytic tech-
borrowing. Similarly, research by Nathan Eagle niques have played a direct role in promoting
has shown that reductions in additional airtime international development. Here the term “devel-
purchases are associated with falls in income. opment” is considered in its broad sense as refer-
Researchers Han Wang and Liam Kilmartin ring not to a mere increase in income, but to
examined Call Detail Record (CDR) data gener- improvements in variables such as health and
ated from mobile devices in Uganda and identified governance.
differences in the way that wealthy and poor indi- The impact of infectious diseases on develop-
viduals respond to price discounts. The ing countries can be devastating. Besides the
researchers also used the data to identify centers obvious humanitarian toll of outbreaks, infectious
of economic activity within Uganda. diseases prevent the accumulation of human cap-
Besides providing insight into how individuals ital and strain local resources. Thus there is great
respond to price changes, big data analytics potential for big data-enabled applications to
allows researchers to explore the complex ways enhance epidemiological understanding, mitigate
in which the economic lives of the poor are orga- transmission, and allow for geographically
nized. Researchers at Harvard’s Engineering targeted relief. Indeed, it is in the tracking of
Social Systems lab have used mobile phone data health outcomes that the utility of big data analyt-
to explore the behavior of inhabitants of slums in ics in the developing world has been most obvi-
Kenya. In particular, the authors tested theories of ous. For example, Amy Wesolowski and
rural-to-urban migration against spatial data emit- colleagues used mobile phone data from 15 mil-
ted by mobile devices. Some of the same lion individuals in Kenya to understand the rela-
researchers have used mobile data to examine tionship between human movement and malaria
the role of social networks on economic develop- transmission. Similarly, after noting in 2008 that
ment and found that diversity in individuals’ net- search trends could be used to track flu outbreaks,
work relationships is associated with greater researchers at Google.org have used data on
economic development. Such research supports searches for symptoms to predict outbreaks of
the contention that insular networks – i.e., highly the dengue virus in Brazil, Indonesia, and India.
clustered networks with few ties to outside In Haiti, researchers from Columbia University
nodes – may limit the economic opportunities and the Karolinska Institute used SIM card data
that are available to members. to track the dispersal of people following a cholera
outbreak. Finally, the Centers for Disease Control
International Development 3

and Prevention used mobile phone data to direct poor countries as reminders of the preferred devel-
resources to appropriate areas during the 2014 opment strategies of the past. While more recent
Ebola outbreak. approaches to reducing poverty that have focused
Big data applications may also prove useful in on improving institutions and governance within
improving and monitoring aspects of governance poor countries may produce positive development
in developing countries. In Kenya, India, and effects, the history of development policy sug-
Pakistan, witnesses of public corruption can gests that optimism should be tempered. The
report the incident online or via text message to same caution holds in regard to the potential role
a service called “I Paid A Bribe.” The provincial of big data in international economic develop-
government in Punjab, Pakistan, has created a ment. Martin Hilbert’s 2016 systematic review
citizens’ feedback model, whereby citizens are article rigorously enumerates both the causes for
solicited for feedback regarding the quality of optimism and reasons for concern. While big data
government services they received via automated may assist in understanding the nature of poverty
calls and texts. In effort to discourage absenteeism or lead to direct improvements in health or gover-
in India and Pakistan, certain government officials nance outcomes, the availability and ability to
are provided with cell phones and required to text process large data sets are not a panacea.
geocoded pictures of themselves at jobsites. These
mobile government initiatives have created a rich
source of data that can be used to improve gov-
Cross-References
ernment service delivery, reduce corruption, and
more efficiently allocate resources.
▶ Economics
Applications that exploit data from social
▶ Epidemiology
media have also proved useful in monitoring elec-
▶ U.S. Agency International Development
tions in sub-Saharan Africa. For example, Aggie,
▶ United Nations Global Pulse (Development)
a social media tracking software designed to mon-
▶ World Bank
itor elections, has been used to monitor elections
in Liberia (2011), Ghana (2012), Kenya (2013),
and Nigeria (2011 and 2014). The Aggie system is
first fed with a list of predetermined keywords, Further Reading
which are established by local subject matter
experts. The software then crawls social media Hilbert, M. (2016). Big data for development: A review of
promises and challenges. Development Policy Review,
feeds – Twitter, Facebook, Google+, Ushahidi, 34(1), 135–174.
and RSS – and generates real-time trend visuali- Wang, H., & Kilmartin, L. (2014). Comparing rural and
zations based on keyword matches. The reports urban social and economic behavior in Uganda:
are monitored by a local Social Media Tracking Insights from mobile voice service usage. Journal of
Urban Technology, 21(2), 61–89.
Center, which identifies instances of violence or Wesolowski, A., et al. (2012). Quantifying the impact of
election irregularities. Flagged incidents are human mobility on malaria. Science, 338(6104),
passed on to members of the election commission, 267–270.
police, or other relevant stakeholders. World Economic Forum. (2012). Big data, big impact:
New possibilities for international development. In
The history of international economic devel- Big data, big impact: New possibilities for interna-
opment initiatives is fraught with would-be pana- tional development, Cologny/Geneva, Switzerland:
ceas that failed to deliver. White elephants – large- World Economic Forum. http://www3.weforum.org/
scale capital investment projects for which the docs/WEF_TC_MFS_BigDataBigImpact_Briefing_
2012.pdf
social surplus is negative – are strewn across
I

International Labor Organization activities, it is widely known for its creation of


Conventions and Recommendations (189 and
Jennifer Ferreira 203, respectively by, 2014) related to labor market
Centre for Business in Society, Coventry standards.
University, Coventry, UK Where Conventions are ratified, come into
force, and are therefore legally binding, they cre-
ate a legal obligation for ratifying nations. For
Every day people across the world in both devel- many Conventions even in countries where they
oped and developing economies are creating an are not ratified, they are often adopted and
ever-growing ocean of digital data. This “big interpreted as the international labor standard.
data” represents a new resource for international There have been many important milestones cre-
organizations with the potential to revolutionize ated by the ILO to shape the landscape to encour-
the way policies, programs, and projects are gen- age the promotion of improved working lives
erated. The International Labour Organization globally, although a significant milestone is often
(ILO) is no exception to this and has begun to considered to be the 1998 Declaration on the
discuss and engage with the potential uses of big Fundamental Principles and Rights to Work
data to contribute to its agenda. which had four key components: the right of
workers to associate freely and collectively, the
end of forced and compulsory labor, the end of
Focus child labor, and the end of unfair discrimination
among workers. ILO members have an obligation
The ILO, founded in 1919 in the wake of the First to work toward these objectives and respect the
World War, became the first specialized agency of principles which are embedded in the
the United Nations. It focuses on labor issues Conventions.
including child labor, collective bargaining, cor-
porate social responsibility, disability, domestic
workers, forced labor, gender equality, informal Decent Work Agenda
economy, international labor migration, interna-
tional labor standards, labor inspection, micro- The ILO believes that work plays a crucial role in
finance, minimum wages, rural development, the well-being of workers and families and there-
and youth employment. By 2013 the ILO had fore the broader social and economic develop-
185 members (of the 193 member states of the ment of individuals, communities, and societies.
United Nations). Among its multifarious While the ILO works on many issues related to
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_118-1
2 International Labor Organization

employment, their key agenda which has domi- 4. Promoting social dialogue by involving both
nated activities in recent decades is “decent workers and employers in the organizations in
work.” order to increase productivity, avoid disputes
“Decent work” refers to an aspiration for peo- and conflicts at work, and more broadly build
ple to have a work that is productive, provides a cohesive societies.
fair income with security and social protection,
safeguards basic rights, and offers equal opportu-
nities and treatment, opportunities for personal ILO Data
development, and a voice in society. “Decent
work” is central to efforts to reduce poverty and The ILO produces research on important labor
is a path to achieving equitable, inclusive, and market trends and issues to inform constituents,
sustainable development; ultimately it is seen as policy makers, and the public about the realities of
a feature which underpins peace and security in employment in today’s modern globalized econ-
communities and societies (ILO 2014a). omy and the issues facing workers and employers
The “decent work” concept was formulated by in countries at all development stages. In order to
the ILO in order to identify the key priorities to do so, it draws on data from a wide variety of
focus their efforts. “Decent work” is designed to sources.
reflect priorities on the social, economic, and The ILO is a major provider of statistics as
political agenda of countries as well as the inter- these are seen as important tools to monitor pro-
national system. In a relatively short time, this gress toward labor standards. In addition to the
concept has formed an international consensus maintenance of key databases (ILO 2014b) such
among government, employers, workers, and as LABOURSTA, it also publishes compilations
civil equitable globalization, a path to reduce pov- of labor statistics, such as the Key Indicators of
erty as well as inclusive and sustainable develop- Labour Markets (KILM) which is a comprehen-
ment. The overall goal of “decent work” is to sive database of country level data for key indica-
instigate positive change in/for people at all spa- tors in the labor market which is used as a research
tial scales. tool for labor market information. Other databases
Putting the decent work agenda into practice is include the ILO STAT, a series of databases with
achieved through the implementation of the ILO’s labor-related data; NATLEX which includes leg-
four strategic objectives, with gender equality as a islation related to labor markets, social security,
crosscutting objective: and human rights; and NORMLEX which brings
together ILO labor standards and national labor
1. Creating jobs to foster an economy that gener- and security laws (ILO 2014c). The ILO database
ates opportunities for investment, entrepre- provides a range of datasets with annual labor
neurship, skills development, job creation, market statistics including over 100 indicators
and sustainable livelihoods. worldwide including annual indicators as well as
2. Guaranteeing rights at work in order to obtain short-term indicators, estimates and projections of
recognition for work achieved as well as total population, and labor force participation
respect for the rights of all workers. rates.
3. Extending social protection to promote both Statistics are vital for the development and
inclusion and productivity of all workers. To evaluation of labor policies, as well as more
be enacted by ensuring both women and men broadly to assess progress toward key ILO objec-
experience safe working conditions, allowing tives. The ILO supports member states in the
free time, taking into account family and social collection and dissemination of reliable and recent
values and situations, and providing compen- data on labor markets. While the data produced by
sation where necessary in the case of lost or the ILO are both wide ranging and widely used,
reduced income. they are not considered by most to be “big data,”
and this has been recognized.
International Labor Organization 3

ILO, Big Data, and the Gender Data insights into women’s maternal health, cultural
attitudes, or political engagement.
In October 2014, a joint ILO-Data2X roundtable • Sensing technologies: for example, satellite
event held in Switzerland identified the impor- data which might be used to examine agricul-
tance of developing innovative approaches to the tural productivity, access to healthcare, and
better use of technology to include big data, in education services.
particular where it can be sourced and where • Crowdsourcing: for example, disseminating
innovations can be made in survey technology. apps to gain views about different elements of
This event, which brought together representa- societies.
tives from national statistics offices, key interna-
tional and regional organizations, and A primary objective of this meeting was to
nongovernmental organizations, was organized highlight that existing gender data gaps are large,
to discuss where there were gender data gaps, and often reflect traditional societal norms, and
particularly focusing on informal and unpaid that no data (or poor data) can have significant
work as well as agriculture. These discussions development consequences. Big data here has the
were sparked by wider UN discussions about the potential to transform the understanding of
data revolution and the importance of develop- women’s participation in work and communities.
ment data in the post-2015 development agenda. Crucially it was posited that while better data is
It is recognized that big data (including adminis- needed to monitor the status of women in informal
trative data) can be used to strengthen existing employment conditions, it is not necessarily
collection of gender statistics, but there need to important to focus on trying to extract more data
be more efforts to find new and innovative ways to but to make an impact with the data that is avail-
work with new data sources to meet a growing able to try and improve wider social, economic,
demand for more up to date (and frequently and environmental conditions.
updating) data on gender and employment
(United Nations, 2013). The fundamental goal of
the discussion was to improve gender data collec-
tion which can then be used to guide policy and
ILO, the UN, and Big Data
inform the post-2015 development agenda, and
here big data is acknowledged as a key compo- The aforementioned meeting represented one
nent. At this meeting, four types of gender data example of where the ILO has engaged with
gaps were identified: coverage across countries other stakeholders to not only acknowledge the
and/or regular country production, international importance of big data but begin to consider
standards to allow comparability, complexity, potential options for its use with respect to their
and granularity (sizeable and detailed datasets agendas. However, as a UN agency, they partake
allowing disaggregation by demographic and in wider discussion with the UN regarding the
other characteristics). Furthermore a series of big importance of big data, as was seen in the 45th
data types that have the potential to increase col- session of the UN Statistical Commission in
lection of gender data were identified: March 2014 where the report of the secretary
general on “big data and the modernization of
• Mobile phone records: for example, mobile statistical systems” was discussed (United
phone use and recharge patterns could be Nations, 2014). This report is significant as it
used as indicators of women’s socioeconomic touches upon important issues, opportunities,
welfare or mobility patterns. and challenges that are relevant for the ILO with
• Financial patterns: exploring engagement with respect to the use of big data.
financial systems. The report makes reference to the UN “Global
• Online activity: for example, Google searches Pulse” which is an initiative on big data
or Twitter activity which might be used to gain established in 2009 which included a vision of a
4 International Labor Organization

future where big data was utilized safely and • Privacy: a dialogue will be required in order to
responsibly. Its mission was to accelerate the gain public trust around the use of data.
adoption of big data innovation. Partnering with • Financial: related to costs for access data.
UN agencies such as the ILO, governments, aca- • Management: policies and directives to ensure
demics, and the private sector, it sought to achieve management and protection of data.
a critical mass of implemented innovation and • Methodological: data quality, representative-
strengthen the adoption of big data as a tool to ness, and volatility are all issues which present
foster the transformation of societies. potential barriers to the widespread use of
There is a recognition that the national statisti- big data.
cal system is essentially now subject to competi- • Technological: the nature of big data, particu-
tion from other actors producing data outside of larly the volume in which it is often created
their system, and there is a need for data collection meaning that some countries would need
of national statistics to adjust in order to make use enhanced information technology.
of the mountain of data now being produced
almost continuously (and often automatically). An assessment of the use of big data for official
To make use of the big data, a shift may be statistics carried out by the UN indicates that there
required from the traditional survey-oriented col- are good examples where it has been used, for
lection of data to a more secondary data-focused example, using transactional, tracking, and sensor
orientation from data sources that are high in data. However, in many cases, a key implication is
volume, velocity, and variety. Increasing demand that statistical systems and IT infrastructures need
from policy makers for real-time evidence in com- to be enhanced in order to be able to support the
bination with declining response rates to national storage and processing of big data as it accumu-
household and business survey means that orga- lates over time.
nizations like the ILO will have to acknowledge Modern society has witnessed an explosion of
the need to make this shift. There are a number of the quantity and diversity of real-time information
different sources of big data which may be poten- known more commonly as big data, presenting a
tially useful for the ILO: sources from administra- potential paradigm shift in the way official statis-
tion, e.g., bank records; commercial and tics are collected and analyzed. In the context of
transaction data, e.g., credit card transactions; increased demand for statistics information, orga-
sensor data, e.g., satellite images or road sensors; nizations recognize that big data has the potential
tracking devices, e.g., mobile phone data; behav- to generate new statistical products in a timelier
ioral data, e.g., online searches; and opinion data, manner than traditional official statistical sources.
e.g., social media. Official statistics like those The ILO, alongside a broader UN agenda to
presented in ILO databases often rely on admin- acknowledge the data revolution, recognizes the
istrative data, and these are traditionally produced potential for future uses of big data at the global
in a highly structured manner which can in turn level, although there is a need for further investi-
limit their use. If administrative data was collected gation of the data sources, challenges and areas of
in real time, or in a more frequent basis, then it has use of big data, and its potential contribution to
the potential to become “big data.” efforts working toward the “better work” agenda.
There are, however, a number of challenges
related to the use of big data which face the UN,
its agencies, and national statistical services alike:
Cross-References
• Legislative: in many countries, there will not
be legislation in place to enable the access to, ▶ Scientific and Cultural Organization
and use of, big data particularly from the pri- (UNESCO) United Nations Global Pulse
vate sector. ▶ United Nations
▶ United Nations Educational
International Labor Organization 5

Further Readings org/ilostat/faces/home/statisticaldata?_afrLoop=342428


603909745. Accessed 10 Sep 2014.
International Labour Organization. (2014a). Key indica- United Nations. (2013). Big data and modernization of sta-
tors of the labour market. International Labour Orga- tistical systems. Report of the Secretary-General. United
nization. http://www.ilo.org/empelm/what/WCMS_ Nations. United Nations Economic and Social Council.
114240/lang–en/index.htm. Accessed 10 Sep 2014. Available at: http://unstats.un.org/unsd/statcom/doc14/
International Labour Organization. (2014b). ILO databases. 2014-11-BigData-E.pdf. Accessed 1 Dec 2014.
International Labour Organization. http://www.ilo.org/pub United Nations. (2014). UN global pulse. United Nations.
lic/english/support/lib/resource/ilodatabases.htm. Accessed Available at: http://www.unglobalpulse.org/. Accessed
1 Oct 2014. 10 Sep 2014.
International Labour Organization. (2014c). ILOSTAT data-
base. International Labour Organization. http://www.ilo.
I

Internet Association, The the economic interests of the US industries, inter-


nationally, and a responsibility to protect the pri-
David Cristian Morar vacy of the American citizens, nationally.
Schar School of Policy and Government, George
Mason University, Fairfax, VA, USA
Main Text

Synonyms Launched in 2012 with 14 members and designed


as the unified voice in Washington D.C. for the
Internet Lobby; Internet Trade Association; industry, the Internet Association now boasts
Internet Trade Organization 41 members and is dedicated, according to their
statements, to protecting the future of the free and
innovative Internet. Among these 41 members,
Introduction some of the more notable include Amazon,
AOL, Groupon, Google, Facebook, Twitter,
The Internet Association is a trade organization eBay, Yelp, IAC, Uber Technologies Inc,
that represents a significant number of the world’s Expedia, and Netflix. As part of both their purpose
largest Internet companies, all of whom are based, and mission statements, the Internet Association
founded, or ran in the United States of America. believes that the decentralized architecture of the
While issues such as net neutrality or copyright Internet, which it vows to protect, is what led it to
reform are at the forefront of their work, the Inter- become one of the world’s most important engines
net Association is also active in expressing the for growth, economically and otherwise. The
voice of the Internet industry in matters of Big Association’s representational role, also referred
Data. On this topic, it urges a commitment to to as a lobbying, is portrayed as not simply an
status quo in privacy regulation and increased annex of Silicon Valley but as a voice of its com-
government R&D for innovative ways of enhanc- munity of users as well. The policy areas it pro-
ing the benefits of Big Data, while also calling for motes are explained with a heavy emphasis on the
dispelling the belief that the web is the only sector user and the benefits and rights the user gains.
that collects large data sets, as well as for a more The President and CEO, Michael Beckerman,
thorough review of government surveillance. a former congressional staffer, is the public face of
These proposals are underlined by the perspective the Internet Association, and he is usually the one
that the government has a responsibility to protect that signs statements or comments on important

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_121-1
2 Internet Association, The

issues on behalf of the members. Beyond their sometimes overlapping federal-state duality of
“business crawl” efforts promoting local busi- levels, also includes laws in place through the
nesses and their connection to, and success yield- Federal Trade Committee that guard against
ing from the Internet economy, the Association is unfair practices and that target and swiftly punish
active in many other areas. These areas include the bad actors that perpetrate the worst harms.
Internet freedom (nationally and worldwide) and This allows companies to harness the potential
patent reform, among others, with their most of Big Data within a privacy-aware context that
important concern being net neutrality. As Big does not allow or tolerate gross misconduct. In
Data is associated with the Internet, and the indus- fact, the Association even cites the White House’s
try is interested in being an active stakeholder in 2012 laudatory comments on the existing privacy
related policy, the Association has taken several regimes, to strengthen its argument for regulatory
opportunities to make its opinions heard on the status quo, beyond simply an industry’s desire to
matter. These opinions can also be traced through- be left to its own devices to innovate without
out the policies it seeks to propose in other major restrictions.
connected areas. The proposed solutions by the industry would
Most notably, after the White House Office of center on private governance mechanisms that
Science and Technology Policy’s (OSTP) 2014 include a variety of stakeholders in the decision-
request for information, as part of their 90-day making process and are not, in fact, a product of
review on the topic of Big Data, the Internet the legislative system. Such actions have been
Association has released a set of comments that taken before and, according to the views of the
crystallize their views on the matter. Prior com- Association, are successful in the general sector of
munications have also brought up certain aspects privacy, and they allow industry and other actors
related to Big Data; however, the comments made that are involved in the specific areas to have a seat
to the OSTP have been the most comprehensive at the table beyond the traditional lobbying route.
and detailed public statement to date by the indus- One part that needs further action, according to
try on issues of Big Data, privacy, and govern- the views of the Association, is educating the
ment surveillance. public on the entire spectrum of activities that
In matters of privacy regulation, the Associa- lead to the collection and analysis of large data
tion believes that the current framework is both sets. With websites as the focus of most privacy-
robust and effective in relation to commercial related research, the industry advocates a more
entities. In their view, reform is mostly necessary consumer-oriented approach that would permeate
in the area of government surveillance, by the whole range of practices from understudied
adopting an update to the Electronic Communica- sectors to the Internet, centered around increasing
tions Privacy Act (which would give service pro- user knowledge on how their data is being han-
viders a legal basis in denying government dled. This would allow the user to understand the
requests for data that are not accompanied by a entire processes that go on beyond the visible
warrant), prohibiting bulk governmental collec- interfaces, without putting any more pressure on
tion of metadata from communications and clearly the industries to change their actions.
bounding surveillance efforts by law. While the Internet Association considers that
The Internet Association subscribes to the commercial privacy regulation should be left vir-
notion that the current regime for private sector tually intact, substantial government funding for
privacy regulation is not only sufficient but also research and development should be funneled into
perfectly equipped to deal with potential concerns unlocking future and better societal benefits of
brought about by Big Data issues. The status quo Big Data. These funds, administered through the
is, in the acceptation of the Internet industry, a National Science Foundation and other instru-
flexible and multilayered framework, designed for ments, would be directed toward a deeper under-
businesses that embrace privacy protective prac- standing of the complexities of Big Data,
tices. The existing framework, beyond a including accountability mechanisms,
Internet Association, The 3

de-identification, and public release. Prioritizing spread around not just between the companies
such government-funded research over new regu- involved but also with the government, as best
lation, the industry believes that current societal practices would necessarily involve governmental
benefits from commercial Big Data usage institutions as well.
(ranging from genome research to better spam
filters) would multiply in number and effect.
The Association deems that the innovation
Cross-References
economy would suffer from any new regulatory
approaches that are designed to restrict the free
▶ Amazon
flow of data. In their view, not only would the
▶ De-identification, Re-identification
companies not be able to continue with their com-
▶ Genome Data
mercial activities, which would hurt the sector,
▶ Google
and the country, but the beneficial aspects of Big
▶ National Security Agency
Data would suffer as well. Coupled with the rev-
▶ Netflix
elations about the data collection projects of the
▶ Office of Science and Technology Policy:
National Security Agency, this would signifi-
White House Report (2014 Report)
cantly impact the standing of the United States
▶ Twitter
internationally, as important international agree-
ments, such as the Transatlantic Trade and Invest-
ment Partnership with the EU, are in jeopardy,
says the industry. Further Readings

The Internet Association. Comments of the Internet Associa-


tion in response to the White House Office of Science and
Conclusion Technology Policy’s Government ‘Big Data’ Request for
Information. http://internetassociation.org/wp-content/
The Internet Association thus sees privacy as a uploads/2014/03/3_31_-2014_The-Internet-Association-
significant concern with regard to Big Data. How- Comments-Regarding-White-House-OSTP-Request-for-
Information-on-Big-Data.pdf. Accessed July 2016.
ever, it strongly emphasizes governmental mis- The Internet Association. Comments on ‘Big Data’ to the
steps in data surveillance, and offers an Department of Commerce. http://internetassociation.org/
unequivocal condemnation of such actions, 080614comments/. Accessed July 2016.
while lauding and extolling the virtues of the The Internet Association. Policies. https://internetassociation.
org/policy-platform/protecting-internet-freedom/.
regulatory framework in place to deal with the Accessed July 2016.
commercial aspect. The Association believes The Internet Association. Privacy. http://internetassociation.
that current nongovernmental policies, such as org/policies/privacy/. Accessed July 2016.
agreements between users and service providers, The Internet Association. Statement on the White House Big
Data Report. http://internetassociation.org/050114bigdata/.
or industry self-regulation, are also adequate, and Accessed July 2016.
promoting such a user-facing approach to a major- The Internet Association. The Internet Association’s Press Kit.
ity of privacy issues would continue to be useful. http://internetassociation.org/the-internet-associations-press-
Governmental involvement is still desired by the kit/. Accessed July 2016.
The Internet Association. The Internet Association Statement
industry, primarily through funding for what on White House Big Data Filed Comments. http://
might be called basic research into the Big Data internetassociation.org/bigdatafilingstatement/. Accessed
territory, as the benefits of this work would be July 2016.
I

Italy senior management on decision-making is also


increasing. Most of the Italian companies (76%
Chiara Valentini of 184 interviewed) claim that they use basic
Department of Management, Aarhus University, analytics strategically and another 36% use more
School of Business and Social Sciences, Aarhus, sophisticated tools for forecasting activities
Denmark (Mosca 2014, January 7).

Introduction Data Protection Agency and Privacy


Issues
Italy is a Parliamentary republic in southern
Europe. It has a population of about 60 million Despite the positive attitude and increased use of
people of which, 86.7%, are Internet users big data by Italian organizations, an increasing
(Internet World Stat 2017). Public perception of public expectation for privacy protection has
handling big data is generally very liberal, and the emerged as a result of raising debates on personal
phenomenon has been associated with more trans- data, data security, and protection in the whole
parency and digitalized economic and social sys- European Union. In the past years, the Italian
tems. The collection and processing of personal Data Protection Authority (DPA) reported several
data have been increasingly used to counter tax instances of data collection of telephone and Inter-
evasion which is one of the major problems of net communications of Italian users which may
Italian economy. The Italian Revenue Agency is have harmed Italians’ fundamental rights (DPA
using data collected through different private and 2014b). Personal data laws have been developed
public data collectors to cross-check tax declara- as these are considered important instruments for
tions (DPA 2014a). the overall protection of fundamental human
According to the results of a study on Italian rights, thereby adding new legal specifications to
companies' perception of big data conducted by the existing privacy framework. The first specific
researchers at the Big Data Analytics & Business law on personal data was adopted by the Italian
Intelligence Observatory of Milan Polytechnic, Parliament in 1996 and this incorporated a num-
more and more companies (þ22% in 2013) are ber of guidelines already included in the European
interested in investing in technologies that allow Union 1995 Data Protection Directive. At the
to handle and use big data. Furthermore, the num- same time, an indepedent authority, the Italian
ber of companies seeking professional managers Data Protection Authority (Garante per la pro-
that are capable of interpreting data and assisting tezione dei dati personali), was created in 1997 to
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_123-1
2 Italy

protect fundamental rights and freedoms of peo- Italian businesses. Yet, according to this authority,
ple when personal data are processed. The Italian these cloud computing guidelines require that Ital-
Data Protection Authority (DPA) is run by a four- ian laws are updated to be fully effective in regu-
member committee elected by the Italian Parlia- lating this area. Critics indicate that there are
ment for a seven-year mandate (DPA 2014a). limits in existing Italian laws concerning the allo-
The main activities of DPA consist of monitor- cation of liabilities, data security, jurisdiction, and
ing and assuring that organizations comply with notification of infractions to the supervisory
the latest regulations on data protection and indi- authority (Russo 2012).
vidual privacy. In order to do so, DPA carries out Another area of great interest for the DPA is the
inspections on organizations’ databases and data collection of personal data via video surveillance
storage systems to guarantee that their require- both in the public and in the private sector. The
ments for preserving individual freedom and pri- DPA has acted on specific cases of video surveil-
vacy are of high standards. It checks that the lance, sometimes banning and other times allo-
activities of the police and the Italian Intelligence wing it (DPA 2014c). For instance, the DPA
Service comply with the legislation, reports pri- reported to have banned the use of webcams in a
vacy infringements to judicial authorities, and nursery school to protect children’s privacy and to
encourages organizations to adopt codes of con- safeguard freedom of teaching. It banned police
duct promoting fundamental human rights and headquarters to process images collected via
freedom. The authority also handles citizens’ CCTV cameras installed in streets for public
reports and complaints of privacy loss or any safety purposes because such cameras also cap-
misuse or abuse of personal data. It bans or blocks tured images of people’s homes. The use of cus-
activities that can cause serious harm to individual tomers’ pre-recorded, operator-unassisted phone
privacy and freedom. It grants authorizations to calls for debt collection purposes is among those
organizations and institutions to have access and activities that have been prohibited by this author-
use sensitive and/or judicial data. Sensitive and ity. Yet, the DPA permits the use of video surveil-
judicial data concern, for instance, information on lance in municipalities for counter-vandalism
a person’s criminal records, ethnicity, religion or purposes (DPA 2014b).
other beliefs, political opinions, membership of
parties, trade unions and/or associations, health,
or sex life. Access to sensitive and judicial data is Conclusion
granted only for specific purposes, for example, in
situations where it is necessary to know more Overall, Italy is advancing with the regulation of
about a certain individual for national security big data phenomenon following also the impetus
reasons (DPA 2014b). given by the EU institutions and international
The DPA participates to data protection activ- debates on data protection, security, and privacy.
ities involving the European Union and other Nonetheless, Italy is still lagging behind many
international supervisory authorities and follows western and European countries regarding the
existing international conventions (Schengen, adoption and development of frameworks for a
Europol, and Customs Information System) full digital economy. According to the Networked
when regulating Italian data protection and secu- Readiness Index 2015 published by the World
rity matters. It carries out an important role in Economic Forum, Italy is ranked 55th. As indi-
increasing public awareness of privacy legislation cated by the report, Italy’s major weakness is still
and in soliciting the Italian Parliament to develop a political and regulatory environment that does
legislation on new economic and social issues not facilitate the development of a digital econ-
(DPA 2014b). The DPA has also formulated spe- omy and its innovation system (Bilbao-Osorio
cific guidelines on cloud computing for helping et al. 2014).
Italy 3

Cross-References DPA (2014b). Who we are. http://www.garanteprivacy.it/web/


guest/home_en/who_we_are. Accessed 31 Oct 2014.
DPA. (2014c) “Compiti del Garante” [Tasks of DPA].
▶ Cell Phone Data http://www.garanteprivacy.it/web/guest/home/autorita/
▶ Data Security compiti. Accessed 31 Oct 2014.
▶ European Union Internet World Stat (2017). Italy. http://www.
▶ Privacy internetworldstats.com/europa.htm. Accessed 15
May 2017.
▶ Security Best Practices Mosca, G. (2014, January 7). Big data, una grossa
▶ Surveillance Cameras opportunità per il business, se solo si sapesse come
usarli. La situazione in Italia. La Stampa. http://www.
ilsole24ore.com/art/tecnologie/2014-01-07/big-data-gr
ossa-opportunita-il-business-se-solo-si-sapesse-come-us
References arli-situazione-italia-110103.shtml?uuid=ABuGM6n.
Accessed 31 Oct 2014.
Bilbao-Osorio, B., Dutta, S. & Lanvin, B. (2014). The Russo, M. (2012). Italian data protection authority releases
global information technology report 2014. Reword guidelines on cloud computing. In McDermott Will &
and risks of big data. World Economic Forum. http:// Emery (Eds.), International News (Focus on Data Pri-
www3.weforum.org/docs/WEF_GlobalInformationTec vacy and Security, 4). http://documents.lexology.com/
hnology_Report_2014.pdf. Accessed 31 Oct 2014. 475569eb-7e6b-4aec-82df-f128e8c67abf.pdf. Accessed
DPA (2014a). Summary of key activities by the Italian DPA 31 Oct 2014.
in 2013. http://www.garanteprivacy.it/web/guest/home/
docweb/-/docweb-display/docweb/3205017. Accessed
31 Oct 2014.
J

Journalism methodologies to achieve the objectives of jour-


nalism. Although big data offer many opportuni-
Brian E. Weeks1, Trevor Diehl2, Brigitte Huber2 ties for journalists to report the news in novel and
and Homero Gil de Zúñiga2 interesting ways, critics have noted data journal-
1
Communication Studies Department, University ism also faces potential obstacles that must be
of Michigan, Ann Arbor, USA considered.
2
Media Innovation Lab (MiLab), Department of
Communication, University of Vienna, Wien,
Austria
Origins of Journalism and Big Data

Contemporary data journalism is rooted in the


The Pew Research Center notes that journalism is
work of reporters like Philip Meyer, Elliot Jaspin,
a mode of communication that provides the public
Bill Dedman, and Stephen Doig. In his 1973
verified facts and information in a meaningful
book, Meyer introduced the concept of “precision
context so that citizens can make informed judg-
journalism” and advocated applying social sci-
ments about society. As aggregated, large-scale
ence methodology to investigative reporting prac-
data have become readily available and the prac-
tices. Meyer argued that journalists needed to
tice of journalism has increasingly turned to big
employ the same tools as scientific researchers:
data to help fulfill this mission. Journalists have
databases, spreadsheets, surveys, and computer
begun to apply a variety of computational and
analysis techniques.
statistical techniques to organize, analyze, and
Based on the work of Meyer, computer-
interpret these data, which are then used in con-
assisted reporting developed as a niche form of
junction with traditional news narratives and
investigative reporting by the late 1980s, as com-
reporting techniques. Big data are being applied
puters became smaller and more affordable.
to all facets of news including politics, health, the
A notable example from this period was Bill
economy, weather, and sports.
Dedman’s Pulitzer Prize winning series “The
The growth of “data-driven journalism” has
Color of Money.” Dedman obtained lending sta-
changed many journalists’ news gathering rou-
tistics on computer tape through the federal Free-
tines by altering the way news organizations inter-
dom of Information Act. His research team
act with their audience, providing new forms of
combined that data with demographic information
content for the public and incorporating new
from the US Census. Dedman found widespread

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_124-1
2 Journalism

racial discrimination in mortgage lending prac- visuals can also accompany and buttress news
tices throughout the Atlanta metropolitan area. articles that rely on traditional reporting methods.
Over the last decade, the ubiquity of large, Nate Silver writes that big data analyses pro-
often free, data sets has created new opportunities vide several advantages over traditional journal-
for journalists to make sense of the world of big ism. They allow journalists to further explain a
data. Where precision journalism was once the story or phenomenon through statistical tests that
domain of a few investigative reporters, data- explore relationships, to more broadly generalize
driven reporting techniques are now a common, information by looking at aggregate patterns over
if not necessary, component of contemporary time and to predict future events based on prior
news work. News organizations like The Guard- occurrences. For example, using an algorithm
ian, The New York Times’ Upshot, and The Texas based on historical polling data, Silver’s website,
Tribune represent the mainstream embrace of big FiveThirtyEight (formerly hosted by the New York
data. Some websites, like Nate Sliver’s Times), correctly predicted the outcome of the
FiveThirtyEight, are entirely devoted to data 2012 US presidential election in all 50 states.
journalism. Whereas methods of traditional journalism often
lend themselves to more microlevel reporting,
more macrolevel and general insights can be
How Do Journalists Use Big Data? gleaned from big data.
An additional advantage of big data is that, in
Big data provide journalists with new and alterna- some cases, they reduce the necessary resources
tive ways to approach the news. In traditional needed to report the story. Stories that would
journalism, reporters collect and organize infor- otherwise have taken years to produce can be
mation for the public, often relying on interviews assembled relatively quickly. For example,
and in-depth research to report their stories. Big WikiLeaks provided news organizations nearly
data allow journalists to move beyond these stan- 400,000 unreleased US military reports related
dard methods and report the news by gathering to the war in Iraq. Sifting through these docu-
and making sense of aggregated data sets. This ments using traditional reporting methods would
shift in methods has required some journalists and take a considerable amount of time, but news
news organizations to change their information- outlets like The Guardian in the UK applied com-
gathering routines. Rather than identifying poten- putational techniques to quickly identify and
tial sources or key resources, journalists using big report the important stories and themes stemming
data must first locate relevant data sets, organize from the leak, including a map noting the location
the data in a way that allows them to tell a coher- of every death in the war.
ent story, analyze the data for important patterns Big data also allow journalists to interact with
and relationships, and, finally, report the news in a their audience to report the news. In a process
comprehensible manner. Because of the complex- called crowdsourcing the news, large groups of
ity of the data, news organizations and journalists people contribute relevant information about a
are increasingly working alongside computer pro- topic, which in the aggregate can be used to
grammers, statisticians, and graphic designers to make generalizations and identify patterns and
help tell their stories. relationships. For example, in 2013 the
One important aspect of big data is visualiza- New York Times website released an interactive
tion. Instead of writing a traditional story with quiz on American dialects that used responses to
text, quotations, and the inverted-pyramid format, questions about accents and phrases to demon-
big data allow journalists to tell their stories using strate regional patterns of speech in the US. The
graphs, charts, maps, and interactive features. quiz became the most visited content on the
These visuals enable journalists to present website that year.
insights from complicated data sets in a format
that is easy for the audience to understand. These
Journalism 3

Data Sets and Methodologies be designed to automatically write news stories,


without a human author. These automated “robot
Journalists have a multitude of large data sets and journalists” have been used to produce stories for
methodologies at their disposal to create news news outlets like the Associated Press and The
stories. Much of the data used is public and orig- Los Angeles Times. Algorithms have also changed
inates from government agencies. For example, the way news is delivered, as news aggregators
the US government has created a website, data. like Google News employ these methods to col-
gov, which offers over 100,000 datasets in a vari- lect and provide users personalized news feed.
ety of areas including education, finance, health,
jobs, and public safety. Other data, like the
WikiLeaks reports, were not intended to be public
Limitations of Big Data for Journalism
but became primary sources of big data for jour-
nalists. News organizations can also utilize publi-
Although big data offer numerous opportunities to
cally available data from private Internet
journalists reporting the news, scholars and prac-
companies like Google or social networking
titioners have both highlighted several potential
sites such as Facebook and Twitter to help report
general limitations of these data. As much as big
the news.
data can help journalists in their reporting, they
Once the data are secured, journalists can apply
need to make an active effort to contextualize the
numerous techniques to make sense of the data.
information. Big data storytelling also elicits
For example, at a basic level, journalists could get
moral and ethical concerns with respect the data
a sense of public interest about a topic or issue by
collection of individuals as aggregated informa-
examining the volume of online searches about
tion. These reporting techniques also need to bear
the topic or the number of times it was referenced
in mind potential data privacy transgressions.
in social media. Mapping or charting occurrences
of events across regions or countries also offers
basic descriptive visualizations of the data. Jour-
nalists can also apply content or sentiment ana- Cross-References
lyses to get a sense of the patterns of phrases or
tone within a set of documents. Further, network ▶ Big Data Storytelling (Digital Storytelling)
analyses could be utilized to assess connections ▶ Computational Social Sciences
between points in the data set, which could pro- ▶ Data Visualization
vide insights on the flow or movement of infor- ▶ Information Society
mation, or on power structures. ▶ Interactive Data Visualization
These methods can be combined to produce a ▶ Open Data
more holistic account of events. For example,
journalists at the Associated Press used textual
and network analysis to examine almost 400,000
Further Readings
WikiLeaks documents related to the Iraq war that
identified related clusters of words used in the Pew Research Center. The core principles of journalism.
reports. In doing so, they were able to demonstrate http://www.people-press.org/1999/03/30/section-i-the-
patterns of content within the documents, which core-principles-of-journalism. Accessed April 2016.
Shorenstein Center on Media, Politics and Public Policy.
shed previously unseen light on what was happen-
Understanding data journalism: Overview of resources,
ing on the ground during the war. tools and topics. http://journalistsresource.org/reference/
Computer algorithms, and self-taught machine reporting/understanding-data-journalism-overview-tools-
learning techniques, also play an important role in topics. Accessed April 2016.
Silver, N. What the fox knows. http://fivethirtyeight.com/
the big data journalistic process. Algorithms can
features/what-the-fox-knows. Accessed August 2014.
4 Journalism

Special Issues and Volumes The ANNALS of American of the American Academy of
Digital Journalism–Journalism in an Era of Big Data: Political and Social Science – Toward Computational
Cases, concepts, and critiques. v. 3/3 (2015). Social Science: Big Data in Digital Environments.
Social Science Computer Review – Citizenship, Social v. 659/1 (2015).
Media, and Big Data: Current and Future Research in
the Social Sciences (in press).
K

Keystroke Capture avoided by using security software as well as


through careful computing practices. KC affects
Gordon Alley-Young individual computer users as well as small,
Department of Communications and Performing medium, and large organizations internationally.
Arts, Kingsborough Community College, City
University of New York, New York, NY, USA
How Keystroke Capture (KC) Works

Synonyms Keystroke capture (KC), also called keystroke


logger, keylogger, keystroke recorder, and
Keycatching; Keylogger; Keystroke logger; keycatching, tracks a computer or mobile device
Keystroke recorder users’ activities, including keyboard activity,
using hardware or software. KC is knowingly
employed by businesses to deter its employees
Introduction from misusing company devices and also by fam-
ilies seeking to monitor the technology activities
Keystroke capture (KC) tracks a computer or of vulnerable family members (e.g., teens, chil-
mobile device users’ keyboard activity using dren). Romantic partners and spouses use KC to
hardware or software. KC is used by businesses catch their significant others engaged in deception
to keep employees from misusing company tech- and/or infidelity. Computer hackers install KC
nology, in families to monitor the use possible onto unsuspecting users’ devices in order to steal
misuse of family computers, and by computer their personal data, website passwords, financial
hackers who seek gain through secretly information, read their correspondence/online
possessing an individual’s personal information communication, to stalk/harass/intimidate users,
and account passwords. KC software can be pur- and/or to sabotage organizations or individuals
chased for use on a device or may be placed that hackers consider unethical. When used
maliciously without the user’s knowledge through covertly to hurt and/or steal from others, KC is
contact with untrusted websites or e-mail attach- called malware, malicious software used to inter-
ments. KC hardware can also be purchased and is fere with a device, and/or spyware, software used
disguised to look like computer cords and acces- to steal information or to spy on someone.
sories. KC detection can be difficult because soft- KC software (e.g., WebWatcher, SpectorPro,
ware and hardware are designed to avoid Cell Phone Spy) is available for free and also for
detection by anti-KC programs. KC can be purchase, and it is usually downloaded onto the
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_125-1
2 Keystroke Capture

device where it either saves captured data onto the The Scope of the Problem
hard drive or sends it through networks/wirelessly Internationally
to another device/website. KC hardware (e.g.,
KeyCobra, KeyGrabber, KeyGhost) may be an In 2013 the Royal Canadian Mounted Police
adaptor device into which a keyboard/mouse (RCMP) served White Falcon Communications
USB cord is plugged before it is inserted in to with a warrant that alleged that the company was
the computer or may look like an extension cable. controlling an unknown number of computers
Hardware can also be installed inside the com- known as the Citadel botnet (Vancouver Sun
puter/keyboard. KC is placed on devices mali- 2013). In addition to distributing KC malware/
ciously by hackers when computer and mobile spyware, the Citadel botnet also distributed spam
device users visit websites, open e-mail attach- and conducted network attacks that reaped over
ments, or click links to files that are from $500 million dollars illegal profit affecting more
untrusted sources. Individual technology users than 5 million people globally (Vancouver Sun
are frequently lured by untrusted sources and 2013). The Royal Bank of Canada and HSBC in
websites that offer free music files or pornogra- Great Britain were among the banks attacked by
phy. KC’s infiltrate organizations’ computers the Citadel botnet (Vancouver Sun 2013). The
when an employee is completing company busi- operation is believed to have originated from
ness (i.e., financial transactions) on a device that Russia or Ukraine as many websites hosted by
he/she also uses to surf the Internet in their White Falcon Communications end in the .ru suf-
free time. fix (i.e., country code for Russia). Microsoft
When a computer is infected with a malicious claims that the 1,400 botnets running Citadel
KC, it can be turned into what is called a zombie, a malware/spyware were interrupted due to the
computer that is hijacked and used to spread KC RCMP action with the highest infection rates in
malware/spyware to other unsuspecting individ- Germany (Vancouver Sun 2013). Other countries
uals. A network of zombie computers that is con- affected were Thailand, Italy, India, Australia, the
trolled by someone other than the legitimate USA, and Canada. White Falcon owner Dmitry
network administrator is called a botnet. In 2011, Glazyrin’s voicemail claimed he was out of the
the FBI shut down the Coreflood botnet, a global country on business when the warrant was served
KC operation affecting 2 million computers. This (Vancouver Sun 2013).
botnet spread KC software via an infected e-mail Trojan horses allow others to access and install
attachment and seemed to infect only computers KC and other malware. Trojan horses can alter or
using Microsoft Windows operating systems. The destroy a computer and its files. One of the most
FBI seized the operators’ computers and charged infamous Trojan horses is called Zeus. Don Jack-
13 “John Doe” defendants with wire fraud, bank son, a senior security researcher with Dell
fraud, and illegally intercepting electronic com- SecureWorks and who has been widely
munication. Then in 2013 security firm interviewed, claims that Zeus is so successful
SpiderLabs found 2 million passwords in the because those behind it, seemingly in Russia, are
Netherlands stolen by the Pony botnet. While well funded and technologically experienced, and
researching the Pony botnet, SpiderLabs discov- this allows them to keep Zeus evolving into dif-
ered that it contained over a million and a half ferent variations (Button 2013). In 2012 Micro-
Twitter and Facebook passwords and over soft’s Digital Crimes Unit with its partners
300,000 Gmail and Yahoo e-mail passwords. Pay- disrupted a variation of Zeus botnets in Pennsyl-
roll management company ADP, with over vania and Illinois responsible for an estimated
600,000 clients in 125 countries, was also hacked 13 million infections globally. Another variation
by this botnet. of Zeus called GameOver tracks computer users’
every login and uses the information to lock them
out and drain their bank accounts (Lyons 2014). In
some instances GameOver works in concert with
Keystroke Capture 3

CryptoLocker. If GameOver finds that an individ- of cyber espionage engage in KC activities, but
ual has little in the bank then CryptoLocker will cyber criminals attain the most notoriety. Cyber
encrypt users’ valuable personal and business files criminals are as effective as they are evasive due
agreeing to release them only once a ransom is to the organization of their criminal gangs. After
paid (Lyons 2014). Often ransoms must be paid in taking money from bank accounts via KC, many
Bitcoin, Internet based and currently anonymous cyber criminals send the payments to a series of
and difficult to track. Victims of CryptoLocker money mules. Money mules are sometimes
will often receive a request for a one Bitcoin unwitting participants in fraud who are recruited
ransom (estimated to be worth 400€/$500USD) via the Internet with promises of money for work-
to unlock the files on their personal computer that ing online. The mules are then instructed to wire
could include records for a small business, aca- the money to accounts in Russia and China (Krebs
demic research, and/or family photographs 2009). Mules have no face-to-face contact with
(Lyons 2014). the heads of KC operations so it can be difficult to
KC is much more difficult to achieve on a secure prosecutions, though several notable cyber
smartphone as most operating systems operate criminals have been identified, charged, and/or
only one application at a time, but it is not impos- arrested. In late 2013 the RCMP secured a warrant
sible. As an experiment Dr. Hao Chen, an Asso- for Dmitry Glazyrin, the apparent operator of a
ciate Professor in the Department of Computer botnet who left Canada before the warrant could
Science at the University of California, Davis, be served. Then in early 2014, Russian SpyEye
with an interest in security research created a KC creator Aleksandr Panin was arrested for cyber
software that operates using smartphone motion crime (IMD 2014). Also, Estonian Vladimir
data. When tested, Chen’s application correctly Tsastsin, the cyber criminal who created
guessed more than 70% of the keystrokes on a DNSChanger and became rich of online advertis-
virtual numerical keypad though he asserts that it ing fraud and KC by infecting millions of com-
would probably be less accurate on an alphanu- puters. Finnish Internet security expert Mikko
merical keypad (Aron 2011). Point-of-sale (POS) Hermanni Hyppönen claimed that Tsastsin
data, gathered when a credit card purchase is made owned 159 Estonian properties when he was
in a retail store or restaurant, is also vulnerable to arrested in 2011 (IMD 2014). Tsastsin was
KC software (Beierly 2010). In 2009 seven Lou- released 10 months after his arrest due to insuffi-
isiana restaurant companies (i.e., Crawfish Town cient proof. As of 2014 Tsastsin has been extra-
USA Inc., Don’s Seafood & Steak House Inc., dited to the US for prosecution (IMD 2014). Also
Mansy Enterprises LLC, Mel’s Diner Part II Inc., in 2014 the US Department of Justice Department
Sammy’s LLC, Sammy’s of Zachary LLC, and (DOJ) filed papers accusing a Russian Evgeniy
B.S. & J. Enterprises Inc.) sued Radiant Systems Mikhailovich Bogachev of leading the gang
Inc., a POS system maker, and Computer World behind GameOver Zeus. The DOJ claims
Inc., a POS equipment distributor, charging that GameOver Zeus caused $100 million in losses
the vendors did not secure the Radiant POS sys- from individuals and large organizations.
tems. The customers were then defrauded by KC Suspected Eastern European malware/spyware
software, and restaurant owners incurred financial oligarchs have received ample media attention for
costs related to this data capture. Similarly, Patco perpetuating KC via botnets and Trojan horses
Construction Company, Inc. sued People’s United while other perpetrators have taken the public by
Bank for failing to implement sufficient security surprise. In 2011 critics accused software com-
measures to detect and address suspicious trans- pany Carrier IQ of placing KC and geographical
actions due to KC. The company finally settled for position spyware in millions of users’ Android
$345,000, the cost that was stolen plus interest. devices (International Business Times 2011).
Teenage computer hackers, so-called hactivists The harshest of critics have alleged illegal
(people who protest ideologically by hacking wiretapping on the part of the company while
computers), and governments under the auspices Carrier IQ has rebutted that what was identified
4 Keystroke Capture

as spyware is actually diagnostic software that 2012–2013 attack on a California escrow firm,
provides network improvement data (Interna- Efficient Services Escrow Group of Huntington
tional Business Times 2011). Further the com- Beach, CA, that had one location and nine
pany stated that the data was both encrypted and employees. Using KC malware/spyware, the
secured and not sold to third parties. In January hackers drained the company of $1.5 million dol-
2014, 11 students were expelled from Corona del lars in three transactions wired to bank accounts in
Mar High School in California’s affluent Orange China and Russia. Subsequently, $432,215 sent to
County for allegedly using KC to cheat for several a Moscow Bank was recovered, while the $1.1
years with the help of tutor Timothy Lai. Police million sent to China was never recouped. The
report being unable to find Lai, a former resident loss was enough to shutter the business’s one
of Irvine, CA, since the allegations surfaced in office and put its nine employees out of work.
December 2013. The students are accused of plac- Though popular in European computer circles,
ing KC hardware onto teachers’ computers to get the relatively low-profile Chaos Computer Club
passwords to improve their grades and steal learned that German state police were using KC
exams. All 11 students signed expulsion agree- malware/spyware as well as saving screenshots
ments in January 2014 that whereby they aban- and activating the cameras/microphones of club
doned their right to appeal their expulsions in members (Kulish and Homola 2014). News of the
exchange for being able to transfer to other police’s actions led the German justice minister to
schools in the district. Subsequently, five of the call for stricter privacy rules (Kulish and Homola
students’ families sued the district for denying the 2014). This call echoes a 2006 commission report
students the right to appeal and/or claiming tutor to the EU Parliament that calls for strengthening
Lai committed the KC crimes. By the end of the regulatory framework for electronic commu-
March, the school district had spent almost nications. KC is a pressing concern in the US for
$45,000 in legal fees. as of 2014, 18 states and one territory (i.e., Alaska,
When large organizations are hacked via KC, Arizona, Arkansas, California, Georgia, Illinois,
the news is reported widely. For instance, Visa Indiana, Iowa, Louisiana, Nevada, New Hamp-
found KC software being able to transmit card shire, Pennsylvania, Rhode Island, Texas, Utah,
data to a fixed e-mail or IP address where hackers Virginia, Washington, Wyoming, Puerto Rico) all
could retrieve it. Here the hackers attached KC to have anti-spyware laws on the books (NCSL
a POS system. Similarly KC was used to capture 2015).
the keystrokes of pilots flying the US military’s
Predator and Reaper drones that have been used in
Afghanistan (Shachtman 2011). Military officials
Tackling the Problem
were unsure whether the KC software was already
built into the drones was the work of a hacker
The problem of malicious KC can be addressed
(Shachtman 2011). Finally, Kaspersky Labs has
through software interventions and changes in
publicized how it is possible to get control of
computer users’ behaviors, especially when
BMW’s Connected Drive system via KC and
online. Business travelers may be at a greater
other malware, and this gain control of a luxury
risk for losses if they log onto financial accounts
car that uses this Internet-based system.
using hotel business centers as these high-traffic
Research by Internet security firm Symantec
areas provide ample opportunities to hackers
shows that many small and medium-sized busi-
(Credit Union Times 2014). Many Internet secu-
nesses believe that malware/spyware is a problem
rity experts recommend not using public wireless
for large organizations (e.g., Visa, the US mili-
networks where of KC spyware thrives. Experts at
tary). However, since 2010 the company notes
Dell also recommend that banks have separate
that 40% of all companies attacked have fewer
computers dedicated only to banking transactions
than 500 employees while only 28% of attacks
with no emailing or web browsing.
target large organizations. A case in point is a
Keystroke Capture 5

Individuals without the resources to devote one systems, but new wisdom suggests that all devices
computer to financial transactions can, experts can be vulnerable especially when programs and
argue, protect themselves from KC through plug-ins are added to devices. Don Jackson, a
changing several computer behaviors. First, indi- senior security researcher with Dell SecureWorks,
viduals should change their online banking pass- argues that one of the most effective methods for
words regularly. Second, they should not use the preventing online business fraud, the air-gap tech-
same password for multiple accounts or use com- nique, is not widely utilized despite being around
mon words or phrases. Third is checking one’s since 2005. The air-gap technique creates a unique
bank account on a regular basis for unauthorized verification code that is transmitted as a digital
transfers. Finally, it is important to log off of token, text message, or other device not connected
banking websites when finished with them and to the online account device, so the client can read
to never click on third-party advertisements that and then key in the code as a signature for each
post to online banking sites and take you to a new transaction over a certain amount. Alternately in
website upon clicking. 2014 Israeli researchers presented research on a
Configurations of one’s computer features, technique to hack an air-gap network using just a
programs, and software are also urged to thwart cellphone.
KC. This includes removing remote access (i.e.,
accessing one’s work computer from home) con-
figurations when they are not needed in addition Cross-References
to using a strong firewall (Beierly 2010). Users
need to continually check their devices for unfa- ▶ Banking Industry
miliar hardware attached to mice or keyboards as ▶ Canada
well as check the listings of installed software ▶ China
(Adhikary et al. 2012; Beierly 2010). Many finan- ▶ Cyber Espionage
cial organizations are opting for virtual keypads ▶ Cyber Threat/Attack
and virtual mice, especially for online transactions ▶ Department of Homeland Security
(Kumar 2009). Under this configuration instead of ▶ Germany
typing a password and username on the keyboard ▶ Microsoft
using number and letter keys, the user scrolls ▶ Point-of-Sales Data
through numbers and letters using the cursors’ ▶ Royal Bank of Canada
virtual keyboard. Always use the online virtual ▶ Spyware
keyboard for your banking password to avoid ▶ Visa
the risk of keystrokes being logged when
available.
Further Readings

Conclusion Adhikary, N., Shrivastava, R., Kumar, A., Verma, S., Bag,
M., & Singh, V. (2012). Battering keyloggers and
screen recording software by fabricating passwords.
Having anti-KC/malware/spyware alone does not International Journal of Computer Network & Infor-
guarantee protection, but experts agree that it is an mation Security, 4(5), 13–21.
important component of an overall security strat- Aron, J. (2011). Smartphone jiggles reveal your private
data. New Scientist, 211(2825), 21.
egy. Anti-KC programs include SpyShelter Stop-
Beierly, I. (2010). They’ll be watching you. Retrieved from
Logger, Zemana AntiLogger, KeyScrambler Pre- http://www.hospitalityupgrade.com/_files/File_Articles/
mium, Keylogger Detector, and GuardedID Pre- HUSum10_Beierly_Keylogging.pdf
mium. Some computer experts claim that PC’s are Button, K. (2013). Wire and online banking fraud continues
to spike for businesses. Retrieved from http://www.
more susceptible to KC malware/spyware than are
americanbanker.com/issues/178_194/wire-and-online-
Mac’s as KC malwares/spywares are often banking-fraud-continues-to-spike-for-businesses-1062
reported to exploit holes in PC’s operating 666-1.html
6 Keystroke Capture

Credit Union Times. (2014). Hotel business centers Lyons, K. (2014). Is your computer already infected with
hacked. Credit Union Times, 25(29), 11. dangerous Gameover Zeus software? Virus could be
IMD: International Institute for Management Develop- lying dormant in thousands of Australian computers.
ment. (2014). Cybercrime buster speaks at IMD. Retrieved from http://www.dailymail.co.uk/news/article-
Retrieved from http://www.imd.org/news/Cybercrime- 2648038/Gameover-Zeus-lying-dormant-thousands-
buster-speaks-at-IMD.cfm Australian-computers-without-knowing.html#ixzz3
International Business Times. (2011). Carrier iq spyware: AmHLKlZ9
Company’s Android app logging the keystrokes of mil- NCSL: National Conference of State Legislatures. (2015).
lions. Retrieved from http://www.ibtimes.com/carrier- State spyware laws. Retrieved from http://www.ncsl.
iq-spyware-companys-android-app-logs-keystrokes- org/research/telecommunications-and-information-tech
millions-video-377244 nology/state-spyware-laws.aspx
Krebs, B. (2009). Data breach highlights role of ‘money Shachtman, N. (2011). Exclusive: Computer virus hits US
mules’. Retrieved from http://voices.washingtonpost. drone fleet. Retrieved from http://www.wired.com/
com/securityfix/2009/09/money_mules_carry_loot_for_ 2011/10/virus-hits-drone-fleet/
org.html Vancouver Sun. (2013). Police seize computers linked to
Kulish, N., & Homola, V. (2014). Germans condemn police large cybercrime operation: Malware Responsible for
use of spyware. Retrieved from http://www.nytimes. over $500 million in losses has affected more than five
com/2011/10/15/world/europe/uproar-in-germany-on- million people globally. Retrieved from http://www.
police-use-of-surveillance-software.html?_r=0 vancouversun.com/news/Police+seize+computers+
Kumar, S. (2009). Handling malicious hackers & assessing linked+large+cybercrime+operation/8881243/story.html
risk in real time. Siliconindia, 12(4), 32–33. #ixzz3Ale1G13s
L

LexisNexis services, data hosting, and online services.


LexisNexis opened its first remote data center
Jennifer Summary-Smith and development facility in
Culver-Stockton College, Canton, MO, USA Springfield, Ohio, in 2004, which hosts new prod-
uct development. Both data centers function as a
backup and recovery facility for each other.
As stated on its website, LexisNexis is a leading According to the LexisNexis’ website, its cus-
global provider of content-enabled workflow tomers use services that span multiple servers and
solutions. This corporation provides data and operating systems. For example, when a sub-
solutions for professionals in areas such as the scriber submits a search request, the systems
academia, accounting, corporate world, govern- explore and sift through massive amounts of
ment, law enforcement, legal, and risk manage- information. The answer set is typically returned
ment. LexisNexis is a subscription-based service, to the customer within 6–10 s, resulting in a
with two data centers located in Springfield and 99.99% average for reliability and availability of
Miamisburg, Ohio. The centers are among the the search. This service is accessible to five mil-
largest complexes of their kind in the United lion subscribers, with nearly five billion docu-
States, providing LexisNexis with “one of the ments of source information available online and
most complete comprehensive collections of stored in the Miamisburg facility. The online ser-
online information in the world.” vices also provide access to externally hosted data
from the Delaware Secretary of State, Dun &
Bradstreet Business Reports, Historical Quote,
Data Centers and Real-Time Quote. Given that a large incentive
for data center services is to provide expansion
The LexisNexis data centers hold network capacity for all future hosting opportunities, this
servers, software, and telecommunication equip- has led to an increase in the percentage of total
ment, which is a vital component of the entire revenue for Reed Elsevier. Currently, the
range of LexisNexis products and services. The Miamisburg data center supports over two billion
data centers service the LexisNexis Group dollars in online revenue for Reed Elsevier.
Inc. providing assistance for application develop-
ment, certification and administrative services,
and testing. The entire complex serves its Reed
Elsevier sister companies while also providing
LexisNexis customers with the following: backup
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_127-1
2 LexisNexis

Mainframe Servers also holds and stores copies of critical data off-
site. Multiple times a year, emergency business
There are over 100 servers housed in the Spring- resumption plans are tested. Furthermore, the data
field center, managing over 100 terabytes of data center has system management services 365 days
storage. As for the Miamisburg location, this com- a year and 24 h a day provided by skilled opera-
plex holds 11 huge mainframe servers, running tions engineers and staff. If needed, there are
34 multiple virtual storage (MVS) operating sys- additional specialists on site, or on call, to provide
tem images. The center also has 300 midrange the best support to customers. According to its
Unix servers and almost 1,000 multiprocessor website, LexisNexis invests a great deal in protec-
NT servers. They provide a wide range of com- tion architecture to prevent hacking attempts,
puter services including patent images to cus- viruses, and worms. In addition, the company
tomers, preeminent US case law citation also has third-party contractors which conduct
systems, a hosting channel data for Reed Elsevier, security studies.
and computing resources for the LexisNexis
enterprise. As the company states, its processors
have access to over 500 terabytes (or one trillion Security Breach
characters) of data storage capacity.
In 2013, Byron Acohido reported that a hacking
group hit three major data brokerage companies.
Telecommunications LexisNexis, Dun & Bradstreet, and Kroll Back-
ground America are companies that stockpile and
LexisNexis has developed a large telecommuni- sell sensitive data. The group that hacked these
cations network, permitting the corporation to data brokerage companies specialized in
support its data collection requirements while obtaining and selling social security numbers.
also serving its customers. As noted on its The security breach was disclosed by a cyberse-
website, subscribers to the LexisNexis Group curity blogger Brian Kebs. He stated that the
have a search rate of one billion times annually. website ssndob.ms (SSNDOB), their acronym
LexisNexis also provides bridges and routers and stands for social security number and date of
maintains firewalls, high-speed lines, modems, birth, markets itself on underground cybercrime
and multiplexors, providing an exceptional degree forums, offering services to its customers who
of connectivity. want to look up social security numbers, birth-
days, and other data on any US resident.
LexisNexis found an unauthorized program called
Physical Dimensions of the Miamisburg nbc.exe on its two systems listed in the botnet
Data Center interface network located in Atlanta, Georgia.
The program was placed as far back as April
LexisNexis Group has hardware, software, elec- 2013, compromising their security for at least
trical, and mechanical systems housed in a 5 months.
73,000 ft2 data center hub. Its sister complex,
located in Springfield, comprises a total of
80,000 ft2. In these facilities, the data center hard- LexisNexis Group Expansion
ware, software, electrical, and mechanical sys-
tems have multiple levels of redundancy, in the As of July 2014, LexisNexis Risk Solutions
event that a single component fails, ensuring expanded its healthcare solutions to the life sci-
uninterrupted service. The company’s website ence marketplace. In an article by Amanda Hall,
states that its systems are maintained and tested she notes that an internal analysis revealed that
on a regular basis to ensure they perform correctly 40% of the customer files have missing or inaccu-
in case of an emergency. The LexisNexis Group rate information in a typical life science company.
LexisNexis 3

LexisNexis Risk Solutions has leveraged its lead- legal professionals have trusted the LexisNexis
ing databases, reducing costs, improving effec- Group. It appears that the company will continue
tiveness, and strengthening identity transparency. to maintain this status and remain one of the
LexisNexis is able to deliver data to over 6.5 leading providers in the data brokerage
million healthcare providers in the United States. marketplace.
This will benefit life science companies allowing
them to tailor their marketing and sales strategies,
to identify the correct providers to pursue. The
Cross-References
LexisNexis databases are more efficient, which
will help health science organizations gain com-
▶ American Bar Association
pliance with federal and state laws.
▶ Big Data Quality
Following the healthcare solutions announce-
▶ Data Breach
ment, Elisa Rodgers writes that Reed Technology
▶ Data Center
and Information Services, Inc., a LexisNexis com-
▶ Legal Issues
pany, acquired PatentCore. PatentCore is an inno-
▶ Reed Elsevier
vator of patent data analytics. PatentAdvisor is a
user-friendly suite, delivering information to
assist with a more effective patent prosecution
and management. Its web-based patent analytic Further Readings
tools will help IP-driven companies and law
firms by making patent prosecution a more strate- Acohido, B. LexisNexis, Dunn & Bradstreet, Kroll Hacked.
http://www.usatoday.com/story/cybertruth/2013/09/26/
gic and probable process. lexisnexis-dunn–bradstreet-altegrity-hacked/2878769/.
The future of the LexisNexis Group should Accessed July 2014.
include more acquisitions, expansion, and Hall, A. LexisNexis verified data on more than 6.5 million
increased capabilities for the company. According providers strengthens identity transparency and reduces
costs for life science organizations. http://www.benzi
to its website, the markets for their companies nga.com/pressreleases/14/07/b4674537/lexisnexis-veri
have grown over the last three decades, servicing fied-data-on-more-than-6-5-million-providers-strengt
professionals in academic institutes, corporations, hens. Accessed July 2014.
governments, and business people. LexisNexis Krebs, B. Data broker giants hacked by ID theft service.
http://krebsonsecurity.com/2013/09/data-broker-giants-
Group provides critical information, in easy-to- hacked-by-id-theft-service/. Accessed July 2014.
use electronic products, to the benefit of sub- LexisNexis. http://www.lexisnexis.com. Accessed July
scribed customers. The company has a long his- 2014.
tory of fulfilling its mission statement “to enable Rodgers, E. Adding multimedia reed tech strengthens line
of LexisNexis intellectual property solutions by acquir-
its customers to spend less time searching for ing PatentCore, an innovator in patent data analytics.
critical information and more time using http://in.reuters.com/article/2014/07/08/supp-pa-reed-
LexisNexis knowledge and management tools to technology-idUSnBw015873a+100+BSW20140708.
guide critical decisions.” For more than a century, Accessed July 2014.
L

Link/Graph Mining a vertex set, and relationships define edges among


network vertices. For example, a relation from
Derek Doran vertex A to B and a relation from vertex C to
Department of Computer Science and D in a homogeneous graph means that A is related
Engineering, Wright State University, Dayton, to B in the same way that C is related to D. An
OH, USA example of a homogeneous graph may be one
where nodes represent individuals and connec-
tions represent a friendship relationship. An
Synonyms example of a heterogeneous graph is one where
different types of network devices connect to each
Network analysis; Network science; Relational other to form a corporate intranet. Different node
data analytics types correspond to different device types, and
different relationships may correspond to the
type of network protocol that two devices use to
Definition/Introduction communicate with each other. Networks may be
directed (e.g., a link may be presented from A to
Link/graph mining is defined as the extraction of B but not vice versa) or undirected (e.g., a link
information within a collection of interrelated from A to B exists if and only if a link from B to
objects. Whereas conventional data mining ima- A exists). Link/graph mining is intimately related
gines a database as a collection of “flat” tables, to network science, which is the scientific study
where entities are rows and attributes of these of the structure of complex systems. Common
entities are columns, link/graph mining imagines link/graph mining tasks include discovering
entities as nodes or vertices in a network, with shortest or expected paths in the network, an
attributes attached to the nodes themselves. Rela- importance ranking of nodes or vertices, under-
tionships among datums in a “flat” database may standing relationship patterns, identifying com-
be seen by primary key relationships or by com- mon clusters or regions of a graph, and modeling
mon values across a set of attributes. In the link/ propagation phenomena across the graph. Ran-
graph mining view of a database, these relation- dom graph models give researchers a way to iden-
ships are made explicit by defining links or edges tify whether a structural or interaction pattern seen
between vertices. The edges may be homoge- within a dataset is statistically significant.
neous, where a single kind of relationship defines
the edges that are formed, or heterogeneous,
where multiple kinds of data are used to develop
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_129-1
2 Link/Graph Mining

Network Representations of Data interested in understanding how the structural


qualities of the system speak in favor or in oppo-
While a traditional “tabular” representation of a sition to the mechanism. Under either setting,
dataset contains information necessary to under- hypotheses may be tested by comparing observa-
stand a big dataset, a network representation tions against random network models to identify
makes explicit datum relations that may be whether or not patterns in support or in opposition
implicit in a data table. For example, in a database of a hypothesis are significant or merely occurred
of employee personal and their meeting calendars, by chance. Network science is intimately tied to
a network view may be constructed where link/graph mining: it defines an apparatus for ana-
employees are nodes and edges are present if lysts to use link/graph mining methods that can
two employees will participate in the same meet- answer important questions about a complex sys-
ing. The network thus captures a “who works with tem. Similarly, network science procedures and
who” relationship that is only implicit in the data analyses are the primary purpose for the develop-
table. Analytics over the network representation ment of link/graph mining techniques. The utility
itself can answer queries such as “how did some- of one would thus not nearly be as high without
body at meeting C hear about information that was the other.
only discussed during meeting A?”, or “which
employee may have been exposed to the most
Representation
amount of potential information, rumors, and
The mathematical representation of a graph is a
views, as measured by participating in many
basic preprocessing step for any link/graph mining
meetings where few other participants overlap?”
task. One form may be as follows: every node in
The network representation of data has another
the graph is labeled with an integer i = 1 . . . n and
important advantage: the network itself represents
a tuple (i, j) is defined for a relationship between
the structure of a complex system of
nodes i and j. A network may then be defined by
interconnected participants. These participants
the value n and a list of all tuples. For example, let
could be people or even components of a physical
n = 5 and define the set {(1, 2), (3, 4), (2, 4), (4, 1),
system. There is some agreement in the scientific
(2, 3)}. This specifies a graph with five vertices,
community that the complexity of most techno-
one of which is disconnected (vertex 5) and the
logical, social, biological, and natural systems is
others that have edges between them as defined by
best captured by its representation as a network.
the set. Such a specification of a network is called
The field of network science is devoted to the
an edge list. Another approach is to translate the
scientific application of link and graph mining
edge list representation into an adjacency matrix
techniques to quantitatively understand, model,
A. This is defined as an n  n matrix where the
and make predictions over complex systems. Net-
element Aij, corresponding to the ith row and jth
work science defines two kinds of frameworks
column of the matrix, is equal to 1 if the tuple (i, j)
under which link/graph mining is performed:
or (j, i) exists in the edge list. When edges are
(i) exploratory analysis and (ii) hypothesis-driven
unlabeled or unweighted, A is simply a binary
analysis. In exploratory analysis, an analyst has no
matrix. Alternatively, if the graph is heteroge-
specific notion about why and now nodes in a
neous or allows multiple relationships between
complex system connect or are related to each
the same pair of nodes, then Aij is equal to the
other or why a complex network takes on a spe-
number of edges between i and j. When A is not
cific structure. Exploratory analysis leads to a
symmetric, the graph is directed rather than
hypothesis about an underlying mechanism of
undirected.
the system based on regularly occurring patterns
or based on anomalous graph metrics. In
hypothesis-driven analysis, the analyst has some Types of Link/Graph Mining Techniques
at hand evidence supporting an underlying mech- The discovery and analysis of algorithms for
anism about how a system operates and is extracting knowledge from networks are ongoing.
Link/Graph Mining 3

Common types of analyses, emphasizing those social system, will B reply? On the World Wide
types often used in practice, are explained below. Web, if website A has a hyperlink to B, will B link
Path analysis: A path p in a graph is a sequence to A?
of vertices p = (v1, v2, . . . , vm) , vi  V such Transitivity refers to the degree to which two
that for each consecutive pair vi,vj of vertices in nodes in a network have a mutual connection in
p is matched by an edge of the form (vj,vi) (if the common. In other words, if there is an edge
network is undirected) or (vi,vj) (if the network is between nodes A and B and B to C, graphs that
directed or undirected). If one were to draw a graph are highly transitive indicate a tendency for an
graphically, a path is any sequence of movements edge to also exist between A and C. In the context
along the edges of the network that brings you from of social network analysis, transitivity carries an
one vertex to another. Any path is valid, even ones intuitive interpretation based on the old adage “a
that have loops or crosses the same vertex many friend of my friend is also my friend.” Transitivity
times. Paths that do not intersect with themselves is an important measure in other contexts, as well.
(i.e., vi does not equal vj for any vi,vj  p) are self- For example, in a graph where edges correspond
avoiding. The length of a path is defined by the to paths of energy as in a power grid, highly
total number of edges along it. Geodesic paths transitive graphs correspond to more efficient sys-
between vertices i and j is a minimum length path tems compared to less transitive ones: rather than
of size k where p1 = i and pk = j. A breadth-first having energy take the path A to B to C, a transi-
search starting from node d, which iterates over all tive relation would allow a transmission from A to
paths of length 1, and then 2 and 3, and so on up to C directly. The transitivity of a graph is measured
the largest path that originates at d, is one way to by counting the total number of closed triangles in
compute geodesic paths. the graph (i.e., counting all subgraphs that are
Network interactions: Whereas path analysis complete graphs of three nodes) multiplied by
considers the global structure of a graph, the inter- three and divided by the total number of
actions among nodes are a concept related to sub- connected triples in the graph (e.g., all sets of
graphs or microstructures. Microstructural three vertices A, B, and C where at least the
measures consider a single node, members of its edges (A,B) and (B,C) exist).
nth degree neighborhood (the set of nodes no Balance is defined for networks where edges
more than n hops from it), and the collection of carry a binary variable that, without loss of gen-
interactions that run between them. If macro- erality, is either “positive” (i.e., a “+,” “1,” “Yes,”
measures study an entire system as a whole (the “True,” etc.) or “negative” (i.e., a “ ,” “0,” “No,”
“forest”), micro-measures such as interactions try “False,” etc.). Vertices incident to positive edges
to get at the heart of the individual conditions that are harmonious or non-conflicting entities in a
cause nodes to bind together locally (the “trees”). system, whereas vertices incident to negative
Three popular features for microstructural analy- edges may be competitive or introduce a tension
sis are reciprocity, transitivity, and balance. in the system. Subgraphs over three nodes that are
Reciprocity measures that degree to which complete are balanced or imbalanced depending
two nodes are mutually connected to each other on the assignment of + and labels to the edges
in a directed graph. In other words, if one observes of the triangle as follows:
that a node A connects to B, what is the chance
that B will also connect A? The term reciprocity • Three positive: Balanced. All edges are “posi-
comes from the field of social network analysis, tive” and in harmony with each other.
which describes a particular set of link/graph min- • One positive, two negative: Balanced. In this
ing techniques designed to operate over graphs triangle, two nodes exhibit a harmony, and
where nodes represent people and edges represent both are in conflict with the same other. The
the social relationships among them. For example, state of this triangle is “balanced” in the sense
if A does a favor for B, will B also do a favor for that every node is either in harmony or in
A? If A sends a friend request to B on an online conflict with all others in kind.
4 Link/Graph Mining

• Two positive, one negative: Imbalanced. In this of their job title but because they have a direct and
triangle, node A is harmonious with B, and B is strong relationship with the Commander in Chief.
harmonious with C, yet A and C are in conflict. Importance is measured by calculating the cen-
This is an imbalanced disagreement since, if trality of a node in a graph. Different centrality
A does not conflict with B, and B does not measures that encode different interpretations of
conflict with C, one would expect A to also node importance exist and should thus be selected
not conflict with C. For example, in a social according to the analysis at hand. Degree central-
context where positive means friend and neg- ity defines importance as being proportional to the
ative means enemy, B can fall into a conflicting number of connections a node has. Closeness
situation when friends A and C disagree. centrality defines importance as having a small
• Three negative: In this triangle, all vertices are average distance to all other nodes in the graph.
in conflict with one another. This is a danger- Betweenness centrality defines importance as
ous scenario in systems of almost any context. being part of as many shortest paths in graph
For example, in a dataset of nations, mutual from other pairs of nodes as possible. Eigenvec-
disagreements among three states has conse- tor centrality defines importance as being
quence to the world community. In a dataset of connected to not only many other nodes but also
computer network components, three routers to many other nodes that are themselves are
that are interconnected but in “conflict” (e.g., important.
a down connection or a disagreement among Graph partitioning: In the same way that clus-
routing tables) may lead to a system outage. ters of datums in a dataset correspond to groups of
points that are similar, interesting, or signify some
Datasets drawn from social process always other demarcation, vertices in graphs may also be
tend toward balanced states because people do divided into groups that correspond to a common
not like tension or conflict. It is thus interesting affiliation, property, or connectivity structure.
to use link/graph mining to study social systems Graph partitioning methods. Graph partitioning
where balance may actually not hold. If a graph takes as an input the number and size of the groups
where most triangles are not balanced comes from and then searches for the “best” partitioning under
a social system, one may surmise that there exist these constraints. Community detection algo-
latent factors pushing the system toward imbal- rithms are similar to graph partitioning methods
anced states. A labeled complete graph is bal- except that they do not require the number and
anced if every one of its triangles is balanced. size of groups to be specified a priori. But this is
Quantifying node importance: The impor- not necessarily a disadvantage to graph
tance of a node is related to its ability to reach partitioning methods; if a graph miner under-
out or connect to other nodes. A node may also be stands the domain from where the graph came
important if it carries a strong degree of “flow,” from well, or if for her application she requires a
that is, if the values of relationships connected to it partitioning into exactly k groups, graph
are very high (so that it acts as a strong conduit for partitioning methods should be used.
the passage of information). Nodes may be impor-
tant if they are vital to maintain network connec-
tivity, so that if an important node was removed, Conclusion
the graph may suddenly fragment or become dis-
connected. Importance may be measured recur- As systems that our society relies on become ever
sively: a node is important if it is connected to more complex, and as technological advances
other nodes that themselves are important. For continue to help us capture the structure of this
example, people who work in the United States complexity at high definition, link/graph mining
White House or serve as Senior Aids to the Pres- methods will continue to rise in prevalence. As the
ident are powerful people, not necessarily because primary means to understand and extract knowl-
edge from complex systems, link/graph mining
Link/Graph Mining 5

methods need to be included in the toolkit of any Further Readings


big data analyst.
Cook, D. J., & Holder, L. B. (2006). Mining graph data.
Wiley.
Getoor, L., & Diehl, C. P. (2005). Link mining: A survey.
Cross-References ACM SIGKDD Explorations Newsletter, 7(2), 3–12.
Lewis, T. G. (2011). Network science: Theory and appli-
▶ Computer Science cations. Wiley.
Newman, M. (2010). Networks: An introduction. New
▶ Computational Science and Engineering
York: Oxford University Press.
▶ Computational Social Sciences Philip, S. Y., Han, J., & Faloutsos, C. (2010). Link mining:
▶ Graph-Theoretic Computations Models, algorithms, and applications. Berlin: Springer.
▶ Mathematics
▶ Statistics
L

LinkedIn by chief executive, Jeff Weiner, who is also the


former CEO of Yahoo! Inc. LinkedIn’s headquar-
Jennifer Summary-Smith ters are located in Mountain View, California,
Culver-Stockton College, Canton, MO, USA with US offices in Chicago, Los Angeles,
New York, Omaha, and San Francisco. LinkedIn
also has international offices in 21 locations and
According to its website, LinkedIn is the largest its online content is available in 23 languages.
professional network in the world servicing over LinkedIn currently employs 5,400 full-time
300 million members in over 200 territories and employees with offices in 27 cities globally.
countries. Their mission statement is to “connect LinkedIn states that professionals are signing up
the world’s professionals to make them more pro- to join the service at the rate of two new members
ductive and successful. When you join LinkedIn, per second with 67% of its membership located
you get access to people, jobs, news, updates, and outside of the United States. The fastest growing
insights that help you be great at what you do.” demographic using LinkedIn are students and
Through its online service, LinkedIn earns around recent college graduates, accounting for around
$473.2 million from premium subscriptions, mar- 39 million users. LinkedIn’s corporate talent solu-
keting solutions, and talent solutions. It offers free tions product lines and its memberships include
and premium memberships allowing people to all executives from the 2013 Fortune 500 compa-
network, obtain knowledge, and locate potential nies and 89 Fortune 100 companies. In 2012, its
job opportunities. The greatest asset to LinkedIn is members conducted over 5.7 billion profession-
its data, making a significant impact in the job ally oriented searches, with three million compa-
industry. nies utilizing LinkedIn company pages.
As noted on cofounder Reid Hoffman’s
LinkedIn account, a person’s network is how one
Company Information stays competitive as a professional, keeping up-
to-date on one’s industry. LinkedIn provides a
Cofounder Reid Hoffman conceptualized the space where professionals learn about key trends,
company in his living room in 2002, launching information, and transformations of their industry.
LinkedIn on May 5, 2003. Hoffman, a Stanford It provides opportunities for people to find jobs,
graduate, became one of PayPal’s earliest execu- clients, and other business connections.
tives. After PayPal was sold to eBay, he
cofounded LinkedIn. The company had one mil-
lion members by 2004. Today, the company is ran
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_130-1
2 LinkedIn

Relevance of Data the top of the application list. OpenLink is a


network that also lets any member on LinkedIn
MIT Sloan Management Review contributing edi- to view another member’s full profile to make a
tor, Renee Boucher Ferguson, interviewed connection.
LinkedIn’s director of relevance science, Deepak The premium LinkedIn membership assists
Agarwal, who states that relevance science at with drawing attention to members’ profile,
LinkedIn plays the role of improving the rele- adding an optional premium or job seeker badge.
vancy of its products by extracting information When viewing the job listings, members have the
from LinkedIn data. In other words, LinkedIn option to sort by salary range, comparing salary
provides recommendations using its data to pre- estimates for all jobs in the United States,
dict user responses to different items. Australia, Canada, and the United Kingdom.
To achieve this difficult task, LinkedIn has LinkedIn’s premium membership also allows
relevance scientists who provide an interdisciplin- users to see more profile data in one’s extended
ary approach with backgrounds in computer sci- network, including first-, second-, and third-
ence, economics, information retrieval, machine degree connections. A member’s first-level con-
learning, optimization, software engineering, and nections are people that have either received an
statistics. Relevance scientists work to improve invitation from the member or the member sent an
the relevancy of LinkedIn’s products. According invitation to connect. Second-level connections
to Deepak Agarwal, LinkedIn relevance scientists are people who are connected to first-level con-
significantly enhance products such as advertis- nections but are not connected to the actual mem-
ing, job recommendations, news, LinkedIn feed, ber. Third-level connections are only connected to
people recommendations, and much more. He the second-level members. Moreover, members
further points out that most of the company’s can receive advice and support from a private
products are based upon its use of data. group of LinkedIn experts, assisting with job
searches.
In a recent article by George Anders, he notes
Impact on the Recruiting Industry the impact that LinkedIn has made on the
recruiting industry. He spoke with the chief exec-
As it states on LinkedIn’s website, the company’s utive of LinkedIn, Jeff Weiner, who brushes off
free membership allows its members the opportu- comparisons between LinkedIn and Facebook.
nity to upload resumes and/or curriculum vitae, While both companies connect a vast amount of
join groups, follow companies, establish connec- people via the Internet, each social media platform
tions, view and/or search for jobs, endorse con- occupies a different niche within the social net-
nections, and update profiles. It also suggests to its working marketplace. Facebook generates 85% of
members several people that they may know, its revenue from advertisements, whereas
based on their connections. LinkedIn’s premium LinkedIn focuses its efforts on monetizing mem-
service provides members with additional bene- bers’ information. Furthermore, LinkedIn’s
fits, allowing access to hiring managers and mobile media experience is growing significantly,
recruiters. Members can send personalized mes- changing the face of job searching, career net-
sages to any person on LinkedIn. Additionally, working, and online profiles. George Anders
members can also find out who has viewed their also interviewed the National Public Radio head
profile, detailing how others found them for up to of talent acquisition, Lars Schmidt, who notes that
90 days. There are four premium search filters, recruiters no longer remain chiefly in their offices
permitting premium members to find decision but are becoming more externally focused. The
makers at target companies. The membership days of exchanging business cards is quickly
also provides individuals the opportunity to get being replaced by smartphone applications such
noticed by potential employers. When one applies as CardMunch. CardMunch is an iPhone app that
as a featured applicant, it raises his or her rank to captures business card photos, transferring them
LinkedIn 3

into digital contacts. In 2011, LinkedIn bought the that are relevant to members. It quickly notifies a
company, retooling it to pull up existing LinkedIn person whenever friends, family members,
profiles from each card improving the ability of coworkers, and so forth are mentioned online in
members to make connections. A significant part the news, blogs, and/or articles.
of LinkedIn’s success comes from its dedication LinkedIn continues to make great strides by
to selling services to people who purchase talent. leveraging its large data archives, to carve out a
The chief executive of LinkedIn, Jeff Weiner, niche in the social media sector specifically
has created an intense sales-focused culture. The targeting the needs of online professionals. It is
company celebrates new account wins during its evident that, through the use of big data, LinkedIn
biweekly meetings. According to George Anders, is changing and significantly influencing the job-
LinkedIn has doubled the number of sales hunting process. This company provides a service
employees in the past year. In addition, the com- that allows its member to connect and network
pany has made a $27 billion impact on the with professionals. LinkedIn is the world’s largest
recruiting industry. Jeff Weiner also states that professional network, proving to be an innovator
every time LinkedIn expands its sales team for in the employment service industry.
hiring solutions, the payoff increases “off the
charts.” He also talks about how sales keep rising
and its customers are spreading enthusiasm for
Cross-References
LinkedIn’s products. Jeff Weiner further states
that once sales are made, LinkedIn customers are
▶ Facebook
loyal, reoccurring, and low maintenance. This
▶ Information Society
trend is reflected in current stock market prices in
▶ Online Identity
the job-hunting sector. George Anders writes that
▶ Social Media
older search firm companies, such as Heidrick &
Struggles that recruits candidates the old fashion
way, have slumped 67%. Monster Worldwide has
experienced a more dramatic drop, tumbling 81%. Further Readings
As noted on its website, “LinkedIn operates the
world’s largest professional network on the Inter- Anders, G. How LinkedIn has turned your resume into a cash
machine. http://www.forbes.com/sites/georgeanders/
net.” This company has made billions of dollars, 2012/06/27/how-linkedin-strategy/. Accessed July 2014.
hosting a massive amount of data with a member- Boucher Ferguson, R. The relevance of data: Behind the
ship of 300 million people worldwide. The social scenes at LinkedIn. http://sloanreview.mit.edu/arti
network for professionals is growing at a fast pace cle/the-relevance-of-data-going-behind-the-scenes-
at-linkedin/. Accessed July 2014.
under the tenure of Chief Executive Jeff Weiner. Gelles, D. LinkedIn makes another deal, buying Bizo.
In a July 2014 article by David Gelles, he reports http://dealbook.nytimes.com/2014/07/22/linkedin-does-
that LinkedIn has made its second acquisition in another-deal-buying-bizo/?_php=true&_type=blogs&_
the last several weeks buying Bizo for $175 mil- php=true&_type=blogs&_php=true&_type=blogs&_
r=2. Accessed July 2014.
lion dollars. A week prior, it purchased Newsle, LinkedIn. https://www.linkedin.com. Accessed July 2014.
which is a service that combs the web for articles
M

Media expenses scandal in the UK or the giant data leak


in the case of the Panama Papers have contributed
Colin Porlezza to further improve the capacities to deal with large
IPMZ - Institute of Mass Communication and amounts of data in newsrooms. Second, big data
Media Research, University of Zurich, Zürich, are not only important in reference to the practice
Switzerland of reporting. They also play a decisive role with
regard to what kind of content gets finally
published. Many newsrooms are no longer using
Synonyms the judgment of human editors alone to decide
what content ends up on their websites; instead
Computer-assisted reporting; Data journalism; they use real-time data analytics generated by the
Media ethics clicks of their users to identify trends, to see how
content is performing, and to boost virality and
user engagement. Data is also used in order to
Definition/Introduction improve product development in entertainment
formats. Social media like Facebook have
Big data can be understood as “the capacity to perfected this technique by using personal prefer-
search, aggregate and cross-reference large data ences, tastes, and moods of their users, to offer
sets” (Boyd and Crawford 2012, p. 663). The personalized content and targeted advertising.
proliferation of large amounts of data concerns This datafication means that social media trans-
the media in at least three different ways. First, forms intangible elements such as relationships
large-scale data collections are becoming an and transform them into a valuable resource or
important resource for journalism. As a result, an economic asset on which to build entire busi-
practices such as data journalism are increasingly ness models. Third, datafication and the use of
gaining attention among newsrooms and become large amounts of data give also rise to risks with
relevant resources as data collected and published regard to ethics, privacy, transparency, and sur-
in the Internet expands and legal frameworks to veillance. Big data can have huge benefits because
access public data such as Freedom of Informa- it allows organizations to personalize and target
tion Acts come into effect. Recent success stories products and services. But at the same time, it
of data journalism such as uncovering the MPs’ requires clear and transparent information

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_133-1
2 Media

handling governance and data protection. Han- processing, analytics, and visualization software
dling big data increases the risk of paralyzing that allows journalists to peer through the massive
privacy, because (social) media or internet-based amounts of data available in a digital environment
services require a lot of personal information in and to show it in a clear and simple way to the
order to use them. Moreover, analyzing big data publics. The importance of data journalism is
entails higher risks to incur in errors, for instance, given by its ability to gather, interrogate, visual-
when it comes to statistical calculations or visual- ize, and mash up data with different sources or
izations of big data. services, and it requires an amalgamation of a
journalist’s “nose for news” and tech savvy
competences.
Big Data in the Media Context However, data journalism is not as new as it
seems to be. Ever since organizations and public
Within media, big data mainly refers to huge administrations collected information or built up
amounts of structured (e.g., sales, clicks) or archives, journalism has been dealing with large
unstructured (e.g., videos, posts, or tweets) data amounts of data. As long as journalism has been
generated, collected, and aggregated by private practiced, journalists were keen to collect data and
business activities, governments, public adminis- to report them accurately. When the data
trations, or online-based organizations such as displaying techniques got better in the late eigh-
social media. In addition, the term big data usually teenth century, newspapers started to use this
includes references to the analysis of huge bulks know-how to present information in a more
of data, too. These large-scale data collections are sophisticated way. The first example of data jour-
difficult to analyze using traditional software or nalism can be traced back to 1821 and involved
database techniques and request new methods in The Guardian, at the time based in Manchester,
order to identify patterns in such a massive and UK. The newspaper published a leaked table list-
often incomprehensible amount of data. The ing the number of students and the costs for each
media ecosystem has therefore developed special- school in the British city. For the first time, it was
ized practices and tools not only to generate big publicly shown that the number of students
data but also to analyze it in turn. One of these receiving free education was higher than what
practices to analyze data is called data or data- was expected in the population. Another example
driven journalism. of early data journalism dates back to 1858, when
Florence Nightingale, the social reformer and
Data Journalism founder of modern nursing, published a report to
We live in an age of information abundance. One the British Parliament about the deaths of soldiers.
of the biggest challenges for the media industry, In her report she revealed with the help of visual
and journalism in particular, is to bring order in graphics that the main cause of mortality resulted
this data deluge. It is therefore not surprising that from preventable diseases during cure rather than
the relationship between big data and journalism as a cause from battles.
is becoming stronger, especially because large By the middle of the twentieth century, news-
amounts of data need new and better tools that rooms started to use systematically computers to
are able to provide specific context, to explain the collect and analyze data in order to find and enrich
data in a clear way, and to verify the information it news stories. In the 1950s this procedure was
contains. Data journalism is thus not entirely dif- called computer-assisted reporting (CAR) and is
ferent from more classic forms of journalism. perhaps the evolutionary ancestor of what we call
However, what makes it somehow special are data journalism today. Computer-assisted
the new opportunities given by the combination reporting was, for instance, used by the television
of traditional journalistic skills like research and network CBS in 1952 to predict the outcome of
innovative forms of investigation thanks to the use the US presidential election. CBS used a then
of key information sets, key data and new famous Universal Automatic Computer
Media 3

(UNIVAC) and programmed it with statistical methods, has helped – according to Philip
models based on voting behavior from earlier Meyer – to make journalism scientific. Besides,
elections. With just 5% of votes in, the computer Meyer’s approach tried also to tackle some of the
correctly predicted the landslide win of former common shortcomings of journalism like the
World War II general Dwight D. Eisenhower increasing dependence on press releases, shrink-
with a margin of error less than 1%. After this ing accuracy and trust, or the critique of political
remarkable success of computer-assisted bias. An important factor of precision journalism
reporting at CBS, other networks started to use was therefore the introduction and the use of sta-
computers in their newsrooms as well, particu- tistical software. These programs enabled journal-
larly for voting prediction. Not one election has ists for the first time to analyze bigger databases
since passed without a computer-assisted predic- such as surveys or public records. This new
tion. However, computers were slowly introduced approach might also be seen as a reaction to alter-
in newsrooms, and only in the late 1960s, they native journalistic trends that came up in the
started to be regularly used in the news production 1990s, for instance, the concept of new journal-
as well. ism. While precision journalism stood for scien-
In 1967, a journalism professor from the Uni- tific rigor in data analysis and reporting, new
versity of North Carolina, Philip Meyer, used for journalism used techniques from fiction to
the first time a quicker and better equipped IBM enhance reading experience.
360 mainframe computer to do statistical analyses There are some similarities between data jour-
on survey data collected during the Detroit riots. nalism and computer-assisted reporting: both rely
Meyer was able to show that not only less edu- on specific software programs that enable journal-
cated Southerners were participating in the riots ists to transform raw data into news stories. How-
but also people who attended college. This story, ever, there are also differences between computer-
published on the Detroit Free Press, won him a assisted reporting and data journalism, which are
Pulitzer Prize together with other journalists and due to the context in which the two practices were
marked a paradigm shift in computer-assisted developed. Computer-assisted reporting tried to
reporting. On the grounds of this success, Meyer introduce both informatics and scientific methods
not only supported the use of computers in jour- into journalism, given that at the time, data was
nalistic practices but developed a whole new scarce, and many journalists had to generate their
approach to investigative reporting by introducing own data. The rise of the Internet and new media
and using social science research methods in jour- contributed to the massive expansion of archives,
nalism for data gathering, sampling, analysis, and databases, and to the creation of big data. There is
presentation. In 1973 he published his thoughts in no longer a poverty of information, data is now
the seminal book entitled “Precision Journalism.” available in abundance. Therefore, data journal-
The fact that computer-assisted reporting entered ism is less about the creation of new databases, but
newsrooms especially in the USA was also more about data gathering, analysis, and visuali-
revealed through the increased use of computers zation, which means that journalists have to look
in news organizations. In 1986, the Time maga- for specific patterns within the data rather than
zine wrote that computers are revolutionizing merely seeking information – although recent dis-
investigative journalism. By trying to analyze cussions call for journalists to create their own
larger databases, journalists were able to offer a databases due to an overreliance on public data-
broader perspective and much more information bases. Either way, the success of data journalism
about the context of specific events. also led to new practices, routines, and mixed
The practice of computer-assisted reporting teams of journalists working together with pro-
spread further until, at the beginning of the grammers, developers, and designers within the
1990s, it became a standard routine particularly same newsrooms, allowing them to tell stories in a
in bigger newsrooms. The use of computers, different and visually engaging way.
together with the application of social science
4 Media

Media Organizations and Big Data audiences’ preferences. At the same time, however,
Big data is not only a valuable resource for data there is also a growing concern among journalist
journalism. Media organizations are data gath- with regard to their professional ethics and the
erers as well. Many media products, whether consequences for the function of journalism in
news or entertainment, are financed through society if they base their editorial decision-making
advertising. In order to satisfy the advertisers’ processes on real-time data. The results of web
interests in the site’s audience, penetration, and analytics not only influence the placement of
visits, media organizations track user behavior on news on the websites; they also have an impact
their webpages. Very often, media organizations on the journalists’ beliefs about what the audience
share this data with external research bodies, wants. Particularly in online journalism, the news
which then try to use the data on their behalf. selection is carried out grounding the decisions on
Gathering information about their customers is data generated by web analytics and no longer
therefore not only an issue when it comes to the on intrinsic notions such as news values or
use of social media. Traditional media organiza- personal beliefs. Consequently, online journalism
tions are also collecting data about their clients. becomes highly responsive to the audiences’
However, media organizations track the user preferences – serving less what would be in the
behavior on news websites not only to provide public interest. As many news outlets are integrated
data to their advertisers. Through user data, they organizations, which means that they apply a
also adapt the website’s content to the audience’s crossmedia strategy by joining previously sepa-
demand, with dysfunctional consequences for rated newsrooms such as the online and the print
journalism and its democratic function within staff, it might be possible that factors like data-
society. Due to web analytics and the generation based audience feedback will also affect print
of large-scale data collections, the audience exerts newsrooms. As Tandoc Jr. and Thomas state, if
an increasing influence over the news selection journalism continues to view itself as a sort of
process. This means that journalists – particularly “conduit through which transient audience prefer-
in the online realm – are at the risk of increasingly ences are satisfied, then it is no journalism worth
adapting their news selections on the audience’s bearing the name” (Tandoc and Thomas 2015, p.
feedback through data generated via web analyt- 253).
ics. Due to the grim financial situation and their While news organizations still struggle with
shrinking advertising revenue, some print media self-gathered data due to the conflicts that can
organizations especially in western societies try to arise in journalism, media organizations active in
apply strategies to compensate these deficits the entertainment industry rely much more
through a dominant market-driven discourse, strongly on data about their audiences. Through
manufacturing cheaper content that appeals to large amounts of data, entertainment media can
broader masses – publishing more soft news, sen- collect significant information about the audi-
sationalism, and articles of human interest without ence’s preferences for a TV series or a movie –
any connection to public policy issues. This is also even before it is broadcast. Particularly for big
due to the different competitive environment: production companies or film studios it is essen-
while there are fewer competitors in traditional tial to observe structured data like ratings, market
newspaper or broadcast markets, in the online share, and box office stats. But also unstructured
world, the next competitor is just one click away. data like comments or videos in social media are
Legacy media organizations, particularly news- equally important in order to understand con-
papers and their online webpages, offer more soft sumer habits, given that they provide information
news to increase traffic, to attract the attention of about the potential success or failure of a (new)
more readers, and thus to keep their advertisers at product.
it. A growing body of literature about the conse- An example of such use of big data is the
quences of this behavior shows that journalists, in launch of the TV show “House of Cards” by the
general, are becoming much more aware of the Internet-based on demand streaming provider
Media 5

Netflix. Before launching the first original content way of addressing customers, Facebook can make
with the political drama, Netflix was already it up with its incredible precision about the cus-
collecting huge amounts of data about the stream- tomers’ interests and its ability to target advertis-
ing habits of their customers. Of more than 25 mil- ing more effectively.
lion users, they tracked around 30 million views a Big data are an integrative part of social
day (recording also when people are pausing, media’s business model: they possess far more
rewinding, or fast-forwarding the videos), about information on their customers given that they
four million ratings, and three million searches have access not only to their surf behavior but
(Carr 2013). On top of that, they also try to gather above all to their tastes, interests, and networks.
unstructured data from social media, and they This might not only bear the potential to predict
look how customers are tagging the selected the users’ behavior but also to influence it, partic-
videos with metadata descriptors and whether ularly as social media such as Facebook and Twit-
they recommend the content. Based on these ter adapt also their noncommercial content to the
data, Netflix predicted possible preferences and individual users: the news streams we see on our
decided to buy “House of Cards.” It was a major personal pages are balanced by various variables
success for the online-based company. (differing between social media) such as interac-
There are also potential risks associated with tions, posting habits, popularity, the number of
the collection of such huge amounts of data: friends, user engagement, and others, being how-
Netflix recommends specific movies or TV ever constantly recombined. Through such
shows to their customers based on what they opaque algorithms, social media might well use
liked or what they have watched before. These their own data to model voters: in 2010, for exam-
recommendation algorithms might well guide the ple, 61 million users in the USA were shown a
user toward more of their original content, without banner message on their pages about how many of
taking into account the consumers’ actual prefer- their friends already voted for the US Congressio-
ences. In addition, consumers might not be able to nal Elections. The study showed that the banner
discover new TV shows that transcend their usual convinced more than 340,000 additional people to
taste. Given that services like Netflix know so cast their vote (Bond et al. 2012). The individually
much about their users’ habits, another concern tailored and modeled messaging does not only
with regard to privacy arises. bear the potential to harm the civic discourse; it
also enhances the negative effects deriving from
Big Data Between Social Media, Ethics, and “asymmetry and secrecy built into this mode of
Surveillance computational politics” (Tufekci 2014).
Social media are a main source for big data. Since The amount of data stored on social media will
the first major social media webpages have been continue to rise, and already today, social media
launched in the 2000s, they began to collect and are among the largest data repositories in the
store massive amounts of data. These sites started world. Since the data collecting mania of social
to gather information about the behavior, prefer- media will not decrease, which is also due to the
ences, and interests of their users in order to know explorative focus of big data, it raises issues with
how their users would both think and act. In regard to the specific purpose of the data collec-
general, this process of datafication is used to tion. Particularly if the data usage, storage, and
target and tailor the services better to the users’ transfer remain opaque and are not made transpar-
interests. At the same time, social media use these ent, the data collection might be disproportionate.
large-scale data collections to help advertiser tar- Yet, certain social media allow third parties to
get the users. Big data in social media have there- access their data, particularly as the trade of data
fore also a strong commercial connotation. increases because of its economic potential. This
Facebook’s business model, for instance, is policy raises ethical issues with regard to trans-
entirely based on hyper-targeted display ads. parency about data protection and privacy.
While display ads are a relatively old-fashioned
6 Media

Particularly in the wake of the Snowden reve- was that the users in the sample were not aware
lations, it has been shown that opaque algorithms that their newsfeed was altered. This study shows
and big data practices are increasingly important that the use of big data generated in social media
to surveillance: “[...] Big Data practices are can entail ethical issues, not the least because the
skewing surveillance even more towards a reli- constructed reality within Facebook can be
ance on technological “solutions,” and that these distorted. Ethical questions with regard to media
both privileges organizations, large and small, and big data are thus highly relevant in our soci-
whether public or private, reinforce the shift in ety, given that both the privacy of citizens and the
emphasis toward control rather than discipline protection of their data are at stake.
and rely increasingly on predictive analytics to
anticipate and preempt” (Lyon 2014, p. 10). Over-
all, the Snowden disclosures have demonstrated
Conclusion
that surveillance is no longer limited to traditional
instruments in the Orwellian sense but have
Big data plays a crucial role in the context of the
become ubiquitous and overly reliant on practices
media. The instruments of computer-assisted
of big data – as governmental agencies such as the
reporting and data journalism allow news organi-
NSA and GCHQ are allowed to access not only
zations to engage in new forms of investigations
the data of social media and search giants but also
and storytelling. Big data also allow media orga-
to track and monitor telecommunications of
nizations to better adapt their services to the pref-
almost every individual in the world. However,
erences of their users. While in the news business
the big issue even with the collect-all approach is
this may lead to an increase of soft news, the
that data is subject to limitations and bias, partic-
entertainment industry benefits from such data in
ularly if they rely on automated data analysis:
order to predict the audience’s taste with regard to
“Without those biases and limitations being
potential TV shows or movies. One of the biggest
understood and outlined, misinterpretation is the
issues with regard to media and big data are its
result” (Boyd and Crawford 2012, p. 668). This
ethical implications, particularly with regard to
might well lead to false accusation or failure of
data collection, storage, transfer, and surveillance.
predictive surveillance as could be seen in the case
As long as the urge to collect large amounts of
of the Boston Marathon bombing case: first, a
data and the use of opaque algorithms continue to
picture of the wrong suspect was massively shared
prevail in many already powerful (social) media
on social media, and second, the predictive radar
organizations, the risks of data manipulation and
grounded on data gathering was ineffective.
modeling will increase, particularly as big data are
In addition, the use of big data generated by
becoming even more important in many different
social media entails also ethical issues in reference
aspects of our lives. Furthermore, as the Snowden
to scientific research. Normally, when human
revelations showed, collect-it-all surveillance
beings are involved in research, strict ethical
already relies heavily on big data practices. It is
rules, such as the informed consent of the people
therefore necessary to increase both the research
participating in the study, have to be observed.
into and the awareness about the ethical implica-
Moreover, in social media there are “public” and
tions of big data in the media context. Only thanks
“private,” which can be accessed. An example of
to a critical discourse about the use of big data in
such a controversial use of big data is a study
our society, we will be able to determine “our
carried out by Kramera et al. (2014). The authors
agency with respect to big data that is generated
deliberately changed the newsfeed of Facebook
by us and about us, but is increasingly being used
users: some got more happy news, others more
at us” (Tufekci 2014). Being more transparent,
sad ones, because the goal of the study was to
accountable, and less opaque about the use and,
investigate whether emotional shifts in those sur-
in particular, the purpose of data collection might
rounding us – in this case virtually – can change
be a good starting point.
our own moods as well. The issue with the study
Media 7

Cross-References 2013/02/25/business/media/for-house-of-cards-using-
big-data-to-guarantee-its-popularity.html. Accessed
11 July 2016.
▶ Advertising Targeting Kramera, A. D. I., Guilloryb, J. E., & Hancock, J. T.
▶ Big Data Storytelling (2014). Experimental evidence of massive-scale emo-
▶ Crowdsourcing tional contagion through social networks. Proceedings
▶ Transparency of the National Academy of Sciences of the United
States of America, 111(24), 8788–8790.
Lyon, D. (2014, July–December). Surveillance, Snowden,
and Big Data: Capacities, consequences, critique. Big
References Data & Society, 1–13.
Tandoc Jr., E. C., & Thomas, R. J. (2015). The ethics of
Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D. I., web analytics. Implications of using audience metrics
Marlow, C., Settle, J. E., & Fowler, J. H. (2012). in news construction. Digital Journalism, 3(2),
A 61-million-person experiment in social influence 243–258.
and political mobilization. Nature, 489, 295–298. Tufekci, Z. (2014). Engineering the public: Big data, sur-
Boyd, D., & Crawford, K. (2012). Critical questions for big veillance and computational politics. First Monday,
data. Information, Communication & Society, 15(5), 19(7). http://journals.uic.edu/ojs/index.php/fm/article/
662–679. view/4901/4097. Accessed 12 July 2016.
Carr, D. (2013, February 24). Giving readers what they
want. New York Times. http://www.nytimes.com/
M

Metadata Know Before Use

Xiaogang Ma Few people are able to use a piece of data before


Department of Computer Science, University of knowing its subject, origin, structure, and mean-
Idaho, Moscow, ID, USA ing. A primary functionality of metadata is to help
people to obtain an overview of some data, and
this functionality can be understood through a few
Metadata are data about data, or in a more general real-world examples. If data are comparable with
sense, they are data about resources. They provide goods in a grocery, then metadata are like the
a snapshot about a resource, such as information information on the package of an item.
about the creator, date, subject, location, time and A consumer may care more about the ingredients
methods used, etc. There are high-level metadata due to allergies to some substances, the nutrition
standards that can provide a general description of facts due to dietary needs, and/or the manufacturer
a resource. In recent years, community efforts and date of expiration due to personal preferences.
have been taken to develop domain-specific meta- Most people want to know the information about a
data schemas and encode the schemas with grocery item before purchasing and consuming
machine readable formats for the World Wide it. The information on the package provides a
Web. Those schemas can be reused and extended concise and essential introduction about the item
to fit requirements of specific applications. Com- inside. Such nutrition and ingredient information
paring with the long-term archive of data and of grocery items is mandatory for manufacturers
metadata in traditional data management and anal- in many countries. Similarly, an ideal situation for
ysis, the velocity of Big Data leads to short-term data users is that they can receive clear metadata
and quick applications addressing scientific and from data providers. However, compared to the
business issues. Accordingly, there is a metadata food industry, the rules and guidelines for meta-
data life cycle in Big Data applications. Commu- data are still less developed.
nity metadata standards and machine readable Another comparable subject is the 5W1H
formats will be a big advantage to facilitate the method for storytelling or context description,
metadata data life cycle on the Web. especially in journalism. The 5W1H represents
the question words who, what, when, where,
why, and how, which can be used to organize a
number of questions about a certain object or
event, such as: Who is responsible for a research
project? What are the planned output data? Where
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_135-1
2 Metadata

will the data be archived? When will the data be International Organization for Standardization
open access? Why a specific instrument is needed (ISO) in 2003 and later revised in 2009. It has
for data collection? How will the data be also been endorsed by a number of other national
maintained and updated? In journalism, the or international organizations such as the Ameri-
5W1H is often used to evaluate whether the infor- can National Standards Institute and the Internet
mation covered in a news article is complete or Engineering Task Force.
not. Normally, the first paragraph of a news article The 15 core elements are part of an enriched
gives a brief overview of the article and provides specification of metadata terms maintained by the
concise information to answer the 5W1H ques- Dublin Core Metadata Initiative (DCMI). The
tions. By reading the first paragraph, a reader can specification includes properties in the core ele-
grasp the key information of an article even before ments, properties in an enriched list of terms,
reading through the full text. Metadata is data vocabulary encoding schemes, syntax encoding
about data; such functionality is similar to what schemes, and classes (including the DCMI Type
the first paragraph works for a news article, and Vocabulary). The enriched terms include all the
metadata items used for describing a dataset are 15 core elements and cover a number of more
equal to the 5W1H question words. specific properties, such as abstract, access rights,
has part, has version, medium, modified, spatial,
temporal, valid, etc. In practice, the metadata
Metadata Hierarchy terms in the DCMI specification can be further
extended by combining with other compatible
Metadata are used for describing resources. The vocabularies to support various application pro-
description can be general or detailed according to files. With the 15 core elements, one is able to
the actual needs. Accordingly, there is a hierarchy provide rich metadata for a certain resource, and
of metadata items corresponding to the actual by using the enriched DCMI metadata terms and
needs of describing an object. For instance, the external vocabularies, one can create an even
abovementioned 5W1H question words can be more specific metadata description for the same
regarded as a list of general metadata items, and object. This can be done in a few ways. For
they can also be used to describe datasets. How- example, one way is to use terms that are not
ever, the six question words only offer a start included in the core elements, such as spatial and
point, and there may be various derived metadata temporal. Another possible way is to use a refined
items in actual works. In early days there was such metadata term that is more appropriate for
a heterogeneous situation among the metadata describing an object. For instance, the term
provided by different stakeholders. To promote “description” in the core elements is with broad
standardization of metadata items, a number of meaning, and it may include an abstract, a table of
international standards have been developed. contents, a graphical representation, or a free-text
The most well-known standard is the Dublin account of a resource. In the enrich DCMI terms,
Core Metadata Element Set (DCMI Usage Board there is a more specific term “abstract,” which
2012). The name “Dublin” originates from a 1995 means a summary of a resource. Compared to
workshop at Dublin, OH, USA. The word “Core” “description,” the term “abstract” is more specific
means that the elements are generic and broad. and appropriate if one wants to collect a literal
The 15 core elements are contributor, coverage, summary of an academic article.
creator, date, description, format, identifier, lan-
guage, publisher, relation, rights, source, subject,
title, and type. Those elements are more specific Domain-Specific Metadata Schemas
than the 5W1H question words and can be used
for describing a wide range of resources, includ- High-level metadata terms such as those in the
ing datasets. The Dublin Core Metadata Element Dublin Core Metadata Element Set have broad
Set was published as a standard by the meaning and are applicable to various resources.
Metadata 3

However, those metadata elements are too general The international geo sample number (IGSN),
in meaning and sometimes are implicit. If one initiated in 2004, is a sample identification code
wants a more specific and detailed description of for the geoscience community. Each registered
the resources, a domain-specific metadata schema IGSN identifier is accompanied with a group of
is needed. Such a metadata schema is a list of metadata providing detailed background informa-
organized metadata items for describing a certain tion about that sample. Top concepts in the current
type of resource. For example, there could be a IGSN metadata schema are sample number, reg-
metadata schema for each type defined in the istrant, related resource identifiers, and log. A top
DCMI Type Vocabulary, such as dataset, event, concept may include a few child concepts. For
image, physical object, service, etc. There have example, there are two child concepts for “regis-
been various national and international commu- trant”: registrant name and name identifier.
nity efforts for building domain-specific metadata The ISO 19115 and ISO 19115-2 geographic
schemas. Especially, many schemas developed in information metadata are regarded as a best prac-
recent years face the data management and tice of metadata schemas for geospatial data.
exchange on the Web. A few recent works are Geospatial data are about objects with some posi-
introduced below. tion on the surface of the Earth. The ISO 19115
The data catalog vocabulary (DCAT) standards provide guidelines on how to describe
(Erickson and Maali 2014) was approved as a geographical information and services. Detailed
World Wide Web Consortium (W3C) recommen- metadata items cover topics about contents, spa-
dation in January 2014. It was designed to facili- tiotemporal extents, data quality, channels for
tate interoperability among data catalogs access and rights to use, etc. Another standard,
published on the Web. DCAT defines a metadata ISO 19139, provides an XML schema implemen-
schema and provides a number of examples on tation for the ISO 19115. The catalog service for
how to use it. DCAT reuses a number of DCMI the Web (CSW) is an open geospatial consortium
metadata terms in combination with terms from (OGC) standard for describing online geospatial
other schemas such as the W3C Simple Knowl- data and services. It adopts ISO 19139, the Dublin
edge Organization System (SKOS). It also defines Core elements and items from other metadata
a few new terms to make the resulted schema efforts. Core elements in CSW include title, for-
more appropriate for describing datasets in data mat, type, bounding box, coordinate reference
catalogs. system, and association.
The Darwin Core is a group of standards for
biodiversity applications. By extending the Dub-
lin Core metadata elements, the Darwin Core Annotating a Web of Data
establishes a vocabulary of terms to facilitate the
description and exchange of data about the geo- Recent efforts on metadata standards and
graphic occurrence of organisms and the physical schemas, such as the abovementioned Dublin
existence of biotic specimens. The Darwin Core Core, DCAT, Darwin Core, EML, IGSN meta-
itself is also extensible, which provides a mecha- data, ISO 19139, and CSW, show a trend of pub-
nism for describing and sharing additional lishing metadata on the Web. More importantly,
information. by using standard encoding formats, such as the
The ecological metadata language (EML) is a XML and W3C resource description framework
metadata standard developed for the none- (RDF), they are making metadata machine dis-
geospatial datasets in the field of ecology. It is a coverable and readable. This mechanism moves
set of schemas encoded in the format of extensible the burden of searching, evaluating, and integrat-
markup language (XML) and thus allows struc- ing massive datasets from humans to computers,
tured expression of metadata. EML can be used to and for computers such burden is not real burden
describe digital resources and also nondigital because they can find ways to access various data
resources such as paper maps. sources through standardized metadata on the
4 Metadata

Web. For example, the project OneGeology aims represents those entities that the search engines
to enable online access to geological maps across can handle in a short term. Schema.org provides a
the world. By the end of 2014, the OneGeology mechanism for extending the scope of concepts,
has 119 participating nations, and most of them properties, and schemas. Webmasters and devel-
share national or regional geological maps opers can define their own specific concepts,
through OGC geospatial data service standards. properties, and schemas. Once those extensions
Those map services are maintained by their are commonly used on the Web, they can also be
corresponding organizations, and they also enable included as a part of the schema.org schemas.
standardized metadata services, such as CSW. On
the one hand, OneGeology provides technical
supports to organizations who want to set up Linking for Tracking
geologic map services using common standards.
On the other hand, it also provides a central data If the recognition of domain-specific topics is a
portal for end users to access various distributed work to identify resource types, then the definition
metadata and data services. The OneGeology pro- of metadata items is a work of annotating those
ject presents a successful example on how to types. The work in schema.org is an excellent
rescue the legacy data, update them with well- reflection of those two works. Various structured
organized metadata, and make them discoverable, and unstructured resources can be categorized and
accessible, and usable on the Web. annotated by using metadata and are ready to be
Comparing with domain-specific structured discovered and accessed. In a scientific or busi-
datasets, such as those in OneGeology, many ness procedure, various resources are retrieved
other datasets in the Big Data are not structured, and used, and outputs are generated and archived
such as webpages and data stream on social and perhaps be reused elsewhere. In recent years,
media. In 2011, the search engines Bing, Google, people take a further step to make links among
Yahoo!, and Yandex launched an initiative called those resources, their types, and properties, as
schema.org, which aims at creating and well as the people and activities involved in the
supporting a common set of schemas for struc- generation of those outputs. The work of catego-
tured data markup on web pages. The schemas are rization, annotation, and linking as a whole can be
presented as lists of tags in hypertext markup used to describe the origin of a resource, which is
language (HTML). Webmasters can use those called provenance. There have been community
tags to mark up their web pages, and search engine efforts developing specifications of commonly
spiders and other parsers can recognize those tags usable provenance models.
and record what a web page is about. This makes The Open Provenance Model was initiated in
it easier for search engine users to find the right 2006. It includes three top classes: artifact, pro-
web pages. Schema.org adopts a hierarchy to cess, and agent and their subclasses, as well as a
organize the schemas and vocabularies of terms. group of properties, such as was generated by, was
The concept on the top is thing, which is very controlled by, was derived from, and used, for
generic and is divided into schemas of a number describing the classes and the interrelationships
of child concepts, including creative work, event, among them. Another earlier effort is the proof
intangible, medical entity, organization, person, markup language, which was used to represent
place, product, and review. These schemas are knowledge about how information on the Web
further divided into smaller schemas with specific was asserted or inferred from other information
properties. A child concept inherits characteristics sources by intelligent agents. Information, infer-
from a parent concept. For example, book is a ence step/inference rule, and inference engine are
child concept of creative work. The hierarchy of the three key building blocks in the proof markup
concepts and properties does not intend to be a language.
comprehensive model that covers everything in Works on the Open Provenance Model and the
the world. The current version of schema.org only proof markup language have set up the basis for
Metadata 5

community actions. Most recently, the W3C business issues. In traditional data management,
approved the PROV Data Model as a recommen- especially for a single data center or data reposi-
dation in 2013. The PROV Data Model is a tory, the metadata life cycle is less addressed.
generic model for provenance, which allows spe- Now, facing the short-lived and quick Big Data
cific representations of provenance in research life cycles, attention should also be paid to the
domains or applications to be translated into the metadata life cycle.
model and be interchangeable among systems In general, a data life cycle covers steps of
(Moreau and Missier 2013). There are intelligent context recognition, data discovery, data access,
knowledge systems that can import the prove- data management, data archive, and data distribu-
nance information from multiple sources, process tion. Correspondingly, a metadata life cycle
it, and reason over it to generate clues for potential covers similar steps but they focus on the descrip-
new findings. The PROV Data Model includes tion of data rather than the data themselves. The
three core classes, entity, activity, and agent, context recognition allows people to study a spe-
which are comparable to the Open Provenance cific domain or application and reuse any existing
Model and the proof markup language. W3C metadata standards and schemas. Then in the
also approved the PROV Ontology as a recom- metadata discovery step, it is possible to develop
mendation for the expression of the PROV Data applications to automatically harvest machine
Model with semantic Web languages. It can be readable metadata from multiple sources and har-
used to represent machine readable provenance monize them. Commonly used domain-specific
information and can also be specialized to create metadata standards and machine readable formats
new classes and properties to represent prove- will significantly facilitate the metadata life cycle
nance information of specific applications and in applications using Big Data, because most of
domain. The extension and specification here are such applications will be on the Web and inter-
similar to the idea of a metadata hierarchy. changeable schemas and formats will be an
A typical application of the PROV Ontology is advantage.
the Global Change Information System for the US
Global Change Research Program (Ma et al.
2014), which captures and presents provenance
Cross-References
of global change research, and links to the publi-
cations, datasets, instruments, models, algo-
▶ Data Model, Data Modeling
rithms, and workflows that support key research
▶ Data Profiling
findings. The provenance information in the sys-
▶ Data Provenance
tem increases understanding, credibility, and trust ▶ Data Sharing
in the works of the US Global Change Research
▶ Open Data
Program and aids in fostering reproducibility of
▶ Semantic Web
results and conclusions.

A Metadata Life Cycle Further Readings

DCMI Usage Board. (2012). DCMI metadata terms. http://


Velocity is a unique feature that differentiates Big
dublincore.org/documents/dcmi-terms
Data from traditional data. Both traditional data Erickson, J., Maali, F. (2014). Data catalog vocabulary
and Big Data traditional data can also be big, but (DCAT). http://www.w3.org/TR/vocab-dcat
they have a relatively longer life cycle compared Ma, X., Fox, P., Tilmes, C., Jacobs, K., & Waple,
A. (2014). Capturing provenance of global change
to social media data stream in Big Data. Big Data information. Nature Climate Change, 4/6, 409–413.
life cycles are featured by short-term and quick Moreau, L., Missier, P.. (2013). PROV-DM: The PROV
deployments to solve specific scientific or data model. http://www.w3.org/TR/prov-dm
M

Mobile Analytics though this is a fourfold increase from just a few


years ago.
Ryan Eanes Any entity that is considering the deployment
Department of Business Management, of a mobile strategy must understand consumer
Washington College, Chestertown, MD, USA behavior as it occurs via mobile devices. Web
usability experts have known for years that online
browsing behavior can be casual, with people
Analytics, broadly defined, refers to a series of quickly clicking from one site to another and
quantitative measures that allow marketers, ven- making judgments about content encountered in
dors, business owners, advertisers, and interested mere seconds. Mobile users, on the other hand, are
parties the ability to gauge consumer engagement far more deliberate in their efforts – generally
and interaction with a property. When properly speaking, a mobile user has a specific task in
deployed and astutely analyzed, analytics can mind when he or she pulls out his phone. Brows-
help to inform a range of business decisions ing is far less likely to occur in a mobile context.
related to user experience, advertising, budgets, This is due to a number of factors, including
marketing, product development, and more. screen size, connection speed, and the environ-
Mobile analytics, then, refers to the measurement mental context in which mobile activity takes
of consumer engagement with a brand, property, place – the middle of the grocery store dairy
or product via a mobile platform, such as a case, for example, is not the ideal place for one
smartphone or tablet computer. to contemplate the purchase of an eight-person
Despite the fact that the mobile Internet and spa for the backyard.
app markets have exploded in growth over the The appropriate route to the consumer must be
past decade, and despite the fact that more than considered, as well. This can be a daunting pros-
half of all American adults now own at least one pect, particularly for small businesses, businesses
smartphone, according to the Pew Research Cen- with limited IT resources, or businesses with little
ter, marketers have been relatively slow to jump previous web or tech experience. If a complete
into mobile marketing. In fact, American adults end-user experience is desired, there are two pri-
spend at least 20% of their time online via mobile mary strategies that a company can employ: an
devices; the advertising industry has been playing all-in-one web-based solution, or a stand-
“catch-up” over the past few years in an attempt to alone app.
chase this market. Even so, analyst Mary Meeker All-in-one web-based solutions allow the same
notes that advertising budgets still devote only HTML5/CSS3-based site to appear elegant and
about a tenth of their expenditures to mobile – functional in a full-fledged computer-based
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_138-1
2 Mobile Analytics

browser while simultaneously “degrading” on a These include, but are not limited to, ClickTale,
mobile device in such a way that no functionality which offers mobile website optimization tools;
is lost. In other words, the same underlying code comScore, which is known for its audience mea-
provides the user experience regardless of what surement metrics; Flurry, which focuses on use
technological platform one uses to visit a site. and engagement metrics; Google, which offers
There are several advantages to this approach, both free and enterprise-level services; IBM,
including singularity of platform (that is, no which offers the ability to record user sessions
need to duplicate properties, logos, databases, and perform deep analysis on customer actions;
etc.), ease of update, unified user experience, and Localytics, which offers real-time user tracking
relative ease of deployment. However, there are and messaging options; Medio, which touts “pre-
downsides: full implementation of HTML5 and dictive” solutions that allow for custom content
CSS3 are relatively new. As a result, it can be creation; and Webtrends, which incorporates
costly to find a developer who is sufficiently other third-party (e.g., social media) data.
knowledgeable to make the solution as seamless The other primary mobile option: development
as desired, and who can articulate the solution in of a stand-alone smartphone or tablet app. Stand-
such a way that non-developers will understand alone apps are undeniably popular, given that
the full vision of the end product. Furthermore, 50 billion apps were downloaded from the Apple
development of a polished finished product can be App Store between July 2008 and June 2014.
time-consuming and will likely involve a great A number of retailers have had great success
deal of compromise from a design perspective. with their apps, including Amazon, Target,
Mobile analytics tools are relatively easy to Zappos, Groupon, and Walgreens, which speaks
deploy when a marketer chooses to take this to the potential power of the app as a marketing
route, as most modern smartphone web browsers tool. However, consider that there are more than
are built on the same technologies that drive one million apps in the Apple App Store alone, as
computer-based web browsers – in other words, of this writing – those odds greatly reduce the
most mobile browsers support both JavaScript chances that an individual will simply “stumble
and web “cookies,” both of which are typically across” a company’s app in the absence of some
requisites for analytics tools. Web pages can be sort of viral advertising, breakout product, or
“tagged” in such a way that mobile analytics can buzzworthy word-of-mouth. Furthermore, devel-
be measured, which will allow for the collection oping a successful and enduring app can be quite
of a variety of information on visitors. This might expensive, particularly considering that a mar-
include device type, browser identification, oper- keter will likely want to make versions of the
ating system, GPS location, screen resolution/ app available for both Apple iOS and Google
size, and screen orientation, all of which can pro- Android (the two platforms are incompatible
vide clues as to the contexts in which users are with each other). Estimates for app development
visiting the website on a mobile device. Some vary widely, from a few thousand dollars at the
mainstream web analytics tools, such as Google low end all the way up to six figures for a complex
Analytics, already include a certain degree of app, according to Mark Stetler of AppMuse – and
information pertaining to mobile users (i.e., it is these figures do not include ongoing updates, bug
possible to drill down into reports and determine fixes, or recurring content updates, all of which
how many mobile users have visited and what require staff with specialized training and know-
types of devices they were using); however, mar- how.
keting entities that want a greater degree of insight If a full-fledged app or redesigned website
into the success of their mobile sites will likely proves too daunting or beyond the scope of what
need to seek out a third-party solution to monitor a marketer needs or desires, there are a number of
performance. other techniques that can be used to reach con-
There are a number of providers of web-based sumers, including text and multimedia messaging,
analytics solutions that cover mobile web use. email messaging, mobile advertising, and so forth.
Mobile Analytics 3

Each of these techniques can reveal a wealth of A third-party provider that shares the data with
data about consumers, so long as the appropriate you, like Google, is more likely to come at a
analytic tools are deployed in advance of the bargain price, whereas a provider that grants you
launch of any particular campaign. exclusive ownership of the data is going to come
Mobile app analytics are quite different from at a premium. Finally, implementation will make a
web analytics in a number of ways, including the difference in costs: SaaS (software-as-a-service)
vocabulary. For example, there are no page views solutions, which are typically web based, run on
in the world of app analytics – instead, “screen the third-party service’s own servers, and rela-
views” are referenced. Likewise, an app “session” tively easy to install, tend to be less expensive,
is analogous to a web “visit.” App analytics often whereas “on-premises” solutions are both rare and
have the ability to access and gauge the use of quite expensive.
various features built into a phone or tablet, There are a small but growing number of com-
including the accelerometer, GPS, and gyroscope, panies that provide app-specific analytic tools,
which can provide interesting kinesthetic aspects typically deployed as SDKs (software develop-
to user experience considerations. App analytics ment kits) that can be “hooked” into apps. These
tools are also typically able to record and retain companies include, but are by no means limited
data related to offline usage for transmission when to, Adobe Analytics, which has been noted for its
a device has reconnected to the network, which scalability and depth of analysis; Artisan Mobile,
can provide a breadth of environmentally contex- an iOS-focused analytics firm that allows cus-
tual information to developers and marketers tomers to conduct experiments with live users in
alike. Finally, multiple versions of a mobile real time; Bango, which focuses on ad-based
app can exist “in the wild” simultaneously monetization of apps; Capptain, which allows
because users’ proclivities differ when it comes specific user segments to be identified and
to updating apps. Most app analytic packages targeted with marketing campaigns; Crittercism,
have the ability to determine which version of an which is positioned as a transaction-monitoring
app is in use so that a development team can track service; Distimo, which aggregates data from a
interactional differences between versions and variety of platforms and app stores to create a
confirm that bugs have been “squashed.” fuller position of an app in the larger marketplace;
As mentioned previously, marketers who ForeSee, which has the ability to record customer
choose to forego app development and develop a interactions with apps; and Kontagent, which
mobile version of their web page often choose to touts itself as a tool for maintaining customer
stick with their existing web analytics provider, retention and loyalty.
and oftentimes these providers do not provide a As mobile devices and the mobile web grow
level of detail regarding mobile engagement that increasingly sophisticated, there is no doubt that
would prove particularly useful to marketers who mobile analytics tools will also grow in sophisti-
want to capture a snapshot of mobile users. In cation. Nevertheless, it would seem that there are
many cases, companies simply have not given a wide range of promising toolkits already avail-
adequate consideration to mobile engagement, able to the marketer who is interested in better
despite the fact that it is a growing segment of understanding customer behaviors and increasing
online interaction that is only going to grow, par- customer retention, loyalty, and satisfaction.
ticularly as smartphone saturation continues.
However, for those entities that wish to delve
further into mobile analytics, there are a growing Cross-References
number of options available, with a few key dif-
ferences between the major offerings. There are ▶ Data Aggregation
both free and paid mobile analytics platforms ▶ Location Data
available; the key differentiator between these ▶ Network Data
offerings seems to come down to data ownership. ▶ Telecommunications
4 Mobile Analytics

Further Readings smartphone-ownership-2013/. Accessed September


2014.
Meeker, M. Internet trends 2014. http://www.kpcb.com/ Stetler, M. How much does it cost to develop a mobile app?
insights/2014-internet-trends. Accessed September AppMuse. http://appmuse.com/appmusing/how-much-
2014. does-it-cost-to-develop-a-mobile-app/. Accessed Sep-
Smith, A. Smartphone ownership 2013. Pew Research tember 2014.
Center. http://www.pewinternet.org/2013/06/05/
N

National Association for the board headed by a chairperson, various depart-


Advancement of Colored People ments within the NAACP govern particular areas
of action. The Legal Department tracks court
Steven Campbell cases with potentially extensive implications for
University of South Carolina, Lancaster, minorities, including recurring discrimination in
Lancaster, SC, USA areas such as education and employment. The
Washington, D.C., office lobbies Congress and
the Presidency on a wide range of policies and
The National Association for the Advancement of issues, while the Education Department seeks
Colored People (NAACP) is an African- improvements in the sphere of public education.
American civil rights organization headquartered Overall, the NAACP’s mission is to bolster equal
in Baltimore, MD. Founded in 1909, its member- rights for all people in political, educational, and
ship advocates civil rights by engaging in activi- economic terms as well as stamp out racial biases
ties such as mobilizing voters and tracking equal and discrimination.
opportunity in government, industry, and commu- In order to extend this mission into the twenty-
nities. Over the past few years, the NAACP has first century, the NAACP launched a digital media
shifted its attention to digital advocacy and the department in 2011. This entailed a mobile sub-
utilization of datasets to better mobilize activists scriber project that led to 423,000 contacts,
online. In the process, the NAACP has become a 233,000 Facebook supporters, and 1.3 million
leading organization in how it harnesses big data email subscribers, due in large part to greater
for digital advocacy and related campaigns. The social media outreach. The NAACP’s “This is
NAACP’s application of specially tailored data to my Vote!” campaign, launched prior to the 2012
its digital approach, from rapid response to presidential election, dramatically advanced the
targeted messaging to understanding recipients’ organization’s voter registration and mobilization
interests, has become an example for other groups programs. As a result, the NAACP registered
to follow. At the same time, the NAACP has twice the number of individuals – over
challenged other big data (both in the public and 374,000 – than it did in 2008 and mobilized over
private sectors), highlighting abuse of such data in 1.2 million voters. In addition, the NAACP
ways that can directly impact disadvantaged conducted an election eve poll that surveyed
minority groups. 1,600 African-American voters. This was done
With a membership of over 425,000 members, in order to assess their potential influence as well
the NAACP is the nation’s largest civil rights as key issue areas prior to the election results and
organization. Administered by a 64-member in looking forward to 2016. Data from the poll
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_139-1
2 National Association for the Advancement of Colored People

highlighted the predominate role played by minorities in addition to the general privacy pro-
African-Americans in major battleground states tections commonly granted. Such controversy
and divulged openings for the Republican Party surrounding civil rights and big data may not be
in building rapport with the African-American self-evident; however, big data often involves the
community. In addition, the data signaled to Dem- targeting and segmenting of one type of individual
ocrats a message not to assume levels of Black from another. This serves as a threat to basic civil
support in 2016 on par with that realized in the rights –which are protected by law – in ways that
2008 and 2012 elections. were inconceivable in recent decades. For
By tailoring its outreach to individuals, the instance, the NAACP has expressed alarm regard-
NAACP has been successful in achieving rela- ing the collection of information by credit
tively high rates of engagement. The organization reporting agencies. Such collections can result in
segments supporters based on their actions, such the making of demographic profiles and stereo-
as whether they support a particular issue based on typical categories, leading to the marketing of
past involvement. For instance, many NAACP predatory financial instruments to minority
members view gun violence as a serious problem groups.
in today’s society. If such a member connects with The US government’s collection of massive
NAACP’s online community via a particular phone records for purposes of intelligence has
webpage or internet advertisement, s/he will be also drawn harsh criticism from the NAACP as
recognized as one espousing stronger gun control well as other civil rights organizations. They have
laws. Future outreach will entail tailored mes- vented warnings regarding such big data by
sages expressing attributes that resonate on a per- highlighting how abuses can uniquely affect dis-
sonal level with the supporter, not unlike that from advantaged minorities. The NAACP supports
a friend or colleague. principles aimed at curtailing the pervasive use
The NAACP also takes advantage of major of data in areas such as law enforcement and
events that reflect aspects of the organization’s employment. Increasing collections of data are
mission statement. Preparation for such moments viewed by the NAACP as a threat since such big
entails much advance work, as evidenced in the data could allow for unjust targeting of, and dis-
George Zimmerman trial involving the fatal crimination against, African-Americans. Thus,
shooting of 17-year-old Trayvon Martin. As the the NAACP strongly advocates measures such
trial was concluding in 2013, the NAACP formed as a stop to “high-tech profiling,” greater pressure
contingency plans in advance of the court’s deci- on private industry for more open and transparent
sion. Website landing pages and prewritten emails data, and greater protections for individuals from
were set in place, adapted for whatever result may inaccurate data.
come. Once the verdict was read, the NAACP sent
out emails within 5 min that detailed specific
actions for supporters to take. This resulted in Cross-References
over a million petition signatures demanding
action on the part of the US Justice Department, ▶ Demographic Data
which it eventually took. ▶ Discrimination
▶ Facebook
▶ Pattern Recognition
Controversy ▶ Targeting

While government and commercial surveillance


potentially affect all Americans, minorities face
Further Reading
these risks at disproportionate rates. Thus, the
NAACP has raised concerns about whether big Fung, Brian (27 Feb 2014). Why civil rights groups are
data needs to provide greater protections for warning against ‘big data’. Washington Post. http://
National Association for the Advancement of Colored People 3

www.washingtonpost.com/blogs/the-switch/wp/2014/ ahead of the curve, making smarter data decisions.


02/27/why-civil-rights-groups-are-warning-against- Advertising Age. http://adage.com/article/datadriven-
big-data/. Accessed Sept 2014. marketing/brands-learn-data-advocacy-groups/
Murray, Ben (3 Dec 2013). What brands can learn about 245498/. Accessed Sept 2014).
data from the NAACP: Some advocacy groups are NAACP. http://www.NAACP.org. Accessed Sept 2014.
N

National Oceanic and Atmospheric the oceans to the state of the sun, and to better
Administration safeguard and preserve seashores and marine life.
NOAA provides alerts to dangerous weather,
Steven J. Campbell maps the oceans and atmosphere, and directs the
University of South Carolina Lancaster, responsible handling and safeguarding of the seas
Lancaster, SC, USA and coastal assets. One key way NOAA pursues
its mission is by conducting research in order to
further awareness and better management of envi-
The National Oceanic and Atmospheric Adminis- ronmental resources. With a workforce of over
tration (NOAA) is an agency housed within the 12,000, NOAA consists of six major line offices,
US Commerce Department that monitors the sta- including the National Weather Service (NWS), in
tus and conditions of the oceans and the atmo- addition to over a dozen staff offices.
sphere. NOAA oversees a diverse array of NOAA’s collection and dissemination of vast
satellites, buoys, ships, aircraft, tide gauges, and sums of data on the climate and environment
supercomputers in order to closely track environ- contribute to a multibillion-dollar weather enter-
mental changes and conditions. This network prise in the private sector. The agency has sought
yields valuable and critical data that is crucial for ways to release extensive new troves of this data,
alerting the public to potential harm and pro- an effort that could be of great service to industry
tecting the environment nationwide. The vast and those engaged in research. NOAA announced
sums of data collected daily have served as a a call in early 2014 for ideas from the private
challenge to NOAA in storing as well as making sector to assist the agency’s efforts in freeing up
the information readily accessible and meaningful a large amount of the 20 terabytes of data that it
to the public and interested organizations. In the collects on a daily basis pertaining to the environ-
future, as demand grows for ever-greater amounts ment and climate change. In exchange,
and types of climate data, NOAA must be researchers stand to gain critical access to impor-
resourceful in meeting the demands of public tant information about the planet, and private
officials and other interested parties. companies can receive help and assistance in
First proposed by President Richard Nixon, advancing new climate tools and assessments.
who wanted a new department in order to better This request by NOAA shows that it is plan-
protect citizens and their property from natural ning to place large amounts of its data into the
dangers, NOAA was founded in October 1970. cloud, benefitting both the private and public sec-
Its mission is to comprehend and foresee varia- tors in a number of ways. For instance, climate
tions in the environment, from the conditions of data collected by NOAA is currently employed
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_141-1
2 National Oceanic and Atmospheric Administration

for forecasting the weather over a week in analyze mounds of scientific data proves vital in
advance. In addition, marine navigation and off- helping public officials, communities, and indus-
shore oil and gas drilling operations are very trial groups to better comprehend and prepare for
interested in related data. NOAA has pursued perils linked with turbulent weather and climatic
unleashing ever-greater amounts of its ocean and occurrences. Located in Virginia, the supercom-
atmospheric data by partnering with groups out- puters operate with 213 teraflops (TF) – up from
side government. This is seen as paramount to the 90 TF with the computers that came before
NOAA’s data management, where tens of them. This has helped to produce an advanced
petabytes of information are recorded in various Hurricane Weather Research and Forecasting
ways, engendering over 15 million results daily – (HWRF) model that the National Weather Service
from weather forecasts for US cities to coastal tide can more effectively employ. By allowing more
monitoring – which totals twice the amount of all effective monitoring of violent storms and more
the printed collections of the US Library of accurate predictions regarding the time, place, and
Congress. intensity of their impact, the HWRF model can
Maneuvering through NOAA’s mountain of result in saved lives.
weather and climate, the data has proved to be a NOAA’s efforts to build a Weather-Ready
great challenge over the years. To help address Nation have evolved from a foundation of super-
this issue, NOAA made available, in late 2013, an computer advancements that have permitted more
instrument that helped further open up the data to accurate storm-tracking algorithms for weather
the public. With a few clicks of a mouse, individ- prediction. First launched in 2011, this initiative
uals can create interactive maps illustrating natu- on the part of NOAA has resulted in advanced
ral and manmade changes in the environment services, particularly in ways that data and infor-
worldwide. For the most part, the data is free to mation can be made available to the public, gov-
the public, but much of the information has not ernment agencies, and private industry.
always been organized in a user-friendly format.
NOAA’s objective was to bypass that issue and
allow public exploration of environmental condi-
Cross-References
tions from hurricane occurrences to coastal tides
to cloud formations. The new instrument, named
▶ Climate Change, Hurricanes/Typhoons
NOAA View, allows ready access to many of
▶ Cloud or Cloud Computing
NOAA’s databases, including simulations of
▶ Data Storage
future climate models. These datasets grant users
▶ Environment
the ability to browse various maps and informa-
▶ Predictive Analytics
tion by subject and time frame. Behind the scenes,
numerous computer programs manipulate
datasets into maps that can demonstrate environ-
mental attributes and climate change over time. Further Readings
NOAA View’s origins were rooted in data visual-
ization instruments present on the web, and it is Freedman, A. (2014, February 24). U.S. readies big-data
dump on climate and weather. http://mashable.com/
operational on tablets and smartphones that 2014/02/24/NOAA-data-cloud/. Accessed September
account for 44% of all hours spent online by the 2014.
US public. Kahn, B. (2013). NOAA’s new cool tool puts climate on
Advances to NOAA’s National Weather Ser- view for all. http://www.climatecentral.org/news/
noaas-new-cool-tool-puts-climate-on-view-for-all-
vice supercomputers have allowed for much faster 16703. Accessed September 2014.
calculations of complex computer models, National Oceanic and Atmospheric Administration
resulting in more accurate weather forecasts. The (NOAA). www.noaa.gov. Accessed September 2014.
ability of these enhanced supercomputers to
N

National Organization for Women NOW’s current president Terry O’Neill has
stated that big data practices can render obsolete
Deborah Elizabeth Cohen the USA’s landmark civil rights and anti-
Smithsonian Center for Learning and Digital discrimination laws with special challenges for
Access, Washington, DC, USA women, the poor, people of color, trans-people,
and the LGBT community. While the technolo-
gies of automated decision-making are hidden and
The National Organization for Women (NOW) is largely not understood by average people, they are
an American feminist organization that is the being conducted with an increasing level of per-
grassroots arm of the women’s movement and vasiveness and used in contexts that affect indi-
the largest organization of feminist activists in viduals’ access to health, education, employment,
the United States. Since its founding in 1966, credit, and products. Problems with big data prac-
NOW has engaged in activity to bring about tices include the following:
equality for all women. NOW has been participat-
ing in recent dialogues to identify how common • Big data technology is increasingly being used
big data working methods lead to discriminatory to assign people to ideologically or culturally
practices against protected classes including segregated clusters, profiling them and in
women. This entry discusses NOW’s mission doing so leaving room for discrimination.
and issues related to big data and the activities • Through the practice of data fusion, big data
NOW has been involved with to end discrimina- tools can reveal intimate personal details, erod-
tory practices resulting from the usage of big data. ing personal privacy.
As written in its original statement of purpose, • As people are often unaware of this “scoring”
the purpose of NOW is to take action to bring activity, it can be hard for individuals to break
women into full participation in the mainstream out of being mislabeled.
of American society, exercising privileges and • Employment decisions made through data
responsibilities in completely equal partnership mining have the potential to be discriminatory.
with men. NOW strives to make change through • Metadata collection renders legal protection of
a number of activities including lobbying, rallies, civil rights and liberties less enforceable, undo-
marches, and conferences. NOW’s six core issues ing civil rights law.
are economic justice, promoting diversity and
ending racism, lesbian rights, ending violence Comprehensive US civil rights legislation in
against women, constitutional equality, and the 1960s and 1970s resulted from social actions
access to abortion and reproductive health.
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_142-1
2 National Organization for Women

organized to combat discrimination. A number of 3. Preserve constitutional principles – govern-


current big data practices are in misalignment with ment databases must not be allowed to under-
these laws and can lead to discriminatory mine core legal protections, including those of
outcomes. privacy and freedom of association. Indepen-
NOW has been involved with several impor- dent oversight of law enforcement is particu-
tant actions in response to these recognized prob- larly important for minorities who often
lems with big data. In January of 2014, the US receive disproportionate scrutiny.
White House engaged in a 90-day review of big 4. Enhance individual control of personal
data and privacy issues, to which NOW as a par- information – individuals, and in particular
ticipating stakeholder provided input. Numerous those in vulnerable populations including
policy recommendations resulted from this pro- women and the LGBT community, should
cess especially related to data privacy and the have meaningful and flexible control over
need for the federal government to develop tech- how a corporation gathers data from them and
nical expertise to stop discrimination. how it uses and shares that data. Nonpublic
The NOW Foundation also belongs to a coali- information should not be shared with the gov-
tion of 200 progressive organizations named the ernment without judicial process.
Leadership Conference on Civil and Human 5. Protect people from inaccurate data – Govern-
Rights whose mission is to promote the civil and ment and corporate databases must allow
human right of all persons in the United States. everyone to appropriately ensure the accuracy
NOW President Terry O’Neill serves on the Coa- of personal information used to make impor-
lition’s Board of Directors. In February 2014, The tant decisions about them. This requires disclo-
Leadership Conference released five “Civil Rights sure of the data and the right to correct it when
Principles for the Era of Big Data” and in August inaccurate.
2014 provided testimony based on their work to
the US National Telecommunications and Infor- Big data has been called the civil rights battle
mation Administration’s Request for Public Com- of our time. Consistent with its mission, NOW is
ment related to Big Data and Consumer Privacy. engaged in this battle, protecting civil rights of
The five civil rights principles to ensure that big women and others against discriminatory prac-
data is designed and used in ways that respect the tices that can result from current big data
values of equal opportunity and equal justice practices.
include the following:

1. Stop high tech profiling – ensure that clear


Cross-References
limits and audit mechanisms are in place to
make sure that data gathering and surveillance
▶ Data Fusion
tools that can assemble detailed information
▶ Data Mining
about a person or group are used in a respon-
▶ Discrimination
sible and fair way.
▶ National Telecommunication and Information
2. Ensure fairness in automated decisions –
Administration
require through independent review and other
▶ White House Big Data Initiative
measures that computerized decision-making
systems in areas such as employment, health,
education, and lending operate fairly for all
people and protect the interests of those that Further Readings
are disadvantaged and have historically been
Big data: Seizing opportunities, preserving values. (2014).
discriminated against. Systems that are blind to
Washington, DC: The White House. www.whitehouse-
preexisting disparities can easily reach deci- gov/sites/default/files/docs/big-data-privacy-report-5.1.1.
sions that reinforce existing inequities. 14-final-print.pdf. Accessed 7 Sep 2014.
National Organization for Women 3

Eubanks, V. (2014). How big data could undo our civil- NOW website. (2014). Who we are. National Organization
rights laws. The American Prospect. www.prospect. for Women. http://now.org/about/who-we-are/.
org/article/how-big-data-could-undo-our-civil-rights- Accessed 2 Sep 2014.
laws. Accessed 7 Sep 2014. The Leadership Conference on Civil and Human Rights.
Gangadharan, S. P. (2014). The dangers of high-tech profil- (2014). Civil rights principles for the era of big data.
ing, using big data. The New York Times. www.nytimes. www.civilrights.org/press/2014/civil-rights-principles-
com/roomfordebate/204/08/06/Is-big-data-spreading- big-data.html. Accessed 7 Sep 2014.
inequality/the-dangers-of-high-tech-profiling-using-
big-data. Accessed 5 Sep 2014.
N

Netflix previous math teacher and founder of Pure Soft,


a software company he sold for $700 million. The
J. Jacob Jenkins idea for Netflix was prompted by Hastings’ expe-
California State University Channel Islands, rience of paying $40 in overdue fees at a local
Camarillo, CA, USA Blockbuster. Using $2.5 million dollars in start-up
money from his sale of Pure Soft, Hastings
envisioned a video provider whose content could
Introduction be returned from the comfort of one’s own home,
void of due dates or late fees. Netflix’s website
Netflix is a film and television provider was subsequently launched on August 29, 1997.
headquartered in Los Gatos, California. Netflix Netflix’s original business model used a tradi-
was founded in 1997 as an online movie rental tional pay-per-rental approach, charging 0.50
service, using Permit Reply Mail to deliver cents per film. Netflix introduced its monthly
DVDs. In 2007, the company introduced stream- flat-fee subscription service in September 1999,
ing content, which allowed customers instant which led to the termination of its pay-per-rental
access to its online video library. Netflix has model by early 2000. Netflix has since built its
since continued its trend toward streaming ser- global reputation on the flat-fee business model,
vices by developing a variety of original and as well as its lack of due dates, late fees, or
award-winning programming. Due to its success- shipping and handling charges. Netflix delivers
ful implementation of Big Data, Netflix has expe- DVDs directly to its subscribers using the United
rienced exponential growth since its inception. It States Postal Service and a series of regional
currently offers over 100,000 titles on DVD and is warehouses located throughout the United States.
the world’s largest on-demand streaming service Based upon which subscription plan is chosen,
with more than 80 million subscribers in over users can keep between one and eight DVDs at a
190 countries worldwide. time, for as long as they desire. When subscribers
return a disc to Netflix using one of its prepaid
envelopes, the next DVD on their online rental
Netflix and Big Data queue is automatically mailed in its stead. DVD-
by-mail subscribers can access and manage their
Software executives Marc Randolph and Reed online rental queue through Netflix’s website in
Hastings founded Netflix in 1997. Randolph was order to add and delete titles or rearrange their
a previous cofounder of MicroWarehouse, a mail- priority.
order computer company; Hastings was a
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_144-1
2 Netflix

In 2007 Netflix introduced streaming content Golden Globe for “Best Actress in a Television
as part of its “Watch Instantly” initiative. When Series Drama.”
Netflix first introduced streaming video to its Through its combination of DVD rentals,
website, subscribers were allowed 1 h of access streaming services, and original programming,
for every $1 spent on their monthly subscription. Netflix has grown exponentially since 1997. In
This restriction was later removed due to emerg- 2000, the company had approximately 300,000
ing competition from Hulu, Apple TV, Amazon subscribers. By 2005 that number grew to nearly
Prime, and other on-demand services. There are 4 million users, and by 2010 it grew to 20 million.
substantially less titles available through Netflix’s During this time, Netflix’s initial public offering
streaming service than its disc library. Despite this (IPO) of $15 per share soared to nearly $500, with
limitation, Netflix has become the most widely a reported annual revenue of more than $6.78
supported streaming service in the world by billion in 2015. Today, Netflix is the largest source
partnering with Sony, Nintendo, and Microsoft of Internet traffic in all of North America. Its sub-
to allow access through Blu-ray DVD players, as scribers stream more than 1 billion hours of media
well as the Wii, Xbox, and PlayStation gaming content each month, approximating one-third of
consoles. In subsequent years, Netflix has increas- total downstream web traffic. Such success has
ingly turned attention toward its streaming ser- resulted in several competitors for online stream-
vices. In 2008 the company added 2500 new ing and DVD rentals. Wal-Mart began its own
“Watch Instantly” titles through a partnership online rental service in 2002 before acquiring the
with Starz Entertainment. In 2010 Netflix inked Internet delivery network, Vudu, in 2010. Ama-
deals with Paramount Pictures, Metro-Goldwyn- zon Prime, Redbox Instant, Blockbuster @ Home,
Mayer, and Lions Gate Entertainment; in 2012 it and even “adult video” services like WantedList
inked a deal with DreamWorks Animation. and SugarDVD have also entered the video
Netflix has also bolstered its online library by streaming market. Competition from Blockbuster
developing its own programming. In 2011 Netflix sparked a price war in 2004, yet Netflix remains
announced plans to acquire and produce original the industry leader in online movie rentals and
content for its streaming service. That same year it streaming.
outbid HBO, AMC, and Showtime to acquire the Netflix owes much of its success to the inno-
production rights for House of Cards, a political vative use of Big Data. Because it is an Internet-
drama based on the BBC miniseries of the same based company, Netflix has access to an unprece-
name. House of Cards was released on Netflix in dented amount of viewer behavior. Broadcast net-
its entirety in early 2013. Additional program- works have traditionally relied on approximated
ming released during 2013 included Lilyhammer, ratings and focus group feedback to make deci-
Hemlock Grove, Orange is the New Black, and the sions about their content and airtime. In contrast,
fourth season of Arrested Development – a series Netflix can aggregate specified data about cus-
that originally aired on Fox between 2003 and tomers’ actual viewing habits in real time, allo-
2006. Netflix later received the first Emmy wing it to understand subscriber trends and
Award nomination for an exclusively online tele- tendencies at a much more sophisticated level.
vision series. House of Cards, Hemlock Grove, The type of information Netflix gathers is not
and Arrested Development received a total of limited to what viewers watch and the ratings
14 nominations at the 2013 Primetime Emmy they ascribe. Netflix also tracks the specific dates
Awards; House of Cards received an additional and times in which viewers watch particular pro-
four nominations at the 2014 Golden Globe gramming, as well as their geographic locations,
Awards. In the end, House of Cards won three search histories, and scrolling patterns; when they
Emmy Awards for “Outstanding Casting for a use pause, rewind, or fast-forward; the types of
Drama Series,” “Outstanding Directing for a streaming devices employed; and so on.
Drama Series,” and “Outstanding Cinematogra- The information Netflix collects allows it to
phy for a Single-Camera Series.” It won one deliver unrivaled personalization to each
Netflix 3

individual customer. This customization not only innovative implementation of Big Data. An
results in better recommendations but also helps unprecedented level of information about cus-
to inform what content the company should invest tomers’ viewing habits has allowed Netflix to
in. Once content has been acquired/developed, make informed decisions about programming
Netflix’s algorithms also help to optimize their development, promotion, and delivery. As a
marketing and to increase renewal rates on origi- result, Netflix currently streams more than 1 bil-
nal programming. As an example, Netflix created lion hours of content per month to over 80 million
ten distinct trailers to promote their original series subscribers in 190 countries and counting.
House of Cards. Each trailer was designed for a
different audience and seen by various customers
based on those customers’ previous viewing Cross-References
behaviors. Meanwhile, the renewal rate for origi-
nal programming on traditional broadcast televi- ▶ Algorithm
sion is approximately 35%; the current renewal ▶ Amazon
rate for original programming on Netflix is nearly ▶ Apple
70%. ▶ Communications
As successful as Netflix’s use of Big Data has ▶ Consumer Action
been, the company strives to keep pace with ▶ Entertainment
changes in viewer habits, as well as changes in ▶ Facebook
its own product. When the majority of subscribers ▶ Internet
used Netflix’s DVD-by-mail service, for instance, ▶ Internet Tracking
those customers consciously added new titles to ▶ Microsoft
their queue. Streaming services demand a more ▶ Social Media
instantaneous and intuitive process of generating ▶ Streaming Data
future recommendations. In response to develop- ▶ Streaming Data Analytics
ments such as this, Netflix initiated the “Netflix ▶ Video
Prize” in 2006: a $1 million payout to the first
person or group of persons to formulate a superior
algorithm for predicting viewer preferences. Over
the next 3 years, more than 40,000 teams from Further Readings
183 countries were given access to over 100 mil-
Keating, G. (2013). Netflixed: The epic battle for America’s
lion user ratings. BellKor’s Pragmatic Chaos was eyeballs. London: Portfolio Trade.
able to improve upon Netflix existing algorithm McCord, P. (2014). How Netflix reinvented HR. Harvard
by approximately 10% and was announced as the Business Review. http://static1.squarespace.com/
static/5666931569492e8e1cdb5afa/t/56749ea457eb
award winner in 2009.
8de4eb2f2a8b/1450483364426/How+Netflix+Reinven
ted+HR.pdf. Accessed 5 Jan 2016.
McDonald, K., & Smith-Rowsey, D. (2016). The Netflix
Conclusion effect: Technology and entertainment in the 21st cen-
tury. London: Bloomsbury Academic.
Simon, P. Big data lessons from Netflix. Wired. Retrieved
In summation, Netflix is presently the world’s from https://www.wired.com/insights/2014/03/big-
largest “Internet television network.” Key turning data-lessons-netflix/
points in the company’s development have Wingfield, N., & Stelter, B. (2011, October 24). How Netflix
lost 800,000 members, and good will. The New York
included a flat-rate subscription service, streaming
Times. http://faculty.ses.wsu.edu/rayb/econ301/Arti
content, and original programming. Much of the cles/Netflix%20Lost%20800,000%20Members%20.
company’s success has also been due to its pdf. Accessed 5 Jan 2016.
N

Network Analytics Network Analytical Methods

Jürgen Pfeffer Networks are defined as a set of nodes and a set of


Bavarian School of Public Policy, Technical edges connecting the nodes. The major questions
University of Munich, Munich, Germany for network analytics, independent from network
size, are “Who is important?” and “Where are the
groups?” Stanley Wasserman and Katherine Faust
Synonyms have authored a seminal work on network analyt-
ical methods. Even though this work was
Network science; Social network analysis published in the mid-1990s, it can still be seen as
the standard book on methods for network analyt-
Much of big data comes with relational informa- ics, and it also provides the foundation for many
tion. People are friends with or follow each other contemporary methods and metrics. With respect
on social media platforms, send each other emails, to identifying the most important nodes in a given
or call each other. Researchers around the world network, a diverse array of centrality metrics have
copublish their work, and large-scale technology been developed in the last decades. Marina
networks like power grids and the Internet are the Henning and her coauthors classified centrality
basis for worldwide connectivity. Big data net- metrics into four groups. “Activity” metrics
works are ubiquitous and are more and more purely count the number or summarize the volume
available for researchers and companies to extract of connections. For “radial” metrics, a node is
knowledge about our society or to leverage new important if it is close to other nodes, and
business models based on data analytics. These “medial” metrics account for being in the middle
networks consist of millions of interconnected of flows in networks or for bridging different areas
entities and form complex socio-technical sys- of the network. “Feedback” metrics are based on
tems that are the fundamental structures the idea that centrality can result from the fact that
governing our world, yet defy easy understand- a node is connected (directly or even indirectly) to
ing. Instead, we must turn to network analytics to other central nodes. For the first three groups,
understand the structure and dynamics of these Linton C. Freeman has defined “degree central-
large-scale networked systems and to identify ity,” “closeness centrality,” and “betweenness
important or critical elements or to reveal groups. centrality” as the most intuitive metrics. These
However, in the context of big data, network metrics are used in almost every network analyt-
analytics is also faced with certain challenges. ical research project nowadays. The fourth metric
category comprises mathematically advanced
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_147-1
2 Network Analytics

methods based on eigenvector computation. 1 min on the first network, the same calculation
Phillip Bonacich presented eigenvector centrality would take 1 million minutes (approximately
which led to important developments of metrics 2 years) on the second network (millionfold).
for web analytics like Google’s PageRank algo- This property of many network metrics makes it
rithm or the HITS algorithms by John Kleinberg, nearly impossible to apply them to big data net-
which is incorporated into several search engines works within reasonable time. Consequently,
to rank search results based on the website’s struc- optimization and approximation algorithms of tra-
tural importance on the Internet. ditional metrics are developed and used to speed
The second big pile of research questions up analysis for big data networks.
related to networks is about identifying groups. A straight forward approach for algorithmic
Groups can refer to a broad array of definitions, optimization of network algorithms for big data
e.g., nodes sharing of certain socioeconomic attri- is parallelization. The abovementioned algorithms
butes, membership affiliations, or geographic closeness and betweenness centralities are based
proximity. When analyzing networks, we are on all-pairs shortest path calculation. In other
often interested in structurally identifiable groups, words, the algorithm starts at a node, follows its
i.e., sets of nodes of a network that are denser links, and visits all other nodes in concentric cir-
connected among them and sparser connected to cles. The calculation for one node is independent
all other nodes. The most obvious group of nodes from the calculation for all other nodes; thus,
in a network would be a clique – a set of nodes different processors or different computers can
where each node is connected to all other nodes. jointly calculate a metric with very little coordi-
Other definitions of groups are more relaxed. nation overhead.
K-cores are a set of nodes for which every node Approximation algorithms try to estimate a
is connected to at least k other nodes in the set. It centrality metric based on a small part of the
turns out that k-cores are more realistic for real- actual calculations. The calculations of the all-
world data than cliques and much faster to calcu- pairs shortest path calculation can be restricted in
late. For any form of group identification in net- two ways. First, we can limit the centrality calcu-
works, we are often interested in evaluating the lation to the k-step neighborhood of nodes, i.e.,
“goodness” of the identified groups. The most instead of visiting all other nodes in concentric
common approach to assess the quality of group- circles, we stop at a distance k. Second, instead of
ing algorithms is to calculate the modularity index all nodes, we just select a small proportion of
developed by Michelle Girvan and Mark nodes as starting points for the shortest path cal-
Newman. culations. Both approaches can speed up calcula-
tion time tremendously as just a small proportion
of the calculations are needed to create these
Algorithmic Challenges results. Surprisingly, these approximated results
have very high accuracy. This is because real-
The most widely used algorithms in network ana- world networks are far from random and have
lytics were developed in the context of small specific characteristics. For instance, networks
groups of (less than 100) humans. When we created from social interactions among people
study big networks with millions of nodes, several often have core-periphery structure and are highly
major challenges emerge. To begin with, most clustered. These characteristics facilitate the accu-
network algorithms run in Y(n2) time or slower. racy of centrality approximation calculations. In
This means that if we double the number of nodes, the context of optimizing and approximating tra-
the calculation time is quadrupled. For instance, ditional network metrics, a major future challenge
let us assume we have a network with 1,000 nodes will be to estimate time/fidelity trade-offs(e.g.,
and a second network with one million nodes develop confidence intervals for network metrics)
(thousandfold). If a certain centrality calculation and to build systems that incorporate the con-
with quadratic algorithmic complexity takes straints of user and infrastructure into the
Network Analytics 3

calculations. This is especially crucial as certain Visualizing Big Data Networks


network metrics are very sensitive and small data
change can lead to big change of results. Visualizing networks can be a very efficient ana-
New algorithms are especially developed for lytical approach as human perception is capable of
very large networks. These algorithms have sub- identifying complex structures and patterns. To
quadratic complexity so that they are applicable facilitate visual analytics, algorithms are needed
for very large networks. Vladimir Batagelj and that present network data in an interpretable way.
Andrej Mrvar have developed a broad array of One of the major challenges for network visuali-
new metrics and a network analytical tool called zation algorithms is to calculate the positions of
“Pajek” to analyze networks with tens of millions the nodes of the network in a way that it reveals
of nodes. the structure of the network, i.e., show communi-
However, some networks are too big to fit into ties and put important nodes in the center of the
the memory of a single computer. Imagine a net- figure. The algorithmic challenges for visualizing
work with 1 billion nodes and 100 billion edges – big networks are very similar to the ones
social media networks have already reached this discussed above. Most commonly used layout
size. Such a network would require a computer algorithms scale very poorly. Ulrich Brandes and
with about 3,000 gigabyte RAM to hold the pure Christian Pich developed a layout algorithm based
network structure with no additional information. on eigenvector analysis that can be used to visu-
Even though supercomputer installations already alize networks with millions of nodes. The
exist that can cope with these requirements, they method that they applied is similar to the before-
are rare and expensive. Instead, researchers make mentioned approximation approaches. As real-
use of computer clusters and analytical software world networks normally have a certain topology
optimized for distributed systems, like Hadoop. that is far from random, calculating just a part of
the actual layout algorithm can be a good enough
approximation to reveal interesting aspects of a
network.
Streaming Data
Networks are often enriched with additional
information about the nodes or the edges. We
Most modern big data networks come from
often know the gender or the location of people.
streaming data of interactions. Messages are sent
Nodes might represent different types of infra-
among nodes, people call each other, and data
structure elements. We can incorporate this infor-
flows are measured among servers. The observed
mation by mapping data to visual elements of our
data consist of dyadic interaction. As the nodes of
network visualization. Nodes can be visualized
the dyads overlap over time, we can extract net-
with different shapes (circles, boxes, etc.) and
works. Even though networks extracted from
can be colored with different colors resulting in
streaming data are inherently dynamic, the actual
multivariate network drawings. Adding contex-
analysis of these networks is often done with static
tual information to compelling network visualiza-
metrics, e.g., by comparing the networks created
tions can make the difference between pretty
from daily aggregation of data. The most interest-
pictures and valuable pieces of information
ing research questions with respect to streaming
visualization.
data are related to change detection. Centrality
metrics for every node or network level indices
that describe the structure of the network can be
Methodological Challenges
calculated for every time interval. Looking at
these values as time series can help to identify
Besides algorithmic issues, we also face serious
structural change in the dynamically changing
conceptual challenges when analyzing big data
networks over time.
networks. Many “traditional” network analytical
metrics were developed for groups of tens of
4 Network Analytics

people. Applying the same metrics to very big analyzing these networks are not scalable. None-
networks raises questions whether the algorithmic theless, it is worthwhile coping with these chal-
assumptions or the interpretations of results are lenges. Researchers from different academic areas
still valid. For instance, the abovementioned met- have been optimizing existing and developing
rics closeness and betweenness centralities just new metrics and methodologies as network ana-
incorporate the shortest paths between every pair lytics can provide unique insights into big data.
of nodes ignoring possible flow of information on
non-shortest paths. Even more, these metrics do
not take path length into account. In other words, Cross-References
if a node is on the shortest path of length, two or
eight is treaded identically. Most likely this does ▶ Algorithmic Complexity
not reflect real-world assumptions of information ▶ Complex Networks
flow. All these issues can be addressed by apply- ▶ Data Visualization
ing different metrics that incorporate all possible ▶ Streaming Data
paths or a random selection of paths with length k.
In general, when accomplishing network analyt-
ics, we need to ask which of the existing network Further Readings
algorithms are suitable under which assumptions
to be used for very large networks? Moreover, Batagelj, V., Mrvar, A., & de Nooy, W. (2011). Exploratory
social network analysis with Pajek. (Expanded edi-
what research questions are appropriate for very
tion.). New York: Cambridge University Press.
large networks? Does being a central actor in a Brandes, U., & Pich, C. (2007). Eigensolver Methods for
group of high school kids has the same interpre- progressive multidimensional scaling of large data.
tation as being a central user of an online social Proceedings of the 14th International Symposium on
Graph Drawing (GD’06), 42–53.
network with millions of users?
Freeman, L. C. (1979). Centrality in social networks: Con-
ceptual clarification. Social Networks, 1(3), 215–239.
Hennig, M., Brandes, U., Pfeffer, J., & Mergel, I. (2012).
Conclusions Studying social networks. A guide to empirical
research. Frankfurt: Campus Verlag.
Wasserman, S., & Faust, K. (1994). Social network analy-
Networks are everywhere in big data. Analyzing sis: Methods and applications. Cambridge: Cambridge
these networks can be challenging. Due of the University Press.
very nature of network data and algorithms,
many traditional approaches of handling and
N

Nutrition of American children being overweight or obese.


Obesity kills more than 2.8 million Americans
Qinghua Yang1 and Yixin Chen2 every year, and the obesity-related health prob-
1
Department of Communication Studies, Texas lems cost American taxpayers more than $147
Christian University, Fort Worth, TX, USA billion every year. Thus, reducing the obesity
2
Department of Communication Studies, Sam prevalence in the United States has become a
Houston State University, Huntsville, TX, USA national health priority.
Big data research on nutrition holds tremen-
dous promise for preventing obesity and improv-
Nutrition is a science that helps people to make ing population health. Recently, researchers have
good choices of foods to keep healthy, by identi- been trying to apply big data to nutritional
fying the amount of nutrients they need and the research, by taking advantages of the increasing
amount of nutrients each food contains. Nutrients amount of nutritional data and the accumulation
are chemicals obtained from diet and are indis- of nutritional studies. Big data is a collection of
pensable to people’s health. Keeping a balanced data sets, which are large in volume and complex
diet containing all essential nutrients can prevent in structure. For instance, the data managed by
people from diseases caused by nutritional defi- America’s leading health care provider Kaiser is
ciencies such as scurvy and pellagra. more than 4,000 times the amount of information
Although the United States has one of the most stored in the Library of Congress. As to data
advanced nutrition sciences in the world, the structure, nutritional data and ingredients are
nutrition status of the U.S. population is not opti- really difficult to normalize. The volume and com-
mistic. While nutritional deficiencies as a result of plexity of nutritional big data make it difficult to
dietary inadequacies are not very common, many process them using traditional data analytic
Americans are suffering from overconsumption- techniques.
related diseases. Due to the excessive intake of Big data analyses can provide more valuable
sugar and fat, the prevalence of overweight and information than traditional data sets and reveal
obesity in the American adult population hidden patterns among variables. In a big data
increased from 47% to over 65% over the past study sponsored by the National Bureau of Eco-
three decades, currently with two-thirds of Amer- nomic Research, economists Matthew Harding
ican adults being overweight and among whom and Michael Lovenheim analyzed data of over
36% being obese. Overweight and obesity are 123 million purchasing decisions on food and
concerns not only for the adult population, but beverage made in the U.S. between 2002 and
also for the childhood population, with one third 2007 and simulated the effects of various taxes
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_151-1
2 Nutrition

on Americans’ buying habits. Their model pre- of nutritional epidemiology in 2013 versus just 1
dicted that an increase of 20% tax on sugar would in 1985. However, in the era of “big data”, there is
reduce Americans’ total caloric intake by 18% and an urgent need to translate big-data nutrition
reduce sugar consumption by over 16%. Based on research to practice, so that doctors and
their findings, they proposed a new policy of policymakers can utilize this knowledge to
implementing a broad-based tax on sugar to improve individual and population health.
improve public health. In another big-data study
on human nutrition, two researchers at West Vir
ginia University tried to understand and monitor
Controversy
the nutrition status of a population. They designed
intelligent data collection strategies and examined
Despite the exciting progress of big-data applica-
the effects of food availability on obesity occur-
tion in nutrition research, several challenges are
rence. They concluded that modifying environ-
equally noteworthy. First, to conduct big-data
mental factors (e.g., availability of healthy food)
nutrition research, researchers often need access
could be the key in obesity prevention.
to a complete inventory of foods purchased in all
Big data can be applied to self-tracking, that is,
retail outlets. This type of data, however, is not
monitoring one’s nutrition status. An emerging
readily available and gathering such information
trend in big data studies is quantified self (QS),
site by site is a time-consuming and complicated
which refers to keeping track of one’s nutritional,
process. Second, information provided by nutri-
biological and physical information, such as cal-
tion big data may be incomplete or incorrect. For
ories consumed, glycemic index, and specific
example, when doing self-tracking for nutrition
ingredients of food intake. By pairing the self-
status, many people fail to do consistent daily
tracking device with a web interface, the QS solu-
documentation or suffer from poor recall of food
tions can provide users with nutrient-data aggre-
intake. Also, big data analyses may be subject to
gation, infographic visualization, and personal
systematic biases and generate misleading
recommendations for diet.
research findings. Lastly, since an increasing
Big data can also enable researchers to monitor
amount of personal data is being generated
the global food consumption. One pioneering pro-
through quantified self-tracking devices, it is
ject is the Global Food Monitoring Group
important to consider privacy rights in personal
conducted by the George Institute for global
data. That individuals’ personal nutritional data
health with participations from 26 countries.
should be well-protected and that data shared and
With the support of these countries, the Group is
posted publicly should be used appropriately are
able to monitor the nutrition composition of var-
key ethical issues for nutrition researchers and
ious foods consumed around the world, identify
practitioners. In light of these challenges, techni-
the most effective food reformulation strategies,
cal, methodological, and educational interven-
and explore effective approaches on food produc-
tions are needed to deal with issues related to
tion and distribution by food companies in differ-
big-data accessibility, errors and abuses.
ent countries.
Thanks to the development of modern data
collection and analytic technologies, the amount
of nutritional, dietary, and biochemical data con- Cross-References
tinues to increase at a rapid pace, along with a
growing accumulation of nutritional epidemio- ▶ Biomedical Data
logic studies during this time. The field of nutri- ▶ Data Mining
tional epidemiology has witnessed a substantial ▶ Diagnostics
increase in systematic reviews and meta-analyses ▶ Health Informatics
over the past two decades. There were 523 meta-
analyses and systematic reviews within the field
Nutrition 3

Further Readings Satija, A., & Hu, F. (2014). Big data and systematic
reviews in nutritional epidemiology. Nutrition Reviews,
Harding, M., & Lovenheim, M. (2017). The effect of prices 72(12).
on nutrition: Comparing the impact of product-and Swan, M. (2013). The quantified self: Fundamental disrup-
nutrient-specific taxes. Journal of Health Economics, tion in big data science and biological discovery. Big
53. Data, 1(2).
Insel, P., et al. (2013). Nutrition. Boston: Jones and Bartlett WVU Today. WVU researchers work to track nutritional
Publishers. habits using ‘Big Data’. http://wvutoday.wvu.edu/n/
2013/01/11/wvu-researchers-workto-track-nutritional-
habits-using-big-data. Accessed Dec 2014.
O

Online Advertising return on investments across media, generate pre-


dictive models, and modify their campaigns in
Yulia A. Strekalova near-real time. The proliferation of data collection
College of Journalism and Communications, gave rise to increased concerns among the Internet
University of Florida, Gainesville, FL, USA users and advocacy groups. As the user data are
collected by shared among multiple parties, they
may amount to become personally identifiable to a
In a broad sense, online advertising means adver- particular person.
tising through cross-referencing on a business’s
own web portal or on the websites of other online
businesses. The goal of online advertising is to Types of Online Advertising
attract attention to advertised websites and prod-
ucts and, potentially, lead to an enquiry about a Online advertising, a multibillion-dollar industry
project, mail list subscription, or product pur- today, started from a single marketing email offer-
chase. Online advertising creates new cost-saving ing a new computer system sent in 1978 to
opportunities for businesses by reducing some of 400 users of the Advanced Research Projects
the risks of ineffective advertising resources. Agency Network (ARPAnet). While the reactions
Online advertising types include banners, targeted to this first online advertising campaign were neg-
ads, and social media community interactions, ative and identified the message as spam, email
and each type requires careful planning and con- and forum-based advertising continued to develop
sideration of potential ethical challenges. and grow. In 1993, a company called Global Net-
Online advertising analytics and measurement work Navigator sold the first clickable online
is necessary to assess the effectiveness of adver- ad. AT&T, one of the early adopters of this adver-
tising efforts and the return on the investment of tising innovation, received clicks from almost half
funds. However, measurement is challenged by of the Internet users who were exposed to its
the fact that advertising across media platforms “Have you ever clicked your mouse right
is increasingly interactive. For example, a TV HERE? – You will.” banner ad. In 1990s, online
commercial may lead to an online search, which advertising industry was largely fragmented, but
will result in a relevant online ad, which may lead first ad networks started to appear and offer their
to a sale. The vast amounts of data and powerful customers opportunities to develop advertising
analytics are necessary to allow advertisers campaigns that will place ads across a diverse set
performing high-definition cross-channel ana- of websites and reach particular audience seg-
lyses of the public and its behaviors, evaluate the ments. An advertising banner may be placed on
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_152-1
2 Online Advertising

high-traffic sites statically for a predefined period search terms that led a consumer to the ad in the
of time. While this method may be the least costly first place.
and targeted to a niche audience, it does not allow Online advertising may also include direct
for rich data collection. Banner advertising is a newsletter advertising delivered to potential cus-
less sophisticated form of online advertising. Ban- tomers who have purchased before. However, the
ner advertising could also be used as a hybrid of decision to use this way of advertising should be
cost per mille (CPM), or cost per thousand, as coupled with an ethical way of employing
another advertising option which will deliver an it. Email addresses became a commodity and can
ad to website users. This option is usually priced be bought. However, a newsletter sent to users
in a multiple of 1,000 impressions (or the number who never bought from a company may fire
of times an ad was shown) and an additional cost back and lead to unintended negative conse-
for clicks. It also allows businesses to assess how quences. Overall, this low-cost advertising
many times an ad was shown. However, this method can be effective in keeping past customers
method is limited in its ability to measure if the informed about new products and other cam-
return on an investment in advertising covered the paigns run by the company.
costs. However, proliferation of banners on sites Social media is another advertising channel,
and the overall volume of information on sites which is rapidly growing in its popularity. Social
lead to “banner blindness” among the Internet media networks created repositories of psycho-
users. In addition, with rapid increase of mobile graphic data, which include user-reported demo-
phones as Internet connection devises, the average graphic information, hobbies, travel destinations,
effectiveness of banners became even lower. The lifetime events, and topics of interest. Social
use of banner and pop-up ads increased in the late media can be used as more traditional advertising
1990s and early 2000s, but the users of the Inter- channels for PPC ad placements. However, they
net started to block these ads with pop-up can also serve as a base for customer engagement.
blockers, and the clicks on banner ads dropped Social media, although require a commitment and
to about 0.1%. time investment from advertisers, may generate
The next innovation in the online advertising is brand loyalty. Social media efforts, therefore,
tied to the growth in sophistication of search require careful evaluation as they can be both
engines. The search engines started to allow costly in terms of direct advertising costs and the
advertisers to place ad relevant to particular key- cost of time spent by company employees on
words. Tying advertising to relevant search key- developing and executing social media campaign
words gave rise to the pay-per-click (PPC) and keeping the flow of communication active.
advertising. PPC provides advertisers with most Data collected from social media channels can be
robust data to assess if expended costs generated analyzed on the individual level, which was
sufficient return. PPC advertising means that nearly impossible with earlier online advertising
advertisers are charged per click on an ad. This methods. Companies can collect information
advertising method ties exposure to advertising to about specific user communication and engage-
an action from a potential consumer thus provid- ment behavior, track communication activities of
ing advertisers with the data on the sites that are individual users, and analyze comments shared by
more effective. Google AdWords is an example of the social media users. At the same time, aggre-
pay-per-click advertising, which is linked to the gate data may allow for general sentiment analysis
keywords and phrases used in search. AdWords to assess if overall comments about a brand are
ads are correlated with these keywords and shown positive or negative and seek out product-related
only to the Internet users with relevant searches. signals shared by users. Social media evaluation,
By using PPC in conjunction with a search however, is challenged by the absence of deep
engine, like Google, Bing, or Yahoo, advertisers understanding of the audience engagement met-
can also obtain insights on the environment or rics and lack of industry-wide benchmarks and
evaluation standards. As a fairly new area of
Online Advertising 3

advertising, social media evaluation of likes, com- gardening and house-keeping magazines or
ments, and shares may be interpreted in a number home improvement stores.
of ways. Social media networks provide a frame- Geo, or local, targeting is focused on the deter-
work for a new type of advertising, community mination of the geographical location of a website
exchange, but they also are channels of online visitor. This information, in turn, is used to deliver
advertising through real-time advertising ads that are specific to a particular location, coun-
targeting. It is likely that focused targeting will try, region or state, city, or metro area. In some
continue to be the focus of advertisers as it leads to cases, targeting can go as deep as an organiza-
the increases in the effectiveness of advertising tional level. Internet protocol (IP) address,
efforts. At the same time, tracking of user web assigned to each device participating a computer
behavior throughout the Web creates privacy con- network, is used as the primary data point in this
cerns and policy challenges. targeting method. The use of this method may
prevent the delivery of ads to users where product
or service is not available – for example, a content
Targeting restriction for Internet television or region-
specific advertising that complies with regional
Innovations in online advertising introduced regulations.
targeting techniques that based advertising on Demographic targeting, as implied by its name,
the past browsing and purchase behaviors of Inter- tailors ads based on website users’ demographic
net users. Proliferation of data collection enabled information, like gender, age, income and educa-
advertisers to target potential clients based on a tion level, marital status, ethnicity, language pref-
multitude of web activities, like site browsing, key erences, and other data points. Users may supply
word searchers, past purchasing across different this information is social networking site registra-
merchants, etc. These targeting techniques led to tion. The sites, additionally, may also encourage
the development of data collection systems that its users to “complete” their profiles after the
track user activity in real time and make decisions initial registration to get access to the fullest set
to advertise or not advertise right as the user is of data.
browsing a particular page. Online advertising Behavioral targeting looks at users’ declared or
lacks rigorous standardization and several recent expressed interests to tailor the content of deliv-
targeting typologies have been proposed. ered ads. Web-browsing information, data on the
Reviewing strategies for online advertising, pages visited, the amount of time spent on partic-
Gabriela Taylor identifies nine distinct targeting ular pages, meta-data for the links that were
methods, which overlap or complement the dis- clicked, the searches conducted recently, and
cussion of targeting methods proposed by other information about recent purchases is collected
authors. In general, targeting refers to situation and analyzed by advertisement delivery systems
when ads that are shown to an Internet user are to select and display the most relevant ads. In a
relevant to their interests. The latter are deter- sense, website publishers can create user profiles
mined by the keywords used on searchers, pages based on the collected data and use it to predict
visited, or online purchases made. future browsing behavior and potential products
Contextual targeting ads are delivered to web of interest. This approach, using rich past data,
users based on the content of the sites these users allows advertisers to target their ads more effec-
visit. In other words, contextually targeted adver- tively to the page visitors who are more likely to
tising matches ads to the content of the webpage have interest in these products or services. Com-
an Internet user is browsing. Systems managing bined with other strategies, including contextual,
contextual advertising scan websites for key- geographic, and demographic targeting, this
words and place ads that match these keywords approach may lead to finely tuned and interest-
most closely. For example, a user viewing a tailored ads. The approach proves effective as
website about gardening may see ads for several studies showed that also Internet users
4 Online Advertising

prefer to have no ads on the web-pages they visit, their past behaviors are identified; they are seg-
they favor relevant ads over random ones. mented into groups to predict their future pur-
DayPart and time-based targeting is run during chase behavior. The goal of this method is to
specific times of the day or the week, for example, identify the most loyal group of customers, who
10 am to 10 pm local time Monday through generate revenue for the company and engage
Friday. Ads targeted based on this method are with this group in a most effective and
displayed only during these days and times and supportive way.
go off during the off-times. Ads run through
DayPart campaigns may focus on time-limited
offers and create a sense of urgency among audi- Privacy Concerns
ence members. At the same time, such ads may
create an increased sense of monitoring and lack Technology is developing at a speed too rapid for
of privacy among the users exposed to these ads. policy-making to catch up. Whichever advertising
Real-time targeting allows for the ad place- targeting method is used, each is based on an
ment systems to place bids for advertisement extended collections and analysis of personal
placement in real time. Additionally, this adver- and behavioral data for each user. Ongoing and
tising method allows to track every unique site potentially pervasive data collection raises impor-
user and collect real-time data to assess the likeli- tant privacy questions and concerns. Omer Tene
hood of each visitor to make a purchase. and Jules Polonetsky identify several privacy
Affinity targeting creates a partnership risks associated with big data. First is an incre-
between a product producer and an interest- mental adverse effect on privacy from an ongoing
based organization to promote the use of a third- accumulation of information. More and more data
party product. This method targets customers who points are collected about individual Internet users
share interest in a particular topic. These cus- and once information about real identify has been
tomers are assumed to have positive attitude linked to a virtual identify of a user, the anonymity
toward a website they visit and therefore have a is lost. Furthermore, disassociation of a user with
positive attitude toward more relevant advertis- a particular service may be insufficient to break a
ing. This method is akin to niche advertising, previously existing link as other networks and
and its success is based on the close match online resources may have already harvested
between the advertising content and that of the missing data points. Second area of privacy risks
passions and interests of website users. is an automated decision-making process. These
Look-alike targeting aims to identify prospec- automated algorithms may lead to discrimination
tive customers who are similar to the advertiser’s and self-determination. Targeting and profiling
customer base. Original customer profiles are used in online advertising gives ground to poten-
determined based on the website use and previous tial threats to the free access to information and
behaviors of active customers. These profiles are open, democratic society. The third area of pri-
then matched against a pool of independent Inter- vacy concerns is predictive analysis, which may
net users who share common attributes and behav- identify and predict stigmatizing behaviors or
iors and are the likely targets for an advertised characteristics, like susceptibility to disease or
product. The challenge with identifying these undisclosed sexual orientation. In addition, pre-
look-alike audiences is challenged by the large dictive analysis may give ground to social strati-
number of possible input data points which may fication by putting users in like-behaving clusters
or may not be defining for a particular behavior or and ignoring outliers and minority groups.
user group. Finally, the fourth area of concern is the lack of
Act-alike targeting is an outcome of predictive access to information and exclusion of smaller
analytics. Advertisers using this method define organizations and individuals from the access to
profiles of customers based on their information the benefits of big data. Large organizations are
consumption and spending habits. Customers and able to collect and use big data to price products
Online Advertising 5

close to an individual’s reservation price or ▶ Predictive Analytics


cornering an individual with a deal impossible to ▶ Social Media
resist. At the same time, large organizations are
seldom forthcoming with sharing individuals’
information with these individuals in an assess-
Further Readings
able and understandable format.
Siegel, E. (2013). Predictive analytics: The power to pre-
dict who will click, buy, lie, or die. Hoboken: Wiley.
Cross-References Taylor, G. (2013). Advertising in a digital age: Best prac-
tices & tips for paid search and social media advertis-
ing. Global & Digital.
▶ Advertising Self-Regulatory Council, Council Tene, O., & Polonetsky, J. (2013). Privacy in the age of big
of Better Business Bureaus data: A time for big decisions. Stanford Law Review
▶ Content Management Online, 11/5.
▶ Data-Driven Marketing Turow, J. (2012). The daily you: How the advertising
industry is defining your identity and your worth.
▶ Data-Information-Knowledge-Wisdom New Haven: Yale University Press.
(DIKW) Pyramid, Framework, Continuum
O

Online Identity culturally determined ways. Each of these areas of


research is described in detail below.
Catalina L. Toma
Communication Science, University of
Wisconsin-Madison, Madison, WI, USA Identity Expression

In its early days, the Internet appealed to many


Identity refers to the stable ways in which indi- users because it allowed them to engage with one
viduals or organizations think of and express another anonymously. However, in recent years,
themselves. The availability of big data has users have overwhelmingly migrated toward per-
enabled researchers to examine online communi- sonalized interaction environments, where they
cators’ identity using generalizable samples. reveal their real identities and often connect with
Empirical research to date has focused on per- members of their offline networks. Such is the
sonal, rather than organizational, identity, and on case with social media platforms. Therefore,
social media platforms, particularly Facebook and research has taken great interest in how users
Twitter, given that these platforms require users to communicate various aspects of their identities
present themselves and their daily reflections to to their audiences in these personalized
audiences. Research to date has investigated the environments.
following aspects of online identity: (1) expres- One important aspect of people’s identities is
sion, or how users express who they are, espe- their personality. Big data has been used to exam-
cially their personality traits and demographics (e. ine how personality traits get reflected in people’s
g., gender, age) through social media activity; (2) social media activity. How do people possessing
censorship, or how users suppress their urges to various personality traits talk, connect, and pre-
reveal aspects of themselves on social media; (3) sent themselves online? The development of the
detection, or the extent to which it is possible to myPersonality Facebook application was instru-
use computational tools to infer users’ identity mental in addressing these questions.
from their social media activity; (4) audiences, myPersonality administers personality question-
or who users believe accesses their social media naires to Facebook users and then informs them
postings and whether these beliefs are accurate; of their personality typology in exchange for
(5) families, or the extent to which users include access to all their Facebook data. The application
family ties as part of their identity portrayals; and has attracted millions of volunteers on Facebook
(6) culture, or how users express their identities in and has enabled researchers to correlate Facebook
activities with personality traits. The application,
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_153-1
2 Online Identity

used in all the studies summarized below, mea- “Internet.” Similarly, highly conscientious users
sures personality using the Big Five Model, which expressed their achievement orientation through
specifies five basic personality traits: (1) extraver- words such as “success,” “busy,” and “work,”
sion, or an individual’s tendency to be outgoing, whereas users high in openness to experience
talkative, and socially active; (2) agreeableness, or expressed their artistic and intellectual pursuits
an individual’s tendency to be compassionate, through words like “dreams,” “universe,” and
cooperative, trusting, and focused on maintaining “music.”
positive social relations; (3) openness to experi- In sum, this body of work shows that people’s
ence, or an individual’s tendency to be curious, identity, operationalized as personality traits, is
imaginative, and interested in new experiences illustrated in the actions they undertake and
and ideas; (4) conscientiousness, or an individ- words they use on Facebook. Given social media
ual’s tendency to be organized, reliable, consis- platforms’ controllable nature, which allows users
tent, and focused on long-term goals and time to ponder their claims and the ability to edit
achievement; and (5) neuroticism, or an individ- them, researchers argue that these digital traces
uals’ tendency to experience negative emotions, likely illustrate users’ intentional efforts to com-
stress, and mood swings. municate their identity to their audience, rather
One study conducted by Yoram Bachrach and than being unintentionally produced.
his colleagues investigated the relationship
between Big Five personality traits and Facebook
activity for a sample of 180,000 users. Results Identity Censorship
show that individuals high in extraversion had
more friends, posted more status updates, partici- While identity expression is frequent in social
pated in more groups, and “liked” more pages on media and, as discussed above, illustrated by
Facebook; individuals high in agreeableness behavioral traces, sometimes users suppress iden-
appeared in more photographs with other tity claims despite their initial impulse to divulge
Facebook users but “liked” fewer Facebook them. This process, labeled “last-minute self-cen-
pages; individuals high in openness to experience sorship,” was investigated by Sauvik Das and
posted more status updates, participated in more Adam Kramer using data from 3.9 million
groups, and “liked” more Facebook pages; indi- Facebook users over a period of 17 days. Censor-
viduals high in conscientiousness posted more ship was measured as instances when users
photographs but participated in fewer groups and entered text in the status update or comment
“liked” fewer Facebook pages; and individuals boxes on Facebook but did not post it in the next
high in neuroticism had fewer friends but partici- 10 min. The results show that 71% of the partic-
pated in more groups and “liked” more Facebook ipants censored at least one post or comment
pages. A related study, conducted by Michal during the time frame of the study. On average,
Kosinski and his colleagues, replicated these find- participants censored 4.52 posts and 3.20 com-
ings on a sample of 350,000 American Facebook ments. Notably, 33% of all posts and 13% of all
users, the largest dataset to date on the relationship comments written by the sample were censored,
between personality and Internet behavior. indicating that self-censorship is a fairly prevalent
Another study examined the relationship phenomenon. Men censored more than women,
between personality traits and word usage in the presumably because they are less comfortable
status updates of over 69,000 English-speaking with self-disclosure. This study suggests that
Facebook users. Results show that personality Facebook users take advantage of controllable
traits were indeed reflected in natural word use. media affordances, such as editability and unlim-
For instance, extroverted users used words ited composition time, in order to manage their
reflecting their sociable nature, such as “party,” identity claims. These self-regulatory efforts are
whereas introverted users used words reflecting perhaps a response to the challenging nature of
their more solitary interests, such as “reading” and addressing large and diverse audiences, whose
Online Identity 3

interpretation of the poster’s identity claims may reasonably inferred; and agreeableness cannot be
be difficult to predict. inferred at all. In other words, Facebook activity
renders extraversion highly visible and agreeable-
ness opaque.
Identity Detection Language can also be used to predict online
communicators’ identity, as shown by Andrew
Given that users leave digital traces of their per- Schwartz and his colleagues in a study of 15.4
sonal characteristics on social media platforms, million Facebook status updates, totaling over
research has been concerned with whether it is 700 million words. Language choice, including
possible to infer these characteristics from social words, phrases, and topics of conversation, was
media activity. For instance, can we deduce users’ used to predict users’ gender, age, and Big Five
gender, sexual orientation, or personality from personality traits with high accuracy.
their explicit statements and patterns of activity? In sum, this body of research suggests that it is
Is their identity implicit in their social media possible to infer many facets of Facebook users’
activity, even though they might not disclose it identity through automated analysis of their
explicitly? online activity, regardless of whether they explic-
One well-publicized study by Michal Kosinski itly choose to divulge this identity. While users
and his colleagues sought to predict Facebook typically choose to reveal their gender and ethnic-
users’ personal characteristics from their “likes” ity, they can be more reticent in disclosing their
– that is, Facebook pages dedicated to products, relational status or sexual orientation and might
sports, music, books, restaurant, and interests – themselves be unaware of their personality traits
that users can endorse and with which they can or intelligence quotient. This line of research
associate by clicking the “like” button. The study raises important questions about users’ privacy
used a sample of 58,000 volunteers recruited and the extent to which this information, once
through the myPersonality application. Results automatically extracted from Facebook activity,
show that, based on Facebook “likes,” it is possi- should be used by corporations for marketing or
ble to predict a user’s ethnic identity (African- product optimization purposes.
American vs. Caucasian) with 95% accuracy, gen-
der with 93% accuracy, religion (Christian vs.
Muslim) with 82% accuracy, political orientation Real and Imagined Audience for Identity
(Democrat vs. Republican) with 85% accuracy, Claims
sexual orientation among men with 88% accuracy
and among women with 75% accuracy, and rela- The purpose of many online identity claims is to
tionship status with 65% accuracy. Certain “likes” communicate a desired image to an audience.
stood out as having particularly high predictive Therefore, the process of identity construction
ability for Facebook users’ personal characteris- involves understanding the audience and targeting
tics. For instance, the best predictors of high intel- messages to them. Social media, such as
ligence were “The Colbert Report,” “Science,” Facebook and Twitter, where identity claims are
and, unexpectedly, “curly fries.” Conversely, low posted very frequently, pose a conundrum in this
intelligence was indicated by “Sephora,” “I Love regard, because audiences tend to be unprecedent-
Being a Mom,” “Harley Davidson,” and “Lady edly large, sometimes reaching hundreds and
Antebellum.” thousands of members, and diverse. Indeed,
In the area of personality, two studies found “friends” and “followers” are accrued over time
that users’ extraversion can be most accurately and often belong to different social circles (e.g.,
inferred from Facebook profile activity (e.g., high school, college, employment). How do users
group membership, number of friends, number conceptualize their audiences on social media
of status updates); neuroticism, conscientious- platforms? Are users’ mental models of their audi-
ness, and openness to experience can be ences accurate?
4 Online Identity

These questions were addressed by Michael their family connections to their audience, and
Bernstein and his colleagues in a study focusing how do family members publically talk to one
specifically on Facebook users. The study used a another on these platforms? Moira Burke and her
survey methodology, where Facebook users indi- colleagues addressed these questions in the con-
cated their beliefs about how many of their text of parent-child interactions on Facebook.
“friends” viewed their Facebook postings, Results show that 37.1% of English-speaking
coupled with large-scale log data for 220,000 US Facebook users specified either a parent or
Facebook users, where researchers captured the child relationship on the site. About 40% of teen-
actual number of “friends” who viewed users’ agers specified at least one parent on their profile,
postings. Results show that, by and large, and almost half of users age 50 or above specified
Facebook users underestimated their audiences. a child on their profile. The most common family
First, they believed that any specific status update ties were between mothers and daughters (41.4%
they posted was viewed, on average, by 20 of all parent-child ties), followed by mothers and
“friends,” when in fact it was viewed by 78 sons (26.8%), fathers and daughters (18.9%), and
“friends.” The median estimate for the audience least of all fathers and sons (13.1%). However,
size for any specific post was only 27% of the Facebook communication between parents and
actual audience size, meaning that participants children was limited, accounting for only 1–4%
underestimated the size of their audience by a of users’ public Facebook postings. When com-
factor of 4. Second, when asked how many total munication did happen, it illustrated family iden-
audience members they had for their profile post- tities: Parents gave advice to children, expressed
ings during the past month, Facebook users affection, and referenced extended family mem-
believed it was 50, when in fact it was 180. The bers, particularly grandchildren.
median perceived audience for the Facebook pro-
file, in general, was only 32% of the actual audi-
ence, indicating that users underestimated their
Cultural Identity
cumulative audience by a factor of 3. Slightly
less than half of Facebook users indicated they
Another critical aspect of personal identity is cul-
wanted a larger audience for their identity claims
tural identity. Is online communicators’ cultural
than they thought they had, ironically failing to
identity revealed by their communication pat-
understand that they did in fact have this larger
terns? Jaram Park and colleagues show that Twit-
audience. About half of Facebook users indicated
ter users create emoticons that reflect an
that they were satisfied with the audience they
individualistic or collectivistic cultural orienta-
thought they had, even though their audience
tion. Specifically, users from individualistic cul-
was actually much greater than they perceived it
tures preferred horizontal and mouth-oriented
to be. Overall, this study highlights a substantial
emoticons, such as :), whereas users from collec-
mismatch between users’ beliefs about their audi-
tivistic cultures preferred vertical and eye-ori-
ences and their actual audiences, suggesting that
ented emoticons, such as ^_^. Similarly, a study
social media environments are translucent, rather
of self-expression using a sample of four million
than transparent, when it comes to audiences. That
Facebook users from several English-speaking
is, actual audiences are somewhat opaque to users,
countries (USA, Canada, UK, Australia) shows
who as a result may fail to properly target their
that members of these cultures can be differenti-
identity claims to their audiences.
ated through their use of formal or informal
speech, the extent to which they discuss positive
personal events, and the extent to which they
Family Identity
discuss school. In sum, this research shows that
cultural identity is evident in linguistic self-
One critical aspect of personal identity is family
expression on social media platforms.
ties. To what extent do social media users reveal
Online Identity 5

Cross-References and Social Media (ICWSM) (pp. 41–50). Association


for the Advancement of Artificial Intelligence.
Das, S., & Kramer, A. (2013). Self-censorship on
▶ Anonymity Facebook. In Proceedings of the 2013 Conference on
▶ Behavioral Analytics Computer-Supported Cooperative Work (pp.
▶ Facebook 793–802). Association for Computing Machinery.
▶ Privacy Kern, M., et al. (2014). The online social self: An open
vocabulary approach to personality. Assessment, 21,
▶ Profiling 158–169.
▶ Psychology Kosinski, M., et al. (2013). Private traits and attributes are
▶ Twitter predictable from digital records of human behavior.
Proceedings of the National Academy of Sciences,
110, 5802–5805.
Kramer, A., & Chung, C. (2011). Dimensions of self-
Further Readings expression in Facebook status updates. In Proceedings
of the International Conference on Weblogs and Social
Bachrach, Y., et al. (2012). Personality and patterns of Media (ICWSM) (pp. 169–176). Association for the
Facebook usage. In Proceedings of the 3rd Annual Advancement of Artificial Intelligence.
Web Science Conference (pp. 24–32). Association for Park, J., et al. (2014). Cross-cultural comparison of non-
Computing Machinery. verbal cues in emoticons on twitter: Evidence from big
Bernstein, M., et al. (2013). Quantifying the invisible audi- data analysis. Journal of Communication, 64, 333–354.
ence in social networks. In Proceedings of the SIGCHI Schwartz, A., et al. (2013). Personality, gender, and age in
Conference on Human Factors in Computing Systems the language of social media: The open-vocabulary
(pp. 21–30). Association for Computing Machinery. approach. PloS One, 8, e73791.
Burke, M., et al. (2013). Families on Facebook. In Pro-
ceedings of the International Conference on Weblogs
O

Open-Source Software History of Open-Source Software

Marc-David L. Seidel Two early software projects leading to the mod-


Sauder School of Business, University of British ern-day open-source software growth were at the
Columbia, Vancouver, BC, Canada Massachusetts Institute of Technology (MIT) and
the University of California at Berkeley. The Free
Software Foundation, created by Richard
Open-source software refers to computer software Stallman of the MIT Artificial Intelligence Lab,
where the copyright holder provides anybody the was launched as a nonprofit organization to pro-
right to edit, modify, and distribute the software mote the development of free software. Stallman
free of charge. The initial creation of such soft- is credited with creating the term “copyleft” and
ware spawned the open-source movement. Fre- created the GNU operating system as an operating
quently the only limitation on the intellectual system composed entirely of free software. The
property rights are that any subsequent changes free BSD Unix operating system was developed
made by others are required to be made with by Bill Jolitz of the University of California at
similarly open intellectual property rights. Such Berkeley Computer Science Research Group and
software is often developed in an open collabora- served as the basis for many later Unix operating
tive manner by a Community Form (C-form) system releases. Many open-source software pro-
organization. A large percentage of the internet jects were unknown outside of the highly techni-
infrastructure is operated utilizing such software cal computer science community. Stallman’s
which handles the majority of networking, web GNU was later popularized by Linus Torvalds, a
serving, e-mail, and network diagnostics. With the Finish computer science student, who released a
spread of the internet, the volume of user gener- Linux kernel based upon the earlier work. The
ated data has expanded exponentially, and open- release of Linux triggered substantial media atten-
source software to manage and analyze big data tion for the open-source movement when an inter-
has flourished through open-source big data pro- nal Microsoft strategy document, dubbed the
jects. This entry explains the history of open- Halloween Documents, was leaked. It outlined
source software, the typical organizational struc- Microsoft’s perception of the threat of Linux to
ture used to create such software, prominent pro- Microsoft’s dominance of the operating system
ject examples of the software focused on market. Linux was portrayed in the mass media
managing and analyzing big data, and the future as a free alternative to the Microsoft Windows
evolution suggested by current research on the operating system. Eric S. Raymond and Bruce
topic. Perens further formalized open source as a
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_157-1
2 Open-Source Software

development method by creating the Open Source Apache CouchDB is a web-focused database
Initiative in 1998. By 1998, open-source software system originally developed by Damien Katz, a
routed 80% of the e-mail on the internet. It has former IBM developer. Similar to Apache
continued to flourish to the modern day being Casandra, it is now developed by the Apache
responsible for a large number of software and Software Foundation. It is designed to deal with
information-based products today produced by large amounts of data through multi-master repli-
the open-source movement. cation across multiple locations.
Apache Hadoop is designed to store and pro-
cess large-scale datasets using multiple clusters of
C-form Organizational Architecture standardized low-level hardware. This technique
allows for parallel processing similar to a super-
The C-form organizational architecture is the pri- computer but using mass market off the shelf
mary organizational structure for open-source commodity computing systems. It was originally
development projects. A typical C-form has four developed by Doug Cutting and Mike Cafarella.
common organizing principles. First, there are Cutting was employed at Yahoo, and Cafarella
informal peripheral boundaries for developers. was a Masters student at the University of Wash-
Contributors can participate as much or as little ington at the time. It is now developed by the
as they like and join or leave a project on their Apache Software Foundation. It serves a similar
own. Second, many contributors receive no finan- purpose as Storm.
cial compensation at all for their work, yet some Apache HCatalog is a table and storage man-
may have employment relationships with more agement layer for Apache Hadoop. It is focused
traditional organizations which encourage their on assisting grid administrators with managing
participation in the C-form as part of their regular large volumes of data without knowing exactly
job duties. Third, C-forms focus on information- where the data is stored. It provides relational
based product, of which software is a major sub- views of the data, regardless of what the source
set. Since the product of a typical C-form is infor- storage location is. It is developed by the Apache
mation based, it can be replicated with minimal Software Foundation.
effort and cost. Fourth, typical C-forms operate Apache Lucene is an information retrieval soft-
with a norm of open transparent communication. ware library which tightly integrates with search
The primary intellectual property of an open- engine projects such as ElasticSearch. It provides
source C-form is the software code. This, by def- full text indexing and searching capabilities. It
inition, is made available for any and all to see, treats all document formats similarly by extracting
use, and edit. textual components and as such is independent of
file format. It is developed by the Apache Soft-
ware Foundation and released under the Apache
Prominent Examples of Open-Source Big Software License.
Data Projects D3.js is a data visualization package originally
created by Mike Bostock, Jeff Heer, and Vadim
Apache Casandra is a distributed database man- Ogievetsky who worked together at Stanford Uni-
agement system originally developed by Avinash versity. It is now licensed under the Berkeley
Lakshman and Prashant Malik at Facebook as a Software Distribution (BSD) open-source license.
solution to handle searching an inbox. It is now It is designed to graphically represent large
developed by the Apache Software Foundation, a amounts of data and is frequently used to generate
distributed community of developers. It is rich graphs and for map making.
designed to handle large amounts of data distrib- Drill is a framework to support distributed
uted across multiple datacenters. It has been rec- applications for data intensive analysis of large-
ognized by University of Toronto researchers as scale datasets in a self-serve manner. It is inspired
having leading scalability capabilities. by Google’s BigQuery infrastructure service. The
Open-Source Software 3

stated goal for the project is to scale to 10,000 or of 2012 is developed by the Apache Software
more servers to make low-latency queries of Foundation.
petabytes of data in seconds in a self-service man- Lumify is a big data analysis and visualization
ner. It is being incubated by Apache currently. It is platform originally targeted to investigative work
similar to Impala. in the national security space. It provides real-time
ElasticSearch is a search server that provides graphical visualizations of large volumes of data
near real-time full-text search engine capabilities and automatically searches for connections
for large volumes of documents using a distrib- between entities. It was originally created by Alta-
uted infrastructure. It is based upon Apache mira Technologies Corporation and then released
Lucene and is released under the Apache Software under the Apache License in 2014.
License. It spawned a venture-funded company in MongoDB is a NoSQL document focused
2012 created by the people responsible for database focused on handling large volumes of
ElasticSearch and Apache Lucene to provide sup- data. The software was first developed in 2007
port and professional services around the by 10gen. In 2009, the company made the soft-
software. ware open source and focused on providing pro-
Impala is an SQL query engine which enables fessional services for the integration and use of the
massively parallel processing of search queries on software. It utilizes a distributed file storage, load
Apache Hadoop. It was announced in 2012 and balancing, and replication system to allow quick
moved out of beta testing in 2013 to public avail- ad hoc queries of large volumes of data. It is
ability. It is targeted at data analysts and scientists released under the GNU Affero General Public
who need to conduct analysis on large-scale data License and uses drivers released under the
without reformatting and transferring the data to a Apache License.
specialized system or proprietary format. It is R is a technical computing high-performance
released under the Apache Software License and programming language focused on statistical
has professional support available from the ven- analysis and graphical representations of large
ture-funded Cloudera. It is similar to Drill. datasets. It is an implementation of the S program-
Julia is a technical computing high-perfor- ming language created by Bell Labs’ John Cham-
mance dynamic programming language with a bers. It was created by Ross Ihaka and Robert
focus on distributed parallel execution with high Gentleman at the University of Auckland. It is
numerical accuracy using an extensive mathemat- designed to allow multiple processors to work
ical function library. It is designed to use a simple on large datasets. It is released under the GNU
syntax familiar to many developers of older pro- License.
gramming languages while being updated to be Scribe is a log server designed to aggregate
more effective with big data. The aim is to speed large volumes of server data streamed in real
development time by simplifying coding for par- time from a high volume of servers. It is com-
allel processing support. It was first released in monly described as a scaling tool. It was origi-
2012 under the MIT open-source license after nally developed by Facebook and then released in
being originally developed starting in 2009 by 2008 using the open-source Apache License.
Alan Edelman (MIT), Jeff Bezanson (MIT), Spark is a data analytic cluster computing
Stefan Karpinski (UCSB), and Viral Shah framework designed to integrate with Apache
(UCSB). Hadoop. It has the capability to cache large
Kafka is a distributed, partitioned, replicated datasets in memory to interactively analyze the
message broker targeted on commit logs. It can be data and then extract a working analysis set to
used for messaging, website activity tracking, further analyze quickly. It was originally devel-
operational data monitoring, and stream pro- oped at the University of California at Berkeley
cessing. It was originally developed by LinkedIn AMPLab and released under the BSD License.
and released open source in 2011. It was subse- Later it was incubated in 2013 at the Apache
quently incubated by the Apache Incubator and as Incubator and released under the Apache License.
4 Open-Source Software

Major contributors to the project include Yahoo movements, biological data, consumer behavior,
and Intel. health metrics, and voice content.
Storm is a programming library focused on
real-time storage and retrieval of dynamic object
information. It allows complex querying across
Cross-References
multiple database tables. It handles unbound
streams of data in an instantaneous manner allo-
▶ Apache
wing real-time analytics of big data and continu-
▶ Crowdsourcing
ous computation. The software was originally
▶ Distributed Computing
developed by Canonical Ltd., also known for the
▶ Global Open Data Initiative
Ubuntu Linux operating system, and is released
▶ Google Flu
under the GNU Lesser General Public License. It
▶ Wikipedia
is similar to Apache Hadoop but with a more real-
time and less batch-focused nature.

Further Readings
The Future
Bretthauer, D. (2002). Open source software: A history.
Information Technology and Libraries, 21(1), 3–11.
The majority of open-source software focused on Lakhani, K. R., & von Hippel, E. (2003). How open source
big data applications has primarily been targeting software works: ‘Free’ user-to-user assistance.
web-based big data sources and corporate data Research Policy, 32(6), 923–943.
analytics. Current developments suggest a shift Marx, V. (2013). Biology: The big challenges of big data.
Nature, 498, 255–260.
toward more analysis of real-world data as sensors McHugh, J. (1998, August). For the love of hacking.
spread more widely into everyday use by mass Forbes.
market consumers. As consumers provide more O’Mahony, S., & Ferraro, F. (2007). The emergence of
and more data passively through pervasive sen- governance on an open source project. Academy of
Management Journal, 50(5), 1079–1106.
sors, the open-source software used to manage Seidel, M.-D. L., & Stewart, K. (2011). An initial descrip-
and understand big data appears to be shifting tion of the C-form. Research in the Sociology of Orga-
toward analyzing a wider variety of big data nizations, 33, 37–72.
sources. It appears likely that the near future will Shah, S. K. (2006). Motivation, governance, and the via-
bility of hybrid forms in open source software devel-
provide more open-source software tools to ana- opment. Management Science, 52(7), 1000–1014.
lyze real-world big data such as physical
P

Participatory Health and Big Data rise of big data comes concern about the security
of health information and privacy.
There are advantages and disadvantages to
Muhiuddin Haider, casting large data nets. Collecting data can help
Yessenia Gomez and Salma Sharaf organizations learn about individuals and commu-
School of Public Health Institute for Applied nities at large. Following online search trends and
Environmental Health, University of Maryland, collecting big data can help researchers under-
College Park, MD, USA stand health problems currently facing the studied
communities and can similarly be used to track
epidemics. For example, increases in Google
The personal data landscaped has changed drasti- searches for the term flu have been correlated
cally with the rise of social networking sites and with an increase in flu patient visits to emergency
the Internet. The Internet and social media sites rooms. In addition, a 2008 Pew study revealed
have allowed for the collection of large amounts that 80% of Internet users use the Internet to
of personal data. Every keystroke typed, website search for health information. Today, many
visited, Facebook post liked, Tweet posted, or patients visit doctors after having already
video shared becomes part of a user’s digital his- searched their symptoms online. Furthermore,
tory. A large net is cast collecting all the personal more patients are now using the Internet to search
data into big data sets that may be subsequently health information, seek medical advice, and
analyzed. This type of data has been analyzed for make important medical decisions. The rise of
years by marketing firms through the use of algo- the Internet has led to more patient engagement
rithms that analyze and predict consumer purchas- and participation in health.
ing behavior. The digital history of an individual Technology has also encouraged participatory
paints a clear picture about their influence in the health through an increase in interconnectedness.
community and their mental, emotional, and Internet technology has allowed for constant
financial state, and much about an individual can access to medical specialists and support groups
be learned through the tracking of his or her data. for people suffering from diseases or those
When big data is fine-tuned, it can benefit the searching for health information. The use of tech-
people and community at large. Big data can be nology has allowed individuals to take control of
used to track epidemics, and its analysis can be their own health, through the use of online
used in the support of patient education, treatment searches and the constant access to online health
of at-risk individuals, and encouragement of par- records and tailored medical information. In the
ticipatory community health. However, with the United States, hospitals are connecting
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_159-1
2 Participatory Health and Big Data

individuals to their doctors through the use of participatory health, better communication
online applications that allow patients to email between individuals and healthcare providers,
their doctors, check prescriptions, and look at and more tailored care.
visit summaries from anywhere where they have Big data collected from these various sources,
an Internet connection. The increase in patient whether Internet searches, social media sites, or
engagement has been seen to play a major role participatory health through applications and
in promotion of health and improvement in qual- technology, strongly influences our modern health
ity of healthcare. system. The analysis of big data has helped med-
Technology has also helped those at risk of ical providers and researchers understand health
disease seek treatment early or be followed care- problems facing their communities and develop
fully before contracting a disease. Collection of tailored programs to address health concerns, pre-
big data has helped providers see health trends in vent disease, and increase community participa-
their communities, and technology has allowed tory health. Through the use of big data
them to reach more people with targeted health technology, providers are now able to study health
information. A United Nations International Chil- trends in their communities and communicate
dren’s Emergency Fund (UNICEF) project in with their patients without scheduling any medi-
Uganda asked community members to sign up cal visits. However, big data also creates concern
for U-report, a text-based system that allows indi- for the security of health information.
viduals to participate in health discussions There are several disadvantages to the collec-
through weekly polls. This system was tion of big data. One being that not all the data
implemented to connect and increase communi- collected is significant and much of the informa-
cation between the community and the govern- tion collected may be meaningless. Additionally,
ment and health officials. The success of the computers lack the ability to interpret information
program helped UNICEF prevent disease out- the way humans do, so something that may have
breaks in the communities and encouraged multiple interpretations may be misinterpreted by
healthy behaviors. U-report is now used in other a computer. Therefore, data may be flawed if
countries to help mobilize communities to play simply interpreted based on algorithms, and any
active roles in their personal health. decisions regarding the health of the communities
Advances in technology have also created that were made based on this inaccurate data
wearable technology that is revolutionizing par- would also be flawed. Of greater concern is the
ticipatory health. Wearable technology is a cate- issue of privacy with regards to big data. Much of
gory of devices that are worn by individuals and the data is collected automatically based on peo-
are used to track data about the individuals, such ple’s online searches and Internet activities, so the
as health information. Examples of wearable tech- question arises as to whether people have the right
nology are wrist bands that collect information to choose what data is collected about them. Ques-
about the individual’s global positioning system tions that arise regarding big data and health
(gps) location, amount of daily exercise, sleep include how long is personal health data saved?
patterns, and heart rate. Wearable technology Will data collected be used against individuals?
enables users to track their health information, How will the Health Insurance Portability and
and some wearable technology even allows the Accountability Act (HIPPA) change with the
individual to save their health information and incorporation of big data in medicine? Will data
share it with their medical providers. Wearable collected determine insurance premiums? Privacy
technology encourages participatory health, and concerns need to be addressed before big health
the constant tracking of health information and data, health applications, and wearable technol-
sharing with medical providers allow for more ogy become a security issue.
accurate health data collection and tailored care. Today, big data can help health providers better
The increase in health technology and collection understand their target populations and can lead to
and analysis of big data has led to an increase in an increase in participatory health. However,
Participatory Health and Big Data 3

concerns arise about the safety of health informa- Further Readings


tion that is automatically collected in big data sets.
With this in mind, targeted data collected may be a Eysenbach, G. (2008). Medicine 2.0: Social networking,
collaboration, participation, apomediation, and open-
more beneficial method for data collection with
ness. Journal of Medical Internet Research, 10(3), e22.
regard to health. All these concerns need to be doi:10.2196/jmir.1030.
addressed today as the use of big data in health Gallant, L. M., Irizarry, C., Boone, G., & Kreps, G. (2011).
becomes more commonplace. Promoting participatory medicine with social media:
New media applications on hospital websites that
enhance health education and e-patients’ voices. Jour-
nal of Participatory Medicine, 3, e49.
Gallivan, J., Kovacs Burns, K. A., Bellows, M., & Eigen-
Cross-References seher, C. (2012). The many faces of patient engage-
ment. Journal of Participatory Medicine, 4, e32.
▶ Epidemiology Lohr, S. (2012). The age of big data. The New York Times.
▶ Marketing/Advertising Revolutionizing social mobilization, monitoring and
response efforts. (2012) UNICEF [video file].
▶ Medical/Health Care Retrieved from https://www.youtube.com/watch?v=
▶ Patient-Centered (Personalized) Health gRczMq1Dn10
▶ PatientsLikeMe The promise of personalized medicine. (2007, Winter).
▶ Prevention NIH Medline Plus, pp. 2–3.
P

Patient Records tallies to justify expenditures. As far back as 1737,


Berlin surgeons were required to note patients’
Barbara Cook Overton conditions each morning and prescribe lunches
Communication Studies, Southeastern Louisiana accordingly (e.g., soup was prescribed for patients
University, Hammond, LA, USA too weak to chew). The purpose, according to
Volker Hess and Sophie Ledebur, was helping
administrators track the hospital’s food costs and
Patient records have existed since the first hospi- had little bearing on actual patient care. In 1791,
tals were opened. Early handwritten accounts of according to Eugenia Siegler in her analysis of
patients’ hospitalizations were recorded for edu- early medical recordkeeping, the New York Board
cational purposes but most records were simply of Governors required complete patient logs along
tallies of admissions and discharges used to justify with lists of prescribed medications, but no
expenditures. Standardized forms would eventu- descriptions of the patients’ conditions. Formally
ally change how patient care was documented. documenting the care that individual patients
Content shifted from narrative to numerical received was fairly uncommon in American hos-
descriptions, largely in the form of test results. pitals at that time. It was not until the end of the
Records became unwieldy as professional guide- nineteenth century that American physicians
lines and malpractice concerns required more and began recording the specifics of daily patient
more data be recorded. Patient records are owned care for all patients. Documentation in European
and maintained by individual providers, meaning hospitals, by contrast, was much more complete.
multiple records exist for most patients. Nonethe- From the mid-eighteenth century on, standardized
less, the patient record is a document meant to medical forms were widely used to record
ensure continuity of care and is a communication patients’ demographic data, their symptoms, treat-
tool for all providers engaged in a patient’s current ments, daily events, and outcomes. By 1820, these
and future care. Electronic health records may forms were collected in preprinted folders with
facilitate information sharing, but that goal is multiple graphs and tables (by contrast, American
largely unrealized. hospitals would not begin using such forms until
Modern patient records evolved with two pri- the mid-1860s). Each day, physicians in training
mary goals: facilitating fiscal justification and were tasked with transcribing medical data into
improving medical education. Early hospitals meaningful narratives, describing patterns of dis-
established basic rules to track patient admissions, ease progression. The resulting texts became valu-
diagnoses, and outcomes. The purpose was able learning tools. Similar narratives were
largely bureaucratic: administrators used patient complied by American physicians and used for
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_160-1
2 Patient Records

medical training as well. In 1805, Dr. David however, the content of patient records still varied
Hosack had suggested recording the specifics of considerably.
particularly interesting cases, especially those Although standardized forms ensured certain
holding the greatest educational value for medical events would be documented, there were no
students. The New York Board of Governors methods to ensure consistency across documenta-
agreed and mandated compiling summary reports tions or between providers. Dr. Larry Weed pro-
in casebooks. As Siegler noted, there were very posed a framework in 1964 to help standardize
few reports written at first: the first casebook recording medical care: SOAP notes. SOAP notes
spanned 1810–1834. Later, as physicians in train- are organized around four key areas: subjective
ing were required to write case reports in order to (what patients say), objective (what providers
be admitted to their respective specialties, the observe, including vital signs and lab results),
number of documented cases grew. Eventually, assessment (diagnosis), and plan (prescribed treat-
reports were required for all patients. The reports, ments). Other standardized approaches have been
however, were usually written retrospectively and developed since then. The most common charting
in widely varying narrative styles. formats today, in addition to SOAP notes, include
Widespread use of templates in American hos- narrative charting, APIE charting, focus charting,
pitals helped standardize patient records, but the and charting by exception. Narrative charting,
resulting quantitative data superseded narrative much as in the early days of patient
content. By the start of the twentieth century, recordkeeping, involves written accounts of
forms guaranteed documentation of specific patients’ conditions, treatments, and responses
tasks like physical exams, histories, orders, and and is documented in chronological order. Charts
test results. Graphs and tables dominated patient include progress notes and flow sheets which are
records and physicians’ narrative summaries multi-column forms for recording dates, times,
began disappearing. The freestyle narrative form and observations that are updated every few
that had previously comprised the bulk of the hours for inpatients and upon each subsequent
patient record allowed physicians to write as outpatient visit. They provide an easy-to-read
much or as little as they wished. Templates left record of change over time; however their limited
little room for lengthy narratives, no more than a space cannot take the place of more complete
few inches, so summary reports gave way to brief assessments, which should appear elsewhere in
descriptions of pertinent findings. As medical the patient record. APIE charting, similar to
technology advanced, according to Siegler, the SOAP notes, involves clustering patient notes
medical record became more complicated and around assessment (both subjective and objective
cumbersome with the addition of yet more forms findings), planning, implementation, and evalua-
for reporting each new type of test (e.g., chemis- tion. Focus charting is a more concise method of
try, hematology, and pathology tests). While most inpatient recording and is organized by keywords
physicians kept working notes on active patients, listed in columns. Providers note their actions and
these scraps of paper notating observations, daily patients’ responses under each keyword heading.
tasks, and physicians’ thoughts seldom made their Charting by exception involves documenting only
way into the official patient record. The official significant changes or events using specially for-
record emphasized tests and numbers, as Siegler matted flow sheets. Computerized charting, or
noted, and this changed medical discourse: inter- electronic health records (EHR), combines several
actions and care became more data driven. Care of the above approaches but proprietary systems
became less about the totality of the patient’s vary widely. Most hospitals and private practices
experience and the physician’s perception of are migrating to EHRs, but the transition has been
it. Nonetheless, patient records had become a expensive, difficult, and slower than expected.
mainstay and they did help ensure continuity of The biggest challenges include interoperability
care. Despite early efforts at a unifying style, issues impeding data sharing, difficult-to-use
Patient Records 3

EHRs, and perceptions that EHRs interfere with objective/quantifiable observations and use quo-
provider-patient relationships. tation marks to set apart patients’ statements, note
Today, irrespective of the charting format used, communication between all members of the care
patient records are maintained according to strict team while documenting the corresponding dates
guidelines. Several agencies publish and times, document informed consent and
recommended guidelines including the American patient education, record every step of every pro-
Association of Nurses, the American Medical cedure and medication administration, and chart
Association (AMA), the Joint Commission of instances of patients’ noncompliance or lack of
Accreditation of Healthcare Organizations cooperation. Providers should avoid writing over,
(JCAHO), and the Centers for Medicare and Med- whiting out, or attempting to erase entries, even if
icaid Services (CMS). Each regards the medical made in error – mistakes should be crossed
record as a communication tool for everyone through with a single line, dated, and signed.
involved in the patient’s current and future care. Altering a patient chart after the fact is illegal in
The primary purpose of the medical record is to many states, so corrections should be made in a
identify the patient, justify treatment, document timely fashion and dated/signed. Leaving blank
the course of treatment and results, and facilitate spaces on medical forms should be avoided as
continuity of care among providers. Data stored in well; if space is not needed for documenting
patient records have other functions; aside from patient care, providers are instructed to draw a
ensuring continuity of care, data can be extracted line through the space or write “N/A.” The fol-
for evaluating the quality of care administered, lowing should also be documented to ensure both
released to third-party payers for reimbursement, good patient care and malpractice defense: the
and analyzed for clinical research and/or epidemi- reason for each visit, chief complaint, symptoms,
ological studies. Each agency’s charting guide- onset and duration of symptoms, medical and
lines require certain fixed elements in the patient social history, family history, both positive and
record: the patient’s name, address, birthdate, negative test results, justifications for diagnostic
attending physician, diagnosis, next of kin, and tests, current medications and doses, over-the-
insurance provider. The patient record also con- counter and/or recreational drug use, drug aller-
tains physicians’ orders and progress notes, as gies, any discontinued medications and reactions,
well as medication lists, X-ray records, laboratory medication renewals or dosage changes, treatment
tests, and surgical records. Several agencies recommendations and suggested follow-up or
require the patient’s full name, birthdate, and a specialty care, a list of other treating physicians,
unique patient identification number appear on a “rule-out” list of considered but rejected diag-
each page of the record, along with the name of noses, final definitive diagnoses, and canceled or
the attending physician, date of visit or admission, missed appointments.
and the treating facility’s contact information. Patient records contain more data than ever
Every entry must be legibly signed or initialed before because of professional guidelines,
and date/time stamped by the provider. malpractice-avoidance strategies, and the ease of
The medical record is a protected legal docu- data entry many EHRs make possible. The result
ment and because it could be used in a malpractice is that providers are experiencing data overload.
case, charting takes on added significance. Incom- Many have difficulty wading through mounds of
plete, confusing, or sloppy patient records could data, in either paper or electronic form, to discern
signal poor medical care to a jury, even in the important information from insignificant attesta-
absence of medical incompetence. For that rea- tions and results. While EHRs are supposed to
son, many malpractice insurers require additional make searching for data easier, many providers
documentation above and beyond what profes- lack the needed skills and time to search for and
sional agencies recommend. For example, pro- review patients’ medical records. Researchers
viders are urged to: write legibly in permanent have found some physicians rely on their own
ink, avoid using abbreviations, write only memories or ask patients about previous visits
4 Patient Records

instead of searching for the information them- given provider in 8 years, will likely have their
selves. Other researchers have found providers records destroyed. Additionally, many retiring
have trouble quickly processing the amount of physicians typically only maintain records for
quantitative data and graphs in most medical 10 years. Better data management capabilities
records. Donia Scott and colleagues, for example, will inevitably change these practices in years
found that providers given narrative summaries of to come.
patient records culled from both quantitative and While patient records have evolved to ensure
qualitative data performed better on questions continuity of patient care, many claim the current
about patients’ conditions than those providers form that records have taken facilitates billing
given complete medical records, and did so in over communication concerns. Many EHRs, for
half the time. Their findings highlight the impor- instance, are modeled after accounting systems:
tance of narrative summaries that should be providers’ checkbox choices of diagnoses and
included in patients’ records. There is a clear tests are typically categorized and notated in bill-
need for balancing numbers with words in ensur- ing codes. Standardized forms are also designed
ing optimal patient care. with billing codes in mind. Diagnosis codes are
Another important issue is ownership of and reported in the International Statistical Classifica-
access to patient records. For each healthcare pro- tion of Diseases and Related Health Problems
vider and/or medical facility involved in a terminology, commonly referred to as ICD. The
patient’s care, there is a unique patient record World Health Organization maintains this coding
owned by that provider. With patients’ permis- system for epidemiological, health management,
sion, those records are frequently shared among and research purposes. Billable procedures and
providers. The Health Insurance Portability and treatments administered in the United States are
Accountability Act (HIPAA) protects the confi- reported in Current Procedural Terminology
dentiality of patient data, but patients, guardians (CPT) codes. The AMA owns this coding schema
or conservators of minor or incompetent patients, and users must pay a yearly licensing fee for the
and legal representatives of deceased patients may CPT codes and codebooks, which are updated
request access to records. Providers in some states annually. Critics claim this amounts to a monop-
can withhold records if, in the providers’ judg- oly, especially given HIPAA, CMS, and most
ment, releasing information could be detrimental insurance companies require CPT-coded data to
to patients’ well-being or cause emotional or men- satisfy reporting requirements and for reimburse-
tal distress. In addition to HIPAA mandates, many ment. CPT-coded data may impact patients’ abil-
states have strict confidentiality laws restricting ity to decipher and comprehend their medical
the release of HIV test results, drug and alcohol records, but the AMA does have a limited search
abuse treatment, and inpatient mental health function on its website for non-commercial use
records. While HIPAA guarantees patient access allowing patients to look up certain codes.
to their medical records, providers can charge Patient records are an important tool ensuring
copying fees. Withholding records because a continuity of care, but data-heavy records are
patient cannot afford to pay for them is prohibited cumbersome and often lacking narrative summa-
in many states because it could disrupt the conti- ries which have been shown to enhance providers’
nuity of care. HIPAA also allows patients the right understanding of patients’ histories and inform
to amend their medical records if they believe better medical decision-making. Strict guidelines
mistakes have been made. While providers are and malpractice concerns produce thorough
encouraged to maintain records in perpetuity, records that while ensuring complete documenta-
there are not requirements that they do so. Given tion, sometimes impede providers’ ability to dis-
the costs associated with data storage, both on cern important from less significant past findings.
paper and electronically, many providers will Better search and analytical tools are needed for
only maintain charts on active patients. Many managing patient records and data.
inactive patients, those who have not seen a
Patient Records 5

Cross-References information: A multi-method study of GP’s use of


electronic health records. BMC Medical Informatics
and Decision Making, 8(12).
▶ Electronic Health Records (EHR) Hess, V., & Ledebur, S. (2011). Taking and keeping: A note
▶ Health Care Delivery on the emergence and function of hospital patient
▶ Health Informatics records. Journal of the Society of Archivists, 32, 1.
▶ Medical/Health Care Lee, J. Interview with Lawrence Weed, MD – The father of
the problem-oriented medical record looks ahead.
▶ Patient-Centered (Personalized) Health http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2911807/.
Accessed October 2014.
Medical Insurance Exchange of California. Medical
Further Reading record documentation for patient safety and physi-
cian defensibility. http://www.miec.com/Portals/0/
pubs/MedicalRec.pdf. Accessed October 2014.
American Medical Association. CPT – current proce- Scott, D., et al. (2013). Data-to-text summarisation of
dural terminology. http://www.ama-assn.org/ama/ patient records: Using computer-generated summaries
pub/physician-resources/solutions-managing-your-prac to access patient histories. Patient Education and
tice/coding-billing-insurance/cpt.page. Accessed October Counseling, 92.
2014. Siegler, E. (2010). The evolving medical record. Annals of
Christensen, T., & Grimsmo, A. (2008). Instant availability Internal Medicine, 153.
of patient records, but diminished availability of patient
P

Patient-Centered (Personalized) and Family-Centered Care expands the IOM def-


Health inition by including provisions for shared
decision-making, planning, delivery, and evalua-
Barbara Cook Overton tion of health care that is situated in partnerships
Southeastern Louisiana University, Baton Rouge, comprising patients, their families, and providers.
LA, USA The concept is further elucidated in terms of four
main principles: respect, information sharing, par-
ticipation, and collaboration. According to the
Patient-centered health privileges patient partici- Picker Institute, patient-centered care encom-
pation and results in tailored interventions incor- passes seven basic components: respect, coordi-
porating patients’ needs, values, and preferences. nation, information and education, physical
Although this model of care is preferred by comfort, emotional support, family involvement,
patients and encouraged by policy makers, many and continuity of care. All of the definitions basi-
healthcare providers persist in using a biomedical cally center on two essential elements: patient
approach which prioritizes providers’ expertise participation in the care process and
and downplays patients’ involvement. Patient- individualized care.
centered care demands collaborative partnerships The goal of patient-centered care, put forth by
and quality communication, both requiring more the IOM, is arguably a return to old-fashioned
time than is generally available during medical medicine. Dr. Abraham Flexner, instrumental in
exams. While big data may not necessarily revamping physician training during the 1910s
improve patient-provider communication, it can and 1920s, promoted medical interactions that
facilitate individualized care in several were guided by both clinical reasoning and com-
important ways. passion. He encouraged a biopsychosocial
The concept of patient-centered health, approach to patient communication, which incor-
although defined in innumerable ways, has gained porates patients’ feelings, thoughts, and expecta-
momentum in recent years. In 2001, the Institute tions. Scientific and technological advances
of Medicine (IOM) issued a report recommending throughout the twentieth century, however, grad-
healthcare institutions and providers adopt six ually shifted medical inquiry away from the whole
basic tenets: safety, effectiveness, timeliness, effi- person and towards an ever-narrowing focus on
ciency, equity, and patient-centeredness. Patient- symptoms and diseases. Once the medical inter-
centeredness, according to the IOM, entails deliv- view became constricted, scientific, and objective,
ering quality health care driven by patients’ needs, collaborative care gave way to a provider-driven
values, and preferences. The Institute for Patient- approach. The growth of medical specialties (like
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_161-1
2 Patient-Centered (Personalized) Health

cardiology and gastroenterology) further interruption or redirection. While some studies


compounded the problem by reducing patients to note patients are becoming more involved in
collections of interrelated systems (such as circu- their health care by offering opinions and asking
latory and digestive). This shift to specialty care questions, others find ever-decreasing rates of
coincided with fewer providers pursuing careers participation during medical encounters. Studies
in primary care, the specialty most inclined to show physicians invite patients to ask questions in
adopt a patient-centered perspective. The fewer than half of exams. Even when patients do
resulting biomedical model downplays patient have concerns, they rarely speak up because they
participation while privileging provider control report feeling inhibited by asymmetrical relation-
and expertise. Although a return to patient- ships: many patients simply do not feel
centered care is being encouraged, many pro- empowered to express opinions, ask questions,
viders persist in using a biomedical approach. or assert goals. Understandably, communication
Some researchers fault patients for not actively problems stem from these hierarchical differences
co-constructing the medical encounter, while and competing goals, thereby making patient-
others blame medical training that centered care difficult.
de-emphasizes relationship development and There are several other obstacles deterring
communication skills. patient-centered communication and care. While
Several studies posit quality communication as medical training prioritizes the development of
the single most important component necessary clinical skills over communication skills, lack of
for delivering patient-centered care. Researchers time and insufficient financial reimbursement are
find patient dissatisfaction is associated with pro- the biggest impediments to patient-centered care.
viders who are insensitive to or misinterpret The “one complaint per visit” approach to health
patients’ socio-emotional needs, fail to express care means most conversations are symptom spe-
empathy, do not give adequate feedback or infor- cific, with little time left for discussing patients’
mation regarding diagnoses and treatment proto- overall health goals. Visits should encompass
cols, and disregard patients’ input in decision- much broader health issues, moving away from
making. Patients who are dissatisfied with pro- the problem presentation/treatment model while
viders’ communication are less likely to comply taking each patient’s unique goals into account.
with treatment plans and typically suffer poorer The goal of patient-centered care is further
outcomes. Conversely, patients satisfied with the compromised by payment structures incentivizing
quality of their providers’ communication are quick patient turnaround over quality communi-
more likely to take medications as prescribed cation, which takes more time than is currently
and adhere to recommended treatments. Satisfied available in a typical medical encounter. Some
patients also have lower blood pressure and better studies, however, suggest that patient-centered
overall health. Providers, however, routinely sac- communication strategies, like encouraging ques-
rifice satisfaction for efficiency, especially in man- tions, co-constructing diagnoses, and mutually
aged care contexts. deciding treatment regimens, do not necessarily
Many medical interactions proceed according lengthen the overall medical encounter. Further-
to a succinct pattern that does not prioritize more, collaboratively decided treatment plans are
patients’ needs, values, and preferences. The associated with decreased rates of hospitalization
asymmetrical nature of the provider-patient rela- and emergency room use. Despite the challenges
tionship preferences providers’ goals and discour- that exist, providers are implored to attempt
ages patient participation. Although patients patient-centered communication.
expect to have all or most of their concerns Big data has helped facilitate asynchronous
addressed, providers usually pressure them to communication between medical providers,
focus on one complaint per visit. Providers also namely through electronic health records which
encourage patients to get to the point quickly, ensure continuity of care, but big data’s real prom-
which means patients rarely speak without ise lies elsewhere. Using the power of predictive
Patient-Centered (Personalized) Health 3

analytics, big data can play an important role in IOC recommends capturing data from pedometers
advancing patient-centered health by helping and sensors in smart phones, which provide
shape tailored wellness programs. The provider- details about patients’ physical activity, and com-
driven, disease-focused approach to health care bining that with data from interactive smart phone
has, heretofore, impacted the kind of health data applications (such as calorie counters and food
that exist: data that are largely focused on patients’ logs) to customize behavior counseling. This
symptoms and diseases. However, diseases do not approach individualizes not only patient care but
develop in isolation. Most conditions develop also education, prevention, and treatment inter-
through a complicated interplay of hereditary, ventions and advances patient-centered care with
environmental, and lifestyle factors. Expanding respect to information sharing, participation, and
health data to include social and behavioral data, collaboration. The IOC also identifies several
elicited via a biopsychosocial/patient-centered other potential sources of health data: social
approach, can help medical providers build better medial profiles, electronic medical records, and
predictive models. By examining comprehensive purchase histories. Collectively, this data can
rather than disease-focused data, providers can, yield a “mass customization” of prevention pro-
for example, leverage health data to predict grams. Given chronic diseases are responsible for
which patients will participate in wellness pro- 60 percent of deaths and 80 percent of healthcare
grams, their level of commitment, and their poten- spending is dedicated to chronic disease manage-
tial for success. This can be done using data ment, customizable programs have the potential to
mining techniques, like collaborative filtering. In save lives and money.
much the same way Amazon makes purchase Despite the potential, big data’s impact are
recommendations for its users, providers may largely unrealized in patient-centered care efforts.
similarly recommend wellness programs by tak- Although merging social, behavioral, and medical
ing into account patients’ past behavior and health data to improve health outcomes has not hap-
outcomes. Comprehensive data could also be use- pened on a widespread basis, there is still a lot
ful for tailoring different types of programs based that can be done analyzing medical data alone.
on patients’ preferences, thereby facilitating There is, however, a clear need for computational/
increased participation and retention. For exam- analytical tools that can aid providers in recogniz-
ple, programs could be customized for patients ing disease patterns, predicting individual
that go beyond traditional racial, ethnic, or socio- patients’ susceptibility, and developing personal-
demographic markers and include characteristics ized interventions. Nitesh Chawla and Darcy
such as social media use and shopping habits. By Davis propose aggregating and integrating big
designing analytics aimed at understanding indi- data derived from millions of electronic health
vidual patients and not just their diseases, pro- records to uncover patients’ similarities and con-
viders may better grasp how to motivate and nections with respect to numerous diseases. This
support the necessary behavioral changes makes a proactive medical model possible, as
required for improved health. opposed to the current treatment-based approach.
The International Olympic Committee (IOC), Chawla and Davis suggest that leveraging clini-
in a consensus meeting on noncommunicable dis- cally reported symptoms from a multitude of
ease prevention, has called for an expansion of patients, along with their health histories, pre-
health data collected and a subsequent conversion scribed treatments, and wellness strategies, can
of that data into information providers and provide a summary report of possible risk factors,
patients may use to achieve better health out- underlying causes, and anticipated concomitant
comes. Noncommunicable/chronic diseases, conditions for individual patients. They devel-
such as diabetes and high blood pressure, are oped an analytical framework called the Collabo-
largely preventable. These conditions are related rative Assessment and Recommendation Engine
to lifestyle choices: too little exercise, an (CARE), which applies collaborative filtering
unhealthy diet, smoking, and alcohol abuse. The using inverse frequency and vector similarity to
4 Patient-Centered (Personalized) Health

generate predictions based on data from similar program that predicts hospital inpatient mortality.
patients. The model was validated using a Medi- Similar programs help predict the likelihood of
care database of 13 million patients with two heart disease, Alzheimer’s, cancer, and digestive
million hospital visits over a 4-year period by disorders. Lastly, big data accrued from not only
comparing diagnosis codes, patient histories, and patients’ health records but from their social
health outcomes. CARE generates a short list that media profiles, purchase histories, and
includes high-risk diseases and early warning smartphone applications have the potential to pre-
signs that a patient may develop in the future, dict enrollment in wellness programs and improve
enabling a collaborative prevention strategy and behavioral modification strategies thereby
better health outcomes. Using this framework, improving health outcomes.
providers can improve the quality of care through
prevention and early detection and also advance
patient-centered health care. Cross-References
Data security is a factor that merits discussion.
Presently, healthcare systems and individual pro- ▶ Biomedical Data
viders exclusively manage patients’ health data. ▶ Electronic Health Records (EHR)
Healthcare systems must comply with security ▶ Epidemiology
mandates set forth by the Health Insurance Porta- ▶ Health Care Delivery
bility and Accountability Act of 1996 (HIPAA). ▶ Health Infomatics
HIPAA demands data servers are firewall and ▶ HIPAA
password protected, and use encrypted data trans- ▶ Medical/Health Care
mission. Information sharing is an important com- ▶ Predictive Analytics
ponent of patient-centered care. Some proponents
of the patient-centered care model advocate trans-
ferring control of health data to patients, who may
Further Readings
then use and share it as they see fit. Regardless as
to who maintains control of health data, storing Chawla, N. V., & Davis, D. A. (2013). Bringing big data to
and electronically transferring that data pose personalized healthcare: A patient-centered frame-
potential security and privacy risks. work. Journal of General Internal Medicine, 28(3),
660–665.
Patient-centered care requires collaborative Duffy, T. P. (2011). The Flexner report: 100 years later. Yale
partnerships and wellness strategies that incorpo- Journal of Biology and Medicine, 84(3), 269–276.
rate patients’ thoughts, feelings, and preferences. Institute of Medicine. (2001). Crossing the quality chasm.
It also requires individualized care, tailored to Washington, DC: National Academies Press.
Institute for Patient- and Family-Centered Care. FAQs.
meet patients’ unique needs. Big data facilitates
http://www.ipfcc.org/faq.html. Accessed Oct 2014.
patient-centered/individualized care in several Matheson, G., et al. (2013). Prevention and management of
ways. First, it ensures continuity of care and non-communicable disease: The IOC consensus state-
enhanced information sharing through integrated ment, Lausanne 2013. Sports Medicine, 43,
1075–1088.
electronic health records. Second, analyzing pat-
Picker Institute. Principles of patient-centered care. http://
terns embedded in big data can help predict dis- pickerinstitute.org/about/picker principles/. Accessed
ease. APACHE III, for example, is a prognostic Oct 2014.
P

PatientsLikeMe with clients, in order to develop patient communi-


ties targeted on a specific disease, or kind of patient
Niccolò Tempini experience. In the context of a sponsored project,
Department of Sociology, Philosophy and PatientsLikeMe staff develop disease-specific tools
Anthropology and Egenis, Centre for the Study required for patient health self-reporting (Patient-
of the Life Sciences, University of Exeter, reported outcome measures – PROMs) on a web-
Exeter, UK based platform, then collect and analyze the patient
data, and produce research outputs, either commer-
cial research reports or peer-reviewed studies.
Introduction Research has regarded a wide range of issues,
from drug efficacy discovery for neurodegenera-
PatientsLikeMe is a for-profit organization based tive diseases, or symptom distribution across
in Cambridge, Massachusetts, managing a social patient populations, to sociopsychological issues
media-based health network that supports patients like compulsive gambling.
in activities of health data self-reporting and While the network has produced much of its
socialization. As of January 2015, the network research in occasion of sponsored research pro-
counts more than 300,000 members and 2,300+ jects, this has mostly been discounted from criti-
associated conditions and it is one of the most cism. This because, for its widespread involvement
established networks in the health social media of patients in medical research, PatientsLikeMe is
space. The web-based system is designed and often seen as a champion of the so-called partici-
managed to encourage and enable patients to patory turn in medicine, the issue of patient
share data about their health situation and empowerment and more generally of the forces of
experience. democratization that several writers argued to be
promise of the social web. While sustaining its
operations through partnerships with commercial
Business Model corporations, PatientsLikeMe also gathers on the
platform a number of patient-activism NGOs. The
Differently from most prominent social media system provides them customized profiles and
sites, the network is not ad-supported. Instead, the communication tools, with which these organiza-
business model centers on the sale of anonymized tions can try to improve the reach with the patient
data access and medical research services to com- population of reference, while the network in
mercial organizations (mostly pharmaceutical return gains a prominent position as the center, or
companies). The organization has been partnering enabler, of health community life.
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_162-1
2 PatientsLikeMe

Patient Members real time through the website or other data-


management technologies. The research process
PatientsLikeMe attracts patient members because involves practices of pattern detection, analysis of
the system is designed to allow patients to find correlations, and investigation of hypotheses
others and socialize. This can be particularly use- through regression and other statistical
ful for patients of rare, chronic, or life-changing techniques.
diseases: patient experiences for which an indi- The vision of scientific discovery that is under-
vidual might feel helpful to learn from the expe- lying the PatientsLikeMe project is one based on
rience of others, whom however might be not easy the assumption that given a broad enough base of
to find through traditional, “offline” socialization users and a granular, frequent and longitudinal
opportunities. The system is also designed to exercise of data collection, new, small patterns
enable self-tracking of a number of health dimen- ought to emerge from the data and invite further
sions. The patients record both structured data, investigation and explanation. This assumption
about diagnoses, treatments, symptoms, disease- implies that for medical matters to be discovered
specific patient-reported questionnaires (PROs), further, the development of an open, distributed
or results of specific lab test, and semi-structured and data-based socio-technical system that is
or unstructured data, in the form of comments, more sensitive to their forms and differences is a
messages, and forum posts. All of these data are necessary step. But also, the hope is that important
at the disposal of the researchers that have access lessons can be learned by opening the medical
to the data. A paradigmatic characteristic of framework to measure and represent a broader
PatientsLikeMe as social media research network collection of entities and events than traditional,
is that the researchers do not learn about the profession-bound medical practice accepted. The
patients in any other way than through the data PatientsLikeMe database includes symptoms and
that the patients share. medical entities as described in the terms used by
the patients themselves. This involves sensitive
and innovative processes of translation from the
Big Data and PatientsLikeMe patient language to expert terminology. Questions
about the epistemological consequence of the
As such, it is the approach to data and to research translation of the patient voice (until now a
that defines PatientsLikeMe as a representative neglected form of medical information) over
“Big Data” research network – one that, however, data fields and categories, and the associated con-
does not manage staggeringly huge quantities of cerns about reliability of patient-generated data,
data nor employs extremely complex technologi- cannot have a simple answer. In any case, from a
cal solutions for data storage and analysis. practice-based point of view these data are none-
PatientsLikeMe is a big data enterprise because, theless being mobilized for research through inno-
first, it approaches medical research through an vative technological solutions for coordinating
open (to data sharing by anyone and about user- the patient user-base. The data can then be ana-
defined medical entities), distributed (relative to lyzed in multiple ways, all of which include the
availability of a broadband connection, from any- use of computational resources and databases –
where and at anytime), and data-based (data are all given the digital nature of the data.
that is transacted between the participating As ethnographic research of the organization
parties) research approach. Second, the data used has pointed out (see further readings section,
by PatientsLikeMe researchers are highly varied below), social media companies that try to
(including social data, social media user- develop knowledge from the aggregation and
generated content, browsing session data, and analysis of the data contributed by their patients
most importantly structured and unstructured are involved in complex efforts to “cultivate” the
health data) and relatively fast, as they are information lying in the database – as they have to
updated, parsed, and visualized dynamically in come to grips with the dynamics and trade-offs
PatientsLikeMe 3

that are specific to understanding health through conversations. In more than one occasion,
social media. Social media organizations try to unauthorized intruders (including journalists and
develop meaningful and actionable information academics) were detected and found screen-
from their database by trying to make data struc- scraping data from the website. Despite the orga-
tures more precise in differentiating between phe- nization employing state-of-the-art techniques to
nomena and reporting about them in data records, protect patient data from unauthorized exporting,
and make the system easier and flexible in use in any sensitive data shared on a website remains at a
order to generate more data. Often these demands risk, given the widespread belief – and public
work at cross-purposes. The development of record on other websites and systems – that skilled
social media for producing new knowledge intruders could always execute similar exploits
through distributed publics involves the engineer- unnoticed. Patients can have a lot to be concerned
ing of social environment where sociality and about, especially if they have conditions with a
information production are inextricably social stigma or if they shared explicit political or
intertwined. Users need to be steered towards personal views in the virtual comfort of a forum
information-productive behaviors as they engage room. In this respect, even if the commercial pro-
in social interaction of sorts, for information is the jects that the organization has undertaken with
worth upon which social media businesses industry partners implied the exchange of user
depend. In this respect, it has been argued that data that had been pseudonymised before being
PatientsLikeMe is representative of the construc- handed over, the limits of user profile
tion of sociality that takes place in all social media anonymization are well known. In the case of
sites, where social interaction unfolds along the profiles of patients living with rare diseases,
paths that the technology continuously and which are a consistent portion of the users in
dynamically draws based on the data that the PatientsLikeMe, it can arguably be not too diffi-
users are sharing. cult to reidentify individuals, upon determined
As such, many see PatientsLikeMe as incarnat- effort. These issues of privacy and confidentiality
ing an important dimension of the much-expected remain a highly sensitive topic as society does not
revolution of personalized medicine. Improve- dispose of standard and reliable solutions against
ments in healthcare will not be limited to a capil- the various forms that data misuse can take. As
lary application of genetic sequencing and other both news and scholars have often reported, the
micro and molecular biology tests that try to open malleability of digital data makes it impossible to
up the workings of individual human physiology stop the diffusion of sensitive data once that func-
at unprecedented scale, instead the information tion creep happens.
produced by these tests will often the related Moreover, as it is often discussed in the social
with the information about the subjective patient media and big data public debate, data networks
experience and expectations that new information increasingly put pressure on the notion of
technology capabilities are increasingly making informed consent as an ethically sufficient device
possible. for conducting research with user and patient data.
The need for moral frameworks of operation that
overperform over strict compliance with law has
Other Issues often been called for, and recently by the report on
data in biomedical research by the Nuffield Coun-
Much of the public debate about the cil for Bioethics. In the report, PatientsLikeMe
PatientsLikeMe network involves issues of pri- was held as a paramount example of new kinds
vacy and confidentiality of the patient users. The of research networks that rely on extensive patient
network is a “walled garden,” with patient profiles involvement and social (medical) data – these
remaining inaccessible to unregistered users by networks are often dubbed as citizen science or
default. However, once logged in, every user can participatory research.
browse all patient profiles and forum
4 PatientsLikeMe

On another note, some have argued that of information that networks such as
PatientsLikeMe, as many other prominent social PatientsLikeMe or search engines such as Google
media organizations, has been exploiting the rhe- make available at a click’s distance is without
toric of sharing (one’s life with a network and its antecedents and what this implies for healthcare
members) to encourage data-productive behav- must still be fully understood. Autonomous deci-
iors. The business model of the network is built sions by the patients do not necessarily happen for
around a traditional, proprietary model of data the worst. As healthcare often falls short of pro-
ownership. The network facilitates the data flow viding appropriate information and counseling,
inbound and makes it less easy for the data to flow especially about everything that is not strictly
outbound, controlling their commercial applica- therapeutic, patients can eventually devise
tion. In this respect, we must notice that the cur- improved courses of action, through a consulta-
rent practice in social media management in tion of appropriate information-rich web
general is often characterized by data sharing resources. At the same time, risks and harms are
evangelism by the managing organization, which not fully appreciated and there is a pressing need
at the same time requires monopoly of the most to understand more on the consequences of these
important data resources that the network gener- networks for individual health and the future of
ates. In the general public debate, this kind of healthcare and health research.
social media business model has been linked as a There are other issues besides these more evi-
factor contributing to the erosion of user privacy. dent and established topics of discussion. As it has
On a different level, one can notice how the been pointed out, questions of knowledge transla-
kind of patient-reported data collection and med- tion (from the patient vocabulary to the clinical-
ical research that the network makes possible to professional) remain open, and unclear is also the
perform is a much cheaper and under many capacity of these distributed and participative net-
respects more efficient model than what the works to consistently represent and organize the
professional-laden institutions such as the clinical patient populations that they are deemed to serve,
research hospital, with their specific work loci and as the involvement of patients is however limited
customs, could put in place. This way of and relative to specific tasks, most often of data-
organising the collection of valuable data operates productive character. The afore-mentioned issues
by including large amounts of end users who are are not exhaustive nor exhausted in this essay.
not remunerated. Despite this, running and orga- They require in-depth treatment; with this intro-
nizing such an enterprise is expensive and labor- duction the aim has been to give a few coordinates
intensive and as such, critical analysis of this kind on how to think about the subject.
of “crowdsourcing” enterprise needs to look
beyond the more superficial issue of the absence
of a contract to sanction the exchange of a mone-
Further Readings
tary reward for distributed, small task perfor-
mances. One connected problem in this respect Angwin, J. (2014). Dragnet nation: A quest for privacy,
is that since data express their value only when security, and freedom in a world of relentless surveil-
they are re-situated through use, no data have a lance. New york: Henry Holt and Company.
distinct, intrinsic value upon generation; not all Arnott-Smith, C., & Wicks, P. (2008). PatientsLikeMe:
Consumer health vocabulary as a folksonomy. Ameri-
data generated will ever be equal. can Medical Informatics Association Annual Sympo-
Finally, the affluence of medical data that this sium Proceedings, 2008, 682–686.
network makes available can have important con- Kallinikos, J., & Tempini, N. (2014). Patient data as med-
sequences on therapy or lifestyle decisions that a ical facts: Social media practices as a foundation for
medical knowledge creation. Information Systems
patient might take. Sure, patients can make up Research, 25, 817–833. doi:10.1287/isre.2014.0544.
their mind and take critical decisions without Lunshof, J. E., Church, G. M., & Prainsack, B. (2014).
appropriate consultation at any time, as they Raw personal data: Providing access. Science, 343,
have always done. Nonetheless, the sheer amount 373–374. doi:10.1126/science.1249382.
PatientsLikeMe 5

Prainsack, B. (2013). Let’s get real about virtual: Online distributed and data-based social media network. The
health is here to stay. Genetical Research, 95, 111–113. Information Society, 31, 193–211.
doi:10.1017/S001667231300013X. Wicks, P., Vaughan, T. E., Massagli, M. P., & Heywood,
Richards, M., Anderson, R., Hinde, S., Kaye, J., Lucassen, J. (2011). Accelerated clinical discovery using self-
A., Matthews, P., Parker, M., Shotter, M., Watts, G., reported patient data collected online and a patient-
Wallace, S., & Wise, J. (2015). The collection, linking matching algorithm. Nature Biotechnology, 29,
and use of data in biomedical research and health care: 411–414. doi:10.1038/nbt.1837.
Ethical issues. London: Nuffield Council on Bioethics. Wyatt, S., Harris, A., Adams, S., & Kelly, S. E. (2013).
Tempini, N. (2014). Governing social media: Organising Illness online: Self-reported data and questions of trust
information production and sociality through open, in medical and social research. Theory Culture & Soci-
distributed and data-based systems (Doctoral disserta- ety., 30, 131–150. doi:10.1177/0263276413485900.
tion). School of Economics and Political Science, Zuboff, S. (2015). Big other: surveillance capitalism and
London. the prospects of an information civilization. Journal of
Tempini, N. (2015). Governing PatientsLikeMe: Informa- Information Technology, 30, 75–89.
tion production and research through an open,
P

Pharmaceutical Industry analysis. Having the ability to be described as


both a problem and an opportunity, big data and
Janelle Applequist its techniques are continuing to be utilized in
The Zimmerman School of Advertising and Mass business by thousands of major institutions. The
Communications, University of South Florida, sector of health care is not immune to massive
Tampa, FL, USA data collection efforts, and pharmaceuticals in
particular comprise an industry that relies on
aggregating information.
Globally, the pharmaceutical industry is worth Literature on data mining in the pharmaceuti-
more than $1 trillion, encompassing one of the cal industry generally points to a disagreement
world’s most profitable industries, focusing on the regarding the intended use of health-care informa-
development, production, and marketing of pre- tion. On the one hand, historically, data mining
scription drugs for use by patients. Over one-third techniques have proved useful for the research
of the pharmaceutical industry is controlled by and development (R&D) of current and future
just ten companies, with six of these companies prescription drugs. Alternatively, continuing con-
in the United States alone. The World Health sumerist discourses in health care that have posi-
Organization has reported an inherent conflict of tion the pharmaceutical industry as a massive and
interest between the pharmaceutical industry’s successful corporate entity have acknowledged
business goals and the medical needs of the pub- how this data is used to increase business sales,
lic, attributable to the fact that twice the amount is potentially at the cost of patient confidentiality
spent on promotional spending (including adver- and trust.
tisements, marketing, and sales representation)
than is on the research and development for future
prescription drugs needed for public health History of Data Mining Used for
efforts. The average pharmaceutical company in Pharmaceutical R&D
the United States sees a profit of greater than $10
billion annually, while pharmaceutical companies Proponents of data mining in the pharmaceutical
contribute 50 times more spending on promoting industry have cited its ability to aide in: organiz-
and advertising for their own products than spend- ing information pertaining to genes, proteins, dis-
ing on public health information initiatives. eases, organisms, and chemical substances,
Big data can be described as the collection, allowing predictive models to be built for analyz-
manipulation, and analysis of massive amounts ing the stages of drug development; keeping track
of data – and the decisions made from that of adverse effects of drugs in a neural network
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_163-1
2 Pharmaceutical Industry

during clinical trial stages; listing warnings and strategically link this information with specific
known reactions reported during the post-drug physicians.
production stage; forecasting new drugs needed Prescription tracking refers to the collection of
in the marketplace; providing inventory control data from prescriptions as they are filled at phar-
and supply chain management information; and macies. When a prescription gets filled, data
managing inventories. Data mining was first used miners are able to collect: the name of the drug,
in the pharmaceutical industry as early as the the date of the prescription, and the name or
1960s alongside the increase in prescription drug licensing number of the prescribing physician.
patenting. With over 1,000 drug patents a year Yet, it is simple for the prescription drug industry
being introduced at that time, data collection to identify specific physicians through protocol in
assisted pharmaceutical scientists in keeping up place by the American Medical Association
with patents being proposed. At this time, infor- (AMA). The AMA has a “Physician Masterfile”
mation was collected and published in an edito- that includes all US physicians, whether or not
rial-style bulletin categorized according to areas they belong to the AMA, and this file allows the
of interest in an effort to make relevant issues for physician licensing numbers collected by data
scientists easier to navigate. Early in the 1980s, miners to be connected to a name. Information
technologies allowed biological sequences to be distribution companies (such as IMS Health, Den-
identified and stored, such as the Human Genome drite, Verispan, Wolters Kluwer, etc.) purchase
Project, which led to the increased use and pub- records from pharmacies. What many consumers
lishing of databanks. Occurring alongside the do not realize is that most pharmacies have these
popularity of personal computer usage, bioinfor- records for sale and are able to do so legally by not
matics was born, which allowed biological including patient names and only providing a
sequence data to be used for discovering and physician’s state licensing number and/or name.
studying new prescription drug targets. Ten While pharmacies cannot release a patient’s name,
years later, in the 1990s, microarray technology they can provide data miners with a patient’s age,
developed, posing a problem for data collection, sex, geographic location, medical conditions, hos-
as this technology permitted the simultaneous pitalizations, laboratory tests, insurance copays,
measurement of large numbers of genes and col- and medication use. This has caused a significant
lection of experimental data on a large scale. As area of concern on behalf of patients, as it not only
the ability to sequence a genome occurred in the may increase instances of prescription detailing,
2000s, the ability to manage large levels of raw but it may compromise the interests of patients.
data was still maturing, creating a continued prob- Data miners do not have access to patient names
lem for data mining in the pharmaceutical indus- when collected prescription data; however, data
try. As the challenges presented for data mining in miners assign unique numbers to individuals so
relation to R&D have continued to increase since that future prescriptions for the patient can be
the 1990s, the opportunities for data mining in tracked and analyzed together. This means that
order to increase prescription drug sales have data miners can determine: how long a patient
steadily grown. remains on a drug, whether the drug treatment is
continued, and which new drugs become pre-
scribed for the patient.
As information concerning a patient’s health is
Data Mining in the Pharmaceutical
highly sensitive, data mining techniques used by
Industry as a Form of Controversy
the pharmaceutical industry have perpetuated the
notion that personal information carries a substan-
Since the early 1990s, health-care information
tial economic value. By data mining companies
companies have been purchasing the electronic
paying pharmacies to extract prescription drug
records of prescriptions from pharmacies and
information, the relationships between patients
other data collection resources in order to
and their physicians and/or pharmacists is being
Pharmaceutical Industry 3

exploited. The American Medical Association physicians. For example, as a result of data mining
(AMA) established the Physician Data Restriction in the pharmaceutical industry, pharmaceutical
Program in 2006, giving any physician the oppor- sales representatives could: determine which phy-
tunity to opt out from data mining initiatives. To sicians are already prescribing specific drugs in
date, no such program for patients exists that order to reinforce already-existent preferences, or,
would give them the opportunity to have their could learn when a physician switches from a drug
records removed from data collection procedures to a competing drug, so that the representative can
and subsequent analyses. Three states have attempt to encourage the physician to switch back
enacted statutes that do not permit data mining to the original prescription.
of prescription records. The Prescription Confi-
dentiality Act of 2006 in New Hampshire was the
first state to decide that prescription information The Future of Data Mining in the
could not be sold or used for any advertising, Pharmaceutical Industry
marketing, or promotional purposes. However, if
the information is de-identified, meaning that the As of 2013, only 18% of pharmaceutical compa-
physician and patient names cannot be accessed, nies work directly with social media to promote
then the data can be aggregated by geographical their prescription drugs, but this number is
region or zip code, meaning that data mining expected to increase substantially in the next
companies could still provide an overall, more year. As more individuals tweet about their med-
generalized report for small geographic areas but ical concerns, symptoms, the drugs they take, and
could not target specific physicians. Maine and respective side effects, pharmaceutical companies
Vermont have statutes that limit the presence of have noticed that social media has become an
data mining. Physicians in Maine can register with integrated part of personalized medicine for indi-
the state to prevent data mining companies from viduals. Pharmaceutical companies are already in
obtaining their prescribing records. Data miners in the process of hiring data miners to collect and
Vermont must obtain consent from the physician analyze various forms of public social media in an
for which they are analyzing prior to using “pre- effort to: discover unmet needs, recognize new
scriber-identifiable” information for marketing or adverse events, and determine what types of
promotional purposes. drugs consumers would like to enter the market.
The number one customer for information dis- Based on the history of data mining used by
tribution companies is the pharmaceutical indus- pharmaceutical corporations, it is evident that the
try, which purchases the prescribing data to lucrative nature of prescription drugs serves as a
identify the highest prescribers and also to track catalyst for data collection and analysis. By hav-
the effects of their promotional efforts. Physicians ing the ability to generalize what should be very
are given a value, a ranking from one to ten, which private information about patients for the pre-
identifies how often they prescribe drugs. A sales scription drug industry, the use of data allows
training guide for Merck even states that this value prescription drugs to make more profit than ever,
issued to identify which products are currently in as individual information can be commoditized to
favor with the physician in order to develop a benefit the bottom line of a corporation. Although
strategy to change those prescriptions into Merck there are evident problems associated with pre-
prescriptions. The empirical evidence provided by scription drug data mining, the US Supreme Court
information distribution companies offers a has continued to recognize that the pharmaceuti-
glimpse into the personality, behaviors, and cal industry has a first amendment right to adver-
beliefs of a physician, which is why these num- tise and solicit clients for goods and future
bers are so valued by the drug industry. services. The Court has argued that legal safe-
By collecting and analyzing this data, pharma- guards, such as the Health Information Portability
ceutical sales representatives are able to better and Accountability Act (HIPAA), are put in place
target their marketing activities toward to combat the very concerns posed by practices
4 Pharmaceutical Industry

such as pharmaceutical industry data mining. ▶ Food and Drug Administration (FDA)
Additionally, the Court has found that by stripping ▶ Health Care Industry
pharmaceutical records of patient information that ▶ Patient Records
could lead to personal identification (e.g., name, ▶ Privacy
address, etc.), patients have their confidentiality
adequately protected. The law, therefore, leaves it
to the discretion of the physician to decide
Further Readings
whether they will associate with pharmaceutical
sales representatives and various data collection Altan, S., et al. (2010). Statistical considerations in design
procedures. space development. Pharmaceutical Technology, 34
An ongoing element to address in analyzing (7), 66–70.
the pharmaceutical industry’s use of data mining Fugh-Berman, A. (2008). Prescription tracking and public
health. Journal of General Internal Medicine, 23(8),
techniques will be the level of transparence used 1277–1280.
with patients while utilizing the information col- Greene, J. A. (2007). Pharmaceutical marketing research
lected. Research shows that the majority of and the prescribing physician. Annals of Internal Med-
patients in the United States are not only unfamil- icine, 146(10), 742–747.
Klocke, J. L. (2008). Comment: Prescription records for
iar with data mining use by the pharmaceutical sale: Privacy and free speech issues arising from the
industry, but that they are against any personal sale of de-identified medical data. Idaho Law Review,
information (e.g., prescription usage information 44(2), 511536.
and personal diagnoses) being sold and shared Orentlicher, D. (2010). Prescription data mining and the
protection of patients’ interests. The Journal of Law,
with outside entities, namely, corporations. As Medicine & Ethics, 38(1), 74–84.
health care continues to change in the United Steinbrook, R. (2006). For sale: Physicians’ prescribing
States, it will be important for patients to under- data. The New England Journal of Medicine, 354(26),
stand the ways in which their personal informa- 2745–2747.
Wang, J., et al. (2011). Applications of data mining in
tion is being shared and used, in an effort to pharmaceutical industry. The Journal of Management
increase national understandings of how privacy and Engineering Integration, 4(1), 120–128.
laws are connected to the pharmaceutical industry. White paper: Big Data and the needs of the Pharmaceuti-
cal Industry. (2013). Philadelphia: Thomson Reuters.
World Health Organization. (2013). Pharmaceutical
Industry. Retrieved online from http://www.who.int/
Cross-References trade/glossary/story073/en/.

▶ Electronic Health Records (EHR)


P

Pollution, Air Hence, concern for air pollution and its influ-
ences on the earth and efforts to prevent/and to
Zerrin Savasan mitigate it have increased greatly in global scale.
Department of International Relations, Sub- However, today, it still stands as one of the pri-
Department of International Law, Faculty of mary challenges that should be addressed globally
Economics and Administrative Sciences, Selcuk on the basis of international cooperation. So, it
University, Konya, Turkey becomes necessary to promote the widespread
understanding on air pollution, its pollutants,
sources, and impacts.
The air contains many different substances, gases,
aerosols, particulate matter, trace metals, and a
variety of other compounds. If those are not at Sources of Air Pollution
the same concentration and change in space, and
over time to an extent that the air quality deterio- The air pollutants can be produced from natural-
rates, some contaminants or pollutant substances based reasons (e.g., fires from burning vegetation,
exist in the air. The release of these air pollutants forest fires, volcanic eruptions, etc.) or anthropo-
causes harmful effects to both environment and genic (human-caused) reasons. When outdoor
humans, to all organisms. This is regarded as air pollution – referring to the pollutants found in
pollution. outdoors – is thought, smokestacks of industrial
The air is a common/shared resource of all plants can be given as an example of human-made
human beings. After released, air pollutants can ones. However, natural processes also produce
be carried by natural events like winds, rains, and outdoor air pollution, e.g., volcanic eruptions.
so on. So, some pollutants, e.g., lead or chloro- The main causes of indoor air pollution, on the
form, often contaminate more than one environ- other hand, again raise basically from human-
mental occasions, so, many air pollutants can also driven reasons, e.g., technologies used for
be water or land pollutants. They can combine cooking, heating, and lighting. Nonetheless,
with other pollutants and thus can undergo chem- again there are also natural indoor air pollutants,
ical transformations, and then they can be eventu- like radon, and chemical pollutants from building
ally deposited on different locations. Their effects materials and cleaning products.
can emerge in different locations far from their Among those, human-based reasons, specifi-
main resources. Thus, they can detrimentally cally after industrialization, have produced a vari-
affect upon all organisms on local or regional ety of sources of air pollution, and thus more
scales and also upon the climate on global scale. contributed to the global air pollution. They can
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_167-1
2 Pollution, Air

emanate from point and nonpoint sources or from necessary to know sufficiently about the features
mobile and stationary sources. A point source of that pollutant. This is because some pollutants
describes a specific location from which large can be the reason of environmental or health
quantities of pollutants are discharged, e.g., coal- problems in the air, they can be essential in the
fired power plants. A nonpoint source, one the soil or water, e.g., nitrogen is harmful as it can
other hand, is more diffuse often involving many form ozone in the air, and it is necessary for the
small pieces spread across a wide range of area, soil as it can also act beneficially as fertilizer in the
e.g., automobiles. Automobiles are also known as soil. Additionally, if toxic substances exist below
mobile sources, and the combustion of gasoline is a certain threshold, they are not necessarily
responsible for released emissions from mobile harmful.
sources. Industrial activities are also known as
stationary sources, and the combustion of fossil
fuels (coal) is accountable for their emissions. New Technologies for Air Pollution:
These pollutants producing from distinct Big Data
sources may cause harm directly or indirectly. If
they are emitted from the source directly into the Before the industrialization period, the compo-
atmosphere, and so cause harm directly, they are nents of pollution are thought to be primarily
called as primary pollutants, e.g., carbon oxides, smoke and soot; but with industrialization, they
carbon monoxide, hydrocarbons, nitrogen oxides, have been expanded to include a broad range of
sulfur dioxide, particulate matter, and so on. If emissions, including toxic chemicals and biolog-
they are produced from chemical reactions includ- ical or radioactive materials. Therefore, even
ing also primary pollutants in the atmosphere, today it is still admitted that there are six conven-
they are known as secondary pollutants, e.g., tional pollutants (or criteria air pollutants) identi-
ozone and sulfuric acid. fied by the US Environmental Protection Agency
(EPA): carbon monoxide, lead, nitrous oxides,
ozone, particulate matter, and sulfur oxides.
The Impacts of Air Pollution Hence, it is expectable that there can be new
sources for air pollution and so new threats for
The air pollutants result in a wide range of impacts the earth soon. Indeed, very recently, through
both upon humans and environment. Their detri- Kigali (Rwanda) Amendment (14 October,
mental effects upon humans can be briefly sum- 2016) to the Montreal Protocol adopted at the
marized as follows: health problems resulting Meeting of the Parties (MOP 28), it is accepted
particularly from toxicological stress, like respi- to address hydrofluorocarbons (HFCs) – green-
ratory diseases such as emphysema and chronic house gases having a very high global warming
bronchitis, chronic lung diseases, pneumonia, car- potential even if not harmful as much as CFCs and
diovascular troubles, and cancer, and immune HCFCs for the ozone layer under the Protocol – in
system disorders increasing susceptibility to addition to chlorofluorocarbons (CFCs) and
infection and so on. Their adverse effects upon hydrochlorofluorocarbons (HCFCs).
environment, on the other hand, are the following: Air pollution first becomes an international
acid deposition, climate change resulting from the issue with the Trail Smelter Arbitration
emission of greenhouse gases, degradation of air (1941) between Canada and the United States.
resources, deterioration of air quality, noise, Indeed, prior to the decision made by the Tribunal,
photooxidant formation (smog), reduction in the disputes over air pollution between two countries
overall productivity of crop plants, stratospheric had never been settled through arbitration. Since
ozone (O3) depletion, threats to the survival of this arbitration case – specifically with increasing
biological species, etc. efforts since the early 1990s – attempts to mea-
While determining the extent and degree of sure, to reduce, and to address rapidly growing
harm given by these pollutants, it becomes impacts of air pollution have been continuing.
Pollution, Air 3

Developing new technologies, like Big Data, involves research on Apps and Sensors for
arises as one of those attempts. Air Pollution (ASAP), National Ambient Air
Big Data has no uniform definition (ELI 2014; Quality Standards (NAAQS) compliance, and
Keeso 2014; Simon 2013; Sowe and Zettsu 2014). data fusion methods)
In fact, it is defined and understood in diverse • Village Green Project (on improving Air Qual-
ways by different researchers (Boyd 2010; Boyd ity Monitoring and awareness in communities)
and Crawford 2012; De Mauro et al. 2016; Gogia • Environmental Quality Index (EQI) (a dataset
2012; Mayer-Schönberger and Cukier 2013; consisting of an index of environmental quality
Manyika et.al 2011) and interested companies based on air, water, land, build environment,
like Experian, Forrester, Forte Wares, Gartner, and sociodemographic space)
and IBM. It is initially identified by 3Vs – volume
(data amount), velocity (data speed), and variety There are also examples generated by local
(data types and sources) (Laney 2001). By the governments like “E-Enterprise for the Environ-
time, it has included fourth Vs like veracity (data ment,” by environmental organizations like “Per-
accuracy) (IBM) and variability (data quality of sonal Air Quality Monitoring,” or by citizen
being subject to structural variation) (Gogia 2012) science like “Danger Maps,” or by private firms
and a fifth V, value (data capability to turn into like “Aircraft Emissions Reductions” (ELI 2014)
value) together with veracity (Marr), and a sixth or Green Horizons Project (IBM 2015).
one, vulnerability (data security-privacy) The Environmental Performance Index (EPI) is
(Experian 2016). It can be also defined by veracity also another platform – using Big Data compiled
and value together with visualization (visual rep- from a great number of sensors and models – pro-
resentation of data) as additional 3Vs (Sowe and viding a country and an issue ranking on how each
Zettsu 2014) and also by volume, velocity, and country manages environmental issues and also a
variety requiring specific technology and analyti- Data Explorer allowing users to investigate the
cal methods for its transformation into value global data comparing environmental performance
(De Mauro et al. 2016). However, it is generally with GDP, population, land area, or other variables.
referred as large and complex data processing Despite all, as the potential benefits and costs
sets/applications that conventional systems are of the use of Big Data are still under discussion
not able to cope with them. (Boyd 2010; Boyd and Crawford 2012; De Mauro
Because air pollution has various aspects that et al. 2016; Forte Wares, – ; Keeso 2014; Mayer-
should be measured as mentioned above, it Schönberger and Cukier 2013; Simon 2013; Sowe
requires massive data that should be collected at and Zettsu 2014), various concerns can be raised
different spatial and temporal levels. Therefore, it about the use of Big Data to monitor, measure, and
is observed in practice that Big Data sets and forecast air pollution as well. Therefore, it is
analytics are increasingly used in the field of air required to make further research to identify
pollution, for monitoring, predicting its possible gaps, challenges, and solutions for “making the
consequences, responding timely to them, con- right data (not just higher volume) available to the
trolling and reducing its impacts, and mitigating right people (not just higher variety) at the right
the pollution itself. time (not just higher velocity)” (Forte Wares, ).
They can be used by different kind of organi-
zations, such as governmental agencies, private
firms, and nongovernmental organizations Cross-References
(NGOs). To illustrate, under US Environmental
Protection Agency (EPA), samples of Big Data ▶ Climate Change
use include: ▶ Environment
▶ Pollution, Land
• Air Quality Monitoring (collaborating with ▶ Pollution, Water
NASA on the DISCOVER-AQ initiative, it
4 Pollution, Air

References Laney, D. 3D data management: Controlling data volume,


velocity, and variety. Meta Group (2001). Retrieved
Boyd, Danah. Privacy and publicity in the context of big from: Available at: https://blogs.gartner.com/doug-
data. WWW Conference. Raleigh, (2010). Retrieved laney/files/2012/01/ad949-3D-Data-Management-Cont
from http://www.danah.org/papers/talks/2010/ rolling-Data-Volume-Velocity-and-Variety.pdf. Acces-
WWW2010.html. Accession 3 Feb 2017. sion 3 Feb 2017.
Boyd, Danah & Crawford, Kate. Critical questions for big Manyika, J. et al. Big data: The next frontier for innova-
data, information, communication & society, 15(5), tion, competition, and productivity. McKinsey
662–679, (2012). Retrieved from: http://www.tandf Global Institute (2011). Retrieved from: https://
online.com/doi/abs/10.1080/1369118X.2012.678878. file:///C:/Users/cassperr/Downloads/MGI_big_da-
Accession3 Feb 2017. ta_full_report.pdf. Accession 3 Feb 2017.
De Mauro, Andrea, Greco, Marco, Grimaldi, Michele. Marr, Bernard (n.d.). Big data: The 5 vs everyone must know.
A formal definition of big data based on its Essential Retrieved from: Available at: https://www.linkedin.
features. (2016). Retrieved from: https://www. com/pulse/20140306073407-64875646-big-data-the-
researchgate.net/publication/299379163_A_formal_ 5-vs-everyone-must-know. Accession 3 Feb 2017.
definition_of_Big_Data_based_on_its_essential_fea Mayer-Schönberger, V., & Cukier, K. (2013). Big data:
tures. Accession 3 Feb 2017. A revolution that will transform how We live, work and
Environmental Law Institute (ELI). (2014). Big data and think. London: John Murray.
environmental protection: An initial survey of public and Simon, P. (2013). Too big to ignore: The business case for
private initiatives. Washington, DC: Environmental Law big data. Hoboken: Wiley.
Institute. Retrieved from: https://www.eli.org/sites/ Sowe, S. K. & Zettsu, K. “Curating big data made simple:
default/files/eli-pubs/big-data-and-environmental-pro Perspectives from scientific communities.” Big Data, 2,
tection.pdf. Accession 3 Feb 2017. 1. 23–33. Mary Ann Liebert, Inc. (2014).
Environmental Performance Index (EPI) (n.d.). Available Wares, F. (n.d.). Failure to launch: From big data to big
at: http://epi.yale.edu/. Accession 3 Feb 2017. decisions why velocity, variety and volume is not
Experian. A data powered future. White Paper (2016). improving decision making and how to fix it. White
Retrieved from: http://www.experian.co.uk/assets/ Paper. A Forte Consultancy Group Company.
resources/white-papers/data-powered-future-2016.pdf. Retrieved from http://www.fortewares.com/Administra
Accession 3 Feb 2017. tor/userfiles/Banner/forte-wares–pro-active-reporting_
Gartner. Gartner says solving ‘big data’ challenge involves EN.pdf. Accession:3 Feb 2017.
more than just managing volumes of data. June
27, 2011. (2011). Retrieved from: http://www.gartner.
com/newsroom/id/1731916. Accession 3 Feb 2017. Further Reading
Gogia, Sanchit. The big deal about big data for customer Gillespie, A. (2006). Climate change, ozone depletion and
engagement, June 1, 2012, (2012). Retrieved from: air pollution. Leiden: Martinus Nijhoff Publishers.
http://www.iab.fi/media/tutkimus-matskut/130822_ Gurjar, B. R., et al. (Eds.). (2010). Air pollution, health and
forrester_the_big_deal_about_big_data.pdf. Acces- environmental impacts. Boca Raton: CRC Press.
sion 3 Feb 2017. Jacobson, M. Z. (2012). Air pollution and global warming.
IBM. IBM expands green horizons initiative globally to New York: Cambridge University Press.
address pressing environmental and pollution chal- Louka, E. (2006). International environmental law,
lenges. (2015). Retrieved from: http://www-03.ibm. fairness, effectiveness, and world order. New York:
com/press/us/en/pressrelease/48255.wss. Accession Cambridge University Press.
3 Feb 2017. Raven, P. H., & Berg, L. R. (2006). Environment. Danvers:
IBM (n.d.). What is big data? Retrieved from: https:// Wiley.
www-01.ibm.com/software/data/bigdata/what-is-big- The Open University. (2007). T210-environmental control
data.html. Accession 3 Feb 2017. and public health. Milton Keynes: The Open
Keeso, Alan. Big data and environmental sustainability: University.
A conversation starter. Smith School Working Paper Vallero, D. A. (2008). Fundamentals of air pollution.
Series, December 2014, Working paper 14-04, (2014). Amsterdam: Elsevier.
Retrieved from: http://www.smithschool.ox.ac.uk/ Vaughn, J. (2007). Environmental politics. Belmont:
library/working-papers/workingpaper%2014-04.pdf. Thomson Wadsworth.
Accession 3 Feb 2017. Withgott, J., & Brennan, S. (2011). Environment. San
Francisco: Pearson.
P

Pollution, Land What Causes Land Pollution?

Zerrin Savaşan The degradation of land surfaces are caused


Department of International Relations, Sub- directly or indirectly by human (anthropogenic)
Department of International Law, Faculty of activities. It is possible to mention several reasons
Economics and Administrative Sciences, Selçuk temporally or permanently changing the land
University, Konya, Turkey structure and so causing land pollution. However,
three main reasons are generally identified as
industrialization, overpopulation, and urbaniza-
Pollution, in its all types (air, water, land), means tion, and the others are counted as the reasons
the entrance of some substances beyond the stemming from these main reasons. Some of
threshold concentration level into the natural envi- them are as follows: improper waste disposal
ronment which do not naturally belong there and (agricultural/domestic/industrial/solid/radioactive
not present there, resulting in its destruction and waste) littering; mining polluting the land through
causing harmful effects on both humans/all living removing the topsoil which forms the fertile layer
organisms and the environment. So, in land pol- of soil, or leaving behind waste products and the
lution as well, solid or liquid waste materials get chemicals used for the process; misuse of land
deposited on land and further degrade and deteri- (deforestation, land conversion, desertification);
orate the quality and the productive capacity of soil pollution (pollution on the topmost layer of
land surface. It is sometimes used as a substitute the land); soil erosion (loss of the upper (the most
of/or together with soil pollution where the upper fertile) layer of the soil); and the chemicals
layer of the soil is destroyed. However, in fact, soil (pesticides, insecticides, and fertilizers) applied
pollution is just one of the causes of the land for crop enhancement on the lands.
pollution. Regarding these chemicals used for crop
Like the other types, land pollution also arises enhancement, it should be underlined that, while
as a global environmental problem, specifically they are enhancing the crop yield, they can also
associated with urbanization and industrialization, kill the insects, mosquitoes, and some other small
that should be dealt with globally concerted envi- animals. So, they can harm the bigger animals that
ronmental policies. However, as a first and fore- feed on these tiny animals. In addition, most of
most step, it requires to be understood very well these chemicals can remain in the soil or accumu-
with its all dimensions by all humankind, but late there for many years. To illustrate, DDT
particularly the researchers studying on it. (dichlorodiphenyltrichloroethane) is one of these
pesticides. It is now widely banned with the great
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_168-1
2 Pollution, Land

effect of Rachel Carson’s very famous book, This process threatens both these particular spe-
Silent Spring (1962), which documents detrimen- cies and also all the other species above and below
tal effects of pesticides on the environment, par- in the food chain. All these combining with the
ticularly on birds. Nonetheless, as it is not massive extinctions of certain species – primarily
ordinarily biodegradable, so known as persistent because of the disturbance of their habitat –
organic pollutant, it has remained in the environ- induce also massive reductions in biodiversity.
ment ever since it was first used.

Control Measures for Land Pollution


Consequences of Land Pollution
Land pollution, along with other types of pollu-
All types of pollution are interrelated and their
tion, poses a threat to the sustainability of world
consequences cannot be restricted to the place
resources. However, while others can have self-
where the pollution is first discharged. This is
purification opportunities through the help of nat-
particularly because of the atmospheric deposition
ural events, it can stay as polluted till to be cleaned
in which existing pollution in the air (atmosphere)
up. Given the time necessary to pass for the dis-
creating pollution in water or land as well.
appearance of plastics in nature (hundreds of
Since they are interrelated to each other, their
years) and the radioactive waste (almost forever),
impacts are similar to each other as well. Like the
this fact can be understood better. So then land
others, land pollution has also serious conse-
pollution becomes one of the serious concerns of
quences on both humans, animals and other living
the humankind.
organisms, and environment. First of all, all living
When the question is asked what should be
things depend on the resources of the earth to
done to deal with it, first of all, it is essential to
survive and on the plants growing from the land,
remind that it is a global problem having no
so anything that damages or destroys the land
boundaries, so requires to be handled with collec-
ultimately has an impact on the survival of
tively. While working collectively, it is first of all
humankind itself and all other living things on
necessary to set serious environmental objectives
the earth. Damages on the land also lead to some
and best-practice measures. A wide range of
problems in relation to health like respiratory
measures – changing according to the cause of
problems, skin problems, and various kinds of
the pollution – can be thought to prevent, reduce,
cancers.
or stop land pollution, such as adopting and
Its effects on environment also require to take
encouraging organic farming instead of using
attention as it forms one of the most important
chemicals herbicides, and pesticides, restricting
reasons of the global warming which has started to
or forbidding their usage, developing the effective
be a very popular but still not adequately under-
methods of recycling and reusing of waste mate-
stood phenomena. This emerges from a natural
rials, constructing proper disposal of all wastes
circulation, in turn, land pollution leads to the
(domestic, industrials, etc.) into secured landfill
deforestation, it leads to less rain, eventually to
sites, and creating public awareness and support
problems such as the greenhouse effect and global
towards all environmental issues.
warming/climate change. Biomagnification is the
Apart from all those measures, the use of Big
other major concern stemming from land pollu-
Data technologies can also be thought as a way of
tion. It occurs when certain substances, such as
addressing rapidly increasing and wide-ranging
pesticides or heavy metals, gained through eating
consequences of land pollution.
by aquatic organisms such as fish, which in turn
Some of the cases in which Big Data technol-
are eaten by large birds, animals, or humans. They
ogies are used in relation to one or more aspects of
become concentrated in internal organs as they
land pollution can be illustrated as follows (ELI
move up the food chain, and then the concentra-
2014):
tion of these toxic compounds tends to increase.
Pollution, Land 3

• Located under US Department of the Interior compiled from a great number of sensors regard-
(DOI), the National Integrated Land System ing environmental issues, on land pollution and on
(NILS) aims to provide the principal data other types of pollution. That is, Big Data tech-
source for land surveys and status by combin- nologies can be thought as a way of addressing
ing Bureau of Land Management (BLM) and consequences of all types of pollution, not just of
Forest Service data into a joint system. land pollution. This is particularly because, all
• New York City Open Accessible Space Infor- types of pollution are deeply interconnected with
mation System (OASIS) is another sample another type, so their consequences cannot be
case; as being an online open space mapping restricted to the place where the pollution is first
tool, it involves a huge amount of data discharged as mentioned above. Therefore, actu-
concerning public lands, parks, community ally, for all types of pollution, relying on satellite
gardens, coastal storm impact areas, and zon- technology and data and data visualization is
ing and land use patterns. essentially required to monitor them regularly, to
• Providing online accession of the state Depart- forecast and reduce their possible impacts, and to
ments of Natural Resources (DNRs) and other mitigate the pollution itself. Nonetheless, there are
agencies to the data of Geographic Information serious concerns raised about different aspects of
Systems (GIS) on environmental concerns, the use of Big Data in general (boyd 2010; boyd
while contributing to the effective manage- and Crawford 2012; De Mauro et al. 2016; Forte
ment of land, water, forest, and wildlife, it Wares; Keeso 2014; Mayer-Schönberger and
essentially requires the use of Big Data to Cukier 2013; Simon 2013; Sowe and Zettsu
make this contribution. 2014). So, further investigation and analysis are
• Alabama’s State Water Program is another needed to clarify the relevant gaps and challenges
example ensuring geospatial data related to regarding the use of Big Data for specifically land
hydrologic, soil, geological, land use, and pollution.
land cover issues.
• The National Ecological Observatory Network
(NEON) is an environmental organization pro-
Cross-References
viding the collection of the site-based data
related to the effects of climate change, inva-
▶ Climate Change
sive species from 160 sites and also land use
▶ Earth Sciences
throughout the USA.
▶ Environment
• The Tropical Ecology Assessment and Moni-
▶ Natural Sciences
toring Network (TEAM) is also a global net-
▶ Pollution, Air
work facilitating the collection and integration
▶ Pollution, Water
of publicly shared data related to patterns of
biodiversity, climate, ecosystems, and also
land use.
• The Danger Maps is another sample case for Further Readings
the use of Big Data, as it also provides the
mapping of government-collected data on Alloway, B. J. (2001). Soil pollution and land contamina-
tion. In R. M. Harrison (Ed.), Pollution: Causes, effects
over 13,000 polluting facilities in China to and control (pp. 352–377). Cambridge: The Royal
allow users to search by area or type of pollu- Society of Chemistry.
tion (water, air, radiation, soil). Boyd, D. (2010). Privacy and publicity in the context of big
data. WWW Conference, Raleigh, 29 Apr 2010.
Retrieved from http://www.danah.org/papers/talks/
The US Environmental Protection Agency 2010/WWW2010.html. Accessed 3 Feb 2017.
(EPA) and the Environmental Performance Index Boyd, D., & Crawford, K. (2012). Critical questions for big
(EPI) are also other platforms using Big Data data, information, communication & society. 15(5),
662–679. Retrieved from http://www.tandfonline.com/
4 Pollution, Land

doi/abs/10.1080/1369118X.2012.678878. Accessed 3 Feb Hill, M. K. (2004). Understanding environmental pollu-


2017. tion. New York: Cambridge University Press.
De Mauro, A., Greco, M., & Grimaldi, M. (2016). A formal Keeso, A. (2014). Big data and environmental sustainabil-
definition of big data based on its essential features. ity: A conversation starter. Smith School Working Paper
Retrieved from https://www.researchgate.net/publica Series, Dec 2014, Working paper 14-04. Retrieved from
tion/299379163_A_formal_definition_of_Big_Data_ http://www.smithschool.ox.ac.uk/library/working-paper
based_on_its_essential_features. Accessed 3 Feb 2017. s/workingpaper%2014-04.pdf. Accessed 3 Feb 2017.
Environmental Law Institute (ELI). (2014). Big data and Mayer-Schönberger, V., & Cukier, K. (2013). Big data:
environmental protection: An initial survey of public A revolution that will transform how we live, work and
and private initiatives. Washington, DC: Environmen- think. London: John Murray.
tal Law Institute. Retrieved from https://www.eli.org/ Mirsal, I. A. (2008). Soil pollution, origin, monitoring &
sites/default/files/eli-pubs/big-data-and-environmental-prot remediation. Berlin/Heidelberg: Springer.
ection.pdf. Accessed 3 Feb 2017. Raven, P. H., & Berg, L. R. (2006). Environment. Danvers:
Environmental Performance Index (EPI). Available at: Wiley.
http://epi.yale.edu/. Accessed 3 Feb 2017. Simon, P. (2013). Too big to ignore: The business case for
Forte Wares. Failure to launch: From big data to big deci- big data. Hoboken: Wiley.
sions why velocity, variety and volume is not improv- Sowe, S. K., & Zettsu, K. (2014). Curating big data made
ing decision making and how to fix it. White Paper. simple: Perspectives from scientific communities. Big
A Forte Consultancy Group Company. Retrieved from Data, 2(1), 23–33. Mary Ann Liebert, Inc.
http://www.fortewares.com/Administrator/userfiles/Ban Withgott, J., & Brennan, S. (2011). Environment. Cornell
ner/forte-wares–pro-active-reporting_EN.pdf. Accessed University: Pearson.
3 Feb 2017.
P

Pollution, Water Hence, it becomes essential to explain that it is


limited and so its resources should not be polluted.
Zerrin Savaşan Here, it also becomes essential to have adequate
Faculty of Economics and Administrative information on all types of pollution resulting in
Sciences, Department of International Relations, environmental deterioration and on water
Sub-Department of International Law, Selçuk pollution.
University, Konya, Turkey

What Causes Water Pollution?


Water pollution can be defined as the contamina-
tion of water bodies by the entrance of large This question has many responses, but basically it
amounts of materials/substances to those bodies, is possible to mention two main reasons: natural
resulting in physical or chemical change in water, reasons and human-driven reasons. All waters are
modifying the natural features of the water, subject to some degree of natural (or ecological)
degrading the water quality, and adversely affect- pollution caused by nature rather than by human
ing the humans and the environment. activity, through algal blooms, forest fires, floods,
Particularly in recent decades, it is highly sedimentation stemming from rainfalls, volcanic
accepted that water pollution is a global environ- eruptions, and other natural events. However, a
mental problem which is interrelated to all other greater part of the instances of water pollution
environmental challenges. Water pollution con- arises from humans’ activities, particularly from
trol, at national level, generally should involve massive industrialization. Accidental spills (e.g., a
financial resources, technology improvement, disaster like the wreck of an oil tanker, as different
policy measures, and necessary legal and admin- from others, is unpredictable); domestic dis-
istrative framework and institutional/staff capac- charges; industrial discharges; the usage of large
ity for implementing these policy measures in amounts of herbicides, pesticides, chemical fertil-
practice. However, more importantly, at global izers; sediments in waterways of agricultural
level, it should involve cooperation of all related fields; improper disposal of hazardous chemicals
sides at all levels. Despite the efforts at both down the sewages; and being not able to construct
national and global levels, reducing pollution sub- adequate waste disposal systems can be expressed
stantially still continues to pose a challenge. This as not all but just some of the human-made rea-
is particularly because even though the world is sons of water pollution.
becoming increasingly globalized, it is still mostly The causes as abovementioned vary greatly
regarded as having with unlimited resources. because a complex variety of pollutants, lying
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_169-1
2 Pollution, Water

suspended in the water or depositing beneath the Water pollution, like other types of pollution,
earth’s surface, get involved in water bodies and has serious widespread effects. In fact, adverse
result in water quality degradation. Indeed, there alteration of water quality produces costs both to
are many different types of water pollutants spill- humans (e.g., large-scale diseases and deaths) and
ing into waterways causing water pollution. They to environment (e.g., biodiversity reduction, spe-
all can be divided up into various categories: cies mortality). Its impact differs depending on the
chemical, physical, pathogenic pollutants, radio- type of water body affected (groundwater, lakes,
active substances, organic pollutants, inorganic rivers, streams, and wetlands). However, it can be
fertilizers, metals, toxic pollutants, biological pol- prevented, lessened, and even eliminated in many
lutants, and so on. Conventional, non- different ways. Some of these different treatment
conventional, and toxic pollutants are some of methods, aiming to keep the pollutants from dam-
these divisions which are regulated by the US aging the waterways, can be relied on the use of
Clean Water Act. The conventional pollutants techniques reducing water use, reducing the usage
are as follows: dissolved oxygen, biochemical of highly water soluble pesticide and herbicide
oxygen demand (BOD), temperature, pH (acid compounds, and reducing their amounts, control-
deposition), sewage, pathogenic agents, animal ling rapid water runoff, physical separation of
wastes, bacteria, nutrients, turbidity, sediment, pollutants from the water, or on the management
total suspended solids (TSS), fecal coliform, oil, practices in the field of urban design and
and grease. Nonconventional (or nontoxic) pollut- sanitation.
ants are not identified as either conventional or There are also some other attempts to measure,
priority, like aluminum, ammonia, chloride, col- reduce, and address rapidly growing impacts of
ored effluents, exotic species, instream flow, iron, water pollution, such as the use of Big Data. Big
radioactive materials, and total phenols. Toxic Data technologies can provide ways of achieving
pollutants, metals, dioxin, and lead can be counted better solutions for the challenges of water pollu-
as examples of priority pollutants. Each group of tion. To illustrate, EPA databases can be accessed
these pollutants has its own specific ways of enter- and maps can be generated from them including
ing the water bodies and its own specific risks. information on environmental activities affecting
water and also on air and land in the context of
EnviroMapper. Under US Department of the Inte-
Water Pollution Control rior (DOI), National Water Information System
(NWIS) monitors surface and underground water
In order to control all these pollutants, it is bene- quantity, quality, distribution, and movement.
ficial to determine from where they are Under National Oceanic and Atmospheric
discharged. So, the following categories can be Administration (NOAA), California Seafloor
identified to find out where they originate from: Mapping Program (CSMP) works for creating a
point and nonpoint sources of pollution. If the comprehensive base map series of coastal/marine
sources causing pollution come from single iden- geology and habitat for all waters of the USA.
tifiable points of discharge, they are point sources Additionally, the Hudson River Environmental
of pollution, e.g., domestic discharges, ditches, Conditions Observing System comprises 15 mon-
pipes of industrial facilities, and ships discharging itoring stations – located between Albany and the
toxic substances directly into a water body. Non- New York Harbor – automatically collecting sam-
point sources of pollution are characterized by ples every 15 min that are used to monitor water
dispersed, not easily identifiable discharge points, quality, assess flood risk, and assist in pollution
e.g., runoff of pollutants into a waterway, like cleanup and fisheries management. Contamina-
agricultural runoff, stormwater runoff. As it is tion Warning System Project, conducted by the
harder to identify them, it is nearly impossible to Philadelphia Water Department, is a combination
collect, trace, and control them precisely, whereas of new data technologies with existing manage-
point sources can be easily controlled. ment systems. It provides a visual representation
Pollution, Water 3

of data streams containing geospatial, water qual- Cross-References


ity, customer concern, operations and public
health information. Creek Watch is another sam- ▶ Climate Change
ple case of the use of Big Data in the field of water ▶ Earth Sciences
pollution. It is developed by IBM and the Califor- ▶ Environment
nia State Water Resources Control Board’s Clean ▶ Natural Sciences
Water Team as a free app to allow users to rate the ▶ Pollution, Land
waterway on three criteria: amount of water, rate ▶ Pollution, Water
of flow, and amount of trash. The collected data is
in large enough to track pollution and manage
water resources. The Danger Maps is another
Further Readings
project mapping government-collected data on
over 13,000 polluting facilities in China. It ren- Boyd, D. (2010). Privacy and publicity in the context of big
ders users to search by area or type of pollution data. WWW Conference, Raleigh, 29 Apr 2010.
(water, air, radiation, soil). Developing technol- Retrieved from http://www.danah.org/papers/talks/
ogy on farm performance can also be shown as 2010/WWW2010.html. Accessed 3 Feb 2017.
Boyd, D., & Crawford, K. (2012). Critical questions for big
another sample on the use of Big Data compiled data, information, communication &society. 15(5),
from yield information, sensors, high-resolution 662–679. Retrieved from http://www.tandfonline.
maps, and databases for water pollution issue. For com/doi/abs/10.1080/1369118X.2012.678878.
example, machine-to-machine (M2M) agricul- Accessed 3 Feb 2017.
De Mauro, A., Greco, M., & Grimaldi, M. (2016). A formal
tural technology produced by a Canadian startup definition of big data based on its essential features.
company Semios allows farmers to improve Retrieved from https://www.researchgate.net/publica
yields and their farm operations’ efficiency but tion/299379163_A_formal_definition_of_Big_Data_
also it provides information for reducing polluted based_on_its_essential_features. Accessed 3 Feb 2017.
Environmental Law Institute (ELI). (2014). Big data and
runoff through increasing the efficient use of environmental protection: An initial survey of public
water, pesticides, and fertilizers (ELI 2014). and private initiatives. Washington, DC: Environmen-
The Environmental Performance Index (EPI) tal Law Institute. Retrieved from https://www.eli.org/
is also another platform using Big Data to display sites/default/files/eli-pubs/big-data-and-environmental-
protection.pdf. Accessed 3 Feb 2017.
how each country manages environmental issues Environmental Performance Index (EPI). Available at:
and to allow users to investigate data through http://epi.yale.edu/. Accessed 3 Feb 2017.
comparing environmental performance with Forte Wares. Failure to launch: From big data to big deci-
GDP, population, land area, or other variables. sions why velocity, variety and volume is not improv-
ing decision making and how to fix it. White Paper.
As shown above by example cases, the use of A Forte Consultancy Group Company. Retrieved from
Big Data technologies is increasingly applied in http://www.fortewares.com/Administrator/userfiles/Ban
the water field, in its different aspects from man- ner/forte-wares–pro-active-reporting_EN.pdf. Accessed
agement to pollution. However, it is still required 3 Feb 2017.
Hill, M. K. (2004). Understanding environmental pollu-
to make further research for their effective use in tion. New York: Cambridge University Press.
order to eliminate related concerns. This is partic- Keeso, A. (2014). Big data and environmental sustainabil-
ularly because there is still debate on the use of ity: A conversation starter. Smith School Working Paper
Big Data even regarding its general scope and Series, Dec 2014, Working paper 14-04. Retrieved from
http://www.smithschool.ox.ac.uk/library/working-papers/
terms (Boyd 2010; Boyd and Crawford 2012; De workingpaper%2014-04.pdf. Accessed 3 Feb 2017.
Mauro et al. 2016; Forte Wares, - ; Keeso 2014; Mayer-Schönberger, V., & Cukier, K. (2013). Big data:
Mayer-Schönberger and Cukier 2013; Simon A revolution that will transform how we livework and
2013; Sowe and Zettsu 2014). think. London: John Murray.
Raven, P. H., & Berg, L. R. (2006). Environment. Danvers:
Wiley.
4 Pollution, Water

Simon, P. (2013). Too big to ignore: The business case for Vaughn, J. (2007). Environmental politics. Thomson
big data. Hoboken: Wiley. Wadsworth.
Sowe, S. K., & Zettsu, K. (2014). Curating big data made Vigil, K. M. (2003). Clean water, An introduction to water
simple: Perspectives from scientific communities. Big quality and water pollution control. Oregon State Uni-
Data, 2(1), 23–33. Mary Ann Liebert, Inc. versity Press.
The Open University. (2007). T210 – Environmental con- Withgott, J., & Brennan, S. (2011). Environment. Pearson.
trol and public health. The Open University.
P

Predictive Analytics Predictive Analytics and Forecasting

Anamaria Berea Prediction, in general, is about forecasting the


Center for Complexity in Business, University of future or forecasting the unknown. In the past,
Maryland, College Park, MD, USA before the scientific method was invented, pre-
dictions were based on astrological observations,
witchcraft, foretelling, oral history folklore, and,
Predictive analytics is a methodology in data min- in general, on random observations or associa-
ing that uses a set of computational and statistical tions of observations that happened at the same
techniques to extract information from data with time. For example, if a conflict happened during
the purpose to predict trends and behavior pat- an eclipse, then all eclipses would become
terns. Often, the unknown event of interest is in “omens” of wars and, in general, bad things. For
the future, but predictive analytics can be applied a long period of time in our civilization, the events
to any type of unknown data, whether it is in the were merely separated in two classes: good or
past, present, or future (Siegel 2013). In other bad. And thus the associations of events that
words, predictive analytics can be applied not would lead to a major conflict or epidemics or
only to time series data but to any data where natural catastrophe would be categorized as
there is some unknown that can be inferred. “bad” omens from there on, while any associa-
Therefore prediction analytics is a powerful set tions of events that would lead to peace, prosper-
of tools for inferring lost past data as well. ity, and, in general, “good” major events would be
The core of predictive analytics in data science categorized as “good” omens or good predictors
relies on capturing relationships between explan- from there on.
atory variables and the predicted variables from The idea of associations of events as predictive
past occurrences, and exploiting them to predict for another event is actually at the core of some of
the unknown outcome. It is important to note, the statistical methods we are using today, such as
however, that the accuracy and usability of results correlation. But the fallacy of using these methods
will depend greatly on the level of data analysis metaphorically instead of in a quantitative system-
and the quality of assumptions (Tukey 1977). atic analysis is that only one set of observations
cannot be predictive for the future. That was true
in the past and it is true now as well, no matter
how sophisticated the techniques we are using.
Predictive analytics uses a series of events or
associations of events, and the longer the series,
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_170-1
2 Predictive Analytics

the more informative the predictive analysis Some techniques in predictive analytics are
can be. borrowed from traditional forecasting techniques,
Unlike past good or bad omens, the results of such as moving average, linear regressions, logis-
predictive analytics are probabilistic. This means tic regressions, probit regressions, multinomial
that predictive analytics informs the probability of regressions, time series models, or random forest
a certain data point or the probability of a hypoth- techniques. Other techniques, such as supervised
esis to be true. learning, A|B testing, correlation ranking,
While true prediction can be achieved only by k-nearest neighbor algorithm are closer to
determining clearly the cause and the effect in a machine learning and newer computational
set of data, a task that is usually hard to do, most of methods.
the predictive analytics techniques are outputting One of the most used techniques in predictive
probabilistic values and error term analyses. analytics today though is supervised learning or
supervised segmentation (Provost and Fawcett
2013). Supervised segmentation includes the fol-
lowing steps:
Predictive Modeling Methods
– Selection of informative attributes – particu-
Predictive modeling statistically shows the under-
larly in large datasets, the selection of the vari-
lying relationships in historical, time series data in
ables that are more likely to be informative to
order to explain the data and make predictions,
the goal of prediction is crucial; otherwise the
forecasts, or classifications about future events.
prediction can render spurious results.
In general, predictive analytics uses a series of
– Information gain and entropy reduction – these
statistical and computational techniques in order
two techniques measure the information in the
to forecast future outcomes from past data. Tradi-
selected attributes.
tionally, the most used method has been the linear
– Selection is done based on tree induction,
regression, but lately, with the emergence of the
which fundamentally represents subsetting
Big Data phenomenon, there have been developed
the data and searching for these informative
many other techniques aiming to support busi-
attributes.
nesses and forecasters, such as machine learning
– The resulting tree-structured model partitions
algorithms or probabilistic methods.
the space of all data into possible segments
Some classes of techniques include:
with different predicted values.
1. Applications of both linear and nonlinear
mathematical programming algorithms, in The supervised learning/segmentation has
which one objective is optimized within a set been popular because it is computationally and
of constraints. algorithmically simple.
2. Advanced “neural” systems, which learn com-
plex patterns from large datasets to predict the
probability that a new individual will exhibit Visual Predictive Analytics
certain behaviors of business interest. Neural
networks (also known as deep learning) are Data visualization and predictive analytics com-
biologically inspired machine learning models plement each other nicely and together they are an
that are being used to achieve the recent even more powerful methodology for the analysis
record-breaking performance on speech recog- and forecasting of complex datasets that comprise
nition and visual object recognition. a variety of data types and data formats.
3. Statistical techniques for analysis and pattern Visual predictive analytics is a specific set of
detection within large datasets. techniques of predictive analytics that is applied
to visual and image data. Just as in the case of
Predictive Analytics 3

predictive analytics in general, temporal data is In any predictive model or analytics technique,
required for the visual (spatial) data (Maciejewski the model can do only what the data is. In other
et al. 2011). This technique is particularly useful words, it is impossible to assess a predictive
in determining hotspots and areas of conflict with model of the heart disease incidence based on
a high dynamics. Some of the techniques used in the travel habits if no data regarding travel is
spatiotemporal analysis are kernel density estima- included.
tion for event distribution and seasonal trend Another important point to remember is that
decomposition by loess smoothing (Maciejewski the accuracy of the model also depends on the
et al. 2011). accuracy measure, and using multiple accuracy
measures is desired (i.e., mean squared error,
p-value, R-squared).
In general, any predictive analytic technique
Predictive Analytics Example
will output a dataset of created variables, called
predictive values, and the newly created dataset.
A good example for using predictive analytics is
Therefore a good technique for verification and
in healthcare. The problem of understanding the
validation of the methods used is to partition the
probability of an upcoming epidemics or the prob-
real dataset in two sets and use one to “train” the
ability of increase in incidence of various dis-
model and the second one to validate the model’s
eases, from flu to heart disease and cancer.
results.
For example, given a dataset that contains data
The success of the model ultimately depends
with respect to the past incidence of heart disease
on how real events will unfold and that is one of
in the USA, demographic data (gender, average
the reasons why longer time series are better at
income, age, etc.), exercise habits, eating habits,
informing predictive modeling and giving better
traveling habits, and other variables, a predictive
accuracy for the same set of techniques.
model would follow these steps:

1. Descriptive statistics – the first step in doing


predictive analytics or building a predictive Predictive Analytics Fallacies
model is always an understanding of the data
with respect to what the variables represent, Cases of “spurious correlations” tend to be quite
what ranges they fall into, how long is the famous, such as the correlation between the num-
time series, ASO, essentially a summary statis- ber of people who dies tangled in their bed sheets
tics of the data. and the consumption of cheese per capita (http://
2. Data cleaning and treatment – it is very impor- www.tylervigen.com/spurious-correlations). These
tant to understand not what the data is or has examples fall on the same fallacy as the “bad”/
but also what the data is missing. “good” omen one, as the observations of the events
3. Build the model/s – in this step, several tech- at the same time does not imply that there is a
niques can be explored and used comparatively causal relationship between the two events.
and based on their results; the best one should Another classic example is to think, in general,
be chosen. For example, both a general regres- that correlations show a causal relationship; there-
sion and a random forest can be used and fore predictions based on correlation analyses
compared, or supervised segmentation based alone tend to fail often.
on demographics and then the segments Some other fallacies of predictive analytics
compared. techniques include an insufficient analysis of the
4. Performance and accuracy estimation – in this errors, relying on the p-value alone, relying on a
final step, the probabilities or measurements of Poisson distribution of the current data, and
forecasting accuracy are computed and many more.
interpreted.
4 Predictive Analytics

Predictive/Descriptive/Prescriptive will be global and real time; demand for data


analysts will increase as will the need for students
There is a clear distinction between descriptive to learn data analysis methods; and scholarly
vs. predictive vs. prescriptive analytics in Big researchers will need to improve their quantitative
Data (Shmueli 2010). Descriptive analytics skills so the large amounts of information avail-
shows how past or current data can be analyzed able can be used to create knowledge instead of
in order to determine patterns and extract mean- information overload.
ingful observations out of the data. Predictive
analytics is generally based on a model that is
informed by descriptive analytics and gives vari-
Predictive Modeling and Other
ous outcomes based on past data and the model.
Forecasting Techniques
Prescriptive analytics is closely related to predic-
tive analytics, as it takes the predictive values,
Some predictive modeling techniques do not nec-
puts them in a decision model, and informs the
essarily involve Big Data. For example, Bayesian
decision-makers about the future course of action
networks and Bayesian inference methods, while
(Shmueli and Koppius 2010).
they can be informed by Big Data, they cannot be
applied granularly to each data point due to the
computational complexity that can arise from cal-
Predictive Analytics Applications
culating thousands of conditional probability
tables. But Bayesian models and inferences can
In practice, predictive analytics can be applied to
certainly be used in combination with statistical
almost all disciplines – from predicting the failure
predictive modeling techniques in order to bring
of mechanical engines in hard sciences, to pre-
the analysis closer to a cause-effect type of infer-
dicting customers’ buying power in social sci-
ence (Pearl 2009).
ences and business (Gandomi and Haider 2015).
Another forecasting technique, that does not
Predictive analytics is especially used in busi-
rely on Big Data, but harnesses the power of the
ness and marketing forecasting. Hair Jr. (2007)
crowds, is the prediction market. Just like Bayes-
shows the importance of predictive analytics for
ian modeling, prediction markets can be used as a
marketing and how it has become more relevant
complement to Big Data and predictive modeling
with the emergence of the Big Data phenomenon.
in order to augment the likelihood value of the
He argues that survival in a knowledge-based
predictions (Arrow et al. 2008).
economy is derived from the ability to convert
information to knowledge. Data mining identifies
and confirms relationships between explanatory
and criterion variables. Predictive analytics uses References
confirmed relationships between variables to pre-
dict future outcomes. The predictions are most Arrow, K.J., et al. (2008). The promise of prediction mar-
kets. Science-New York then Washington-320.5878:
often values suggesting the likelihood a particular 877.
behavior or event will take place in the future. Gandomi, A., & Haider, M. (2015). Beyond the hype: Big
Hair also argues that, in the future, we can data concepts, methods, and analytics. International
expect predictive analytics to increasingly be Journal of Information Management, 35(2), 137–144.
Hair Jr., J. F. (2007). Knowledge creation in marketing:
applied to databases in all fields and revolutionize The role of predictive analytics. European Business
the ability to identify, understand, and predict Review, 19(4), 303–315.
future developments; data analysts will increas- Maciejewski, R., et al. (2011). Forecasting hotspots –
ingly rely on mixed-data models that examine A predictive analytics approach. IEEE Transactions
on Visualization and Computer Graphics, 17(4),
both structured (numbers) and unstructured (text 440–453.
and images) data; statistical tools will be more Pearl, J. (2009). Causality. Cambridge: Cambridge univer-
powerful and easier to use; future applications sity press.
Predictive Analytics 5

Provost, F., & Fawcett, T. (2013). Data science for busi- Shmueli, G., & Koppius, O. (2010). Predictive analytics in
ness: What you need to know about data mining and information systems research. Robert H. Smith School
data-analytic thinking. Sebastopol: O’Reilly Media. Research Paper No. RHS, 06-138.
Shmueli, G. (2010) To Explain or to Predict?. Statistical Siegel, E. (2013). Predictive analytics: The power to pre-
Science 25(3):289–310. dict who will click, buy, lie, or die. Hoboken: Wiley.
Tukey, J. (1977). Exploratory data analysis. New York:
Addison-Wesley.
P

Privacy parties. This catalogue is meant to remain an


open one, enabling protection of forever new cat-
Joanna Kulesza egories of data, such as geographical location data
Department of International Law and or arguably to a “virtual personality.” As such, the
International Relations, University of Lodz, Lodz, term covers also information about an individual
Poland that is produced, generated, or needed for the
purpose of rendering electronic services, such as
a telephone, an IMEI or an IP number, e-mail
Origins and Definition address, a website address, geolocation data, or
search terms, as long as such information may be
Privacy is a universally recognized human right, linked to an individual and allows for their identi-
subject to state protection from arbitrary or unlaw- fication. Privacy is not an absolute right and may
ful interference and unlawful attacks. The age of be limited for reasons considered necessary in a
Big Data has brought it to the foreground of all democratic society. While there is no numerus
technology-related debates as the amount of infor- clausus of such limitative grounds, they usually
mation aggregated online, generated by various include reasons of state security and public order
sources together with the computing capabilities or the rights of others, such as their freedom of
of modern networks, makes it easy to connect an expression. States are free to introduce certain
individual to a particular piece of information limitations on individual privacy right as long as
about them, possibly causing a direct threat to those are introduced by specific provisions of law,
their privacy. Yet international law grants every communicated to the individuals whose privacy is
person the right to legal safeguards against any impacted, and applied solely when necessary in
interference with one’s right or attacks upon particular circumstances. This seemingly clear and
it. The right to privacy covers, although is not precise concept suffers practical limitations as
limited to, one’s identity, integrity, intimacy, states differ in their interpretations of “necessity”
autonomy, communication, and sexuality and of interference as well as the “specificity” of legal
results in legal protection for one’s physical integ- norms required and scope of their application. As a
rity; health information, including sex orientation consequence the concept of privacy strongly
and gender; reputation; image; personal develop- varies throughout the world’s regions and coun-
ment; personal autonomy; and self-determination tries. This is a particular challenge at the time of
as well as family, home, and correspondence that Big Data as various national and regional percep-
are to be protected by state from arbitrary or tions of privacy need to be applied to the very same
unlawful interferences by its organs or third vast catalogue of online information.
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_172-1
2 Privacy

This inconsistency in privacy perceptions governmental agents. Initially the right was used
results from varied cultural and historical back- to limit the rapidly evolving press industry, with
ground of individual states as well as their differ- time, as individual awareness and recognition of
ing political and economic situation. In countries the right increased; the right to privacy primarily
recognizing values reflected in universal human introduced limits of individual information that
rights treaties, including Europe, large parts of state or local authorities may obtain and process.
Americas, and some Asian states, the right to As any new idea, the right to privacy initially
privacy covers numerous elements of individual provoked much skepticism, yet by mid twentieth
autonomy and is strongly protected by compre- century became a necessary element of the rising
hensive legal safeguards. On the other hand in human rights law. In the twenty-first century, it
countries rapidly developing, as well as in ones gained increased attention as a side effect of the
with unstable political or economic situation, pri- growing, global information society. International
marily located in Asia and Africa, the significance online communications allowed for easy and
of the right to one’s private life subsides to urgent cheap mass collection of data, creating the
needs of protecting life and personal or public greatest threat to privacy so far. What followed
security. As a consequence the undisputed right was an eager debate on the limits of allowed
to privacy, subject to numerous international privacy intrusions and actions required from states
treaties and rich international law jurisprudence, aimed at safeguarding the rights of an individual.
remains highly ambiguous, an object of A satisfactory compromise is not easy to find as
conflicting interpretations by national authorities states and communities view privacy differently,
and their agents. This is one of the key challenges based on their history, culture, and mentality. The
to finding the appropriate legal norms governing existing consensus on human rights seems to be
Big Data. In the unique Big Data environment, it the only starting point of a successful search for an
is not only the traditional jurisdictional chal- effective privacy compromise, much needed in
lenges, specific to all online interactions, that the era of transnational companies operating on
must be faced but also the tremendously varying Big Data. With the modern notions of “the right to
perceptions of privacy all finding their application be forgotten” or “data portability” referring to new
to the vast and varied Big Data resource. facets of the right to protect one’s privacy, the Big
Data phenomenon is one of the deciding factors of
this ongoing evolution.
History

The idea of privacy rose simultaneously in various Privacy as a Human Right


cultures. Contemporary authors most often refer
to the works of American and European legal The first document of international human rights
writers of late nineteenth century to identify its law, recognizing the right to privacy, was the 1948
origins. In US doctrine it were Warren and Bran- Universal Declaration on Human Rights (UDHR).
deis who introduced in their writings “the right to The nonbinding political middle ground was not
be let alone,” a notion still often used to describe too difficult to find with the greatest horrors in
the essential content of privacy. Yet at remotely human history of World War II still vividly in the
the same time, German legal scholar Kohler mind of world’s politicians and citizens alike.
published a paper covering a similar concept. It With horrid memories fading away and the Iron
was also in mid nineteenth century that French Curtain drawing a clear line between differing
courts issued their first decisions protecting the values and interests, a binding treaty on the very
right to private life. The right to privacy was issue took almost 20 more years. Irreconcilable
introduced to grant individuals protection from differences between communist and capitalist
undesired intrusions into their private affairs and countries covered the scope and implementation
home life, be it by nosy journalists or of individual property, free speech, or privacy.
Privacy 3

The eventual 1966 compromise in the form of the interpretation in its 1988 General Comment
two fundamental human rights treaties: the Inter- No. 16 as well as recommendations and observa-
national Covenant on Civil and Political Rights tions issued thereafter. Before Big Data became,
(ICCPR) and the International Covenant on Eco- among its other functions, an effective tool for
nomic Social and Cultural Rights (ICESCR) allo- mass surveillance, the HRC took a clear stand on
wed for a conciliatory wording on hard law the question of legally permissible limits of state
obligations for different categories of human inspection. It clearly stated that any surveillance,
rights, yet left the crucial details to future state whether electronic or otherwise; interceptions of
practice and international jurisprudence. Among telephonic, telegraphic, and other forms of com-
the right to be put into detail by future state prac- munication; wiretapping; and recording of con-
tice, international courts, and organizations was versations should be prohibited. It confirmed that
the right to privacy, established as a human right individual limitation upon privacy must be
in Article 12 UDHR and Article 17 ICCPR. They assessed on a case-by-case basis and follow a
both granted every individual freedom from “arbi- detailed legal guideline, containing precise cir-
trary interference” with their “privacy, family, cumstances when privacy may be restricted by
home, or correspondence” as well as from any actions of local authorities or third parties. The
attacks upon their honor and reputation. While HRC specified that even interference provided for
neither document defines “privacy,” the UN by law should be in accordance with the provi-
Human Rights Committee (HRC) has gone into sions, aims, and objectives of the Covenant and
much detail on delimitating its scope for the inter- reasonable in the particular circumstances, where
national community. All 168 ICCPR state parties “reasonable” means justified by those particular
are obliged per the Covenant to reflect HRC rec- circumstances. Moreover, as per the HRC inter-
ommendations on the scope and enforcement of pretation, states must take effective measures to
the treaty in general and privacy in particular. guarantee that information about individual’s life
Over time the HRC produced detailed instruction does not reach ones not authorized by law to
on the scope of privacy protected by international obtain, store, or process it. Those general guide-
law, discussing the thin line with state sover- lines are to be considered the international stan-
eignty, security, and surveillance. According to dard of protecting the human right to privacy and
Article 12 UDHR and Article 17 ICCPR, privacy need to be respected regardless of the ease that Big
must be protected against “arbitrary or unlawful” Data services offer in connecting pieces of infor-
intrusions or attacks through national laws and mation available online with individuals they
their enforcement. Those laws are to detail limits relate to. Governments must ensure that Big
for any justified privacy invasions. Those limits of Data is not to be used in a way that infringes
individual privacy right are generally described in individual privacy, regardless of the economic
Article 29 para. 2 which allows for limitations of benefits and technical accessibility of Big Data
all human rights determined by law solely for the services.
purpose of securing due recognition and respect The provisions of Article 17 ICCPR resulted in
for the rights and freedoms of others and of meet- similar stipulations of other international treaties.
ing the just requirements of morality, public order, Those include Article 8 of the European Conven-
and the general welfare in a democratic society. tion on Human Rights (ECHR) binding upon its
Although proposals for including a similar 48 member states or Article 11 of the American
restraint in the text of the ICCPR were rejected Convention on Human Rights (ACHR) agreed
by negotiating parties, the right to privacy is not upon by 23 parties to the treaty. The African
an absolute one. Following HRC guidelines and Charter on Human and Peoples’ Rights (Banjul
state practice surrounding the ICCPR, privacy Charter) does not contain a specific stipulation
may be restrained according to national laws regarding privacy, yet its provisions of Article
which meet the general standards present in 4 on the inviolability of human rights, Article
human rights law. The HRC confirmed this 5 on human dignity, and Article 16 on the right
4 Privacy

to health serve as basis to grant individuals within specification principle, (5) the use limitation prin-
the jurisdiction of 53 state parties the protection ciple, (6) the security safeguards principle, (7) the
recognized by European or American states as openness principle, and (8) the accountability
inherent to the right of privacy. While no general principle. They introduce certain obligations
human rights document exists among Austral- upon “data controllers” that is parties “who,
asian states, the general guidelines provided by according to domestic law, are competent to
the HRC and the work of the OECD are often decide about the contents and use of personal
reflected in national laws on privacy, personal data regardless of whether or not such data are
rights, and personal data protection. collected, stored, processed or disseminated by
that party or by an agent on their behalf.” They
oblige data controllers to respect limits made by
Privacy and Personal Data national laws pertaining to the collection of per-
sonal data. As already noted this is of particular
The notion of personal data is closely related to importance to Big Data operators, who must be
that of privacy, yet their scopes differ. While per- aware and abide by the varying national regimes.
sonal data is a term relatively well defined, pri- Personal data must be obtained by “lawful and
vacy is a more broad and ambiguous notion. As fair” means and with the knowledge or consent
Kuner rightfully notes, the concept of privacy of the data subject, unless otherwise provided by
protection is a broader one than personal data relevant law. Collecting or processing personal
regulations, where the latter provides a more data may only be done when it is relevant to the
detailed framework for individual claims. The purposes for which it will be used. Data must be
influential Organization for Economic accurate, complete, and up to date. The purposes
Co-operation and Development (OECD) Forum for data collection ought to be specified no later
identified personal data as a component of the than at the time of data collection. The use of the
individual right to privacy, yet its 34 members data must be limited to the purposes so identified.
differ on the effective methods of privacy protec- Data controllers, including those operating on Big
tion and the extent to which such protection Data, are not to disclose personal data at their
should be granted. Nevertheless, the nonbinding disposal for purposes other than those initially
yet influential 1980 OECD Guidelines on the Pro- specified and agreed upon by the data subject,
tection of Privacy and Transborder Flow of Per- unless such use or disclosure is permitted by law.
sonal Data (Guidelines) together with their 2013 All data processors are to show due diligence in
update have so far encouraged over data protec- protecting their collected data, by introducing rea-
tion laws in over 100 countries, justifying the sonable security safeguards against the loss or
claim that, thanks to its detailed yet unified char- unauthorized data access and its destruction, use,
acter and national enforceability personal data modification, or disclosure. This last obligation
protection, is the most common and effective may prove particularly challenging for Big Data
legal instrument safeguarding individual privacy. operators, with regard to the multiple locations of
The Guidelines identify the universal privacy pro- data storage and their continuously changeability.
tection through eight personal data processing Consequently each data subjects enjoys the right
principles. The definition of “personal data” to obtain information on the fact of the data con-
contained in the Guidelines is usually directly troller having data relating to him, to have any
adopted by national legislations which cover any such data communicated within a reasonable time,
information relating to an identified or identifiable to be given reasons if a request for such informa-
individual, referred to as “data subject.” The basic tion is denied, as well as to be able to challenge
eight principles of privacy and data protection such denial and any data relating to him.
include (1) the collection limitation principle, Followingly each data subject enjoys the right to
(2) the data quality principle, (3) the individual have their data erased, rectified, completed, or
participation principle, (4) the purpose amended, and data controller is to be held
Privacy 5

accountable to national laws for lack of effective figures” enjoying least protection. An assessment
measures ensuring all of those personal data of the limits of one’s privacy when compared with
rights. their public function would always be made on
Therewith the OECD principles form a practi- case-by-case basis. Any information that may not
cal standard for privacy protection represented in be considered public is to be granted privacy
the human rights catalogue, applicable also to Big protection and may only be collected or processed
Data operators, given the data in their disposal with permission granted by the one it concerns.
relates directly or indirectly to an individual. The need to obtain consent from the individual the
While their effectiveness may come to depend information concerns is also required for the inti-
upon jurisdictional issues, the criteria for identifi- mate sphere, where the protection is even stron-
cation of data subjects and obligations of data ger. Some authors argue that information on one’s
processors are clear. health, religious beliefs, sexual orientation, or his-
tory should only be distributed in pursuit of a
legitimate aim, even when permission for its dis-
Privacy as a Personal Right tribution was granted by the one it concerns.
With the civil law scheme for privacy protec-
Privacy is recognized not only by international tion being relatively simple, its practical applica-
law treaties and international organizations but tion relies on case-by-case basis and therefore
also by national laws, from constitutions to civil may show challenging and unpredictable in prac-
and criminal law codes and acts. Those regula- tice, especially when international court practice
tions hold great practical significance, as they is of issue.
allow for direct remedies against privacy infrac-
tions from private parties, rather than those
enacted by state authorities. Usually privacy is Privacy and Big Data
considered an element of the larger catalogue of
personal rights and granted equal protection. It Big Data is a term that directly refers to informa-
allows individuals whose privacy is under threat tion about individuals. It may be defined as gath-
for the threatening activity to be seized (e.g., ering, compiling, and using large amounts of
infringing information be deleted or a press information enabling for marketing or policy deci-
release be stopped). It also allows for pecuniary sions. With large amounts of data being collected
compensation or damages should a privacy by international service providers, in particular
infringement already take place. ones offering telecommunication services, such
Originating from German-language civil law as Internet access, the scope of data they may
doctrine, privacy protection may be well collect and the use to which they may put it is of
described by the theory of concentric spheres. crucial concern to all their clients but also to their
Those include the public, private, and intimate competitors and state authorities interested in par-
sphere, with different degrees of protection from ticipating in this valuable resource. In the light of
interference granted to each of them. The stron- the analysis presented above, any information
gest protection is granted to intimate information; falling within the scope of Big Data that is col-
activities falling within the public sphere are not lected and processed while rendering online ser-
protected by law and may be freely collected and vices may be considered subject to privacy
used. All individual information may be qualified protection when it refers to identified or identifi-
as falling into one of the three spheres, with the able individual that is a physical person who may
activities performed in the public sphere being either be directly identified or whose identification
those performed by an individual as a part of is possible. When determining whether particular
their public or professional duties and obligations category or a piece of information constitutes
and deprived of privacy protection. This sphere private data, account must be taken of means
would differ as per individual, with “public likely reasonably to be used either by any person
6 Privacy

to identify the individual, in particular costs, time, processed in bulk, with no judicial supervision
and labor needed to identify such person. When or without the consent of the individual it refers
private information has been identified, the pro- to. Big Data offer new possibilities for collecting
cedures required for privacy protection described and processing personal data. When designing
above ought to be applied by entities dealing with Big Data services or using information they pro-
such information. In particular the guidelines vide, all business entities must address the inter-
described by the HRC in their comments and national standards of privacy protection, as
observations may serve as a guideline for han- identified by international organizations and
dling personal data falling within the Big Data good business practice.
resource. Initiatives such as Global Network Ini-
tiative, a bottom-up initiative of the biggest online
service providers aimed at identifying and apply-
Cross-References
ing universal human rights standards for online
services, or the UN Protect Respect and Remedy
▶ Data Processing
Framework for business, defining the human
▶ Data Profiling
rights obligations of private parties, present a use-
▶ Data Quality Management
ful tool for introducing enhanced privacy safe-
▶ Data Security
guards for all Big Data resources. With the
▶ Data Security Management
users’ growing awareness of the value of their
privacy, company privacy policies prove to be a
significant element of the marketing game, incit-
ing Big Data operators to convince forever more Further Readings
users to choose their privacy-oriented services.
Kuner, C. (2009). An international legal framework for
data protection: Issues and prospects. Computer Law
and Security Review, 25(263), 307.
Summary Kuner, C. (2013). Transborder data flows and data privacy
law. Oxford: Oxford University Press.
UN Human Rights Committee. General Comment No. 16:
Privacy recognized as a human right requires cer-
Article 17 (Right to Privacy), The Right to Respect of
tain precautions to be taken by state authorities Privacy, Family, Home and Correspondence, and Pro-
and private business alike. Any information that tection of Honour and Reputation. 8 Apr 1988. http://
may allow for the identification of an individual www.refworld.org/docid/453883f922.html.
UN Human Rights Council. Report of the Special Rappor-
ought to be subjected to particular safeguards
teur on the promotion and protection of human rights
allowing for its collection or processing solely and fundamental freedoms while countering terrorism,
based on the consent of the individual in question Martin Scheinin. U.N. Doc. A/HRC/13/37.
or a particular norm of law applicable in a case Warren, S.D., & Brandeis, L.D. (1980). The right to pri-
vacy. Harvard Law Review, v. 4/193.
where the inherent privacy invasion is reasonable Weber, R.H. (2013). Transborder data transfers: Concepts,
and necessary to achieve a justifiable aim. In no regulatory approaches and new legislative initiatives.
case may private information be collected or International Data Privacy Law v. 1/3–4.
P

Psychology processes) at the scale of hundreds of milliseconds


to tens of seconds, rational (the study of decision
Daniel N. Cassenti and Katherine R. Gamble making and problem solving) at minutes to hours,
U.S. Army Research Laboratory, Adelphi, MD, and social at days to months. The cognitive, ratio-
USA nal, and social bands can all be related to big data
in terms of both the researcher analyzing data and
the data itself. Here, we describe how psycholog-
Wikipedia introduces big data as “a blanket term ical principles can be applied to the researcher to
for any collection of data sets so large and com- handle data in the cognitive and rational fields and
plex that it becomes difficult to process using demonstrate how psychological data in the social
on-hand data management tools or traditional field can be big data.
data processing applications.” The field of psy-
chology is interested in big data in two ways:
(1) at the level of the data, that is, how much Cognitive and Rational Fields
data there are to be processed and understood,
and (2) at the level of the user, or how the One of the greatest challenges of big data is its
researcher analyzes and interprets the data. Thus, analysis. The principles of cognitive and rational
psychology can serve the role of helping to psychology can be applied to improve how the big
improve how researchers analyze big data and data researcher evaluates and makes decisions
provide data sets that can be examined or analyzed about the data. The first step in analysis is atten-
using big data principles and tools. tion to the data, which often involves filtering out
irrelevant from relevant data. While many soft-
ware programs can provide an automated filtering
Psychology of data, the researcher must still give attention and
critical analysis to the data as a check on the
Psychology may be divided into two overarching automated system, which operates within rigid
areas: clinical psychology with a focus on indi- criteria preset by the researcher that is not sensi-
viduals, and the fields of experimental psychology tive to the context of the data. At this early level of
with foci on the more general characteristics that analysis, the researcher’s perception of the data,
apply to the majority of people. Allen Newell ability to attend and retain attention, and working
classifies the fields of experimental psychology memory capacity (i.e., the quantity of information
by time scale, to include biological at the smallest that an individual can store while working on a
time scale, cognitive (the study of mental task) are all important to success. That is, the
# Springer International Publishing AG (outside the USA) 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_173-1
2 Psychology

researcher must efficiently process and highlight used with big data, especially with big data sets
the most important information, stay attentive that include groups of individuals and their rela-
enough to do this for a long period of time, and tionships with one another, the scope of social
because of limited working memory capacity and psychology. The field of social psychology is
a lot of data to be processed, effectively manage able to ask questions and collect large amounts
the data, such as by chunking information, so that of data that can be examined and understood using
it is easier to filter and store in memory. these big data-type analyses including, but not
The goal of analysis is to lead to decisions or limited to, the following types of analyses.
conclusions about data, the scope of the rational Linguistic analysis offers the ability to process
field. If all principles from cognitive psychology transcripts of communications between individ-
have been applied correctly (e.g., only the most uals, or to groups as in social media applications,
relevant data are presented and only the most such as tweets from a Twitter data set. A linguistic
useful information stored in memory), tenets of analysis may be applied in a multitude of ways,
rational psychology must next be applied to make including analyzing the qualities of relationship
good decisions about the data. Decision making between individuals or how communications to
may be aided by programming the analysis soft- groups may differ based on the group. These
ware to present decision options to the researcher. analyses can determine qualities of these commu-
For example, in examining educational outcomes nications, which may include trust, attribution of
of children who come from low income families, personal characteristics, or dependencies, among
the researcher’s options may be to include chil- other considerations.
dren who are or are not part of a state-sponsored Sentiment analysis is a type of linguistic anal-
program, or are of a certain race. Statistical soft- ysis that takes communications and produces rat-
ware could be designed to present these options to ings of the emotional valence individuals direct to
the researcher, which may reveal results or rela- the topic. This is of value for considerations of
tionships in the data that the researcher may not social data researchers who must find those with
have otherwise discovered. Option presentation whom alliances may be formed and who to avoid.
may not be enough, however, as researchers A famous example is the strategy shift taken by
must also be aware of the consequences of their United Stated Armed Forces commanders to ally
decisions. One possible solution is the implemen- with Iraqi residents. Sentiment analysis indicated
tation of associate systems for big data software. which residential leaders would give their coop-
An associate system is automation that attempts to eration for short-term goals of mutual interest.
advise the user, in this case to aid decision mak- The final social psychological big data analysis
ing. Because these systems are knowledge based, technique under consideration here is social-
they have situational awareness and are able to network analysis or SNA. With SNA, special
recommend courses of action and the reasoning emphasis is not with the words spoken as in lin-
behind those recommendations. Associate sys- guistic and sentiment analysis but on the direc-
tems do not make decisions themselves, but tionality and frequency of communication
instead work semiautonomously, with the user between individuals. SNA created a type of net-
imposing supervisory control. If the researcher work map that uses nodes and ties to connect
deems recommended options to be unsuitable, members of groups or organizations to one
then the associate system can present what it another. This visualization tool allows a
judges to be the next best options. researcher to see how individuals are connected
to one another with factors like the thickness of a
line to determine frequency of communication, or
Social Field the number of lines coming from a node deter-
mining the number of nodes to which they are
The field of social psychology provides good connected.
examples of methods of analysis that can be
Psychology 3

Psychological Data as Big Data to slow your car), cognitive processing must take
place at the levels of perception, information pro-
Each field of psychology potentially includes big cessing, and initiation of action. Therefore, any
data sets for analysis by a psychological behavior or thought process that is measured in
researcher. Traditionally, psychologists have col- cognitive psychology will yield a large amount of
lected data on a smaller scale using controlled data for even the simplest of these, such that
methods and manipulations analyzable with tradi- complex processes or behaviors measured for
tional statistical analyses. However, with the their cognitive process will yield data sets of the
advent of big data principles and analysis tech- magnitude of big data.
niques, psychologists can expand the scope of Another clear case of a field with big data sets
data collection to examine larger data sets that is rational psychology. In rational psychological
may lead to new and interesting discoveries. The paradigms, researchers who limit experimental
following section discusses each of the aforemen- participants to a predefined set of options often
tioned fields. find themselves limiting their studies to the point
In clinical psychology, big data may be used to of not capturing naturalistic rational processing.
diagnose an individual. In understanding an indi- The rational psychologist, instead typically con-
vidual or attempting to make a diagnosis, the fronts big data as imaginative solutions to prob-
person’s writings and interview transcripts may lems, and many forms of data, such as verbal
be analyzed in order to provide insight to his or protocols (i.e., transcripts of participants
her state of mind. To thoroughly analyze and treat explaining their reasoning), require big data anal-
a person, a clinical psychologist’s most valuable ysis techniques.
tool may be this type of big data set. Finally, with the large time band under consid-
Biological psychology includes the subfields eration, social psychologists must often consider
of psychophysiology and neuropsychology. Psy- days’ worth of data in their studies. One popular
chophysiological data may include hormone col- technique is to have participants use wearable
lection (typically salivary), blood flow, heart rate, technology to periodically remind them to record
skin conductance, and other physiological how they are doing, thinking, and feeling during
responses. Neuropsychology includes multiple the day. These types of studies lead to big data sets
technologies for collecting information about the not just because of the frequency with which the
brain, including electroencephalography (EEG), data is collected, but also due to the enormous
functional magnetic resonance imaging (fMRI), number of possible activities, thoughts, and feel-
functional near infrared spectroscopy (fNIRS), ing that participants may have experienced and
among other lesser used technologies. Measures recorded at each prompted time point.
in biological psychology are generally taken near-
continuously across a certain time range, so much
of the data collected in this field could be consid- The Unique Role of Psychology in
ered big data. Big Data
Cognitive psychology covers all mental pro-
cessing. That is, this field includes the initiation of As described above, big data plays a large role in
mental processing from internal or external stim- the field of psychology, and psychology can play
uli (e.g., seeing a stoplight turn yellow), the actual an important role in how big data are analyzed and
processing of this information (e.g., understand- used. One aspect of this relationship is the neces-
ing that a yellow light means to slow down), and sity of the role of the psychology researcher on
the initiation of an action (e.g., knowing that you both ends of big data. That is, psychology is a
must step on the brake in order to slow your car). theory-driven field, where data are collected in
For each action that we take, and even actions that light of a set of hypotheses, and analyzed as either
may be involuntary (e.g., turning your head supporting or rejecting those hypotheses. Big data
toward an approaching police siren as you begin offers endless opportunities for exploration and
4 Psychology

discovery in other fields, such as creating word Further Readings


clouds from various forms of social media to
determine what topics are trending, but solid psy- Cowan, N. (2004). Working memory capacity. New York:
Psychology Press.
chological experiments are driven by a priori
Endsley, M. R. (2000). Theoretical underpinnings of situ-
ideas, rather than data exploration. Thus, psychol- ation awareness: A critical review. In Situation aware-
ogy is important to help big data researchers learn ness analysis and measurement. Mahwah, NJ:
how to best process their data, and many types of Lawrence Erlbaum Associates.
Ericsson, K. A., & Simon, H. A. (1984). Protocol analysis.
psychological data can be big data, but the impor-
Cambridge, MA: MIT-press.
tance of theory, hypotheses, and the role of the Lewis, T. G. (2011). Network science: Theory and appli-
researcher will always be integral in how psychol- cations. Hoboken: Wiley.
ogy and big data interact. Neisser, U. (1976). Cognition and reality: Principles and
implications of cognitive psychology. San Francisco:
W.H. Freeman and Co.
Newell, A. (1990). Unified theories of cognition. Cam-
bridge, MA: Harvard University Press.
Cross-References Newell, A., & Simon, H. (1972). Human problem solving.
Englewood Cliffs: Prentice-Hall.
▶ Artificial Intelligence Pang, B., & Lee, L. (2008). Opinion mining and sentiment
▶ Communications analysis. Foundations and Trends in Information
Retrieval, 2(1–2), 1–35.
▶ Decision Theory Pentland, A. (2014). Social physics: How good ideas
▶ Social Media spread – The lessons from a new science. New York:
▶ Social Network Analysis Penguin Press.
▶ Social Sciences Yarkoni, T. (2012). Psychoinformatics new horizons at the
interface of the psychological and computing sciences.
▶ Spatial Analytics Current Directions in Psychological Science, 21(6),
▶ Visualization 391–397.
R

Regression Linear Regression

Qinghua Yang The estimation target of regression is a function


Department of Communication Studies, Texas that predicts the dependent variable based upon
Christian University, Fort Worth, TX, USA values of the independent variables, which is
called the regression function. For simple linear
regressions, the function can be represented as
Regression is a statistical tool to estimate the yi = a + bxi + ei. The function of multiple lin-
relationship(s) between a dependent variable ear regressions is yi = b0 + b1x1 + b2x2 þ   
(y or outcome variable) and one or more indepen- þ bkxk + ei where k is the number of independent
dent variables (x or predicting variables; Fox variables. The regression estimation using ordi-
2008). More specifically, regression analysis nary least squares (OLS) selects the line with the
helps in understanding the variation in a depen- lowest total sum of squared residuals. The propor-
dent variable using the variation in independent tion of total variation (SST) that is explained by
variables with other confounding variable(s) the regression (SSR) is known as the coefficient
controlled. Regression analysis is widely used to of determination, often referred to as R2, a value
make prediction and estimation of the conditional ranging between 0 and 1 with a higher value
expectation of the dependent variable given the indicating a better regression model (Keith 2015).
independent variables, where its use overlaps with
the field of machine learning. Figure 1 shows how
crime rate is related to residents’ poverty level and Nonlinear Regression
predicts the crime rate of a specific community.
We know from this regression that there is a In the real world, there are much more nonlinear
positive linear relationship between the crime functions than linear ones. For example, the rela-
rate (y axis) and residents’ poverty level (x axis). tionship between x and y can be fitted in a qua-
Given the poverty index of a specific community, dratic function shown in Figure 2. There are in
we are able to make a prediction of the crime rate general two ways to deal with nonlinear models.
at that area. First, nonlinear models can be approximated with
linear functions. Both nonlinear functions in
Figure 2 can be approximated by two linear func-
tions according to the slope: the first linear regres-
sion function is from the beginning of the
semester to the final exam, and the second
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_174-1
2 Regression

50

25
crime

–25

–50

–1.00 –.50 –.00 .50 1.00 1.50


poverty_sqrt

Regression, Figure 1 Linear regression of crime rate and residents’ poverty level

function is from the final to the end of the semes- prediction of the outcome variable. In logistic
ter. Similarly, regarding cubic, quartic, and more regression, we predict the odds or log-odds
complicated regressions, they can also be approx- (logit) that a certain condition will or will not
imated with a sequence of linear functions. How- happen. Odds range from 0 to infinity and are a
ever, analyzing nonlinear models in this way can ratio of the chance of an event (p) divided by the
produce much residual and leave considerable chance of the event not happening, that is, p/
variance unexplained. The second way is consid- (1p). Log-odds (logits) are transformed odds,
ered better than the first one from this aspect, by ln[p/(1p)], and range from negative to positive
including nonlinear terms in the regression func- infinity. The relationship predicting probability
tion as ^y = a þ b1x þ b2x2. As the graph of a using x follows an S-shaped curve as shown in
quadratic function is a parabola, if b2 < 0, the Figure 3. The shape of curve above is called a
parabola opens downward, and if b2 > 0, the “logistic curve.” This is defined as
parabola opens upward. Instead of having x2 in expðb0 þb1 xi þei Þ
pð y i Þ ¼ . In this logistic regression,
the model, the nonlinearity can also be presented 1þexpðb0 þb1 xi þei Þ
pffiffiffi the value predicted by the equation is a log-odds
in many other ways, such as x, ln(x), sin(x),
cos(x), and so on. However, which nonlinear or logit. This means when we run logistic regres-
model to choose should be based on both theory sion and get coefficients, the values the equation
or former research and the R2. produces are logits. Odds is computed as exp
expðlogitÞ
(logit), and probability is computed as 1þexp ðlogitÞ.
Another model used to predict binary outcome is
the probit model, with the difference between
Logistic Regression logistic and probit models lying in the assumption
about the distribution of errors: while the logit
When the outcome variable is dichotomous (e.g., model assumes standard logistic distribution of
yes/no, success/failure, survived/died, accept/ errors, probit model assumes normal distribution
reject), logistic regression is applied to make
Regression 3

Regression, Anxiety
Figure 2 Nonlinear
regression models

Semester Mid-term Final Semester


begins ends

Confidence in
the Subject

Semester Mid-term Final Semester


begins ends

of errors (Chumney & Simpson 2006). Despite opportunities and challenges. Generally speaking,
the difference in assumption, the predictive results big data is a collection of large-scale and complex
using these two models are very similar. When the data sets that are difficult to be processed and
outcome variable has multiple categories, multi- analyzed using traditional data analytic tools.
nomial logistic regression or ordered logistic Inspired by the advent of machine learning and
regression should be implemented depending on other disciplines, statistical learning has
whether the dependent variable is nominal or emerged as a new subfield in statistics, including
ordinal. supervised and unsupervised statistical learn-
ing (James, Witten, Hastie, & Tibshirani, 2013).
Supervised statistical learning refers to a set of
approaches for estimating the function f based on
Regression in Big Data
the observed data points, to understand the rela-
tionship between Y and X = (X1, X2, . . . , XP),
Due to the advanced technologies that have been
which can be represented as Y = f(X) þ e. Since
increasingly used in data collection and the vast
the two main purposes for the estimation are to
amount of user-generated data, the amount of data
make prediction and inference, which regression
will continue to increase at a rapid pace, along
modeling is widely used for, many classical sta-
with a growing accumulation of scholarly works.
tistical learning methods use regression models,
The explosion of knowledge makes big data one
such as linear, nonlinear, and logistic regression,
of new research frontiers with an extensive num-
with the selection of specific regression model
ber of application areas affected by big data, such
based on research question and data structure. In
as public health, social science, finance, geogra-
contrast, for unsupervised statistical learning,
phy, and so on. The high volume and complex
there is no response variable to predict for every
structure of big data bring statisticians both
4 Regression

Regression,
Figure 3 Logistic 1.00
regression models

0.80

0.60

pass
0.40

0.20

0.00

0 2 4 6 8 10
X

observation that can supervise our analysis (James ▶ Statistics


et al. 2013). Additionally, more methods have
been recently developed, such as Bayesian and
Markov chain Monte Carlo (MCMC). Bayes-
Further Readings
ian approach, distinct from the frequentist
approach, treats model parameters as random Bandalos, D. L., & Leite, W. (2013). Use of Monte Carlo
and models them via distributions. MCMC is studies in structural equation modeling research. In
statistical sampling investigations that involve G. R. Hancock & R. O. Mueller (Eds.), Structural
sample data generation to obtain empirical sam- equation modeling: A second course (pp. 625-666).
Charlotte, NC: Information Age Publishing.
pling distributions based on constructing a Mar- Chumney, E. C., & Simpson, K. N. (2006). Methods and
kov chain that has the desired designs for outcomes research. Bethesda, MD: ASHP.
distribution (Bandalos & Leite 2013). Fox, J. (2008). Applied regression analysis and general-
ized linear models. Thousand Oaks, CA: Sage.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).
An introduction to statistical learning (Vol. 6).
Cross-References New York, NY: Springer.
Keith, T. Z. (2015). Multiple regression and beyond: An
▶ Data Mining Algorithms introduction to multiple regression and structural
equation modeling. New York, NY: Routledge.
▶ Machine Learning
▶ Statistical Analysis
R

Religion technological advances, people still wrestle with


spiritual or existential matters.
Matthew Pittman and Kim Sheehan While the term “big data” seems commonplace
School of Journalism & Communication, now, it is a fairly recent development. Several
University of Oregon, Eugene, OR, USA researchers and authors claim to have coined the
term, but its modern usage took off in the mid-
1990s and only really became mainstream in 2012
In his work on the changing nature of religion in when the White House and the Davos World
our modern mediated age, Stewart Hoover notes Economic Forum identified it as a serious issue
that religion today is much more commodified, worth tackling. Big data is a broad term, but
therapeutic, public, and personalized than it has generally has two main precepts: humans are
been for most of history. He also notes that, now producing information at an unprecedented
because media are coming together to create an rate, and new methods of analysis are needed to
environment in which our personal projects of make sense of that information. Religious prac-
identity, meaning, and self are worked out, reli- tices are changing in both of these areas. Faith-
gion and media are actually converging. As more based activity is creating new data streams even as
people around the globe obtain devices capable of churches, temples, and mosques are figuring out
accessing the Internet, their everyday religious what to do with all that data. On an institutional
practices are leaving digital traces for interested level, the age of big data is giving religious groups
companies and institutions to pick up on. The age new ways to learn about the individuals who
of big data is usually thought to affect institutions adhere to their teachings. On an individual level,
like education, mass media, or law, but religion is technology is changing how people across the
undergoing dynamic shifts as well. globe learn about, discuss, and practice their
Though religious practice was thought to be in faiths.
decline through the end of the twentieth century,
there has been a resurgence of interest through the
beginning of the 21st. A Google NGram viewer Institutional Religion
(which keeps track of a word’s frequency in
published books and general literature over time) It is now common for religious institutions to
shows that “data” surpassed “God” for the first using digital technology to reach their believers.
time in 1973. Yet, by about 2004, God once again Like any other business or group that needs mem-
overtook data (and its synonym “information”), bers to survive, most seek to utilize or leverage
indicating that despite incredible scientific and new devices and trends into opportunities to
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_175-1
2 Religion

strengthen existing members or recruit potential and groups as they are announced in the service;
new ones. Of course, depending on a religion’s or those using online scripture software can access
stance toward culture, they may (like the Amish) texts and take notes. There are just a few
eschew some technology. However, for most possibilities.
mosques, churches, and synagogues, it has There are other ways religious groups can har-
become standard for each to have its own website ness big data. Some churches have begun analyz-
or Facebook page. Email newsletters and Twitter ing liturgies to assess and track length and content
accounts feeds have replaced traditional newslet- over time. For example, a dip in attendance during
ters and event reminders. a given month might be linked to the sermons
New opportunities are constantly emerging being 40% longer in that same time frame. Many
that create novel space for leaders to engage prac- churches make their budgets available to mem-
titioners. Religious leaders can communicate bers for the sake of transparency, and in a digital
directly with followers through social media, age it is not difficult to create financial records that
adding a personal touch to digital messages, are clear and accessible to laypeople. Finally,
which can sometimes feel distant or cold. Rabbi learning from a congregant’s social media profiles
SchmuleyBoteach, “America’s Rabbi,” has and personal information, a church might remind
29 best-selling books but often communicates a parishioner of her daughter’s upcoming birth-
daily though his Twitter account, which has over day, the approaching deadline for an application
a hundred thousand followers. On the flip side, to a family retreat, or when other congregants are
people can thoroughly vet potential religious attending a sporting event of which she is a fan.
leaders or organizations before committing to The risk of overstepping boundaries is real and,
them. If concerned that a particular group’s ideol- just like with Facebook or similar entities, privacy
ogy might not align with one’s own, a quick settings should be negotiated beforehand.
Internet search or trip to the group’s website As with other commercial entities, religious
should identify any potential conflicts. In this institutions utilizing big data must learn to differ-
way, providing data about their identity and entiate information they need from information
beliefs helps religious groups differentiate they don’t. The sheer volume of available data
themselves. makes distinguishing desired signal from irrele-
In a sense, big data makes it possible for reli- vant noise an increasingly important task. Ran-
gious institutions to function more like – and take dom correlations may lead to false positive
their cues from – commercial enterprises. Track- causation. A mosque may benefit from learning
ing streams of information about its followers can that members with the highest income are not
help religious groups be more in tune with the actually its biggest givers, or testing for a relation-
wants and needs of these “customers.” Some reli- ship between how far away its members live and
gious organizations implement the retail practice how often they attend. Each religious group must
of “tweets and seats”: by ensuring that members determine how big data may or may not benefit its
always have available places to sit, rest, or hang operation in any given endeavor, and the oppor-
out, and that wifi (wireless Internet connectivity) tunities are growing.
is always accessible, they hope to keep people
present and engaged. Not all congregations
embrace this change, but the clear cultural trend Individual Religion
is toward ubiquitous smart phone connectivity.
Religious groups that take advantage of this may The everyday practice of religion is becoming
provide several benefits to their followers: mem- easier to track as it increasingly utilizes digital
bers could immediately identify and download technology. A religious individual’s personal
any worship music being played; interested mem- blog, Twitter feed, Facebook profile keep a record
bers could look up information about a local reli- of his or her activity or beliefs, making it relatively
gious leader; members could sign up for events easy for any interested entity to track online
Religion 3

behavior over time. Producers and advertisers use individuals is unprecedented. With over a billion
this data to promote products, events, or websites opens and/or uses, YouVersion statistically pro-
to people who might be interested. Currently com- ved several phenomena. The data demonstrated
panies like Amazon have more incentive than, the most frequent activity for users is looking up a
say, a local synagogue in keeping tabs on what favorite verse for encouragement. Despite the ste-
websites one visits, but the potential exists for reotype of shirtless men at football games, the
religious groups to access the same data that most popular verse was not John 3:16, but Philip-
Facebook, Amazon, Google, etc. already possess. pians 4:13: “I can do all things through him who
Culturally progressive religious groups antici- gives me strength.” Religious adherents have
pate mutually beneficial scenarios: they provide a always claimed that their faith gives them strength
data service that benefits personal spiritual and hope, but big data has now provided a brief
growth, and in turn the members generate fields insight into one concrete way this actually
of data that are of great value to the group. A Sikh happens.
coalition created the FlyRights app in 2012 to help The YouVersion data also reveal that people
with quick reporting of discriminatory TSA pro- used the bible to make a point in social media.
filing while travelling. The Muslim’s Prayer Verses were sought out and shared in an attempt to
Times app provides a compass, calendar (with support views on marriage equality, gender roles,
moon phases), and reminders for Muslims about or other divisive topics. Tracking how individuals
when and in what direction to pray. Apple’s claim to have their beliefs supported by scripture
app store has also had to ban other apps from may help religious leaders learn more about how
fringe religious groups or individuals for being these beliefs are formed, how they change over
too irreverent or offensive. time, and which interpretations of scripture are
The most popular religious app to date simply most influential. Finally, YouVersion data reveal
provides access to scripture. In 2008 LifeChurch. that Christian users like verses with simple mes-
tv launched “the Bible app,” also called sages, but chapters with profound ideas. Verses
YouVersion, and it currently has over 151 million are easier to memorize when they are short and
installations worldwide on smartphones and tab- unique, but when engaging in sustained reading,
lets. Users can access scripture (in over 90 differ- believers prefer chapters with more depth.
ent translations) while online or download it for Whether large data sets confirm suspicions or
access offline. An audio recording of each chapter shatter expectations, they continue to change the
being read aloud can also be downloaded for some way religion is practiced and understood.
of the translations. A user can search through
scripture by keyword, phrase, or book of the
Bible, or there are reading plans of varying levels
Numerous or Numinous
of intensity and access to related videos or
movies. A “live” option lets users search out
In the past, spiritual individuals had a few reli-
churches and events in surrounding geographic
gions to choose from, but the globalizing force of
areas, and a sharing option lets users promote the
technology has dramatically increased the avail-
app, post to social media what they have read, or
able options. While the three big monotheisms
share personal notes directly to friends. The digi-
(Christianity, Judaism, and Islam) and
tal highlights or notes made, even when using the
pan/polytheisms (Hinduism and Buddhism) are
app offline, will later upload to one’s account and
still the most popular, the Internet has made it
remain in one’s digital “bible” permanently.
possible for people of any faith, sect, or belief to
All this activity has generated copious amounts
find each other and validate their practice. Though
of data for YouVersion’s producers. In addition to
pluralism is not embraced in every culture, there is
using the data to improve their product they also
at least increasing awareness of the many ways
released it to the public. This kind of insight into
religion is practiced across the globe.
the personal religious behavior of so many
4 Religion

Additionally, more and more people are iden- continue to offer for those who engage in numi-
tifying themselves as “spiritual but not religious,” nous and religious behavior.
indicating a desire to seek out spiritual experi-
ences and questions outside the confines of a
traditional religion. Thus for discursive activities Cross-References
centered on religion, Daniel Stout advocates the
use of another term in addition to “religion”: ▶ Data Monetization
numinous. Because “religious” can have negative ▶ Digitization
or limiting connotations, looking for the “numi- ▶ Entertainment
nous” in cultural texts or trends can broaden the ▶ Internet
search for and dialogue about a given topic. To be ▶ Text Analytics
numinous, something must meet several criteria:
stir deep feeling (affect), spark belief (cognition),
include ritual (behavior), and be done with fellow Further Readings
believers (community). This four-part framework
is a helpful tool for identification of numinous Campbell, H. A. (Ed.). (2012). Digital religion: Under-
activity in a society where it once might have standing religious practice in new media worlds.
Abingdon: Routledge.
been labeled “religious.”
Hjarvard, S. (2008). The mediatization of religion:
By this definition, the Internet (in general) and A theory of the media as agents of religious change.
entertainment media (in particular) all contain Northern Lights: Film & Media Studies Yearbook, 6(1),
numinous potential. The flexibility of the Internet 9–26.
Hoover, S. M., & Lundby, K. (Eds.). (1997). Rethinking
makes it relevant to the needs of most; while
media, religion, and culture (Vol. 23). Thousand Oaks:
authority of some of its sources can be dubious, Sage.
the ease of social networking and multi-mediated Kuruvilla, C. Religious mobile apps changing the faith-
experiences provides all the elements of tradi- based landscape in America. Retrieved from http://
www.nydailynews.com/news/national/gutenberg-
tional religion (community, ritual, belief, feeling).
moment-mobile-apps-changing-america-religious-
Entertainment media, which produce at least as landscape-article-1.1527004. Accessed Sep 2014.
much data as – and may be indistinguishable Mayer-Schönberger, V., & Cukier, K. (2013). Big data:
from – religious media, emphasize universal A revolution that will transform how we live, work, and
think. Houghton Mifflin Harcourt.
truths through storytelling. The growing opportu-
Taylor, B. (2008). Entertainment theology (cultural exege-
nities of big data (and its practical analysis) will sis): New-edge spirituality in a digital democracy.
Baker Books.
R

Risk Analysis disasters such as hurricanes) because normal data


and independent observations are assumed. Tra-
Jonathan Z. Bakdash ditional methods also typically do not account for
Human Research and Engineering Directorate, U. cascading failures, which are not uncommon in
S. Army Research Laboratory, Aberdeen Proving complex systems. For example, a hurricane may
Ground, MD, USA cause a power failure, which in turn results in
flooding.
The blessing and curse of risk analysis with big
Definition and Introduction data are illustrated by the example of Google Flu
Trends (GFT). Initially, it was highly successful in
Society is becoming increasingly interconnected estimating flu rates in real time, but over time it
with networks linking people, the environment, became inaccurate due to external factors, lack of
information, and technology. This rising com- continued validation, and incorrect modeling
plexity is a challenge for risk analysis. Risk anal- assumptions.
ysis is the identification and evaluation of the
probability of an adverse outcome, its associated
risk factors, and the potential impact if that out- Interdependencies
come occurs. Successfully modeling risk within
interdependent and complex systems requires Globalization and advances in technology have
access to considerably more data than traditional, led to highly networked and interdependent
simple risk models. The increasing availability of social, economic, political, natural, and techno-
big data offers enormous promise for improving logical systems (Helbing 2013). Strong interde-
risk analysis through more detailed, comprehen- pendencies are potentially dangerous because
sive, faster, and accurate predictions of risks and small or gradual changes in a single system can
their impacts than small data alone. cause cascading failures throughout multiple sys-
However, risk analysis is not purely a compu- tems. For example, climate change is associated
tational challenge that can be solved by more data. with food availability, food availability with eco-
Big data does not eliminate the importance of data nomic disparity, and economic disparity with war.
quality and modeling assumptions; it is not nec- In interconnected systems, risks often spread
essarily a replacement for small data. Further- quickly in a cascading process, so early detection
more, traditional risk analysis methods typically and mitigation of risks is critical to stopping fail-
underestimate the probability and impact of risks ures before they become uncontrollable. Helbing
(e.g., terrorist attacks, power failures, and natural (2013) contends that big data is necessary to
# Springer International Publishing AG (outside the USA) 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_176-1
2 Risk Analysis

model risks in interconnected and complex sys- source, general findings indicate this type of data
tems: Capturing interdependent dynamics and has poor predictive accuracy. Additional reasons
other properties of systems requires vast amounts to question experts are situations or systems with a
of heterogeneous data over space and time. large number of unknown factors and potentially
Interdependencies are also critical to risk anal- catastrophic impacts for erroneous estimations.
ysis because even when risks are mitigated, they Big data can be an improvement over small data
may still cause amplifying negative effects and one or several expert opinions. However,
because of human risk perception. Perceived risk volume is not necessarily the same as quality.
is the public social, political, and economic Multidimensional aspects of data quality, whether
impacts of unrealized (and realized) risks. An the data are big or small, should always be
example of the impact of a perceived risk is the considered.
nuclear power accident at Three-Mile Island. In
this accident, minimal radiation was released so
the real risk was mitigated. Nevertheless, the near
Risk Analysis Methods
miss of a nuclear meltdown had immense social
and political consequences that continue to nega-
Vose (2008) explains the general techniques for
tively impact the nuclear power industry in the
conducting risk analysis. A common, descriptive
United States. The realized consequences of per-
method for risk analysis is Probability-Impact (P-
ceived risk mean that “real” risk should not nec-
I). P-I is the probability of a risk occurring multi-
essarily be separated from “perceived” risk.
plied by the impact of the risk if it materializes:
Probability  Impact = Weighted Risk. All values
may be either qualitative (e.g., low, medium, and
Data: Quality and Sources
high likelihood or severity) or quantitative (e.g.,
10% or one million dollars). The Probability may
Many of the analysis challenges for big data are
be a single value or multiple values such as a
not unique but are pertinent to analysis of all data
distribution of probabilities. The Impact may
(Lazer et al. 2014). Regardless of the size of the
also be a single value or multiple values and is
dataset, it is important for analysts and
usually expressed as money. A similar weighted
policymakers to understand how, why, when,
model to P-I, Threat  Vulnerability  Conse-
and where the data were collected and what the
quence = Risk, is frequently used in risk analysis.
data contain and do not contain. Big data may be
However, a significant weakness with P-I and
“poor data” because rules, causality, and out-
related models with fixed values is that they tend
comes are far less clear compared to small data.
to systematically underestimate the probability
More specifically, Vose (2008) describes the
and impact of rare events that are interconnected,
quality of data characteristics for risk analysis.
such as natural hazards (e.g., floods), protection of
The highest quality data are obtained using a
infrastructure (e.g., power grid), and terrorist
large sample of direct and independent measure-
attacks. Nevertheless, the P-I method can be effec-
ments collected and analyzed using established
tive for quick risk assessments.
best practices over a long period of time and
continually validated to correct data for errors.
The second highest quality data use proxy mea- Probabilistic Risk Assessment
sures, a widely used method for collection, anal- P-I is a foundation for Probabilistic Risk Assess-
ysis, and some validation. Other characteristics of ment (PRA), an evaluation of the probabilities for
decreasing data quality are: A smaller sample of multiple potential risks and their respective
objective data, agreement among multiple impacts. The US Army’s standardized risk matrix
experts, a single expert opinion, and is weakest is an example of qualitative PRA, see Fig. 1 (also
with speculation. While there may be some situa- see Level 5 of risk analysis below).
tions in which expert opinions are the only data The risk matrix is constructed by:
Risk Analysis 3

Risk Analysis, Fig. 1 Risk analysis (Source: Safety Risk Management, Pamphlet 385-30 (Headquarters, Department of
the Army, 2014, p. 8): www.apd.army.mil/pdffiles/p385_30.pdf)

Step 1: Identifying possible hazards (i.e., potential conditions and information change, updating the
risks) risk matrix as needed, and provided feedback to
Step 2: Estimating the probabilities and impacts of improve the accuracy of future risk matrix.
each risk and using the P-Is to categorize Other widely used techniques include inferen-
weighted risk tial statistical tests (e.g., regression) and the more
comprehensive approach of what-if data simula-
Risk analysis informs risk reduction, but they tions, which are also used in catastrophe model-
are not one and the same. After the risk matrix is ing. Big data may improve the accuracy of
constructed, appropriate risk tolerance and miti- probability and impact estimates, particularly the
gation strategies are considered. The last step is upper bounds in catastrophe modeling, leading to
ongoing supervision and evaluation of risk as more accurate risk analysis.
4 Risk Analysis

From a statistical perspective, uncertainty and multiple curves, which are then combined
variability tend to be interchangeable. If uncer- using the average or another measure. A
tainty can be attributed to random variability, generic example of Level 5, for qualitative
there is no distinction. However, in risk analysis, values, was illustrated with the above risk
uncertainty can arise from incomplete knowledge matrix. When implemented quantitatively,
(Paté-Cornell 1996). Uncertainty in risk may be Level 5 is similar to what-if simulations in
due to a lack of data (particularly for rare events), catastrophe modeling.
not knowing relevant risks and/or impacts and
unknown interdependencies among risks and/or Catastrophe Modeling
impacts. Big data may improve risk analysis at Level 2 and
above but may be particularly informative for
Levels of Risk Analysis modeling multiple risks at Level 5. Using catas-
There are six levels for understanding uncertainty, trophe modeling, big data can allow for a more
ranging from qualitative identification of risk fac- comprehensive analysis of the combinations of P-
tors (Level 0) to multiple risk curves constructed Is while taking into account interdependences
using different PRAs (Level 5) (Paté-Cornell among systems. Catastrophe modeling involves
1996). Big data are relevant to Level 2 and running a large number of simulations to construct
beyond. The specific levels are as follows a landscape of risk probabilities and their impacts
(adapted Paté-Cornell 1996): for events such as terrorist attacks, natural disas-
ters, and economic failures. Insurance, finance,
Level 0: Identification of a hazard or failure other industries, and governments are increas-
modes. Level 0 is primarily qualitative. For ingly relying on big data to identify and mitigate
example, does exposure to a chemical increase interconnected risks using catastrophe modeling.
the risk of cancer? Beiser (2008) describes the high level of data
Level 1: Worst case. Level 1 is also qualitative, detail in catastrophe modeling. For risk analysis of
with no explicit probability. For example, if a terrorist attack in a particular location,
individuals are exposed to a cancer-causing interconnected variables taken into account may
chemical, what is the highest number that include the proximity to high-profile targets (e.g.,
could develop cancer? government buildings, airports, and landmarks),
Level 2: Quasi-worst case (probabilistic upper- the city, and details of the surrounding buildings
bound). Level 2 introduces subjective estima- (e.g., construction materials), as well as the poten-
tion of probability based on reasonable expec- tial size and impact of an attack. Simulations are
tation(s). Using the example from Level 1, this run under different assumptions, including the
could be the 95th percentile for the number of likelihood of acquiring materials to carry out a
individuals developing cancer. particular type of attack (e.g., a conventional
Level 3: Best and central estimates. Rather than a bomb versus a biological weapon) and the proba-
worst case, Level 3 aims to model the most bility of detecting the acquisition of such mate-
likely impact using central values (e.g., mean rials. Big data is informative for the wide range of
or median). possible outcomes and their impacts in terms of
Level 4: Single-curve PRA. Previous levels were projected loss of life and property damage. How-
point estimates of risk; Level 4 is a type of ever, risk analysis methods are only as good as
PRA. For example, what is the number of their assumptions, regardless of the amount of
individuals that will develop cancer across a data.
probability distribution?
Level 5: Multiple-curve PRA. Level 5 has more Assumptions: Cascading Failures
than one probabilistic risk curve. Using the Even with big data, risk analysis can be flawed
cancer risk example, different probabilities due to inappropriate model assumptions. In the
from distinct data can be represented using case of Hurricane Katrina, the model assumptions
Risk Analysis 5

for a Category 3 hurricane did specify a large, reality, searches were likely influenced by exter-
slow-moving storm system with heavy rainfall nal events such as media reporting of a possible
nor did they account for the interdependencies in flu pandemic, seasonal increases in searches for
infrastructure systems. This storm caused early cold symptoms that were similar to flu symptoms,
loss of electrical power, so many of the pumping and the introduction of suggestions in Google
stations for levees could not operate. Conse- Search. Therefore, GFT wrongly assumed the
quently, water overflowed, causing breaches, data were stationary (i.e., no trends or changes in
resulting in widespread flooding. Because of cas- the mean and variance of data over time). Second,
cading effects in interconnected systems, risk Google did not provide sufficient information for
probabilities and impacts are generally far greater understanding the analysis, such as all selected
than in independent systems and therefore will be search terms and access to the raw data and algo-
substantially underestimated when incorrectly rithms. Third, big data is not necessarily a replace-
treated as independent. ment for small data. Critically, the increased
volume of data does not necessarily make it the
highest quality source. Despite these issues, GFT
Right Then Wrong: Google Flu Trends was at the second highest level of data quality
using criteria from Vose (2008) because GFT ini-
GFT is an example of both success and failure for tially used:
risk analysis using big data. The information pro-
vided by an effective disease surveillance tool can 1. Proxy measures: search terms originally corre-
help mitigate disease spread by reducing illnesses lated with local flu reports over a finite period
and fatalities. Initially, GFT was a successful real- of time
time predictor of flu prevalence, but over time, it 2. A common method: search terms used for
becomes inaccurate. This is because the model Internet advertising, disease surveillance was
assumptions did not hold over time, validation novel (with limited validation)
with small data was not on-going, and it lacked
transparency. GFT used a data-mining approach In the case of GFT, the combination of big and
to estimate real-time flu rates: Hundreds of mil- small data, by continuously recalibrating the algo-
lions of possible models were tested to determine rithms for the big data using the small (surveil-
the best fit of millions of Google searches to lance) data, would have been much more accurate
traditional weekly surveillance data. The tradi- than either alone. Moreover, big data can make
tional weekly surveillance data consisted of the powerful predictions that are impossible with
proportion of reported doctor visits for flu-like small data alone. For example, GFT could provide
systems. At first, GFT was a timely and accurate estimates of flu prevalence in local geographic
predictor of flu prevalence, but it began to produce areas using detailed spatial and temporal informa-
systematic overestimates, sometimes by a factor tion from searches; this would be impossible with
of two or greater compared with the gold-standard only the aggregated traditional surveillance data.
of traditional surveillance data. The erroneous
estimates from GFT resulted from a lack of con-
tinued validation, thus assuming relevant search Conclusions
terms only changed as a result of flu symptoms
and transparency in the data and algorithms used. Similar to GFT, many popular techniques for ana-
Lazer et al. (2014) called the inaccuracy of lyzing big data use data mining to automatically
GFT a parable for big data, highlighting several uncover hidden structures. Data mining tech-
key points. First, a key cause for the misestimates niques are valuable for identifying patterns in
was that the algorithm assumed that influences on big data but should be interpreted with caution.
search patterns were the same over time and pri- The dimensions of big data do not obviate con-
marily driven by the onset of flu symptoms. In siderations of data quality, the need for continuous
6 Risk Analysis

validation, and the importance of modeling References


assumptions (e.g., non-normality, non-
stationarity, and non-independence). While big Beiser, V. (2008). Pricing terrorism: Insurers gauge risks,
costs, Wired. Permanent link: http://web.archive.org/
data has enormous potential to improve the accu-
save/_embed/http://www.wired.com/2008/06/pb-
racy and insights of risk analysis, particularly for terrorism/
interdependent systems, it is not necessarily a Helbing, D. (2013). Globally networked risks and how to
replacement for small data. respond. Nature, 497(7447), 51–59. doi:10.1038/
nature12047.
Lazer, D. M., Kennedy, R., King, G., & Vespignani, A.
(2014). The parable of Google flu: Traps in big data
Cross-References analysis. Science, 343(6176), 1203–1206. doi:10.1126/
science.1248506.
Paté-Cornell, M. E. (1996). Uncertainties in risk analysis:
▶ Complex Networks
Six levels of treatment. Reliability Engineering & Sys-
▶ Google Flu tem Safety, 54(2), 95–111. doi:10.1016/S0951-8320
▶ Military Operations (Counter-Intelligence and (96)00067-1.
Counter-Terrorism) Vose, D. (2008). Risk analysis: A quantitative guide (3rd
ed.). West Sussex: Wiley.
▶ Small Data
▶ Statistical Analysis
U

Upturn policy. Robinson holds a JD from Yale


University’s Law School and has reported for the
Katherine Fink Wall Street Journal and The American, an online
Department of Media, Communications, and magazine published by the American Enterprise
Visual Arts, Pace University, Pleasantville, NY, Institute. Harlan Yu holds a PhD in Computer
USA Science from Princeton University, where he
developed software to make court records more
accessible online. He has also advised the US
Introduction Department of Labor on open government poli-
cies and analyzed privacy, advertising, and broad-
Upturn is a think tank that focuses on the impact band access issues for Google. Aaron Rieke has a
of big data on civil rights. Founded in 2011 as JD from the University of California Berkeley’s
Robinson þ Yu, the organization announced a Law School and has worked for the Federal Trade
name change in 2015 and expansion of its staff Commission and the Center for Democracy and
from two to five people. The firm’s work Technology on data security and privacy issues.
addresses issues such as criminal justice, lending, Cofounders Robinson and Yu began their col-
voting, health, free expression, employment, and laboration at Princeton University as researchers
education. Upturn recommends policy changes on government transparency and civic engage-
with the aim of ensuring that institutions use tech- ment. They were among four coauthors of the
nology in accordance with shared public values. 2009 Yale Journal of Law & Technology article
The firm has published white papers, academic “Government Data and the Invisible Hand,”
articles, and an online newsletter targeting which argued that the government should priori-
policymakers and civil rights advocates. tize opening access to more of its data rather than
creating websites. The article suggested that “pri-
vate parties in a vibrant marketplace of engineer-
Background ing ideas” were better suited to develop websites
that could help the public access government data.
Principals of Upturn include experts in law, public In 2012, Robinson and Yu coauthored the UCLA
policy, and software engineering. David Robinson Law Review article “The New Ambiguity of
was formerly the founding Associate Director of ‘Open Government,’” in which they argued that
Princeton University’s Center for Information making data more available to the public did not
Technology Policy, which conducts interdisci- by itself make government more accountable. The
plinary research in computer science and public article recommended separating the notion of
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_177-1
2 Upturn

open government from the technologies of open Robinson þ Yu researched the effects of the
data in order to clarify the potential impacts of use of big data in credit scoring in a guide for
public policies on civic life. policymakers titled “Knowing the Score.” The
guide endorsed the most widely used credit scor-
ing methods, including FICO, while acknowledg-
Criminal Justice ing concerns about disparities in scoring among
racial groups. The guide concluded that the scor-
Upturn has worked with the Leadership Confer- ing methods themselves were not discriminatory,
ence, a coalition of civil rights and media justice but that the disparities rather reflected other under-
organizations, to evaluate police department pol- lying societal inequalities. Still, the guide advo-
icies on the use of body-worn cameras. The orga- cated some changes to credit scoring methods.
nizations, noting increased interest in the use of One recommendation was to include “mainstream
such cameras following police-involved deaths in alternative data” such as utility bill payments in
communities such as Ferguson (Missouri), New order to allow more people to build their credit
York City, and Baltimore, also cautioned that files. The guide expressed reservations about
body-worn cameras could be used for surveil- “nontraditional” data sources, such as social net-
lance, rather than protection, of vulnerable indi- work data and the rate at which users scroll
viduals. The organizations released a scorecard on through terms of service agreements. Robinson
body-worn camera policies of 25 police depart- þ Yu also called for more collaboration among
ments in November 2015. The scorecard included financial advocates and the credit industry, since
criteria such as whether body-worn camera poli- much of the data on credit scoring is proprietary.
cies were publicly available, whether footage was Finally, Robinson þ Yu advocated that govern-
available to people who file misconduct com- ment regulators more actively investigate “mar-
plaints, and whether the policies limited the use keting scores,” which are used by businesses to
of biometric technologies to identify people in target services to particular customers based on
recordings. their financial health. The guide suggested that
marketing scores appeared to be “just outside the
scope” of the Fair Credit Reporting Act, which
Lending requires agencies to notify consumers when their
credit files have been used against them.
Upturn has warned of the use of big data by
predatory lenders to target vulnerable consumers.
In a 2015 report, “Led Astray,” Upturn explained Voting
how businesses used online lead generation to sell
risky payday loans to desperate borrowers. In Robinson þ Yu partnered with Rock the Vote in
some cases, Upturn found that the companies 2013 in an effort to simplify online voter registra-
violated laws against predatory lending. Upturn tion processes. The firm wrote a report,
also found some lenders exposed their customers’ “Connected OVR: a Simple, Durable Approach
sensitive financial data to identity thieves. The to Online Voter Registration.” At the time of the
report recommended that Google, Bing, and report, nearly 20 states had passed online voter
other online platforms tighten restrictions on pay- registration laws. Robinson þ Yu recommended
day loan ads. It also called on the lending industry that all states allow voters to check their registra-
to promote best practices for online lead genera- tion statuses in real time. It also recommended that
tion and for greater oversight of the industry by online registration systems offer alternatives to
the Federal Trade Commission and Consumer users who lack state identification, and that the
Financial Protection Bureau. systems be responsive to devices of various sizes
Upturn 3

and operating systems. Robinson þ Yu also income people. The automobile insurance com-
suggested that states streamline and better coordi- pany Progressive, for example, installed devices
nate their online registration efforts. Robinson þ in customers’ vehicles that allowed for the track-
Yu recommended that states develop a simple, ing of high-risk behaviors. Such behaviors
standardized platform for accepting voter data included nighttime driving. Robinson þ Yu
and allow third-party vendors (such as Rock the argued that many lower-income workers com-
Vote) to design interfaces that would accept voter muted during nighttime hours and thus might
registrations. Outside vendors, the report have to pay higher rates, even if they had clean
suggested, could use experimental approaches to driving records. The report also argued that mar-
reach new groups of voters while still adhering to keters used big data to develop extensive profiles
government registration requirements. of consumers based on their incomes, buying
habits, and English-language proficiency, and
such profiling could lead to predatory marketing
Big Data and Civil Rights and lending practices. Consumers often are not
aware of what data has been collected about
In 2014, Robinson þ Yu advised The Leadership them and how that data is being used, since such
Conference on “Civil Rights Principles for the Era information is considered to be proprietary. Rob-
of Big Data.” Signatories of the document inson þ Yu also suggested that credit scoring
included the American Civil Liberties Union, methods can disadvantage low-income people
Free Press, and NAACP. The document offered who lack extensive credit histories.
guidelines for developing technologies with The report found that big data could impair job
social justice in mind. The principles included an prospects in several ways. Employers used the
end to “high-tech profiling” of people through the federal government’s E-Verify database, for
use of surveillance and sophisticated data-gather- example, to determine whether job applicants
ing techniques, which the signatories argued were eligible to work in the United States. The
could lead to discrimination. Other principles system could return errors if names had been
included fairness in algorithmic decision-making; entered into the database in different ways. For-
the preservation of core legal principles such as eign-born workers and women have been dispro-
the right to privacy and freedom of association; portionately affected by such errors. Resolving
individual control of personal data; and protec- errors can take weeks, and employers often lack
tions from data inaccuracies. the patience to wait. Other barriers to employment
The “Civil Rights Principles” were cited by the arise from the use of automated questionnaires
White House in its report, “Big Data: Seizing some applicants must answer. Some employers
Opportunities, Preserving Values.” John Podesta, use the questionnaires to assess which potential
Counselor to President Barack Obama, cautioned employees will likely stay in their jobs the lon-
in his introduction to the report that big data had gest. Some studies have suggested that longer
the potential “to eclipse longstanding civil rights commute times correlate to shorter-tenured
protections in how personal information is used.” workers. Robinson þ Yu questioned whether ask-
Following the White House report, Robinson þ ing the commuting question was fair, particularly
Yu elaborated upon four areas of concern in the since it could lead to discrimination against appli-
white paper “Civil Rights, Big Data, and Our cants who lived in lower-income areas. Finally,
Algorithmic Future.” The paper included four Robinson þ Yu raised concerns about “sublimi-
chapters: Financial Inclusion, Jobs, Criminal Jus- nal” effects on employers who conducted web
tice, and Government Data Collection and Use. searches for job applicants. A Harvard researcher,
The Financial Inclusion chapter argued the era they noted, found that Google algorithms were
of big data could result in new barriers for low- more likely to show advertisements for arrest
4 Upturn

records in response to web searches of “black- Newsletter


identifying names” rather than “white-identifying
names.” Equal Future, Upturn’s online newsletter, began
Robinson þ Yu found that big data had in 2013 with support from the Ford Foundation.
changed approaches to criminal justice. Munici- The newsletter has highlighted news stories
palities used big data in “predictive policing,” or related to social justice and technology. For
anti-crime efforts that targeted ex-convicts and instance, Equal Future has covered privacy issues
victims of crimes as well as their personal net- related to the FBI’s Next Generation Identification
works. Robinson þ Yu warned that these systems system, a massive database of biometric and other
could lead to police making “guilt by association” personal data. Other stories have included a legal
mistakes, punishing people who had done nothing dispute in which a district attorney forced
wrong. The report also called for greater transpar- Facebook to grant access to the contents of nearly
ency in law enforcement tactics that involved 400 user accounts. Equal Future also wrote about
surveillance, such as the use of high-speed cam- an “unusually comprehensive and well-consid-
eras that can capture images of vehicle license ered” California law that limited how technology
plates, and so-called stingray devices, which inter- vendors could use educational data. The law was
cept phone calls by mimicking cell phone towers. passed in response to parental concerns about
Because of the secretive nature with which police sensitive data that could compromise their chil-
departments procure and use these devices, the dren’s privacy or limit their future educational and
report contended that it was difficult to know professional prospects.
whether they were being used appropriately. Rob-
inson þ Yu also noted that police departments
were increasingly using body cameras and that Cross-References
early studies suggested the presence of the cam-
eras could de-escalate tension during police ▶ American Civil Liberties Union
interactions. ▶ Biometrics
The Data Government and Use chapter ▶ Criminology and Law Enforcement
suggested that big data tools developed in the ▶ Data-Driven Marketing
interest of national security were also being used ▶ e-commerce
domestically. The DEA, for example, worked ▶ Federal Bureau of Investigation (FBI)
closely with AT&amp;T to develop a secret data- ▶ Financial Services
base of phone records for domestic criminal ▶ Google
investigations. To shield the database’s existence, ▶ Governance
agents avoided mentioning it by name in official ▶ Marketing/Advertising
documents. Robinson þ Yu warned that an abun- ▶ National Association for the Advancement of
dance of data and a lack of oversight could result Colored People
in abuse, citing cases in which law enforcement ▶ Online Advertising
workers used government data to stalk people
they knew socially or romantically. The report
also raised concerns about data collection by the
US Census Bureau, which sought to lower the Further Readings
cost of its decennial count by collecting data
Civil Rights Principles for the Era of Big Data. (2014,
from government records. Robinson þ Yu cau- February). http://www.civilrights.org/press/2014/civil-
tioned that the cost-cutting measure could result in rights-principles-big-data.html
undercounting some populations.
Upturn 5

Robinson, D., & Yu, H. (2014, October). Knowing the score: The Leadership Conference on Civil and Human Rights &
New data, underwriting, and marketing in the consumer Upturn. (2015, November). Police body worn cameras:
credit marketplace. https://www.teamupturn.com/static/ A policy scorecard. https://www.bwcscorecard.org/
files/Knowing_the_Score_Oct_2014_v1_1.pdf static/pdfs/LCCHR_Upturn-BWC_Scorecard-v1.04.pdf
Robinson þ Yu. (2013). Connected OVR: A simple, durable Upturn. (2014, September). Civil rights, big data, and our
approach to online voter registration. Rock the Vote. algorithmic future. https://bigdata.fairness.io/
http://www.issuelab.org/resource/connected_ovr_a_simp Upturn. (2015, October). Led Astray: Online lead genera-
le_durable_approach_to_online_voter_registration tion and payday loans. https://www.teamupturn.com/
Robinson, D., Yu, H., Zeller, W. P., & Felten, E. W. (2008). reports/2015/led-astray
Government data and the invisible hand. Yale JL & Yu, H., & Robinson, D. G. (2012). The new ambiguity of
Tech., 11, 159. ‘open government’. UCLA L. Rev. Disc. 59, 178.
S

Salesforce organization around the customer. This ability to


track and message correctly highlights
Jason Schmitt Salersforce’s unique approach to management
Communication and Media, Clarkson University, practice known in software development as
Potsdam, NY, USA Scrum.
Scrum is an incremental software development
framework for managing product development by
Salesforce is a global enterprise software com- a development team that works as a unit to reach a
pany, with Fortune 100 standing, most well- common goal. A key principle of Salesforce’s
known for its role in linking cloud computing to Scrum direction is the recognition that during a
on-demand customer relationship management project the customers can change their minds
(CRM) products. Salesforce CRM and marketing about what they want and need, often called
products work together to make corporations churn, and predictive understanding is hard to
more functional and ultimately more efficient. accomplish. As such, Salesforce takes an empiri-
Founded in 1999 by Marc Benioff, Parker Harris, cal approach in accepting that an organization’s
Dave Moellenhoff, and Frank Domingues, problem cannot be fully understood or defined
Salesforce’s varied platforms allow organizations and instead focuses on maximizing the team’s
to understand the consumer and the varied media ability to deliver messaging quickly and respond
conversations revolving around a business or to emerging requirements.
brand. According to Forbes (April 2011) which Salesforce provides a fully customizable user
conducted an assessment of businesses focused on interface for custom adoption and access for a
value to shareholders, Marc Benioff of Salesforce diverse array of organization employees. Further,
was the most effective CEO in the world. Salesforce has the ability to integrate into existing
Salesforce provides a cloud-based centralized websites and allows for building additional web
location to track data. Contacts, accounts, sales pages through the cloud-based service. Salesforce
deals, and documents as well as corporate mes- has the ability to link with Outlook and other mail
saging and the varied social media conversations clients to sync calendars and associate emails with
are all archived and retrievable within the the proper contact and provides the functionality
Salesforce architecture from any web or mobile to keep a record every time a contact or data entry
device without the use of any tangible software. is accessed or amended. Similarly, Salesforce
Salesforce’s quickly accessible information has an keeps track and organizes customer support issues
end goal to optimize profitability, revenue, and and tracks them through to resolution with the
customer satisfaction by orientating the ability to escalate individual cases based on time
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_179-1
2 Salesforce

sensitivity and the hierarchy of various clients. platform the post or message originated from.
Extensive reporting is a value of Salesforce’s The “River of News” displays posts with many
offerings, which provides management an ability different priorities, such as newest post first, num-
to track problem areas within an organization to a ber of Twitter followers, social media platform
distinct department, area, or tangible product used, physical location, and Klout score. This
offering. tool provides strong functionality for marketers
Salesforce has been a key leader in evolving or corporations wishing to hone in, or take part
marketing within this digital era through the use of in, industry, customer, or competitor
specific marketing strategy aimed at creating and conversations.
tracking marketing campaigns as well as measur- “Topic analysis” is a widget that is most often
ing the success of online campaigns. These used to show share of voice or the percentage of
services are part of another growing segment conversation happening about your brand or orga-
available within Salesforce offerings in addition nization in relation to competitor brands. It is
to the CRM packaging. Marketing departments displayed as a pie chart and can be segmented
leveraging Salesforce’s Buddy Media, Radian6, multiple ways based on user configuration.
or ExactTarget obtain the ability of users to con- Many use this feature as a quick visual assessment
duct demographic, regional, or national searches to see the conversations and interest revolving
on keywords and themes across all social net- around specific initiatives or product launches.
works, which create a more informed and accurate “Topic trends” is a widget that provides the
marketing direction. Further, Salesforce’s dash- ability to display the volume of conversation
board, which is the main user interactive page, over time through graphs and charts. This feature
allows the creation of specific marketing directed can be used to understand macro day, week, or
tasks that can be customized and shared for dif- month data. This widget is useful when tracking
fering organizational roles or personal crisis management or brand sentiment. With a line
preferences. graph display, users can see spikes of activity and
Salesforce marketing dashboard utilizes wid- conversation around critical areas. Further, users
gets that are custom, reusable page elements, then can click and hone in on spikes, which can
which can be housed on individual users’ pages. open a “Conversation Cloud” or “River of News”
When a widget is created, it is added to a widgets that allows users to see the catalyst behind the
view where all team members can easily be spike of social media activity. This tool is used
assigned access. This allows companies and orga- as a way to better understand reasons for increased
nizations to share appropriate widgets defined and interest or conversation across broad social media
created to serve the target market or industry- platforms.
specific groups. The shareability of widgets
allows the most pertinent and useful tasks to be
replicated by many users within a single Salesforce Uses
organization.
Salesforce offers wide ranging data inference
from its varied and evolving products. As CRM
Types of Widgets integration within the web and mobile has
increased, the broad interest to better understand
The Salesforce Marketing Cloud “River of News” and leverage social media marketing campaigns
is a widget that allows users to scroll through has risen as well, allowing Salesforce a leading
specific search results, within all social media push within this industry’s market share. The
conversations, and utilizes user-defined key- diverse array of businesses, nonprofits, munici-
words. Users have the ability to see original palities, and other organizations that utilize
posts that were targeted from keyword searches Salesforce illustrates the importance of this soft-
and provided a source link to the social media ware within daily business and marketing
Salesforce 3

strategy. Salesforce clients include the American Chatter is a social and collaborative function
Red Cross, the City of San Francisco, that relates to the Salesforce platform. Similar to
Philadelphia’s 311 system, Burberry, H&R Facebook and Twitter, Chatter allows users to
Block, Volvo, and Wiley Publishing. form a community within their business that can
be used for secure collaboration and knowledge
sharing.
Salesforce Service Offerings Work.com is a corporate performance manage-
ment platform for sales representatives. The plat-
Salesforce is a leader within other CRM and form targets employee engagement in three areas:
media marketing-orientated companies such as alignment of team and personal goals with busi-
Oracle, SAP, Microsoft Dynamics CRM, Sage ness goals, motivation through public recognition,
CRM, Goldmine, Zoho, Nimble, Highrise, and real-time performance feedback.
Insight.ly, and Hootsuite. Salesforce’s offerings Salesforce has more than 5,500 employees,
can be purchased individually or as a complete revenues of approximately $1.7 billion, and a
bundle. It offers current breakdowns of services market value of approximately $17 billion. The
and access in its varied options that are referred to company regularly conducts over 100 million
as Sales Cloud, Service Cloud, ExactTarget Mar- transactions a day and has over 3 million
keting Cloud, Salesforce1 Platform, Chatter, and subscribers.
Work.com. Headquartered in San Francisco, California,
Sales Cloud allows businesses to track cus- Salesforce also maintains regional offices in Dub-
tomer inquiries, escalate issues requiring special- lin, Singapore, and Tokyo with secondary loca-
ized support, and monitor employee productivity. tions in Toronto, New York, London, Sydney, and
This product provides customer service teams San Mateo, California. Salesforce operates with
with the answers to customers’ questions and the over 170,000 companies and 17,000 nonprofit
ability to make the answers available on the web organizations. In June 2004, Salesforce was
so consumers can find answers for themselves. offered on the New York Stock Exchange under
Service Cloud offers active and real-time infor- the symbol CRM.
mation directed toward customer service. This
service provides functionality such as Agent Con-
sole which offers relevant information about cus-
Cross-References
tomers and their media profiles. This service also
provides businesses the ability to give customers
▶ Customer Service
access to live agent web chats from the web to
▶ Data Aggregation
ensure customers can have access to information
▶ Social Media
without a phone call.
▶ Streaming Data
ExactTarget Marketing Cloud focuses on cre-
ating closer relationships with customers through
directed email campaigns, in-depth social market-
ing, data analytics, mobile campaigns, and mar- Further Readings
keting automation.
Sales1Platform is geared toward mobile app Denning, S. (2011). Successfully implementing radical
management at Salesforce.com. Strategy & Leader-
creation. Sales1Platform gives access to create ship, 39(6), 4.
and promote mobile apps with over four million
apps created utilizing this service.
S

Scientometrics of increased digital indexing is enhanced by the


recent surge in total scientific output. Lutz
Jon Schmid Bornmann and Ruediger Mutz find that global
Georgia Institute of Technology, Atlanta, GA, scientific output has grown at a rate of 8–9% per
USA year since World War II (equivalent to a doubling
every 9 years) (Bornmann and Mutz 2015).
Bibliometric analysis using large data sets has
Scientometrics refers to the study of science been particularly useful in research that seeks to
through the measurement and analysis of understand the nature of research collaboration.
researchers’ productive outputs. These outputs Because large bibliographic databases contain
include journal articles, citations, books, patents, information on coauthorships, the institutions
data, and conference proceedings. The impact of that host authors, journals, and publication dates,
big data analytics on the field of scientometrics text mining software can be used in combination
has primarily been driven by two factors: the with social network analysis to understand the
emergence of large online bibliographic databases nature of collaborative networks. Visualizations
and a recent push to broaden the evaluation of of these networks are increasingly used to show
research impact beyond citation-based measures. patterns of collaboration, ties between scientific
Large online databases of articles, conferences disciplines, and the impact of scientific ideas. For
proceedings, and books allow researchers to example, Hanjun Xian and Krishna Madhavan
study the manner in which scholarship develops analyzed over 24,000 journal articles and confer-
and measure the impact of researchers, institu- ence proceedings from the field of engineering
tions, and even countries on a field of scientific education in effort to understand how the litera-
knowledge. Using data on social media activity, ture was produced (Xian and Madhaven 2014).
article views, downloads, social bookmarking, These data were used to map the network of
and the text posted on blogs and other websites, collaborative ties in the discipline. The study
researchers are attempting to broaden the manner found that cross-disciplinary scholars played a
in which scientific output is measured. critical role in linking isolated network segments.
Bibliometrics, a subdiscipline of Besides studying authorship and collaboration,
scientometrics that focuses specifically on the big data analytics have been used to analyze cita-
study of scientific publications, witnessed a boon tions to measure the impact of research,
in research due to the emergence of large digital researchers, and research institutions. Citations
bibliographic databases such as Web of Science, are a common proxy for the quality of research.
Scopus, Google Scholar, and PubMed. The utility Important papers will generally be highly cited as
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_180-1
2 Scientometrics

subsequent research relies on them to advance promising use of social media data lies not in its
knowledge. use as a predictor of traditional impact measures but
One prominent metric used in scientometrics is as means of creating novel metrics of the social
the h-index, which was proposed by Jorge Hirsch impact of research.
in 2005. The h-index considers the number of Indeed the development of an alternative set of
publications produced by an individual or organi- measurements – often referred to as “altmetrics” –
zation and the number of citations these publica- based on data gleaned from the social web repre-
tions receive. An individual can be said to have an sents a particularly active field of scientometrics
h-index of h when she produces h publications research. Toward this end, services such as PLOS
each of which receives at least h citations and no Article-Level Metrics use big data techniques to
other publication receives more than h citations. develop metrics of research impact that consider
The advent of large databases and big data factors other than citations. PLOS Article-Level
analytics has greatly facilitated the calculation of Metrics pulls in data on article downloads,
the h-index and similar impact metrics. For exam- commenting and sharing via services such
ple, in a 2013 study, Filippo Radicchi and Claudio CiteuLike, Connotea, and Facebook, to broaden
Castellano utilized the Google Scholar Citations the way in which a scholar’s contribution is
data set to evaluate the individual scholarly con- measured.
tribution of over 35,000 scholars (Radicchi and Certain academic fields, such as the humanities,
Castellano 2013). The researchers found that the that rely on under-indexed forms of scholarship
number of citations received by a scientist is a such as book chapters and monographs have proven
strong proxy for that scientist’s h-index, whereas difficult to study using traditional scientometrics
the number of publications is a less precise proxy. techniques. Because they do not depend on online
The same principles behind citation analysis can bibliographic databases, altmetrics may prove useful
be applied to measure the impact or quality of in studying such fields. Björn Hammarfelt uses data
patents. Large patent databases such as PATSTAT from Twitter and Mendeley – a web-based citation
allow researchers to measure the importance of indi- manager that has a social networking component –
vidual patents using forward citations. Forward cita- to study scholarship in the humanities (Hammarfelt
tions come from the “prior art” section of the patent 2014). While his study suggests that coverage gaps
documents, which describes the technologies that still exist using altmetrics, as these applications
were deemed critical to their innovation by the become more widely used, they will likely become
patent applicants. Scholars use patent counts, a useful means of studying neglected scientific
weighed by forward citations, to derive measures fields.
of national innovative productivity.
Until recently, measurement of research impact
has been almost exclusively based on citation-based See Also
measures. However, citations are slow to accumu-
late and ignore the influence of research on the ▶ Bibliometrics
broader public. Recently there has been a push to ▶ Social Media
include novel data sources in the evaluation of ▶ Text Analytics
research impact. Gunther Eysenbach has found ▶ Thomson Reuters
that tweets about a journal article within the first
3 days of publication are a strong predictor of even-
tual citations for highly cited research articles Further Readings
(Eysenbach 2011). The direction of causality in
this relationship – i.e., whether strong papers lead Bornmann, L., & Mutz, R. (2015). Growth rates of modern
science: A bibliometric analysis based on the number of
to a high volume of tweets or whether the tweets
publications and cited references. Journal of the Asso-
themselves cause subsequent citations – is unclear. ciation for Information Science and Technology,
However, the author suggests that the most 66(11), 2215–2222. arXiv:1402.4578 [Physics, Stat].
Scientometrics 3

Eysenbach, G. (2011). Can tweets predict citations? Met- Radicchi, F., & Castellano, C. (2013). Analysis of
rics of social impact based on Twitter and correlation bibliometric indicators for individual scholars in a
with traditional metrics of scientific impact. Journal of large data set. Scientometrics, 97(3), 627–637. https://
Medical Internet Research, 13, e123. doi.org/10.1007/s11192-013-1027-3.
Hammarfelt, B. (2014). Using altmetrics for assessing Xian, H., & Madhavan, K. (2014). Anatomy of scholarly
research impact in the humanities. Scientometrics, collaboration in engineering education: A big-data
101, 1419–1430. bibliometric analysis. Journal of Engineering Educa-
tion, 103, 486–514.
S

Semantic/Content Analysis/Natural Big data is an interdisciplinary field, of which


Language Processing natural language processing (NLP) is a
fragmented and interdisciplinary subfield.
Paul Nulty Broadly speaking, researchers use approaches
Centre for Research in Arts Social Science and somewhere on a continuum between representing
Humanities, University of Cambridge, and parsing the structures of human language in a
Cambridge, United Kingdom symbolic, rule-based fashion, or feeding large
amounts of minimally preprocessed text into
more sophisticated statistical machine learning
Introduction systems. In addition, various substantive research
areas have developed overlapping but distinct
One of the most difficult aspects of working with methods for computational analysis of text.
big data is the prevalence of unstructured data, The question of whether NLP tasks are best
and perhaps the most widespread source of approached with statistical, data-driven methods
unstructured data is the information contained in or symbolic, theory-driven models is an old
text files in the form of natural language. Human debate. In 1957, Noam Chomsky wrote:
language is in fact highly structured, but although it must be recognized that the notion of “probability
major advances have been made in automated of a sentence” is an entirely useless one, under
methods for symbolic processing and parsing of any known interpretation of this term.
language, full computational language under-
standing has yet to be achieved, and so a combi-
However, at present the best methods we have
nation of symbolic and statistical approaches to
for translating, searching, and classifying natural
machine understanding of language are com-
language text use flexible machine-learning algo-
monly used. Extracting meaning or achieving
rithms that learn parameters probabilistically from
understanding from human language through sta-
relatively unprocessed text. On the other hand,
tistical or computational processing is one of the
some applications, such as the IBM Watson ques-
most fundamental and challenging problems of
tion answering system (Ferruci et al. 2010), make
artificial intelligence. From a practical point of
good use of a combination of probabilistic learn-
view, the dramatic increase in availability of text
ing and modules informed by linguistic theory to
in electronic form means that reliable automated
disambiguate nuanced queries.
analysis of natural language is an extremely useful
The field of computational linguistics origi-
source of data for many disciplines.
nally had the goal of improving understanding of

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_182-1
2 Semantic/Content Analysis/Natural Language Processing

human language using computational methods. into main memory on most modern machines.
Historically, this meant implementing rules and While techniques such as parallel and distributed
structures inspired by the cognitive structures pro- processing may be necessary in some cases, for
posed by Chomskyan generative linguistics. Over example, global streams of social media text or
time, computational linguistics has broadened to applying machine learning techniques for classi-
include diverse methods for machine processing fication, typically the challenge of text data is to
of language irrespective of whether the computa- parse and extract useful information from the idi-
tional models are plausible cognitive models of osyncratic and opaque structures of natural lan-
human language processing. As practiced today, guage, rather than overcoming computational
computational linguistics is closer to a branch of difficulties simply to store and manipulate the
computer science than a branch of linguistics. The text. The unpredictable structure of text files
branch of linguistics that uses quantitative analy- means that general purpose programming lan-
sis of large text corpora is known as corpus guages are commonly used, unlike in other appli-
linguistics. cations where the tabular format of the data allows
Research in computational linguistics and nat- the use of specialized statistical software.
ural language processing involves finding solu- Original Unix command line tools such as
tions for the many subproblems associated with grep, sed, and awk are still extremely useful for
understanding language, and combining advances batch processing of text documents. Historically,
in these modules to improve performance on gen- Perl has been the programming language of
eral tasks. Some of the most important NLP sub- choice for text processing, but recently Ruby and
problems include part-of-speech tagging, Python have become more widely used. These are
syntactic parsing, identifying the semantic roles scripting languages, designed for ease of use and
played by verb arguments, recognizing named flexibility rather than speed. For more computa-
entities, and resolving references. These feed tionally intensive tasks, NLP tools are
into performance on more general tasks like implemented in Java or C/Cþþ.
machine translation, question answering, and The python libraries spaCy and gensim and the
summarization. Java-based Stanford Core NLP software are
In the social sciences, the terms quantitative widely used in industry and academia. They pro-
content analysis, quantitative text analysis, or vide implementations and guides for the most
“text as data” are all used. Content analysis may widely used text processing and statistical docu-
be performed by human coders, who read and ment analysis methods.
mark-up documents. This process can be stream-
lined with software. Fully automated content anal-
ysis, or quantitative text analysis, typically Preprocessing
employs statistical word-frequency analysis to
discover latent traits from text, or scale documents The first step in approaching a text analysis
of interest on a particular dimension of interest in dataset is to successfully read the document for-
social science or political science. mats and file encodings used. Most programming
languages provide libraries for interfacing with
Microsoft Word and pdf documents. The ASCII
Tools and Resources coding system represents unaccented English
upper and lowercase letters, numbers, and punc-
Text data does not immediately challenge compu- tuation, using one byte per character. This is no
tational resources to the same extent as other big longer sufficient for most purposes, and modern
data sources such as video or sensor data. For documents are encoded in a diverse set of charac-
example, the entire proceedings of the European ter encodings. The Unicode system defines code
parliament from 1996 to 2005, in 21 languages, points which can represent characters and sym-
can be stored in 5.4 gigabytes – enough to load bols from all writing systems. The UTF-8 and
Semantic/Content Analysis/Natural Language Processing 3

UTF-16 encodings implement these code points function, and some NLP systems simply remove
in 8 bit or 16 bit encoded files. them before proceeding with a statistical analysis.
Words are the most apparent units of written After the initial text preprocessing, there are
text, and most text processing tasks begin with several simple metrics that may be used to assess
tokenization – dividing the text into words. In the complexity of language used in the docu-
many languages, this is relatively uncomplicated: ments. The type-token ratio, a measure of lexical
whitespace delimits words, with a few ambiguous diversity, gives an estimate of the complexity of
cases such as hyphenation, contraction, and the the document by comparing the total number of
possessive marker. Within languages written in words in the document to the number of unique
the Roman alphabet there is some variance, for words (i.e., the size of the vocabulary). The
example, agglutinative languages like Finnish and Fleisch-Kincaid readability metric uses the aver-
Hungarian tend to use long compound terms dis- age sentence length and the average number of
ambiguated by case markers, which can make the syllables per word combined with coefficients
connection between space-separated words and calibrated with data from students to give an esti-
dictionary-entry meanings tenuous. For languages mate of the grade-level reading difficulty of a text.
with a different orthographic system, such as Chi-
nese, Japanese, and Arabic, it is necessary to use a
customized tokenizer to split text into units suit- Document-Term Matrices
able for quantitative analysis.
Even in English, the correspondence between After tokenization and other preprocessing steps,
space-separated word and semantic unit is not most text analysis methods work with a matrix
exact. The fundamental unit of vocabulary – that stores the frequency with which each word in
sometimes called the lexeme – may be modified the vocabulary occurs in each document. This is
or inflected by the addition of morphemes indicat- the simplest case, known as the “bag-of-words”
ing tense, gender, or number. For many applica- model, and no information about the ordering of
tions, it is not desirable to distinguish between the the words in the original texts is retained. More
inflected forms of words, rather we want to sum sophisticated analysis might involve extracting
together counts of equivalent words. Therefore, it counts of complex features from the documents.
is common to remove the inflected endings of For example, the text may be parsed and tagged
words and count only the root, or stem. For exam- with part-of-speech information as part of the
ple, a system to judge the sentiment of a movie preprocessing stage, which would allow for the
review need not distinguish between the words words with identical spellings but different part-
“excite,” “exciting,” “excites,” and “excited.” of-speech categories or grammatical roles to be
Typically the word ending is removed and the counted as separate features.
terms are treated equivalently. Often, rather than using only single words,
The Porter stemmer (Porter 1980) is one of the counts of phrases are used. These are known as
most frequently used algorithms for this purpose. n-grams, where n is the number of words in the
A slightly more sophisticated method is phrase, for example, trigrams are three-word
lemmatization, which also normalizes inflected sequences. N-gram models are especially impor-
words, but uses a dictionary to match irregular tant for language modeling, used to predict the
forms such as “be”/“is”/“are”. In addition to stem- probability of a word or phrase given the preced-
ming and tokenizing, it may be useful to remove ing sequence of words. Language modeling is
very common words that are unlikely to have particularly important for natural language gener-
semantic content related to the task. In English, ation and speech recognition problems.
the most common words are function words such Once each document has been converted to a
as “of,” “in,” and “the.” These “stopwords” row of counts of terms or features, a wide range of
largely serve a grammatical rather than semantic automated document analysis methods can be
employed. The document-term matrix is usually
4 Semantic/Content Analysis/Natural Language Processing

sparse and uneven – a small number of words In addition, word frequencies are extremely
occur very frequently in many documents, while unevenly distributed (an observation known as
a large number of words occur rarely, and most Zipf’s law) and are highly correlated with one
words do not occur at all in a given document. another, resulting in parameter vectors that make
Therefore, it is common practice to smooth or less than ideal examples for regression models. It
weight the matrix, either using the log of the may therefore be necessary to use regression
term frequency or with a measure of term impor- methods designed to mitigate this problem, such
tance like tf-idf (term frequency x inverse docu- as lasso and ridge regression, or to prune the
ment frequency) or mutual information. feature space to avoid overtraining, using feature
subset selection or a dimensionality reduction
technique like principal components analysis or
Matrix Analysis singular value decomposition. With recent
advances in neural network research, it has
Supervised classification methods attempt to auto- become more common to use unprocessed counts
matically categorize documents based on the doc- of n-grams, tokens, or even characters as input to a
ument-term matrix. One of the most familiar of neural network with many intermediate layers.
such tasks is the email spam detection problem. With sufficient training data, such a network can
Based on the frequencies of words in a corpus of learn the feature extraction process better than
emails, the system must decide if an email is spam hand-curated feature extraction systems, and
or not. Such a system is supervised in the sense these “deep learning” networks have improved
that it requires as a starting point a set of docu- the state of the art in machine translation and
ments that have been correctly labeled with the image labeling.
appropriate category, in order to build a statistical Unsupervised methods can cluster documents
model of which terms are associated with each or reveal the distribution of topics in documents in
category. One simple and effective algorithm for a data-driven fashion. For unsupervised scaling
supervised document classification is Naive and clustering of documents, methods include k-
Bayes, which gives a new document the class means clustering, or the Wordfish algorithm, a
that has the maximum a posteriori probability multinomial Poisson scaling model for political
given the term counts and their independent asso- documents.
ciation between the terms and the categories in the Another goal of unsupervised analysis is to
training documents. In political science,a similar measure what topics comprise the text corpus,
algorithm – “wordscores” – is widely used, which and how these topics are distributed across docu-
sums Naive-Bayes-like word parameters to scale ments. Topic modeling (Blei 2012) is a widely
new documents based on reference scores used generative technique to discover a set of
assigned to training texts with extreme posi- topics that influence the generation of the texts,
tions (Laver et al. 2003). and explore how they are associated with other
Other widely used supervised classifiers variables of interest.
include support vector machines, logistic regres-
sion, and nearest neighbor models. If the task is to
predict a continuous variable rather than a class Vector Space Semantics and Machine
label, then a regression model may be used. Sta- Learning
tistical learning and prediction systems that oper-
ate on text data very often face the typical big data In addition to retrieving or labeling documents, it
problem of having more features (word types) can be useful to represent the meaning of terms
than observed or labeled documents. This is a found in the documents. Vector space semantics,
high dimensional learning problem, where p (the or distributional semantics, aims to represent the
number of parameters) is much larger than n (the meaning of words using counts of their co-occur-
number of observed examples). rences with other words. The “distributional
Semantic/Content Analysis/Natural Language Processing 5

hypothesis,” as described by JR Firth (Firth 1957), over the last few decades, and with the prepon-
is the idea that “you shall know a word by the derance of online training data and advances in
company it keeps.” The co-occurrence vectors of machine learning methods, it is likely that further
words have been shown to be useful for noun gains will be made in the coming years. For
phrase disambiguation, semantic relation extrac- researchers intending to make use of rather than
tion, and analogy resolution. Many systems now advance these methods, a fruitful approach is a
use the factorization of the co-occurrence matrices good working knowledge of a general purpose
as the initial input to statistical learners, allowing a programming language, combined with the ability
fine-grained representation of lexical semantics. to configure and execute off-the-shelf machine
Vector semantics also allows for word sense dis- learning packages.
ambiguation – it is possible to distinguish the
different senses of a word by clustering the vector
representations of its occurrences. Cross-References
These vectors may count instances of words
co-occurring with the same context (syntagmatic ▶ Artificial Intelligence
relations) or compare the similarity of the contexts ▶ Biomedical Natural Language Processing
of words as a measure of their substitutability ▶ Python Scripting Language
(paradigmatic relations) (Turney and Pantel ▶ Supervised Machine Learning
2010). The use of neural networks or dimension- ▶ Text Analytics
ality reduction techniques allows researchers to ▶ Unstructured Data
produce a relatively low dimensional space in
which to compare word vectors, sometimes called
word embeddings.
References
Machine learning has long been used to per-
form classification of documents or to aid the Blei, D. M. (2012). Probabilistic topic models. Communi-
accuracy of NLP subtasks described above. How- cations of the ACM, 55(4), 77–84.
ever, as in many other fields, the recent application Chomsky, N. (2002). Syntactic structures. Berlin: Walter
de Gruyter.
of neural networks with many hidden layers
Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek,
(Deep Learning) has led to large improvements D., Kalyanpur, A., Lally, A., Murdock, J., Nyberg, E.,
in accuracy rates on many tasks. These opaque but Prager, J., Schlaefer, N., & Welty, C. A. (2010). Build-
computationally powerful techniques require only ing Watson: An overview of the deep QA project. AI
Magazine, 31(3), 59–79.
a large volume of training data and a differentiable
Firth, J. R. (1957). A synopsis of linguistic theory. In
target function to model complex linguistic Studies in linguistic analysis. Blackwell: Oxford.
behavior. Laver, M., Benoit, K., & Garry, J. (2003). Extracting policy
positions from political texts using words as data.
American Political Science Review, 97(02), 311–331.
Porter, MF. "An algorithm for suffix stripping." Pro-
Conclusion gram 14.3 (1980): 130–137.
Slapin, J. B., & Proksch, S.-O. (2008). A scaling model for
Natural language processing is a complex and estimating time-series party positions from texts. Amer-
ican Journal of Political Science, 52(3), 705–722.
varied problem that lies at the heart of artificial
Turney, P. D., & Pantel, P. (2010). From frequency to
intelligence. The combination of statistical and meaning: Vector space models of semantics. Journal
symbolic methods has led to huge leaps forward of Artificial Intelligence Research, 37(1), 141–188.
S

Semi-structured Data in markup languages and software applications


allow the collection and evaluation of semi-
Yulia A. Strekalova1 and Mustapha Bouakkaz2 structured data, but the richness of natural text
1
College of Journalism and Communications, contained in semi-structured data still presents
University of Florida, Gainesville, FL, USA challenges for analysts.
2
University Amar Telidji Laghouat, Laghouat, Structured data has been organized into a for-
Algeria mat that makes it easier to access and process such
as databases where data is stored in columns,
which represent the attribute of the database. In
More and more data become available electroni- reality, very little data is completely structured.
cally every day, and they may be stored in a Conversely, unstructured data has been not
variety of data systems. Some data entries may reformatted, and its elements are not organized
reside in unstructured document file systems, and into a data structure. Semi-structured data com-
some data may be collected and stored in highly bines some elements of both data types. It is not
structured relational databases. The data itself organized in a complex manner that supports
may represent raw images and sounds or come immediate analyses; however, it may have infor-
with a rigid structure as strictly entered entities. mation associated with it, such as metadata tag-
However, a lot of data currently available through ging, that allows elements contained to be
public and proprietary data systems is semi- addressed through more sophisticated access
structured. queries. For example, a word document is gener-
ally considered to be unstructured data. However,
when metadata tags in the form of keywords that
Definition represent the document content are added, the data
becomes semi-structured.
Semi-structured data is data that resembles struc-
tured data by its format but is not organized with
the same restrictive rules. This flexibility allows Data Analysis
collecting data even if some data points are miss-
ing or contain information that is not easily trans- The volume and unpredictable structure of the
lated in a relational database format. Semi- available data present challenges in analysis. To
structured data carries the richness of human get meaningful insights from semi-structured
information exchange, but most of it cannot be data, analysts need to pre-analyze it to ask ques-
automatically processed and used. Developments tions that can be answered with the data. The fact
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_183-1
2 Semi-structured Data

that a large number of correlations can be found identification of analytic algorithms, should they
does not necessarily mean that analysis is reliable be deemed necessary. Algorithms are analytic
and complete. One of the preparation measures approaches to data, which may be very sophisti-
before the actual data analysis is data reduction. cated. However, establishing a reliable set of
While a large number of data points may be avail- meaningful metrics to answer a question may be
able for collection, not all these data points should a more reliable strategy. Step 8 looks at the results
be included in an analysis to every question. and conclusions of the analysis and calls for con-
Instead, a careful consideration of data points is servative assessment of possible explanations and
likely to produce a more reliable and explainable models suggested by the data, assertions for cau-
interpretation of observed data. In other words, sality, and possible biases. Finally, step 9 calls for
just because the data is available, it does not mean validation of results in step 8 using comparable
it needs to be included in the analysis. Some data sets. Invalidation of predictions may suggest
elements may be random and will not add sub- necessary adjustments to any of the steps in the
stantively to the answer to a particular questions. data analysis and make conclusions more robust.
Some other elements may be redundant and not
add any new information compared to the one
already provided by other data points. Data Management
Jules Berman suggests nine steps to the analy-
sis of semi-structured data. Step 1 includes formu- Semi-structured data includes both database char-
lation of a question which can and will be acteristics and incorporates documents and other
subsequently answered with data. A Big Data files types, which cannot be fully described by a
approach may not be the best strategy for ques- standard database entry. Data entries in structured
tions that can be answered with other traditional data sets follow the same order; all entries in a
research methods. Step 2 evaluates data resources group have the same descriptions, defined format,
available for collection. Data repositories may and predefined length. In contrast, semi-structured
have “blind spots” or data points that are system- data entries are organized in semantic entities,
atically excluded or restricted for public access. similar to the structured data, which may not
At step 3, a question is reformulated to adjust for have same attributes in the same order or of the
the resources identified in step 2. Available data same length. Early digital databases were orga-
may be insufficient to answer the original question nized based on the relational model of data, where
despite the access to large amounts of data. Step data is recorded into one or more tables with a
4 involved evaluation of possible query outputs. unique identifier for each entry. The data for such
Data mining may return a large number of data databases needs to be structured uniformly for
points, but these data points most frequently need each record. Semi-structured data but relies on
to be filtered to focus on the analysis of the ques- tag or other markers to separate data elements.
tion at hand. At step 5, data should be reviewed Semi-structured data may miss data elements or
and evaluated for its structure and characteristics. have more than one data point in an element.
Returned data may be quantitative or qualitative, Overall, while semi-structured data has a pre-
or it may have data points which are missing for a defined structure, the data within this structure is
substantial number of records, which will impact not entered with the same rigor as in the traditional
future data analysis. Step 6 requires a strategic and relational databases. This data management situa-
systematic data reduction. Although it may sound tion arises from the practical necessity to handle
counterintuitive, Big Data analysis can provide user-generated and widely interactional data
most powerful insights when the data set is con- brought up by the Web 2.0. The data contained
densed to bare essentials to answer a focused in emails, blog posts, PowerPoint presentation
question. Some collected data may be irrelevant files, images, and videos may have very different
or redundant to the problem at hand and will not sets of attributes, but they also offer a possibility
be needed for the analysis. Step 7 calls for the to assign metadata systematically. Metadata may
Semi-structured Data 3

include information about author and time and Two main types of semi-structured data for-
may create the structure to assign the data to mats are Extensible Markup Language (XML)
semantic groups. Unstructured data, on the other and JavaScript Object Notation (JSON). XML,
hand, is the data that cannot be readily organized developed in the mid-1990s, is a markup language
in tables to capture the full extent of it. Semi- that sets rules for the data interchange. XML,
structured data, as the name suggests, carries although being an improvement to earlier markup
some elements of structured data. These elements languages, has been critiqued for being bulky and
are metadata tags that may list the author or cumbersome in implementation. JSON is viewed
sender, entry creation and modification times, as a possible successor format for digital architec-
the length of a document, or the number of slides ture and database technologies. JSON is an open
in a presentation. Yet, these data also have ele- standard format that transmits data between an
ments that cannot be described in a traditional application and a server. Data objects in JSON
relational database. For example, traditional data- format consist of attribute-value pairs stored in
base structure which would require initial infra- databases like MongoDB and Couchbase. The
structure design will not be able to handle data, which is stored in a database like MongoDB,
information as a sent email, and all response that can be pulled with a software network for more
were received as it is unknown if an email respon- efficient and faster processing. Apache Hadoop is
dents will use one or all names in response, if an example of an open-source framework that
anyone will get added or omitted, if original mes- provides both storage and processing support.
sage will be modified, if attachments will be Other multi-platform query processing applica-
added to subsequent messages, etc. tions suitable for enterprise-level use are Apache
Semi-structured data allows programmers to Spark and Presto.
nest data or create hierarchies that represent com-
plex data models and relationships among entries.
However, robustness of the traditional relational
See Also
data model forces more thoughtful implementa-
tion of data applications and possible subsequent
▶ Big Data Storytelling, Digital Storytelling
ease in analysis. Handling of semi-structured data
▶ Discovery Analytics
is associated with some challenges. The data itself
▶ Hadoop
may present a problem by being embedded in
▶ MongoDB
natural text, which cannot always be extracted
▶ Text Analytics
automatically with precision. Natural text is
based on sentences which may not have easily
identifiable relationships and entities which are nec-
essary for data collection. Natural text is based on Further Readings
sentences that may not have easily identifiable rela-
tionships and entities, which are necessary for data Abiteboul, S., et al. (2012). Web data management.
New York: Cambridge University Press.
collection, and the lack of widely accepted stan- Foreman, J. W. (2013). Data smart: Using data science to
dards for vocabularies. A communication process transform information into insight. Indianapolis: Wiley.
may involve different models to transfer the same Miner, G., et al. (2012). Practical text mining and statisti-
information or require richer data transfer available cal analysis for non-structured text data applications.
Waltham: Academic.
through natural text and not through a structured
exchange of keywords. For example, email
exchange can capture the data about senders and
recipients, but automated filtering and analysis of
the body of email are limited.
S

Sentiment Analysis such as “sad” or “happy”), but also typically


involves some form of opinion mining. For this
Francis Dalisay1, Matthew J. Kushin2 and reason, and since both fields rely on natural lan-
Masahiro Yamamoto3 guage processing (NLP) to analyze opinions from
1
Communication & Fine Arts, College of Liberal text, sentiment analysis is often couched under the
Arts & Social Sciences, University of Guam, same umbrella as opinion mining.
Mangilao, GU, USA Sentiment analysis has gained popularity as a
2
Department of Communication, Shepherd social data analytics tool. Recent years have
University, Shepherdstown, WV, USA witnessed the widespread adoption of social
3
Department of Communication, University at media platforms as outlets to publicly express
Albany – SUNY, Albany, NY, USA opinions on nearly any subject, including those
relating to political and social issues, sporting and
entertainment events, weather, and brand and con-
Sentiment analysis is defined as the computational sumer experiences. Much of the content posted on
study of opinions, or sentiment, in text. Sentiment sites such as Twitter, Facebook, YouTube, cus-
analysis typically intends to capture an opinion tomer review pages, and news article comment
holder’s evaluative response (e.g., positive, nega- boards is public. As such, businesses, political
tive, or neutral, or a more fine-grained classifica- campaigns, universities, and government entities,
tion scheme) toward an object. The evaluative among others, can collect and analyze this infor-
response reflects an opinion holder’s attitudes, or mation to gain insight into the thoughts of key
affective feelings, beliefs, thoughts, and publics.
appraisals. The ability of sentiment analysis to measure
According to scholars Erik Cambria, Bjorn individuals’ thoughts and feelings has a wide
Schuller, Yunging Xia, and Catherine Havasi, range of practical applications. For example, sen-
sentiment analysis is a term typically used inter- timent analysis can be used to analyze online news
changeably with opinion mining to refer to the content and to examine the polarity of news cov-
same field of study. The scholars note, however, erage of particular issues. Also, businesses are
that opinion mining generally involves the detec- able to collect and analyze the sentiment of com-
tion of the polarity of opinion, also referred to as ments posted online to assess consumers’ opin-
the sentiment orientation of a given text (i.e., ions toward their products and services, evaluate
whether the expressed opinion is positive, nega- the effectiveness of advertising and PR cam-
tive, or neutral). Sentiment analysis focuses on the paigns, and identify customer complaints. Gath-
recognition of emotion (e.g., emotional states ering such market intelligence helps guide
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_184-1
2 Sentiment Analysis

decision-making in the realms of product research the training corpus to classify data into sentiment
and development, marketing and public relations, categories.
crisis management, and customer relations.
Although businesses have traditionally relied on
surveys and focus groups, sentiment analysis
Levels of Analysis
offers several unique advantages over such con-
ventional data collection methods. These advan-
The classification of an opinion in text as positive,
tages include reduced cost and time, increased
negative, or neutral (or a more fine-grained clas-
access to much larger samples and hard-to-reach
sification scheme) is impacted by and thus
populations, and real-time intelligence. Thus, sen-
requires consideration of the level at which the
timent analysis can be a useful market research
analysis is conducted. There are three levels of
tool. Indeed, sentiment analysis is now commonly
analysis: document, sentence, and aspect and/or
offered by many commercial social data analysis
entity. First, the document-level sentiment classi-
services.
fication addresses a whole document as the unit of
analysis. The task of this level of analysis is to
determine whether an entire document (e.g., a
Approaches product review, a blog post, an email, etc.) is
positive, negative, or neutral about an object.
Broadly speaking, there exist two approaches in This level of analysis assumes that the opinions
the automatic extraction of sentiment from textual expressed on the document are targeted toward a
material: the lexicon-based approach and the single entity (e.g., a single product). As such, this
machine learning-based approach. In the level is not particularly useful to documents that
lexicon-based approach, a sentiment orientation discuss multiple entities.
score is calculated for a given text unit based on The second, sentence-level sentiment classifi-
a predetermined set of opinion words with posi- cation, focuses on the sentiment orientation of
tive (e.g., good, fun, exciting) and negative (e.g., individual sentences. This level of analysis is
bad, boring, poor) sentiments. In a simple form, a also referred to as subjectivity classification and
list of words, phrases, and idioms with known comprised of two tasks: subjective classification
sentiment orientations is built into a dictionary, and sentence-level classification. In the first task,
or an opinion lexicon. Each word is assigned the system determines whether a sentence is sub-
specific sentiment orientation scores. Using the jective or objective. If it is determined that the
lexicon, each opinion word extracted receives a sentence expresses a subjective opinion, the anal-
predefined sentiment orientation score, which is ysis moves to the second task, sentence-level clas-
then aggregated for a text unit. sification. This second task involves determining
The machine learning-based approach, also whether the sentence is positive, negative, or
called the text classification approach, builds a neutral.
sentiment classifier to determine whether a given The third type of classification is referred to as
text about an object is positive, negative, or neu- entity and aspect-level sentiment analysis. Also
tral. Using the ability of machines to learn, this called feature-based opinion mining, this level of
approach trains a sentiment classifier to use a large analysis focuses on sentiments directed at entities
set of examples, or training corpus, that have and/or their aspects. An entity can include a prod-
sentiment categories (e.g., positive, negative, or uct, service, person, issue, or event. An aspect is a
neutral). The sentiment categories are manually feature of the entity, such as its color or weight.
annotated by humans according to predefined For example, in the sentence “the design of this
rules. The classifier then applies the properties of laptop is bad, but its processing speed is excel-
lent,” there are two aspects stated – “design” and
Sentiment Analysis 3

“processing speed.” This sentence is negative sentiment if said sincerely but implies negative
about one aspect, “design,” and positive about sentiment if said sarcastically. Similarly, words
the other aspect, “processing speed.” Entity- and such as “sick,” “bad,” and “nasty” may have
aspect-level sentiment analysis is not limited to reversed sentiment orientation depending on con-
analyzing documents or sentences alone. Indeed, text and how they are used. For example, “My
although a document or sentence may contain new car is sick!” implies positive sentiment
opinions regarding multiple entities and their toward the car. These issues can also contribute
aspects, the entity- and aspect-level sentiment to inaccuracies in sentiment analysis.
analysis has the ability to identify the specific Altogether, despite these limitations, the com-
entities and/or aspects that the opinions on the putational study of opinions provided by senti-
document or sentence are referring to and then ment analysis can be beneficial for practical
determine whether the opinions are positive, neg- purposes. So long as individuals continue to
ative, or neutral. share their opinions through online user-generated
media, the possibilities for entities seeking to gain
meaningful insights into the opinions of key pub-
Challenges and Limitations lics will remain. Yet, challenges to sentiment,
analysis such as those discussed above, pose sig-
Extracting opinions from texts is a daunting task. nificant limitations to its accuracy and thus its
It requires a thorough understanding of the seman- usefulness in decision-making.
tic, syntactic, explicit, and implicit rules of a lan-
guage. Also, because sentiment analysis is carried
out by a computer system with a typical focus on Cross-References
analyzing documents on a particular topic, off-
topic passages containing irrelevant information ▶ Competitive Monitoring
may also be included in the analyses (e.g., a doc- ▶ Consumer Products
ument may contain information on multiple ▶ Data Mining
topics). This could result in creating inaccurate ▶ Facebook
global sentiment polarities about the main topic ▶ Internet
being analyzed. Therefore, the computer system ▶ LinkedIn
must be able to adequately screen and distinguish ▶ Marketing/Advertising
opinions that are not relevant to the topic being ▶ Online Identity
analyzed. Relatedly, for the machine learning- ▶ Real-Time Analytics
based approach, a sentiment classifier trained on ▶ SalesForce
a certain domain (e.g., car reviews) may perform ▶ Social Media
well on the particular topic, but may not when ▶ Twitter
applied to another domain (e.g., computer
review). The issue of domain independence is
another important challenge. Further Reading
Also, the complexities of human communica-
tion limit the capacity of sentiment analysis to Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013).
capture nuanced, contextual meanings that opin- New avenues in opinion mining and sentiment analysis.
IEEE Intelligent Systems, 28, 15–21.
ion holders actually intend to communicate in
Liu, B. (2011). Sentiment analysis and opinion mining. San
their messages. Examples include the use of sar- Rafael: Morgan & Claypool.
casm, irony, and humor in which context plays a Pang, B., & Lee, L. (2008). Opinion mining and sentiment
key role in conveying the intended message, par- analysis. Foundations and Trends in Information
Retrieval, 2(1–2), 1–135.
ticularly in cases when an individual says one
Pang, B., Lee, L., & Vaithyanathan S. (2002). Thumbs up?
thing but means the opposite. For example, some- Sentiment classification using machine learning tech-
one may say “nice shirt,” which implies positive niques. In Proceedings of the Conference on Empirical
4 Sentiment Analysis

Methods in Natural Language Processing (EMNLP) washingtonpost.com/politics/the-secret-service-wants-


(pp. 79–86). software-that-detects-sarcasm-yeah-good-luck/2014/
Zezima, K. The secret service wants software that detects 06/03/35bb8bd0-eb41-11e3-9f5c-9075d5508f0a_
sarcasm (Yeah, good luck.) The Washington Post. story.html
Retrieved 11 Aug 2014 from http://www.
S

Smart Cities policy making and better direct long-term munic-


ipal planning efforts (Batty 2013; Komninos
Jan Lauren Boyles 2015; Goldsmith and Crawford 2014). Despite
Greenlee School of Journalism and this promise of more effective and responsive
Communication, Iowa State University, Ames, governance, however, achieving a truly smart
IA, USA city often requires the redesign (and in many
cases, the physical rebuilding) of structures to
harvest and process big data from the urban envi-
Definition/Introduction ronment (Campbell 2013). As a result, global
metropolitan leaders continue to experiment with
Smart cities are built upon aggregated, data- cost-effective approaches to constructing smart
driven insights that are obtained directly from cities in the late-2010s.
the urban infrastructure. These data points trans- Heralded as potentially revolutionizing citizen-
late into actionable information that can guide government interactions within cities, the initial
municipal development and policy (Albino et al. integration of Internet Communication Technolo-
2015). Building on the emergent Internet of gies (ICTs) into the physical city in the late 1990s
Things movement, networked sensors (often was viewed as the first step toward today’s smart
physically embedded into the built environment) cities (Caragliu et al. 2011; Albino et al. 2015). In
create rich data streams that uncover how city the early 2000s, the burgeoning population
resources are used (Townsend 2013; Komninos growth of global cities mandated the use of more
2015; Sadowski and Pasquale 2015). Such intel- sophisticated computational tools to effectively
ligent systems, for instance, can send alerts to city monitor and manage metropolitan resources
residents when demand for urban resources (Campbell 2013; Meijer and Bolivar 2015). The
outpaces supply or when emergency conditions rise of smart cities in the early 2010s can, in fact,
exist within city limits. By analyzing these data be traced to a trio of technological advances: the
flows (often in real time), elected officials, city adoption of cloud computing, the expansion of
staff, civic leaders, and average citizens can more wireless networks, and the acceleration of pro-
fully understand resource use and allocation, cessing power. At the same time, the societal
thereby optimizing the full potential of municipal uptick in mobile computing by everyday citizens
services (Hollands 2008; de Lange and de Waal enables more data to be collected on user habits
2013; Campbell 2013; Komninos 2015). Over and behaviors of urban residents (Batty 2013).
time, the integration of such intelligent systems The most significant advance in smart city adop-
into metropolitan life acts to better inform urban tion rests, however, in geolocation – the concept
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_185-1
2 Smart Cities

that data can be linked to physical space (Batty Conclusion


2013; Townsend 2013). European metropolises,
in particular, have been early adopters of intelli- The successful integration of intelligent systems
gent systems (Vanolo 2013). into the city is centrally predicated upon financial
investment in overhauling aging urban infrastruc-
ture (Townsend 2013; Sadowski and Pasquale
2015). Politically, investment decisions are fur-
The Challenges of Intelligent
ther complicated by fragmented municipal lead-
Governance
ership, whose priorities for smart city
implementation may shift between election cycles
Tactically, most smart cities attempt to tackle
and administrations (Campbell 2013). Rather than
wicked problems – the types of dilemmas that
encountering these challenges in isolation, munic-
have historically puzzled city planners
ipal leaders are beginning to work together to
(Campbell 2013; Komninos 2015). The integra-
develop global solutions to shared wicked prob-
tion of intelligent systems into the urban environ-
lems. Intelligent system advocates argue that devel-
ment has accelerated the time horizon for
oping collaborative approaches to building smart
policymaking for these issues (Batty 2013). Data
cities will drive the growth of smart cities into the
that once took years to gather and assess can now
next decade (Goldsmith and Crawford 2014).
be accumulated and analyzed in mere hours, or in
some cases, in real time (Batty 2013). Within
smart cities, crowdsourcing efforts often also
enlist residents, who voluntarily provide data to Cross-References
fuel collective and collaborative solutions (Batty
2013). Operating in this environment of height- ▶ Internet of Things
ened responsiveness, municipal leaders within ▶ Open Data
smart cities are increasingly expected to integrate ▶ Semantic Web
open data initiatives that provide public access to
the information gathered by the data-driven
municipal networks (Schrock 2016). City plan-
Further Reading
ners, civic activists, and urban technologists
must also jointly consider the needs of city Albino, V., Berardi, U., & Dangelico, R. M. (2015). Smart
dwellers throughout the process of designing cities: Definitions, dimensions, performance, and ini-
smart cities, directly engaging residents in the tiatives. Journal of Urban Technology, 22(1), 3–21.
building of smart systems (de Lange and de Batty, M. (2013). Big data, smart cities and city planning.
Dialogues in Human Geography, 3(3), 274–279.
Waal 2013). At the same time, urban officials Campbell, T. (2013). Beyond smart cities: How cities net-
must be increasingly cognizant that as more user work, learn and innovate. New York: Routledge.
behaviors within city limits are tracked with data, Caragliu, A., Del Bo, C., & Nijkamp, P. (2011). Smart
the surveillance required to power smart systems cities in Europe. Journal of Urban Technology, 18(2),
65–82.
may also concurrently challenge citizen notions of de Lange, M., & de Waal, M. (2013). Owning the city: New
privacy and security (Goldsmith and Crawford media and citizen engagement in urban design. First
2014; Sadowski and Pasquale 2015). Local gov- Monday, 18(11). doi:10.5210/fm.v18i11.4954.
ernments must also ensure that the data collected Goldsmith, S., & Crawford, S. (2014). The responsive city:
Engaging communities through data-smart gover-
will be safe and secure from hackers, who may nance. San Francisco: Jossey-Bass.
wish to disrupt essential smart systems within Hollands, R. G. (2008). Will the real smart city please stand
cities (Schrock 2016). up? Intelligent, progressive or entrepreneurial? City,
12(3), 303–320.
Komninos, N. (2015). The age of intelligent cities: Smart
environments and innovation-for-all strategies. New
York: Routledge.
Smart Cities 3

Meijer, A., & Bolívar, M. P. R. (2015). Governing the Schrock, A. R. (2016). Civic hacking as data activism and
smart city: A review of the literature on smart urban advocacy: A history from publicity to open government
governance. International Review of Administrative data. New Media & Society, 18(4), 581–599.
Sciences. doi:10.1177/0020852314564308. Townsend, A. (2013). Smart cities: Big data, civic hackers,
Sadowski, J., & Pasquale, F. A. (2015). The spectrum and the quest for a new utopia. New York:
of control: A social theory of the smart city. W.W. Norton.
First Monday, 20(7). doi:10.5210/fm.v20i7.5903. Vanolo, A. (2013). Smartmentality: The smart city as
disciplinary strategy. Urban Studies, 51(5), 883–898.
S

Social Media previously to traditional media producers and


technology innovators. In addition to referring to
Dimitra Dimitrakopoulou various communication tools and platforms,
School of Journalism and Mass Communication, including social networking sites, social media
Aristotle University of Thessaloniki, also hint at a cultural mindset that emerged in
Thessaloniki, Greece the mid-2000s as part of the technical and busi-
ness phenomenon referred to as Web 2.0.
It is important to distinguish between social
Social media and networks are based on the tech- media and social networks. Whereas often both
nological tools and the ideological foundations of terms are used interchangeably, it is important to
Web 2.0 and enable the production, distribution, understand that social media are based on user-
and exchange of user-generated content. They generated content produced by the active users
transform the global media landscape by transpos- who now can act as producers as well. Social
ing the power of information and communication media have been defined on multiple levels,
to the public that had until recently a passive role starting from more operational definitions that
in the mass communication process. underline that social media indicate a shift from
Web 2.0 tools refer to the sites and services that HTML-based linking practices of the open Web to
emerged during the early 2000s, such as blogs (e. linking and recommendation, which happen
g., Blogspot, Wordpress), wikis (e.g., Wikipedia), inside closed systems. Web 2.0 has three
microblogs (e.g., Twitter), social networking sites distinguishing features: it is easy to use, it facili-
(e.g., Facebook, LinkedIn), video (e.g., tates sociality, and it provides users with free
YouTube), image (e.g., Flickr), file-sharing plat- publishing and production platforms that allow
forms (e.g., We, Dropbox), and related tools that them to upload content in any form, be it pictures,
allow participants to create and share their own videos, or text. Social media are often contrasted
content. Though the term was originally used to to traditional media by highlighting their
identify the second coming of the Web after the distinguishing features, as they refer to a set of
dotcom burst and restore confidence in the indus- online tools that supports social interaction
try, it became inherent in the new WWW applica- between users. The term is often used to contrast
tions through its widespread use. with more traditional media such as television and
The popularity of Web 2.0 applications dem- books that deliver content to mass populations but
onstrates that, regardless of their levels of techni- do not facilitate the creation or sharing of content
cal expertise, users can wield technologies in by users as well as their ability to blur the
more active ways than had been apparent
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_186-1
2 Social Media

distinction between personal communication and In the development of the flows of information,
the broadcast model of messages. the Internet holds the key role as a catalyst of a
novel platform for public discourse and public
communication. The Internet consists of both a
Theoretical Foundations of Social Media technological infrastructure and (inter)acting
humans, in a technological system and a social
Looking into the role of the new interactive and subsystem that both have a networked character.
empowering media, it is important to study their Together these parts form a techno-social system.
development as techno-social systems, focusing The technological structure is a network that pro-
on the dialectic relation of structure and agency. duces and reproduces human actions and social
As Fuchs (2014) describes, media are techno- networks and is itself produced and reproduced by
social systems, in which information and commu- such practices.
nication technologies enable and constrain human The specification of the online platforms, such
activities that create knowledge that is produced, as Web 1.0, Web 2.0, or Web 3.0, marks distinc-
distributed, and consumed with the help of tech- tively the social dynamics that define the evolu-
nologies in a dynamic and reflexive process that tion of the Internet. Fuchs (2014) provides a
connects technological structures and human comprehensive approach for the three “genera-
agency. The network infrastructure of the Internet tions” of the Internet, founding them on the idea
allows multiple and multi-way communication of knowledge as a threefold dynamic process of
and information flow between agents, combining cognition, communication, and cooperation. The
both interpersonal (one-to-one), mass (one-to- (analytical) distinction indicates that all Web 3.0
many), and complex, yet dynamically equal com- applications (cooperation) and processes also
munication (many-to-many). include aspects of communication and cognition
The discussion on the role of social media and and that all Web 2.0 applications (communica-
networks finds its roots in the emergence of the tion) also include cognition. The distinction is
network society and the evolvement of the Inter- based on the insight of knowledge as threefold
net as a result of the convergence of the audiovi- process that all communication processes require
sual, information technology, and cognition, but not all cognition processes result in
telecommunications sector. Contemporary society communication, and that all cooperation pro-
is characterized by what can be defined as conver- cesses require communication and cognition, but
gence culture (Jenkins 2006) in which old and not all cognition and communication processes
new media collide, where grassroots and corpo- result in cooperation.
rate media intersect, where the power of the media In many definitions, the notions of collabora-
producer and the power of the media consumer tion and collective actions are central, stressing
interact in unpredictable ways. that social media are tools that increase our ability
The work of Manuel Castells (2000) on the to share, to cooperate, with one another, and to
network society is central, emphasizing that the take collective action, all outside the framework
dominant functions and processes in the Informa- of traditional institutional institutions and organi-
tion Age are increasingly organized around net- zations. Social media enable users to create their
works. Networks constitute the new social own content and decide on the range of its dis-
morphology of our societies, and the diffusion of semination through the various available and eas-
networking logic substantially modifies the oper- ily accessible platforms. Social media can serve as
ation and outcomes in processes of production, online facilitators or enhancers of human net-
experience, power, and culture. Castells (2000) works – webs of people that promote connected-
introduces the concept of “flows of information,” ness as a social value.
underlining the crucial role of information flows Social network sites (SNS) are built on the
in networks for the economic and social pattern of online communities of people who are
organization. connected and share similar interests and
Social Media 3

activities. Boyd and Ellison (2007) provide a The Emergence of Citizen Journalism
robust and articulated definition of SNS, describ-
ing them as Web-based services that allow indi- The rise of social media and networks has a direct
viduals to (1) construct a public or semipublic impact on the types and values of journalism and
profile within a bounded system, (2) articulate a the structures of the public sphere. The transfor-
list of other users with whom they share a con- mation of interactions between political actors,
nection, and (3) view and traverse their list of journalists and citizens through the new technol-
connections and those made by others within the ogies has created the conditions for the emergence
system. The nature and nomenclature of these of a distinct form from professional journalism,
connections may vary from site to site. As the often called citizen, participatory, or alternative
social media and user-generated content phenom- journalism. The terms used to identify the new
ena grew, websites focused on media sharing journalistic practices on the Web range from inter-
began implementing and integrating SNS features active or online journalism to alternative journal-
and becoming SNSs themselves. ism, participatory journalism, citizen journalism,
The emancipatory power of social media is or public journalism. The level and the form of
crucial to understand the importance of network- public’s participation in the journalistic process
ing, collaboration, and participation. These con- determine whether it is a synergy between jour-
cepts, directly linked to social media, are key nalists and the public or exclusive journalistic
concepts to understand the real impact and dimen- activities of the citizens.
sions of contemporary participatory media cul- However, the phenomenon of alternative jour-
ture. According to Jenkins (2006), the term nalism is not new. Already in the nineteenth cen-
participatory culture contrasts with older notions tury, the first forms of alternative journalism made
of passive media consumption. Rather than their appearance with the development of the rad-
talking about media producers and consumers ical British press. The radical socialist press in the
occupying separate roles, we might now see USA in the early twentieth century followed as
them as participants who interact with each other did the marginal and feminist press between 1960
and contribute actively and prospectively equally and 1970. Fanzines and zines appeared in the
to social media production. 1970s and were succeeded by pirate radio sta-
Participation is a key concept that addresses the tions. At the end of the twentieth century, how-
main differences between the traditional (old) ever, the attention has moved to new media and
media and the social (new) media and focuses Web 2.0 technologies.
mainly on the empowerment of the audience/ The evolution of social networks with the new
users of media toward a more active information paradigm shift is currently defining to a great
and communication role. The changes transform extent the type, the impact, and the dynamics of
the relation between the main actors in political action, reaction, and interaction of the involved
communication, namely, political actors, journal- participants in a social network. According to
ists, and citizens. Social media and networks Atton (2003), alternative journalism is an ongoing
enable any user to participate in the mediation effort to review and challenge the dominant
process by actively searching, sharing, and approaches to journalism. The structure of this
commenting on available content. The distrib- alternative journalistic practice appears as the
uted, dynamic, and fluid structure of social counterbalance to traditional and conventional
media enables them to circumvent professional media production and disrupts its dominant
and political restrictions on news production and forms, namely, the institutional dimension of
has given rise to new forms of journalism defined mainstream media, the phenomena of capitaliza-
as citizen, alternative, or participatory journalism, tion and commercialization, and the growing con-
but also new forms of propaganda and centration of ownership.
misinformation. Citizen journalism is based on the assumption
that the public space is in crisis (institutions,
4 Social Media

politics, journalism, political parties). It appears as The purpose of citizen journalism is to reverse
an effort to democratize journalism and thereby is the “hierarchy of access” as it was identified by
questioning the added value of objectivity, which Glasgow University Media Group, giving voice to
is supported by professional journalism. the ones marginalized by the mainstream media.
The debate on a counterweight to professional, While mainstream media rely extensively on elite
conventional, mainstream journalism was intensi- groups, alternative media can offer a wider range
fied around 1993, when the signs of fatigue and of “voices” that wait to be heard. The practices of
the loss of public’s credibility in journalism alternative journalism provide “first-hand” evi-
became visible and overlapped with the innova- dences, as well as collective and anti-hierarchical
tive potentials of the new interactive technologies. forms of organizations and a participatory, radical
The term public journalism (public journalism) approach of citizen journalism. This form of jour-
appeared in the USA in 1993 as part of a move- nalism is identified by Atton as native reporting.
ment that expressed concerns for the detachment To determine the moving boundary between
of journalists and news organizations from the news producers and the public, Bruns (2005)
citizens and communities, as well as of US citi- used the term produsers, combining the words
zens from public life. However, the term citizen and concepts of producers and users. These
journalism has defined on various levels. If both changes determine the way in which power rela-
its supporters and critics agree on one core thing, tions in the media industry and journalism are
it is that it means different things to different changing, shifting the power from journalists to
people. the public.
The developments that Web 2.0 has introduced
and the subsequent explosive growth of social
media and networks mark the third phase of public Social Movements
journalism and its transformation to alternative
journalism. The field of information and commu- In the last few years, we have witnessed a growing
nication is transformed into a more participatory heated debate among scholars, politicians, and
media ecosystem, which evolves the news as journalists regarding the role of the Internet in
social experiences. News are transformed into a contemporary social movements. Social media
participatory activity to which people contribute tools such as Facebook, Twitter, and YouTube
their own stories and experiences and their reac- which facilitate and support user-generated con-
tions to events. tent have taken up a leading role in the develop-
Citizen journalism proposes a different model ment and coordination of a series of recent social
of selection and use of sources and of news prac- movements, such as the student protests in Britain
tices and redefinition of the journalistic values. at the end of 2010 as well as the outbreak of
Atton (2003) traces the conflict with traditional, revolution in the Arab world, the so-called Arab
mainstream journalism in three key points: (a) Spring.
power does not come exclusively from the official The open and decentralized character of the
institutional institutions and the professional cat- Internet has inspired many scholars to envisage a
egory of journalists, (b) reliability and validity can rejuvenation of democracy, focusing on the
derive from descriptions of lived experience and (latent) democratic potentials of the new media
not only objectively detached reporting, and (c) it as interactive platforms that can motivate and
is not mandatory to separate the facts from sub- fulfill the active participation of the citizens in
jective opinion. Although Atton (2003) does not the political process. On the other hand, Internet
consider lived experiences as an absolute value, skeptics suggest that the Internet will not itself
he believes it can constitute the added value of alter traditional politics. On the contrary, it can
alternative journalism, combining it with the generate a very fragmented public sphere based
capability of recording it through documented on isolated private discussions while the abun-
reports. dance of information, in combination with the
Social Media 5

vast amounts of offered entertainment and the Facebook revolution,” demonstrating the power
options for personal socializing, can lead people of networks.
to restrain from public life. The Internet actually In the European continent, we have witnessed
offers a new venue for information provision to the recent development of the Indignant Citizens
the citizen-consumer. At the same time, it allows Movement, whose origin was largely attributed to
politicians to establish direct communication with the social movements that started in Spain and
the citizens free from the norms and structural then spread to Portugal, the Netherlands, the
constraints of traditional journalism. UK, and Greece. In these cases, the digital social
Social media aspire to create new opportunities networks have proved powerful means to convey
for social movements. Web 2.0 platforms allow demands for a radical renewal of politics based on
protestors to collaborate so that they can quickly a stronger and more direct role of citizens and on a
organize and disseminate a message across the critique of the functioning of Western democratic
globe. By enabling the fast, easy, and low-cost systems.
diffusion of protest ideas, tactics, and strategies,
social media and networks allow social move-
ments to overcome problems historically associ-
See Also
ated with collective mobilization.
Over the last years, the center of attention was
▶ Digital Literacy
not the Western societies, which were used in
▶ Open Data
being the technology literate and information-
▶ Social Network Analysis
rich part of the world, but the Middle Eastern
▶ Twitter
ones. Especially after 2009, there is considerable
evidence advocating in favor of the empowering,
liberating, and yet engaging potentials of the
online social media and networks as in the case Further Reading
of the protesters in Iran who have actively used
Web services like Facebook, Twitter, Flickr, and Atton, C. (2003). What is ‘alternative’ journalism? Jour-
nalism: Theory Practice and Criticism, 4(3), 267–272.
YouTube to organize, attract support, and share Boyd, D. M., & Ellison, N. B. (2007). Social network sites:
information about street protests after the June Definition, history, and scholarship. Journal of Com-
2009 presidential elections. More recently, a rev- puter-Mediated Communication, 13(1), 210–230.
olutionary wave of demonstrations has swept the Bruns, A. (2005). Gatewatching: Collaborative online
news production. New York: Peter Lang.
Arab countries as the so-called Arab Spring, using Castells, M. (2000). The rise of the network society, the
again the social media as means for raising aware- information age: Economy, society and culture vol. I.
ness, communication, and organization, facing at Oxford: Blackwell.
the same time strong Internet censorship. Though Fuchs, C. (2014). Social media: A critical introduction.
London: Sage.
neglecting the complexity of these transforma- Jenkins, H. (2006). Convergence culture: Where old and
tions, the uprisings were largely quoted as “the new media collide. New York: New York University
Press.
S

Social Sciences societies (Archeology, History, Demography),


social interaction (Political Economy, Sociology,
Ines Amaral Anthropology), or cognitive system (Psychology,
University of Minho, Minho, Portugal Linguistics). There are also applied Social Sci-
Autonomous University of Lisbon, Lisbon, ences (Law, Pedagogy) and other Social Sciences
Portugal classified in the generic group of Humanities
(Political Science, Philosophy, Semiotics, Com-
munication Sciences). The anthropologist Claude
Social Science is an academic discipline Lévi-Strauss, the philosopher and political scien-
concerned with the study of humans through tist Antonio Gramsci, the philosopher Michel
their relations with society and culture. Social Foucault, the economist and philosopher Adam
Science disciplines analyze the origins, develop- Smith, the economist John Maynard Keynes, the
ment, organization, and operation of human soci- psychoanalyst Sigmund Freud, the sociologist
eties and cultures. The technological evolution Émile Durkheim, the political scientist and soci-
has strengthened Social Sciences since it enables ologist Max Weber, and the philosopher, sociolo-
empirical studies developed through quantitative gist, and economist Karl Marx are some of the
means, allowing the scientific reinforcement of leading social scientists of the last centuries.
many theories about the behavior of man as a The social scientist studies phenomena, struc-
social actor. The rise of big data represents an tures, and relationships that characterize the social
opportunity for the Social Sciences to advance and cultural organizations; analyzes the move-
the understanding of human behavior using mas- ments and population conflicts, the construction
sive sets of data. of identities, and the formation of opinions;
The issues related to Social Sciences began to researches behaviors and habits and the relation-
have a scientific nature in the eighteenth century ship between individuals, families, groups, and
with the first studies on the actions of humans in institutions; and develops and uses a wide range
society and their relationships with each other. It of techniques and research methods to study
was by this time that Political Economy emerged. human collectivities and understand the problems
Most of the subjects belonging to the fields of of society, politics, and culture.
Social Sciences, such as Anthropology, Sociol- The study of humans through their relations
ogy, and Political Science arisen in the nineteenth with society and culture relied on “surface data”
century. and “deep data.” “Surface data” was used in the
Social Sciences can be divided in disciplines disciplines that adapted quantitative methods, like
that are dedicated to the study of the evolution of Economics. “Deep data” about individuals or
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_188-1
2 Social Sciences

small groups was used in disciplines that analyze dimensions. What makes big data so interesting to
society through qualitative methods, such Social Sciences is the possibility to reduce data,
Sociology. apply filters that allow to identify relevant patterns
Data collection has always been a problem for of information, aggregate sets in a way that helps
social research because of its inherent subjectivity identify temporal scales and spatial resolutions,
as Social Sciences have traditionally relied on and segregate streams and variables in order to
small samples using methods and tools gathering analyze social systems.
information based on people. In fact, one of the As big data is dynamic, heterogeneous, and
critical issues of Social Science is the need to interrelated, social scientists are facing new chal-
develop research methods that ensure the objec- lenges due to the existence of computational and
tivity of the results. Moreover, the objects of study statistical tools, which allow extracting and ana-
of Social Sciences do not fit into the models and lyzing large datasets of social information. Big
methods used by other sciences and do not allow data is being generated in multiple and
the performance of experiments under controlled interconnecting disciplinary fields. Within the
laboratory conditions. The quantification of infor- social domain, data is being collected from trans-
mation is possible because there are several tech- actions and interactions through multiple devices
niques of analysis that transform ideas, social and digital networks. The analysis of large
capital, relationships, and other variables from datasets is not within the field of a single scientific
social systems into numerical data. However, the discipline or approach. In this regard, big data can
object of study always interacts with the culture of change Social Science because it requires an inter-
the social scientist, making it very difficult to have section of sciences within different research tradi-
a real impartiality. tions and a convergence of methodologies and
Big data is not self-explanatory. Consequently, techniques. The scale of the data and the methods
it requires new research paradigms across multi- required to analyze them need to be developed
ple disciplines, and for social scientists, it is a combining expertise with scholars from other sci-
major challenge as it enables interdisciplinary entific disciplines. Within this collaboration with
studies and the intersection between computer data scientists, social scientists must have an
science, statistics, data visualization, and social essential role in order to read the data and under-
sciences. Furthermore, big data empowers the stand the social reality.
use real-time data on the level of whole The era of big data implies that Social Sciences
populations, to test new hypotheses and study rethink and update theories and theoretical ques-
social phenomena on a larger scale. In the context tions such as small world phenomenon, complex-
of modern Social Sciences, large datasets allow ity of urban life, relational life, social networks,
scientists to understand and study different social study of communication and public opinion for-
phenomena, from the interactions of individuals mation, collective effervescence, and social influ-
and the emergence of self-organized global move- ence. Although computerized databases are not
ments to political decisions and the reactions of new, the emergence of an era of big data is critical
economic markets. as it creates a radical shift of paradigm in social
Nowadays, social scientists have more infor- research. Big data reframes key issues on the
mation on interaction and communication pat- foundation of knowledge, the processes and tech-
terns than ever. The computational tools allow niques of research, the nature of information, and
understanding the meaning of what those patterns the classification of social reality.
reveal. The models build about social systems The new forms of social data have interesting
within the analysis of large volumes of data must dimensions: volume, variety, velocity, exhaustive,
be coherent with the theories of human actors and indexical, relational, flexible, and scalable. Big
their behavior. The advantages of large datasets data consists of relational information in large
and of the scaling up the size of data are that it is scale that can be created in or near real time with
possible to make sense of the temporal and spatial different structures, extensive in scope, capable of
Social Sciences 3

identifying and indexing information distinc- Several scholars, who believe that the new
tively, flexible, and able to expand in size quickly. empiricism operates as a discursive rhetorical
The datasets can be created by personal data or device, criticize this approach. Kitchin argues
nonpersonal data. Personal data can be defined as that whereas data can be interpreted free of con-
information relating to an identified person. This text and domain-specific expertise, such an epis-
definition includes online user-generated content, temological interpretation is probable to be
online social data, online behavioral data, location unconstructive as it absences to be embedded in
data, sociodemographic data, and information broader discussions.
from an official source (e.g., police records). All As large datasets are highly distributed and
data collected that do not directly identify individ- present complex data, a new model of data-driven
uals are considered nonpersonal data. Personal science is emerging within the Social Science
data can be collected from different sources with disciplines. The data-driven science uses a hybrid
three techniques: voluntary data that is created combination of abductive, inductive, and deduc-
and shared online by individuals; observed data, tive methods to the understanding of a phenome-
which records the actions of the individual; and non. This approach assumes theoretical
data inferred about individuals based on voluntary frameworks and pursues to generate scientific
information or observed. hypotheses from the data by incorporating a
The disciplinary outlines of Social Sciences in mode of induction into the research design. There-
the age of big data are in constant readjustment fore, the epistemological strategy adopted within
because of the speed of change in the data land- this model is to detect techniques to identify
scape. Some authors argued that the new data potential problems and questions, which can be
streams could reconfigure and constitute social worth of further analysis, testing, and validation.
relations and populations. Academic researchers Although big data enhance the set of data
attempt to handle the methodological challenges available for analysis and enable new approaches
presented by the growth of big social data, and and techniques, it does not replace the traditional
new scientific trends arise, although the diversity small data studies. Due to the fact that big data
of the philosophical foundations of Social Science cannot answer specific social questions, more
disciplines. Objectivity of the data does not result targeted studies are required. Computational
directly in their interpretation. The scientific Social Sciences can be the interface between com-
method postulated by Durkheim attempts to puter science and the traditional social sciences.
remove itself from the subjective domain. Never- This interdisciplinary and emerging scientific
theless, the author stated that objectivity is made from Social Sciences uses computationally
by subjects and is based on subjective observa- methods to model social reality and analyze phe-
tions and selections of individuals. nomena, as well as social structures and collective
A new empiricist epistemology emerged in behavior. The main computational approaches
Social Sciences and goes against the deductive from Social Sciences to study big data are social
approach that is hegemonic within modern sci- network analysis, automated information extrac-
ence. According to this new epistemology, big tion systems, social geographic information sys-
data can capture an entire social reality and pro- tems, complexity modeling, and social simulation
vide their full understanding. Therefore, there is models.
no need for theoretical models or hypotheses. This Computational Social Science is an intersec-
perspective assumes that patterns and relation- tion of Computer Science, Statistics, and the
ships within big data are characteristically signif- Social Sciences, which uses large-scale demo-
icant and accurate. Thus, the application of data graphic, behavioral, and network data to analyze
analytics transcends the context of a single scien- individual activity, collective behaviors, and rela-
tific discipline or a specific domain of knowledge tionships. Computational Social Sciences can be
and can be interpreted by those who can interpret the methodological approach to Social Sciences
statistics or data visualization. study big data because of the use of mathematical
4 Social Sciences

methods to model social phenomena and the abil- ▶ Network Data


ity to handle with large datasets. ▶ Psychology
The analysis of big volumes of data opens up ▶ Social Network Analysis (SNA)
new perspectives of research and makes it possi- ▶ Sociology
ble to answer questions that were previously ▶ Visualization
incomprehensible. Though big data itself is rela-
tive, its analysis within the theoretical tradition of
Social Sciences to build a context for information
Further Readings
will enable its understanding and the intersection
with the smaller studies to explain specific data Allison, P. D. (2002). Missing data: Quantitative applica-
variables. tions in the social sciences. British Journal of Mathe-
Big data may have a transformational impact as matical and Statistical Psychology, 55(1), 193–196.
it can transform policy making, by helping to Berg, B. L., & Lune, H. (2004). Qualitative research
methods for the social sciences (Vol. 5). Boston:
improve communication and governance in sev- Pearson.
eral policy domains. Big social data also raise Boyd, D., & Crawford, K. (2012). Critical questions for big
significant ethical issues for academic research data: Provocations for a cultural, technological, and
and request an urgent debate for a wider critical scholarly phenomenon. Information, Communication
& Society, 15(5), 662–679.
reflection on the epistemological implications of Coleman, J. S. (1990). Foundations of social theory. Cam-
data analytics. bridge, MA: Belknap Press of Harvard University
Press.
Floridi, L. (2012). Big data and their epistemological chal-
lenge. Philosophy & Technology, 25, 435–437.
Cross-References González-Bailón, S. (2013). Social science in the era of big
data. Polymer International, 5(2), 147–160.
▶ Anthropology Lohr, S. (2012). The age of big data. New York Times 11.
▶ Communications Lynch, C. (2008). Big data: How do your data grow?
Nature, 455(7209), 28–29.
▶ Complex Networks Oboler, A., et al. (2012). The danger of big data: Social
▶ Computational Social Sciences media as computational social science. First Monday,
▶ Computer Science 17(7-2). Retrieved from http://firstmonday.org/ojs/
▶ Data Science index.php/fm/article/view/3993/3269
▶ Network Analytics
S

Spatial Data Raster and Vector Representations

Xiaogang Ma Spatial data are representations of facts that con-


Department of Computer Science, University of tain positional values, and geospatial data are
Idaho, Moscow, ID, USA spatial data that are about facts happening on the
surface of the Earth. Almost everything on the
Earth has location properties, so geospatial data
Synonyms and spatial data are regarded as synonyms. Spatial
data can be seen almost everywhere in the big data
Geographic information; Geospatial data; deluge, such as social media data stream, traffic
Geospatial information control, environmental sensor monitoring, and
supply chain management, etc. Accordingly,
there are various applications of spatial data in
the actual world. For example, one may find a
Introduction
preferred restaurant based on the grading results
on Twitter. A driver may adjust his route based on
Spatial property is almost a pervasive component
the real-time local traffic information. An engi-
in the big data environment because everything
neer may identify the best locations for new build-
happening on the Earth happens somewhere. Spa-
ings in an area with regular earthquakes. A forest
tial data can be grouped into raster or vector
manager may optimize timber production using
according to the methods used in representations.
data of soil and tree species distribution and con-
Web-based services facilitate the publication and
sidering a few constraints such as the requirement
use of spatial data legacies, and the crowdsourcing
of biodiversity and market price.
approaches enable people to be both contributors
Spatial data can be divided into two groups:
and users of spatial data. Semantic technologies
raster representations and vector representations.
further enable people to link and query the spatial
A raster representation can be regarded as a group
data available on the Web, find patterns of interest,
of mutually exclusive cells which form the repre-
and to use them to tackle scientific and business
sentation of a partition of space. There are two
issues.
types of raster representations: regular and irreg-
ular. The former has cells with same shape and
size and the latter with cells of varying shape and
size. Raster representations do not store coordi-
nate pairs. In contrast, vector representations use
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_192-1
2 Spatial Data

coordinate pairs to explicitly describe a geo- Spatial Data Service


graphic phenomenon. There are several types of
vector representations, such as points, lines, areas, Various proprietary and public formats for raster
and the triangulated irregular networks. A point is and vector representations have been introduced
a single coordinate pair in a two-dimensional since computers were used for spatial data collec-
space or a coordinate triplet in a three-dimensional tion, analysis, and presentation. Plenty of remote
space. A line is defined by two end points and zero sensing images, digital maps, and sensor data
to more internal points to define the shape. An form a massive spatial data legacy. On the one
area is a partition of space defined by a boundary hand, they greatly facilitate the progress of using
(Huisman and de By 2009). spatial data to tackle scientific and social issues.
The raster representations have simple but less On the other hand, the heterogeneities caused by
compact data structures. They enable simple the numerous data formats, conceptual models,
implementation of overlays but pose difficulties and software platforms bring huge challenges for
for the representation of interrelations among geo- data integration and reuse from multiple sources.
graphic phenomena, as the cell boundaries are The Open Geospatial Consortium (OGC) (2016)
independent of feature boundaries. However, the was formed in 1994 to promote a worldwide con-
raster representations are efficient for image pro- sensus process for developing publicly available
cessing. In contrast, the vector representations interface standards for spatial data. By early 2015,
have complex data structures but are efficient for the consortium consists of more than 500 mem-
representing spatial interrelations. The vector rep- bers from industry, government agencies, and aca-
resentations work well in scale changes but are demia. Standards developed by OGC have been
hard to implement overlays. Also, they allow the implemented for promoting interoperability in
representation of networks and enable easy asso- spatial data collection, sharing, service, and pro-
ciation with attribute data. cessing. Well-known standards include the Geog-
The collection, processing, and output of spa- raphy Markup Language, Keyhole Markup
tial data are often relevant to a number of plat- Language, Web Map Service, Web Feature Ser-
forms and systems, among them the most well- vice, Web Processing Service, Catalog Service for
known are the geographic information system, the Web, Observations and Measurements, etc.
remote sensing, and the global positioning sys- Community efforts such as the OGC service
tem. A geographic information system is a com- standards offer a solution to publish multisource
puterized system that facilitates the phases of data heterogeneous spatial data legacy on the Web. A
collection, data processing, and data output, espe- number of best practices have emerged in recent
cially for spatial data. Remote sensing is the use of years. The OneGeology is an international initia-
satellites to capture information about the surface tive among the geological surveys across the
and atmosphere of the Earth. Remote sensing data world. It was launched in 2007, and by early
are normally stored in raster representations. The 2015, it has 119 participating member nations.
global positioning system is a space-based satel- Most members in OneGeology share national
lite navigation system that provides direct mea- and/or regional geological maps through the
surement of position and time on the surface of the OGC service standards, such as Web Map Service
Earth. Remote sensing images and global posi- and Web Feature Service. The OneGeology Portal
tioning system signals can be regarded as primary provides a central node for the various distributed
data sources for the geographic information data services. The Portal is open and easy to use.
system. Anyone with an internet browser can view the
maps registered on the portal. People can also
use the maps in their own applications as many
software programs now provide interfaces to
access the spatial data services. Another more
comprehensive project is the GEO Portal of the
Spatial Data 3

Global Earth Observation System of Systems, (5) Spatial database management systems such as
which is coordinated by the Group on Earth PostgreSQL, Ingres Geospatial, and JASPA
Observations. It acts as a central portal and clear- (6) Web-based spatial data publication and pro-
inghouse providing access to spatial data in sup- cessing servers such as GeoServer,
port of the whole system. The portal provides MapServer, and 52n WPS
registry for both data services and standards used (7) Web-based spatial data service development
in data services. It allows users to discover, frameworks such as OpenLayers, GeoTools,
browse, edit, create, and save spatial data from and Leaflet
members of the Group on Earth Observations
across the world. An international organization, the Open
Another popular spatial data service is the vir- Source Geospatial Foundation, was formed in
tual globe, which provides three-dimensional rep- 2006 to support the collaborative development
resentation of the Earth or another world. It allows of open-source geospatial software programs and
users to navigate in a virtual environment by promote their widespread use.
changing the position, viewing angle, and scale. Companies such as Google, Microsoft, and
A virtual globe has the capability to represent Yahoo! already provide free map services. One
various different views on the surface of the can browse maps on the service website, but the
Earth by adding spatial data as layers on the sur- spatial data behind the service is not open. In
face of a three-dimensional globe. Well-known contrast, the free and open-source spatial data
virtual globes include Google Earth, NASA approach requires not only freely available
World Wind, ESRI ArcGlobe, etc. Besides spatial datasets but also details about the data, such as
data browsing, most virtual globe programs also format, conceptual structure, vocabularies used,
enable the functionality of interactions with users. etc. A well-known open-source spatial data pro-
For example, the Google Earth can be extended ject is the OpenStreetMap, which aims at creating
with many add-ons encoded in the Keyhole a free editable map of the world. The project was
Markup Language, such as geological map layers launched in 2004. It adopts a crowdsourcing
exported from OneGeology. approach, that is, to solicit contributions from a
large community of people. By the middle of
2014, the OpenStreetMap project has more than
Open-Source Approaches 1.6 million contributors. Comparing with the
maps, the data generated by the OpenStreetMap
There are already widely used free and open- are considered as the primary output. Due to the
source software programs serving different pur- crowdsourcing approach, the current data quali-
poses in spatial handling (Steiniger and Hunter ties vary across different regions. Besides the
2013). Those programs can be grouped into a OpenStreetMap, there are numerous similar
number of categories: open-source and collaborative spatial data pro-
jects addressing the needs of different communi-
(1) Standalone desktop geographic information ties, such as the GeoNames for geographical
systems such as GRASS GIS, QGIS, and names and features, the OpenSeaMap for a world-
ILWIS wide nautical chart, and the eBird project for real-
(2) Mobile and light geographic information sys- time data about bird distribution and abundance.
tems such as gvSIG Mobile, QGIS for Open-source spatial data formats have also
Android, and tangoGPS received increasing attention in recent years, espe-
(3) Libraries with capabilities for spatial data pro- cially Web-based formats. A typical example is
cessing, such as GeoScript, CGAL, and GeoJSON, which enables the encoding of simple
GDAL geospatial features and their attributes using
(4) Data analysis and visualization tools such as JavaScript Object Notation (JSON). GeoJSON is
GeoVISTA Studio and R and PySAL; now supported by various spatial data software
4 Spatial Data

packages and libraries, such as OpenLayers, a more flexible mechanism for the Linked Data
GeoServer, and MapServer. Map services of Goo- approach and data exploration as they are fully
gle, Yahoo!, and Microsoft also support open. For example, there are already works done
GeoJSON in their application programming to transform data of the OpenStreetMap and
interfaces. GeoNames into RDF triples. For the pattern
exploration, there are already initial results, such
as those in the GeoKnow project (Athanasiou et
Spatial Intelligence al. 2014). The project built a prototype called
GeoKnow Generator which provides functions
The Semantic Web brings innovative ideas to the to link, enrich, query, and visualize RDF triples
geospatial community. The Semantic Web is a of spatial data and build lightweight applications
web of data compared to the traditional web of addressing specific requests in the actual world.
documents. A solid enablement of the Semantic The linked spatial data is still far from mature
Web is the Linked Data, which is a group of yet. More efforts are needed on the annotation and
methodologies and technologies to publish struc- accreditation of shared spatial RDF data, the inte-
tured data on the Web so they can be annotated, gration and fusion of them, the efficient RDF
interlinked, and queried to generate useful infor- query in a big data environment, and innovative
mation. The Web-based capabilities of linking and ways to visualize and present the results.
querying are specific features of the Linked Data,
which help people to find patterns from data and
use them in scientific or business activities. To
Cross-References
make full use of the Linked Data, the geospatial
community is developing standards and technol-
▶ Geography
ogies to (1) transform spatial data into Semantic
▶ Location Data
Web compatible formats such as the Resource
▶ Spatial Analytics
Description Framework (RDF), (2) organize and
▶ Spatio-Temporal Analytics
publish the transformed data using triple stores,
and (3) explore patterns in the data using new
query languages such as GeoSPARQL.
The RDF uses a simple triple structure of sub- References
ject, predicate, and object. The structure is robust
enough to support the linked spatial data Athanasiou, S., Hladky, D., Giannopoulos, G., Rojas, A.
G., Lehmann, J. (2014). GeoKnow: Making the web an
consisting of billions of triples. Building on the exploratory place for geospatial knowledge. ERCIM
basis of the RDF, there are a number of specific News, 96. http://ercim-news.ercim.eu/en96/special/
schemas for representing locations and spatial geoknow-making-the-web-an-exploratory-place-for-geo
relationships in triples, such as the GeoSPARQL. spatial-knowledge. Accessed 29 Apr 2016.
Huisman, O., & de By, R. A. (Eds.). (2009). Principles of
Triple stores offer functionalities to manage spa- geographic information systems. Enschede: ITC Edu-
tial data RDF triples and query them, which are cational Textbook Series.
very similar to what the traditional relational data- Open Geospatial Consortium (2016). About OGC. http://
bases are capable. As mentioned above, spatial www.opengeospatial.org/ogc. Accessed 29 Apr 2016.
Steiniger, S., & Hunter, A. J. S. (2013). The 2012 free and
data have two major sources: conventional data open source GIS software map: A guide to facilitate
legacy and crowdsourcing data. While technolo- research, development, and adoption. Computers,
gies are being mature for transforming both of Environment and Urban Systems, 39, 136–150.
them into triples, the crowdsourcing data provide
T

Transparency transparency. Upward and downward transpar-


ency refers to disclosure within an organization.
Anne L. Washington Supervisors observing subordinates is upward
George Mason University, Fairfax, VA, USA transparency, while subordinates observing the
hierarchy above is downward transparency.
Inward and outward transparency refers to disclo-
Transparency is a policy mechanism that encour- sure beyond organizational boundaries. An orga-
ages organizations to disclose information to the nization aware of its environment is outward
public. Scholars of big data and transparency rec- transparency, while citizen awareness of govern-
ognize the inherent powerpower of information ment activity is inward transparency. Transpar-
and share a common intellectual history. Govern- ency policies encourage the visibility of
ment and corporate transparency, which is often operating status and standard procedures.
implemented by releasing open dataopen data, First, transparency may compel information on
increases the amount of material available for operating status. When activities may impact
big data projects. Furthermore, big data has its others, organizations disclose what they are
own need for transparency as data-driven algorith- doing in frequent updates. For example, the US
malgorithms support essential decisions in society government required regular reports from stock
with little disclosure about operations and proce- exchanges and other financial markets after the
dures. Critics question whether information can stock market crash in 1929. Operating status
be used as a control mechanism in an industry that information gives any external interest an ability
functions as a distributed network. to evaluate the current state of the organization
and auditing.
Second, transparency efforts may distribute
Definition standard procedures in order to enforce ideal
behaviors. This type of transparency holds people
Transparency is defined as a property of glass or with the public trust accountable. For example,
any object that lets in light. As a governance cities release open data with transportation sched-
mechanism, transparency discloses the inner ules and actual arrival times. The planned infor-
mechanisms of an organization. OrganizationsOr- mation is compared to the actual information to
ganization implement or are mandated to abide by evaluate behaviors and resource distribution. Pro-
transparency policies that encourage the release of cedural transparency assumes that organizations
information about how they operate. Hood and can and should operate predictably.
Heald (2006) uses a directional typology to define
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_199-1
2 Transparency

Disclosures allow comparison and review. government projects developed technology capa-
Detailed activity disclosure of operations answers bilities in public sector organizations.
questions of who, what, when, and where. Con- Advances in computing have increased the use
versely, disclosures can also answer questions of big data techniques to automatically review
about influential people or wasteful projects. Dis- transparency disclosures. Transparency can be
closure may emphasize predictive trends and ret- implemented without technology, but often the
rospective measurement, while other disclosures two are intrinsically linked. One impact technol-
may emphasize narrative interpretation and ogy has on transparency is that information now
explanation. comes in multiple forms. Disclosure before tech-
nology was the static production of documents
and regularly scheduled reports that could be
released on paper by request. Disclosure with
Implementation
technology is the dynamic streaming of real-time
data available through machine-readable search
Transparency is implemented by disclosing
and discovery. Transparency is often implemented
timely information to meet specific needs. This
by releasing digital material as open data that can
assumes that stakeholders will discover the
be reused with few limitations. Open data trans-
disclosed information, comprehend its impor-
parency initiatives disclose information in formats
tance, and subsequently use it to change behavior.
that can be used with big data methods.
Organizations, including corporations and gov-
ernment, often implement transparency using
technology which creates digital material used in
Intellectual History
big data.
Corporations release information about how
Transparency has its origins in economic and
their actions impact communities. The goal of
philosophical ideas about disclosing the activities
corporate transparency is to improve services,
of those in authority. In Europe, the intellectual
share financial information, reduce harm to the
history spans from Aristotle in fifth-century
public, or reduce reputation risks. The veracity
Greece to Immanuel Kant in eighteenth-century
of corporate disclosures has been debated by man-
Prussia. Debates on big data can be positioned
agement science scholars (Bennis et al. 2008). On
within these conversations about the dynamics
the one hand, mandatory corporate reporting fails
of information and power. An underlying assump-
if the information provided does not solve the
tion of transparency is that there are hidden and
target issue (Fung et al. 2007). On the other
visible power relationships in the exchange of
hand, organizations that are transparent to
information. Transparency is often an antidote to
employees, management, stockholders, regulators,
situations where information is used as power to
and the public may have a competitive advantage.
control others.
In any case, there are real limits to what corpora-
Michel Foucault, the twentieth-century French
tions can disclose and still remain both domesti-
philosopher, considered how rulers used statistics
cally and internationally competitive.
to control populations in his lecture on
Governments release information as a form of
Governmentality. Foucault engaged with Jeremy
accountability. From the creation of the postal
Bentham’s eighteenth-century descriptions of the
code system to social security numbers, govern-
ideal prison and the ideal government, both of
ments have inadvertently provided core categories
which require full visibility. This philosophical
for big data analytics (Washington 2014). Starting
position argues that complete surveillance will
in the mid-twentieth century, legislatures around
result in complete cooperation. While some
the world began to write freedom of information
research suggests that people will continue bad
laws that supported the release of government
behavior under scrutiny, transparency is still seen
materials on request. Subsequently, electronic
as a method of enforcing good behavior.
Transparency 3

Big data extends concerns about the balance of because it includes monitoring customers and
authority, power, and information. Those who employees.
collect, store, and aggregate big data have more Transparency of the analytics industry dis-
control than those generating data. These concep- closes how the big data market functions. Industry
tual foundations are useful in considering both the transparency of operations might establish techni-
positive and negative aspects of big data. cal standards or policies for all participating orga-
nizations. The World Wide Web Consortium’s
data provenance standard provides a technical
Big Data Transparency solution by automatically tracing where data orig-
inated. Multi-stakeholder groups, such as those
Big data transparency discloses the transfer and for Internet Governance, are a possible tool to
transformation of data across networks. Big data establish self-governing policy solutions. The
transparency brings visibility to the embedded intent is to heighten awareness of the data supply
power dynamic in predicting human behavior. chaindata supply chain from upstream content
Analysis of digital material can be done without quality to downstream big data production. Indus-
explicit acknowledgment or agreement. Further- try transparency of procedure might disclosure
more, the industry that exchanges consumer data algorithms and research designs that are used in
is easily obscured because transactions are all data-driven decisions.
virtual. While a person may willingly agree to Big data transparency makes it possible to
free services from a platform, it is not clear if compare data-driven decisions to other methods.
users know who owns, sees, collects, or uses It faces particular challenges because its produc-
their data. The transparency of big data is tion process is distributed across a network of
described from three perspectives: sources, orga- individuals and organizations. The process flows
nizations, and the industry. from an initial data capture to secondary uses and
Transparency of sources discloses information finally into large-scale analytic projects. Transpar-
about the digital material used in big data. Disclo- ency is often associated with fighting potential
sure of sources explains which data generated on corruption or attempts to gain unethical power.
which platforms were used in which analysis. The Given the influence of big data in many aspects
flip side of this disclosure is that those who create of society, the same ideas apply to the transpar-
user-generated content would be able to trace their ency of big data.
digital footprint. User-generated content creators
could detect and report errors and also be aware of
their overall data profile. Academic big data Criticism
research on social media was initially questioned
because of opaque sources from privacy compa- A frequent criticism of transparency is that its
nies. Source disclosure increases confidence in unintended consequences may thwart the antici-
data quality and reliability. pated goals. In some cases, the trend toward
Transparency of platforms considers organiza- visibility is reversed as those under scrutiny stop
tions that provide services that create user-gener- creating findable traces and turn to informal mech-
ated content. Transparency within the anisms of communication.
organization allows for internal monitoring. It is important to note that a transparency label
While part of normal business operations, some- may be used to legitimize authority without any
one with command and control is able to view substantive information exchange. Large amounts
personally identifiable information about the of information released under the name of trans-
activities of others. The car ride service Uber parency may not, in practice, provide the intended
was fined in 2014 because employees used the result. Helen Margetts (1999) questions whether
internal customer tracking system inappropriately. unfiltered data dumps obscure more than they
Some view this as a form of corporate surveillance reveal. Real-time transparency may lack
4 Transparency

meaningful engagement because it requires inter- earlier research that examines the relationship
mediary interpretation. This complaint has been between power and information. Transparency
lodged at open data transparency initiatives which of big data evaluates the risks and opportunities
did not release crucial information. of aggregating sources for large-scale analytics.
Implementation of big data transparency is
constrained by complex technical and business
issues. Algorithms and other technology are lay-
Cross-References
ered together, each with its own embedded
assumptions. Business agreements about the
▶ Algorithmic Accountability
exchange of data may be private, and release
▶ Business Process
may impact market competition. Scholars ques-
▶ Data Governance
tion how to analyze and communicate what drives
▶ Economics
big data, given these complexities.
▶ Enterprise Data
Other critics question whether what is learned
▶ Privacy
through disclosure is looped back into the system
▶ Standardization
for reform or learning. Information disclosed for
transparency may not be channeled to the right
places or people. Without any feedback mecha-
nism, transparency can be a failure because it does Further Readings
not drive change. Ideally, either organizations
improve performance or individuals make new Bennis, W. G., Goleman, D., & O’Toole, J. (2008). Trans-
parency: How leaders create a culture of candor. San
consumer choices. Francisco: Jossey-Bass.
Fung, A., Graham, M., & Weil, D. (2007). Full disclosure:
The perils and promise of transparency. New York:
Summary Cambridge University Press.
Hood, C., & Heald, D. (Eds.). (2006). Transparency: The
key to better governance? Oxford. New York: Oxford
Transparency is a governance mechanism for dis- University Press.
closing activities and decisions that profoundly Margetts, H. (1999). Information technology in govern-
enhances confidence in big data. It builds on ment: Britain and America. London: Routledge.
Washington, A. L. (2014). Government information policy
existing corporate and government transparency in the era of big data. Review of Policy Research, 31(4).
efforts to monitor the visibility of operations and doi:10.1111/ropr.12081.
procedures. Transparency scholarship builds on
U

United Nations Educational, in science as well as promoting dialogue between


Scientific and Cultural Organization scientists and policy-makers. In doing so, it acts as
(UNESCO) a platform for dissemination of ideas in science
and encourages efforts on crosscutting themes
Jennifer Ferreira including disaster risk reduction, biodiversity,
Centre for Business in Society, Coventry engineering, science education, climate change,
University, Coventry, UK and sustainable development. Within the social
and human sciences, UNESCO plays a large role
in promoting heritage as a source of identity and
United Nations Educational, Scientific and Cul- cohesion for communities. It actively contributes
tural Organization (UNSCO), founded in 1945, is by developing cultural conventions that provide
an agency of the United Nations (UN) which mechanisms for international cooperation. These
specializes in education, natural sciences, social international agreements are designs to safeguard
and human sciences, culture, and communications natural and cultural heritage across the globe, for
and information. With 195 members, 9 associate example, through designation as UNESCO World
members, and 50 field offices, working with over Heritage sites. The development of communica-
300 international NGOs, UNESCO carries out tion and sharing information is embedded in all
activities in all of these areas, with the post-2015 their activities.
development agenda underpinning their overall UNESCO has five key objectives: to attain
agenda. quality education for all and lifelong learning;
As the only UN agency with a mandate to mobilize science knowledge and policy for sus-
address all aspects of education, it proffers that tainable development; address emerging social
education is at the heart of development, with a and ethical challenges; foster cultural diversity,
belief that education is fundamental to human, intercultural dialogue, and culture of peace; and
social, and economic development. It coordinates build inclusive knowledge societies through
“Education for All” movement, a global commit- information and communication. Like other UN
ment to provide quality basic education for all agencies, UNESCO has been involved in debates
children, youth, and adults, monitoring trends in about the data revolution for development and the
education and where possible make attempts to role that big data can play.
raise the profile of education on the global devel- The data revolution for sustainable develop-
opment agenda. For the natural sciences, ment is an international initiative designed to
UNESCO acts as an advocate for science as it improve the quality of data and information that
focuses on encouraging international cooperation is generated and made available. It recognizes that
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_201-1
2 United Nations Educational, Scientific and Cultural Organization (UNESCO)

societies need to take advantage of new technolo- considerations based around the ownership of
gies and crowd-sourced data and improve digital data and privacy. This is an area the UN recog-
connectivity in order to empower citizens with nizes that policy-makers will need to address to
information that can contribute towards progress ensure that data will be used safely to address their
towards wider development goals. While there are objectives while still protecting the rights of peo-
many data sets available about the state of global ple whom the data is about or generated from.
education, it is argued that better data could be Furthermore, there are a number of critiques of
generated, even around basic measures such as the big data which make more widespread use of big
number of schools. In fact rather than focus on data for UNESCO problematic: first that claims
“big data” which has captured the attention of that big data are objective and accurate represen-
many leaders and policy-makers, instead more tations are misleading; not all data produced can
efforts should focus on “little data,” i.e., focus be used comparably; there are important ethical
on data that is both useful and relevant to partic- considerations necessary about the use of big data;
ular communities. Now discussions are shifting to limited access to big data is exacerbating existing
identify which indicators and data should be digital divides.
prioritized. The Scientific Advisory Board of the Secre-
UNESCO Institute for Statistics is the organi- tary-General of the United Nations which is
zation’s own statistic arm; however, much of the hosted by UNESCO provided comments on the
data collection and analysis that takes place here report on data revolution in sustainable develop-
relies on much more conventional management ment. It highlighted concerns over equity and
and information systems which in turn relies on access to data noting that the data revolution
national statistical agencies which in many devel- should lead to equity in access and use of data
oping countries are often unreliable or heavily for all. Furthermore, it suggested that a number of
focused on administrative data (UNESCO 2012). global priorities should be included in any agenda
This means that the data used by UNESCO is related to the data revolutions: first that countries
often out of date, or not detailed enough. While should seek to avoid contributing to a data divide
digital technologies have become widely used in between the rich and poor countries and secondly
many societies, more potential sources of data are that there should be some form of harmonization
generated (Pentland 2013). For example, mobile and standardization of data platform to increase
phones are now used as banking devices as well as accessibility internationally, there should be
for standard communications. Official statistics national and regional capacity building efforts,
organizations are still behind in many countries and there should be a series of training institutes
and international organizations in that they have and training programs in order to develop skills
not developed ways to adapt and make use of this and innovation in areas related to data generation
data alongside the standard administrative data and analysis (Manyika et al. 2011). A key point
already collected. made here is that the quality and integrity of the
There are a number of innovative initiatives to data generated needs to be addressed, as it is
make better use of survey data and mobile phone- recognized that big data often plays an important
based applications to collect data more efficiently role in political and economic decision-making.
and prove more timely feedback to schools, com- Therefore a series of standards and methods for
munities, and ministries on target areas such as analysis and evaluation of data quality should be
enrolment, attendance, and learning achievement. developed.
UNESCO could make a significant contribution to In the journal Nature, Hubert Gijzen,
a data revolution in education by investing in UNESCO Regional Science Bureau for Asia and
resources in collecting these innovations and the Pacific, calls for more big data to help secure a
making them more widely available to countries. sustainable future (Gijzen 2013). He argues that
Access to big data for development, as with all more data should be collected which can be used
big data sources, presents a number of ethical to model different scenarios for sustainable
United Nations Educational, Scientific and Cultural Organization (UNESCO) 3

societies concerning a range of issues from energy mechanisms that are developed with respect to
consumption, improving water conditions, and big data (United Nations 2014). These principles
poverty eradication. Big data, according to are likely to influence UNESCOs engagement
Gijzen, has the potential if coordinated globally with big data in the future.
between countries, regions, and relevant institu- UNESCO, and the UN more broadly, acknowl-
tions to have a big impact on the way societies edge that technology has been, and will continue
address some of these global challenges. The to be, a driver of the data revolution and a wider
United Nations has begun to take actions to do variety of data sources. For big data that is derived
this through the creation of the Global Pulse ini- from this technology to have an impact, these data
tiative bringing together experts from the govern- sources need to be leveraged in order to develop a
ment, academic, and private sectors to consider greater understanding of the issues related to the
new ways to use big data to support development development agenda.
agendas. Global Pulse, a network of innovation
labs which conduct research on Big Data for
Development via collaborations between the gov-
Cross-References
ernments, academic, and private sectors. The ini-
tiative is designed especially to make use of the
▶ History
digital data flood that has developed in order to
▶ International Development
address the development agendas that are at the
▶ United National Global Pulse
heart of UNESCO, and the UN more broadly.
▶ United Nations
The UN Secretary-General’s Independent
▶ World Bank
Expert Advisory Group on the Data Revolution
for Sustainable Development produced the report
“A World That Counts” UN Secretary-General’s
Export Advisory Group on Data Revolution Further Readings
report in November 2014 suggested a number of
key principles which should be sought regards to Gijzen, H. (2013). Development: Big data for a sustainable
future. Nature, 52, 38.
the use of data: data quality and integrity to ensure Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,
clear standards for use of data, data disaggregation Roxburgh, C., Byers, A. (2011). Big data: The next
to provide a basis for comparison, data timeliness frontier for innovation, competition, and productivity.
to encourage a flow of high quality data for used in McKinsey Global Institute. New York. http://www.
mckinsey.com/insights/mgi/research/technology_and_
evidence-based policy-making, data transparency innovation/big_data_the_next_frontier_for_innova
to encourage systems which allow data to make tion. Accessed 12 Nov 14.
freely available, data usability to ensure data can Pentland, A. (2013). The data driven society. Scientific
be made user-friendly, data protection and pri- American, 309, 78–83.
UNESCO (2012). Learning analytics. UNESCO Institute
vacy: to establish international and national poli- for Information Technologies Policy Brief. Available
cies and legal frameworks for regulating data from http://iite.unesco.org/pics/publications/en/files/
generation and use, data governance and indepen- 3214711.pdf. Accessed 11 Nov 14.
dence, data resources and capacity to ensure all United Nations (2014). A world that counts. United
Nations. United Nations. Available from http://www.
countries have effective national statistical agen- unglobalpulse.org/IEAG-Data-Revolution-Report-A-
cies, and finally data rights to ensure human rights World-That-Counts. Accessed 28 Nov 14.
remains a core part of any legal or regulatory
V

Visualization Visualization and Data Visualization

Xiaogang Ma Visualization, in its literal meaning, is the proce-


Department of Computer Science, University of dure to form a mental picture of something that is
Idaho, Moscow, ID, USA not present to the sight (Cohen et al. 2002). People
can also illustrate such kind of mental pictures by
using various visible media such as papers and
Synonyms computer screens. Seen as a way to facilitate
information communication, the meaning of visu-
Data visualization; Information visualization; alization can be understood at two levels. The first
Visual representation level is to make something to be visible and the
second level is to make it obvious so it is easy to
understand (Tufte 1983). People’s daily experi-
ence shows that graphics are easier to read and
Introduction
understand than words and numbers, such as the
use of maps in automotive navigation systems to
People use visualization for information commu-
show the location of an automobile and the road to
nication. Data visualization is the study of creat-
the destination. The daily experience is approved
ing visual representations of data, which bears two
by scientific discoveries. Studies on visual object
levels of meaning: the first is to make information
perceptions explain such differentiation in reading
visible and the second is to make it obvious for
graphics and texts/numbers: the human brain
understand. Visualization is a pervasive existence
deciphers image elements simultaneously and
in the data life cycle and recent trends is to pro-
decodes language in a linear and sequential man-
mote the use of visualization in data analysis
ner, where the linear process takes more time than
rather than use it only as a way to present the
the simultaneous process.
result. Community standards and open source
Data are representations of facts and informa-
libraries set the foundation for visualization of
tion is the meaning worked out from data. In the
Big Data, and domain expertise and creative
context of the Big Data, visualization is a crucial
ideas are needed to put standards into innovative
method to tackle the considerable needs of extra-
applications.
cting information from data and presenting
it. Data visualization is the study of creating visual
representations of data. In practice, data visuali-
zation means to visually display one or more
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_202-1
2 Visualization

objects by combined use of words, numbers, sym- Visualization or more specifically data visual-
bols, points, lines, color, shading, coordinate sys- ization provides support to different steps in the
tems, and more. While there are various choices of data life cycle. For example, the Unified Modeling
visual representations for a same piece of data, Language (UML) provides a standard way to
there are a few general guidelines that can be visualize the design of information systems,
applied to establish effective and efficient data including the conceptual and logical models of
visualization. This first is to avoid distorting databases. Typical relationships in UML include
what the data have to say. That is, the visualization association, aggregation, and composition at the
should not give a false or misleading account of instance level, generalization and realization at the
the data. The second is to know the audience and class level, and general relationships such as
serve a clear purpose. For instance, the visualiza- dependency and multiplicity. For ontologies and
tion can be a description of the data, a tabulation vocabularies in the Semantic Web, concept maps
of the records, or an exploration of the information are widely used for organizing concepts in a sub-
that is of interest to the audience. The third is to ject domain and the interrelationships among
make large datasets coherent. A few artistic those concepts. In this way a concept map is the
designs will be required to present the data and visual representation of a knowledge base. The
information in an orderly and consistent way. The concept maps are more flexible than UML
presidential, Senate, and House elections of the because they cover all the relationships defined
United States have been reported with well- in UML and allow people to create new relation-
presented data visualization, such as those on the ships that apply to the domain under working
website of The New York Times. The visualization (Ma et al. 2014). For example, there are concept
on that website is underpinned by dynamic maps for the ontology of the Global Change Infor-
datasets and can show the latest records mation System led by the US Global Change
simultaneously. Research Program. The concept maps are able to
show that report is a subclass of publication, and
there are several components in a report, such as
Visualization in the Data Life Cycle chapter, table, figure, array, and image. Recent
work in information technologies also enable
Visualization is crucial in the process from data to online visualized tools to capture and explore
information. However, information retrieval is concepts underlying collaborative science activi-
just one of the many steps in the data life cycle, ties, which greatly facilitate the collaboration
and visualization is useful through the whole data between domain experts and computer scientists.
life cycle. In conventional understanding, a data Visualization is also used to facilitate data
life cycle begins with data collection and con- archive, distribution, and discovery. For instance,
tinues with cleansing, processing, archiving, and the Tetherless World Constellation at Rensselaer
distribution. Those are from the perspective of Polytechnic Institute recently developed the Inter-
data providers. Then, from the perspective of national Open Government Dataset Catalog,
data users, the data life cycles continues with which is a Web-based faceted browsing and
data discovery, access, analysis, and then search interface to help users find datasets of
repurposing. From repurposing, the life cycle interest. A facet represents a part of the properties
may go back to the collection or processing step of a dataset, so faceted classification allows the
restarting the cycle. Recent studies show that there assignment of the dataset to multiple taxonomies,
is another step called concept before the step of and then datasets can be classified and ordered in
data collection. The concept step covers works different ways. On the user interface of a data
such as conceptual models, logical models, and center the faceted classification can be visualized
physical models for relational databases, and as a number of small windows and options, which
ontologies and vocabularies for Linked Data in allows the data center to hide the complexity of
the Semantic Web.
Visualization 3

data classification, archive and search on the arrangement and update of cells in a notebook.
server side. A notebook can be shared with others as a normal
file, or it can also be shared with the public using
online services such as the IPython Notebook
Visual Analytics Viewer. A completed notebook can be converted
into a number of standard output formats, such as
The pervasive existence of visualization in the HyperText Markup Language (HTML), HTML
data life cycle shows that visualization can be presentation slides, LaTeX, Portable Document
applied broadly in data analytics. Yet, in actual Format (PDF), and more. The conversion is
practices visualization is often treated as a method done through a few simple operations, so that
to show the result of data analysis rather than as a means once a notebook is complete, a user only
way to enable the interactions between users and needs to press a few buttons to generate a scien-
complex datasets. That is, the visualization as a tific report. The notebook can be reused to analyze
result is separated from the datasets upon which other datasets, and the cells inside it can also be
the result is generated. Many of the data analysis reused in other notebooks.
and visualization tools scientists use in nowadays
do not allow dynamic and live linking between
visual representations and datasets, and when Standards and Best Practices
dataset changes, the visualization is no longer
updated to reflect the changes. In the context of Any applications of Big Data will face the chal-
Big Data, many socioeconomic challenges and lenges caused by the four dimensions of Big Data:
scientific problem facing the world are increas- volume, variety, velocity, and veracity. Com-
ingly linked to the interdependent datasets from monly accepted standards or communities con-
multiple fields of research, organizations, instru- sensus are a proved way to reduce the
ments, dimensions, and formats. Interactions are heterogeneities between datasets under working.
becoming an inherent characteristic of data ana- Various standards have already been used in appli-
lytics with the Big Data, which requires new cation tackling scientific, social, and business
methodologies and technologies of data visuali- issues, such as the aforementioned JSON for
zation to be developed and deployed. transmitting data with human-readable text, the
Visual analytics is a field of research to address Scalable Vector Graphics (SVG) for two-
the requests of interactive data analysis. It com- dimensional vector graphics, and the GeoJSON
bines many existing techniques from data visual- for representing collections of georeferenced fea-
ization with those from computational data tures. There are also organizations coordinating
analysis, such as those from statistics and data the works on community standards. The World
mining. Visual analytics is especially focused on Wide Web Consortium (W3C) coordinates the
the integration of interactive visual representa- development of standards for the Web. For exam-
tions with the underlying computational process. ple, the SVG is an output of the W3C. Other W3C
For example, the IPython Notebook provides an standards include the Resource Description
online collaborative environment for interactive Framework (RDF), the Web Ontology Language
and visual data analysis and report drafting. (OWL), and the Simple Knowledge Organization
IPython Notebook uses JavaScript Object Nota- System (SKOS). Many of them are used for data
tion (JSON) as the scripting language, and each in the Semantic Web. The Open Geospatial Con-
notebook is a JSON document that contains a sortium (OGC) coordinates the development of
sequential list of input/output cells. There are standards relevant to geospatial data. For exam-
several types of cells to contain different contents, ple, the Keyhole Markup Language (KML) is
such as text, mathematics, plots, codes, and even developed for presenting geospatial features in
rich media such as video and audio. Users can Web-based maps and virtual globes such as Goo-
design a workflow of data analysis through the gle Earth. The Network Common Data Form
4 Visualization

(netCDF) is developed for encoding array- dataset of the National Cultural Heritage with
oriented data. Most recently, the GeoSPARQL is more than 13 thousand archaeological monu-
developed for encoding and querying geospatial ments in the Netherlands. Besides the
data in the Semantic Web. GeoSPARQL, GeoJSON and few other standards
Standards just enable the initial elements for and libraries are also used in that demo system.
data visualization, and domain expertise and
novel ideas are needed to put standards into prac-
tice (Fox and Hendler 2011). For example, Goo-
Cross-References
gle Motion Chart adapts the fresh idea of motion
charts to extend the traditional static charts, and
▶ Data Visualization
the aforementioned IPython Notebook allows the
▶ Data-Information-Knowledge-Action Model
use of several programming languages and data
▶ Interactive Data Visualization
formats through the use of cells. There are various
▶ Pattern Recognition
programming libraries developed for data visual-
ization, and many of them are made available on
the Web. The D3.js is a typical example of such
open source libraries (Murray 2013). The D3 here References
represents Data-Driven Documents. It is a
JavaScript library using digital data to drive the Cohen, L., Lehericy, S., Chochon, F., Lemer, C., Rivaud,
S., & Dehaene, S. (2002). Language-specific tuning of
creation and running of interactive graphics in visual cortex? Functional properties of the visual word
Web browsers. D3.js based visualization uses form area. Brain, 125(5), 1054–1069.
JSON as the format of input data and SVG as Fox, P., & Hendler, J. (2011). Changing the equation on
the format for the output graphics. The scientific data visualization. Science, 331(6018),
705–708.
OneGeology data portal provides a platform to Ma, X., Fox, P., Rozell, E., West, P., & Zednik, S. (2014).
browse geological map services across the Ontology dynamics in a data life cycle: Challenges and
world, using standards developed by both OGC recommendations from a geoscience perspective. Jour-
and W3C, such as SKOS and Web Map Service nal of Earth Science, 25(2), 407–412.
Murray, S. (2013). Interactive data visualization for the
(WMS). GeoSPARQL is a relatively newer stan- web. Sebastopol: O’Reilly.
dard for geospatial data but there are already fea- Tufte, E. (1983). The visual display of quantitative infor-
ture applications. The demo system of the Dutch mation. Cheshire: Graphics Press.
Heritage and Location shows the linked open
W

White House Big Data Initiative economic growth, education, health, clean energy,
and national security (Raul 2014; Savitz 2012).
Gordon Alley-Young The administration stated that the private sector
Department of Communications & Performing would lead by developing BD while the govern-
Arts, Kingsborough Community College, ment will promote R&D, facilitate private sector
Kingsborough Community College – City access to government data, and shape public pol-
University of New York, New York, NY, USA icy. Several government agencies made the initial
investment in this initiative to advance the tools/
techniques required to analyze and capitalize on
Synonyms BD. TBDRDI has been compared by the Obama
Administration to previous administrations’
The Big Data Research and Development Initia- investments in science in technology that lead to
tive (TBDRDI) innovations such as the Internet. Critics of the
initiative argue that administration BD efforts
need to be directed elsewhere.
Introduction

On March 29, 2012, the White House introduced History of the White House Big Data
The Big Data Research and Development Initia- Initiative
tive (TBDRDI) at a cost of $200 million. Big data
(BD) refers to the collection and interpretation of TBDRDI is the White House’s $200 million fed-
enormous datasets, using supercomputers running eral agency funded initiative that seeks to secure
smart algorithms to rapidly uncover important the US’s position as the world’s most powerful
features (e.g., interconnections, emerging trends, and influential economy by channeling the infor-
anomalies, etc.). The Obama Administration mation power of BD into social and economic
developed TBDRDI because having the large development (Raul 2014). BD is an all-inclusive
amounts of instantaneous data that is continually name for the nonstop supply of sophisticated elec-
being produced by research and development tronic data that is being produced by a variety of
(R&D) and emerging technology go unprocessed technologies and by scientific inquiry. In short,
hurts the US economy and society. President BD includes any digital file, tag or data that is
Obama requested an all-hands-on-deck for created whenever we interact with technology, no
TBDRDI including the public (i.e., government) matter how briefly (Carstensen 2012). The
and private (i.e., business) sectors to maximize dilemma posed by BD to the White House, as
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_204-1
2 White House Big Data Initiative

well as to other countries, organizations, and busi- referred to TBDRDI as representing it placing its
nesses worldwide, is that so much of it goes bet on BD meaning that the financial investment
unanalyzed due to its sheer volume and the limits in this initiative is expected to yield a significant
of our current technological tools to effectively return for the country in coming years. To this
store, organize, and analyze. Processing BD is not end, President Obama has sought the involvement
so simple because it requires supercomputing of public, private, and other (e.g., academia, non-
capabilities, some of which are still emerging. governmental organizations) experts and organi-
Experts argue that up until 2003, only zations to work in a way that emphasizes
5-exabytes (EB) of data were produced; that num- collaboration. For spearheading TBDRDI and
ber has since exploded to over five quintillion for choosing to stake the future of the country on
bytes of data (approximately 4.3 EB) every BD, President Barack Obama has been dubbed the
2 days. BD president by the media.
The White House Office of Science and Tech-
nology Policy (WHOSTP) announced TBDRDI
in March 2012 in conjunction with the National Projects of the White House Big Data
Science Foundation (NSF), National Institutes of Initiative
Health (NIH), US Geological Survey (USGS),
and the Department of Defense (DoD) and The projects included under the umbrella of
Department of Energy (DoE). Key concerns to TBDRDI are diverse, but they share common
be addressed by TBDRDI are to manage BD by themes of emphasizing collaboration (i.e., to max-
significantly increasing the speed of scientific imize resources and eliminate data overlap) and
inquiry and discovery, bolstering national secu- making data openly accessible for its social and
rity, and overhauling US education. TBDRDI is economic benefits. One project undertaken with
the result of recommendations in 2011 by the the co-participation of NIH and Amazon, the
President’s Council of Advisors on Science and world’s largest online retailer, aims to provide
represents the US government’s wish to get ahead public access to the 1,000 Genomes Project
of the wave BD and prevent a cultural lag by using cloud computing (Smith 2012). The 1,000
revamping its BD practices (Executive Office of Genomes Project involved scientists and
the President 2014). John Holdren, Director of researchers sequencing the genomes of over
WHOSTP, compared the $200 million being 1,000 anonymous and ethnically diverse people
invested in BD to prior federal investments in between 2008 and 2012 in order to better treat
science and technology that are responsible for illness and predict medical conditions that are
our current technological age (Scola 2013). The genetically influenced. The NIH will deposit
innovations of the technology age ironically have 200 terabytes (TB) of genomic data into Ama-
created the BD that makes initiatives such as these zon’s Web Services. According to the White
necessary. House, this is currently the world’s largest collec-
In addition to the US government agencies that tion of human genetic data. In August 2014, the
helped to unveil TBDRDI, several other federal UK reported that it would undertake a 100,000
agencies had been requested to develop BD man- genomes project that is slated to finish in 2017.
agement strategies in the time leading up to and The NIH and NSF will cooperate to fund 15–20
following this initiative. A US government fact research projects for a cost of $25 million. Other
sheet listed between 80 and 85 BD projects across collaborations include the DoE’s and University
a dozen federal agencies including, in addition to of California’s creation of a new facility as part of
the departments previously mentioned, the their Lawrence Berkeley National Laboratory
Department of Homeland Security (DHS), called the Scalable Data Management, Analysis,
Department of Health and Human Services and Visualization Institute ($25 million) and the
(DHHS), and the Food and Drug Administration NSF and University of California, Berkeley’s
(FDA) (Henschen 2012). The White House geosciences Earth Cube BD project ($10 million).
White House Big Data Initiative 3

The CyberInfrastructure for Billions of Elec- supercomputer networks. In keeping with


tronic Records (CIBER) project is a co-initiative TBDRDI maxim to collaborate and share, the
of the National Archives and Records Adminis- DoD has partnered with Lockheed Martin Corpo-
tration (NARA), the NSF, and the University of ration to provide the military and its partners with
North Carolina Chapel Hill. The project will time-sensitive intelligence, surveillance, and
assemble decades of historical and digital-era doc- reconnaissance data in what is being called a
uments on demographics and urban development/ Distributed Common Ground System (DCGS).
renewal. The project draws on citizen-led sourc- This project is touted as having the potential to
ing or citizen sourcing meaning that the project save individual soldier’s lives on the battlefield.
will build a participative archive fueled by Other defense-oriented initiatives under TBDRDI
engaged community members and not just by include how the Pentagon is working to increase
professional archivists and/or governmental its ability to extract information from texts over
experts. Elsewhere the NSF will partner with 100 times its current rates and Defense Advanced
NASA’s on its Global Earth Observation System Research Projects Agency’s (DARPA) develop-
of Systems (GEOSS), an international project to ment of XDATA (Raul 2014), a $100 million
share and integrate Earth observation data. Simi- program for sifting BD.
larly, the National Oceanic and Atmospheric
Administration (NOAA) and NASA, who collec-
tively oversee hundreds of thousands of environ- Influences of the Initiative and Expected
mental sensors producing reams of climate data, Outcomes
have partnered with Computer Science Corpora-
tion (CSC) to manage this climate data using their The United Nations’ (UN) Global Pulse Initiative
ClimatEdge™ risk management suite of tools. (GPI) may have shaped TBDRDI (UN Global
CSC will collect and interpret the climate data Pulse 2012). Realizing in 2009–2010 that the
and make it available to subscribers in the forms data it relied upon to respond to global crises
of monthly reports that anticipate how climate was outdated, the UN created its GPI to provide
changes could affect global agriculture, global real-time data. In 2011, the proof of concept (i.e.,
energy demand/production, sugar/soft commodi- primary project) phase began with the analysis of
ties, grain/oilseeds, and energy/natural gas. These 2 years’ worth of US and Irish social media data
tools are promoted to help companies and con- for mood scores/conversation indicators that
sumers make better decisions. For example, fluc- could, in some cases, predict economic downturns
tuating resource prices caused by climate changes 5 months out and economic upturns 2 months out.
will allow a consumer/business to find new sup- Success in this project justified opening GPI labs
plies/suppliers in advance of natural disasters and in Jakarta, Indonesia, and Kampala, Uganda.
weather patterns. Future goals include providing Similarly in 2010, President Obama’s Council of
streaming data to advanced users of the service Advisors on Science and Technology urged focused
and expanding this service to other sectors includ- investment in information technology (IT) to avoid
ing disease and health trends (Eddy 2014). overlapping efforts (Henschen 2012). This advice fit
The DoD argues that it will spend $250 million with 2010’s existing cost-cutting efforts that were
annually on BD. Several of its initiatives promote moving government work to less expensive
cybersecurity like its Cyber-Insider Threat pro- Internet-based applications. TBDRDI, emerging
gram quick and precise targeting of cyber espio- from IT recommendations and after a period of
nage threats to military computer networks. The economic downturn, differs from the so-called
DoD’s cybersecurity projects also include devel- reality-based community (i.e., studying what has
oping cloud-computing capabilities that would happened) of the Bush Administration by focusing
retain function in the midst of an attack, program- instead on what will happen in the future. Some also
ming languages that stay encrypted whenever in argue that an inkling of TBDRDI can be seen as
use, and security programs suitable for BD early as 2008 when then Senator Obama
4 White House Big Data Initiative

cosponsored a bipartisan online federal spending government to invest more money in the training
database bill (i.e., for USAspending.gov) and as a of quantitative analysts to feed initiatives such
presidential candidate who actively utilized BD as this (Tucker 2012).
techniques (Scola 2013). In terms of cutting overspending, cloud com-
TBDRDI comes at a time when International puting (platform-as-a-service technologies) has
Data Corporation (IDC) predicts that by 2020, been identified under TBDRDI as a means to
over a third of digital information will generate consolidate roughly 1,200 unneeded federal data
value if analyzed. Making BD open and accessi- centers (Tucker 2012). The Obama Administra-
ble will bring businesses an estimated three tril- tion has stated that it will eliminate 40 % of federal
lion dollars in profits. Mark Weber, President of data centers by 2015. This is estimated to generate
US Public Sector for NetApp and government IT a $5 billion in savings. Some in the media applaud
commentator, argues that the value of BD lies in the effort and corresponding savings while some
transforming it into quality knowledge for critics of the plan argue that the data centers be
increasing efficiency better informed decision- streamlined and upgraded instead. As of 2014, the
making (CIO Insight 2012). TBDRDI is also US government reports that 750 data centers have
said to national security. Kaigham Gabriel, a Goo- been eliminated.
gle executive and the next CEO and President of In January 2014, after classified information
Draper Laboratory, argued that the cluttered leaks by former NSA contractor Edward Snowden,
nature of the BD field allows America’s adversar- President Obama asked the White House for a com-
ies to hide and that field is becoming increasingly prehensive review of BD that some argue dampened
cluttered as it is estimated that government agen- the enthusiasm for TBDRDI (Raul 2014). The US
cies generated one petabyte (PB) or one quadril- does not have a specific BD privacy law leading
lion bytes of data from 2012 to 2014 (CIO Insight critics to claim a policy deficit. Others point to the
2012). One would need almost 14,552 Federal Trade Commission (FTC) Act, Section 5
64-gigabyte (GB) iPhones in order to store this that prohibits unfair or deceptive acts or practices in
amount of data. Experts argue that the full extent or affecting commerce as being firm enough to
of technology/applications required to success- handle any untoward business practices that might
fully manage the amounts BD that TBDRDI emerge from BD while flexible enough to not hinder
could produce now and in the future remains to the economy (Raul 2014). Advocates note that the
be seen. European Union (EU) has adopted a highly detailed
President Obama promised that TBDRDI privacy policy that has done little to foster commer-
would stimulate the economy and save taxpayer cial innovation and economic growth (Raul 2014).
money, and there is evidence to indicate this. The
employment outlook for individuals trained in
mathematics, science, and technology is strong Conclusion
as the US government attempts to hire sufficient
staff to carry out the work of TBDRDI. Hiring Other criticism argues that TBDRDI, and the
across governmental agencies requires the skilled Obama Administration by default, actually serves
work of deriving actionable knowledge from big business instead of individual consumers and
BD. This responsibility falls largely on a subset citizens. In support of this argument, critics argue
of highly trained professionals known as quanti- that the administration pressured communications
tative analysts or the quants for short. Currently companies to provide more affordable and higher
these employees are in high demand and thus can speeds of mobile broadband. As of the summer of
be difficult to source as the US government must 2014, Hong Kong has the world’s fastest mobile
compete alongside private sector businesses for broadband speeds that are also some of the most
talent when the latter may be able to provide larger affordable with South Korea second and Japan
salaries and higher profile positions (e.g., Wall third; the US and its neighbor Canada are not
Street firms). Some have argued for the even in the top ten list of fastest mobile broadband
White House Big Data Initiative 5

speed countries. Supporters of the administration CIO Insight (2012). Can government IT meet the big data
cite that the Obama Administration has instead challenge? Retrieved from http://www.cioinsight.com/
c/a/Latest-News/Big-Data-Still-a-Big-Challenge-for-G
chosen to emphasize its unprecedented open data overnment-IT-651653/.
initiatives under TBDRDI. The US Open Data Eddy, N. (2014). Big data proves alluring to federal IT pros.
Action Plan emphasizes making high-priority Retrieved from http://www.eweek.com/enterprise-apps/
US government data both mobile and publically big-data-proves-alluring-to-federal-it-pros.html.
Executive Office of the President (2014). Big data: Seizing
accessible while Japan is reported to have fallen opportunities, preserving values. Retrieved from https://
behind in open-sourcing its BD, specifically in www.whitehouse.gov/sites/default/files/docs/big_data_
providing access to their massive stores of state/ privacy _report_may_1_2014.pdf.
local data, costing its economy trillions of yen. Henschen, D. (2012). Big data initiative or
big government boondoggle? Retrieved from http://
www.informationweek.com/software/information-man
agement/big-data-initiative-or-big-government-boondog
Cross-References gle/d/d-id/1103666?
Raul, A.C. (2014). Don’t throw the big data out with the bath
water. Retrieved from http://www.politico.com/maga
▶ Big Data zine/story/2014/04/dont-throw-the-big-data-out-with-the-
▶ Cloud or Cloud Computing bath-water-106168_full.html?print#.U_PA-lb4bFI.
▶ Cyberinfrastructure Savitz, E. (2012). Big data in the enterprise: A lesson or
▶ Defense Advanced Research Projects Agency two from big brother. Retrieved from http://www.
forbes.com/sites/ciocentral/2012/12/26/big-data-in-the-
(DARPA) enterprise-a-lesson-or-two-from-big-brother/.
▶ Department of Homeland Security Scola, N. (2013). Obama, the ‘big data’ president. Retrieved
▶ Food and Drug Administration (FDA) from http://www.washingtonpost.com/opinions/obama-
▶ NASA the-big-data-president/2013/06/14/1d71fe2e-d391-11e2-
b05f-3ea3f0e7bb5a_story.html.
▶ National Oceanic and Atmospheric Smith, J. (2012). White House aims to tap power of gov-
Administration ernment data. Retrieved from https://www.yahoo.
▶ National Science Foundation com/news/white-house-aims-tap-power-government-
▶ Office of Science and Technology Policy data-093701014.html?ref=gs.
Tucker, S. (2012). Budget pressures will drive government
▶ United Nations Global Pulse (Development) IT change. Retrieved from http://www.washingtonpost.
▶ United States Geological Survey (USGS) com/business/capitalbusiness/budget-pressures-will-dri
ve-government-it-change/2012/08/24/ab928a1e-e898-
11e1-a3d2-2a05679928ef_story.html.
UN Global Pulse. (2012). Big data for development: Chal-
References lenges & opportunities. Retrieved from UN Global Pulse,
Executive Office of the Secretary-General United
Carstensen, J. (2012). Berkeley group digs in to challenge of Nations, New York, NY at http://www.unglobalpulse.
making sense of all that data. Retrieved from http:// org/sites/default/files/BigDataforDevelopment-UNGl
www.nytimes.com/2012/04/08/us/berkeley-group-tries- obalPulseJune2012.pdf.
to-make-sense-of-big-data.html?_r=0.
W

White House BRAIN Initiative Institutes of Health (NIH), the Defense Advanced
Research Projects Agency (DARPA), and the
Gordon Alley-Young National Science Foundation (NSF) with
Department of Communications & Performing matching support for the initiative reported to
Arts, Kingsborough Community College, City come from private research institutions and foun-
University of New York, New York, NY, USA dations. TWHBI has drawn comparisons to the
Human Genome Project (HGP) for the potential
scientific discovery that the project is expected to
Synonyms yield. The HGP and TWHBI are also big data
projects for the volume of data that they have
Brain Research Through Advancing Innovative already produced and will produce in the future.
Neurotechnologies

History and Aims of the Initiative


Introduction
TWHBI aims to provide opportunities to map,
The White House BRAIN Initiative (TWHBI) study, and thus treat brain disorders including
includes an acronym where BRAIN stands for Alzheimer’s disease, epilepsy, autism, and trau-
the Brain Research Through Advancing Innova- matic brain injuries. The NIH will lead efforts
tive Neurotechnologies. The goal of the initiative under the initiative to map brain circuitry, measure
is to spur brain research, such as mapping the electrical/chemical activity along those circuits,
brain’s circuitry, and technology that will lead to and understand the role of the brain in human
treatments and preventions for common brain dis- behavioral and cognitive output. The initiative is
orders. President Barack Obama first announced guided by eight key goals. The first is to make
the initiative in his February 2013 State of the various types of brain cells available for experi-
Union Address (SOTHA). More than 200 leaders mental researchers to study their role in illness and
from universities, research institutes, national lab- well-being. The second is to create multilayered
oratories, and federal agencies were invited to maps of the brain’s different circuitry levels as
attend when President Obama formally unveiled well as a map of the whole organ. The third
TWHBI on April 2, 2013. The Obama adminis- would see the creation of a dynamic picture of
tration identified this initiative as one of the grand the brain through large-scale monitoring of neural
challenges of the twenty-first century. The $100 activity. Fourth is to link brain activity to behavior
million initiative is funded via The National with tools that could intervene in and change
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_205-1
2 White House BRAIN Initiative

neural circuitry. A fifth goal is to increase under- first century (i.e., The HGP was previously
standing of the biological basis for mental pro- deemed a grand challenge). Unlocking the secrets
cesses by theory building and developing new of the brain will tell us how the brain can record,
data analysis tools. The sixth is to innovate tech- process, utilize, retain, and recall large amounts of
nology to better understand the brain so as to information. Dr. Geoffrey Ling, deputy director of
better treat disorders. The seventh is to establish the Defense Sciences Office at Defense Advanced
and sustain interconnected networks of brain Research Projects Agency (DARPA), states that
research. Finally, the last goal is to integrate the TWHBI is needed to attract young and intelligent
outcomes of the other goals to discover how people into the scientific community. Ling cites a
dynamic patterns of neural activity get translated lack of available funding as a barrier to persuading
into human thought, emotion, perception, and students to pursue research careers (Vallone
action in illness and in health. 2013). Current NIH director and former HGP
NIH Director Dr. Francis Collins echoed Pres- director Dr. Francis Sellers Collins notes the
ident Obama in publically stating that TWHBI potential of TWHBI to create jobs while poten-
will change the way we treat the brain and grow tially curing diseases of the brain and the nervous
the economy (National Institutes of Health 2014). system, for instance, Alzheimer’s disease (AD). In
During his 2013 SOTUA, President Obama drew 2012 Health and Human Services Secretary
an analogy to the Human Genome Project (HGP) Kathleen Sebelius stated the Obama administra-
arguing that for every dollar the USA invested in tion’s goal to cure AD by 2025. The Alzheimer’s
the project, the US economy gained $140. Esti- Association (AA) estimates that AD/dementia
mates suggest that the HGP created $800 billion in health and care cost $203 billion in 2013 ($142
economic activity. The HGP was estimated to cost billion by Medicare/Medicaid); this will reach
$3 billion and take 15 years (i.e., 1990–2005). The $1.2 trillion by 2050 (Alzheimer’s Association
project finished 2 years early and under cost at 2013).
$2.7 billion in 1991 dollars. The HGP project is Dr. Ling argues that for scientists to craft and
estimated to have cost $3.39–$5 billion in 2003 validate their hypotheses to build on their knowl-
dollars. TWHBI has a budget of $100 million edge that potentially lead to medical break-
allocated in budget year 2014 with comparable throughs, they need access to the latest research
funds ($122 million) contributed by private inves- tools. Ling states that some of the today’s best
tors. A US federal report calls for $4.5 billion in clinical brain research tools are nonetheless lim-
funding for brain research over the next 12 years. ited and outdated in light of TWHBI work that
remains to be done. To bolster his case for better
research tools, Ling uses an analogy whereby the
Projects Undertaken by the Initiative physical brain is hardware and the dynamic pro-
cesses across the brain’s circuits are software.
The first research paper believed to be produced Ling notes that cutting-edge tools can help iden-
under TWHBI initiative came from a paper tify bugs in the brain’s software caused by a phys-
published on June 19, 2014, by principal investi- ical trauma (i.e., to the hardware) that once found
gator Dr. Karl Deisseroth of Stanford University. might be repairable. The tools necessary for med-
The research described Deisseroth and his team’s ical research will need to be high-speed tools with
innovation of the CLARITY technique that can a much greater capacity for record signals from
remove fat from the brain without damaging its brain cells. TWHBI, by bringing together scien-
wiring and enable the imaging of a whole trans- tists and researchers from a variety of fields such
parent brain. Data from the study is being used by as nanoscience, imaging, engineering, informat-
international biomedical research projects. ics, has the greatest opportunity to develop these
TWHBI was undertaken because it addresses tools.
what the science, society, and government con-
siders one of the grand challenges of the twenty-
White House BRAIN Initiative 3

Earlier Efforts and Influences Similarly, with a $500 million investment, billion-
aire philanthropist Fred Kavli funded brain insti-
Brain research was emphasized prior to TWHBI tutes at Yale, Columbia, and the University of
by the previous two administrations. The Clinton California (Broad 2014). It was primarily scien-
administration held a White House conference on tists from these two institutes that crafted the
early childhood development and leaning focused TWHBI blueprint. Connor states that there are
on insights gleaned from the latest brain research benefits and downsides to TWHBI’s connections
in 1997. In 2002 the Bush administration’s to private philanthropy. Connor acknowledges
National Drug Control Policy Director John that philanthropists are able to invest in risky
Walters donated millions of dollars of drug-war initiatives in a way that the government cannot
money to purchase dozens of MRI machines. but that this can lead to a self-serving research
Their goal was a decade long, $100 million focus, the privileging of affluent universities at the
brain-imaging initiative to study the brain to better expense of poorer ones and a US government that
understand addiction. is following the lead of private interests rather
Publicity surrounding TWHBI brings attention than setting the course itself (Connor 2013).
to how much science has learned about the brain The $100 million for the first phase of TWHBI
in relatively short period of time. In the nineteenth in fiscal year 2014 comes from three government
century, brain study focused mostly on what hap- agencies’ budgets specifically NIH, DARPA, and
pens when parts of the brain are damaged/ NSF. The NIH Blueprint for Neuroscience
removed. For instance, Phineas Gage partially Research will lead with contributions specifically
lost his prefrontal cortex in an 1848 accident, geared to projects that would lead to the develop-
and scientists noted how Mr. Gage changed from ment of cutting edge, high-speed tools, training,
easygoing and dependable before to angry and and other resources. The next generation of tools
irresponsible afterward. From the late eighteenth has designated as viewed as vital to the advance-
to mid-nineteenth centuries, pseudoscientists ment of this initiative. Contributor DARPA will
practiced phrenology or reading a person’s mind invest in programs that aim to understand the
by handling a person’s skull. dynamic functions of the brain, noted in
Phillip Low, a director of San Diego-based Dr. Ling’s analogy as the software of the brain,
NeuroVigil Inc. (NVI), states that the White showing breakthrough applications based on the
House talked to many scientists and researchers dynamic function insights gained. DARPA also
while planning TWHBI but did not reveal to these seeks to develop new tools for capturing and
individuals that they were talking to many others, processing dynamic neural and synaptic activities.
all of who potentially believed they were the par- DARPA develops applications for improving the
ent of TWHBI. However, the originators of the diagnosis and treatment of post-traumatic stress,
idea that lead to TWHBI are said to be six scien- brain injury, and memory loss sustained through
tists, whose journal article in the June 2012 issue war and battle. Such applications would include
of Neuron proposed a brain-mapping project. The generating new information processing systems
six are A. Paul Alivisatos (University of Califor- related to the information processing system in
nia Berkeley), Miyoung Chun (The Kavli Foun- the brain and mechanisms of functional restora-
dation), George M. Church (Harvard University), tion after brain injury. DARPA is mindful that
Ralph J. Greenspan (The Kavli Institute), Michael advances in neurotechnology, such as those
L. Roukes (Kavli Nanoscience Institute), and outlined above, will entail ethical, legal, and
Rafael Yuste (Columbia University) (Alivisatos social issues that it will oversee via its own
et al. 2012). New York Times reporter Steve experts. Ethics are also at the forefront of
Connor says the roots of TWHBI occur 10 years TWHBI. Specifically President Obama identified
earlier when Microsoft cofounder and philanthro- adhering to the highest standards of research pro-
pist Paul G. Allen established a brain science tections as a prime focus. Oversight of ethical
institute in Seattle for a $300 million investment. issues related to this as well as any other
4 White House BRAIN Initiative

neuroscience initiative will fall to the administra- across the USA. The NIH is said to be establishing
tion’s Commission for the Study of Bioethical a bicoastal cochaired working group under
Issues. Dr. Cornelia Bargmann, a former UCSF Profes-
The NSF’s strength as a contributor to TWHBI sor, with the Rockefeller University in New York
is that it will sponsor interdisciplinary research City and Dr. William Newsome from California’s
that spans the fields of biology, physics, engineer- Stanford University to specify goals for the NIH’s
ing, computer science, social science, and behav- investment and create a multiyear plan for achiev-
ioral science. The NSF’s contribution to TWHBI ing these goals with timelines and costs (Univer-
again emphasizes the development of tools and sity of California San Francisco 2013). On the east
equipment specifically molecular-scale probes coast of the USA, the NIH Blueprint for Neuro-
that can sense and record the activity of neural science Research, draws on 15 of its 27 NIH Insti-
networks. Additionally, the NSF will also seek to tutes and Centers headquartered in Bethesda, MD,
address the innovations that will be necessary in will be a leading NIH contributor to TWHBI.
the field of big data in order to store, organize, and Research will occur in nearby Virginia at
analyze the enormous amounts of data that will be HHMI’s Janelia Farm Research Campus that
produced. Finally, NSF projects under TWHBI focuses on developing new imaging technologies
will see better understanding of how thoughts, and finding out how information is stored and
emotions, actions, and memories get represented processed in neural networks. Imaging technol-
in the brain. ogy furthers TWHBI’s goals of mapping the
In addition to federal government agencies, at brain’s structures by allowing researchers to cre-
least four private institutes and foundations have ate dynamic brain pictures down to the level of
pledged an estimated $122 million to support to single brain cells as they interact with complex
TWHBI: The Allen Institute (TAI), the Howard neural circuits at the speed of thought.
Hughes Medical Institute (HHMI), The Kavli
Foundation (TKF), and The Salk Institute for Bio-
logical Studies (TSI). TAI’s strengths lie in large- Conclusion
scale brain research, tools, and data sharing which
is necessary for a big data project like TWHBI Contributions to and extensions of TWHBI are
represents. Starting in March 2012, TAI under- also happening on the US west coast and interna-
took a 10-year project to unlock the neural code tionally. San Diego State University (SDSU) is
(i.e., how brain activity leads to perception, contributing to TWHBI via its expertise in clinical
decision-making, and action). HHMI by compar- and cognitive neuroscience specifically their
ison is the largest nongovernmental funder of investigations to understand and treat brain-
basic biomedical research and has long supported based disorders like autism, aphasia, fetal alcohol
neuroscience research. TKF anticipates drawing spectrum (FAS) disorders, and AD. San Diego’s
on the endowments of existing Kavli Institutes NVI, founded in 2007 and advised by Dr. Stephen
(KI) to fund its participation in TWHBI. This Hawking, and its founder, CEO, and Director
includes funding new KIs. Finally the TSI, under Dr. Phillip Low, helped to shape TWHBI initia-
its dynamic BRAIN initiative, will support cross- tive. NVI’s is notable for its iBrain™ single-
boundary research in neuroscience. For example, channel electroencephalograph (EEG) device
TSI researchers will map brain’s neural networks that noninvasively monitors the brain (Keshavan
to determine their interconnections. TSI scientists 2013). Dr. Low has also taken the message of the
will lay the groundwork for solving neurological WBHI international as he was asked to go to Israel
puzzles such as Alzheimer’s/Parkinson’s by and help them develop their own BRAIN initia-
studying age-related brain differences (The tive. To this end Dr. Low delivered one of two
White House 2013). keynotes for Israel’s first International Brain
The work of TWHBI will be spread across Technology Conference in Tel Aviv in October
affiliated research institutions and laboratories 2013. Australia also supports TWHBI through
White House BRAIN Initiative 5

neuroscience research collaboration and increased Alzheimer’s Association. (2013). Alzheimer’s Association
hosting of the NSF’s US research fellows for applauds White House Brain Mapping Initiative.
Retrieved from Alzheimer’s Association National
collaborating on relevant research projects. Office, Chicago, IL at http://www.alz.org/news_and_
events_alz_association_applauds_white_house.asp
Broad, W.J. (2014). Billionaires with big ideas are
privatizing American science. Retrieved from The
Cross-References New York Times, New York, NY http://www.nytimes.
com/2014/03/16/science/billionaires-with-big-ideas-
▶ Australia are-privatizing-american-science.html
Connor, S. (2013). One of the biggest mysteries in the
▶ Big Data universe is all in the head. Retrieved from Independent
▶ Data Sharing Digital News and Media, London, UK at http://www.
▶ Defense Advanced Research Projects Agency independent.co.uk/voices/comment/one-of-the-biggest-
(DARPA) mysteries-in-the-universe-is-all-in-the-head-8791565.
html
▶ Engineering Keshavan, M. (2013). BRAIN Initiative will tap our best
▶ Human Genome Project minds. San Diego Business Journal, 34(15), 1.
▶ Medicare National Institutes of Health. (2014). NIH embraces bold,
▶ Medical/Health Care 12-year scientific vision for BRAIN Initiative. Retrieved
from National Institutes of Health, Bethesda, MD at
▶ Medicaid http://www.nih.gov/news/health/jun2014/od-05.htm
▶ National Institutes of Health The White House. (2013). Fact sheet: BRAIN Initiative.
▶ National Science Foundation Retrieved from The White House Office of the Press
▶ Neuroscience Secretary, Washington, DC at http://www.whitehouse.
gov/the-press-office/2013/04/02/fact-sheet-brain-initiative
University of California San Francisco. (2013). President
Obama unveils brain mapping project. Retrieved from
the University Of California San Francisco at http://
References www.ucsf.edu/news/2013/04/104826/president-obama-
unveils-brain-mapping-project
Alivisatos, A. P., Chun, M., Church, G. M., Greenspan, Vallone, J. (2013). Federal initiative takes aim at treating
R. J., Roukes, M. L., & Yuste, R. (2012). The brain brain disorders. In Investors Business Daily, Los
activity map project and the challenge of functional Angeles, CA, (p. A04).
connectomics. Neuron, 74(6), 970–974.
W

WikiLeaks ultimately stifled, by governments. On its website,


WikiLeaks notes that it is working toward what it
Kim Lacey calls “open governance,” the idea that leaks are
Saginaw Valley State University, University not only for international, bureaucratic diplomacy
Center, MI, USA but more importantly for clarity of citizens’
consciousness.
In 2010, Chelsea (born Bradley) Manning
WikiLeaks is a nonprofit organization devoted to leaked a United States’ military cable containing
sharing classified, highly secretive, and otherwise 400,000 files regarding the Iraq War. According to
controversial documents to promote transparency Andy Greenberg, this leak, which later became
among global superpowers. These shared docu- known as Cablegate, marked the largest leak of
ments are commonly referred to as “leaks.” United States’ government information since
WikiLeaks has received both highly positive and Daniel Ellsberg photocopied The Pentagon
negative attention for this project particularly Papers. After chatting for some time, Manning
because of its mission to share leaked information. confessed to former hacker Adrian Lamo. Even-
WikiLeaks is operated by the Icelandic Sunshine tually, Lamo turned Manning over to the army
Press, and Julian Assange is often named the authorities leading to her arrest. The United
founder of the organization. States’ government officials were outraged by
WikiLeaks began in 2006, and its founding is the leak of classified documents and viewed Man-
largely attributed to Australian Julian Assange, ning as a traitor. This leak eventually led to Man-
often described as an Internet activist and hacker. ning’s detention, and officials kept her detained
The project, which aims to share government doc- for more than 1,000 days without a trial. Because
uments usually kept from citizens, is a major of this delay, supporters of WikiLeaks were out-
source of division between individuals and offi- raged at Manning’s denial of a swift trial. Man-
cials. The perspective on this division differs ning was eventually acquitted of aiding the
depending on the viewpoint. From the perspective enemy, but, in August 2013, was sentenced to
of its opponents, the WikiLeaks documents are 35 years for various crimes including violations
obtained illegally, and their distribution is poten- of the Espionage Act.
tially harmful for national security purposes. One of the most well-known documents Man-
From the perspective of its supporters, the docu- ning shared put WikiLeaks on the map for many
ments point to egregious offenses perpetrated, and who were previously unfamiliar with the

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_206-1
2 WikiLeaks

organization. This video, known familiarly as separately but contain the same information) have
“Collateral Murder,” shows a United States’ appeared allowing users to access WikiLeaks doc-
Apache helicopter shooting Reuters reporters, uments and also donate with “blocked” payment
individuals helping these reporters, and seriously methods. WikiLeaks also sells paraphernalia on
injuring two children. There have been two ver- its website, but it is unclear if these products fall
sions of the video that have been released: a under the banking blockade restrictions.
shorter, 17-min video and a more detailed Because of his affiliation with WikiLeaks,
39-min video. Both videos were leaked by Julian Assange has been granted political asylum
WikiLeaks and remain on its website. in Ecuador in 2012. Prior to his asylum, he had
WikiLeaks uses a number of different drop been accused of molestation and rape in Sweden
boxes in order to obtain documents and maintain but evaded arrest. In June 2013, Edward Snow-
the anonymity of the leakers. Many leakers are den, a former employer of the National Security
well versed in anonymity protective programs Agency (NSA), leaked evidence of the United
such as Tor, which uses what they call “onion States spying on its citizens to the UK’s The
routing”: several layers of encryption to avoid Guardian. On many occasions, WikiLeaks has
detection. However, in order to make leaking supported Snowden, helping him apply for polit-
less complicated, WikiLeaks provides instruc- ical asylum, providing funding, and also provid-
tions on its website for users to skirt around reg- ing him with escorts him on flights (most notably
ular detection through normal identifiers. Users Sarah Harrison accompanying Snowden from
are instructed to submit documents in one of Hong Kong to Russia).
many anonymous drop boxes to avoid detection. WikiLeaks has been nominated for multiple
In order to verify the authenticity of a docu- awards for reporting. Among the awards, it has
ment, WikiLeaks performs several forensic tests won including the Economist Index on Censor-
including weighing the price of forgery as well as ship Freedom of Expression award (2008) and the
possible motives for falsifying information. On its Amnesty International human rights reporting
website, WikiLeaks explains that it verified the award (2009, New Media). In 2011, Norwegian
now infamous “Collateral Murder” video by actu- citizen Snorre Valen publically announced that he
ally sending journalists to interview individuals nominated Julian Assange for the Nobel Peace
affiliated with the attack. WikiLeaks states that Prize, although Assange did not win.
simply when it publishes a document, the fact
that it has been published is verification enough.
By making information more freely available, Cross-References
WikiLeaks aims to start a larger conversation
within the press about access to authentic docu- ▶ Anonymization
ments and democratic information. ▶ National Security Agency (NSA)
Funding for WikiLeaks has been a contentious ▶ Transparency
issue since its founding. Since 2009, Assange has
noted several times that WikiLeaks is in danger of
running out of funding. One of the major reasons
causing these funding shortages is the result of Further Readings
many corporations (including Visa, MasterCard,
Dwyer, D. n.d. “WikiLeaks’ Assange for Nobel Prize?”
and PayPal) ceasing to allow its customers to ABC News. Available at: http://abcnews.go.com/Poli
donate money to WikiLeaks. On the WikiLeaks tics/wikileaks-julian-assange-nominated-nobel-peace-
website, this action is described as the “banking prize/story?id=12825383. Accessed 28 Aug 2014.
blockade.” To work around this banking block- Greenberg, A. This machine kills secrets: How
wikileakers, cypherpunks, and hacktivists aim to free
ade, many mirror sites (websites that are hosted the world’s information. Dutton: New York, 2012.
WikiLeaks 3

Sifry, Micah L 2011. WikiLeaks and the age of transpar- judge-to-sentence-bradley-manning-today/2013/08/


ency. O/R Books: New York, Wikileaks.org. 20/85bee184-09d0-11e3-b87c-476db8ac34cd_story.
WikiLeaks. Available at: https://www.wikileaks.org/. html. Accessed 26 Aug 2014.
Accessed 28 Aug 2014. WikiRebels: The Documentary. n.d.. Available at: https://
Tate, J. n.d., “Bradley Manning Sentenced to 35 Years in www.youtube.com/watch?v=z9xrO2Ch4Co.
WikiLeaks Case.” Washington Post Available at: http:// Accessed 1 Sept 2012.
www.washingtonpost.com/world/national-security/
W

Wikipedia in a way that is easily accessible. If someone


makes changes that are not in the best interest of
Ryan McGrady the encyclopedia, another user can easily see the
North Carolina State University, Raleigh, NC, extent of those changes and if necessary restore a
USA previous version or make corrections. Each
change is timestamped and attributed to either a
username or, if made anonymously, an IP address.
Wikipedia is an open-access online encyclopedia Although Wikipedia is transparent about what
hosted and operated by the Wikimedia Founda- data it saves and draws little criticism on privacy
tion (WMF), a San Francisco-based nonprofit matters, any use of a wiki requires self-awareness
organization. Unlike traditional encyclopedias, given that one’s actions will be archived
Wikipedia is premised on an open editing model indefinitely.
whereby everyone using the site is allowed and Article histories largely comprise the
encouraged to contribute content and make Wikipedia database, which the WMF makes
changes. Since its launch in 2001, it has grown available to download for any purpose compatible
to over 40 million articles across nearly three with its Creative Commons license, including
hundred languages, constructed almost entirely mirroring, personal and institutional offline use,
by unpaid pseudonymous and anonymous users. and data mining. The full English language data-
Since its infancy, Wikipedia has attracted base download amounts to more than ten
researchers from many disciplines to its vast col- terabytes, with several smaller subsets available
lection of user-generated knowledge, unusual pro- that, for example, exclude discussion pages and
duction model, active community, and open user profiles or only include the most current
approach to data. version of each page.
Wikipedia works on a type of software called a As with any big data project, there is a chal-
wiki, a popular kind of web application designed lenge in determining not just what questions to
to facilitate collaboration. Wiki pages can be mod- ask but how to use the data to convey meaningful
ified directly using a built-in text editor. When a answers. Wikipedia presents an incredible amount
user saves his or her changes, a new version of the of knowledge and information, but it is widely
article is created and immediately visible to the dispersed and collected in a database organized
next visitor. Part of what allows Wikipedia to around articles and users, not structured data. One
maintain standards for quality is the meticulous way the text archive is rendered intelligible is
record-keeping of changes provided by wiki soft- through visualization, wrangling the unwieldy
ware, storing each version of a page permanently information by expressing statistics and patterns
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_207-1
2 Wikipedia

through visuals like graphs, charts, or histograms. issues and to promote resource sharing, the
Given the multi-language and international nature Wikimedia Commons was introduced in 2004 as
of Wikipedia, as well as the disproportionate size a central location for images and other media for
and activity of the English version in particular, all WMF projects. Wikidata works on a similar
geography is important in its critical discourse. premise with data. Its initial task was to centralize
Maps are thus popular visuals to demonstrate inter-wiki links, which connect, for example, the
disparities, locate concentrations, and measure English article “Cat” to the Portuguese “Gato”
coverage or influence. Several programs have and Swedish “Katt.” Inter-language links had pre-
been developed to create visualizations using viously been handled locally, creating links at the
Wikipedia data, as well. One of the earliest, the bottom of an article to its counterparts at every
IBM History Flow tool, produces images based on other applicable version. Since someone adding
stages of an individual article’s development over links to the Tagalog Wikipedia is not likely to
time, giving a manageable, visual form to an speak Swedish, and because someone who speaks
imposingly long edit history and the disagree- Swedish is not likely to actively edit the Tagalog
ments, vandalism, and controversies it contains. Wikipedia and vice versa, this process frequently
The Wikipedia database has been and con- resulted in inaccurate translations, broken links,
tinues to be a valuable resource, but there are one-way connections, and other complications.
limitations to what can be done with its unstruc- Wikidata helps by acting as a single junction for
tured data. It is downloaded as a relational data- each topic.
base filled with text and markup, but machines A topic, or an item, on Wikidata is given its
that researchers use to process data are not able to own page which includes an identification num-
understand text like a human, limiting what tasks ber. Users can then add a list of alternative terms
they can be given. It is for this reason there have for the same item and a brief description in every
been a number of attempts to extract structured language. Items also receive statements
data as well. DBPedia is a database project started connecting values and properties. For example,
in 2007 to put as much of Wikipedia into the The Beatles’s 1964 album A Hard Day’s Night is
Resource Description Framework (RDF) as pos- item Q182518. The item links to the album’s
sible. Whereas content on the web typically Wikipedia articles in 49 languages and includes
employs HTML to display and format text, mul- 17 statements with properties and values. The
timedia, and links, RDF emphasizes not what a very common instance of property has the value
document looks like but how its information is “album,” a property called record label has the
organized, allowing for arbitrary statements and value “Parlophone Records,” and four statements
associations which effectively make the items connect the property genre with “rock and roll,”
meaningful to machines. The article for the film “beat music,” “pop music,” and “rock music.”
Moonlight Kingdom may contain the textual Other statements describe its recording location,
statement “it was shot in Rhode Island,” but a personnel, language, and chronology, and many
machine would have a difficult time extracting applicable properties are not yet filled in. Like
the desired meaning, instead preferring to see a Wikipedia, Wikidata is an open community pro-
subject “Moonlight Kingdom” with a standard ject and anybody can create or modify statements.
property “filming location” set to the value Some of the other properties items are given
“Rhode Island.” include names, stage names, pen names, dates,
In 2012, WMF launched Wikidata, its own birth dates, death dates, demographics, genders,
structured database. In addition to Wikipedia, professions, geographic coordinates, addresses,
WMF operates a number of other sites like manufacturers, alma maters, spouses, running
Wiktionary, Wikinews, Wikispecies, and mates, predecessors, affiliations, capitals, awards
Wikibooks. Like Wikipedia, these sites are avail- won, executives, parent companies, taxonomic
able in many languages, each more or less inde- orders, and architects, among many others. So as
pendent from the others. To solve redundancy to operate according to the core Wikipedia tenet of
Wikipedia 3

neutrality, multiple conflicting values are allowed. support Wikipedia research, Wikipedia can be
Property-value pairs can furthermore be assigned used to support other forms of research and even
their own property-value pairs such that the enhance commercial products. Google, Facebook,
record sales property and its value can have the IBM, and many others regularly make use of data
qualifier as of and another value to reflect when from Wikipedia and Wikidata in order to improve
the sales figure was accurate. Each property-value search results or provide better answers to ques-
pair along the way can be assigned references akin tions. By creating points of informational inter-
to cited sources on Wikipedia. section and interpretation for hundreds of
Some Wikipedia metadata is easy to locate and languages, Wikidata also has potential for use in
parse as fundamental elements of wiki technol- translation applications and to enhance cultural
ogy: timestamps, usernames, and article titles, for education. The introduction of Wikidata in 2012,
example. Other data is incidental, like template built on an already impressively large knowledge
parameters. Design elements that would other- base, and its ongoing development, have opened
wise be repeated in many articles are frequently many new areas for exploration and accelerated
copied into a separate template which can then be the pace of experimentation, incorporating the
invoked when relevant, using parameters to cus- data into many areas of industry, research, educa-
tomize it for the particular page on which it is tion, and entertainment.
displayed. For example, in the top-right corner
of articles about books there is typically a neatly
formatted table called an infobox which includes
Cross-References
standardized information input as template
parameters like author, illustrator, translator,
▶ Anonymity
awards received, number of pages, Dewey deci-
▶ Crowdsourcing
mal classification, and ISBN number.
▶ Open Data
A fundamental part of DBPedia and the second
▶ Semantic Web
goal for Wikidata is the collection of data based on
these relatively few structured fields that exist in
Wikipedia.
Standardizing the factual information in Further Reading
Wikipedia holds incredible potential for research.
Wikidata and DBPedia, used in conjunction with Jemielniak, D. (2014). Common knowledge: An ethnogra-
phy of wikipedia. Stanford: Stanford University Press.
the Wikipedia database, make it possible to, for Krötzscha, M., et al. (2007). Semantic Wikipedia. Web
example, assess article coverage of female musi- Semantics: Science, Services and Agents on the World
cians as compared to male musicians in different Wide Web, 5(4), 251–261.
parts of the world. Since they use machine- Leetaru, K. (2012). A bigdata approach to the humanities,
arts, and social sciences: Wikipedia’s view of the world
readable formats, they can also interface with through supercomputing. Research Trends, 30, 17–30.
one another and with many other sources like Stefaner, M., et al. Notability – Visualizing deletion dis-
GeoNames, Library of Congress Subject Head- cussions on Wikipedia. http://www.notabilia.net/
ings, Internet Movie Database, MusicBrainz, and Viégas, F., et al. (2004). Studying cooperation and conflict
between authors with history flow visualizations. Paper
Freebase, allowing for richer, more complex presented at CHI 2004, Vienna.
queries. Likewise, just as these can be used to
W

World Bank demand for access to quantitative data to inform


development strategies (Lehdonvirta and Ernkvist
Jennifer Ferreira 2011).
Centre for Business in Society, Coventry A significant amount of the data hosted and
University, Coventry, UK disseminated by the World Bank is drawn from
national statistical organizations, and it recognizes
that the quality of global data therefore is reliant
The World Bank, part of the World Bank Group on the capacity and effectiveness of these national
established in 1944, is the international financial statistical organizations. The World Bank has ten
institution responsible for promoting economic key principles with respect to its statistical activ-
development and reducing poverty. The World ities (in line with the Fundamental Principles of
Bank has two key objectives: to end extreme Official Statistics and the Principles Governing
poverty by reducing the proportion of the world’s International Statistical Activities of the United
population living on less than $1.25 a day and Nations Statistical Division): quality, innovation,
promoting shared prosperity by fostering income professional integrity, partnership, country own-
growth in the lowest 40% of the population. ership, client focus, results, fiscal responsibility,
A core activity for the World Bank is the pro- openness, and good management.
vision of low interest loans, zero- to low-interest The world is now experiencing unprecedented
grants to developing countries. This could be to capacity to generate, store, process, and interact
support a wide range of activities from education with data (McAfee and Brynjolfsson 2012), a
and health care to infrastructure, agriculture, or phenomenon that has been recognized by the
natural resource management. In addition to the World Bank, like other international institutions.
financial support, the World Bank provides policy For the World Bank, data is seen as critical for the
advice, research, analysis, and technical assis- design, implementation, and evaluation of effi-
tance to various countries in order to inform its cient and effective development policy recom-
own investments and ultimately to work toward mendations. In 2014, Jim Yong Kim, the
its key objectives. Part of its activities relate to the President of the World Bank, discussed the impor-
provision of tools to research and address devel- tance of efforts to invest in infrastructure, includ-
opment challenges, some of which are in the form ing data systems. Big data is recognized as a new
of providing access to data, for example, the Open advancement which has the potential to enhance
Data website which includes a comprehensive efforts to address development, although it recog-
range of downloadable data sets related to differ- nizes there are a series of challenges associated
ent issues. This shows its recognition of the with this. In 2013, the World Bank hosted an event
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_209-1
2 World Bank

where over 150 experts, data scientists, civil soci- Examples include: the use of GPS-equipped vehi-
ety groups, and development practitioners met to cles in Stockholm, providing real-time traffic
analyze various forms of big data and consider assessments, which are used in conjunction with
how it could be used to tackle development issues. other data sets such as weather which can then be
The event was a public acknowledgement of how used to make traffic predictions, using mobile
the World Bank viewed the importance of phone data to predict mobility patterns.
expanding the awareness of how big data can The World Bank piloted some activities in
help combine various data sets to generate knowl- Central America to explore the potential of big
edge which can in turn foster development data to impact on development agendas. This
solutions. region has historically experienced low frequen-
A report produced in conjunction with the cies of data collection for traditional data forms,
World Bank, Big Data in Action for Development, such as household surveys and so other forms of
highlights some of the potential ways in which big data collection, were viewed as particularly
data can be used to work toward development important. One of these pilot studies used google
objectives and some of the challenges associated trends data to explore the potential for the ability
with doing so. The report sets out a conceptual to forecast price changes to commodities. Another
framework for using big data in the development study, in conjunction with the UN Global Pulse,
sector highlighting the potential transformative explored the use of social media content to ana-
capacity of big data, particularly in relation to lyze public perceptions of policy reforms, in par-
raising awareness, developing understanding, ticular a gas subsidy reform in El Salvador,
and contributing to forecasting. highlighting the potential for this form of data to
Using big data to develop and enhance aware- complement other studies on public perception
ness of different issues has been widely acknowl- (United Nations Global Pulse 2012).
edged. Examples of this include: using The report from the World Bank, Big Data in
demographic data in Afghanistan to detect Action for Development, presents a matrix of dif-
impacts of small scale violence outbreaks, using ferent ways in which big data could be used in
social media content to indicate unemployment transformational ways toward the development
rises or crisis related stress, or using tweets to agenda: using mobile data (e.g., reduced mobile
recognize where cholera outbreaks were phone top ups as an indicator of financial stress),
appearing at a much faster rate than was recog- financial data (e.g., increased understanding of
nized in official statistics. This ability to gain customer preferences), satellite data (e.g., to
awareness of situations, experiences, and senti- crowd source information on damage after an
ments is seen to have the potential to reduce earthquake), internet data (e.g., to collect daily
reaction times and improve processes which deal prices), and social media data (e.g., to track par-
with such situations. ents perception of vaccination). The example of
Big data can also be used to develop under- examining the relationship between food and fuel
standing of societal behaviors (LaValle et al. prices and corresponding change in official price
2011). Examples include investigation of twitter index measures by using twitter data (by the UN
data to explore the relationship between food and Global Pulse Lab) is outlined in detail explaining
fuel price tweets and changes in official price how it was used to provide an indication of social/
indexes in Indonesia; after the 2010 earthquake economic conditions in Indonesia. This was done
in Haiti, mobile photo data was used to track by extracting tweets mentioning food and fuel
population displacement after the event, and sat- prices between 2011 and 2013 (around 100,000
ellite rainfall data was used in combination with relevant tweets after filtering for location and lan-
qualitative data sources to understand how rainfall guage) and analyzing these with corresponding
affects migration. changes from official data sets. The analysis indi-
Big data is also seen to have potential for cated a clear relationship between official food
contributing to modelling and forecasting. inflation statistics and the number of tweets
World Bank 3

about food price increases. This study was cited as requires expertise to both clean the data and
an example of how big data could be used to where necessary aggregate it (e.g., if one set of
analyze public sentiment, in addition to objective data collected every hour, and another every day).
economic conditions. The examples mentioned Then the media through which data is collected is
here are just some of the activities undertaken by also an important factor to consider. Mobile
the World Bank to embrace the world of big data. phones, for example, producing highly sensitive
As with many other international institutions data, satellite images produce highly unstructured
which recognize the potential uses for big data, data, and social media platforms produce a lot of
the World Bank also recognizes there are a range unstructured text which requires filtering and cod-
of challenges associated with the generation, anal- ifying which in itself requires specific analytic
ysis, and use of big data. capabilities.
One of the most basic challenges for many Then in order to make effective use of big data,
organizations (and individuals) is gaining access those using it need to consider elements about the
to data, from both government institutions and the data itself. The generation of big data has been
private sector. A new ecosystem needs to be driven by advances in technology, yet these
developed where data is made openly available advances are not alone sufficient to be able to
and sharing incentives are in place. It is acknowl- understand the results which can be gleaned
edged by the World Bank that international agen- from big data. Transforming vast data sets into
cies will need to address this challenge by not only meaningful results requires effective human capa-
by promoting the availability of data but promot- bilities. Depending on how the data is generated,
ing collaboration and mechanisms for sharing and by whom, there is scope for bias and therefore
data. In particular, a shift in business models will misleading conclusions. Then with large amounts
be required in order to ensure the private sector is of data, there is a tendency for patterns to be
willing to share data, and governments will need observed where there may be none; because of
to design policy mechanisms to ensure the value its nature, big data can give rise to significant
of big data is captured and is shared across depart- statistical correlations. It is important to remember
ments. Related to this, there need to be consider- that correlation does not imply causation. Then
ations of how to engage the public with this data. just because there is large amount of data avail-
Thinking particularly about the development able, this does not necessarily mean this is the
agenda at the heart of the World Bank, there is a right data for the question or issue being
paradox: countries where poverty is high or where investigated.
development agendas require the most attention The World Bank acknowledges that for big
are often countries where data infrastructures or data to be made effective for development, there
technological systems are insufficient. Because will need to be collaboration between practi-
the generation of big data relies largely on tech- tioners, social scientists, and data scientists in
nological capabilities, relying on those who use or order to ensure the understanding of the real-
interact with digital sources may be systematically world conditions and data generation mecha-
unrepresentative of the larger population that nisms, and methods of interpretation are effec-
forms the focus of the research. tively combined. Beyond this there will need to
The ways in which data are recorded have be cooperation between public and private sector
implications for the results which are interpreted. bodies in order to foster greater data sharing and
Where data is passively recorded, there is less incentivize the use of big data across different
potential for bias in the results generated, and sectors. Even when data has been accessed, in
likewise where data is actively recorded, there is nearly all occasions it needs to be filtered and
greater potential for the results to be more made suitable for analysis. Filters require human
susceptive to selection bias. Furthermore, how input and need to be applied carefully as their use
data is processed into a more structured from the may preclude information and affect the results.
often very large and unstructured data sets Data needs to be cleaned. Mobile data is received
4 World Bank

in unstructured form of millions of files, which data and has begun to explore areas of clear poten-
requiring time-intensive processing to obtain data tial for big data use. However, questions remain
suitable for analysis. Likewise, analysis of text about how it can support countries to take owner-
from social media requires a decision making ship and create, manage, and maintain their own
process to filter out suitable search terms. data, contributing to their own development
Finally, there are a series of concerns about agendas in effective ways.
how privacy is ensured with big data, given that
often there are elements of big data which can be
sensitive in nature (either to the individual or
Cross-References
commercially). This is made more complicated
as each country will have different regulations
▶ Bank of America
about data privacy which poses particular chal-
▶ Citigroup Inc
lenges for institutions working across national
▶ International Development
boundaries, like the World Bank.
▶ United Nations
For the World Bank, the use of big data is seen
▶ United Nations Global Pulse
to have potential for improving and changing the
▶ World Health Organization
international development sector. Underpinning
the ideas of the World Bank’s approach to big
data is the recognition that while the technological
capacities for generation, storage, and processing Further Reading
of data continue to develop, this also needs to be
accompanied by institutional capabilities to Coppola, A., Calvo-Gonzalez, O., Sabet, E., Arjomand, N.,
Siegel, R., Freeman, C., Massarat, N. (2014). Big data
enable big data analysis to contribute to effective in action for development. Washington, DC: World
actions that can contribute to development, Bank and Second Muse. Available at: http://live.
whether this is through strengthening of warning worldbank.org/sites/default/files/Big%20Data%20for
systems, raising awareness, or developing under- %20Development%20Report_final%20version.pdf.
LaValle, S., Lesser, E., Shockley, R., Hopkins, M., &
standing of social systems or behaviors. Kruschwitz, N. (2011). Big data, analytics and the
The World Bank has begun to consider an path from insights to value. MIT Sloan Management
underlying conceptual framework around the use Review, 52(2), 21–31.
of big data, in particular considering the chal- Lehdonvirta, V., & Ernkvist, M. (2011). Converting the
virtual economy into development potential: Knowl-
lenges it presents in terms of using big data for edge map of the virtual economy. InfoDev/World Bank
development. In the report Big Data in Action for White Paper, 1, 5–17.
Development, it is acknowledged that there is McAfee, A., & Brynjolfsson, E. (2012). Big data: The
great potential for big data to provide a valuable management revolution. Harvard Business Review,
90(10), 60–66.
input for designing effective development policy United Nations Global Pulse. (2012). Big data for devel-
recommendation but also that big data is no pan- opment: Challenges & opportunities. New York: UN,
acea (Coppola et al. 2014). The World Bank has New York.
made clear efforts to engage with the use of big
Z

Zappos there were no major online retailers specializing in


shoes. It was at this point that Swinmurn decided
Jennifer J. Summary-Smith to quit his full-time job and start an online shoe
Culver-Stockton College, Canton, MO, USA retailer named Zappos. Overtime the company has
evolved, focusing on making the speed of its
customers’ online purchase central to its business
As one of the largest online retailers of shoes, model. In order to achieve this, Zappos ware-
Zappos (derived from the Spanish word zapatos houses have everything it sells. As the company
meaning shoes) is a company that is setting an grew, it reached new heights in 2009 when Zappos
innovative trend in customer service and manage- and Amazon joined forces combining their pas-
ment style. According to Zappos’ website, one of sion for strong customer service. Since then,
its primary goals is to provide the best online Zappos has grown significantly and restructured
service. The company envisions a world where into ten separate companies.
online customers will make 30% of all retail trans-
actions in the United States. Zappos hopes to be
the company that leads the market in online sales, Security Breach
setting itself aside from other online retail com-
petitors by offering the best customer service and Unfortunately, Zappos has not been without a few
selection. missteps. In 2012, the company experienced a
security breach, compromising as many as 24 mil-
lion customers. Ellen Messmer reports that cyber-
History of the Company hackers successfully gained access to the
company’s internal network and systems. To
Zappos was founded in 1999 by Nick Swinmurn address this security breach, Zappos CEO Tony
who developed the idea for the company while Hsieh announced that existing customer pass-
walking around a mall in San Francisco, Califor- words would be terminated as a result of the
nia, looking for a pair of shoes. After spending an breach. Still yet, the cyberhackers likely gained
hour in the mall searching from store to store for accessed to names, phone numbers, the last four
the right color and shoe size, he left the mall digits of credit card numbers, cryptographically
empty handed and frustrated. Upon arriving scrambled passwords, email, billing information,
home, Swinmurn turned to the Internet to con- and shipping addresses. After Zappos CEO Tony
tinue his search for his preferred shoes, which Hsieh posted an open letter explaining the breach
again was unsuccessful. Swinmurn realized that and how the company would head off resulting
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_210-1
2 Zappos

problems, there were mixed responses to how the browsewraps. Browsewraps are user agreements
company had handled the situation. As part of its that bind users simply for browsing the website.
response to the breach, the company sent out The courts ruled that Zappos presented its user
emails informing its customers of the problem agreement as a browsewrap. Furthermore, Zappos
urging them to change their passwords. Zappos claimed on its website that the company reserved
also provided an 800-number phone service to its the right to amend the contract whenever it saw fit.
customers helping them through the process of Despite other companies using this language
choosing a new password. online, it is detrimental to a contract. The courts
However, some experts familiar with the ruled that Zappos can amend the terms of the user
online industry have criticized the moves by agreement at any time, making the arbitration
Zappos. In an article by Ellen Messmer, she clause susceptible to change as well. This makes
interviewed an Assistant Professor of Information the clause unenforceable. Eric Goldman posits
Technology from the University of Notre Dame, that the court ruling left Zappos in a bad position
who argued that the response strategy by Zappos because all of the risk management provisions are
was not appropriate. Professor John D’Arcy posits ineffective. In other words, losing the contract left
that the company’s decision to terminate cus- Zappos without the following: its waiver of con-
tomers’ passwords promotes a panic mode, creat- sequential damages, its disclaimer of warranties,
ing a sense of panic in its customers. In contrast, its clause restricting class actions in arbitration,
other analysts claim that Zappos public response and its reduced statute of limitations. Conversely,
to the situation was the right move, communicat- companies that use click-through agreements and
ing to its customers publicly. remove clauses that state they can amend the
Nevertheless, Zappos is doing a good job of contract unilaterally are in a better legal position,
getting the information out about the security according to Eric Goldman.
breach to the public as soon as possible, according
to Professor John D’Arcy. This typically benefits
the customers, creating favorable reactions. In Holacracy
terms of the cost of the security breaches, the
Ponemon Institute estimates that on average, a Zappos CEO Tony Hsieh announced in
data breach costs $277 per compromised record. November 2013 that his company would be
implementing the management style known as
Holacracy. With Holacracy, there are two key
Lawsuits elements that Zappos will follow: distributed
authority and self-organization. According to an
After the security breach, dozens of lawsuits were article by Nicole Leinbach-Reyhle, distribution
filed. Zappos attempted to send the lawsuits to authority allows employees to evolve the organi-
arbitration, citing its user agreement. In the fall zation’s structure by responding to real-word cir-
of 2012, a federal court struck down Zappos. cumstances. In regard to self-organization,
com’s user agreement, according to Eric employees have the authority to engage in useful
Goldman. Eric Goldman is a professor of law at action to express their purpose as long as it does
Santa Clara University School of Law who writes not “violate of the domain of another role.” There
about Internet law, intellectual property, and is a common misunderstanding that Holacracy is
advertising law. He states that Zappos made mis- nonhierarchical when in fact it is strongly hierar-
takes that are easily avoidable. The courts typi- chical, distributing power within the organization.
cally divide user agreements into one of three This approach to management creates an atmo-
groups: “clickwraps” or “click-through agree- sphere where employees can speak up evolving
ments,” “browsewraps,” and “clearly not a con- into leaders rather than followers. Zappos CEO
tract.” Eric Goldman argues that the click-through Tony Hsieh states that he is trying to structure
agreements are effective in courts unlike Zappos less like a bureaucratic corporation and
Zappos 3

more like a city, resulting in increased productiv- downtown Las Vegas region. As Sara Corbett
ity and innovation. To date, with 1,500 notes in her article, he hopes to change the area
employees, Zappos is the largest company to into a start-up fantasyland.
adopt the management model – Holacracy.

Cross-References
Innovation
▶ Bureau of Consumer Protection: Data Breach
The work environment at Zappos has become
▶ Legal Issues
known for its unique corporate culture, which
▶ Small Business Enterprises
incorporates fun and humor into daily work. As
stated on Zappos.com, the company has a total of
ten core values: “deliver WOW through service,
embrace and drive change, create fun and a little Further Reading
weirdness, be adventurous, creative, and open-
minded, pursue growth and learning, build open Corbett, S. (n.d.). How Zappos’ CEO turned Las Vegas into
a startup fantasyland. http://www.wired.com/2014/01/
and honest relationships with communication, zappos-tony-hsieh-las-vegas/
build a positive team and family spirit, do more Goldman, E. (n.d.). How Zappos’ user agreement Failed in
with less, be passionate and determined, and be court and left Zappos legally naked. http://www.forbes.
humble.” Nicole Leinbach-Reyhle writes that com/sites/ericgoldman/2012/10/10/how-zappos-user-
agreement-failed-in-court-and-left-zappos-legally-
Zappos’ values help to encourage its employees naked/. Accessed Jul 2014
to think outside of the box. Leinbach-Reyhle, N. (n.d.). Shedding hierarchy: Could
To date, Zappos is a billion-dollar online Zappos be setting an innovative trend? http://www.
retailer, expanding beyond selling shoes. The forbes.com/sites/nicoleleinbachreyhle/2014/07/15/
shedding-hierarchy-could-zappos-be-setting-an-
company is also making waves in its corporate innvoative-trend/. Accessed Jul 2014
culture and hierarchy. Additionally, information Messmer, E. (n.d.). Zappos data breach response a good
technology plays a huge role in the corporation, idea or just panic mode? Online shoe and clothing
serving its customers and the business. Based retailer Zappos has taken assertive steps after breach,
but is it enough? http://www.networkworld.com/arti
upon the growing success of Zappos, it is keeping cle/2184860/malware-cybercrime/zappos-data-breach-
true to its mission statement “to provide the best response-a-good-idea-or-just-panic-mode-.html.
customer service possible.” It evident that Zappos Accessed Jul 2014
will continue to make positive changes for the Ponemon Group. (n.d.). 2013 cost of data breach study:
Global analysis. http://www.ponemon.org. Accessed
corporation and its corporate headquarters in Las Jul 2014
Vegas. In 2013, Zappos CEO Tony Hsieh com- Zappos. (n.d.). http://www.zappos.com. Accessed Jul 2014
mitted $350 million to rebuild and renovate the
Z

Zillow destination in the country. Now with Trulia, it


accounts for 48% of Web traffic for real estate
Matthew Pittman and Kim Sheehan listings, though that number is diminished to
School of Journalism & Communication, around 15% if you factor in individual realtor
University of Oregon, Eugene, OR, USA sites and local MLS (multiple listing service) list-
ings. The company’s chief economist Stan
Humphries created a tool that processes 1.2 mil-
Overview and Business Model lion proprietary statistical models three times per
week on the county and state real estate data it is
Like most industries, real estate is undergoing constantly gathering. In 2011, they shifted from
dynamic shifts in the age of big data. an in-house computer cluster to renting space in
Real estate information, once in the hands of a the Amazon cloud to help with the massive
few agents or title companies, is being democra- computing load.
tized for any and all interested consumers. What On the consumer side, Zillow is a web site or
were previously physical necessities – real estate mobile app that is free to use. Users can enter a
agents, showings, and physical homes – are being city or zip code and search, filtering out home
obsolesced by digital platforms like Zillow. Real types, sizes, or prices that are undesirable. There
estate developers can use technology to track how are options to see current homes for sale, recently
communities flow and interact with one another, sold properties, foreclosures, rental properties,
which will help build smarter, more efficient and even Zillow “zestimates” (the company’s sig-
neighborhoods in the future. The companies that nature feature) of the home’s current value based
succeed in the future will be the ones who, like on similar homes in the area, square footage,
Zillow, find innovative, practical, and valuable amenities, and more. Upon clicking on a house
ways to navigate and harness the massive amounts of interest, the user can see a real estate agent’s
of data that are being produced in and around their description of the home, how long it has been on
field. the market – along with any price fluctuations – as
Founded in Seattle in 2005, Zillow is a billion- well as photos, similarly priced nearby houses,
dollar real estate database that uses big data to proposed mortgage rates on the home, the agents
help consumers learn about home prices, rent associated with it, the home’s sale history, and
rates, market trends, and more. They provide esti- facts and features.
mates for most housing units in the United States. Zillow makes money on real estate firms and
It acquired its closest competitor, Trulia, in 2014 agents that advertise through the site and by pro-
for $3.5 billion. It is the most-viewed real estate viding subscriptions to real estate professionals.
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_211-1
2 Zillow

They can charge more for ads that appear during a amounts of data accessible to common people.
search for homes in Beverly Hills than in Bis- Potential buyers no longer need to contact a real
marck, South Dakota. Some 57,000 agents spend estate agent before searching for homes – they can
an average of $4,000 every year for leads to get start a detailed search on just about any house in
new buyers and sellers. Zillow keeps a record of the country from their own mobile or desktop
how many times a listing has been viewed, which device. This is empowering for consumers, but it
may help negotiate the price with among agents, shakes up an industry that has long relied on
buyers, and sellers. Real estate agents can sub- human agents. These agents made it their business
scribe to silver, gold, or platinum programs to get to know specific areas, learn the ins and outs of a
CRM (customer relationship management) tools, given community, and then help connect inter-
their photo in listings, a web site, and more. Basic ested buyers to the right home. Sites that give
plans start at 10 dollars a month. users a tool peer into huge amounts of data (like
Zillow’s mortgage marketplace also earns Zillow) are useful to a point, but some critics feel
them revenue. Potential homebuyers can find only a human being who is local and present in a
and engage with mortgage brokers and firms. community can really serve potential buyers.
The mortgage marketplace tells potential buyers Because it takes an aggregate of multiple
what their monthly payment would be, how much national and MLS listing sites, Zillow is rarely
they can afford, submit loan requests, and get perfect. Any big data computing service that
quotes from various lenders. In the third quarter works with offline or subjective entities – and
of 2013, Zillow’s mortgage marketplace received real estate prices certainly fit this description –
5.9 million loan requests from borrowers (more will have to make logical (some would say illog-
than all of 2011), which grew its revenue stream ical) leaps where information is scarce. When
120% to $5.7 million. A majority of Zillow’s Zillow does not have exact or current data on a
revenue comes from the real estate segment that house or neighborhood, it “guesses” when prices
lets users browse homes for sale and for rent. This come in too high, sellers have unrealistic expec-
earned them over $35 million in 2013’s third tations of the potential price of their home.
quarter. Buyers, too, may end up paying for a home than
Analysts and shareholders have voiced some it is actually worth.
concerns over Zillow’s business model. Zillow A human expert (real estate agent) has tradi-
now spends over 70% of its revenues on sales tionally been the expert in this area, yet people are
and marketing, as opposed to 33% for LinkedIn still surprised when too much stock is put into an
and between 21% and 23% for IBM and Micro- algorithm. Zillow zestimates tend to work best for
soft. Spending money on television commercials midrange homes in an area where there are plenty
and online ads for its services seems to have of comparable houses. Zestimates are less accu-
diminishing returns for Zillow, who is spending rate for low- and high-end homes because there
more and more on marketing for the same net are fewer comps (comparable houses for sale or
profit. What once seemed like a sure-fire recently sold). Similarly, zestimates of rural,
endeavor – making money by connecting cus- unique, or fixer-upper homes are difficult to
tomers to agents through relevant and concise gauge. Local MLS sites may have more detail on
management of huge amounts of data – is no a specific area, but Zillow has broader, more gen-
longer a sure thing. Zillow will have to continu- eral information over a larger area. They estimate
ally evolve its business model if it is to stay afloat. their coverage of American homes to be around
57%.
Real estate data is more difficult to come by in
Zillow and the Real Estate Industry some areas. Texas doesn’t provide public records
of housing transaction prices, so Zillow had to
Zillow has transformed the real estate industry by access sales data from property databases through
finding new and practical ways to make huge real estate brokers. Because of the high number of
Zillow 3

cooperative buildings, New York City is another sort of activity bridges the traditional brick-and-
difficult area in which to gauge real estate prices. mortar house hunting of the past with the instant
Tax assessments are made on the co-ops, not the big data access of the future (and increasingly, the
individual units, which negates that factor in present). Zillow has emerged as a leader in its field
zestimate calculations. Additional information, of real estate by connecting its customers, not just
like square footage or amenities, is also difficult to big data but the right data at the right time and
to come by, forcing Zillow to seek out alternative places.
sources.
Of course, zestimates can be accurate as well.
As previously noted, when the house is midrange
Cross-References
and in a neighborhood with plenty of comps (and
thus plenty of data), zestimates can be very good
▶ Data-Driven Marketing
indicators of the home’s actual worth. As Zillow
▶ Digitization
zestimates – and sources from which to draw
▶ E-Commerce
factoring information – continue to evolve, the
▶ Real Estate/Housing
service may continue growing in popularity. The
▶ Utilities Industry
more popular Zillow becomes, the more incentive
real estate agents will have to list all of their
housing database information with the service.
Agents know that, in a digital society, speed is Further Readings
key: 74% of buyers and 76% of sellers will work
with the first agent with whom they talk. Arribas-Bel, D. (2014). Accidental, open and everywhere:
Emerging data sources for the understanding of cities.
Recently Zillow is recognizing a big shift to Applied Geography, 49, 45–53.
mobile: about 70% of Zillow’s usage now occurs Cranshaw, J., Schwartz, R., Hong, J.I., Sadeh,
on mobile platforms. This trend is concurrent with N.M. (2012). The livelihoods project: Utilizing social
other platforms’ shift to mobile usage; Facebook, media to understand the dynamics of a city. In ICWSM.
Hagerty, J. R.(2007). How good are Zillow’s estimates?
Instagram, Zynga, and others have begun to rec- Wall Street Journal.
ognize and monetize users’ access from Huang, H., & Tang, Y. (2012). Residential land use regu-
smartphones and tablets. For real estate, this lation and the US housing price cycle between 2000
mobile activity is about more than just conve- and 2009. Journal of Urban Economics, 71(1), 93–99.
Wheatley, M. (n.d.). Zillow-Trulia merger will create bound-
nience: user can find information on homes in less new big data opportunities. http://siliconangle.com/
real time as they drive around a neighborhood, blog/2014/07/31/zillow-trulia-merger-will-create-bound
looking directly at the potential homes, and con- less-new-big-data-opportunities/. Accessed on Sept
tact the relevant agent before they get home. This 2014.
A

AgInformatics unresolved issues including big data handling,


multiple data sources and limited standardization,
Andrea De Montis1, Giuseppe Modica2 and data protection, and lack of optimization models.
Claudia Arcidiacono3 Development of knowledge-based systems in the
1
Dipartimento di Agraria, University of Sassari, farming sector would require key components,
Sassari, Sardinia, Italy supported by Internet of things (IoT), data acqui-
2
Dipartimento di Agraria, Università degli Studi sition systems, ubiquitous computing and net-
Mediterranea di Reggio Calabria, Reggio working, machine-to-machine (M2M)
Calabria, Italy communications, effective management of
3
Dipartimento di Agricoltura, Alimentazione e geospatial and temporal data, and ICT-supported
Ambiente, University of Catania, Catania, Italy cooperation among stakeholders.

Synonyms Generalities

E-agriculture; Precision agriculture; Precision This relatively new expression derives from a
farming combination of the two terms agriculture and
informatics, hence alluding to the application of
informatics to the analysis, design, and develop-
Definition ment of agricultural activities. It broadly involves
the study and practice of creating, collecting, stor-
The term stems from the blending of the two ing and retrieving, manipulating, classifying, and
words agriculture and informatics and refers to sharing information concerning both natural and
the application of informatics to the analysis, engineered agricultural systems. The domains of
design and development of agricultural activities. application are mainly agri-food and environmen-
It overarches expressions such as Precision Agri- tal sciences and technologies, while sectors
culture (PA), Precision Livestock Farming (PLF), include biosystems engineering, farm manage-
and Agricultural landscape analysis and planning. ment, crop production, and environmental moni-
The adoption of AgInformatics can accelerate toring. In this respect, it encompasses the
agricultural development by providing farmers management of the information coming from
and decision makers with more accessible, com- applications and advances of information and
plete, timely, and accurate information. However, communication technologies (ICTs) in agriculture
it is still hindered by a number of important yet (e.g., global navigation satellite system, GNSS;
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_218-1
2 AgInformatics

remote sensing, RS; wireless sensor networks, management between different actors are supply
WSN; and radio-frequency identification, RFID) chain information systems (SCIS) including those
and performed through specific agriculture infor- specifically designed for traceability and supply
mation systems, models, and methodologies (e.g., chain planning.
farm management information systems, FMIS; Recently, PA has evolved to predictive and
GIScience analyses; Data Mining; decision sup- prescriptive agriculture. Predictive agriculture
port systems, DSS). regards the activity of combining and using a
AgInformatics is an umbrella concept that large amount of data to improve knowledge and
includes and overlaps issues covered in precision predict trends, whereas prescriptive agriculture
agriculture (PA), precision livestock farming involves the use of detailed, site-specific recom-
(PLF), and agricultural landscape analysis and mendations for a farm field. Today PA embraces
planning, as follows. new terms such as precision citrus farming, preci-
sion horticulture, precision viticulture, precision
Precision Agriculture (PA) livestock farming, and precision aquaculture (Li
PA was coined in 1929 and later defined as “a and Chung 2015).
management strategy that uses information tech-
nologies to bring data from multiple sources to
Precision Livestock Farming (PLF)
bear on decisions associated with crop produc-
The increase in activities related to livestock farm-
tion” (Li and Chung 2015). The concept evolved
ing triggered the definition of the new term preci-
since the late 1980s due to new fertilization equip-
sion livestock farming (PLF), namely, the real-
ment, dynamic sensing, crop yield monitoring
time monitoring technologies aimed at managing
technologies, and GNSS technology for auto-
the smallest manageable production unit’s tempo-
mated machinery guidance.
ral variability, known as “the per animal
Therefore, PA technology has provided
approach” (Berckmans 2004). PLF consists in
farmers with the tools (e.g., built-in sensors in
the real-time gathering of data related to livestock
farming machinery, GIS tools for yield monitor-
animals and their close environment, applying
ing and mapping, WSNs, satellite and low-alti-
knowledge-based computer models, and extra-
tude RS by means of unmanned aerial systems
cting useful information for automatic monitoring
(UAS), and recently robots) and information (e.g.,
and control purposes. It implies monitoring ani-
weather, environment, soil, crop, and production
mal health, welfare, behavior, and performance
data) needed to optimize and customize the
and the early detection of illness or a specific
timing, amount, and placement of inputs includ-
physiological status and unfolds in several activ-
ing seeds, fertilizers, pesticides, and irrigation,
ities including real-time analysis of sounds,
activities that were later applied also inside closed
images, and accelerometer data, live weight
environments, buildings, and facilities, such as for
assessment, condition scoring, and online milk
protected cultivation.
analysis. In PLF, continuous measurements and
To accomplish the operational functions of a
a reliable prediction of variation in animal data or
complex farm, FMISs for PA are designed to
animal response to environmental changes are
manage information about processes, resources
integrated in the definition of models and algo-
(materials, information, and services), procedures
rithms that allow for taking control actions (e.g.,
and standards, and characteristics of the final
climate control, feeding strategies, and therapeu-
products (Sørensen et al. 2010). Nowadays dedi-
tic decisions).
cated FMISs operate on networked online frame-
works and are able to process a huge amount of
data. The execution of their functions implies the Agricultural Landscape Analysis and Planning
adoption of various management systems, data- Agricultural landscape analysis and planning is
bases, software architectures, and decision increasingly based on the development of inter-
models. Relevant examples of information operable spatial data infrastructures (SDIs) that
AgInformatics 3

integrate heterogeneous multi-temporal spatial information at the appropriate moment to decision


datasets and time-series information. makers. Concurrently, management concepts,
Nearly all agricultural data has some form of such as PA and PLF, may play an important role
spatial component, and GISs allow to visualize in driving and accelerating adoption of ICT tech-
information that might otherwise be difficult to nologies. However, the application of PA solu-
interpret (Pierce and Clay 2007). tions has been slow due to a number of
Land use/land cover (LU/LC) change detection important yet unresolved issues including big
methods are widespread in several research fields data handling, limited standardization, data pro-
and represent an important issue dealing with the tection, and lack of optimization models and
modification analysis of agricultural uses. In this depends as well on infrastructural conditions
framework, RS imagery plays a key role and such as availability of broadband internet in rural
involves several steps dealing with the classifica- areas. The adoption of FMISs in agriculture is
tion of continuous radiometric information hindered by barriers connected to poor interfac-
remotely surveyed into tangible information, ing, interoperability and standardized formats,
often exposed as thematic maps in GIS environ- and dissimilar technological equipment adoption.
ments, and that can be utilized in conjunction with Development of knowledge-based systems in the
other data sets. Among classification techniques, farming sector would require key components,
object-based image analysis (OBIA) is one of the supported by IoT, data acquisition systems, ubiq-
most powerful techniques and gained popularity uitous computing and networking, M2M commu-
since the early 2000s in extracting meaningful nications, effective management of geospatial and
objects from high-resolution RS imagery. temporal data, traceability systems along the sup-
Proprietary data sources are integrated with ply chain, and ICT-supported cooperation among
social data created by citizens, i.e., volunteered stakeholders. Recent designs and prototypes
geographic information (VGI). VGI includes using cloud computing and the future Internet
crowdsourced geotagged information from social generic enablers for inclusion in FMIS have
networks (often provided by means of smart recently been proposed and lay the groundwork
applications) and geospatial information on the for future applications. A modification, which is
Web (GeoWeb). Spatial decision support systems underway, from proprietary tools to Internet-
(SDSSs) are computer-based systems that help based open systems supported by cloud hosting
decision makers in the solution of complex prob- services will enable a more effective cooperation
lems, such as in agriculture, land use allocation, between actors of the supply chain. One of the
and management. SDSSs implement diverse limiting factors in the adoption of SCIS is a lack of
forms of multi-criteria decision analysis interoperability, which would require implemen-
(MCDA). GIS-based MCDA can be considered tation of virtual supply chains based on the
as a class of SDSS. Implementing GIS-MCDA virtualization of physical objects such as con-
within the World Wide Web environment can tainers, products, and trucks. Recent and promis-
help to bridge the gap between the public and ing developments of the spatial decision-making
experts and favor public participation. deal with the interaction and the proactive
involvement of the final users, implementing the
so-called collaborative or participative Web-based
GIS-MCDA systems. Computers science and IT
Conclusion
evolvements affect the developments of RS in
agriculture, leading to the need for new methods
Technologies have the potential to change modes
and solutions to the challenges of big data in a
of producing agri-food and livestock. ICTs can
cloud computing environment.
accelerate agricultural development by providing
more accessible, complete, timely, or accurate
4 AgInformatics

Cross-References Further Readings

▶ Agriculture, Forestry, Fishery, Hunting Berckmans, D. (2004). Automatic on-line monitoring of


animals by precision livestock farming. In Proceedings
▶ Cloud
of the ISAH conference on animal production in
▶ Data Processing Europe: The Way Forward in a Changing World.
▶ Information Technology Saint-Malo, pp. 27–31.
▶ Radio-Frequency Identification (RFID) Li, M., & Chung, S. (2015). Special issue on precision
agriculture. Computers and Electronics in Agriculture,
▶ Satellite Imagery/Remote Sensing
112, 1.
▶ Semantic Web Pierce, F. J., & Clay, D. (Eds.). (2007). GIS applications in
▶ Sensor Technologies agriculture. Boca Raton: CRC Press Taylor and Francis
▶ Spatial Analytics Group.
Sørensen, C. G., Fountas, S., Nash, E., Pesonen, L.,
▶ Spatial Data
Bochtis, D., Pedersen, S. M., Basso, B., & Blackmore,
▶ Volunteered Geographic Information (VGI) S. B. (2010). Conceptual model of a future farm man-
agement information system. Computers and Electron-
ics in Agriculture, 72(1), 37–47.
B

Big Data Quality and complexity related to data and its quality
compounds incrementally and could potentially
Subash Thota challenge the very growth of the business that
Synectics for Management Decisions, Inc., acquired the data. This paper is intended to show-
Arlington, VA, USA case challenges related to data quality and
approaches to mitigating data quality issues.

Introduction
Data Defined
Data is the most valuable asset for any organiza-
tion. Yet in today’s world of big and unstructured Data is “ . . . language, mathematical or other sym-
data, more information is generated than can be bolic surrogates which are generally agreed upon to
collected and properly analyzed. The onslaught of represent people, objects, events and concepts”
data presents obstacles to create data-driven deci- (Liebenau and Backhouse 1990). Vayghan et al.
sions. Data quality is an essential characteristic of (2007) argued that most enterprises deal with three
data that determines the reliability of data for types of data: master data, transactional data, and
making decisions in any organization or business. historical data. Master data are the core data enti-
Errors in data can cost a company millions of ties of the enterprise, i.e., customers, products,
dollars, alienate customers, and make employees, vendors, suppliers, etc. Transactional
implementing new strategies difficult or impossi- data describe an event or transaction in an organi-
ble (Redman 1995). zation, such as sales orders, invoices, payments,
In practically every business instance, project claims, deliveries, and storage records. Transac-
failures and cost overruns are due to fundamental tional data is time bound and changes to historical
misunderstanding about the data quality that is data once the transaction has ended. Historical
essential to the initiative. A global data manage- data contain facts, as of certain point in time (e.g.,
ment survey by PricewaterhouseCoopers of 600 database snapshots), and version information.
companies across the USA, Australia, and Britain
showed that 75% of reported significant problems
were a result of data quality issues, with 33% of Data Quality
those saying the problems resulted in delays in
getting new business intelligence (BI) systems Data quality is the capability of data to fulfill and
running or in having to scrap them altogether satisfy the stated business, framework, system and
(Capehart and Capehart 2005). The importance technical requirements of an enterprise. A classic
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_240-1
2 Big Data Quality

definition of data quality is “fitness for use,” or more value is derived from analyzing connec-
more specifically, the extent to which some data tivity and relationships, the inability to link
successfully serve the purposes of the user (Tayi related data instance together impedes this
and Ballou 1998; Cappiello et al. 2003; Lederman valuable analysis.
et al. 2003; Watts et al. 2009).
To be able to correlate data quality issues to
business impacts, we must be able to both classify
Causes and Consequences
our data quality expectations as well as our busi-
ness impact criteria. In order to do that, it is
The “Big Data” era comes with new challenges
valuable to understand these common data quality
for data quality management. Beyond volume,
dimensions (Loshin 2006):
velocity, and variety lies the importance of the
fourth “V” of big data: veracity. Veracity refers
– Completeness: Is all the requisite information
to the trustworthiness of the data. Due to the sheer
available? Are data values missing, or in an
volume and velocity of some data, one needs to
unusable state? In some cases, missing data is
embrace the reality that when data is extracted
irrelevant, but when the information that is
from multiple datasets at a fast and furious clip,
missing is critical to a specific business pro-
determining the semantics of the data – and under-
cess, completeness becomes an issue.
standing correlations between attributes –
– Conformity: Are there expectations that data
becomes of critical importance.
values conform to specified formats? If so, do
Companies that manage their data effectively
all the values conform to those formats?
are able to achieve a competitive advantage in the
Maintaining conformance to specific formats
marketplace (Sellar 1999). On the other hand, bad
is important in data representation, presenta-
data can put a company at a competitive disad-
tion, aggregate reporting, search, and
vantage comments (Greengard 1998). It is there-
establishing key relationships.
fore important to understand some of the causes of
– Consistency: Do distinct data instances pro-
bad data quality:
vide conflicting information about the same
underlying data object? Are values consistent
• Lack of data governance standards or valida-
across data sets? Do interdependent attributes
tion checks.
always appropriately reflect their expected
• Data conversion usually involves transfer of
consistency? Inconsistency between data
data from an existing data source to a new
values plagues organizations attempting to rec-
database.
oncile different systems and applications.
• Increasing complexity of data integration and
– Accuracy: Do data objects accurately repre-
enterprise architecture.
sent the “real-world” values they are expected
• Unreliable and inaccurate sources of
to model? Incorrect spellings of products, per-
information.
sonal names or addresses, and even untimely
• Mergers and acquisitions between companies.
or not current data can impact operational and
• Manual data entry errors.
analytical applications.
• Upgrades of infrastructure systems.
– Duplication: Are there multiple, unnecessary
• Multidivisional or line-of-business usage of data.
representations of the same data objects within
• Misuse of data for purposes different from the
your data set? The inability to maintain a single
capture reason.
representation for each entity across your sys-
tems poses numerous vulnerabilities and risks.
Different people performing the same tasks
– Integrity: What data is missing important rela-
tionship linkages? The inability to link related have a different understanding of the data being
processed, which leads to inconsistent data mak-
records together may actually introduce dupli-
ing its way into the source systems. Poor data
cation across your systems. Not only that, as
Big Data Quality 3

quality is a primary reason for 40% of all business 1. Enterprise Focus and Discipline
initiatives failing to achieve their targeted benefits
(Friedman and Smith 2011). Marsh (2005) sum- Enterprises should be more focused and
marizes consequences in one of his article: engaged toward data quality issues; views toward
data cleansing must evolve. Clearly defining roles
• Eighty-eight percent of all data integration pro- and outlining the authority, accountability and
jects either fail completely or significantly responsibility for decisions regarding enterprise
overrun their budgets. data assets provides the necessary framework for
• Seventy-five percent of organizations have resolving conflicts and driving a business forward
identified costs stemming from dirty data. as the data-driven organization matures. Data
• Thirty-three percent of organizations have quality programs are most efficient and effective
delayed or canceled new IT systems because when they are implemented in a structured,
of poor data. governed environment.
• $611B per year is lost in the USA to poorly
targeted bulk mailings and staff overheads. 2. Implementing MDM and SOA
• According to Gartner, bad data is the number
one cause of customer-relationship manage- The goal of a master data management (MDM)
ment (CRM) system failure. solution is to provide a single source of truth of
• Less than 50% of companies claim to be very data, thus providing a reliable foundation for that
confident in the quality of their data. data across the organization. This prevents busi-
• Business intelligence (BI) projects often fail due ness users across an organization from using dif-
to dirty data, so it is imperative that BI-based ferent versions of the same data. Another
business decisions are based on clean data. approach of big data and big data governance is
• Only 15% of companies are very confident in the deployment of cloud-based models and soft-
the quality of external data supplied to them. ware-oriented architecture (SOA). SOA enables
• Customer data typically degenerates at 2% per the tasks associated with a data quality program to
month or 25% annually. be deployed as a set of services that can be called
dynamically by applications. This allows business
To Marsh, organizations typically overestimate rules for data quality enforcement to be moved
the quality of their data and underestimate the cost outside of applications and applied universally at
of data errors. Business processes, customer a business process level. These services can either
expectations, source systems and compliance be called proactively by applications as data is
rules are constantly changing – and data quality entered into an application system, or by batch
management systems must reflect this. Vast after the data has been created.
amounts of time and money are spent on custom
coding and “firefighting” to dampen an immediate 3. Implementing Data Standardization and Data
crisis rather than dealing with the long-term prob- Enrichment
lems that bad data can present to an organization.
Data standardization usually covers
reformatting of user-entered data without any
loss of information or enrichment of information.
Data Quality: Approaches Such solutions are most suitable for applications
that integrate data. Data enrichment covers the
Due to the large variety of sources from which reformatting of data with additional enrichment
data is collected and integrated, for its sheer vol- or addition of useful referential and analytical
ume and changing nature, it is impossible to man-
information.
ually specify data quality rules. Below are a few
approaches to mitigating data quality issues:
4 Big Data Quality

Data Quality: Methodology in Profiling In assessing a dataset for veracity, it is impor-


tant to answer core questions about it:
Data profiling provides a proactive way to manage
and comprehend an organization’s data. Data pro- • Do the patterns of the data match expected
filing is explicitly about discovering and patterns?
reviewing the underlying data available to deter- • Do the data adhere to appropriate uniqueness
mine the characteristics, patterns, and essential and null value rules?
statistics about the data. Data profiling is an • Are the data complete?
important diagnostic phase that furnishes quanti- • Are they accurate?
fiable and tangible facts about the strength of the • Do they contain information that is easily
organization’s data. These facts not only help in understood and unambiguous?
establishing what data is available in the organi- • Do the data adhere to specified required key
zation but also how accurate, valid, and usable the relationships across columns and tables?
data is. Data profiling covers numerous tech- • Are there inferred relationships across col-
niques and processes: umns, tables, or databases?
• Are there redundant data?
– Data Ancestry: This covers the lineage of the
dataset. It describes the source from which the Data in an enterprise is often derived from
data is acquired or derived and the method of different sources, resulting in data inconsistencies
acquisition. and nonstandard data. Data profiling helps ana-
– Data Accuracy: This is the closeness of the lysts dig deeper to look more closely at each of the
attribute data associated with an object or fea- individual data elements and establish which data
ture, to the true value. It is usually recorded as values are inaccurate, incomplete, or ambiguous.
the percentage correctness for each topic or Data profiling allows analysts to link data in dis-
attribute. parate applications based on their relationships to
– Data Latency: This is the level at which the each other or to a new application being devel-
data is current or accurate to date. This can be oped. Different pieces of relevant data spread
measured by having appropriate data reconcil- across many individual data stores make it diffi-
iation procedures to gauge any unintended cult to develop a complete understanding of an
delays in acquiring the data due to technical enterprise’s data. Therefore, data profiling helps
issues. one understand how data sources interact with
– Data Consistency: This is the fidelity or integ- other data sources.
rity of the data within data structures or
interfaces.
– Data Adherence: This is a measure of compli-
ance or adherence of the data to the intended Metadata
standards or logical rules that govern the stor-
age or interpretation of data. Metadata is used to describe the characteristics of
– Data Duplicity: This is a measure of dupli- a data field in a file or a table and contains infor-
cates records or fields in the system that can be mation that indicates the data type, the field
consolidated to reduce the maintenance costs length, whether the data should be unique, and if
and efficiency of the system storage processes. a field can be missing or null. Pattern matching
– Data Completeness: This is a measure of the determines if the data values in a field are in the
correspondence between the real world and the likely format. Basic statistics about data such as
specified dataset. minimum and maximum values, mean, median,
mode, and standard deviation can provide insight
into the characteristics of the data.
Big Data Quality 5

Conclusion Cappiello, C., Francalanci, C., & Pernici, B. (2003). Time-


related factors of data quality in multi-channel infor-
mation systems. Journal of Management Information
Ensuring data quality is one of the most pressing Systems, 20(3), 71–91.
challenges today for most organizations. With Friedman, T., & M. Smith. (2011). Measuring the business
applications constantly receiving new data and value of data quality (Gartner ID# G00218962). Avail-
undergoing incremental changes, achieving data able at: http://www.data.com/export/sites/data/com
mon/assets/pdf/DS_Gartner.pdf
quality cannot be a onetime event. As organiza- Greengard, S. (1998). Don’t let dirty data derail you. Work-
tions’ appetite for big data grows daily in their force, 77(11), 107–108.
quest to satisfy customers, suppliers, investors, Knolmayer, G., & Röthlin, M. (2006). Quality of material
and employees, the common obstacle of impedi- master data and its effect on the usefulness of distrib-
uted ERP systems. Lecture Notes in Computer Science,
ment is data quality. Improving data quality is the 4231, 362–371.
lynchpin to a better enterprise, better decision- Lederman, R., Shanks, G., Gibbs, M.R. (2003). Meeting
making, and better functionality. privacy obligations: the implications for information
Data quality can be improved, and there are systems development. Proceedings of the 11th Euro-
pean Conference on Information Systems. Paper pre-
methods for doing so that are rooted in logic and sented at ECIS: Naples, Italy.
experience. On the market are commercial off- Liebenau, J., & Backhouse, J. (1990). Understanding
the-shelf (COTS) products which are simple, intu- information: an introduction. Information systems.
itive methods to manage and analyze data – and Palgrave Macmillan, London, UK.
Loshin, D. (2006). The data quality business case: Pro-
establish business rules for an enterprise. Some jecting return on investment (White paper). Available
can implement a data quality layer that filters at: http://knowledge-integrity.com/Assets/data_qual
any number of sources for quality standards; ity_business_case.pdf
provide real-time monitoring; and enable the pro- Marsh, R. (2005). Drowning in dirty data? It’s time to sink
or swim: A four-stage methodology for total data qual-
filing of data prior to absorption and aggregation ity management. Database Marketing & Customer
with a company’s core data. At times, however, Strategy Management, 12(2), 105–112. Available at:
it will be necessary to bring in objective, http://link.springer.com/article/10.1057/palgrave.dbm.
third-party subject-matter experts for an impartial 3240247.
Redman, T. C. (1995). Improve data quality for competi-
analysis and solution of an enterprise-wide data tive advantage. MIT Sloan Management Rev., 36(2),
problem. pp. 99–109.
Whatever path is chosen, it is important for an Sellar, S. (1999). Dust off that data. Sales and Marketing
organization to have a master data management Management, 151(5), 71–73.
Tayi, G. K., & Ballou, D. P. (1998). Examining data qual-
(MDM) plan no differently than it might have a ity. Communications of the ACM, 41(2), 54–57.
recruiting plan or a business development plan. A Vayghan, J. A., Garfinkle, S. M., Walenta, C., Healy, D. C.,
sound MDM creates an ever-present return & Valentin, Z. (2007). The internal information trans-
on investment (ROI) that saves time, reduces formation of IBM. IBM Systems Journal, 46(4),
669–684.
operating costs, and satisfies both clients and Watts, S., Shankaranarayanan, G., & Even, A. (2009). Data
stakeholders. quality assessment in context: A cognitive perspective.
Decision Support Systems, 48(1), 202–211.

Further Readings

Capehart, B. L., & Capehart, L. C. (2005). Web based


energy information and control systems: case studies
and applications, 436–437.
C

Core Curriculum Issues (Big Data the curricula of those who will not obtain degrees
Research/Analysis) or certificates in disciplines related to big data –
but for whom training or education in these KSAs
Rochelle E. Tractenberg is still desired or intended. A third core issue is
Collaborative for Research on Outcomes and how to construct the curriculum – whether the
–Metrics, Washington, DC, USA degree is directly related to big data or some key
Departments of Neurology; Biostatistics, KSAs relating to big data are proposed for inte-
Bioinformatics & Biomathematics; and gration into another curriculum – in such a way
Rehabilitation Medicine, Georgetown University, that it is evaluable. Since the technical attributes
Washington, DC, USA of big data and its management and analysis are
evolving nearly constantly, any curriculum devel-
oped to teach about big data must be evaluated
Definition periodically (e.g., annually) to ensure that what is
being taught is relevant; this suggests that core
A curriculum is defined as the material and con- underpinning constructs must be identified so that
tent that comprises a course of study within a learners in every context can be encouraged to
school or college, i.e., a formal teaching program. adapt to new knowledge rather than requiring
The construct of “education” is differentiated retraining or reeducation.
from “training” based on the existence of a cur-
riculum, through which a learner must progress in
an evaluable, or at least verifiable, way. In this Role of the Curriculum in “Education”
sense, a fundamental issue about a “big data cur- Versus “Training”
riculum” is what exactly is meant by the expres-
sion. “Big data” is actually not a sufficiently Education can be differentiated from training by
concrete construct to support a curriculum, nor the existence of a curriculum in the former and its
even the integration of one or more courses into absence in the latter. The Oxford English Dictio-
an existing curriculum. Therefore, the principal nary defines education as “the process of educating
“core curriculum issue” for teaching and learning or being educated, the theory and practice of teach-
around big data is to articulate exactly what ing,” whereas training is defined as “teaching a
knowledge, skills, and abilities are to be taught particular skill or type of behavior through regular
and practiced through the curriculum. A second practice and instruction.” The United Nations
core issue is how to appropriately integrate those Educational, Scientific and Cultural Organization
key knowledge, skills, and abilities (KSAs) into (UNESCO) highlights the fact that there may be an
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_285-1
2 Core Curriculum Issues (Big Data Research/Analysis)

articulated curriculum (“intended”) but the curric- curriculum, it is important to understand that
ulum that is actually delivered (“implemented”) there is no uniform cognitive schema, nor other
may differ from what was intended. There are also contextual support, that the formal curriculum
the “actual” curriculum, representing what students typically provides. Thus, it can be helpful to con-
learn, and the “hidden” curriculum, which com- sider “training in big data” as appropriate for those
prises all the bias and unintended learning that any who have completed a formal curriculum in data-
given curriculum achieves (http://www.unesco. related domains. Otherwise, skills that are
org/new/en/education/themes/strengthening-educ acquired in such training, intended for deploy-
ation-systems/quality-framework/technical-notes ment currently and specifically, may actually
/different-meaning-of-curriculum/). These types limit the trainees’ abilities to adapt to new knowl-
of curricula are also described by the Netherlands edge, and thereby, lead to a requirement for
Institute for Curriculum Development (SLO, retraining or reeducation.
http://international.slo.nl/) and worldwide in mul-
tiple books and publications on curriculum devel-
opment and evaluation.
Determining the Knowledge, Skills, and
When a curriculum is being developed or eval-
Abilities Relating to Big Data That
uated with respect to its potential to teach about big
Should Be Taught
data, each of these dimensions of that curriculum
(intended, implemented, actual, hidden) must be
The principal core curricular issue for teaching
considered. These features, well known to instruc-
and learning around big data is to articulate
tors and educators who receive formal training to
exactly what knowledge, skills, and abilities are
engage in the kindergarten–12th grade (US) or
to be taught and practiced through the curriculum.
preschool/primary/secondary (UK/Europe) edu-
As big data has become an increasingly popular
cation, are less well known among instructors in
construct (since about 2010), different stake-
tertiary/higher education settings whose training
holders in the education enterprise have articu-
is in other domains – even if their main job will be
lated curricular objectives in computer science,
to teach undergraduate, graduate, postgraduate,
statistics, mathematics, and bioinformatics for
and professional students. It may be helpful, in
undergraduate (e.g., De Veaux et al. 2017) and
the consideration of curricular elements around
graduate students (e.g., Greene et al. 2016). These
big data, for those in the secondary education/
stakeholders include longstanding national or
college/university setting to consider what attri-
international professional associations and new
butes characterize the curricula that their incom-
groups seeking to establish either their own cred-
ing students have experienced relating to the same
ibility or to define the niche in “big data” where
content or topics.
they plan to operate. However, “big data” is not a
Many modern researchers in the learning
specific domain that is recognized or recogniz-
domains reserve the term “training” to mean
able; it has been described as a phenomenon
“vocational training.” For example, Gibbs et al.
(Boyd and Crawford 2012) and is widely consid-
(2004) identify training as specifically “skills
ered not to be a domain for training or education
acquisition” to be differentiated from instruction
on its own. Instead, knowledge, skills, and abili-
(“information acquisition”); together with social-
ties relating to big data are conceptualized as
ization and the development of thinking and prob-
belonging to the discipline of data science; this
lem-solving skills, this information acquisition is
discipline is considered as existing at the intersec-
the foundation of education overall. The voca-
tion of mathematics, computer science, and statis-
tional training is defined as a function of skills or
tics. This is practically implemented as the
behaviors to be learned (“acquired”) by practice in
articulation of foundational aspects of each of
situ. When considering big data trainees, defined
these disciplines together with their formal and
as individuals who participate in any training
purposeful integration into a formal curriculum.
around big data that is outside of a formal
Core Curriculum Issues (Big Data Research/Analysis) 3

With respect to data science, then, generally, tool – e.g., instructional videos on YouTube or as
there is agreement that students must develop formal courses of varying lengths that can be read
abilities to reason with data and to adapt to a (slides, documentation) or watched as webinars.
changing environment, or changing characteris- Examples can be found online at sites including
tics of data (preferably both). However, there is Big Data University (bigdatauniversity.com), cre-
not agreement on how to achieve these abilities. ated by IBM and freely available, and Coursera
Moreover, because existing undergraduate course (coursera.org) which offers data science, analyt-
requirements are complex and tend to be compre- ics, and statistics courses as well as eight different
hensive for “general education” as well as for the specializations, comprising curated series of
content making up a baccalaureate, associate, or courses – but also many other topics. Coursera
other terminal degree in the postsecondary con- has evolved many different educational opportu-
text, in some cases just a single course may be nities and some curated sequences that can be
considered for incorporation into either required completed to achieve “certification,” with differ-
or elective course lists. This would represent the ent costs depending on the extent of student
least coherent integration of big data into a col- engagement/commitment. The Open University
lege/university undergraduate curriculum. In the (www.open.ac.uk) is essentially an online version
construction of a program that would award a cer- of regular university courses and curricula (and
tificate, minor or major, if it seeks to successfully so is closer to “education” than “training”) –
prepare students for work in or with big data, or degree and certificate programs all have costs
statistics and data science, or analytics, or of other associated and also can be considered to follow a
programs intended to train or prepare people for formal curriculum to a greater extent than any
jobs that either focus on, or simply “know about,” other option for widely accessible training/learn-
big data must follow the same curricular design ing around big data. These examples represent a
principles that every formal educational enterprise continuum that can be characterized by the atten-
should follow. If they do not, they risk tion to the curricular structure from minimal (Big
underperforming on their advertising and Data University) to complete (The Open Univer-
promises. sity). The individual who selects a given training
It is important to consider the role of training in opportunity, as well as those who propose and
the development, or consideration of develop- develop training programs, must articulate exactly
ment, of curricula that feature big data. In addition what knowledge, skills, and abilities are to be
to the creation of undergraduate degrees and taught and practiced. The challenge for individ-
minors, Master’s degrees, post-baccalaureate cer- uals making selections is to determine how cor-
tificate programs, and doctoral programs, all of rectly an instructor or program developer has
which must be characterized by the curricula described the achievements the training is
they are defined and created to deliver, many intended to provide. The challenge for those curat-
other “training” opportunities and workforce ing or creating programs of study is to ensure that
development initiatives also exist. These are the learning objectives of the curriculum are met,
being developed in corporate and other human i.e., that the actual curriculum is as high a match
resource-oriented domains, as well as in more to the intended curriculum as possible. Basic prin-
open (open access) contexts. Unlike traditional ciples of curriculum design can be brought to bear
degree programs, training and education around for acceptable results in this matching challenge.
big data are unlikely to be situated specifically The stronger the adherence to these basic princi-
within a single disciplinary context – at least not ples, the more likely a robust and evaluable curric-
exclusively. People who have specific skills, or ulum, with demonstrable impact, will result. This is
who have created specific tools, often create free not specific to education around big data, but with
or easily accessible representations of the skills or all the current interest in data and data science,
4 Core Curriculum Issues (Big Data Research/Analysis)

these challenges rise to the level of “core curricu- just training is. This may arise from a sense that
lum issues” for this domain. the technology is changing too fast to create a
whole curriculum around it. Training opportunity
creators are typically experts in the domain, but
Utility of Training Versus a Curriculum may not necessarily be sufficiently expert in
Around Big Data teaching and learning theories, or the domains
from which trainees are coming, to successfully
De Veaux et al. (2017) convened a consensus translate their expertise into effective “training.”
panel to determine the fundamental requirements This may lead to the development of new training
for an undergraduate curriculum in “data sci- opportunities that appear to be relevant, but which
ence.” They articulated that the main topical can actually contribute only minimally to an indi-
areas that comprise – and must be leveraged for vidual trainee’s ability to function competently in
appropriate baccalaureate-level training in – this a new domain like big data, because they do not
domain are as follows: data description and also include or provide contextualization or sche-
curation, mathematical foundations, computa- matic links with prior knowledge.
tional thinking, statistical thinking, data model- An example of this problem is the creation of
ing, communication, reproducibility, and ethics. “competencies” by subject matter expert consen-
Since computational and statistical thinking, as sus committees, which are then used to create
well as data modeling, all require somewhat dif- “learning plans” or checklists. The subject matter
ferent mathematical foundations, this list shows experts undoubtedly can articulate what compe-
clearly the challenges in selecting specific “train- tencies are required for functional status in their
ing opportunities” to support development of new domain. However, (a) a training experience devel-
skills in “big data” for those who are not already oped to fill in a slot within a competency checklist
trained in quantitative sciences to at least some often fails to support teaching and learning around
extent. Moreover, arguments are arising in many the integration of the competencies into regular
quarters (science and society, philosophy/ethics/ practice; and (b) curricula created in alignment
bioethics, and professional associations like the with competencies often do not promote the actual
Royal Statistical Society, American Statistical development and refinement of these competen-
Association, and Association of Computing cies. Instead, they may tend to favor the checking-
Machinery) that “ethics” is not a single entity off of “achievement of competency X” from the
but, with respect to big data and data science, is list.
a complex – and necessary – type of reasoning Another potential challenge arises from the
that cannot be developed in a single course or opposite side of the problem, learner-driven train-
training opportunity. The complexity of reasoning ing development. “What learners want and need
that is required for competent work in the domain from training” should be considered together with
referred to exchangeably as “data analytics,” what experts who are actually using the target
“data science,” and “big data”, which includes knowledge, skills, and abilities believe learners
this ability to reason ethically, underscores the need from training. However, the typical trainee
point that piecemeal training will be unsuccessful will not be sufficiently knowledgeable to choose
unless the trainee possesses the ability to organize the training that is in fact most appropriate for
the new material together with extant (high level) their current skills and learning objectives. The
reasoning abilities, or at least a cognitive/mental construct of “deliberate practice” is instructive
schema within which the diverse training experi- here. In their 2007 Harvard Business Review arti-
ences can be integrated for a comprehensive cle, “The making of an expert,” Ericsson, Prietula,
understanding of the domain. and Cokely summarize Ericsson’s prior work on
However, the proliferation of training opportu- expertise and its acquisition, commenting that
nities around big data suggests a pervasive sense “(y)ou need a particular kind of practice – delib-
that a formal curriculum is not actually needed – erate practice - to develop expertise” (emphasis in
Core Curriculum Issues (Big Data Research/Analysis) 5

original, p. 3). Deliberate practice is practice training of those who will not obtain, or have not
where weaknesses are specifically identified and obtained, degrees or certificates in disciplines
targeted – usually by an expert both in the target related to big data. A third core issue is how to
skillset and perhaps more particularly in identify- construct the curriculum in such a way that the
ing and remediating specific weaknesses. If a alignment of the intended and the actual objec-
trainee is not (yet) an expert, determining how tives is evaluable and modifiable as appropriate.
best to address a weakness that one has self-iden- Since the technical attributes of big data and its
tified can be another limitation on the success of a management and analysis are evolving nearly
training opportunity, if it focuses on what the constantly, any curriculum developed to teach
learner wants or believes they need without appeal about big data must be evaluated periodically to
to subject matter experts. This perspective argues ensure the relevance of the content; however the
for the incorporation of expert opinion into the alignment of the intended and actual curricula
development, descriptions, and contextualizations must also be regularly evaluated to ensure learn-
of training, i.e., the importance of deliberate prac- ing objectives are achieved and achievable.
tice in the assurance that as much as possible of
the intended curriculum becomes the actual cur-
riculum. Training opportunities around big data
Further Readings
can be developed to support, or fill in gaps, in a
formal curriculum; without this context, training Boyd, D., & Crawford, K. (2012). Critical questions for big
in big data may not be as successful as desired. data: Provocations for a cultural, technological, and
scholarly phenomenon. Information, Communication,
& Society, 15(5), 662–679.
De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S.,
Conclusions Bray, A., Bressoud, T. C., et al. (2017). Curriculum guide-
lines for undergraduate programs in data science. Annual
A curriculum is a formal program of study, and Review of Statistics and its Applications, 4, 2.1–2.16.
basic curriculum development principles are doi:10.1146/annurev-statistics-060116-053930. Down-
loaded from http://www.amstat.org/asa/files/pdfs/EDU-
essential for effective education in big data – as DataScienceGuidelines.pdf. 2 Jan 2017.
in any other domains. Knowledge, skills, and Ericsson, K. A., Prietula, M. J., & Cokely, E. T. (2007). The
abilities, and the levels to which these will be making of an expert. Harvard Business Review 85
both developed and integrated, must be articulated (7–8):114–121, 193. Downloaded from https://hbr.
org/2007/07/the-making-of-an-expert. 5 June 2010.
in order to structure a curriculum to optimize the Gibbs, T., Brigden, D., & Hellenberg, D. (2004). The
match between the intended and the actual curric- education versus training and the skills versus compe-
ula. The principal core curricular issue for teach- tency debate. South African Family Practice, 46(10),
ing and learning around big data is to articulate 5–6. doi:10.1080/20786204.2004.10873146.
Greene, A. C., Giffin, K. A., Greene, C. S., & Moore, J. H.
exactly what knowledge, skills, and abilities are to (2016). Adapting bioinformatics curricula for big data.
be taught and practiced. A second core issue is Briefings in Bioinformatics, 17(1), 43–50. doi:10.1093/
that the “big data” knowledge, skills, and abilities bib/bbv018.
may require more foundational support for
D

Data Exhaust information in daily life (e.g., making an online


purchase, accessing healthcare information, or
Daniel E. O’Leary1 and Veda C. Storey2 interacting in a social network). Data exhaust
1
Marshall School of Business, University of can also come from information-seeking behavior
Southern California, Los Angeles, CA, USA that is used to make inferences about an
2
J Mack Robinson College of Business, Georgia individual’s needs, desires, or intentions, such as
State University, Atlanta, GA, USA Internet searches or telephone hotlines (George
et al. 2014).

Overview
Additional Terminology
Data exhaust is also known as ambient data, rem-
Data exhaust is a type of big data that is often
nant data, left over data, or even digital exhaust
generated unintentionally by users from normal
(Mcfedries 2013). A digital footprint or a digital
Internet interaction. It is generated in large quan-
dossier is the data generated from online activities
tities and appears in many forms, such as the
that can be traced back to an individual. The
results from web searches, cookies, and tempo-
passive traces of data from such activities are
rary files. Initially, data exhaust has limited, or
considered to be data exhaust. The big data that
no, direct value to the original data collector.
interests many companies is called “found data.”
However, when combined with other data for
Typically data is extracted from random Internet
analysis, data exhaust can sometimes yield valu-
searches and location data is generated from smart
able insights.
or mobile phone usage. Data exhaust should not
be confused with community data that is gener-
ated by users in online social communities, such
Description as Facebook and Twitter.
In the age of big data, one can, thus, view data
Data exhaust is passively collected and consists of as a messy collage of data points, which includes
random online searches or location data that is found data, as well as the data exhaust extracted
generated, for example, from using smart phones from web searches, credit card payments, and
with location dependent services or applications mobile devices. These data points are collected
(Gupta and George 2016). It is considered to for disparate purposes (Harford 2014).
be “noncore” data that may be generated when
individuals use technologies that passively emit

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_303-1
2 Data Exhaust

Generation of Data Exhaust Challenges


Data exhaust is normally generated autonomously There are practical and research challenges
from transactional, locational, positional, text, to deriving value from data exhaust (technical,
voice, and other data signatures. It typically is privacy and security, and managerial). A major
gathered in real time. Data exhaust might not be technical challenge is the acquisition of data
purposefully collected, or is collected for other exhaust. Because it is often generated without
purposes and then used to derive insights. the user’s knowledge, this can lead to issues of
privacy and security. Data exhaust is often
Example of Data Exhaust unstructured data for which there is, technically,
An example of data exhaust is backend data. no known, proven, way to consistently extract its
Davidson (2016) provides an example from a potential value from a managerial perspective.
real-time information transit application called Furthermore, data mining and other tools that
Transit App (Davidson 2016). The Transit App deal with unstructured data are still at a relatively
provides a travel service to users. The App shows early stage of development.
the coming departures of nearby transit services. It From a research perspective, traditionally,
also has information on bike share, car share, and research studies of humans have focused on data
other ride services, which appear when the user collected explicitly for a specific purpose. Com-
simply opens the app. The app is intended to be putational social science increasingly uses data
useful for individuals who know exactly where that is collected for other purposes. This can result
they are going and how to get there, but want real- in the following (Altman 2014):
time information on schedules. The server, how-
ever, retains data on the origin, destination, and 1. Access to “data exhaust” cannot easily be con-
device data for every search result. The usefulness trolled by a researcher. Although a researcher
of this backend data was assessed by comparing may limit access to their own data, data exhaust
the results obtained from using the backend data may be available from commercial sources or
to predict trips, to a survey data of actual trips, from other data exhaust sources. This increases
which revealed a very similar origin-destination the risk that any sensitive information linked
pattern. with a source of data exhaust can be
reassociated with an individual.
2. Data exhaust often produces fine-grained
Sources of Data Exhaust observations of individuals over time. Because
The origin of data exhaust may be passive, digital, of regularities in human behavior, patterns in
or transactional. Specifically, data exhaust can be data exhaust can be used to “fingerprint” an
passively collected as transactional data from peo- individual, thereby enabling potential
ple’s use of digital services such as mobile reidentification, even in the absence of explicit
phones, purchases, web searches, etc. These dig- identifiers or quasi-identifiers.
ital services are then used to create networked
sensors of human behavior.
Evolution
Potential Value As ubiquitous computing continues to evolve,
Data exhaust is accessed either directly in an there will be a continuous generation of data
unstructured format or indirectly as backend exhaust from sensors, social media, and other
data. The value of data exhaust often is in its use sources (Nadella and Woodie 2014). Therefore,
to improve online experiences and to make pre- the amount of unstructured data will continue to
dictions about consumer behavior. However, the grow and, no doubt, attempts to extract value from
value of the data exhaust can depend on the par- data exhaust will grow as well.
ticular application and context.
Data Exhaust 3

Conclusion Bhushan, A. (2013). “Big data” is a big deal for develop-


ment. In Higgins, K. (Ed), International development
in a changing world, 34. The North-South Institute,
As the demand for capture and use of real-time Ottawa, Canada.
data continues to grow and evolve, data exhaust Davidson, A. (2016). Big data exhaust for origin-destina-
may play an increasing role in providing value tion surveys: Using mobile trip-planning data for sim-
to organizations. Much communication, leisure, ple surveying. Proceedings of the 95th Annual Meeting
of the Transportation Research Board.
and commerce occur on the Internet, which is now George, G., Haas, M. R., & Pentland, A. (2014). Big data
accessible from smartphones, cars, and a multi- and management. Academy of Management Journal,
tude of devices (Harford 2014). As a result, activ- 57(2), 321–326.
ities of individuals can be captured, recorded, Gupta, M., & George, J. F. (2016). Toward the develop-
ment of a big data analytics capability. Information
and represented in a variety of ways, most likely Management, 53(8), 1049–1064.
leading to an increase in efforts to capture and use Harford, T. (2014). Big data: A big mistake? Significance,
data exhaust. 11(5), 14–19.
Mcfedries, P. (2013). Tracking the quantified self [Techni-
cally speaking]. IEEE Spectrum, 50(8), 24–24.
Nadella, A., & Woodie, A. (2014). Data ‘exhaust’ leads
Further Readings to ambient intelligence, Microsoft CEO says. https://
www.datanami.com/2014/04/15/data_exhaust_leads_
Altman, M. (2014). Navigating the changing landscape of to_ambient_intelligence_microsoft_ceo_says/
information privacy. http://informatics.mit.edu/blog/
2014/10/examples-big-data-and-privacy-problems
D

Data Fusion development of methodologies are dependent on


the field for diverse purposes (Bleiholder and
Carolynne Hultquist Naumann 2008). In general, the intention is to
Geoinformatics and Earth Observation fuse data from many sources in order to increase
Laboratory, Department of Geography and value. Data from different sources can support
Institute for CyberScience, The Pennsylvania each other which decreases uncertainty in the
State University, University Park, PA, USA assessment or conflicts which raises questions of
validity. Castanedo (2013) groups the data fusion
field into three major methodological categories
Definition/Introduction of data association, state estimation, and decision
fusion. Analyzing the relationships between
Data fusion is a process that joins together differ- multiple data sources can help to provide an
ent sources of data. The main concept of using a understanding of the quality of the data as well
data fusion methodology is to synthesize data as identify potential inconsistencies.
from multiple sources in order to create collective Modern technologies have made data easier to
information that is more meaningful than if only collect and more accessible. The development of
using one form or type of data. Data from many sensor technologies and the interconnectedness
sources can corroborate information, and, in the of the Internet of things (IoT) have linked together
era of big data, there is an increasing need to an ever-increasing number of sensors and devices
ensure data quality and accuracy. Data fusion which can be used to monitor phenomena. Data
involves managing this uncertainty and is accessible in large quantities, and multiple sources
conflicting data at a large scale. The goal of data of data are sometimes available for an area of inter-
fusion is to create useful representations of reality est. Fusing data from a variety of forms of sensing
that are more complete and reliable than a single technologies can open new doors for research and
source of data. address issues of data quality and uncertainty.
Multisensor data fusion can be done for data
collected for the same type of phenomena. For
Integration of Data example, environmental monitoring data such as
air quality, water quality, and radiation measure-
Data fusion is a process that integrates data ments can be compared to other sources and models
from many sources in order to generate more to test the validity of the measurements that were
meaningful information. Data fusion is very collected. Geospatial data is fused with data col-
domain-dependent, and therefore, tasks and the lected in different forms and is sometimes also
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_305-1
2 Data Fusion

known in this domain as data integration. Geograph- Cross-References


ical information from such sources as satellite
remote sensing, UAVs (unmanned aerial vehicles), ▶ Big Data Quality
geolocated social media, and citizen science data ▶ Big Data Volume
can be fused to give a picture that any one source ▶ Big Variety Data
cannot provide. Assessment of hazards is an appli- ▶ Data Integration
cation area in which data fusion is used ▶ Data Veracity
to corroborate the validity of data from many ▶ Disaster Planning
sources. The data fusion process is often able to fill ▶ Internet of things ( IOT)
some of the information gaps that exist and could ▶ Sensor Technologies
assist decision-makers by providing an assessment
of real-world events.
Further Readings
Conclusion Bleiholder, J., & Naumann, F. (2008). Data fusion. ACM
Computing Surveys, 41, 1:1–1:41.
The process of data fusion directly seeks to address Castanedo, F. (2013). A review of data fusion techniques.
challenges of big data. The methodologies are The Scientific World Journal, 2013, 1–19, Article ID
704504.
directed at considering the veracity of large volumes
and many varieties of data. The goal of data fusion is
to create useful representations of reality that are
more complete and reliable than trusting data that
is only from a single source.
M

Middle East the West Bank and the Gaza Strip (Palestine),
Egypt, Sudan, Libya, Saudi Arabia, Kuwait,
Feras A. Batarseh Yemen, Oman, Bahrain, Qatar, and United Arab
College of Science, George Mason University, Emirates (UAE). Subsequent political and histor-
Farifax, VA, USA ical events have tended to include more countries
into the mix (such as: Tunisia, Algeria, Morocco,
Afghanistan, and Pakistan).
Synonyms The Middle East is often referred to as the
cradle of civilization. By studying the history of
Mid-East; The Middle East and North Africa the region, it is clear why the first human civiliza-
(MENA) tions were established in this part of the world
(particularly the Mesopotamia region around the
Tigris and Euphrates rivers). The Middle East is
Definition where humans made their first transitions from
nomadic to agriculture, invented the wheel, cre-
The Middle East is a transcontinental region in ated basic agriculture, and where the beginnings
Western Asia and North Africa. Countries of the of the written-word first existed. It is well known
Middle East are ones extending from the shores of that this region is an active political, economic,
the Mediterranean Sea, south towards Africa, and historic, and religious part of the world
east towards Asia, and sometimes beyond (Encyclopedia Britannica 2017). For the purposes
depending on the context (political, geographical, of this encyclopedia, the focus of this entry is on
etc.). The majority of the countries of the region technology, data, and software of the Middle East.
speak Arabic.

The Digital Age in the Middle East


Introduction
Since the beginning of the 2000s, the Middle East
The term “Middle East” evolved with time. It was was one of the highest regions in the world in
originally referred to as the countries of the Otto- terms of adoption of social media; certain coun-
man empire, but by the mid-twentieth century, a tries (such as the United Arab Emirates, Qatar, and
more common definition of the Middle East Bahrain) have adopted social technologies by
included the following states (countries): Turkey, 70% of its population (which is a higher percent-
Jordan, Cyprus, Lebanon, Iraq, Syria, Israel, Iran, age than the United States). While citizens are
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_400-1
2 Middle East

jumping on the wagon of social media, govern- making the software available and reliable across
ments still struggle to manage, define, or guide the the geographical borders of the Arab states. Dif-
usage of such technologies. ferent spoken languages have different orienta-
The McKinsey Middle East Digitization Index tions and fall into different groups. Dealing with
is the one of the main metrics to assess the level these groups is accomplished by using different
and impact of digitization across the Middle East. code pages and Unicode fonts. Languages fall into
Only 6% of Middle Eastern public lives under a two main families, single-byte (such as: French,
digitized smart or electronic government (The German, and Polish) and double-byte (such as:
UAE, Jordan, Israel, and Saudi Arabia are Japanese, Chinese, and Korean). Another catego-
among the few countries that have some form of rization that is more relevant to Middle Eastern
e-government) (Elmasri et al. 2016). However, Languages is based on their orientation. Most
many new technology startups are coming from Middle Eastern languages are right-to-left (RTL)
the Middle East with great success. The most (such as: Arabic and Hebrew), while other world
famous technology startup companies coming languages are left-to-right (LTR) (such as: English
out of the Middle East include: (1) Maktoob and Spanish). For all languages, however, a set of
(from Jordan): is one that stands out. The com- translated strings should be saved in a bundle file
pany represents a major trophy on the list of that indexes all the strings, assign them IDs so the
Middle Eastern tech achievements. It made global software program can locate them and display the
headlines when it was bought by Yahoo, Inc. for right string in the language of the user. Further-
$80 million in 2009, symbolizing a worldwide more, to accomplish software Arabization, char-
important step by a purely Middle Eastern com- acters encoding should be enabled. The default
pany. (2) Yamli (from Lebanon): One of the most encoding for a given system is determined by the
popular web apps for Arabic speakers today. runtime locale set on the machine’s operating
(3) GetYou (from Israel): A famous social media system. The most commonplace character
application. (4) Digikala (from Iran): An online encoding format is UTF (USC transformation for-
retailer application. (5) ElWafeyat (from Egypt): mat) USC is the universal character set. UTF is
An Arabic language social media site for honoring designed to be compatible with ASCII. UTF has
deceased friends and family. (6) Project X (from three types: UTF-8, UTF-16, and UTF-32. UTF is
Jordan): A mobile application that allows for 3D the international standard for ISO/IEC 10646. It is
printing of prosthetics, inspired by wars in the important to note that the process of Arabization is
region. These examples are assembled from mul- not a trivial process; engineers cannot merely
tiple sources; many other exciting projects exist as inject translated language strings into the system,
well (such as Souq which was acquired by Ama- or hardcode cultural, date, or numerical settings
zon in 2017, Masdar, Namshi, Sukar, and many into the software, rather, the process is done by
others). obtaining different files based on the settings of
the machine, the desires of the user, and applying
the right locales. An Arabization package needs to
Software Arabization: The Next Frontier be developed to further develop the digital, soft-
ware, and technological evolution in the
The first step towards invoking more technology Middle East.
in a region is to localize the software, content, and
its data. Localizing a software system is accom-
plished by supporting a new spoken language Bridging the Digital Divide
(Arabic Language in this context, hence the
name, Arabization). A new term is presented in Information presented in this entry showed how
this entry of the Encyclopedia, Arabization: it is the Middle East is speeding towards catching-up
the overall concept that includes the process of with industrialized nations in terms of software
Middle East 3

Middle East,
Fig. 1 Middle Eastern
Investments in Technology
(Elmasri et al. 2016)

technology adoption and utilizations (i.e., bridge economic growth at countries all across the
the digital divide between third world and first region; however, the impacts of technology
world countries). Figure 1 below shows which require minimum adoption thresholds before
countries are investing towards leading that trans- those impacts begin to materialize; the wider the
formation; numbers in the figure illustrate venture intensity and use of big data, Internet of things
capital funding as share of GDP (Elmasri et al. (IoT), and machine learning, the greater the
2016). However, According to Cisco’s 2015 impacts.
visual networking index (VNI), the world is
looking towards a new digital divide, beyond
software and mobile apps. By 2019, the number Conclusion
of people connecting to Internet is going to rise to
3.9 billion users, reaching over 50% of the global The Middle East is known for many historical and
population. That will accelerate the new wave of political events, conflicts, and controversies; how-
big data, machine learning, and the Internet of ever, it is not often referred to as a technological
Things (IoT). That will be the main new challenge and software-startup hub. This entry of the Ency-
for technology innovators in the Middle East. clopedia presents a brief introduction to the Mid-
Middle Eastern countries need to first lay the dle East and draws a simple picture about its
“data” infrastructure (such as the principle of soft- digitization, and claims that Arabization of soft-
ware Arabization presented above) that would ware could lead to many advancements across the
enable the peoples of the Middle East towards region and eventually the world – for startups and
higher adoption rates of future trends (big data creativity, the Middle East is an area worth
and IoT). Such a shift would greatly influence watching (Forbes 2017).
4 Middle East

References %2520the%2520region%2520into%2520a%2520lead
ing%2520digital%2520economy%2Fdigital-middle-east-
Elmasri, T., Benni, E., Patel, J., & Moore, J. (2016). Digital finalupdated.ashx&usg=AFQjCNHioXhFY692mS_Qwa
Middle East: Transforming the region into a leading 6hkBT6UiXYVg&sig2=6udbc7EP-bPs-ygQ18KSLA&
digital economy. McKinsey and Company. https://www. cad=rja
google.com/url?sa=t&rct=j&q=&esrc=s&source=web& Encyclopedia Britannica. (2017). Available at https://
cd=2&ved=0ahUKEwiG2J2e55LTAhXoiVQKHfD8Cx www.britannica.com/place/Middle-East
AQFggfMAE&url=http%3A%2F%2Fwww.mckinsey. Forbes reports on the Middle East. (2017). Available at
com%2F~%2Fmedia%2Fmckinsey%2Fglobal%2520 http://www.forbes.com/sites/natalierobehmed/2013/08/
themes%2Fmiddle%2520east%2520and%2520africa 22/forget-oil-tech-could-be-the-next-middle-east-
%2Fdigital%2520middle%2520east%2520transforming goldmine/
S

Sensor Technologies it gravitational, mechanical, thermal, electromag-


netic, chemical, or nuclear. The activity of interest
Carolynne Hultquist is typically measured by a sensor and converted by
Geoinformatics and Earth Observation a transducer into a signal as a quantity (McGrath
Laboratory, Department of Geography and and Scanaill 2013). Sensors have been integrated
Institute for CyberScience, The Pennsylvania into daily life so that we use them without consid-
State University, University Park, PA, USA ering tactile sensors such as elevator buttons,
touchscreen devices, and touch sensing lamps.
Typical vehicles contain numerous sensors for driv-
Definition/Introduction ing functions, safety, and the comfort of the pas-
sengers. Mechanical sensors measure motion,
Sensors technologies are developed to detect spe- velocity, acceleration, and displacement through
cific phenomena, behavior, or actions. The origin such sensors as strain gauges, pressure, force, ultra-
of the word sensor comes from the Latin root sonic, acoustic wave, flow, displacement, acceler-
“sentire” a verb defined as “to perceive” ometers, and gyroscopes (McGrath and Scanaill
(Kalantar-zadeh 2013). Sensors are designed to 2013). Chemical and thermal biometric sensors
identify certain phenomena as a signal but not are often used for healthcare from traditional
record anything else as it would create noise in forms like monitoring temperature, blood pressure
the data. Sensors are specified by purpose to iden- cuffs to glucose meters, pacemakers, defibrillators,
tify or measure the presence or intensity of differ- and HIV testing.
ent types of energy: mechanical, gravitational, New sensor applications are developing which
thermal, electromagnetic, chemical, and nuclear. produce individual, home, and environmental
Sensors have become part of everyday life and data. There are many sensor types that were devel-
continue to grow in importance in modern oped years ago but are finding new applications.
applications. Navigational aids, such sensors as gyroscopes,
accelerometers, and magnetometers, have existed
for many years in flight instruments for aircraft
Prevalence of Sensors and more modernly for smartphones. Sensors
internal to smartphone devices are intended to
Sensors are used in everyday life to detect phenom- monitor the device but can be repurposed to mon-
ena, behavior, or actions such as force, temperature, itor to monitor many things such as extreme expo-
pressure, flow, etc. The type of sensor utilized is sure to heat or movement for health applications.
based on the type of energy that is being sensed, be The interconnected network of devices to promote
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_442-1
2 Sensor Technologies

automation and efficiency is often referred to as Increasingly, smart home sensors are being used
the Internet of things (IoT). Sensors are becoming for everyday monitoring in order to have more
more prevalent and cheap enough that the public efficient energy consumption with smart lighting
can make use of personal sensors that already fixtures and temperature controls. Sensors are
exist in their daily lives or can be easily acquired. often placed to inform on activities in the house
such as a door or window being opened. This
Personal Health Monitoring integrated network of house monitoring prom-
Health-monitoring applications are becoming ises efficiency, automation, and safety based on
increasingly common and produce very large vol- personal preferences. There is significant invest-
umes of data. Biophysical processes such as heart ment in smart home technologies, and big data
rate, breathing rate, sleep patterns, and restless- analysis can play a major role in determining
ness can be recorded continuously using devices appropriate settings based on feedback.
kept in contact with the body. Health-conscious
and athletic communities, such as runners, have Environmental Monitoring
particularly taken to personal monitoring by using Monitoring of the environment from the surface to
technology to track their current condition and the atmosphere is traditionally a function
progress. Pedometers, weight scales, and ther- performed by the government through remotely
mometers are commonplace. Heart rate, blood sensed observations and broad surveys. Remote
pressure, and muscle fatigue are now monitored sensing imagery from satellites and airborne
by affordable devices in the form of bracelets, flights can create large datasets on global environ-
rings, adhesive strips, and even clothing. Brands mental changes for use in such applications as
of smart clothing are offering built-in sensors for agriculture, pollution, water, climatic conditions,
heart rate, respiration, skin temperature and mois- etc. Government agencies also employ static sen-
ture, and electrophysiological signals that are sors and make on-site visits to check sensors
sometimes even recharged by solar panels. There which monitor environmental conditions. These
are even wireless sensors for the insole of shoes to sensors are sometimes integrated into networks
automatically adjust for the movements of the which can communicate observations to
user in addition to providing health and training form real-time monitoring systems.
analysis. In addition to traditional government sources
Wearable health technologies are often used to of environmental data, there are growing collec-
provide individuals with private personal informa- tions of citizen science data that are focused pri-
tion; however, certain circumstances call for sys- marily on areas of community concern such as air
tem-wide monitoring for medical or emergency quality, water quality, and natural hazards. Air
purposes. Medical patients, such as those with quality and water quality have long been moni-
diabetes or hypertension, can use continuously tored by communities concerned about pollution
testing glucose meters or blood pressure monitors in their environment, but a recent development
(Kalantar-zadeh 2013). Bluetooth-enabled devices after the 2011 Fukushima nuclear disaster is radi-
can transmit data from monitoring sensors and ation sensing. Safecast is a radiation monitoring
contact the appropriate parties automatically if project that seeks to empower people with infor-
there are health concerns. Collective health infor- mation on environmental safety and openly dis-
mation can be used to have a better understanding tributes measurements under creative commons
of such health concerns as cardiac issues, extreme rights (McGrath and Scanaill 2013). Radiation is
temperatures, and even crisis information. not visibly observable so it is considered a “silent”
environmental harm, and the risk needs to be
Smart Home considered in light of validated data (Hultquist
Sensors have long been a part of modern house- and Cervone 2017). Citizen science projects for
holds from smoke and carbon monoxide detec- sensing natural hazards from flooding, landslides,
tors to security systems and motion sensors. earthquakes, wildfires, etc. have come online with
Sensor Technologies 3

support from both governments and communities. the raw data from an individual, but at a generalized
Open-source environmental data is a growing level, such data can be valuable for research and
movement as people get engaged with their envi- can appropriately take into account variations in the
ronment and become more educated about their data.
health. Sensor technologies are integrated into every-
day life and are used in numerous applications to
monitor conditions. The usefulness of technolog-
Conclusion ical sensors should be no surprise as every living
organism has biological sensors which serve sim-
The development and availability of sensor tech- ilar purposes to indicate the regulation of internal
nologies is a part of the big data paradigm. Sen- functions and conditions of the external environ-
sors are able to produce an enormous amount of ment. The integration of sensor technologies is a
data, very quickly with real-time uploads, and natural step that goes from individual measure-
from diverse types of sensors. Many questions ments to collective monitoring which highlights
still remain of how to use this data and if the need for big data analysis and validation.
connected sensors will lead to smart environments
that will be a part of everyday modern life. The
Internet of things (IoT) is envisioned to connect
Cross-References
communication across domains and applications
in order to enable the development of smart cities.
▶ AgInformatics
Sensor data can provide useful information for
▶ Air Pollution
individuals and generalized information from col-
▶ Biometrics
lective monitoring. Services often offer personal-
▶ Biosurveillance
ized analysis in order to keep people engaged using
▶ Crowdsourcing
the application. Yet, most analysis and interest
▶ Drones
from researchers in sensor data is at a generalized
▶ Environment
level. Despite mostly generalized data analysis,
▶ Health Informatics
there is public concern related to data privacy
▶ Land Pollution
from individual and home sensors. The privacy
▶ Participatory Health and Big Data
level of the data is highly dependent on the system
▶ Patient-Centered (Personalized) Health
used and the terms of service agreement if a service
▶ Remote Sensing
is being provided related to the sensor data.
▶ Water Pollution
Analysis of sensor data is often complex, messy,
and hard to verify. Nonpersonal data can often be
checked or referenced to a comparable dataset to
see if it makes sense. However, large datasets pro- Further Readings
duced by personal sensors for such applications as
health are difficult to independently verify at an Hultquist, C., & Cervone, G. (2017). Citizen monitoring
during hazards: Validation of Fukushima radiation
individual level. For example, an environmental measurements. Geo Journal. http://doi.org/10.1007/
condition could have caused a natural reaction of s10708-017-9767-x.
a rapid heartbeat which is medically safe given the Kalantar-zadeh, K. (2013). Sensors: An introductory
condition that the user awoke with a quick increase course (1st ed.). Boston: Springer US.
McGrath, M. J., & Scanaill, C. N. (2013). Sensor technol-
in heart rate due to an earthquake. Individual ogies: Healthcare, wellness, and environmental appli-
inspection of data for such noise is fraught with cations. New York: Apress Open.
problems as it is complicated to identify causes in
S

“Small” Data and use for which the data are intended. In fact,
disciplinary perspectives vary on how large “big
data” need to be to merit this label, but small data
Rochelle E. Tractenberg1,2 and are not characterized effectively by the absence of
Kimberly F. Sellers3 one or more of these “3 Vs.” Most statistical
1
Collaborative for Research on Outcomes and analyses require some amount of vector and
Metrics, Washington, DC, USA matrix manipulation for efficient computation in
2
Departments of Neurology; Biostatistics, the modern context. Data sets may be considered
Bioinformatics & Biomathematics; and “big” if they are so large, multidimensional,
Rehabilitation Medicine, Georgetown University, and/or quickly accumulating in size that the typi-
Washington, DC, USA cal linear algebraic manipulations cannot con-
3
Department of Mathematics and Statistics, verge or yield true summaries of the full data set.
Georgetown University, Washington, DC, USA The fundamental statistical analyses, however, are
the same for data that are “big” or “small”; the true
distinction arises from the extent to which com-
Synonyms putational manipulation is required to map and
reduce the data (Day and Ghemawat 2004) such
Data; Statistics that a coherent result can be derived. All analyses
share common features, irrespective of the size,
complexity, or completeness of the data – the
Introduction relationship between statistics and the underlying
population; the association between inference,
Big data are often characterized by “the 3 Vs”: estimation, and prediction; and the dependence
volume, velocity, and variety. This implies that of interpretation and decision-making on statisti-
“small data” lack these qualities, but that is an cal inference. To expand on the lack of distin-
incorrect conclusion about what defines “small” guishability between “small” data and “big”
data. Instead, we define “small data” to be simply data, we explore each of these features in turn.
“data” – specifically, data that are finite but not By doing so, we expound on the assertion that a
necessarily “small” in scope, dimension, or rate of characterization of a dataset as “small” depends
accumulation. The characterization of data as on the users’ intention and the context in which
“small” is essentially dependent on the context the data, and results from its analysis, will be used.

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_445-1
2 “Small” Data

Understanding “Big Data” as “Data” Technological advances allow investigators to


collect batches of experimental, survey, or other
An understanding of why some datasets are char- traditional types of data in near-real or real time,
acterized as “big” and/or “small” requires some or in online or streaming fashion; such informa-
juxtaposition of these two descriptors. “Big data” tion has been incorporated to ask and answer
are thought to expand the boundary of data sci- experimental and epidemiologic questions,
ence because innovation has been ongoing to including testing hypotheses in physics, climate,
promote ever-increasing capacity to collect and chemistry, and both social and biomedical sci-
analyze data with high volume, velocity, and/or ences, since the technology was developed. It is
variety (i.e., the 3 Vs). In this era of technological inappropriate to distinguish “big” from “small”
advances, computers are able to maintain and data along these characteristics; in fact, two ana-
process terabytes of information, including lysts simultaneously considering the same data set
records, transactions, tables, files, etc. However, may each perceive it to be “big” or “small”; these
the ability to analyze data has always depended on labels must be considered to be relative.
the methodologies, tools, and technology avail-
able at the time; thus the reliance on computa-
tional power to collect or process data is not new Analysis and Interpretation of “Big
or specific to the current era and cannot be con- Data” Is Based on Methods for “Small
sidered to delimit “big” from “small” data. Data”
Data collection and analyses date back to
ancient Egyptian civilizations that collected cen- Considering analysis, manipulation, and interpre-
sus information; the earliest Confucian societies tation of data can support a deeper appreciation
collected this population-spanning data as well. for the differences and similarities of “big” and
These efforts were conducted by hand for centu- “small” data. Large(r) and higher-dimensional
ries, until a “tabulating machine” was used to data sets may require computational manipulation
complete the analyses required for the 1890 (e.g., Day and Ghemawat 2004), including group-
United States Census; this is possibly the first ing and dimension reduction, to derive an inter-
time so large a dataset was analyzed with a non- pretable result from the full data set. Further,
human “computer.” Investigations that previously whenever a larger/higher dimension dataset is
took years to achieve were suddenly completed in partitioned for analysis, the partitions or subsets
a fraction of the time (months!). Since then, tech- are analyzed using standard statistical methods.
nology continues to be harnessed to facilitate data The following sections explicate how standard
collection, management, and analysis. In fact, statistical analytic methods (i.e., for “small”
when it was suggested to add “data science” to data) are applied to a dataset whether it is
the field of statistics (Bickel 2000; Rao 2001), described as “small” or “big”. These methods are
“big data” may have referred to a data set of up selected, employed, and interpreted specifically to
to several gigabytes in size; today, petabytes of support the user’s intention for the results and do
data are not uncommon. Therefore, neither the not depend inherently on the size or complexity of
size nor the need for technological advancements the data itself. This underscores the difficulty of
are inherent properties of either “big” or articulating any specific criterion/a for character-
“small” data. izing data as “big” or “small.”
Data are sometimes called “big” if the data
collection process is fast(-er), not finite in time Sample Versus Population
or amount, and/or inclusive of a wide range of Statistical analysis and summarization of “big”
formats and quality. These features may be data are the same as for data generally; the
contrasted with experimental, survey, epidemio- description, confidence/uncertainty, and coher-
logic, or census data where the data structure, ence of the results may vary with the size and
timing, and format are fixed and typically finite. completeness of the data set. Even the largest
“Small” Data 3

and most multidimensional dataset is presumably In frequentist statistical analysis (based on long
an incomplete (albeit massive) representation of run results), this characterization typically
the entire universe of values – the “population.” describes how likely the observed result would
Thus, the field of statistics has historically been be if there were, in truth, no relationship between
based on long-run frequencies or computed esti- (any) variables, or if the true parameter value was
mates of the true population parameters. For a specific value (e.g., zero). In Bayesian statistical
example, in some current massive data collection analysis (based on current data and prior knowl-
and warehousing enterprises, the full population edge), this characterization describes how likely it
can never be obtained because the data are con- is that there is truly no relationship given the data
tinuously streaming in and collected. In other that were observed and prior knowledge about
massive data sets, however, the entire population whether such a relationship exists.
is captured; examples include the medical records Whenever inferences are made about estimates
for a health insurance company, sales on Amazon. and predictions about future events, relationships,
com, or weather data for the detection of an evolv- or other unknown/unobserved events or results,
ing storm or other significant weather pattern. The corrections must be made for the multitude of
fundamental statistical analyses would be the inferences that are made for both frequentist and
same for either of these data types; however, Bayesian methods. Confidence and uncertainty
they would result in estimates for the about every inference and estimate must accom-
(essentially) infinite data set, while actual modate the fact that more than one has been made;
population-descriptive values are possible when- these “multiple comparisons corrections” protect
ever finite/population data are obtained. Impor- against decisions that some outcome or result is
tantly, it is not the size or complexity of the data rare/statistically significant when, in fact, the var-
that results in either estimation or population iability inherent in the data make that result far
description – it is whether or not the data are finite. less rare than it appears. Numerous correction
This underscores the reliance of any and all data methods exist with modern (since the mid-
analysis procedures on statistical methodologies; 1990s) approaches focusing not on controlling
assumptions about the data are required for the for “multiple comparisons” (which are closely
correct use and interpretation of these methodol- tied to experimental design and formal hypothesis
ogies for data of any size and complexity. It fur- testing), but controlling the “false discovery rate”
ther blurs qualifications of a given data set as (which is the rate at which relationships or esti-
“big” or “small.” mates will be declared “rare given the inherent
variability of the data” when they are not, in fact,
rare). Decisions made about inferences, estimates,
Inference, Estimation, and Prediction
and predictions are classified as correct (i.e., the
Statistical methods are generally used for two
event is rare and is declared rare, or the event is
purposes: (1) to estimate “true” population param-
not rare and is declared not rare) or incorrect (i.e.,
eters when only sample information is available,
the event is rare but is declared not rare – a false
and (2) to make or test predictions about either
negative/Type II error; or the event is not rare but
future results or about relationships among vari-
is declared rare – a false positive/Type I error);
ables. These methods are used to infer “the truth”
controls for multiple comparisons or false discov-
from incomplete data and are the foundations of
eries seek to limit Type I errors.
nearly all experimental designs and tests of quan-
Decisions are made based on the data analysis,
titative hypotheses in applied disciplines (e.g.,
which holds for “big” or “small” data. While
science, engineering, and business). Modern sta-
multiple comparisons corrections and false dis-
tistical analysis generates results (i.e., parameter
covery rate controls have long been accepted as
estimates and tests of inferences) that can be char-
representing competent scientific practice, they
acterized with respect to how rare they are given
are also essential features of the analysis of big
the random variability inherent in the data set.
4 “Small” Data

data, whether or not these data are analyzed for and interpreting all data will also continue to
scientific or research purposes. evolve, and these will become increasingly
interdependent on the methods for collecting,
Analysis, Interpretation, and Decision Making manipulating, and storing the data. Because of
Analyses of data are either motivated by theory or the constant evolution and advancement in tech-
prior evidence (“theory-driven”), or they are nology and computation, the notion of “big data”
unplanned and motivated by the data themselves may be best conceptualized as representing the
(“data-driven”). Both types of investigations can processes of data collection, storage, and manip-
be executed on data of any size, complexity, or ulation for interpretable analysis, and not the size,
completeness. While the motivations for data utility, or complexity of the data itself. Therefore,
analysis vary across disciplines, evidence that the characterization of data as “small” depends
supports decisions is always important. Statistical critically on the context and use for which the
methods have been developed, validated, and uti- data are intended.
lized to support the most appropriate analysis,
given the data and its properties, so that defensible
and reproducible interpretations and inferences
Further Reading
result. Thus, decisions that are made based on
the analysis of data, whether “big” or “small,” Bickel, P. J. (2000). Statistics as the information science.
are inherently dependent on the quality of the Opportunities for the mathematical sciences, 9, 11.
analysis and associated interpretations. Day, J., & Ghemawat, S (2004, December). MapReduce:
Simplified data processing on large clusters. In
OSDI’04: Sixth symposium on operating system design
and implementation. San Francisco. Downloaded from
Conclusion https://research.google.com/archive/mapreduce.html on
21 Dec 2016.
As has been the case for centuries, today’s “big” Rao, C. R. (2001). Statistics: Reflections on the past and
visions for the future. Communications in Statistics –
data will eventually be perceived as “small”; how- Theory and Methods, 30(11), 2235–2257.
ever, the statistical methodologies for analyzing
T

Time Series Analytics before and after annual traffic accident data to
determine the efficacy of safety legislation. Time
Erik Goepner series analytics can be used to forecast, determine
George Mason University, Arlington, VA, USA the transfer function, assess the effects of unusual
intervention events, analyze the relationships
between variables of interest, and design control
Synonyms schemes (Box et al. 2015). Preferably, observa-
tions have been recorded at fixed time intervals. If
Time series analysis, Time series data the time intervals vary, interpolation can be used
to fill in the gaps (Zois et al. 2015).
Of critical importance is whether the variables
Introduction are stationary or nonstationary. Stationary vari-
ables are not time dependent (i.e., mean, variance,
Time series analytics utilize data observations and covariance remain constant over time). How-
recorded over time at certain intervals. Subse- ever, time series data are quite often non-
quent values of time-ordered data often depend stationary. The trend of nonstationary variables
on previous observations. Time series analytics is, can be deterministic (e.g., following a time
therefore, interested in techniques that can ana- trend), stochastic (i.e., random), or both.
lyze this dependence (Box et al. 2015; Zois et al. Addressing nonstationarity is a key requirement
2015). Up until the second half of the twentieth for those working with time series and is
century, social scientists largely ignored the pos- discussed further under “Challenges” (Box et al.
sibility of dependence within time series data 2015; Kirchgässner et al. 2012).
(Kirchgässner et al. 2012). Statisticians have Time series are frequently comprised of four
since demonstrated that adjacent observations components. There is the trend over the long-term
are frequently dependent in a time series and that and, often, a cyclical component that is normally
previous observations can often be used to accu- understood to be a year or more in length. Within
rately predict future values (Box et al. 2015). the cycle, there can be a seasonal variation. And
Time series data abound and are of importance finally, there is the residual which includes all
to many. Physicists and geologists investigating variation not explained by the trend, cycle, and
climate change, for example, use annual tempera- seasonal components. Prior to the 1970s, only the
ture readings, economists study quarterly gross residual was thought to include random impact,
domestic product and monthly employment with trend, cycle, and seasonal change understood
reports, and policy makers might be interested in to be deterministic. That has changed, and now it
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_469-1
2 Time Series Analytics

is assumed that all four components can be sto- to inaccurate, missing, or incomplete data. Before
chastically modeled (Kirchgässner et al. 2012). analysis, these issues should be addressed via
duplicate elimination, interpolation, data fusion,
or an influence model (Zois et al. 2015).
The Evolution of Time Series Analytics
Contending with Massive Amounts of Data
In the first half of the 1900s, fundamentally dif- Tremendous amounts of time series data exist,
ferent approaches were pursued by different dis- potentially overwhelming computer memory. In
ciplines. Natural scientists, mathematicians, and response, solutions are needed to lessen the effects
statisticians generally modeled the past history of on secondary memory access. Sliding windows
the variable of interest to forecast future values of and time series indexing may help. Both are com-
the variable. Economists and other social scien- monly used; however, newer users may find the
tists, however, emphasized theory-driven models learning curve unhelpfully steep for time series
with their accompanying explanatory variables. In indexing. Similarly, consideration should be
1970, Box and Jenkins published an influential given to selecting management schemes and
textbook, followed in 1974 by a study from query languages simple enough for common
Granger and Newbold, that has substantially users (Zois et al. 2015).
altered how social scientists interact with time
series data (Kirchgässner et al. 2012).
The Box Jenkins approach, as it has been fre- Analysis and Forecasting
quently called ever since, relies on extrapolation.
Box Jenkins focuses on the past behavior of the Time series are primarily used for analysis and
variable of interest rather than a host of explana- forecasting (Zois et al. 2015). A variety of poten-
tory variables to predict future values. The vari- tial models exist, including autoregressive (AR),
able of interest must be transformed so that it moving average (MA), mixed autoregressive
becomes stationary and its stochastic properties moving average (ARMA), and autoregressive
time invariant. At times, the terms Box Jenkins integrated moving average (ARIMA). ARMA
approach and time series analysis have been used models are used with stationary processes and
interchangeably (Kennedy 2008). ARIMA models for nonstationary ones (Box et
al. 2015). Forecasting options include regression
and nonregression based models. Model develop-
Time Series Analytics and Big Data ment should follow an iterative approach, often
executed in three steps: identification, estimation,
Big Data has stimulated interest in efficient que- and diagnostic checking. Diagnostic checks
rying of time series data. Both time series and Big examine whether the model is properly fit, and
Data share similar characteristics relating to vol- the checks analyze the residuals to determine
ume, velocity, variety, veracity, and volatility model adequacy. Generally, 100 or more observa-
(Zois et al. 2015). The unprecedented volume of tions are preferred. If fewer than 50 observations
data can overwhelm computer memory and pre- exist, development of the initial model will
vent processing in real time. Additionally, the require a combination of experience and past
speed at which new data arrives (e.g., from sen- data (Box et al. 2015; Kennedy 2008).
sors) has also increased. The variety of data
includes the medium from which it comes (e.g., Autoregressive, Moving Average, and Mixed
audio and video) as well as differing sampling Autoregressive Moving Average Models
rates, which can prove problematic for data anal- An autoregressive model predicts the value of the
ysis. Missing data and incompatible sampling variable of interest based on its values from one or
rates are discussed further in the “Challenges” more previous time periods (i.e., its lagged value).
section below. Veracity includes issues relating If, for instance, the model only relied on the value
Time Series Analytics 3

of the immediately preceding time period, then it function of the vector’s lagged values combined
would be a first-order autoregression. Similarly, if with an error vector. The single vector is derived
the model included the values for the prior two from the linear function of each variable’s lagged
time periods, then it would be referred to as a values and the lagged values for each of the other
second-order autoregression and so on. A moving variables. VAR models are used to investigate the
average model also uses lagged values, but of the potential causal relationship between different
error term rather than the variable of interest (Ken- time series, yet they are controversial because
nedy 2008). If neither an autoregressive nor mov- they are atheoretical and include dubious asser-
ing average process succeeds in breaking off the tions (e.g., orthogonal innovation of one variable
autocorrelation function, then a mixed auto- is assumed to not affect the value of any other
regressive moving average approach may be pre- variable). Despite the controversy, many scholars
ferred (Kirchgässner et al. 2012). AR, MA, and and practitioners view VAR models as helpful,
ARMA models are used with stationary time particularly VAR’s role in analysis and forecasting
series, to include time series made stationary (Kennedy 2008; Kirchgässner et al. 2012; Box et
through differencing. However, the potential loss al. 2015).
of vital information during differencing opera-
tions should be considered (Kirchgässner et al. Error Correction Models
2012). These models attempt to harness positive features
ARMA models produce unconditional fore- of both ARIMA and VAR models, accounting for
casts, using only the past and current values of the dynamic feature of time series data while also
the variable. Because such forecasts frequently taking advantage of the contributions explanatory
perform better than traditional econometric variables can make. Error correction models add
models, they are often preferred. However, theory-driven exogenous variables to a general
blended approaches, which transform linear form of the VAR model (Kennedy 2008).
dynamic simultaneous equation systems into
ARMA models or the inverse, are also available.
These blended approaches can retain information Challenges
provided by explanatory variables (Kirchgässner
et al. 2012). Nonstationarity
Nonstationarity can be caused by deterministic
Autoregressive Integrated Moving Average and stochastic trends (Kirchgässner et al. 2012).
(ARIMA) Models To transform nonstationary processes into station-
In ARIMA models, also known as ARIMA (p,d, ary ones, the deterministic and/or stochastic
q), p indicates the number of lagged values of Y*, trends must be eliminated. Measures to accom-
which represents the variable of interest after it plish this include differencing operations and
has been made stationary by differencing. d indi- regression on a time trend. However, not all non-
cates the number of differencing operations stationary processes can be transformed
required to transform Y into its stationary version, (Kirchgässner et al. 2012).
Y*. The number of lagged values of the error term The Box Jenkins approach assumes that
is represented by q. ARIMA models can forecast differencing operations will make nonstationary
for univariate and multivariate time series (Ken- variables stationary. A number of unit root tests
nedy 2008). have been developed to test for nonstationarity,
but their lack of power remains an issue. Addi-
Vector Autoregressive (VAR) Models tionally, differencing (as a means of eliminating
VAR models blend the Box Jenkins approach with unit roots and creating stationarity) comes with
traditional econometric models. They can be quite the undesirable effect of eliminating any theory-
helpful in forecasting. VAR models express a driven information that might otherwise contrib-
single vector (of all the variables) as a linear ute to the model.
4 Time Series Analytics

Granger and colleagues developed Conclusion


cointegrated procedures to address this challenge
(Kirchgässner et al. 2012). When nonstationary Time series analytics utilizes data observations
variables are cointegrated, that is, the variables recorded over time at certain intervals, observa-
remain relatively close to each other as they wan- tions which often depend on each other. Time
der over time, procedures other than differencing series analytics focuses on this dependence (Box
can be used. Examples of cointegrated variables et al. 2015; Zois et al. 2015). A variety of models
include prices and wages and short- and long-term exist for use in time series analysis (e.g., ARMA,
interest rates. Error correcting models may be an ARIMA, VAR, and ECM). Of critical importance
appropriate substitute for differencing operations is whether the variables are stationary or non-
(Kennedy 2008). Cointegration analysis has stationary. Stationary variables are not time
helped shrink the gap between traditional econo- dependent (i.e., mean, variance, and covariance
metric methods and time series analytics, facili- remain constant over time). However, time series
tating the inclusion of theory-driven explanatory data are quite often nonstationary. Addressing
variables into the modeling process (Kirchgässner nonstationarity is a key requirement for users of
et al. 2012). time series (Box et al. 2015; Kirchgässner et al.
2012).
Autocorrelation
Time series data are frequently autocorrelated
and, therefore, violate the assumption of ran- Cross-References
domly distributed error terms. When autocorrela-
tion is present, the current value of a variable ▶ Core Curriculum Issues (Big Data Research/
serves as a good predictor of its next value. Auto- Analysis)
correlation can disrupt models such that the anal- ▶ Real-Time Analytics
ysis incorrectly concludes the variable is ▶ Spatio-Temporal Analytics
statistically significant when, in fact, it is not ▶ Statistical Analysis
(Berman and Wang 2012). Autocorrelation can
be detected visually or with statistical techniques
like the Durbin-Watson test. If present, autocorre- Further Readings
lation can be corrected with differencing or by
adding a trend variable, for instance (Berman Berman, E., & Wang, X. (2012). Essential statistics for
and Wang 2012). public managers and policy analysts (3rd ed.). Los
Angeles: CQ Press.
Box, G., Jenkins, G., Reinsel, G., & Ljung, G. (2015). Time
Missing Data and Incompatible Sampling series analysis: Forecasting and control. Hoboken:
Rates Wiley.
Missing data occur for any number of reasons. Kennedy, P. (2008). A guide to econometrics (6th ed.).
Malden: Blackwell.
Records may be lost, destroyed, or otherwise
Kirchgässner, G., Wolters, J., & Hassler, U. (2012). Intro-
unavailable. At certain points, sampling rates duction to modern time series analysis (2nd ed.). Hei-
may fail to follow the standard time measurement delberg: Springer Science & Business Media.
of the data series. Specialized algorithms may be Zois, V., Chelmis, C., & Prasanna, V. (2015). Querying of
time series for big data analytics. In L. Yan (Ed.),
necessary. Interpolation can be used as a tech-
Handbook of research on innovative database query
nique to fill in missing data or to smooth the processing techniques (pp. 364–391). Hershey: IGI
gaps between intervals (Zois et al. 2015). Global.
W

Web Scraping The process of scraping data from the Internet


can be divided into two sequential steps; acquiring
Bo Zhao web resources and then extracting desired infor-
College of Earth, Ocean, and Atmospheric mation from the acquired data. Specifically, a web
Sciences, Oregon State University, Corvallis, OR, scraping program starts by composing a HTTP
USA request to acquire resources from a targeted
website. This request can be formatted in either a
URL containing a GET query or a piece of HTTP
Web scraping, also known as web extraction or message containing a POST query. Once the
harvesting, is a technique to extract data from the request is successfully received and processed by
World Wide Web (WWW) and save it to a file the targeted website, the requested resource will
system or database for later retrieval or analysis. be retrieved from the website and then sent back to
Commonly, web data is scrapped utilizing Hyper- the give web scraping program. The resource can
text Transfer Protocol (HTTP) or through a web be in multiple formats, such as web pages that are
browser. This is accomplished either manually by built from HTML, data feeds in XML or JSON
a user or automatically by a bot or web crawler. format, or multimedia data such as images, audio,
Due to the fact that an enormous amount of het- or video files. After the web data is downloaded,
erogeneous data is constantly generated on the the extraction process continues to parse,
WWW, web scraping is widely acknowledged as reformat, and organize the data in a structured
an efficient and powerful technique for collecting way. There are two essential modules of a web
big data (Mooney et al. 2015; Bar-Ilan 2001). To scraping program – a module for composing
adapt to a variety of scenarios, current web scrap- an HTTP request, such as Urllib2 or selenium
ing techniques have become customized from and another one for parsing and extracting infor-
smaller ad hoc, human-aided procedures to the mation from raw HTML code, such as Beautiful
utilization of fully automated systems that are Soup or Pyquery. Here, the Urllib2 module
able to convert entire websites into well-organized defines a set of functions to dealing with HTTP
data set. State-of-the-art web scraping tools are requests, such as authentication, redirections,
not only capable of parsing markup languages or cookies, and so on, while Selenium is a web
JSON files but also integrating with computer browser wrapper that builds up a web browser,
visual analytics (Butler 2007) and natural lan- such as Google Chrome or Internet Explorer, and
guage processing to simulate how human users enables users to automate the process of browsing
browse web content (Yi et al. 2003). a website by programming. Regarding data
extraction, Beautiful Soup is designed for
# Springer International Publishing AG (outside the USA) 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_483-1
2 Web Scraping

scraping HTML and other XML documents. It web data integration. For examples, at a micro-
provides convenient Pythonic functions for navi- scale, the price of a stock can be regularly scraped
gating, searching, and modifying a parse tree; a in order to visualize the price change over time
toolkit for decomposing an HTML file and extra- (Case et al. 2005), and social media feeds can be
cting desired information via lxml or html5lib. collectively scraped to investigate public opinions
Beautiful Soup can automatically detect the and identify opinion leaders (Liu and Zhao 2016).
encoding of the parsing under processing and At a macro-level, the metadata of nearly every
convert it to a client-readable encode. Similarly, website is constantly scraped to build up Internet
Pyquery provides a set of Jquery-like functions to search engines, such as Google Search or Bing
parse xml documents. But unlike Beautiful Soup, Search (Snyder 2003).
Pyquery only supports lxml for fast XML Although web scraping is a powerful technique
processing. in collecting large data sets, it is controversial and
Of the various types of web scraping programs, may raise legal questions related to copyright
some are created to automatically recognize the (O’Reilly 2006), terms of service (ToS) (Fisher
data structure of a page, such as Nutch or Scrapy, et al. 2010), and “trespass to chattels” (Hirschey
or to provide a web-based graphic interface that 2014). A web scraper is free to copy a piece of data
eliminates the need for manually written web in figure or table form from a web page without
scraping code, such as Import.io. Nutch is a robust any copyright infringement because it is difficult
and scalable web crawler, written in Java. It to prove a copyright over such data since only a
enables fine-grained configuration, paralleling specific arrangement or a particular selection of
harvesting, robots.txt rule support, and machine the data is legally protected. Regarding the ToS,
learning. Scrapy, written in Python, is an reusable although most web applications include some
web crawling framework. It speeds up the process form of ToS agreement, their enforceability usu-
of building and scaling large crawling projects. In ally lies within a gray area. For instance, the
addition, it also provides a web-based shell to owner of a web scraper that violates the ToS
simulate the website browsing behaviors of a may argue that he or she never saw or officially
human user. To enable nonprogrammers to har- agreed to the ToS. Moreover, if a web scraper
vest web contents, the web-based crawler with a sends data acquiring requests too frequently, this
graphic interface is purposely designed to mitigate is functionally equivalent to a denial-of-service
the complexity of using a web scraping program. attack, in which the web scraper owner may be
Among them, Import.io is a typical crawler for refused entry and may be liable for damages under
extracting data from websites without writing any the law of “trespass to chattels,” because the
code. It allows users to identify and convert owner of the web application has a property inter-
unstructured web pages into a structured format. est in the physical web server which hosts the
Import.io’s graphic interface for data identifica- application. An ethical web scraping tool will
tion allows user to train and learn what to extract. avoid this issue by maintaining a reasonable
The extracted data is then stored in a dedicated requesting frequency.
cloud server, and can be exported in CSV, JSON, A web application may adopt one of the fol-
and XML format. A web-based crawler with a lowing measures to stop or interfere with a web
graphic interface can easily harvest and visualize scrapping tool that collects data from the given
real-time data stream based on SVG or WebGL website. Those measures may identify whether an
engine but fall short in manipulating a large data operation was conducted by a human being or a
set. bot. Some of the major measures include the fol-
Web scraping can be used for a wide variety of lowing: HTML “fingerprinting” that investigates
scenarios, such as contact scraping, price change the HTML headers to identify whether a visitor is
monitoring/comparison, product review collec- malicious or safe (Acar et al. 2013); IP reputation
tion, gathering of real estate listings, weather determination, where IP addresses with a
data monitoring, website change detection, and recorded history of use in website assaults that
Web Scraping 3

will be treated with suspicion and are more likely Fisher, D., Mcdonald, D. W., Brooks, A. L., & Churchill,
to be heavily scrutinized (Sadan and Schwartz E. F. (2010). Terms of service, ethics, and bias: Tapping
the social web for CSCW research. Computer
2012); behavior analysis for revealing abnormal Supported Cooperative Work (CSCW), Panel
behavioral patterns, such as placing a suspiciously discussion.
high rate of requests and adhering to anomalous Hirschey, J. K. (2014). Symbiotic relationships: Pragmatic
browsing patterns; and progressive challenges acceptance of data scraping. Berkeley Technology Law
Journal, 29, 897.
that filter out bots with a set of tasks, such as Liu, J. C.-E., & Zhao, B. (2016). Who speaks for climate
cookie support, JavaScript execution, and change in China? Evidence from Weibo. Climatic
CAPTCHA (Doran and Gokhale 2011). Change, 140(3), 413–422.
Mooney, S. J., Westreich, D. J., & El-Sayed, A. M. (2015).
Epidemiology in the era of big data. Epidemiology,
26(3), 390.
Further Readings O’Reilly, S. (2006). Nominative fair use and Internet
aggregators: Copyright and trademark challenges
Acar, G., Juarez, M., Nikiforakis, N., Diaz, C., Gürses, S., posed by bots, web crawlers and screen-scraping tech-
Piessens, F., & Preneel, B. (2013). Fpdetective: Dusting nologies. Loyola Consumer Law Review, 19, 273.
the web for fingerprinters. In Proceedings of the 2013 Sadan, Z., & Schwartz, D. G. (2012). Social network
ACM SIGSAC conference on computer & communica- analysis for cluster-based IP spam reputation. Informa-
tions security. New York: ACM. tion Management & Computer Security, 20(4),
Bar-Ilan, J. (2001). Data collection methods on the web for 281–295.
infometric purposes – A review and analysis. Snyder, R. (2003). Web search engine with graphic snap-
Scientometrics, 50(1), 7–32. shots. Google Patents.
Butler, J. (2007). Visual web page analytics. Google Yi, J., Nasukawa, T., Bunescu, R., & Niblack, W. (2003).
Patents. Sentiment analyzer: Extracting sentiments about a
Case, K. E., Quigley, J. M., & Shiller, R. J. (2005). Com- given topic using natural language processing tech-
paring wealth effects: The stock market versus the niques. Data Mining, 2003. ICDM 2003. Third IEEE
housing market. The BE Journal of Macroeconomics, International Conference on, IEEE. Melbourne,
5(1), 1. Florida, USA.
Doran, D., & Gokhale, S. S. (2011). Web robot detection
techniques: Overview and limitations. Data Mining
and Knowledge Discovery, 22(1), 183–210.
B

Big Geo-Data (uncertainty) of data, and the complex


interlinkages with (small) datasets that cover mul-
Song Gao tiple perspectives, topics, and spatiotemporal
Department of Geography, University of scales. It poses grand research challenges during
California, Santa Barbara, CA, USA the life cycle of large-scale georeferenced data
collection, access, storage, management, analysis,
modeling, and visualization.
Synonyms

Big georeferenced data; Big geospatial data;


Theoretical Aspects
Geospatial big data; Spatial big data
Geography has a long-standing tradition on
the duality of research methodologies: the law-
Definition/Introduction
seeking approach and the descriptive or explana-
tory approach. With the increasing popularity of
Big geo-data is an extension to the concept of
data-driven approaches in geography, a variety of
big data with emphasis on the geospatial compo-
statistical methods and machine learning methods
nent and under the context of geography or
have been applied in geospatial knowledge dis-
geosciences. It is used to describe the phenome-
covery and modeling for predictions. Miller and
non that large volumes of georeferenced data
Goodchild (2015) discussed the major challenges
(including structured, semi-structured, and
(i.e., populations not samples, messy not clean
unstructured data) about various aspects of the
data, and correlations not causality) and the
Earth environment and society are captured by
role of theory in the data-driven geographic
millions of environmental and human sensors
knowledge discovery and spatial modeling, with
in a variety of formats such as remote sensing
addressing the tensions between idiographic ver-
imageries, crowdsourced maps, geotagged videos
sus nomothetic knowledge in geography. Big geo-
and photos, transportation smart card transactions,
data is leading to new approaches to research
mobile phone data, location-based social media
methodologies in capturing complex spatiotem-
content, and GPS trajectories. Big geo-data is
poral dynamics of the Earth and the society
“big” not only because it involves a huge volume
directly at multiple spatial and temporal scales
of georeferenced data but also because of the high
instead of just snapshots. The data streams play a
velocity of generation streams, high dimensional-
driving-force role in data-driven methods rather
ity, high variety of data forms, the veracity
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_492-1
2 Big Geo-Data

than a test or calibration role behind the theory or Technical Aspects


models in conventional geographic analyses.
While data-driven science and predictive ana- Cloud computing technologies and their distrib-
lytics evolve in geographic and provide new uted deployment models offer scalable computing
insights, sometimes it is still very challenging for paradigms to enable big geo-data processing
humans to interpret the meanings of machine for scientific researches and applications. In the
learning or analytical results or relate findings to geospatial research world, cloud computing has
underlying theory. To solve this problem, attracted increasing attention as a way of solving
Janowicz et al. (2015) proposed a semantic cube data-intensive, computing-intensive, and access-
to illustrate the need for semantic technologies intensive geospatial problems and challenges,
and domain ontologies to address the role of such as supporting climate analytics, land-use
diversity, synthesis, and definiteness in big data and land-cover change analysis, and dust storm
researches. forecasting (Yang et al. 2017). Geocomputation
facilitates fundamental geographical science stud-
ies by synthesizing high-performance computing
Social and Human Aspects capabilities with spatial analysis operations, with
providing a promising solution to aforementioned
The emergence of big geo-data brings new oppor- geospatial research challenges.
tunities for researchers to understand our socio- There are a variety of big data analytics plat-
economic and human environments. In the journal forms and parallelized database systems emerging
of Dialogues in Human Geography (volume 3, in the new era. They can be classified into two
Issue 3, November 2013), several human geogra- categories: (1) the massively parallel processing
phers and GIScience researchers discussed a data warehousing systems like Teradata which
series of theoretical and practical challenges and are designed for holding large-scale structured
risks to geographic scholarship and raised a num- data and support standard SQL queries and
ber of epistemological, methodological, and ethi- (2) the distributed file storage systems and
cal questions related to the studies of big data in cluster-computing framework like Apache
geography. With the advancements in location- Hadoop and Apache Spark. The advantages of
awareness technology, information and commu- Hadoop-based systems mainly lie in their high
nication technology, and mobile sensing technol- flexibility, scalability, low cost, and reliability for
ogy, researchers employed emerging big geo-data managing and efficiently processing a large volume
for investigating the geographical perspective of of structured and unstructured datasets, as well as
human dynamics research within such contexts in providing job schedules for balancing data,
the special issue on Human Dynamics in the resources, and task loads. A MapReduce computa-
Mobile and Big Data Era on the International tion paradigm on Hadoop takes the advantages
Journal of Geographical Information Science of divide-and-conquer strategy and improves the
(Shaw et al. 2016). By synthesizing multi-sources processing efficiency. However, big geo-data has
of big data, those researches can uncover interest- its complexity on the spatial and temporal compo-
ing human behavioral patterns that are difficult or nents and requires new analytical framework
impossible to uncover with the traditional and functionalities compared with nonspatial big
datasets. However, challenges still exist in the data. Gao et al. (2017) built a scalable Hadoop-
scarcity of demographics and cross-validation or based geoprocessing platform (GPHadoop) and
getting the identity of individual behaviors rather ran big geo-data analytical functions to solve
than aggregated patterns. Moreover, the location- crowdsourced gazetteers harvesting problems.
privacy concerns and discussions arise in both Recently, more efforts have been made in
academic world and the society. There exist social connecting traditional GIS analysis research com-
tensions among big data accessibility and privacy munity to the cloud computing research commu-
protection. nity for the next frontier of big geo-data analytics.
Big Geo-Data 3

In one special issue on big data at the journal modeling, (2) the development of advanced
of Annals of GIS (volume 20, Issue 4, 2014), spatial analysis functions and models, and
researchers further discussed several key techno- (3) the advancement of quality assurance issues
logies (e.g., cloud computing, high-performance on big geo-data. Finally, there will still be ongoing
geocomputation cyberinfrastructures) for dealing comparisons between data-driven and theory-
with quantitative and qualitative dynamics of big driven research methodologies in geography.
geo-data. Advanced spatiotemporal big data mining
and geoprocessing methods should be developed
by optimizing the elastic storage, balanced sched-
Further Readings
uling, and parallel computing resources in high-
performance geocomputation cyberinfrastructures. Gao, S., Li, L., Li, W., Janowicz, K., & Zhang, Y. (2017).
Constructing gazetteers from volunteered big geo-data
based on Hadoop. Computers, Environment and Urban
Conclusion Systems, 61, 172–186.
Janowicz, K., van Harmelen, F., Hendler, J., & Hitzler,
P. (2015). Why the data train needs semantic rails. AI
With the advancements in location-awareness Magazine, Association for the Advancement of Artifi-
technology and mobile distributed sensor net- cial Intelligence (AAAI), pp. 5–14.
works, large-scale high-resolution spatiotemporal Miller, H. J., & Goodchild, M. F. (2015). Data-driven
geography. Geo Journal, 80(4), 449–461.
datasets about the Earth and the society become Shaw, S. L., Tsou, M. H., & Ye, X. (2016). Editorial:
available for geographic research. The research on Human dynamics in the mobile and big data era. Inter-
big geo-data involves interdisciplinary collabora- national Journal of Geographical Information Science,
tive efforts. There are at least three research areas 30(9), 1687–1693.
Yang, C., Huang, Q., Li, Z., Liu, K., & Hu, F. (2017). Big
that require further work: (1) the systematic inte- data and cloud computing: Innovation opportunities
gration of various big geo-data sources in and challenges. International Journal of Digital
geospatial knowledge discovery and spatial Earth, 10(1), 13–53.
I

Integrated Data System Purpose of an IDS

Ting Zhang With the rising attraction of big data and the
Department of Finance and Economics, Merrick exploding need to share existing data, the need
School of Business, University of Baltimore, to link already collected various administrative
Baltimore, MD, USA records rises. The systems allow government
agencies to integrate various databases and bridge
the gaps that have traditionally formed within
Definition/Introduction individual agency databases; it can be used for
quick knowledge-to-practice development cycle
Integrated Data Systems (IDS) typically link indi- to better address the often interconnected citizens’
vidual level administrative records collected by needs efficiently and effectively (Actionable Intel-
multiple agencies such as k–12 schools, commu- ligence for Social Policy 2017), for case manage-
nity colleges, other colleges and universities, ment (National Neighborhood Indicators
departments of labor, justice, human resources, Partnership 2017), program or service monitor-
human and health services, police, housing, and ing, tracking, and evaluation, developing and test-
community services. The systems can be used for ing an intervention and monitoring the outcomes
quick knowledge-to-practice development cycle (Davis et al. 2014), research and policy analysis,
(Actionable Intelligence for Social Policy 2017), strategic planning and performance management,
case management, program or service monitoring, and so on. It can test social policy innovations
tracking, and evaluation (National Neighborhood through high-speed, low-cost randomized control
Indicators Partnership 2017), research and policy trials and quasi-experimental approaches, can be
analysis, strategic planning and performance used for continuous quality improvement efforts
management, and so on. It can also help evaluate and benefit cost analysis, and can also help pro-
how different programs, services, and policies vide a complete account of how different pro-
affect individual persons or individual geographic grams, services, and policies affect individual
units. The linkages between different agency persons or individual geographic units to more
records are often made through a common indi- efficiently and effectively address the often
vidual personal identification number, a shared interconnected needs of the citizens (Actionable
case number, or a geographic unit. Intelligence for Social Policy 2017).

# Springer International Publishing AG 2017


L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_494-1
2 Integrated Data System

Key Elements to Build an IDS for data quality of IDS information. However,
some of the relevant databases, particularly stu-
According to Davis et al. (2014) and Zhang and dent records. do not include a universally linkable
Stevens (2012), typical crucial factors related to a personal identifier, that is, a Social Security num-
successful IDS include: ber; some databases are unable to ensure that a
known to be valid Social Security number is
• A broad and steady institutional commitment paired with one individual, and only that individ-
to administrate the system ual, consistently over time; and some databases
• Individual-level data (no matter individual per- are unable to ensure that each individual is asso-
sons or individual geographic units) to mea- ciated with only one Social Security number over
sure outcomes time (Zhang and Stevens 2012). Zhang and Ste-
• The necessary data infrastructure vens (2012) included ongoing collection of case
• Linkable data fields, such as Social Security studies documenting how SSNs can be extracted,
numbers, business identifiers, shared case validated, and securely stored offline. With the
number, and addresses established algorithms required for electronic
• The capacity to match various administrative financial transactions, spreading adoption of elec-
records tronic medical records and rising interest in big
• A favorable state interpretation of the data data, there is an extensive, and rapidly growing,
privacy requirements, consistent with federal literature illustrating probabilistic matching solu-
regulations tions and various software designs to address the
• The funding, knowhow, and analytical capac- identity management challenge. Often the
ity to work with and maintain the data required accuracy threshold is application spe-
• Successfully obtaining participation from mul- cific; assurance of an exact match may not be
tiple data providing agencies with clearance to required for some anticipated longitudinal data
use those data. system uses (Zhang and Stevens 2012).

Data Privacy
Maintenance
To build and use an IDS, issues related to privacy
of personal information within the system is
Administrative data records are typically col-
important. Many government agencies have rele-
lected by public and private agencies. An IDS
vant regulations. For example, a nationally wide-
often requires to extract, transform, clean, and
known law is the Family Educational Rights and
link information from various source administra-
Privacy Act (FERPA) that defines when student
tive databases and load it into a data warehouse.
information can be disclosed and data privacy
Many data warehouses offer a tightly coupled
practices (U.S. Department of Education 2017).
architecture that it usually takes little time to
Similarly Health Insurance Portability and
resolve queries and extract information (Widom
Accountability Act of 1996 (HIPAA) addresses
1995).
the use and disclosure of health information (U.S.
Department of Health & Human Services 2017).
Challenges
Ethics
Identity Management and Data Quality Most IDS taps individual person’s information.
One challenge to build an IDS is to have effective When using IDS information, in order not to mis-
and appropriate individual record identity man- use personal information, extra caution is needed.
agement diagnostics that include consideration Institutional review boards are often needed when
of the consequences of gaps in common identifier conducting research involving human subjects.
availability and accuracy. This is the first key step
Integrated Data System 3

Data Sharing Louisiana Workforce Longitudinal Data System


To build an IDS, a favorable state interpretation of (WLDS) housed at the Louisiana Workforce
the data privacy requirements, consistent with Commission
federal regulations and clearance to use the data Minnesota’s iSEEK data. managed by an organi-
for the IDS, is critical. For example, some state zation called iSEEK Solutions
education agencies have been reluctant to share Heldrich Center data at Rutgers University
their education records, largely due to narrow Ohio State University’s workforce longitudinal
state interpretations of the confidentiality provi- administrative database
sions of FERPA and its implementing regulations University of Texas Ray Marshall Center database
(Davis et al. 2014). Corresponding data sharing Virginia Longitudinal Data System
agreements need to be in place. Washington’s Career Bridge, managed by the
Workforce Training and Education Coordinat-
ing Board
Data Security
Connecticut’s Preschool through Twenty and
During the process of building, transferring,
Workforce Information Network (P-20 WIN)
maintaining, and using IDS information, the data
Delaware Department of Education’s Education
security issue in an IDS center is particularly
Insight Dashboard
important. Measures to ensure data security and
Georgia Department of Education’s Statewide
information privacy and confidentiality becomes
Longitudinal Data System and Georgia’s Aca-
the key factors for an IDS’ vigor and sustainabil-
demic and Workforce Analysis and Research
ity. Fortunately, many of the US current IDS cen-
Data System (GA AWARDS)
ters have had experience maintaining confidential
Illinois Longitudinal Data System
administrative records for years or even decades.
Indiana Network of Knowledge (INK)
However, facing the convenience of web access to
Maryland Longitudinal Data System
maintain the continued data security and sustain-
Missouri Comprehensive Data System
ability often requires updated data protection tech-
Ohio Longitudinal Data Archive (OLDA)
nics. The federal, state, and local government has
South Carolina Longitudinal Information Center
important roles in safeguarding data and data use.
for Education (SLICE)
Texas Public Education Information Resource
(TPEIR) and Texas Education Research Center
Examples (ERC)
Washington P-20W Statewide Longitudinal Data
Example of IDS in the United States include: System

Chapin Hall’s Planning for Human Service


Reform Using Integrated Administrative Data Conclusion
Jacob France Institute’s database for education,
employment, human resources, and human Integrated Data Systems (IDS) typically link indi-
services vidual level administrative records collected by
Juvenile Justice and Child Welfare Data Cross- multiple agencies. The systems can be used for
over Youth Multi-Site Research Study case management, program or service monitoring,
Actionable Intelligence for Social Policy’s inte- tracking, and evaluation, research and policy anal-
grated data systems initiatives for policy anal- ysis, etc. A successful IDS often requires a broad
ysis and program reform and steady institutional commitment to adminis-
Florida’s Common Education Data Standards trate the system, individual-level data, the neces-
(CEDS) Workforce Workgroup and the later sary data infrastructure, linkable data fields,
Florida Education & Training Placement Infor- capacity and knowhow to match various adminis-
mation Program trative records and maintain it, data access
4 Integrated Data System

permission, and data privacy procedures. Main neighborhoodindicators.org/resources-integrated-data-


challenges to build a sustainable IDS include systems-ids.
U.S. Department of Education. (2017). Family Educational
identity management, data quality, data privacy, Rights and Privacy Act (FERPA). Retrieved on May
ethics, data sharing, and data security. There are 14, 2017 from the World Wide Web https://ed.gov/
many IDS in the United States. policy/gen/guid/fpco/ferpa/index.html.
U.S. Department of Health & Human Services. (2017).
Summary of the HIPAA Security Rule. Retrieved on
May 14, 2017 from the World Wide Web https://www.
Further Reading hhs.gov/hipaa/for-professionals/security/laws-regula
tions/.
Actionable Intelligence for Social Policy. (2017). Inte- Widom, J. (1995). “Research problems in data
grated Data Systems (IDS). Retrieved in March 2017 warehousing.” CIKM ’95 Proceedings of the fourth
from the World Wide Web at https://www.aisp.upenn. international conference on information and knowl-
edu/integrated-data-systems/. edge management (pp. 25–30). Baltimore.
Davis, S., Jacobson, L., & Wandner, S. (2014). Using Zhang, T., & Stevens, D. (2012). Integrated data system
workforce data quality initiative databases to develop person identification: Accuracy requirements and
and improve consumer report card systems. methods. Jacob France Institute. Available at SSRN:
Washington, DC: Impaq International. https://ssrn.com/abstract=2512590 or http://dx.doi.
National Neighborhood Indicators Partnership. (2017). org/10.2139/ssrn.2512590 and http://www.
Resources on Integrated Data Systems (IDS), Retrieved workforcedqc.org/sites/default/files/images/JFI%20wdqi
in March 2017 from the World Wide Web at http://www. %20research%20report%20January%202014.pdf.
S

State Longitudinal Data System public reporting (US Department of Education


2015). The Statewide Longitudinal Data Systems
Ting Zhang Grant Program funds states’ efforts to develop and
Department of Finance and Economics, Merrick implement these data systems in respond to legis-
School of Business, University of Baltimore, lative initiatives (US Department of
Baltimore, MD, USA Education 2015).

Definition Information Offered

State Longitudinal Data Systems (SLDS) connect The data system aligns p-12 student education
databases across two or more of state-level agen- records with secondary and postsecondary educa-
cies of early learning, K–12, postsecondary, and tion and the workforce records, with linkable stu-
workforce. It is a state-level Integrated Data Sys- dent and teacher identification numbers and
tem and focuses on tracking individuals student and teacher information on student level
longitudinally. (National Center for Education Statistics 2010).
The student education records include informa-
tion on enrollment, demographics, program par-
Purpose of the SLDS ticipation, test records, transcript information,
college readiness test scores, successful transition
SLDS are intended to enhance the ability of states to postsecondary programs, enrollment in post-
to capture, manage, develop, analyze, and use secondary remedial courses, entries, and exits
student education records, to support evidence- from various levels of the education system
based decisions to improve student learning, to (National Center for Education Statistics 2010).
facilitate research to increase student achievement
and close achievement gaps (National Center for
Education Statistics 2010), to address potential Statewide Longitudinal Data Systems
recurring impediments to student learning, to Grant Program
measure and document education long-term
return on investment, to support education According to US Department of Education
accountability systems, and to simplify the pro- (2015), the Statewide Longitudinal Data Systems
cesses used by state educational agencies to make Program awards grants to State educational agen-
education data transparent through federal and cies to design, develop, and implement SLDS to
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_495-1
2 State Longitudinal Data System

efficiently and accurately manage, analyze, disag- interpretations of the confidentiality provisions
gregate, and use individual student data. As autho- of FERPA and its implementing regulations
rized by the Educational Technical Assistance Act (Davis et al. 2014). Many states have overcome
of 2002, Title II of the statute that created the potential FERPA-related obstacles in their own
Institute of Education Sciences (IES), the SLDS unique ways, for example: (1) obtaining legal
Grant Program has awarded competitive, cooper- advice recognizing that the promulgation of
ative agreement grants to almost all states since amended FERPA regulations was intended to
2005; in addition to the grants, the program offers facilitate the use of individual-level data for
many services and resources to assist education research purposes, (2) maintaining the workforce
agencies with SLDS-related work data within the education state’s agency, and
(US Department of Education 2016). (3) creating a special agency that holds both the
education and workforce data (Davis et al. 2014).

Challenges Maintaining Longitudinal Data


Many state’s SLDS already have linked student
In addition to the challenges an Integrated Data records, but decision making based on a short-
System has, SLDS has the following main term return on education investment is not neces-
challenges: sarily useful; the word “longitudinal” is the key-
stone needed for development of a strong business
Training/Education Provider Participation case for sustained investment in a SLDS (Stevens
In spite of the recent years’ progress, participation and Zhang 2014). “Longitudinal” means the capa-
by training/education providers has not been uni- bility to link information about individuals across
versal. To improve the training and education defined segments and through time. While there is
coverage, a few states have taken effective action. no evidence that the length of data retention
For example, the Texas state legislature has tied a increases identity disclosure risk, public concern
portion of the funding of state technical colleges about data retention is escalating (Stevens and
to their ability to demonstrate high levels of pro- Zhang 2014).
gram completion and employment in occupations
related to training (Davis et al. 2014).
Examples
Privacy Issues and State Longitudinal Data
Systems Examples of US SLDS include:
To ensure data privacy and protect personal infor-
mation, Family Educational Rights and Privacy Florida Education & Training Placement Informa-
Act (FERPA), the Pupil Protection Rights Act tion Program
(PPRA), and Children’s Online Privacy Protec- Louisiana Workforce Longitudinal Data System
tion Act (COPPA) are issued (Parent Coalition (WLDS)
for Student Privacy 2017). However, the related Minnesota’s iSEEK data.
issues and rights are complex, and the privacy Heldrich Center data at Rutgers University
rights provided by law are often not provided in Ohio State University’s workforce longitudinal
practice (National Center for Education Statistics administrative database
2010). For a sustained SLDS, a push in the University of Texas Ray Marshall Center
established privacy rights is important. database,
Virginia Longitudinal Data System
FERPA Interpretation Washington’s Career Bridge
Another challenge is that some state education Connecticut’s Preschool through Twenty and
agencies have been reluctant to share their educa- Workforce Information Network
tion records, largely due to narrow state Delaware Education Insight Dashboard
State Longitudinal Data System 3

Georgia Statewide Longitudinal Data System and Cross-References


Georgia Academic and Workforce Analysis
and Research Data System (GA AWARDS) ▶ Integrated Data System
Illinois Longitudinal Data System
Indiana Network of Knowledge (INK),
Maryland Longitudinal Data System
Further Readings
Missouri Comprehensive Data System
Ohio Longitudinal Data Archive (OLDA) Davis, S., Jacobson, L., & Wandner, S. (2014). Using
South Carolina Longitudinal Information Center workforce data quality initiative databases to develop
for Education (SLICE) and improve consumer report card systems.
Texas Public Education Information Resource Washington, DC: Impaq International.
National Center for Education Statistics. (2010). “Data
(TPEIR) and Texas Education Research Center stewardship: Managing personally identifiable infor-
(ERC) mation in student education records.” SLDS technical
Washington P-20W Statewide Longitudinal Data brief. Available at http://nces.ed.gov/pubsearch/
System. pubsinfo.asp?pubid=2011602
Stevens, D., & Zhang, T. (2014). “Toward a business case
for sustained investment in State Longitudinal Data
Systems.” Jacob France Institute. Available at http://
Conclusion www.jacob-france-institute.org/wp-content/uploads/
JFI-WDQI-Year-Three-Research-Report1.pdf
US Department of Education. (2015). “Applications for
SLDS connects databases across two or more of new awards; Statewide Longitudinal Data Systems
agencies of p-20 and Workforce. It is a US state- Program,” Federal register. Available at https://www.
level Integrated Data System and focuses on federalregister.gov/documents/2015/03/12/2015-
tracking individuals longitudinally. SLDS are 05682/applications-for-new-awards-statewide-
longitudinal-data-systems-program
intended to enhance the ability of states to capture, US Department of Education (2016). “Agency information
manage, design, develop, analyze, and use student collection activities; Comment request; State Longitu-
education records and to support data-driven deci- dinal Data System (SLDS) Survey 2017–2019.” Fed-
sions to improve student learning and to facilitate eral Register. Available at https://www.federalregister.
gov/documents/2016/10/07/2016-24298/agency-
research to increase student achievement and information-collection-activities-comment-request-
close achievement gaps. The Statewide Longitu- state-longitudinal-data-system-slds-survey
dinal Data Systems (SLDS) Grant Program funds Parent Coalition for Student Privacy (2017). Federal Stu-
states’ efforts to develop and implement these data dent Privacy Rights: FERPA, PPRA AND COPPA,
retrieved on May 14, 2017 from the World Wide Web
systems in respond to legislative initiatives. The https://www.studentprivacymatters.org/ferpa_ppra_
main challenges of SLDS include training/educa- coppa/.
tion provider participation, privacy issues and
State Longitudinal Data Systems, and FERPA
interpretation, and maintaining longitudinal data.
There are many Nationwide SLDS examples.

You might also like