Professional Documents
Culture Documents
The American Library Association (ALA) is a At this time, the Association of College & Research
voluntary organization that represents libraries Libraries (ACRL) is a primary division of the
and librarians around the world. Worldwide, the ALA that is concerned with big data issues. The
ALA is the largest and oldest professional ACRL has published a number of papers, guides,
organization for libraries, librarians, information and articles related to the use of, promise of, and
science centers, and information scientists. The the risks associated with big data. Several other
association was founded in 1876 in Philadelphia, ALA divisions are also involved with big data.
Pennsylvania. Since its inception, the ALA has The Association for Library Collections &
provided leadership for the development, promo- Technical Service (ALCTS) division discusses
tion, and improvement of libraries, information issues related to the management, organization,
access, and information science. The ALA is pri- and cataloging of big data and its sources. The
marily concerned with learning enhancement and Library Information Technology Association
information access for all people. The organiza- (LITA) is an ALA division that is involved with
tion strives to advance the profession through its the technological and user services activities that
initiatives and divisions within the organization. advance the collection, access, and use of big data
The primary action areas for the ALA are advo- and big data sources.
cacy, education, lifelong learning, intellectual
freedom, organizational excellence, diversity,
equitable access to information and services, Big Data Activities of the Association of
expansion of all forms of literacy, and library College & Research Libraries (ACRL)
transformation to maintain relevance in a dynamic
and increasing global digitalized environment. The Association of College & Research Libraries
While ALA is composed of several different divi- (ACRL) is actively involved with the opportuni-
sions, there is no single division devoted exclu- ties and challenges presented by big data. As
sively to big data. Rather, a number of different science and technology advance, our world
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_6-1
2 American Library Association
Carr, P. L. (2014). Reimagining the library as a technology: Finnemann, N. O. (2014). Research libraries and the
An analysis of Ranganathan’s five laws of library sci- Internet: On the transformative dynamic between institu-
ence within the social construction of technology tions and digital media. Journal of Documentation, 70(2),
framework. The Library Quarterly, 84(2), 152–164. 202–220.
Federer, L. (2013). The librarian as research information- Gordon-Murnane, L. (2012). Big data: A big opportunity
ist: A case study. Journal of the Medical Library for Librarians. Online, 36(5), 30–34.
Association, 101(4), 298–302.
A
released, or in some instances analyzed, the sen- overcome this, it would be necessary to apply
sitive personal information needs to be altered. minimum noise so that the average income before
The challenge comes in deciding upon a method and after would not be representative of the
that can achieve anonymity and preserve the data change. At the same time, the computational
integrity. integrity of the data is maintained. The amount
of noise and whether an exponential or Laplacian
mechanism is used is still subject to ongoing
Noise Addition research/discussion.
Johannes Gehrke, and Muthuramakrishnan However, it should be noted that the distance
Venkitasubramniam, define it as follows: metric may differ depending on the data types.
A q*-block is l-diverse if it contains at least l well- This includes the following distance measures:
represented values for the sensitive attribute S. numerical, equal, and hierarchical.
A table is l-diverse if every q*-block is l-diverse.
extremely important to mitigate such risks Machanavajjhala, A., et al. (2007). l-diversity: Privacy
through the use of effective de-identification tech- beyond k-anonymity. ACM Transactions on Knowl-
edge Discovery from Data, 1(1), Article 3, 1–12.
niques so as to protect sensitive personal informa- Sweeney, L. (2002). k-anonymity: A Model for Protecting
tion. As the amount of data becomes more Privacy. International Journal of Uncertainty, Fuzzi-
abundant and accessible, there is an increased ness and Knowledge-Based Systems, 10(5).
importance to continuously modify and refine The European Parliament and of the Council Working
Party. (2014). Opinion 05/2014 on anonymisation tech-
existing anonymization techniques. niques. http://ec.europa.eu/justice/data-protection/arti
cle-29/documentation/opinion-recommendation/files/
2014/wp216_en.pdf. Retrieved on 29 Dec 2014.
Further Reading
characteristic of Big Data, as opposed to the “e- accentuated by the rise of more sophisticated
Science/Grid Computing” paradigm of the 2000s. data capture techniques in the field, which is
Whereas the latter was primarily concerned with increasing the capacity of data that can be gath-
“big infrastructure,” anticipating the need for sci- ered and analyzed. Although still not “big” in the
entists to deal with a “deluge” of monolithic data literal sense of “Big Data,” this class of material
emerging from massive projects such as the Large undoubtedly requires the kinds of approaches in
Hardron Collider, as described by Tony Hey and thinking and interpretation familiar from else-
Anne Trefethen, Big Data is concerned with the where in the Big data agenda. Recent applications
mass of information which grows organically as in landscape archaeology have highlighted the
the result of the ubiquity of computing in every- need both for large capacity and interoperation.
day life and in everyday science. In the case of For example, integration of data from the in the
archaeology, it may be considered more as a Stonehenge Hidden Landscape project, also
“complexity deluge,” where small data, produced directed by Gaffney, provides for “seamless” cap-
on a daily basis, forms part of a bigger picture. ture of reams of geophysical data from remote
There are exceptions: Some individual projects sensing, visualizing the Neolithic landscape
in archaeology are concerned with terabyte-scale beneath modern Wiltshire to a degree of clarity
data. The most obvious example in the UK is the and comprehensiveness that would only have
North Sea Paleolandscapes, led by the University been possible hitherto with expensive and labori-
of Birmingham, a project which has reconstructed ous manual survey. Due to improved capture tech-
the Early Holocene landscape of the bed of the niques, this project succeeded in gathering a
North Sea, which was an inhabitable landscape quantity of data in its first two weeks equivalent
until its inundation between 20,000 and 8,000 to that of the landmark Wroxeter survey project in
BP – so-called Doggerland. Vince Gaffney and the 1990s.
others describe drawing on 3D seismic data gath- These early achievements of big data in an
ered during the process of oil prospection, this archaeological context fall against a background
project has used large-scale data analytics and of falling hardware costs, lower barriers to usage,
visualization to reconstruct the topography of the and the availability of generic web-based plat-
preinundation land surface spanning an area larger forms where large-scale distributed research can
than the Netherlands, and to thus allow inferences be conducted. This combination of affordability
as to what environmental factors might have and usability is bringing about a revolution in
shaped human habitation of it; although it must applications such as those described above,
be stressed that there is no direct evidence at all of where remote sensing is reaching new concepts
that human occupation. While such projects dem- and applications. For example, coverage of freely
onstrate the potential of Big Data technologies for available satellite imagery is now near-total;
conducting large-scale archaeological research, graphical resolution is finer for most areas than
they remain the exception. Most applications in ever before (1 m or less); and pre-georeferenced
archaeology remain relatively small scale, at least satellite and aerial images are delivered to the
in terms of the volume of data that is produced, user’s desktop, removing the costly and highly
stored, and preserved. specialized process of locating imagery of the
However, this is not to say that approaches Earth’s surface. Such platforms also allow access
which are characteristic of Big Data are not chang- to imagery of archaeological sites in regions
ing the picture significantly in archaeology, espe- which are practically very difficult or impossible
cially in the field of landscape studies. Data from to survey, such as Afghanistan, where declassified
geophysics, the science of scanning subterranean CORONA spy satellite data are now being
features using techniques such as magentometry employed to construct inventories of the region’s
and resistivity typically produce relatively large (highly vulnerable) archaeology. If these develop-
datasets, which require holistic analysis in order to ments cannot be said to have removed the bound-
be understood and interpreted. This trend is aries within which archaeologists can produce,
Archaeology 3
access, and analyze data, then it has certainly the field: LIDAR (Light Detection and Ranging or
made them more porous. Laser Imaging Detection and Ranging) data,
As in other domains, strategies for the storage which models terrain elevation modelled from
and preservation of data in archaeology have a airborne sensors, 3D laser scanning, maritime sur-
fundamental relationship with relevant aspects of vey, and digital video. At first glance this appears
the Big Data paradigm. Much archaeological to underpin an assumption that the primary focus
information lives on the local servers of institu- is data formats which convey larger individual
tions, individuals, and projects; this has always data objects, such as images and geophysics
constituted an obvious barrier to their integration data, with the report noting that “many formats
into a larger whole. However, weighing against have the potential to be Big Data, for example, a
this is the ethical and professional obligation to digital image library could easily be gigabytes in
share, especially in a discipline where the process size. Whilst many of the conclusions reached here
of gathering the data (excavation) destroys its would apply equally to such resources this study
material context. National strategies and bodies is particularly concerned with Big Data formats in
encourage the discharge of this obligation. In the use with technologies such as lidar surveys, laser
UK, as well as data standards and collections held scanning and maritime surveys.”
by English Heritage, the main repository for However, the report also acknowledges that “If
archaeological data is the Archaeology Data Ser- long term preservation and reuse are implicit goals
vice, based at the University of York. The ADS data creators need to establish that the software to
considers for accession any archaeological data be used or toolsets exist to support format migra-
produced in the UK in a variety of formats. This tion where necessary.” It is true that any “Big
includes most of the data formats used in day-to- Data” which is created from an aggregation of
day archaeological workflows: Geographic Infor- “small data” must interoperate. In the case of
mation System (GIS) databases and shapefiles, “social data” from mobile devices, for example,
images, numerical data, and text. In the latter location is a common and standardizable attribute
case, particular note should be given to the that can be used to aggregate Tb-scale datasets:
“Grey Literature” library of archaeological reports heat maps of mobile device usage can be created
from surveys and excavations, which typically which show concentrations of particular kinds of
present archaeological information and data in a activity in particular places at particular times. In
format suitable for rapid publication, rather than more specific contexts hashtags can be used to
the linking and interoperation of that data. Cur- model trends and exchanges between large
rently, the Library contains over 27,000 such groups. Similarly intuitive attributes that can be
reports. Currently, the total volume of the ADS’s used for interoperation, however, elude archaeo-
collections stands at 4.5 Tb (I thank Michael logical data, although there is much emerging
Charno for this information). While this could be interest in Linked Data technologies, which
considered “big” in terms of any collection of data allow the creation of linkages between web-
in the humanities, it is not of a scale which would exposed databases, provided they conform
overwhelm most analysis platforms; however (or can be configured to conform) to predefined
what is key here is that it is most unlikely to be specifications in descriptive languages such as
useful to perform any “global” scale analysis RDF. Such applications have proved immensely
across the entire collection. The individual successful in areas of archaeology concerned with
datasets therein relate to each other only inasmuch particular data types, such as geodata, where there
as they are “archaeological.” In the majority of is a consistent base reference (such as latitude and
cases, there is only fragmentary overlap in terms longitude). However, this raises a question which
of content, topic, and potential use. A 2007 is fundamental to archaeological data in any
ADS/English Heritage report on the challenges sense. Big Data approaches here, even if the data
of Big Data in archaeology identified four types is not “Big” in terms of relative terms to the social
of data format potentially relevant to Big Data in and natural sciences, potentially allows an
4 Archaeology
“n=all” picture of the data record. As noted and at new levels of complexity. We can harvest
above, however, this record represents only a public discourse about cultural heritage in social
tiny fragment of the entire picture. A key question, media and elsewhere and ask what that tells us
therefore, is does “Big data” thinking risk techno- about that heritage’s place in the contemporary
logical determination, constraining what ques- world. We can examine what are the fundamental
tions can be asked? This is a point which has building blocks of our knowledge about the past
concerned archaeologists since the very earliest and ask what do we gain, as well as lose, by
days of computing in the discipline. In 1975, a putting them into a form that the World Wide
skeptical Sir Moses Finley noted that “It would be Web can read.
a bold archaeologist who believed he could antic-
ipate the questions another archaeologist or a his-
torian might ask a decade or a generation later, as
References
the result of new interests or new results from
older researchers. Computing experience has pro- Archaeology data service. http://archaeologydataservice.
duced examples enough of the unfortunate conse- ac.uk. Accessed 25 May 2017.
quences . . . of insufficient anticipation of the Austin, T., & Mitcham, J. (2007). Preservation and man-
possibilities at the coding stage.” agement strategies for exceptionally large data for-
mats: ‘Big Data’. Archaeology Data Service &
English Heritage: York, 28 Sept 2007.
Gaffney, V., Thompson, K., & Finch, S. (2007). Mapping
Conclusion Doggerland: The Mesolithic landscapes of the South-
ern North Sea. Oxford: Archaeopress.
Gaffney, C., Gaffney, V., Neubauer, W., Baldwin, E.,
Such questions probably cannot be predicted, but Chapman, H., Garwood, P., Moulden, H., Sparrow, T.,
big data is (also) not about predicting questions. Bates, R., Löcker, K., Hinterleitner, A., Trinks, I., Nau,
The kind of critical framework that Big Data is W., Zitz, T., Floery, S., Verhoeven, G., & Doneus,
advancing, in response to the ever-more linkable M. (2012). The Stonehenge Hidden Landscapes Pro-
ject. Archaeological Prospection, 19(2), 147–155.
mass of pockets of information, each themselves Tudhope, D., Binding, C., Jeffrey, S., May, K., &
becoming larger in size as hardware and software Vlachidis, A. (2011). A STELLAR role for knowledge
barriers lower, allows us to go beyond what is organization systems in digital archaeology. Bulletin of
available “just” from excavation and survey. We the American Society for Information Science and
Technology, 37(4), 15–18.
can look at the whole landscape in greater detail
A
Asian Americans Advancing Justice Media Justice, to propose, sign, and release the
“Civil Rights Principles for the Era of Big Data.”
Francis Dalisay The coalition acknowledged that progress and
Communication & Fine Arts, College of Liberal advances in technology would foster improve-
Arts & Social Sciences, University of Guam, ments in the quality of life of citizens and help
Mangilao, GU, USA mitigate discrimination and inequality. However,
because various types of “big data” tools and
technologies – namely, digital surveillance, pre-
Asian Americans Advancing Justice (AAAJ) is a dictive analytics, and automated decision-
national nonprofit organization founded in 1991. making – could potentially ease the level in
It was established to empower Asian Americans, which businesses and governments are able to
Pacific Islanders, and other underserved groups, encroach upon the private lives of citizens, the
ensuring a fair and equitable society for all. The coalition found it critical that such tools and tech-
organization’s mission is to promote justice, unify nologies are developed and employed with the
local and national constituents, and empower intention of respecting equal opportunity and
communities. To this end, AAAJ dedicates itself equal justice.
to develop public policy, educate the public, liti- According to civilrights.org (2014), the Civil
gate, and facilitate in the development of grass- Rights Principles for the Era of Big Data proposes
roots organizations. Some of their recent five key principles: (1) stop high-tech profiling,
accomplishments have included increasing Asian (2) guarantee fairness in automated decisions,
Americans and Pacific Islanders’ voter turnout (3) maintain constitutional protections,
and access to polls, enhancing immigrants’ access (4) enhance citizens’ control of their personal
to education and employment opportunities, and information, and (5) protect citizens from inaccu-
advocating for greater protections of rights as they rate data. These principles were intended to
relate to the use of “big data.” inform law enforcement, companies, and policy-
makers about the impact of big data practices on
racial justice and the civil and human rights of
The Civil Rights Principles for the Era of citizens.
Big Data
1. Stop high-tech profiling. New and emerging
In 2014, AAAJ joined a diverse coalition com- surveillance technologies and techniques have
prising of civil, human, and media rights groups, made it possible to piece together comprehen-
such as the ACLU, the NAACP, and the Center for sive details on any citizen or group, resulting in
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_14-1
2 Asian Americans Advancing Justice
an increased risk of profiling and discrimina- 4. Enhance citizens’ control of their personal
tion. For instance, it was alleged that police in information. According to this principle, citi-
New York had used license plate readers to zens should have direct control over how cor-
document vehicles that were visiting certain porations gather data from them, and how
mosques; this allowed the police to track corporations use and share such data. Indeed,
where the vehicles were traveling. The acces- personal and private information known and
sibility and convenience of this technology accessible to a corporation can be shared with
meant that this type of surveillance could hap- companies and the government. For example,
pen without policy constraints. The principle unscrupulous companies can find vulnerable
of stopping high-tech profiling was thus customers through accessing and using highly
intended to limit such acts through setting targeted marketing lists, such as one that might
clear limits and establishing auditing proce- contain the names and contact information of
dures for surveillance technologies and citizens who have cancer. In this case, the
techniques. principle of enhancing citizens’ control of per-
2. Ensure fairness in automated decisions. Today, sonal information ensures that the government
computers are responsible for making critical and companies should not be able to disclose
decisions that have the potential to affect the private information without a legal process to
lives of citizens’ in the areas of health, employ- do so.
ment, education, insurance, and lending. For 5. Protect citizens from inaccurate data. This
example, major auto insurers are able to use principle advocates that when it comes to mak-
monitoring devices to track drivers’ habits, and ing important decisions about citizens – partic-
as a result, insurers could potentially deny the ularly, the disadvantaged (the poor, persons
best coverage rates to those who often drive with disabilities, the LGBT community,
when and where accidents are more likely to seniors, and those who lack access to the
occur. The principle of ensuring fairness in Internet) – corporations and the government
automated decisions advocates that computer should work to ensure that their databases con-
systems should be operating fairly in situations tain accurate of personal information about
and circumstances such as the one described. citizens. To ensure the accuracy of data, this
The coalition had recommended, for instance, could require disclosing the underlying data
that independent reviews be employed to and granting citizens the right to correct infor-
assure that systems are working fairly. mation that is inaccurate. For instance, govern-
3. Preserve constitutional protections. This prin- ment employment verification systems have
ciple advocates that government databases had higher error rates for legal immigrants
must be prohibited from undermining core and individuals with multiple surnames
legal protections, including those concerning (including many Hispanics) than for other
citizens’ privacy and their freedom of associa- legal workers; this has created a barrier to
tion. Indeed, it has been argued that data from employment. In addition, some individuals
warrantless surveillance conducted by the have lost job opportunities because of inaccu-
National Security Agency have been used by racies in their criminal history information, or
federal agencies, including the DEA and the because their information had been expunged.
IRS, even though such data were gathered out-
side the policies that rule those agencies. Indi- The five principles above continue to help
viduals with access to government databases inspire subsequent movements highlighting the
could also potentially use them for improper growing need to strengthen and protect civil rights
purposes. The principle of preserving constitu- in the face of technological change. Asian Amer-
tional protections is thus intended to limit such icans Advancing Justice and the other members of
instances from occurring. the coalition also continue to advocate for these
rights and protections.
Asian Americans Advancing Justice 3
▶ American Civil Liberties Union Civil rights and big data: Background material. http://
www.civilrights.org/press/2014/civil-rights-and-big-
▶ Center for Democracy and Technology
data.html. Accessed 20 June 2016.
▶ Center for Digital Democracy
▶ National Hispanic Media Coalition
A
selected from the short-listed candidates in the though information granularity makes it possible
second stage. The dual decision model not only to know what was previously impossible, infor-
facilitates greater insights, it also eliminates the mation overload can lead us astray towards inap-
fatigue that can seriously dampen the capacity for propriate choices, and at worse, it can incapacitate
effective decisions. Yet this discipline comes at a our ability to make effective decisions.
cost. Goals, values, and biases that are part of the The third implication of big data is the poten-
early phase of a project can leave a lasting imprint. tial for objectivity. When a planned and compre-
Any realization later in the project that was not hensive examination of alternatives is combined
deliberately or accidently situated in the earlier with a deeper understanding of the data, the result
context becomes more difficult to incorporate is more accurate information. This makes it less
into the decision. In the context of recruitment, if likely for individuals to come up to an incorrect
the skills desired of the selected candidate change conclusion. This eliminates the personal biases
after the first stage, it is unlikely that the short- that can prevail in the absence of sufficient infor-
listed pool will rank highly in that skill. The more mation. Since traditional response to overcome
unique is the requirement that emerges in the later the effect of personal bias is to rely on individuals
stage, the greater is the likelihood that it will not with greater experience, big data predicts an elim-
be sufficiently fulfilled. This tradeoff suggests that ination of the critical role of experience. In this
an improvement in our understanding of the vein, Andrew McAfee and Erik
choices comes at the cost of limited maneuver- Brynjolfson (2012) find that regardless of the
ability of an established decision context. level of experience, firms that extensively rely
In addition to the benefits and costs of early on data for decision making are, on average, 6%
decisions in the data generation cycle, big data more profitable than their peers. This suggests that
allows access to information at a much more gran- as decisions become increasingly imbibed with an
ular level than possible in the past. Behaviors, objective orientation, prior knowledge becomes a
attitudes, and preferences can now be tracked in redundant element. This however does not elimi-
extensive detail, fairly continuously, and over lon- nate the value of domain-level experts. Their role
ger periods of time. They can in turn be combined is expected to evolve into individuals who know
with other sources of data to develop a broader what to look for (by asking the right questions)
understanding of consumers, suppliers, and where to look (by identifying the appropriate
employees, and competitors. Not only can we sources of data). Domain expertise and not just
understand in much more depth the activities and experience is the mantra to identify people who
processes that pertain to various social and eco- are likely to be the most valuable in this new
nomic landscapes, higher level of granularity information age. However, it needs to be
makes decisions more informed and, as a result, acknowledged that this belief in objectivity is
more effective. Unfortunately, granularity also based on a critical assumption: individuals endo-
brings with it the potential of distraction. All wed with identical information that is sufficient
data that pertains to a choice may not be necessary and relevant to the context, reach identical con-
for the decision, and excessive understanding can clusions. Yet anyone watching the same news
overload our capacity to make inferences. Ima- story reported by different media outlets knows
gine the human skin which is continuously sens- the fallacy of this assumption. The variations that
ing and discarding thermal information generated arise when identical facts lead individuals to
from our interaction with the environment. What contrasting conclusions are a manifestation of
if we had to consciously respond to every signal the differences in the way humans work with
detected by the skin? It is this loss of granularity information. Human cognitive machinery associ-
that comes through the human mind responsive ates meanings to concepts based on personal his-
only to significant changes in temperature that tory. As a result, even while being cognizant of
saves us from being overwhelmed by data. Even our biases, the translation of information into
Automated Modeling/Decision Making 3
Automated Modeling/Decision Making, Table 1 Opportunities and challenges for the decision implications of
big data
Big data implication Opportunity Challenge
1. Dual decision model Comprehensive examination of Early choices can constrain later considerations
alternatives
2. Granularity In-depth understanding Critical information can be lost due to
information overload
3. Objectivity Lack of dependence on experience Inflates the effect of variations in translation
4. Transparency Free-flow of ideas Difficult to validate
5. Bottom-up decision Prompt decisions Impairment of vision
making
visitors may help users fulfill their informational of face-to-face communication, behavioral analyt-
needs so that they can apply the information to ics allows commercial marketers to examine e-
improve decisions they make about their health. consumers through additional lenses apart from
the traditional demographic and traffic tracking.
In approaching the selling process from a relation-
Applications ship standpoint, behavioral analytics uses data
collected via web-based behavior to increase
According to Kokel and colleagues, the largest understanding of consumer motivations and
behavioral databases can be found at Internet goals, and fulfill their needs. Examples of these
technology companies such as Google as well as sources of data include keyword searchers, navi-
online gaming communities. The sheer size of gation paths, and click-through patterns. By input-
these datasets is giving rise to new methods, ting data from these sources into machine learning
such as data visualization, for behavioral analyt- algorithms, computational social scientists are
ics. Fox and Hendler note the opportunity in able to map human factors of consumer behavior
implementing data visualization as a tool for as it unfolds during purchases. In addition, behav-
exploratory research and argue for a need to create ioral analytics can use web-based behaviors of
a greater role for it in the process of scientific consumers as proxies for cues typically conveyed
discovery. For example, Carneiro and Mylonakis through in-person face-to-face communication.
explain how Google Flu relies on data visualiza- Previous research suggests that web-based dia-
tion tools to predict outbreaks of influenza by logs can capture rich data pointing toward behav-
tracking online search behavior and comparing it ioral cues, the analysis of which can yield highly
to geographical data. Similarly, Mitchell notes accurate predictions comparable to data collected
how Google Maps analyzes traffic patterns during face-to-face interactions. The significance
through data provided via real-time cell phone of this ability to capture communication cues is
location to provide recommendations for travel reflected in marketers increased ability to speak to
directions. In the realm of social media, Bollen their consumers with greater personalization that
and colleagues have also demonstrated how anal- enhances the consumer experience.
ysis of Twitter feeds can be used to predict public Behavioral analytics has also enjoyed increas-
sentiments. ingly widespread application in game develop-
According to Jou, the value of behavioral ana- ment. El-Nasr and colleagues discuss the
lytics has perhaps been most notably observed in growing significance of assessing and uncovering
the area of commercial marketing. The consumer insights related to player behavior, both of which
marketing space has borne witness to the progress have emerged as essential goals for the game
made through extracting actionable and profitable industry and catapulted behavioral analytics into
insights from user behavioral data. For example, a central role with commercial and academic
between recommendation search engines for implications for game development. A combina-
Amazon and teams of data scientists for LinkedIn, tion of evolving mobile device technology and
behavioral analytics has allowed these companies shifting business models that focus on game dis-
to transform their plethora of user data into tribution via online platforms has created a
increased profits. Similarly, advertising efforts situation for behavioral analytics to make impor-
have turned toward the use of behavioral analytics tant contributions toward building profitable
to glean further insights into consumer behavior. businesses.
Yamaguchi discusses several tools on which dig- Increasingly available data on user behavior
ital marketers rely that go beyond examining data has given rise to the use of behavioral analytic
from site traffic. approaches to guide game development. Fields
Nagaitis notes observations that are consistent and Cotton note the premium placed in this indus-
with Jou’s view of behavioral analytics’ impact on try on data mining techniques that decrease
marketing. According to Nagaitis, in the absence behavioral datasets in complexity while extracting
Behavioral Analytics 3
knowledge that can drive game development. capitalizing on the use of behavioral analytics is
However, determining cutting-edge methods in security. Although Brown reports on exploration
behavioral analytics within the game industry is in the use of behavioral analytics to track cross-
a challenge due to reluctance on the part of various border smuggling activity in the United Kingdom
organizations to share analytic methods. Drachen through vehicle movement, the application of
and colleagues observe a difficulty in assessing these techniques under the broader umbrella of
both data and analytical methods applied to data security remains understudied. Along these lines
analysis in this area due to a perception that these and in the context of an enormous amount of
approaches represent a form of intellectual prop- available data, Jou discusses the possibilities for
erty. Sifa further notes that to the extent that data implementing behavioral analytics techniques to
mining, behavioral analytics, and the insights identify insider threats posed by individuals
derived from these approaches provide a compet- within an organization. Inputting data from a vari-
itive advantage over rival organizations in an ety of sources into behavioral analytics platforms
industry that already exhibits fierce competition can offer organizations an opportunity to contin-
in the entertainment landscape, organizations will uously monitor users and machines for early indi-
not be motivated to share knowledge about these cators and detection of anomalies. These sources
methods. may include email data, network activity via
Another area receiving attention for its appli- browser activity and related behaviors, intellec-
cation of behavioral analytics is business manage- tual property repository behaviors related to how
ment. Noting that while much interest in applying content is accessed or saved, end-point data show-
behavioral analytics has focused on modeling and ing how files are shared or accessed, and other less
predicting consumer experiences, Géczy and col- conventional sources such as social media or
leagues observe a potential for applying these credit reports. Connecting data from various
techniques to improve employee usability of inter- sources and aggregating them under a comprehen-
nal systems. More specifically, Géczy and col- sive data plane can provide enhanced behavioral
leagues describe the use of behavioral analytics threat detection. Through this, robust behavioral
as a critical first step to user-oriented management analytics can be used to extract insights into pat-
of organizational information systems through terns of behavior consistent with an imminent
identification of relevant user characteristics. threat. At the same time, the use of behavioral
Through behavioral analytics, organizations can analytics can also measure, accumulate, verify,
observe characteristics of usability and interaction and correctly identify real insider threats while
with information systems and identify patterns of preventing inaccurate classification of nonthreats.
resource underutilization. These patterns are Jou concludes that the result of implementing
important in providing implications for designing behavioral analytics in an ethical manner can pro-
streamlined and efficient user-oriented processes vide practical and operative intelligence while
and services. Behavioral analytics can also offer raising the question as to why implementation in
prospects for increasing personalization during this field has not occurred more quickly.
the user experience by drawing from user infor- In conclusion, behavioral analytics has been
mation provided in user profiles. These profiles previously defined as a process in which large
contain information about how the user interacts datasets consisting of behavioral data are analyzed
with the system, and the system can accordingly for the purpose of deriving insights that can serve
adjust based on clustering of users. as actionable knowledge. This definition includes
Despite advances made in behavioral analytics three goals underlying the use of behavioral ana-
within the commercial marketing and game indus- lytics, namely, to enhance organizational perfor-
tries, several areas are ripe with opportunities for mance, improve decision-making, and generate
integrating behavioral analytics to improve per- insights into user behavior. Given the burgeoning
formance and decision-making practices. One presence of big data and spread of data mining
area that has not yet reached its full potential for techniques to analyze this data, several fields have
4 Behavioral Analytics
begun to integrate behavioral analytics into their Carneiro, H. A., & Mylonakis, E. (2009). Google trends: A
approaches for problem-solving and perfor- web-based tool for real-time surveillance of disease
outbreaks. Clinical Infectious Diseases, 49(10).
mance-enhancing actions. While concerns related Davenport, T., & Harris, J. (2007). Competing on analyt-
to accuracy and ethical use of these insights ics: The new science of winning. Boston: Harvard
remain to be addressed, behavioral analytics can Business School Press.
present organizations and business with unprece- Drachen, A., Sifa, R., Bauckhage, C., & Thurau, C. (2012).
Guns, swords and data: Clustering of player behavior in
dented opportunities to enhance business, man- computer games in the wild. Proceedings of the IEEE
agement, and operations. Computational Intelligence and Games.
El-Nasr, M. S., Drachen, A., & Canossa, A. (2013). Game
analytics: Maximizing the value of player data. New
York: Springer Publishers.
Cross-References Fields, T. (2011). Social game design: Monetization
methods and mechanics. Boca Raton: Taylor &
Francis.
▶ Big Data Fox, P., & Hendler, J. (2011). Changing the equation on
▶ Business Analytics scientific data visualization. Science, 331(6018).
▶ Data Mining Géczy, P., Izumi, N., Shotaro, A., & Hasida, K. (2008).
Toward user-centric management of organizational
▶ Data Science
information systems. Proceedings of the Knowledge
▶ Data Scientist Management International Conference, Langkawi,
▶ Data-Driven Decision-Making Malaysia (pp. 282-286).
Kohavi, R., Rothleder, N., & Simoudis, E. (2002). Emerg-
ing trends in business analytics. Communications of the
ACM, 45(8).
Further Readings Mitchell, T. M. (2009). Computer science: Mining our
reality. Science, 326(5960).
Montibeller, G., & Durbach, I. (2013). Behavioral analyt-
Bollen, J., Mao, H., & Pepe, A. (2011). Modeling public
ics: A framework for exploring judgments and choices
mood and emotion: Twitter sentiment and socio-eco-
in large data sets. Working Paper LSE OR13.137. ISSN
nomic phenomena. Proceedings of the Fifth Interna-
2041-4668.
tional Association for Advancement of Artificial
Negash, S., & Gray, P. (2008). Business intelligence. Ber-
Intelligence Conference on Weblogs and Social Media.
lin/Heidelberg: Springer.
Brown, G. M. (2007). Use of kohonen self-organizing
Sifa, R., Drachen, A., Bauckhage, C., Thurau, C., &
maps and behavioral analytics to identify cross-border
Canossa, A. (2013). Behavior evolution in tomb raider
smuggling activity. Proceedings of the World Congress
underworld. Proceedings of the IEEE Computational
on Engineering and Computer Science.
Intelligence and Games.
B
3. The development from humanities 1.0 to methods on the evidence and truth and support
humanities 2.0 (Davidson 2008, pp. 707–717) the argumentation that digital humanities were
marks the transition from digital development developed from a network of historical cultures
of methods within “Enhanced Humanities” to of knowledge and media technologies with their
the “Social Humanities” which use the possi- roots in the end of the nineteenth century.
bility of web 2.0 to construct the research The relevant research literature of the historical
infrastructure. Social humanities use interdis- context and genesis of Big Humanities is regarded
ciplinarity of scientific knowledge by making as one of the first projects of genuine humanistic
use of software for open access, social reading usages of computer a Concordance of Thomas of
and open knowledge and by enabling online Aquino based on punch cards by Roberto Busa
cooperative and collaborational work on (Vanhoutte 2013, p. 126). Roberto Busa
research and development. On the basis of the (1913–2011), an Italian Jesuit priest, is considered
new digital infrastructure of social web as a pioneer of Digital Humanities. This project
(hypertext systems, Wiki tools, Crowd funding enabled the achievement of uniformity in histori-
software etc.) these products transfer the ography of computational science in its early
computer-based processes from the early stage (Schischkoff 1952). Busa, who in 1949
phase of digital humanities into the network developed the linguistic corpus of “Index
culture of the social sciences. Today it is Blog- Thomisticus” together with Thomas J. Watson,
ging Humanities (work on digital publications the founder of IBM, (Busa 1951, 1980,
and mediation in peer-to-peer networks) and pp. 81–90), is regarded a founder of the point of
Multimodal humanities (presentation and rep- intersection between humanities and IT. The first
resentation of knowledge within multimedia digital edition on punch cards initiated a series of
software environments) that stand for the tech- the following philological projects: “In the 60s the
nical modernization of academic knowledge first electronic version of ‘Modern Language
(McPherson 2008). Because of them Big Association International Bibliography’
Social Humanities claims the right to represent (MLAIB) came up, a specific periodical bibliog-
paradigmatically alternative form of knowl- raphy of all modern philologies, which could be
edge production. In this context one should searched through with a telephone coupler. The
reflect on the technical fundamentals of the retrospective digitalization of cultural heritage
computer-based process of gaining insights started after that, having had ever more works
within the research of humanities and cultural and lexicons such as German vocabulary by
studies while critically considering data, Grimm brothers, historical vocabularies as the
knowledge genealogy and media history in Krünitz or regional vocabularies” (Lauer 2013,
order to evaluate properly the understanding p. 104).
of a role in the context of digital knowledge At first, a large number of other disciplines and
production and distribution (Thaller 2012, non-philological areas were formed such as liter-
pp. 7–23). ature, library, and archive studies. They had lon-
ger epistemological history in the field of
philological case studies and practical information
History of Big Humanities studies. Since the introduction of punch card
methods, they have been dealing with quantitative
Big Humanities have been considered only occa- and IT procedures for facilities of knowledge
sionally from the perspective of science and management. As one can see, neither the research
media history in the course of the last few years question nor Busa’s methodological procedure
(Hockey 2004). Historical approach to the have been without its predecessors, so they can
interdependent relation between humanities and be seen as a part of a larger and longer history of
cultural studies and the usage of computer-based knowledge and media archeology. Sketch models
processes relativize the aspiration of digital of mechanical knowledge apparatus capable of
Big Humanities Project 3
combining information were found in the manu- starting in the early 1950s, the first autonomous
scripts of Suisse Archivar Karl Wilhelm Bührer research area, which could provide an “objective
(1861–1917, Bührer 1890, pp. 190–192). This analysis of exact knowledge” (Pietsch 1951). In
figure of thought of flexible and modularized the 1960s, the first studies in the field of computer
information unit was made to a conceptional linguistics concerning the automatized indexing
core of mechanical data processing. The archive of large text corpora appeared, publishing the
and library studies took part directly in the histor- computer-based analysis about word indexing,
ical change of paradigm of information pro- word frequency, and word groups.
cessing. It was John Shaw Billings, the doctor The automatized evaluation procedure of texts
and later director of the National Medical Library, for the editorial work within literary studies was
who worked further on the development of appa- described already in the early stages of “humani-
ratus for machine-driven processing of statistical ties computing” (mostly within its areas of “com-
data, a machine developed by Hermann Hollerith puter philology” and “computer linguistics”) on
in 1886 (Krajewski 2007, p. 43). Technology of the ground of two discourse figures relevant even
punch cards traces its roots in technical pragmat- today. The first figure of discourse describes the
ics of library knowledge organization; even if achievements of the new tool usage with instru-
only later – within the rationalization movement mental availability of data (“helping tools”); the
in the 1920s – the librarian working procedure other figure of discourse focuses on the econom-
was automatized in specific areas. Other projects ical disclosure of data and emphasizes the effi-
of data processing show that the automatized pro- ciency and effectivity of machine methods of
duction of an index or a concordance marks the documenting. The media figure of automation
beginning of computer-based humanities and cul- was finally combined with the expectance that
tural studies for the lexicography and catalogue interpretative and subjective influences from the
apparatus of libraries. Until the late 1950s, it was processing and analysis of information can be
the automatized method of processing large text systematically removed. In the 1970s and 1980s,
data with the punch card system after Hollerith the computer linguistics was established as an
procedure that stood in the center of the first institutionally positioned area of research with
applications/usages. The technical procedure of its university facilities, its specialist journals
punch cards changed the lecture practice of text (Journal of Literary and Linguistic Computing,
analysis by transforming a book into a database Computing in the Humanities), discussion panels
and by turning the linear-syntagmatic structure of (HUMANIST), and conference activities. The
text into a factual and term-based system. As early computer-based work in the historical-
as 1951, the academic debate among the contem- sociological research has its first large rise, but it
poraries started in academic journals. This debate remains regarded in the work reports less than an
saw the possible applications of the punch card autonomous method, and it is seen mostly as a
system as largely positive and placed them into tool for critical text examination and as a simpli-
the context of economically motivated rationality. fication measure by quantifying the prospective
Between December 13 and 16, 1951, the German subjects (Jarausch 1976, p. 13).
Society for Documentation and the Advisory A sustainable media turn both in the field of
Board of German Economical Chamber orga- production and in the field of reception aesthetics
nized a working conference on the study of mech- appeared with the application of standardized
anization and automation of documentation markup texts such as the Standard Generalized
process, which was enthusiastically discussed by Markup Language established in 1986 and
philosopher Georgi Schischkoff. He talked about software-driven programs for text processing.
a “significant simplification and acceleration [. . .] They made available the additional series of dig-
by mechanical remembrance” (Schischkoff 1952, ital modules, analytical tools, and text functions
p. 290). The representatives of computer-based and transformed the text into a model of a data-
humanities saw in the “literary computing,” base. The texts could be loaded as structured
4 Big Humanities Project
information and were available as (relational) Since the social net is not only a neutral reading
databases. In the 1980s and 1990s, the technical channel of research, writing, and publication
development and the text reception were domi- resources without any power but also a govern-
nated by the paradigm of a database. mental structure of power of scientific knowledge,
With the domination of the World Wide Web, the epistemological probing of social, political,
the research and teaching practices changed dras- and economic contexts of Digital Humanities
tically: the specialized communication experi- includes also a data critical and historical
enced a lively dynamics through the digital questioning of its computer-based reformation
network culture of publicly accessible online agenda (Schreibmann 2012, pp. 46–58).
resources, e-mail distribution, chats, and forums, What did the usage of computer technology
and it became largely responsive through the change for cultural studies and humanities on the
media-driven feedback mentality of rankings and basis of theoretical essentials? Computers did
voting. With its aspiration to go beyond the hier- reorganize and accelerated the quantification and
archical structures of academic system through calculation process of scientific knowledge; they
the reengineering of scientific knowledge, the did entrench the metrical paradigm in the cultural
Digital Humanities 2.0 made the ideals of equal- studies and humanities and promoted the
ity, freedom, and omniscience attainable again. hermeneutical-interpretative approaches with a
As opposed to its beginnings in the 1950s, the mathematical formalization of the respective sub-
Digital Humanities today have also an aspiration ject field. In addition to these epistemological
to reorganize the knowledge of the society. There- shifts, the research practices within the Big
fore, they regard themselves “both as a scientific Humanities have been shifted, since the research
as well as a socioutopistic project” (Hagner and and development are seen as project related, col-
Hirschi 2013, p. 7). With the usage of social media laborative, and network formed, and on the net-
in the humanities and cultural studies, the techno- work horizon, they become the subject of research
logical possibilities and the scientific practices of of network analysis. The network analysis itself
Digital Humanities not only developed but they has its goal to reveal the correlations and relation-
also brought to life new phantasmagoria of scien- patterns of digital communication of scientific
tific distribution, quality evaluation, and transpar- networks and to declare the Big Humanities itself
ency in the World Wide Web (Haber 2013, to the subject of reflections within a social con-
pp. 175–190). In this context, Bernhard Rieder structivist actor-network-theory.
and Theo Röhle identified five central problematic
perspectives for the current “Digital Humanities”
in their text from 2012 “five challenges.” These
Further Readings
are the following: the temptation of objectivity,
the power of visual evidence, black-boxing Anne, B, Drucker, J., Lunenfeld, P., Presner, T., &
(fuzziness, problems of random sampling, etc.), Schnapp, J. (2010). Digital_humanities. Cambridge,
institutional turbulences (rivaling service facilities MA: MIT Press, 201(2). Online: http://mitpress.mit.
and teaching subjects), and the claim of univer- edu/sites/default/files/titles/content/9780262018470_
Open_Access_Edition.pdf
sality. Computer-based research is usually domi- Bührer, K. W. (1890). Ueber Zettelnotizbücher und
nated by the evaluation of data so that some Zettelkatalog. Fernschau, 4, 190–192.
researchers see the advanced analysis within the Busa, R. (1951). S. Thomae Aquinatis Hymnorum
research process even as a substitution for a sub- Ritualium Varia Specimina Concordantiarum. Primo
saggio di indici di parole automaticamentecomposti e
stantial theory construction. That means that the stampati da macchine IBM a schede perforate. Milano:
research interests are almost completely data Bocca.
driven. This evidence-based concentration on the Busa, R. (1980). The annals of humanities computing: The
data possibilities can deceive the researcher to index Thomisticus. Computers and the Humanities,
14(2), 83–90.
neglect the heuristic aspects of his own subject.
Big Humanities Project 5
Davidson, C. N. (2008). Humanities 2.0: Promise, perils, to HUMLab Seminar, Umeå University, 4 Mar. http://
predictions. Publications of the Modern Language stream.humlab.umu.se/index.php?streamName=dynami
Association (PMLA), 123(3), 707–717. cVernaculars
Gold, M. K. (Ed.). (2012). Debates in the digital humani- Pietsch, E. (1951). Neue Methoden zur Erfassung des
ties. Minneapolis: University of Minnesota Press. exakten Wissens in Naturwissenschaft und Technik.
Haber, P. (2013). ‘Google Syndrom‘. Phantasmagorien des Nachrichten für Dokumentation, 2(2), 38–44.
historischen Allwissens im World Wide Web. Zürcher Ramsey, S., & Rockwell, G. (2012). Developing things:
Jahrbuch für Wissensgeschichte, 9, 175–190. Notes toward an epistemology of building in the digital
Hagner, M., & Hirschi, C. (2013). Editorial Digital humanities. In M. K. Gold (Ed.), Debates in the digital
Humanities. Zürcher Jahrbuch für Wissensgeschichte, humanities (pp. 75–84). Minneapolis: University of
9, 7–11. Minnesota Press.
Hockey, S. (2004). History of humanities computing. In Rieder, B., & Röhle, T. (2012). Digital methods: Five
S. Schreibman, R. Siemens, & J. Unsworth (Eds.), A challenges. In D. M. Berry (Ed.), Understanding digital
companion to digital humanities. Oxford: Blackwell. humanities (pp. 67–84). London: Palgrave.
Jarausch, K. H. (1976). Möglichkeiten und Probleme der Schischkoff, G. (1952). Über die Möglichkeit der
Quantifizierung in der Geschichtswissenschaft. In: Dokumentation auf dem Gebiete der Philosophie.
ders., Quantifizierung in der Geschichtswissenschaft. Zeitschrift für Philosophische Forschung, 6(2),
Probleme und Möglichkeiten (pp. 11–30). Düsseldorf: 282–292.
Droste. Schreibman, S. (2012). Digital humanities: Centres and
Krajewski, M. (2007). In Formation. Aufstieg und Fall der peripheries. In: M. Thaller (Ed.), Controversies around
Tabelle als Paradigma der Datenverarbeitung. In the digital humanities (Historical social research, Vol.
D. Gugerli, M. Hagner, M. Hampe, B. Orland, 37(3), pp. 46–58). Köln: Zentrum für Historische
P. Sarasin, & J. Tanner (Eds.), Nach Feierabend. Sozialforschung.
Zürcher Jahrbuch für Wissenschaftsgeschichte (Vol. Svensson, P. (2010). The landscape of digital humanities.
3, pp. 37–55). Zürich/Berlin: Diaphanes. Digital Humanities Quarterly (DHQ), 4(1). Online:
Lauer, G. (2013). Die digitale Vermessung der Kultur. http://www.digitalhumanities.org/dhq/vol/4/1/000080/
Geisteswissenschaften als Digital Humanities. In 000080.html
H. Geiselberger & T. Moorstedt (Eds.), Big Data. Das Thaller, M. (Ed.). (2012). Controversies around the digital
neue Versprechen der Allwissenheit (pp. 99–116). humanities: An agenda. Computing Historical Social
Frankfurt/M: Suhrkamp. Research, 37(3), 7–23.
McCarty, W. (2005). Humanities computing. London: Vanhoutte, E. (2013). The gates of hell: History and defi-
Palgrave. nition of digital | humanities. In M. Terras, J. Tyham, &
McPherson, T. (2008). Dynamic vernaculars: Emergent dig- E. Vanhoutte (Eds.), Defining digital humanities
ital forms in contemporary scholarship. Lecture presented (pp. 120–156). Farnham: Ashgate.
B
(b) the dissemination of data analysis methods MAP, a next-generation sequencing (NGS) DNA
and software, processing software launched by a Dutch Soft-
(c) the training in biomedical big data and data ware Company GENALICE. Processing biomed-
science, ical big data one hundred times faster than
(d) the establishment of centers of excellence in conventional data analytic tools, MAP demon-
data science (Margolis et al. 2014) strated robustness and spectacular performance
and raised the NGS data processing and analysis
First, BD2K initiative fosters the emergence of to a new level.
data science as a discipline relevant to biomedi-
cine by developing the solutions to specific high-
need challenges confronting the research commu- Challenges
nity. For instance, the Centers of Excellence in
Data Science initiated the first BD2K Funding Despite the opportunities brought by biomedical
Opportunity to test and validate new ideas in big data, certain noteworthy challenges also exist.
data science. Second, BD2K aims to enhance the First, to use big biomedical data effectively, it is
training of methodologists and practitioners in imperative to identify the potential sources of
data science by improving their skills in demand healthcare information and to determine the
under the data science “umbrella,” such as com- value of linking them together (Weber et al.
puter science, mathematics, statistics, biomedical 2014). The “bigness” of biomedical data sets is
informatics, biology, and medicine. Third, given multidimensional: some big data, such as EHRs,
the complex questions posed by the generation of provide depth by including multiple types of data
large amounts of data requiring interdisciplinary (e.g., images, notes, etc.) about individual patient
teams, BD2K initiative facilitates the develop- encounters; others, such as claims data, provide
ment of investigators in all parts of the research longitudinality, which refers to patients’ medical
enterprise for interdisciplinary collaboration to information over a period of time. Moreover,
design studies and perform subsequent data ana- social media, credit cards, census records, and a
lyses (Margolis et al. 2014). various number of other types of data can help
Besides these promotive initiatives proposed assemble a holistic view of a patient and shed light
by national research institutes, such as NIH, on social and environmental factors that may be
great endeavors in improving biomedical big influencing health.
data processing and analysis have also been The second technical obstacle in linking big
made by biomedical researchers and for-profit biomedical data results from the lack of a national
organizations. National cyberinfrastructure has unique patient identifier (UPI) in the United
been suggested by biomedical researchers as one States (Weber et al. 2014). To address the absence
of the systems that could efficiently handle many of a UPI to enable precise linkage, hospitals and
of big data challenges facing the medical infor- clinics have developed sophisticated probabilistic
matics community. In the United States, the linkage algorithms based on other information,
national cyberinfrastructure (CI) refers to an such as demographics. By requiring enough vari-
existing system of research supercomputer centers ables to match, hospitals and clinics are able to
and high-speed networks that connect them reduce the risk of linkage errors to an acceptable
(LeDuc et al. 2014). CI has been widely used by level even though two different patients share the
physical and earth scientists, and more recently same characteristics (e.g., name, age, gender, zip
biologists, yet little used by biomedical code). In addition, the same techniques used to
researchers. It has been argued that more compre- match patients across different EHRs can be
hensive adoption of CI could facilitate many chal- extended to data sources outside of health care,
lenges in biomedical area. One example of which is an advantage of probabilistic linkage.
innovative biomedical big data technique pro- Third, besides the technical challenges, pri-
vided by for-profit organizations is GENALICE vacy and security concerns turn to be a social
Biomedical Data 3
increasingly assumes itself as a place for open was subsequently extended to provide early warn-
production of meaning and permanent negotia- ing of epidemics in cities, regions, and countries,
tion, by providing comment functions, hypertext in cooperation with the 2008 established Google
systems, and ranking and voting procedures Flu Trends in collaboration with the US authority
through collective framing processes. This is the for the surveillance of epidemics (CDC). On the
case of the sports app Runtastic, monitoring dif- Google Flu Trends website, users can visualize
ferent sports activities, using GPS, mobile the development of influenza activity both geo-
devices, and sensor technology, and making infor- graphically and chronologically. Some studies
mation, such as distance, time, speed, and burned criticize that the predictions of the Google project
calories, accessible and visible for friends and are far above the actual flu cases.
acquaintances in real time. The Eatery app is Ginsberg et al. (2009) point out that in the case
used for weight control and requires its users the of an epidemic, it is not clear whether the search
ability to do self-optimization through self-track- engines behavior of the public remains constant
ing. Considering that health apps also aim to and thus whether the significance of Google Flu
influence the attitudes of their users, they can Trends is secured or not. They refer to the
additionally be understood as persuasive media medialized presence of the epidemic as distorting
of Health Governance. With their feedback tech- cause of an “Epidemic of Fear” (Eysenbach 2006,
nologies, the apps facilitate not only issues related p. 244), which can lead to miscalculations
to healthy lifestyles but also multiply the social concerning the impending influenza activity. Sub-
control over compliance with the health regula- sequently, the prognostic reliability of the corre-
tions in peer-to-peer networks. Taking into con- lation between increasing search engine entries
sideration the network connection of information and increased influenza activity has been
technology equipment, as well as the commercial questioned. In recent publications on digital
availability of biometric tools (e.g., “Nike Fuel,” biosurveillance, communication processes in
“Fitbit,” “iWatch”) and infrastructure (apps), the online networks are more intensely analyzed.
biosurveillance is frequently associated, in the Especially in the field of Twitter Research (Paul
public debates, to dystopian ideas of a society of and Dredze 2011), researchers developed specific
control biometrically organized. techniques and knowledge models for the study of
Organizations and networks for health promo- future disease development and work backed up
tion, health information, and health education and by context-oriented sentiment analysis and social
formation observed with great interest that, every network analysis to hold out the prospect of a
day, millions of users worldwide search for infor- socially and culturally differentiated
mation about health using the Google search biosurveillance.
engine. During the influenza season, the searches
for flu increase considerably, and the frequency of
certain search terms can provide good indicators
Further Readings
of flu activity. Back in 2006, Eysenbach evaluated
in a study on “Infodemiology” or “Infoveillance” Albrechtslund, A. (2008). Online social networking as
the Google AdSense click quotas, with which he participatory surveillance. First Monday, 13(3).
analyzed the indicators of the spread of influenza Online: http://firstmonday.org/ojs/index.php/fm/arti
and observed a positive correlation between cle/viewArticle/2142/1949
Brownstein, J. S., et al. (2009). Digital disease detection –
increasing search engine entries and increased Harnessing the web for public health surveillance. The
influenza activity. Further studies on the volume New England Journal of Medicine, 360(21),
of search patterns have found that there is a sig- 2153–2157.
nificant correlation between the number of flu- Burkom, H. S., et al. (2008). Decisions in biosurveillance
tradeoffs driving policy and research. Johns Hopkins
related search queries and the number of people technical digest, 27(4), 299–311.
showing actual flu symptoms (Freyer-Dugas et al.
2012). This epidemiological correlation structure
Biosurveillance 3
Eysenbach, G. (2006). Infodemiology: Tracking flu-related Paul, M. J., & Dredze, P. (2011). You are what you Tweet:
searches on the Web for syndromic surveillance. In Analyzing Twitter for public health. In Proceedings of
AMIA Annual Symposium, Proceedings 8/2, 244–248. the Fifth International AAAI Conference on Weblogs
Freyer-Dugas, A., et al. (2012). Google Flu Trends: Cor- and Social Media. Online: www.aaai.org/ocs/index.
relation with emergency department influenza rates and php/ICWSM/ICWSM11/paper/.../3264
crowding metrics. Clinical Infectious Diseases, 54(15), Walters, R. A., et al. (2010). Data sources for
463–469. biosurveillance. In G. Voeller John (Ed.), Wiley hand-
Ginsberg, J., et al. (2009). Detecting influenza epidemics book of science and technology for homeland security
using search engine query data. In Nature. Interna- (Vol. 4, pp. 2431–2447). Hoboken: Wiley.
tional weekly journal of science (Vol. 457, pp.
1012–1014).
C
Cancer is an umbrella term that encompasses Cancer Prevention and Early Detection
more than 100 unique diseases related to the
uncontrolled growth of cells in the human body. Epidemiology is the study of the causes and pat-
Cancer is not completely understood by scientists, terns of human diseases. Aggregated data allows
but it is generally accepted to be caused by both epidemiologists to study why and how cancer
internal genetic factors and external environmen- forms. Researchers study the causes of cancer
tal factors. The US National Cancer Institute and ultimately make recommendations about
describes cancer on a continuum, with points of how to prevent cancer. Data provides medical
significance that include prevention, early detec- practitioners with information about populations
tion, diagnosis, treatment, survivorship, and end- at risk. This can facilitate proactive and preventive
of-life care. This continuum provides a frame- action. Data is used by expert groups including the
work for research priorities. Cancer prevention American Cancer Society and the United States
includes lifestyle interventions such as tobacco Preventive Services Task Force to write recom-
control, diet, physical activity, and immunization. mendations about screening for detection. Screen-
Detection includes screening tests that identify ing tests, including mammography and
atypical cells. Diagnosis and treatment involves colonoscopy, have advantages and disadvantages.
informed decision making, the development of Evidence-based results, from large representative
new treatments and diagnostic tests, and outcomes samples, can be used to recommend screening for
research. Finally, end-of-life care includes pallia- those who will gain the largest benefit and sustain
tive treatment decisions and social support. Large the fewest harms. Data can be used to identify
data sets can be used to uncover patterns, view where public health education and resources
trends, and examine associations between vari- should be disseminated.
ables. Searching, aggregating, and cross- At the individual level, aggregated information
referencing large data sets is beneficial at all can guide lifestyle choices. With the help of
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_32-1
2 Cancer
technology, people have the ability to quickly and Data is also being used to predict which med-
easily measure many aspects of their daily lives. ications may be good candidates to move forward
Gary Wolf and Kevin Kelly coined this rapid into clinical research trials. Clinical trials are sci-
accumulation of personal data the quantified self entific studies that are designed to determine if
movement. Individual-level data can be collected new treatments and diagnostic procedures are safe
through wearable devices, activity trackers, and and effective. Margaret Mooney and Musa Mayer
smartphone applications. The data that is accumu- estimate that only 3% of adult cancer patients
lated is valuable for cancer prevention and early participate in clinical trials. Much of what is
detection. Individuals can track their physical known about cancer treatment is based on data
activity and diet over time. These wearable from this small segment of the larger population.
devices and applications also allow individuals Data from patients who do not participate in clin-
to become involved in cancer research. Individ- ical trials exists, but this data is unconnected and
uals can play a direct role in research by contrib- stored in paper and in electronic medical records.
uting genetic data and information about their New techniques in big data aggregation have the
health. Health care providers and researchers can potential to facilitate patient recruitment for clin-
view genetic and activity data to understand the ical trials. Thousands of studies are in progress
connections between health behaviors and worldwide at any given point in time. The tradi-
outcomes. tional, manual, process of matching patients with
appropriate trials is both time consuming and
inefficient. Big data approaches can allow for the
Diagnosis and Treatment integration of medical records and clinical trial
data from across multiple organizations. This
Aggregated data that has been collected over long aggregation can facilitate the identification of
periods of time has made a significant contribu- patients for inclusion in an appropriate clinical
tion to research on the diagnosis and treatment of trial. Nicholas LaRusso writes that IBM’s super-
cancer. The Human Genome Project, completed computer Watson will soon be used to match
in 2003, was one of the first research endeavors to cancer patients with clinical trials. Patient data
harness large data sets. Researchers have used can be mined for lifestyle factors and genetic
information from the Human Genome Project to factors. This can allow for faster identification of
develop new medicines that can target genetic participants that meet inclusion criteria. Watson,
changes or drivers of cancer growth. The ability and other supercomputers, can shorten the patient
to sequence the DNA of large numbers of tumors identification process considerably, matching
has allowed researchers to model the genetic patients in seconds. This has the potential to
changes underlying certain cancers. increase enrollment in clinical trials and ulti-
Genetic data is stored in biobanks, repositories mately advance cancer research.
in which samples of human DNA are stored for Health care providers’ access to large data sets
testing and analysis. Researchers draw from these can improve patient care. When making a diagno-
samples and analyze genetic variation to observe sis, providers can access information from
differences in the genetic material of someone patients exhibiting similar symptoms, lifestyle
with a specific disease compared to a healthy choices, and demographics to form more accurate
individual. Biobanks are run by hospitals, conclusions. Aggregated data can also improve a
research organizations, universities, or other med- patient’s treatment plan and reduce the costs of
ical centers. Many biobanks do not meet the needs conducting unnecessary tests. Knowing a
of researchers due to an insufficient number of patient’s prognosis helps a provider decide how
samples. The burgeoning ability to aggregate aggressively to treat cancer and what steps to take
data across biobanks, within the United States after treatment. If aggregate data from large and
and internationally, is invaluable and has the diverse groups of patients were available in a
potential to lead to new discoveries in the future. single database, providers would be better
Cancer 3
equipped to predict long-term outcomes for the data that is available. The data set will always
patients. Aggregate data can help providers select be incomplete and will fail to cover the entire
the best treatment plan for each patient, based on population. Data from diverse sources will vary
the experiences of similar patients. This can also in quality. Self-reported survey data will appear
allow providers to uncover patterns to improve alongside data from randomized, clinical trials.
care. Providers can also compare their patient out- Second, the major barrier to using big data for
comes to outcomes of their peers. Harlan diagnosis and treatment is the task of integrating
Krumholz, a professor at the Yale School of Med- information from diverse sources. Allen Lichter
icine, argued that the best way to study cancer is to explained that 1.6 million Americans are diag-
learn from everyone who has cancer. nosed with cancer every year, but in more than
95% of cases, details of their treatments are in
paper medical records, file drawers, or electronic
Survivorship and End-of-Life Care systems that are not connected to each other.
Often, the systems in which useful information
Cancer survivors face physical, psychological, is currently stored cannot be easily integrated.
social, and financial difficulties after treatment The American Association of Clinical Oncology
and for the remaining years of their lives. As sci- is working to overcome this barrier and has devel-
ence advances, people are surviving cancer and oped software that can accept information from
living in remission. A comprehensive database on multiple formats of electronic health records. A
cancer survivorship could be used to develop, test, prototype system has collected 100,000 breast
and maintain patient navigation systems to facili- cancer records from 27 oncology groups. Third,
tate optimal care for cancer survivors. traditional laboratory research is necessary to
Treating or curing cancer is not always possi- understand the context and meaning of the infor-
ble. Health care providers typically base patient mation that comes from the analysis of big data.
assessments on past experiences and the best data Large data sets allow researchers to explore cor-
available for a given condition. Aggregate data relations or relationships between variables of
can be used to create algorithms to model the interest. Danah Boyd and Kate Crawford point
severity of illness and predict outcomes. This out that data are often reduced to what can fit
can assist doctors and families who are making into a mathematical model. Taken out of context,
decisions about end-of-life care. Detailed infor- results lose meaning and value. The experimental
mation, based on a large number of cases, can designs of clinical trials will ultimately allow
allow for more informed decision making. For researchers to show causation and identify vari-
example, if a provider is able to tell a patient’s ables that cause cancer. Bigger data, in this case
family with confidence that it is extremely more data, is not always better. Fourth, patient
unlikely that the patient will survive, even with privacy and security of information must be pri-
radical treatment, this eases the discussion about oritized at all levels. Patients are, and will
palliative care. continue to be, concerned with how genetic and
medical profiles are secured and who will have
access to their personal information.
Challenges and Limitations
service may be internal or external, and fiscal with correct service level. When looking at cloud
responsibility is shared between the organiza- services, it is important to examine four different
tions. Hybrid clouds are a grouping of two or aspects: application requirements, business
more clouds, public or private community, expectations, capacity provisioning, and cloud
where the cloud service is comprised of variant information collection and process. These four
combination that extends the capacity of the ser- areas complicate the process of selecting a cloud
vice through aggregation, integration, or service. First, the application requirements refer to
customizations with another cloud service. Some- the different features such as data volume, data
times a hybrid cloud is used on a temporary basis production rate, data transfer and updating, com-
to meet short-term data needs that cannot be ful- munication, and computing intensities. These fac-
filled by the private cloud. Having the ability to tors are important because the differences in these
use the hybrid cloud enables the organization to factors will affect the CPU (central processing
only pay for the extra resources when they are unit), memory, storage, and network bandwidth
needed, so this exists as a fiscal incentive for for the user. Business expectations fluctuate
organizations to use a hybrid cloud service. depending on the applications and potential
The other aspect to consider when evaluating users, which, in turn, affect the cost. The pricing
cloud services is the specific service models model depends on the level of the service required
offered for the consumer or organization. Cloud (e.g., voicemail, a dedicated service, amount of
computing offers three different levels of service: storage required, additional software packages,
Software as a Service (SaaS), Platform as a Ser- and other custom services). Capacity provisioning
vice (PaaS), and Infrastructure as a Service (IaaS). is based on the concept that, according to need,
The SaaS has a specific application or service different IT technologies are employed and, there-
subscription for the customer (e.g., Dropbox, fore, each technology has its own unique strengths
Salesforce.com, and QuickBooks). With the and weaknesses. The downside for the consumer
SaaS, the service provider handles the installation, is the steep learning curve required. The final
setup, and running of the application with little to challenge requires that the consumers invest a
no customization. The PaaS allows businesses an substantial amount of time to investigate individ-
integrated platform on which they can create and ual websites, collect information about each cloud
deploy custom apps, databases, and line-of-busi- service offering, collate their findings, and employ
ness service (e.g., Microsoft Windows Azure, their own assessments to determine their best
IBM Bluemix, Amazon Web Services (AWS), match. If an organization has an internal IT
Elastic Beanstalk, Heroku, Force.com, Apache department or employs an IT consultant, the deci-
Stratos, Engine Yard, and Google App Engine). sion is easier to make; for the individual con-
The PaaS service model includes the operating sumer, without an IT background, the choice
system, programming language execution envi- may be considerably more difficult.
ronment, database, and web servicer designed
for a specific framework with a high level of
customization. With Infrastructure as a Service Cloud Safety and Security
(IaaS), businesses can purchase infrastructure
from providers as virtual resources. Components For the consumer, two primary issues are relevant
include servers, memory, firewalls, and more, but to cloud usage: a check and balance system on the
the organization provides the operating system. usage versus service level purchased and data
IaaS providers include Amazon Elastic Cloud safety. This on-demand computation model of
Computer (Amazon EC2), GoGrid, Joyent, cloud computing is processed through large vir-
AppNexus, Rackspace, and Google Compute tual data centers (clouds), offering storage and
Engine. computation needs for all types of cloud users.
Once the correct cloud service configuration is These needs are based on service level agree-
determined, the next step is to match user needs ments. Although cloud services are relatively
Cloud Services 3
low cost, there is no way to know if the services deal of variance in how different countries and
they are purchasing are equivalent to the service regions deal with security issues. At this point in
level purchased. Although being able to deter- time, until there are universal rules or legacy
mine that a consumer’s usage in relationship to specifically addressing data privacy legislation,
the service level purchased is appropriate, the the consumers must take responsibility for their
more serious concern for consumers is data safety. own data. There are five strategies for keeping
Furthermore, because users do not have physical your data secure in the cloud, outside of what the
possession of their data, public cloud services are cloud services offer. First, consider storing crucial
underutilized due to trust issues. Larger organiza- information somewhere other than the cloud. For
tions use privately held clouds, but if a company this type of information, perhaps utilizing the
does not have the resources to develop their own available hardware storage might be the best solu-
cloud service, most organizations are unlikely to tion rather than using a cloud service. Second,
use public cloud services due to safety concerns. when choosing a cloud service, take the time to
Currently, there is no global standardization of read the user agreement. The user agreement
data encryption between cloud services, and should clearly delineate the parameters of their
there have been some concerns raised by experts service level and that will help with the decision-
who say there is no way to be completely sure that making. Third, take creating passwords seriously.
data, once moved to the cloud, remains secure. Oftentimes, the easy route for passwords is famil-
With most cloud services, control of the encryp- iar information such as dates of birth, hometowns,
tion keys is retained by the cloud service, making and pet’s or children’s names. With the advances
your data vulnerable to a rogue employee or a in hardware and software designed specially to
governmental request to see your data. crack passwords, it is particularly important to
The Electronic Frontier Foundation (EFF) is a use robust, unique passwords for each of your
privacy advocacy group that maintains a section accounts. Fourth, the best way to protect data is
on their website (Who Has Your Back) that rates through encryption. The way encryption works in
the largest Internet companies on their data pro- this instance is to use an encryption software on a
tections. The EFF site uses six criteria to rate the file before you move the file to the cloud. Without
companies: requires a warrant for content, tells the password to the encryption, no one will be
users about government data requests, publishes able to read the file content. When considering a
transparency reports, publishes law enforcement cloud service, investigate their encryption ser-
guidelines, fights for user privacy rights in courts, vices. Some cloud services encrypt and decrypt
and fights for user privacy rights in Congress. user files local as well as provide storage and
Another consumer and corporate data protection backup. Using this type of service ensures that
group is the Tahoe Least Authority File System data is encrypted before it is stored in the cloud
(Tahoe-LAFS) project. Tahoe-LAFS protects a and after it is downloaded from the cloud provid-
free, open-source storage system created and ing the optimal safety net for consumer data.
developed by Zooko Wilcox-O’Hearn with the
goal of data security and protection from hard-
ware failure. The strength of this storage system is
Cross-References
their encryption and integrity – checks first go
through gateway servers, and after the process is
▶ Cloud
complete, the data is stored on a secondary set of
▶ Cloud Computing
servers that cannot read or modify the data.
▶ Cloud Safety
Security for data storage via cloud services is a
▶ Cloud Storage
global concern whether for individuals or organi-
▶ Computer Network Storage
zations. From a legal perspective, there is a great
4 Cloud Services
draws from Communications association with lin- research with larger theoretical contexts. One cri-
guistics and modern languages. Natural language tique of the data-revolution is the false identifica-
processing is an attempt to build communication tion of this form of analysis as being new. Rather
into computers so they can understand and provide than consider big data as an entirely new phenom-
more sender-tailored messages to users. ena, by situating it within a larger history of Com-
The field of communication has also been out- munications theory, more direct comparisons
spoken about the promises levied with big data between past and present datasets can be drawn.
analytics as well as the ethics of big data use. Second, the field requires more attention to the
Recognizing that the field is still early in its devel- topic of validity in big data analysis. While quan-
opment, scholars point to the lifespan of other titative and statistical measurements can support
technologies and innovations as examples of the reliability of a study, validity asks researchers
how optimism early in the lifecycle often turns to provide examples or other forms of support for
into critique. Pierre Levy is one Communications their conclusions. This greatly challenges the eth-
scholar who explains that although new datasets ical notions of anonymity in big data, as well as
and technologies are viewed as positive changes the consent process for individual protections.
with big promises early in their trajectory, as more This is one avenue in which the quality of big
information is learned about their effects, scholars data research needs more work within the field of
often begin to challenge their use and ability to communications.
provide insight. Communications asserts that big data is an
Communications scholars often refer to big data important technological and methodological
as the “datafication” of society, meaning turning advancement within research, however, due to its
everyday interactions and experiences into quanti- newness, researchers need to exercise caution when
fiable data that can be segmented and analyzed considering its future. Specifically, researchers must
using brad techniques. This in particular refers to focus on the ethics of inclusion in big datasets,
analyzing data that has not been previously viewed along with the quality of analysis and long term
as data before. Although this is partially where the effects of this type of dataset on society.
value of big data develops from, for Communica-
tions researchers, this complicates the ability to
think holistically or qualitatively.
Further Readings
Specifically, big datasets in Communications
research include information taken from social Burns, R. W. (2003). Communications: An international
media sites, health records, media texts, political history of the formative years. New York: IEE History
polls, and brokered language transcriptions. The of Technology Series.
wide variety of types of datasets reflects the truly Levy, P. (1997). Collective intelligence: Mankind’s emerg-
ing world in cyberspace. New York: Perseus Books.
broad nature of the discipline and its subfields. Parks, M. R. (2014). Big data in communication research:
Malcom Parks offers suggestions on the future Its contents and discontents. Journal of Communica-
of big data research within the field of Communi- tion, 64, 355–360.
cations. First, the field must situate big data
C
in the social sciences. Joshua M. Epstein developed, when big data is connected, it forms large net-
with Robert Axtell, the first large-scale agent-based works of heterogeneous information with data
computational model, which aims to explore the role redundancy that can be exploited to compensate
of social experiences such as seasonal migrations, for the lack of data, to validate trust relationships,
pollution, and transmission of disease. to disclose inherent groups, and to discover hid-
As an instrument-based discipline, computa- den patterns and models. Several methodologies
tional social sciences enables the observation and and applications in the context of modern social
empirical study of phenomena through computa- science datasets allow scientists to understand and
tional methods and quantitative datasets. Quantita- study different social phenomena, from political
tive methods such as dynamical systems, artificial decisions to the reactions of economic markets to
intelligence, network theory, social network analy- the interactions of individuals and the emergence
sis, data mining, agent-based modeling, computa- of self-organized global movements.
tional content analysis, social simulations Trillions of bytes of data can be captured by
(macrosimulation and microsimulation), and statis- instruments or generated by simulation. Through
tical mechanics are often combined to study com- better analysis of these large volumes of data that
plex social systems. are becoming available, there is the potential to
Technological developments are constantly make further advances in many scientific disci-
changing society, ways of communication, behav- plines and improve the social knowledge and the
ioral patterns, the principles of social influence, success of many companies. More than ever, sci-
and the formation and organization of groups and ence is now a collaborative activity. Computational
communities, enabling the emergence of self- systems and techniques created new ways of
organized movements. As technology-mediated collecting, crossing and interconnecting data.
behaviors and collectives are primary elements Analysis of big data are now at the disposal of
in the dynamics and in the design of social struc- social sciences, allowing the study of cases in
tures, computational approaches are critical to macro- and in microscales in connection to other
understand the complex mechanisms that form scientific fields.
part of many social phenomena in contemporary
society. Big data can be used to understand many
complex phenomena as it offers new opportuni- Cross-References
ties to work toward a quantitative understanding
of our complex social systems. Technological- ▶ Computer Science
mediated social phenomena emerging over multi- ▶ Data Visualization
ple scales are available in complex datasets. Twit- ▶ Network Analytics
ter, Facebook, Google, and Wikipedia showed ▶ Network Data
that it is possible to relate, compare, and predict ▶ Physics
opinions, attitudes, social influences, and collec- ▶ Social Network Analysis (SNA)
tive behaviors. Online and offline big data can ▶ Sociology
provide insights that allow the understanding of ▶ Visualization
social phenomena like diffusion of information,
polarization in politics, formation of groups, and
evolution of networks.
Further Readings
Big data is dynamic, heterogeneous, and inter-
related. But it is also often noisy and unreliable. Bankes, S., Lempert, R., & Popper, S. (2002). Making
However, even so, big data may be more valuable computational social science effective epistemology,
to social sciences than small samples because the methodology, and technology. Social Science Com-
puter Review, 20(4), 377–388.
overall statistics obtained from frequent patterns
Bainbridge, W. S. (2007). Computational sociology. In
and correlation analysis disclose often hidden pat- The Blackwell Encyclopedia of Sociology. Malden,
terns and more reliable knowledge. Furthermore, MA: Blackwell Publishing.
Computational Social Sciences 3
Cioffi-Revilla, C. (2010). Computational social science. Miller, J. H., & Page, S. E. (2009). Complex adaptive
Wiley Interdisciplinary Reviews: Computational Statis- systems: An introduction to computational models of
tics, 2(3), 259–271. social life. Princeton: Princeton University Press.
Conte, R., et al. (2012). Manifesto of computational social Oboler, A., et al. (2012). The danger of big data: Social
science. The European Physical Journal Special media as computational social science. First Monday 17
Topics, 214(1), 325–346. (7). Retrieved from http://firstmonday.org/article/view/
Lazer, D., et al. (2009). Computational social science. 3993/3269/
Science, 323(5915), 721–723.
C
volunteers and was typically based on the enforce- These media firms began to employ a variety of
ment of local rules of engagement around com- techniques to combat what they viewed as the
munity norms and user behavior. Moderation misappropriation of the comments spaces, using
practices and style therefore developed locally in-house moderators, turning to firms that special-
among communities and their participants and ized in the large-scale management of such inter-
could inform the flavor of a given community, active areas and deploying technological
from the highly rule-bound to the anarchic: the interventions such as word filter lists or
Bay Area-based online community the WELL disallowing anonymous posting, to bring the com-
famously banned only three users in its first ments sections under control. Some media outlets
6 years of existence, and then only temporarily went the opposite way, preferring instead to close
(Turner 2005, p. 499). their comments sections altogether.
In social communities, on the early text-based
Internet, mechanisms to enact moderation were
often direct and visible to the user and could Commercial Content Moderation and
include demanding that a user alters a contribution the Contemporary Social Media
to eliminate offensive or insulting material, the Landscape
deletion or removal of posts, the banning of
users (by username or IP address), the use of text The battle with text-based comments was just the
filters to disallow posting of specific types of beginning of a much larger issue. The rise of
words or content, and other overt moderation Friendster, MySpace, and other social media
actions. Examples of sites of this sort of content applications in the early part of the twenty-first
moderation include many Usenet groups, BBSes, century has given way to more persistent social
MUDs, listservs, and various early commercial media platforms of enormous scale and reach. As
services. of the second quarter of 2016, Facebook alone
Motives for people participating in voluntary approached two billion users worldwide, all of
moderation activities varied. In some cases, users whom generate content by virtue of their partici-
carried out content moderation duties for prestige, pation on the platform. YouTube reported receiv-
status, or altruistic purposes (i.e., for the better- ing upwards of 100 hours of UGC video per
ment of the community); in others, moderators minute as of 2014.
received non-monetary compensation, such as The contemporary social media landscape is
free or reduced-fee access to online services, therefore characterized by vast amounts of UGC
e.g., AOL (Postigo 2003). The voluntary model uploads made by billions of users to massively
of content moderation persists today in many popular commercial Internet sites and social
online communities and platforms; one such media platforms with a global reach. Mainstream
high-profile site where volunteer content modera- platforms, often owned by publicly traded firms
tion is used exclusively to control site content is responsible to shareholders, simply cannot afford
Wikipedia. the risk – legal, financial, and to reputation – that
As the Internet has grown into large-scale unchecked UGC could cause. Yet, contending
adoption and a massive economic engine, the with the staggering amounts of transmitted data
desire for major mainstream platforms to control from users to platforms is not a task that can
the UGC that they host and disseminate has also currently be addressed reliably and at large scale
grown exponentially. Early on in the proliferation by computers. Indeed, making nuanced decisions
of so-called Web 2.0 sites, newspapers and other about what UGC is acceptable and what is not
news media outlets, in particular, began noticing a currently exceeds the abilities of machine-driven
significant problem with their online comments processes, save for the application of some algo-
areas, which often devolved into unreadable rithmically informed filters or bit-for-bit or hash
spaces filled with invective, racist and sexist dia- value matching, which occur at relatively low
tribes, name-calling, and irrelevant postings. levels of computational complexity.
Content Moderation 3
The need for adjudication of UGC – video- and guidelines of the platform for which they labor.
image-based content, in particular – therefore They must also be aware of the laws and statutes
calls on human actors who rely upon their own that may govern the geographic or national loca-
linguistic and cultural knowledge and competen- tion from where the content emanates, for which
cies to make decisions about UGC’s appropriate- the content is destined, and for where the platform
ness for a given site or platform. Specifically, or site is located – all of which may be distinct
“they must be experts in matters of taste of the places in the world. They must be aware of the
site’s presumed audience, have cultural knowl- platform’s tolerance for risk, as well as the expec-
edge about location of origin of the platform and tations of the platform for whether or how CCM
of the audience (both of which may be very far workers should make their presence known.
removed, geographically and culturally, from In many cases, CCM workers may work at
where the screening is taking place), have linguis- organizational arm’s length from the platforms
tic competency in the language of the UGC (that they moderate. Some labor arrangements in
may be a learned or second language for the CCM have workers located at great distances
content moderator), be steeped in the relevant from the headquarters of the platforms for which
laws governing the site’s location of origin and they are responsible, in places such as the Philip-
be experts in the user guidelines and other pines and India. The workers may be structurally
platform-level specifics concerning what is and removed from those firms, as well, via
is not allowed” (Roberts 2016). These human outsourcing companies who take on CCM con-
workers are the people who make up the legions tracts and then hire the workers under their aus-
of commercial content moderators: moderators pices, in call center (often called BPO, or business
who work in an organized way, for pay, on behalf process outsourcing) environments. Such
of the world’s largest social media firms, apps, and outsourcing firms may also recruit CCM workers
websites who solicit UGC. using digital piecework sites such as Amazon
CCM processes may take place prior to mate- Mechanical Turk or Upwork, in which the rela-
rial being submitted for inclusion or distribution tionships between the social media firms, the
on a site, or they may take place after material has outsourcing company, and the CCM worker can
already been uploaded, particularly on high- be as ephemeral as one review.
volume sites. Specifically, content moderation Even when CCM workers are located on-site at
may be triggered as the result of complaints a headquarters of a social media firm, they often
about material from site moderators or other site are brought on as contract laborers and are not
administrators, from external parties (e.g., compa- afforded the full status, or pay, of a regular full-
nies alleging misappropriation of material they time employee. In this regard, CCM work, wher-
own; from law enforcement; from government ever it takes place in the world and by whatever
actors) or from other users themselves who are name, often shares the characteristic of being rel-
disturbed or concerned by what they have seen atively low wage and low status as compared to
and then invoke protocols or mechanisms on a other jobs in tech. These arrangements of institu-
site, such as the “flagging” of content, to prompt tional and geographic removal can pose a risk for
a review by moderators (Crawford and Gillespie workers, who can be exposed to disturbing and
2016). In this regard, moderation practices are shocking material as a condition of their CCM
often uneven, and the removal of UGC may rea- work but can be a benefit to the social media
sonably be likened to censorship, particularly firms who require their labor, as they can distance
when it is undertaken in order to suppress speech, themselves from the impact of the CCM work on
political opinions, or other expressions that the workers. Further, the working conditions,
threaten the status quo. practices, and existence of CCM workers in social
CCM workers are called upon to match and media are little known to the general public, a fact
adjudicate volumes of content, typically at rapid that is often by design. CCM workers are fre-
speed, against the specific rules or community quently compelled to sign NDAs, or
4 Content Moderation
nondisclosure agreements, that preclude them enforcement of laws, social norms, and mores
from discussing the work that they do or the that frequently vary and often are in conflict with
conditions in which they do it. While social each other. The acknowledgement and under-
media firms often gesture at the need to maintain standing of the history of content moderation
secrecy surrounding the exact nature of their mod- and the contemporary reality of large-scale CCM
eration practices and the mechanisms they used to is central to many of these core questions of what
undertake them, claiming the possibility of users’ the Internet has been, is now, and will be in the
being able to game the system and beat the rules if future, and yet the continued invisibility and lack
armed with such knowledge, the net result is that of acknowledgment of CCM workers by the firms
CCM workers labor in secret. The conditions of for which their labor is essential means that such
their work – its pace, the nature of the content they questions cannot fully be addressed. Neverthe-
screen, the volume of material to be reviewed, and less, discussions of moderation practices and the
the secrecy – can lead to feelings of isolation, people who undertake them are essential to the
burnout, and depression among some CCM end of more robust, nuanced understandings of
workers. Such feelings can be enhanced by the the state of the contemporary Internet and to better
fact that few people know such work exists, policy and governance based on those
assuming, if they think of it at all, that algorithmi- understandings.
cally driven computer programs take care of social
media’s moderation needs. It is a misconception
that the industry has been slow to correct. Cross-References
▶ Algorithm
Conclusion ▶ Facebook
▶ Internet
Despite claims and conventional wisdom to the ▶ Social Media
contrary, content moderation has likely always ▶ Wikipedia
existed in some form on the social Internet. As ▶ YouTube
the Internet’s many social media platforms grow
and their financial, political, and social stakes
increase, the undertaking of organized control of Further Readings
user expression through such practices as CCM
will likewise only increase. Nevertheless, CCM Crawford, K., & Gillespie, T. (2016). What is a flag for?
remains a little discussed and little acknowledged Social media reporting tools and the vocabulary of
complaint. New Media & Society, 18(3), 410–428.
aspect of the social media production chain,
Galloway, A. R. (2006). Protocol: How control exists after
despite its mission-critical status in almost every decentralization. Cambridge, MA: MIT Press.
case in which it is employed. The existence of a Postigo, H. (2003). Emerging sources of labor on the
globalized CCM workforce abuts many difficult, internet: The case of America online volunteers. Inter-
national Review of Social History, 48(S11), 205–223.
existential questions about the nature of the Inter-
Roberts, S. T. (2016). Commercial content moderation:
net itself and the principles that have long been Digital laborers’ dirty work. In S. U. Noble &
thought to undergird it, particularly, the free B. Tynes (Eds.), The intersectional internet: Race,
expression and circulation of material, thought, sex, class and culture online (pp. 147–160). New
York: Peter Lang.
and ideas. These questions are further compli- Turner, F. (2005). Where the counterculture met the new
cated by the pressures related to contested notions economy: The WELL and the origins of virtual com-
of jurisdiction, borders, application and munity. Technology and Culture, 46(3), 485–512.
C
Some tasks depend on other tasks for completion, together crowds to analyze issues related to global
while others stand alone. Some tasks require but a climate change, registering more than 14,000
few seconds, while others demand more time and members who participate in a range of contests.
mental energy. More specifically, tasks might Within the contests, members create and refine
include finding and managing information, ana- proposals that offer climate change solutions.
lyzing information, solving problems, and pro- The proposals then are evaluated by the commu-
ducing content. With big data, crowds may enter, nity and, through voting, recommended for imple-
clean, and validate data. The crowds may even mentation. Contest winners presented their
collect data, particularly geospatial data, which proposals to those who might implement them at
prove useful for search and rescue, land manage- a conference. Some contests build their initiatives
ment, disaster response, and traffic management. on big data, such as Smart Mobility, which relies
Other tasks might include transcription of audio or on mobile data for tracking transportation and
visual data and tagging. traveling patterns in order to suggest ways for
When bringing crowdsourcing to big data, the people to reduce their environmental impacts
crowd offers skills that benefit through matters of while still getting where they want to go.
judgment, contexts, and visuals – skills that Another government example comes from the
exceed computational models. In terms of judg- city of Boston, wherein a mobile app called Street
ment, people can determine the relevance of items Bump tracks and maps potential potholes
that appear within a data set, identify similarities throughout the city in order to guide crews toward
among items, or fill in holes within the set. In fixing them. The crowdsourcing for this initiative
terms of contexts, people can identify the situa- comes from two levels. One, the information gath-
tions surrounding the data and how those situa- ered from the app helps city crews do their work
tions influence them. For example, a person can more efficiently. Two, the app’s first iteration
determine the difference between the Statue of reported too many false positives, leading crews
Liberty on Ellis Island in New York and the rep- to places where no potholes existed. The city
lica on The Strip in Las Vegas. The contexts then worked with a crowd drawn together through
allow determination of accuracy or ranking, such InnoCentive to improve the app and its efficiency,
as in this case differentiating the real from the with the top suggestions coming from a hackers
replica. People also can determine more in-depth group, a mathematician, and a software engineer.
relationships among data within a set. For exam- Corporations also use crowdsourcing to work
ple, people can better decide the accuracy of with their big data. AOL needed help with
search engine terms and results matches, deter- cataloging the content on its hundreds of thou-
mine better the top search result, or even predict sands web pages, specifically the videos and their
other people’s preferences. sources, and turned to crowdsourcing as a means
Properly managed crowdsourcing begins to expedite and streamline the project’s costs.
within an organization that has clear goals for its Between 2006 and 2010, Netflix, an online
big data. These organizations can include govern- streaming and mail DVD distributor, sought help
ment, corporations, and nonprofit organizations. with perfecting its algorithm for predicting user
Their goals can include improving business prac- ratings of films. The company developed a contest
tices, increasing innovations, decreasing project with a $1 million dollar prize, and for the contest,
completion times, developing issue awareness, it offered data sets consisting of multiple million
and solving social problems. These goals fre- units for analysis. The goal was to beat Netflix’s
quently involve partnerships that occur across current algorithm by 10%, which one group
multiple entities, such as government or corpora- achieved and took home the prize.
tions partnering with not-for-profit initiatives. Not-for-profit groups also incorporate
At the federal level and managed through Mas- crowdsourcing as part of their initiatives. AARP
sachusetts Institute for Technology’s Center for Foundation, which works on behalf of older
Collective Intelligence, Climate CoLab brings Americans, used crowdsourcing to tackle such
Crowdsourcing 3
issues as eliminating food insecurity and food often tasks that require a higher time and mental
deserts (areas where people do not have conve- commitment than others.
nient or close access to grocery stores). Humani- Crowds bring wisdom to crowdsourced tasks
tarian Tracker crowdsources data from people “on on big data through their diversity of skills and
the ground” about issues such as disease, human knowledge. Determining the makeup of that
rights violations, and rape. Focusing particularly crowd proves more challenging, but one study of
on Syria, Humanitarian Tracker aggregates these Mechanical Turk offers some interesting findings.
data into maps that show the impacts of systematic It found that US females outnumber males by 2 to
killings, civilian targeting, and other human tolls. 1 and that many of the workers hold bachelor’s
Not all crowdsourcing and big data projects and even master’s degrees. Most live in small
originate within these organizations. For example, households of two or fewer people, and most use
Galaxy Zoo demonstrates the expanses of both the crowdsourcing work to supplement their
big data and crowds. The project asked people to household incomes as opposed to being the pri-
classify a data set of one million galaxies into mary source of income.
three categories: elliptical, merger, and spiral. By Crowd members choose the projects on which
the project’s completion, 150,000 people had con- they want to work, and multiple factors contribute
tributed 50 million classifications. The data fea- to their motivations for joining a project and
ture multiple independent classifications as well, staying with it. For some working on projects
adding reliability. The largest crowdsourcing pro- that offer no further incentive to participate, the
ject involved searching satellite images for wreck- project needs to align with their interests and
age from Malaysia Airlines flight MH370, which experience so that they feel they can make a
went missing in March 2014. Millions of people contribution. Others enjoy connecting with other
searched for signs among the images made avail- people, engaging in problem-solving activities,
able by Colorado-based Digital Globe. The seeking something new, learning more about the
amount of crowdsourcing traffic even crashed data at hand, or even developing a new skill. Some
websites. projects offer incentives such as prize money or
Not all big data crowdsourced projects suc- top-contributor status. For some entertainment
ceed, however. One example is the Google Flu motivates them to participate in that the tasks
tracker. The tracker included a map to show the offer a diversion. For others, though, working on
disease’s spread throughout the season. It was crowdsourced projects might be addiction as well.
later revealed that the tracker overestimated the While crowdsourcing offers multiple benefits
expanse of the flu spreading, predicting twice as for the processing of big data, it also draws some
much as actually occurred. criticism. A primary critique centers on the notion
In addition to their potentially not succeeding, of labor, wherein the crowd contributes knowl-
another drawback to these projects is their overall edge and skills for little-to-no pay, while the orga-
management, which tends to be time-consuming nization behind the data stands to gain much more
and difficult. Several companies attempt to fulfill financially. Some crowdsourcing sites offer low
this role. InnoCentive and Kaggle use crowds to cash incentives for the crowd participants, and in
tackle challenges brought to them by industries, doing so, they sidestep labor laws requiring min-
government, and nonprofit organizations. Kaggle imum wage and other worker benefits. Opponents
in particular offers almost 150,000 data scientists of this view cite that the labor involved frequently
– statisticians – to help companies develop more requires menial tasks and that the labor faces no
efficient predictive models, such as deciding the obligation in completing the assigned tasks. They
best order in which to show hotel rooms for a also cite that crowd participants engage the tasks
travel app or guessing which customers would because they enjoy doing so.
leave an insurance company within a year. Both Ethical concerns come back to the types of
InnoCentive and Kaggle run their crowdsourcing crowdsourced big data projects and the intentions
activities as contests or competitions as these are behind them, such as information gathering,
4 Crowdsourcing
Curriculum, Higher Education, and however, the use of big data social sciences
Social Sciences departments at colleges and universities seems
likely to increase.
Stephen T. Schroth
Department of Early Childhood Education,
Towson University, Baltimore, MD, USA Background
and using the numerous sources of information in Some have added two additional criteria to
ways that could benefit organizations and individ- these: variability and complexity. Variability con-
uals. Infonomics, the study of how information cerns the potential inconsistency that data can
could be used for economic gain, grew in impor- demonstrate at times, which can be problematic
tance as companies and organizations worked to for those who analyze the data. Variability can
make better use of the information they possessed, hamper the process of managing and handling
with the end goal being to use it in ways that the data. Complexity refers the intricate process
increased profitability. A variety of consulting that data management involves, in particular when
firms and other organizations began working large volumes of data come from multiple and
with large corporations and organizations in an disparate sources. For analysts and other users to
effort to accomplish this. They defined big data fully understand the information that is contained
as consisting of three “v”s, volume, variety, and in these data, they must be must first be connected,
velocity. correlated, and linked in a way that helps users
Volume, as used in this context, refers to the make sense of them.
increase in data volume caused by technological
innovation. This includes transaction-based data
that has been gathered by corporations and orga- Big Data Comes to the Social Sciences
nizations over time but also includes unstructured
data that derives from social media and other Colleges, universities, and other research centers
sources as well as increasing amounts of sensor have tracked the efforts of the business world to
and machine-to-machine data. For years, exces- use big data in a way that helped to shape organi-
sive data volume was a storage issue, as the cost of zational decisions and increase profitability. Many
keeping much of this information was prohibitive. working in the social sciences were intrigued by
As storage costs have decreased, however, cost this process, as they saw it as a useful tool that
has diminished as a concern. Today, how best to could be used in their own research. The typical
determine relevance within large volumes of data program in these areas, however, did not provide
and how best to analyze data to create value have students, be they at the undergraduate or graduate
emerged as the primary issues facing those wish- level, the training necessary to engage in big data
ing to use it. research projects. As a result, many programs in
Velocity refers to the amount of data streaming the social sciences have altered their curriculum in
in at great speed raises the issue of how best to an effort to assure that researchers will be able to
deal with this in an appropriate way. Technologi- carry out such work. For many programs across
cal developments, such as sensors and smart the social sciences that have pursued curricular
meters, and client and patient needs emphasize changes that will enable students to engage in
the necessity of overseeing and handling inunda- big data research, these changes have resulted in
tions of data in near real time. Responding to data more coursework in statistics, networking, pro-
velocity in a timely manner represents an ongoing gramming, analytics, database management, and
struggle for most corporations and other organi- other related areas. As many programs already
zations. Variety in the types of formats in which required a substantial number of courses in other
data today comes to organizations presents a prob- areas, the drive toward big data competency has
lem for many. Data today includes that in struc- required many departments to reexamine the work
tured numeric forms which is stored in traditional required of their students.
databases but has grown to include information This move toward more coursework that sup-
created from business applications, e-mails, text ports big data has not been without its critics.
documents, audio, video, financial transactions, Some have suggested that changes in curricular
and a host of others. Many corporations and orga- offerings have come at a high cost, with students
nizations struggle with governing, managing, and now being able to perform certain operations
merging different forms of data. involved with handling data but unable to
Curriculum, Higher Education, and Social Sciences 3
competently perform other tasks, such as how it interacts with disciplinary issues and con-
establishing a representative sample or composing cerns have been emphasized by many programs.
a valid survey. These critics also suggest that Some programs have embraced big data tools
while big data analysis has been praised for offer- but suggested that not every student needs mas-
ing tremendous promise, in truth the analysis tery of them. Instead, these programs have
performed is shallow, especially when compared suggested that big data has emerged as a field of
to that done with smaller data sets. Indeed, repre- its own and that certain students should be trained
sentative sampling would negate the need for, and in these skills so that they can work with others
expense of, many big data projects. Such critics within the discipline to provide support for those
suggest that increased emphasis in the curriculum projects that require big data analysis. This
should focus on finding quality, rather than big, approach offers more incremental changes to the
data sources and that efforts to train students to social science curricular offerings, as it would
load, transform, and extract data is sublimating require fewer changes for most students yet still
other more important skills. enable departments to produce scholars who are
Despite these criticisms, changes to the social equipped to engage in research projects involving
sciences curriculum are occurring at many insti- big data.
tutions. Many programs now require students to
engage in work that examines practices and para-
digms of data science, which would provide stu-
Cross-References
dents with a grounding in the core concepts of
data science, analytics, and data management.
▶ Big Data Quality
Work in algorithms and modeling, which provide
▶ Correlation vs. Causation
proficiency in basic statistics, classification, clus-
▶ Curriculum, Higher Education, Business
ter analysis, data mining, decision trees, experi-
▶ Curriculum, Higher Education, Humanities
mental design, forecasting, linear algebra, linear
▶ Education
and logistic regression, market basket analysis,
▶ Public Administration/Government
predictive modeling, sampling, text analytics,
summarization, time series analysis, and
unsupervised learning constrained optimization,
is also an area of emphasis in many programs. Further Readings
Students require exposure to tools and platforms,
Foreman, J. W. (2013). Data smart: Using data science to
which provides proficiency in modeling, develop-
transform information into insight. Hoboken: Wiley.
ment and visualization tools to be used on big data Lane, J. E., & Zimpher, N. L. (2014). Building a smarter
projects, as well as knowledge about the platforms university: Big data, innovation, and analytics.
used for execution, governance, integration, and Albany: The State University of New York Press.
Mayer-Schönberger, V., & Cukier, K. (2013). Big data.
storage of big data. Finally, work with applica- New York: Mariner Books.
tions and outcomes, which emphasize the primary Siegel, E. (2013). Predictive analytics: The power to pre-
applications of data science to one’s field, and dict who will click, buy, lie, or die. Hoboken: Wiley.
D
data science, the idea of what constitutes data where big data ends and data science begins
science remains nebulous. continues to be imprecise.
Another source of confusion in defining data
science stems from the absence of formalized
Controversy in Defining the Field academic programs in higher education. The
lack of these programs exists in part due to chal-
According to Provost and Fawcett, one reason lenges in launching novel programs that cross
why data science is difficult to define relates to disciplines and the natural pace at which these
its conceptual overlap with big data and data- programs are implemented within the academic
driven decision making. Data-driven decision environment. Although several institutions within
making represents an approach characterized by higher education now recognize the importance
the use of insights gleaned through data analysis of this emerging field and the need to develop
for deciding on a course of action. This form of programs that fulfill industry’s need for practi-
decision making may also incorporate varying tioners of data science, the result up to now has
amounts of intuition, but does not rely solely on been to leave the task for defining the field to data
it for moving forward. For example, a marketing scientists.
manager faced with a decision about how much Data scientists currently occupy an enviable
promotional effort should be invested in a partic- position as among the most coveted employees
ular product has the option of solely relying on for twenty-first-century hiring according to Dav-
intuition and past experiences, or using a combi- enport and Patil. They describe data scientists as
nation of intuition and knowledge gained from professionals, usually of senior-level status, who
data analysis. The latter represents the basis for are driven by curiosity and guided by creativity
data-driven decision making. At times, however, and training to prepare and process big data.
in addition to enabling data-driven decision mak- Their efforts are geared toward uncovering find-
ing, data science may also overlap with data- ings that solve problems in both private and
driven decision making. The case of automated public sectors. As businesses and organizations
online recommendations of products based on accumulate greater volumes of data at faster
user ratings, preferences, and past consumer speeds, Davenport and Patil predict the need for
behavior is an example of where the distinction data scientists will to continue in a very steep and
between data science and data-driven decision upward trajectory.
making is less clear.
Similarly, differentiating between the concepts
of big data and data science becomes murky when Opportunities in Data Science
considering that approaches used for processing
big data overlay with the techniques and princi- Several sectors stand to gain from the explosion in
ples used to extract knowledge and espoused by big data and acquisition of data scientists to ana-
data science. This conceptual intersection exists lyze and extract insights from it. Chen, Chiang,
where big data technologies meet data mining and Storey note the opportunities inherent through
techniques. For example, technologies such as data science for various areas. Beginning with e-
Apache™ Hadoop ® which are designed to store commerce and the collection of market intelli-
and process large-scale data can also be used to gence, Chen and colleagues focus on the develop-
support a variety of data science efforts related to ment of product recommendation systems via e-
solving business problems, such as fraud detec- commerce vendors such as Amazon that are com-
tion, and social problems, such as unemployment prised of consumer-generated data. These product
reduction. As the technologies associated with big recommendation systems allow for real-time
data are also often used to apply and bolster access to consumer opinion and behavior data in
approaches to data mining, the boundary between record quantities. New data analytic techniques to
Data Science 3
harness consumer opinions and sentiments have analytics for science and technology research.
accompanied these systems, which can help busi- The iPlant Collaborative represents another
nesses become better able to adjust and adapt NSF-funded initiative that relies on cyber infra-
quickly to needs of consumers. Similarly, in the structure to instill skills related to computational
realm of e-government and politics, a multitude of techniques that address evolving complexities
data science opportunities exist for increasing the within the field of plant biology among emerging
likelihood for achieving a range of desirable out- biologists.
comes, including political campaign effective- The health field is also flush with opportunities
ness, political participation among voters, and for advances using data science. According to
support for government transparency and Chen and colleagues, opportunities for this field
accountability. Data science methods used to are rising in the form of massive amounts of
achieve these goals include opinion mining, social health- and healthcare-related data. In addition to
network analysis, and social media analytics. data collected from patients, data are also gener-
Public safety and security represents another ated through advanced medical tools and instru-
area that Chen and colleagues observe has pros- mentation, as well as online communities formed
pects for implementing data science. Security around health-related topics and issues. Big data
remains an important issue for businesses and within the health field is primarily comprised of
organizations in a post-September 11th 2001 era. genomics-based data and payer-provider data.
Data science offers unique opportunities to pro- Genomics-based data encompasses genetic-
vide additional protections in the form of security related information such as DNA sequencing.
informatics against terrorist threats to transporta- Payer-provider data comprises information col-
tion and key pieces of infrastructure (including lected as part of encounters or exchanges between
cyberspace). Security informatics uses a three- patients and the healthcare system, and includes
pronged approach coordinating organizational, electronic health records and patient feedback.
technological, and policy-related efforts to Despite these opportunities, Miller notes that
develop data techniques designed to promote application of data science techniques to health
international and domestic security. The use of data remains behind that of other sectors, in part
data science techniques such as crime data min- due to a lack of initiatives that leverage scalable
ing, criminal network analysis, and advanced analytical methods and computational platforms.
multilingual social media analytics can be instru- In addition, research and ethical considerations
mental in preventing attacks as well as surrounding privacy and protection of patients’
pinpointing whereabouts of suspected terrorists. rights in the use of big data present some chal-
Another sector flourishing with the rise of data lenges to full utilization of existing health data.
science is science and technology (S&T). Chen
and colleagues note that several areas within S&T,
such as astrophysics, oceanography, and geno-
Challenges to Data Science
mics, regularly collect data through sensor sys-
tems and instruments. The result has been an
Despite the enthusiasm for data science and the
abundance of data in need of analysis, and the
potential application of its techniques for solving
recognition that information sharing and data ana-
important real-world problems, there are some
lytics must be supported. In response, the National
challenges to full implementation of tools from
Science Foundation (NSF) now requires the sub-
this emerging field. Finding individuals with the
mission of a data management plan with every
right training and combination of skills to become
funded project. Data-sharing initiatives such as
data scientists represents one challenge. Daven-
the 2012 NSF Big Data program are examples of
port and Pital discuss the shortage of data scien-
government endeavors to advance big data
tists as a case in which demand has grossly
4 Data Science
Industrial and Commercial Bank of With its combination of state and private owner-
China ship, state governance, and commercial dealings,
ICBC serves as a perfect case study to examine the
Jing Wang1 and Aram Sinnreich2 transformation of China’s financial industry.
1
School of Communication and Information, Big data collection and database construction
Rutgers University, New Brunswick, NJ, USA are fundamental to ICBC’s management strate-
2
School of Communication, American University, gies. Beginning in the late 1990s, ICBC paid
Washington, DC, USA unprecedented attention on the implication of
information technology (IT) in their daily opera-
tions. Several branches adopted computerized
The Industrial and Commercial Bank of input and internet communication of transactions,
China (ICBC) which had previously relied upon manual prac-
tices by bank tellers. Technological upgrades
The Industrial and Commercial Bank of China increased work efficiency and also helped to
(ICBC) was the first state-owned commercial save labor costs. More importantly, compared to
bank of the People’s Republic of China (PRC). It the labor-driven mechanism, the computerized
was founded on January 1st, 1984, and is system was more effective for retrieving data
headquartered in Beijing. In line with Deng from historical records and analyzing these data
Xiaoping’s economic reform policies launched for business development. At the same time, it
in the late 1970s, the State Council (chief admin- became easier for the headquarters to control the
istrative authority of China) decided to relay all local branches by checking digitalized informa-
the financial businesses related to industrial and tion records. Realizing the benefits of these
commercial sectors from the central bank informatization and centralization tactics, the
(People’s Bank of China) to ICBC (China Indus- head company assigned its Department of Infor-
trial Map Committee 2016). This decision made mation Management to develop a centralized
in September 1983 is considered a landmark event database collecting data from every single branch.
in the evolution of China’s increasingly special- This database is controlled and processed by
ized banking system (Fu and Hefferman 2009). ICBC headquarters but is also available for use
While the government retains control over ICBC, by local branches with the permission of top
the bank began to take on public shareholders in executives.
October, 2006. As of May 2016, ICBC was In this context, “big data” refers to all the
ranked as the world’s largest public company by information collected from ICBC’s daily opera-
Forbes “Global 2000.” (Forbs Ranking 2016) tions and can be divided into two general
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_113-1
2 Industrial and Commercial Bank of China
categories: “structured data” (which is organized By 2014, ICBC’s Data Center in Shanghai had
according to preexisting database categories) and collected more than 430 million individual cus-
“unstructured data” (which does not) (Davenport tomers’ profiles and more than 600,000 commer-
and Kim 2013). For example, a customer’s cial business records. National transactions –
account information is typically structured data. exceeding 215 million on daily basis – have all
The branch has to input the customer’s gender, been documented at the Data Center. Data storage
age, occupation, etc., into the centralized network. and processing on such a massive scale cannot be
This information then flows into the central data- fulfilled without a powerful and reliable computer
base which is designed specifically to accommo- system. The technology infrastructure supporting
date it. Any data other than the structured data will ICBC’s big data strategy consists of three major
be stored as raw data and preserved without pro- elements: hardware, software, and cloud comput-
cessing. For example, the video recorded at a local ing. Suppliers are both international and domestic,
branch’s business hall will be saved with only a including IBM, Teradata, and Huawei.
date and a location label. Though “big data” in Further, ICBC has also invested in data backup
ICBC’s informational projects refers to both struc- to secure its database infrastructure and data
tured and unstructured data, the former is the core records. The Shanghai Data Center has a backup
of ICBC’s big data strategy and is primarily used system in Beijing which can record data when the
for data mining. main server fails to work properly. The Beijing
Since the late 1990s, ICBC has invested in big data center serves as a redundant system in case
data development with increasingly large eco- the Shanghai Data Center fails. It only takes less
nomic and human resources. On September 1st, than 30 s to switch between two centers. To speed
1999, ICBC inaugurated its “9991” project, which data backup and minimize data loss in significant
aimed at centralizing the data collected from disruptive events, ICBC undertakes multiple
ICBC branches nationwide. This project took disaster recovery (DR) tests on a regular basis.
more than 3 years to accomplish its goal. Begin- The accumulation and construction of big data
ning in 2002, all local branches were connected to is significant for ICBC’s daily operation in three
ICBC’s Data Processing Center in Shanghai – a respects. First of all, big data allows ICBC to
data warehouse with a 400 terabyte (TB) capacity. develop its customers’ business potential through
The center’s prestructured database enables ICBC a so-called “single-view” approach. A customer’s
headquarters to process and analyze data as soon business data collected from one of ICBC’s
as they are generated, regardless of the location. 35 departments are available for all the other
With its enhanced capability in storing and man- departments. By mining the shared database,
aging data, ICBC also networked and digitized its ICBC headquarters is able to evaluate both a
local branch operations. Tellers are able to input customer’s comprehensive value and the overall
customer information (including their profiles and quality of all existing customers. Cross depart-
transaction records) into the national Data Center mental business has also been propelled (e.g.,
through their computers at local branches. These the Credit Card Department may share business
two-step strategies of centralization and digitiza- opportunities with the Savings Department). Sec-
tion allow ICBC to converge local operations on ond, the ICBC marketing department has been
one digital platform, which intensifies the head- using big data for email-based marketing
quarters’ control over national businesses. In (EBM). Based on the data collected from
2001, ICBC launched another data center in branches, the Marketing and Business Develop-
Shenzhen, China, which is in charge of the big ment Department is able to locate their target
data collected from its oversea branches. ICBC’s customers and follow up with customized market-
database thus enables the headquarters’ control ing/advertising information via customized email
over business and daily operations globally and communications. This data-driven marketing
domestically. approach is increasingly popular among financial
institutions in China. Third, customer
Industrial and Commercial Bank of China 3
management systems rely directly on big data. All collection and data mining. The governing poli-
customers have been segmented into six levels, cies primarily regulate the release of data from
ranging from “one star” to “seven stars,” (one star ICBC to other institutions, yet the protection of
and two stars fall into a single segment which customer privacy within ICBC itself has rarely
indicates the customers’ savings or investment been addressed. According to the central bank’s
levels at ICBC). “Seven Stars” clients have the Regulation on the Administration of the Credit
highest level of credit and enjoy the best benefits Investigation Industry issued by the State Council
provided by ICBC. in 2013, interbank sharing of customer informa-
Big data has influenced ICBC’s decision- tion is forbidden. Further, a bank is not eligible to
making on multiple levels. For local branches, release customer information to its nonbanking
market insights are available at a lower cost. Con- subsidiaries. For example, the fund management
sumer data generated and collected at local company (ICBCCS) owned by ICBC is not allo-
branches have been stored on a single platform wed access customer data collected from ICBC
provided and managed by the national data center. banks. The only situation in which ICBC could
For example, a branch in an economically devel- release customer data to a third party is when such
oping area may predict demand for financial prod- information has been linked to the official inquiry
ucts by checking the purchase data from branches by law enforcement. These policies prevent con-
in more developed areas. The branch could also sumer information from leaking to other compa-
develop greater insights regarding the local con- nies for business purposes. Yet, the policies have
sumer market by examining data from multiple also affirmed the fact that ICBC has full owner-
branches in the geographic area. For ICBC head- ship of the customer information, thus giving
quarters, big data fuels a dashboard through which ICBC greater power to use the data in its own
it monitors ICBC’s overall business and is alerted interests.
to potential risks. Previously, individual depart-
ments used to manage their financial risk through
their own balance sheets. This approach was
Cross-References
potentially misleading and even dangerous for
ICBC’s overall risk profile. A given branch pro-
▶ Data Driven Marketing
viding many loans and mortgages may be consid-
▶ Data Mining
ered to be performing well, but if a large number
▶ Data Warehouse
of branches overextended themselves, the emer-
▶ Hardware
gent financial consequences might create a crisis
▶ Structured Data
for ICBC or even for the financial industry at
large. Consequently, today, a decade after its
data warehouse construction, ICBC considers
big data indispensable in providing a holistic per- Further Reading
spective, mitigating risk for its business and
development strategies. China Industrial Map Editorial Committee, China Eco-
nomic Monitoring & Analysis Center & Xinhua Hold-
To date, ICBC has been a pioneer in big data ings. 2016. Industrial map of China’s financial sectors,
construction among all the financial enterprises in Chapter 6. World Scientific Publishing.
China. It was the first bank to have all local data Davenport, T., & Kim, J. (2013). Keeping up with the
centralized in a single database. As the Director of quants: Your guide to understanding and using analyt-
ics. Boston: Harvard Business School Publishing.
ICBC’s Informational Management Department Fu, M., & Hefferman, S. (2009). The effects of reform on
claimed in 2014, ICBC has the largest Enterprise China’s bank structure and performance. Journal of
Database (EDB) in China. Banking & Finance, 33(1), 39–52.
Parallel to its aggressive strategies in big data Forbs Ranking (2016). The World’s Biggest Public Com-
pany. Retrieved from https://www.forbes.com/compa
construction, the issue of privacy protection has nies/icbc/
always been a challenge in ICBC’s customer data
I
access to that data for the individuals to which it Information Act and changed its name to the
related. In order to comply with the Act, a data Information Commissioner’s Office. On 1 Janu-
controller must comply with the following eight ary, 2005, the Freedom of Information Act 2000
principles as “data should be processed fairly and was fully implemented. The Act was intended to
lawfully; should be obtained only for specified improve the public’s understanding of how public
and lawful purposes; should be adequate, rele- authorities carry out their duties, why they make
vant, and not excessive; should be accurate and, the decisions they do, and how they spend their
where necessary, kept up to date; should not be money. Placing more information in the public
kept longer than is necessary for the purposes for domain would ensure greater transparency and
which it is processed; should be processed in trust and widen participation in policy debate. In
accordance with the rights of the data subject October 2009, the ICO adopted a new mission
under the Act; should be appropriate technical statement: “The ICO’s mission is to uphold infor-
and organisational measures should be taken mation rights in the public interest, promoting
against unauthorised or unlawful processing of openness by public bodies and data privacy for
personal data and against accidental loss or individuals.” In 2011, ICO launched the “data
destruction of, or damage to, personal data; and sharing code of practice” at the House of Com-
should not be transferred to a country or territory mons and enable to impose monetary penalties of
outside the European Economic Area unless that up to £500,000 for serious breaches of the Privacy
country or territory ensures an adequate level of and Electronic Communications Regulations.
protection for the rights and freedoms of data
subjects in relation to the processing of personal
data.”
Cross-References
In 1995, The EU formally adopted the General
Directive on Data Protection. In 1997, DUIS, the
▶ Data Protection
Data User Information System, was implemented,
▶ Open Data
and the Register of Data Users was published on
the internet. In 2000, the majority of the Data
Protection Act comes into force. The name of
the office was changed from the Data Protection Further Readings
Registrar to the Data Protection Commissioner.
Notification replaced the registration scheme Data Protection Act 1984. http://www.out-law.com/page-
413. Accessed Aug 2014.
established by the 1984 Act. Revised regulations DataProtectionAct 1984. http://www.legislation.gov.uk/
implementing the provisions of the Data Protec- ukpga/1984/35/pdfs/ukpga_19840035_en.pdf?view=
tion Telecommunications Directive 97/66/EC extent. Accessed Aug 2014.
came into effect. In January 2001, the office was Smartt, U. (2014). Media & entertainment law (2nd ed.).
London: Routledge.
given the added responsibility of the Freedom of
I
visually and the ways that most people are doing it Infographics are being used for many years, and
without much success. Also around this time, recently the availability of many easy-to-use free
William Cleveland extended and refined data tools have made the creation of infographics
visualization techniques for statisticians. At the available to every Internet user (Murray 2013).
end of the century, the term information visuali- Of course static visualizations can also be
zation was proposed. In 1999, Stuart Card, Jock published on the World Wide Web in order to
Mackinlay, and Ben Shneiderman published their disseminate more easily and rapidly. Publishing
book entitled “Readings in Information Visuali- on the web is considered to be the quickest way to
zation: Using Vision to Think.” Moving to the reach a global audience. An online visualization is
twenty-first century, Colin Ware published two accessible by any Internet user that employs a
books entitled “Information Visualization: Per- recent web browser, regardless of the operating
ception for Design (2004) and Visual Thinking system (Windows, Mac, Linux, etc.) and device
for Design (2008)” in which he compiled, orga- type (laptop, desktop, smartphone, tablet). But the
nized, and explained what we have learned from true capabilities of the web are being exploited in
several scientific disciplines about visual thinking the case of interactive data visualization.
and cognition and applied that knowledge to data Dynamic, interactive visualizations can
visualization (Few 2013). empower people to explore data on their own.
Since the turn of the twenty-first century, data The basic functions of most interactive visualiza-
visualization has been popularized, and it has tion tools have been set back in 1996, when Ben
reached the masses through commercial software Shneiderman proposed a “Visual Information-
products that are distributed through the web. Seeking Mantra” (overview first, zoom and filter,
Many of these data visualization products pro- and then details on demand). The above functions
mote more superficially appealing esthetics and allow data to be accessible from every user, from
neglect the useful and effective data exploration, the one who is just browsing or exploring the
sense-making, and communication. Nevertheless dataset to the one who approaches the visualiza-
there are a few serious contenders that offer prod- tion with a specific question in mind. This design
ucts which help users fulfill data visualization pattern is the basic guide for every interactive
potential in practical and powerful ways. visualization today.
An interactive visualization should initially
offer an overview of the data, but it must also
From Static to Interactive include tools for discovering details. Thus it will
be able to facilitate different audiences, from those
Visualization can be categorized into static and who are new to the subject to those who are
interactive. In the case of the static visualization, already deeply familiar with the data. Interactive
there is only one view of data, and in many occa- visualization may also include animated transi-
sions, multiple cases are needed in order to fully tions and well-crafted interfaces in order to
understand the available information. Also the engage the audience to the subject it covers.
number of dimensions of data is limited. Thus
representing multidimensional datasets fairly in
static images is almost impossible. Static visuali- User Control
zation is ideal when alternate views are neither
needed nor desired and is special suited for static In the case of interactive data visualization, users
medium (e.g., print) (Knaffic 2015). It is worth interact with the visualization by introducing a
mentioning that infographics are also part of the number of input types. Users can zoom in a par-
static visualization. Infographics (or information ticular part of an existing visualization, pinpoint
graphics) are graphic visual representations of an area that interest them, select an option from an
data or knowledge, which are able to present offered list, choose a path, and input numbers or
complex information quickly and clearly. text that customize the visualization. All the
Interactive Data Visualization 3
previous mentioned input types can be accom- generated where different regions are
plished by using keyboard, mice, touch screen, updated over time.
and other more specialized input devices. With
the help of these input actions, users can control
both the information being represented on the Types of Interactive Data Visualizations
graph or the way that the information is being
presented. In the second case, the visualization is The information and more specifically statistical
usually part of a feedback loop. In most cases the information is abstract, since it describes things
actual information remains the same, but the rep- that are not physical. It can concern education,
resentation of the information does change. One sales, diseases, and various other things. But
other important parameter in the interactive data everything can be displayed visually, if the way
visualizations is the time it takes for the visuali- is found to give them a suitable form. The trans-
zation to be updated after the user has introduced formation of the abstract into physical representa-
an input. A delay of more than 20 ms is noticeable tion can only succeed if we understand a bit about
by most people. The problem is that when large visual perception and cognition. In other words, in
amounts of data are involved, this immediate ren- order to visualize data effectively, one must
dering is impossible. follow design principles that are derived from an
Interactive framerate is a term that is often understanding of human perception.
being used to measure the frequency with which Heer, Bostock and Ogievetsky (2010) defined
a visualization system generates an image. In the types (and also their subcategories) of data
case that the rapid response time, which is visualization:
required for interactive visualization, is not fea-
sible, there are several approaches that have (i) Time series data (index charts, stacked
been explored in order to provide people with graphs, small multiples, horizon graphs)
rapid visual feedback based on their input. (ii) Statistical distributions (stem-and-leaf plots,
These approaches include: Q-Q plots, scatter plot matrix (SPLOM),
parallel coordinates)
Parallel rendering: in this case the image is being (iii) Maps (flow maps, choropleth maps, gradu-
rendered simultaneously by two or more com- ated symbol maps, cartograms)
puters (or video cards). Different frames are (iv) Hierarchies (node-link diagrams, adjacency
being rendered at the same time by different diagrams, enclosure diagrams)
computers, and the results are transferred over (v) Networks (force-directed layout, arc dia-
the network for display on the user’s computer. grams, matrix views)
Progressive rendering: in this case a framerate is
guaranteed by rendering some subset of the
information to be presented. It also provides Tools
progressive improvements to the rendering
when the visualization is no longer changing. There are a lot of tools that can be used for
Level-of-detail (LOD) rendering: in this case sim- creating interactive data visualizations. All of
plified representations of information are ren- them are either free or offer a free version (except
dered in order to achieve the desired frame rate, a paid version that includes more features).
while a user is providing input. When the user According to datavisualization.ch, the list of the
has finished manipulating the visualization, then tools that most users employ includes: Arbor.js,
the full representation is used in order to generate CartoDB, Chroma.js, Circos, Cola.js,
a still image. ColorBrewer, Cubism.js, Cytoscape, D3.js,
Frameless rendering: in this type of rendering, Dance.js, Data.js, DataWrangler, Degrafa, Envi-
the visualization is not presented as a time sion.js, Flare, GeoCommons, Gephi, Google
series of images. Instead a single image is Chart Tools, Google Fusion Tables, I Want
4 Interactive Data Visualization
behavior in the developing world. As the func- Big data analytics are also being used to
tionality provided by mobile money services enhance understanding of international develop-
extends into loans, money transfers from abroad, ment assistance. In 2009, the College of William
cash withdrawal, and the purchase of goods, the and Mary, Brigham Young University, and Devel-
data yielded by these platforms will become even opment Gateway created AidData (aiddata.org), a
richer. website that aggregates data on development pro-
The data produced by mobile devices has jects to facilitate project coordination and provide
already been used to glean insights into complex researchers with a centralized source for develop-
economic or social systems in the developing ment data. AidData also maps development pro-
world. In many cases, the insights into local eco- jects geospatially and links donor-funded projects
nomic conditions that result from the analysis of to feedback from the project’s beneficiaries.
mobile device data can be produced more quickly
than national statistics. For example, in Indonesia
the UN Global Pulse monitored tweets about the Big Data in Practice
price of rice and found them to be highly corre-
lated with national spikes in food prices. The same Besides expanding the evidence base available to
study found that tweets could be used to identify international development scholars and practi-
trends in other types of economic behavior such as tioners, large data sets and big data analytic tech-
borrowing. Similarly, research by Nathan Eagle niques have played a direct role in promoting
has shown that reductions in additional airtime international development. Here the term “devel-
purchases are associated with falls in income. opment” is considered in its broad sense as refer-
Researchers Han Wang and Liam Kilmartin ring not to a mere increase in income, but to
examined Call Detail Record (CDR) data gener- improvements in variables such as health and
ated from mobile devices in Uganda and identified governance.
differences in the way that wealthy and poor indi- The impact of infectious diseases on develop-
viduals respond to price discounts. The ing countries can be devastating. Besides the
researchers also used the data to identify centers obvious humanitarian toll of outbreaks, infectious
of economic activity within Uganda. diseases prevent the accumulation of human cap-
Besides providing insight into how individuals ital and strain local resources. Thus there is great
respond to price changes, big data analytics potential for big data-enabled applications to
allows researchers to explore the complex ways enhance epidemiological understanding, mitigate
in which the economic lives of the poor are orga- transmission, and allow for geographically
nized. Researchers at Harvard’s Engineering targeted relief. Indeed, it is in the tracking of
Social Systems lab have used mobile phone data health outcomes that the utility of big data analyt-
to explore the behavior of inhabitants of slums in ics in the developing world has been most obvi-
Kenya. In particular, the authors tested theories of ous. For example, Amy Wesolowski and
rural-to-urban migration against spatial data emit- colleagues used mobile phone data from 15 mil-
ted by mobile devices. Some of the same lion individuals in Kenya to understand the rela-
researchers have used mobile data to examine tionship between human movement and malaria
the role of social networks on economic develop- transmission. Similarly, after noting in 2008 that
ment and found that diversity in individuals’ net- search trends could be used to track flu outbreaks,
work relationships is associated with greater researchers at Google.org have used data on
economic development. Such research supports searches for symptoms to predict outbreaks of
the contention that insular networks – i.e., highly the dengue virus in Brazil, Indonesia, and India.
clustered networks with few ties to outside In Haiti, researchers from Columbia University
nodes – may limit the economic opportunities and the Karolinska Institute used SIM card data
that are available to members. to track the dispersal of people following a cholera
outbreak. Finally, the Centers for Disease Control
International Development 3
and Prevention used mobile phone data to direct poor countries as reminders of the preferred devel-
resources to appropriate areas during the 2014 opment strategies of the past. While more recent
Ebola outbreak. approaches to reducing poverty that have focused
Big data applications may also prove useful in on improving institutions and governance within
improving and monitoring aspects of governance poor countries may produce positive development
in developing countries. In Kenya, India, and effects, the history of development policy sug-
Pakistan, witnesses of public corruption can gests that optimism should be tempered. The
report the incident online or via text message to same caution holds in regard to the potential role
a service called “I Paid A Bribe.” The provincial of big data in international economic develop-
government in Punjab, Pakistan, has created a ment. Martin Hilbert’s 2016 systematic review
citizens’ feedback model, whereby citizens are article rigorously enumerates both the causes for
solicited for feedback regarding the quality of optimism and reasons for concern. While big data
government services they received via automated may assist in understanding the nature of poverty
calls and texts. In effort to discourage absenteeism or lead to direct improvements in health or gover-
in India and Pakistan, certain government officials nance outcomes, the availability and ability to
are provided with cell phones and required to text process large data sets are not a panacea.
geocoded pictures of themselves at jobsites. These
mobile government initiatives have created a rich
source of data that can be used to improve gov-
Cross-References
ernment service delivery, reduce corruption, and
more efficiently allocate resources.
▶ Economics
Applications that exploit data from social
▶ Epidemiology
media have also proved useful in monitoring elec-
▶ U.S. Agency International Development
tions in sub-Saharan Africa. For example, Aggie,
▶ United Nations Global Pulse (Development)
a social media tracking software designed to mon-
▶ World Bank
itor elections, has been used to monitor elections
in Liberia (2011), Ghana (2012), Kenya (2013),
and Nigeria (2011 and 2014). The Aggie system is
first fed with a list of predetermined keywords, Further Reading
which are established by local subject matter
experts. The software then crawls social media Hilbert, M. (2016). Big data for development: A review of
promises and challenges. Development Policy Review,
feeds – Twitter, Facebook, Google+, Ushahidi, 34(1), 135–174.
and RSS – and generates real-time trend visuali- Wang, H., & Kilmartin, L. (2014). Comparing rural and
zations based on keyword matches. The reports urban social and economic behavior in Uganda:
are monitored by a local Social Media Tracking Insights from mobile voice service usage. Journal of
Urban Technology, 21(2), 61–89.
Center, which identifies instances of violence or Wesolowski, A., et al. (2012). Quantifying the impact of
election irregularities. Flagged incidents are human mobility on malaria. Science, 338(6104),
passed on to members of the election commission, 267–270.
police, or other relevant stakeholders. World Economic Forum. (2012). Big data, big impact:
New possibilities for international development. In
The history of international economic devel- Big data, big impact: New possibilities for interna-
opment initiatives is fraught with would-be pana- tional development, Cologny/Geneva, Switzerland:
ceas that failed to deliver. White elephants – large- World Economic Forum. http://www3.weforum.org/
scale capital investment projects for which the docs/WEF_TC_MFS_BigDataBigImpact_Briefing_
2012.pdf
social surplus is negative – are strewn across
I
employment, their key agenda which has domi- 4. Promoting social dialogue by involving both
nated activities in recent decades is “decent workers and employers in the organizations in
work.” order to increase productivity, avoid disputes
“Decent work” refers to an aspiration for peo- and conflicts at work, and more broadly build
ple to have a work that is productive, provides a cohesive societies.
fair income with security and social protection,
safeguards basic rights, and offers equal opportu-
nities and treatment, opportunities for personal ILO Data
development, and a voice in society. “Decent
work” is central to efforts to reduce poverty and The ILO produces research on important labor
is a path to achieving equitable, inclusive, and market trends and issues to inform constituents,
sustainable development; ultimately it is seen as policy makers, and the public about the realities of
a feature which underpins peace and security in employment in today’s modern globalized econ-
communities and societies (ILO 2014a). omy and the issues facing workers and employers
The “decent work” concept was formulated by in countries at all development stages. In order to
the ILO in order to identify the key priorities to do so, it draws on data from a wide variety of
focus their efforts. “Decent work” is designed to sources.
reflect priorities on the social, economic, and The ILO is a major provider of statistics as
political agenda of countries as well as the inter- these are seen as important tools to monitor pro-
national system. In a relatively short time, this gress toward labor standards. In addition to the
concept has formed an international consensus maintenance of key databases (ILO 2014b) such
among government, employers, workers, and as LABOURSTA, it also publishes compilations
civil equitable globalization, a path to reduce pov- of labor statistics, such as the Key Indicators of
erty as well as inclusive and sustainable develop- Labour Markets (KILM) which is a comprehen-
ment. The overall goal of “decent work” is to sive database of country level data for key indica-
instigate positive change in/for people at all spa- tors in the labor market which is used as a research
tial scales. tool for labor market information. Other databases
Putting the decent work agenda into practice is include the ILO STAT, a series of databases with
achieved through the implementation of the ILO’s labor-related data; NATLEX which includes leg-
four strategic objectives, with gender equality as a islation related to labor markets, social security,
crosscutting objective: and human rights; and NORMLEX which brings
together ILO labor standards and national labor
1. Creating jobs to foster an economy that gener- and security laws (ILO 2014c). The ILO database
ates opportunities for investment, entrepre- provides a range of datasets with annual labor
neurship, skills development, job creation, market statistics including over 100 indicators
and sustainable livelihoods. worldwide including annual indicators as well as
2. Guaranteeing rights at work in order to obtain short-term indicators, estimates and projections of
recognition for work achieved as well as total population, and labor force participation
respect for the rights of all workers. rates.
3. Extending social protection to promote both Statistics are vital for the development and
inclusion and productivity of all workers. To evaluation of labor policies, as well as more
be enacted by ensuring both women and men broadly to assess progress toward key ILO objec-
experience safe working conditions, allowing tives. The ILO supports member states in the
free time, taking into account family and social collection and dissemination of reliable and recent
values and situations, and providing compen- data on labor markets. While the data produced by
sation where necessary in the case of lost or the ILO are both wide ranging and widely used,
reduced income. they are not considered by most to be “big data,”
and this has been recognized.
International Labor Organization 3
ILO, Big Data, and the Gender Data insights into women’s maternal health, cultural
attitudes, or political engagement.
In October 2014, a joint ILO-Data2X roundtable • Sensing technologies: for example, satellite
event held in Switzerland identified the impor- data which might be used to examine agricul-
tance of developing innovative approaches to the tural productivity, access to healthcare, and
better use of technology to include big data, in education services.
particular where it can be sourced and where • Crowdsourcing: for example, disseminating
innovations can be made in survey technology. apps to gain views about different elements of
This event, which brought together representa- societies.
tives from national statistics offices, key interna-
tional and regional organizations, and A primary objective of this meeting was to
nongovernmental organizations, was organized highlight that existing gender data gaps are large,
to discuss where there were gender data gaps, and often reflect traditional societal norms, and
particularly focusing on informal and unpaid that no data (or poor data) can have significant
work as well as agriculture. These discussions development consequences. Big data here has the
were sparked by wider UN discussions about the potential to transform the understanding of
data revolution and the importance of develop- women’s participation in work and communities.
ment data in the post-2015 development agenda. Crucially it was posited that while better data is
It is recognized that big data (including adminis- needed to monitor the status of women in informal
trative data) can be used to strengthen existing employment conditions, it is not necessarily
collection of gender statistics, but there need to important to focus on trying to extract more data
be more efforts to find new and innovative ways to but to make an impact with the data that is avail-
work with new data sources to meet a growing able to try and improve wider social, economic,
demand for more up to date (and frequently and environmental conditions.
updating) data on gender and employment
(United Nations, 2013). The fundamental goal of
the discussion was to improve gender data collec-
tion which can then be used to guide policy and
ILO, the UN, and Big Data
inform the post-2015 development agenda, and
here big data is acknowledged as a key compo- The aforementioned meeting represented one
nent. At this meeting, four types of gender data example of where the ILO has engaged with
gaps were identified: coverage across countries other stakeholders to not only acknowledge the
and/or regular country production, international importance of big data but begin to consider
standards to allow comparability, complexity, potential options for its use with respect to their
and granularity (sizeable and detailed datasets agendas. However, as a UN agency, they partake
allowing disaggregation by demographic and in wider discussion with the UN regarding the
other characteristics). Furthermore a series of big importance of big data, as was seen in the 45th
data types that have the potential to increase col- session of the UN Statistical Commission in
lection of gender data were identified: March 2014 where the report of the secretary
general on “big data and the modernization of
• Mobile phone records: for example, mobile statistical systems” was discussed (United
phone use and recharge patterns could be Nations, 2014). This report is significant as it
used as indicators of women’s socioeconomic touches upon important issues, opportunities,
welfare or mobility patterns. and challenges that are relevant for the ILO with
• Financial patterns: exploring engagement with respect to the use of big data.
financial systems. The report makes reference to the UN “Global
• Online activity: for example, Google searches Pulse” which is an initiative on big data
or Twitter activity which might be used to gain established in 2009 which included a vision of a
4 International Labor Organization
future where big data was utilized safely and • Privacy: a dialogue will be required in order to
responsibly. Its mission was to accelerate the gain public trust around the use of data.
adoption of big data innovation. Partnering with • Financial: related to costs for access data.
UN agencies such as the ILO, governments, aca- • Management: policies and directives to ensure
demics, and the private sector, it sought to achieve management and protection of data.
a critical mass of implemented innovation and • Methodological: data quality, representative-
strengthen the adoption of big data as a tool to ness, and volatility are all issues which present
foster the transformation of societies. potential barriers to the widespread use of
There is a recognition that the national statisti- big data.
cal system is essentially now subject to competi- • Technological: the nature of big data, particu-
tion from other actors producing data outside of larly the volume in which it is often created
their system, and there is a need for data collection meaning that some countries would need
of national statistics to adjust in order to make use enhanced information technology.
of the mountain of data now being produced
almost continuously (and often automatically). An assessment of the use of big data for official
To make use of the big data, a shift may be statistics carried out by the UN indicates that there
required from the traditional survey-oriented col- are good examples where it has been used, for
lection of data to a more secondary data-focused example, using transactional, tracking, and sensor
orientation from data sources that are high in data. However, in many cases, a key implication is
volume, velocity, and variety. Increasing demand that statistical systems and IT infrastructures need
from policy makers for real-time evidence in com- to be enhanced in order to be able to support the
bination with declining response rates to national storage and processing of big data as it accumu-
household and business survey means that orga- lates over time.
nizations like the ILO will have to acknowledge Modern society has witnessed an explosion of
the need to make this shift. There are a number of the quantity and diversity of real-time information
different sources of big data which may be poten- known more commonly as big data, presenting a
tially useful for the ILO: sources from administra- potential paradigm shift in the way official statis-
tion, e.g., bank records; commercial and tics are collected and analyzed. In the context of
transaction data, e.g., credit card transactions; increased demand for statistics information, orga-
sensor data, e.g., satellite images or road sensors; nizations recognize that big data has the potential
tracking devices, e.g., mobile phone data; behav- to generate new statistical products in a timelier
ioral data, e.g., online searches; and opinion data, manner than traditional official statistical sources.
e.g., social media. Official statistics like those The ILO, alongside a broader UN agenda to
presented in ILO databases often rely on admin- acknowledge the data revolution, recognizes the
istrative data, and these are traditionally produced potential for future uses of big data at the global
in a highly structured manner which can in turn level, although there is a need for further investi-
limit their use. If administrative data was collected gation of the data sources, challenges and areas of
in real time, or in a more frequent basis, then it has use of big data, and its potential contribution to
the potential to become “big data.” efforts working toward the “better work” agenda.
There are, however, a number of challenges
related to the use of big data which face the UN,
its agencies, and national statistical services alike:
Cross-References
• Legislative: in many countries, there will not
be legislation in place to enable the access to, ▶ Scientific and Cultural Organization
and use of, big data particularly from the pri- (UNESCO) United Nations Global Pulse
vate sector. ▶ United Nations
▶ United Nations Educational
International Labor Organization 5
issues on behalf of the members. Beyond their sometimes overlapping federal-state duality of
“business crawl” efforts promoting local busi- levels, also includes laws in place through the
nesses and their connection to, and success yield- Federal Trade Committee that guard against
ing from the Internet economy, the Association is unfair practices and that target and swiftly punish
active in many other areas. These areas include the bad actors that perpetrate the worst harms.
Internet freedom (nationally and worldwide) and This allows companies to harness the potential
patent reform, among others, with their most of Big Data within a privacy-aware context that
important concern being net neutrality. As Big does not allow or tolerate gross misconduct. In
Data is associated with the Internet, and the indus- fact, the Association even cites the White House’s
try is interested in being an active stakeholder in 2012 laudatory comments on the existing privacy
related policy, the Association has taken several regimes, to strengthen its argument for regulatory
opportunities to make its opinions heard on the status quo, beyond simply an industry’s desire to
matter. These opinions can also be traced through- be left to its own devices to innovate without
out the policies it seeks to propose in other major restrictions.
connected areas. The proposed solutions by the industry would
Most notably, after the White House Office of center on private governance mechanisms that
Science and Technology Policy’s (OSTP) 2014 include a variety of stakeholders in the decision-
request for information, as part of their 90-day making process and are not, in fact, a product of
review on the topic of Big Data, the Internet the legislative system. Such actions have been
Association has released a set of comments that taken before and, according to the views of the
crystallize their views on the matter. Prior com- Association, are successful in the general sector of
munications have also brought up certain aspects privacy, and they allow industry and other actors
related to Big Data; however, the comments made that are involved in the specific areas to have a seat
to the OSTP have been the most comprehensive at the table beyond the traditional lobbying route.
and detailed public statement to date by the indus- One part that needs further action, according to
try on issues of Big Data, privacy, and govern- the views of the Association, is educating the
ment surveillance. public on the entire spectrum of activities that
In matters of privacy regulation, the Associa- lead to the collection and analysis of large data
tion believes that the current framework is both sets. With websites as the focus of most privacy-
robust and effective in relation to commercial related research, the industry advocates a more
entities. In their view, reform is mostly necessary consumer-oriented approach that would permeate
in the area of government surveillance, by the whole range of practices from understudied
adopting an update to the Electronic Communica- sectors to the Internet, centered around increasing
tions Privacy Act (which would give service pro- user knowledge on how their data is being han-
viders a legal basis in denying government dled. This would allow the user to understand the
requests for data that are not accompanied by a entire processes that go on beyond the visible
warrant), prohibiting bulk governmental collec- interfaces, without putting any more pressure on
tion of metadata from communications and clearly the industries to change their actions.
bounding surveillance efforts by law. While the Internet Association considers that
The Internet Association subscribes to the commercial privacy regulation should be left vir-
notion that the current regime for private sector tually intact, substantial government funding for
privacy regulation is not only sufficient but also research and development should be funneled into
perfectly equipped to deal with potential concerns unlocking future and better societal benefits of
brought about by Big Data issues. The status quo Big Data. These funds, administered through the
is, in the acceptation of the Internet industry, a National Science Foundation and other instru-
flexible and multilayered framework, designed for ments, would be directed toward a deeper under-
businesses that embrace privacy protective prac- standing of the complexities of Big Data,
tices. The existing framework, beyond a including accountability mechanisms,
Internet Association, The 3
de-identification, and public release. Prioritizing spread around not just between the companies
such government-funded research over new regu- involved but also with the government, as best
lation, the industry believes that current societal practices would necessarily involve governmental
benefits from commercial Big Data usage institutions as well.
(ranging from genome research to better spam
filters) would multiply in number and effect.
The Association deems that the innovation
Cross-References
economy would suffer from any new regulatory
approaches that are designed to restrict the free
▶ Amazon
flow of data. In their view, not only would the
▶ De-identification, Re-identification
companies not be able to continue with their com-
▶ Genome Data
mercial activities, which would hurt the sector,
▶ Google
and the country, but the beneficial aspects of Big
▶ National Security Agency
Data would suffer as well. Coupled with the rev-
▶ Netflix
elations about the data collection projects of the
▶ Office of Science and Technology Policy:
National Security Agency, this would signifi-
White House Report (2014 Report)
cantly impact the standing of the United States
▶ Twitter
internationally, as important international agree-
ments, such as the Transatlantic Trade and Invest-
ment Partnership with the EU, are in jeopardy,
says the industry. Further Readings
protect fundamental rights and freedoms of peo- Italian businesses. Yet, according to this authority,
ple when personal data are processed. The Italian these cloud computing guidelines require that Ital-
Data Protection Authority (DPA) is run by a four- ian laws are updated to be fully effective in regu-
member committee elected by the Italian Parlia- lating this area. Critics indicate that there are
ment for a seven-year mandate (DPA 2014a). limits in existing Italian laws concerning the allo-
The main activities of DPA consist of monitor- cation of liabilities, data security, jurisdiction, and
ing and assuring that organizations comply with notification of infractions to the supervisory
the latest regulations on data protection and indi- authority (Russo 2012).
vidual privacy. In order to do so, DPA carries out Another area of great interest for the DPA is the
inspections on organizations’ databases and data collection of personal data via video surveillance
storage systems to guarantee that their require- both in the public and in the private sector. The
ments for preserving individual freedom and pri- DPA has acted on specific cases of video surveil-
vacy are of high standards. It checks that the lance, sometimes banning and other times allo-
activities of the police and the Italian Intelligence wing it (DPA 2014c). For instance, the DPA
Service comply with the legislation, reports pri- reported to have banned the use of webcams in a
vacy infringements to judicial authorities, and nursery school to protect children’s privacy and to
encourages organizations to adopt codes of con- safeguard freedom of teaching. It banned police
duct promoting fundamental human rights and headquarters to process images collected via
freedom. The authority also handles citizens’ CCTV cameras installed in streets for public
reports and complaints of privacy loss or any safety purposes because such cameras also cap-
misuse or abuse of personal data. It bans or blocks tured images of people’s homes. The use of cus-
activities that can cause serious harm to individual tomers’ pre-recorded, operator-unassisted phone
privacy and freedom. It grants authorizations to calls for debt collection purposes is among those
organizations and institutions to have access and activities that have been prohibited by this author-
use sensitive and/or judicial data. Sensitive and ity. Yet, the DPA permits the use of video surveil-
judicial data concern, for instance, information on lance in municipalities for counter-vandalism
a person’s criminal records, ethnicity, religion or purposes (DPA 2014b).
other beliefs, political opinions, membership of
parties, trade unions and/or associations, health,
or sex life. Access to sensitive and judicial data is Conclusion
granted only for specific purposes, for example, in
situations where it is necessary to know more Overall, Italy is advancing with the regulation of
about a certain individual for national security big data phenomenon following also the impetus
reasons (DPA 2014b). given by the EU institutions and international
The DPA participates to data protection activ- debates on data protection, security, and privacy.
ities involving the European Union and other Nonetheless, Italy is still lagging behind many
international supervisory authorities and follows western and European countries regarding the
existing international conventions (Schengen, adoption and development of frameworks for a
Europol, and Customs Information System) full digital economy. According to the Networked
when regulating Italian data protection and secu- Readiness Index 2015 published by the World
rity matters. It carries out an important role in Economic Forum, Italy is ranked 55th. As indi-
increasing public awareness of privacy legislation cated by the report, Italy’s major weakness is still
and in soliciting the Italian Parliament to develop a political and regulatory environment that does
legislation on new economic and social issues not facilitate the development of a digital econ-
(DPA 2014b). The DPA has also formulated spe- omy and its innovation system (Bilbao-Osorio
cific guidelines on cloud computing for helping et al. 2014).
Italy 3
racial discrimination in mortgage lending prac- visuals can also accompany and buttress news
tices throughout the Atlanta metropolitan area. articles that rely on traditional reporting methods.
Over the last decade, the ubiquity of large, Nate Silver writes that big data analyses pro-
often free, data sets has created new opportunities vide several advantages over traditional journal-
for journalists to make sense of the world of big ism. They allow journalists to further explain a
data. Where precision journalism was once the story or phenomenon through statistical tests that
domain of a few investigative reporters, data- explore relationships, to more broadly generalize
driven reporting techniques are now a common, information by looking at aggregate patterns over
if not necessary, component of contemporary time and to predict future events based on prior
news work. News organizations like The Guard- occurrences. For example, using an algorithm
ian, The New York Times’ Upshot, and The Texas based on historical polling data, Silver’s website,
Tribune represent the mainstream embrace of big FiveThirtyEight (formerly hosted by the New York
data. Some websites, like Nate Sliver’s Times), correctly predicted the outcome of the
FiveThirtyEight, are entirely devoted to data 2012 US presidential election in all 50 states.
journalism. Whereas methods of traditional journalism often
lend themselves to more microlevel reporting,
more macrolevel and general insights can be
How Do Journalists Use Big Data? gleaned from big data.
An additional advantage of big data is that, in
Big data provide journalists with new and alterna- some cases, they reduce the necessary resources
tive ways to approach the news. In traditional needed to report the story. Stories that would
journalism, reporters collect and organize infor- otherwise have taken years to produce can be
mation for the public, often relying on interviews assembled relatively quickly. For example,
and in-depth research to report their stories. Big WikiLeaks provided news organizations nearly
data allow journalists to move beyond these stan- 400,000 unreleased US military reports related
dard methods and report the news by gathering to the war in Iraq. Sifting through these docu-
and making sense of aggregated data sets. This ments using traditional reporting methods would
shift in methods has required some journalists and take a considerable amount of time, but news
news organizations to change their information- outlets like The Guardian in the UK applied com-
gathering routines. Rather than identifying poten- putational techniques to quickly identify and
tial sources or key resources, journalists using big report the important stories and themes stemming
data must first locate relevant data sets, organize from the leak, including a map noting the location
the data in a way that allows them to tell a coher- of every death in the war.
ent story, analyze the data for important patterns Big data also allow journalists to interact with
and relationships, and, finally, report the news in a their audience to report the news. In a process
comprehensible manner. Because of the complex- called crowdsourcing the news, large groups of
ity of the data, news organizations and journalists people contribute relevant information about a
are increasingly working alongside computer pro- topic, which in the aggregate can be used to
grammers, statisticians, and graphic designers to make generalizations and identify patterns and
help tell their stories. relationships. For example, in 2013 the
One important aspect of big data is visualiza- New York Times website released an interactive
tion. Instead of writing a traditional story with quiz on American dialects that used responses to
text, quotations, and the inverted-pyramid format, questions about accents and phrases to demon-
big data allow journalists to tell their stories using strate regional patterns of speech in the US. The
graphs, charts, maps, and interactive features. quiz became the most visited content on the
These visuals enable journalists to present website that year.
insights from complicated data sets in a format
that is easy for the audience to understand. These
Journalism 3
Special Issues and Volumes The ANNALS of American of the American Academy of
Digital Journalism–Journalism in an Era of Big Data: Political and Social Science – Toward Computational
Cases, concepts, and critiques. v. 3/3 (2015). Social Science: Big Data in Digital Environments.
Social Science Computer Review – Citizenship, Social v. 659/1 (2015).
Media, and Big Data: Current and Future Research in
the Social Sciences (in press).
K
device where it either saves captured data onto the The Scope of the Problem
hard drive or sends it through networks/wirelessly Internationally
to another device/website. KC hardware (e.g.,
KeyCobra, KeyGrabber, KeyGhost) may be an In 2013 the Royal Canadian Mounted Police
adaptor device into which a keyboard/mouse (RCMP) served White Falcon Communications
USB cord is plugged before it is inserted in to with a warrant that alleged that the company was
the computer or may look like an extension cable. controlling an unknown number of computers
Hardware can also be installed inside the com- known as the Citadel botnet (Vancouver Sun
puter/keyboard. KC is placed on devices mali- 2013). In addition to distributing KC malware/
ciously by hackers when computer and mobile spyware, the Citadel botnet also distributed spam
device users visit websites, open e-mail attach- and conducted network attacks that reaped over
ments, or click links to files that are from $500 million dollars illegal profit affecting more
untrusted sources. Individual technology users than 5 million people globally (Vancouver Sun
are frequently lured by untrusted sources and 2013). The Royal Bank of Canada and HSBC in
websites that offer free music files or pornogra- Great Britain were among the banks attacked by
phy. KC’s infiltrate organizations’ computers the Citadel botnet (Vancouver Sun 2013). The
when an employee is completing company busi- operation is believed to have originated from
ness (i.e., financial transactions) on a device that Russia or Ukraine as many websites hosted by
he/she also uses to surf the Internet in their White Falcon Communications end in the .ru suf-
free time. fix (i.e., country code for Russia). Microsoft
When a computer is infected with a malicious claims that the 1,400 botnets running Citadel
KC, it can be turned into what is called a zombie, a malware/spyware were interrupted due to the
computer that is hijacked and used to spread KC RCMP action with the highest infection rates in
malware/spyware to other unsuspecting individ- Germany (Vancouver Sun 2013). Other countries
uals. A network of zombie computers that is con- affected were Thailand, Italy, India, Australia, the
trolled by someone other than the legitimate USA, and Canada. White Falcon owner Dmitry
network administrator is called a botnet. In 2011, Glazyrin’s voicemail claimed he was out of the
the FBI shut down the Coreflood botnet, a global country on business when the warrant was served
KC operation affecting 2 million computers. This (Vancouver Sun 2013).
botnet spread KC software via an infected e-mail Trojan horses allow others to access and install
attachment and seemed to infect only computers KC and other malware. Trojan horses can alter or
using Microsoft Windows operating systems. The destroy a computer and its files. One of the most
FBI seized the operators’ computers and charged infamous Trojan horses is called Zeus. Don Jack-
13 “John Doe” defendants with wire fraud, bank son, a senior security researcher with Dell
fraud, and illegally intercepting electronic com- SecureWorks and who has been widely
munication. Then in 2013 security firm interviewed, claims that Zeus is so successful
SpiderLabs found 2 million passwords in the because those behind it, seemingly in Russia, are
Netherlands stolen by the Pony botnet. While well funded and technologically experienced, and
researching the Pony botnet, SpiderLabs discov- this allows them to keep Zeus evolving into dif-
ered that it contained over a million and a half ferent variations (Button 2013). In 2012 Micro-
Twitter and Facebook passwords and over soft’s Digital Crimes Unit with its partners
300,000 Gmail and Yahoo e-mail passwords. Pay- disrupted a variation of Zeus botnets in Pennsyl-
roll management company ADP, with over vania and Illinois responsible for an estimated
600,000 clients in 125 countries, was also hacked 13 million infections globally. Another variation
by this botnet. of Zeus called GameOver tracks computer users’
every login and uses the information to lock them
out and drain their bank accounts (Lyons 2014). In
some instances GameOver works in concert with
Keystroke Capture 3
CryptoLocker. If GameOver finds that an individ- of cyber espionage engage in KC activities, but
ual has little in the bank then CryptoLocker will cyber criminals attain the most notoriety. Cyber
encrypt users’ valuable personal and business files criminals are as effective as they are evasive due
agreeing to release them only once a ransom is to the organization of their criminal gangs. After
paid (Lyons 2014). Often ransoms must be paid in taking money from bank accounts via KC, many
Bitcoin, Internet based and currently anonymous cyber criminals send the payments to a series of
and difficult to track. Victims of CryptoLocker money mules. Money mules are sometimes
will often receive a request for a one Bitcoin unwitting participants in fraud who are recruited
ransom (estimated to be worth 400€/$500USD) via the Internet with promises of money for work-
to unlock the files on their personal computer that ing online. The mules are then instructed to wire
could include records for a small business, aca- the money to accounts in Russia and China (Krebs
demic research, and/or family photographs 2009). Mules have no face-to-face contact with
(Lyons 2014). the heads of KC operations so it can be difficult to
KC is much more difficult to achieve on a secure prosecutions, though several notable cyber
smartphone as most operating systems operate criminals have been identified, charged, and/or
only one application at a time, but it is not impos- arrested. In late 2013 the RCMP secured a warrant
sible. As an experiment Dr. Hao Chen, an Asso- for Dmitry Glazyrin, the apparent operator of a
ciate Professor in the Department of Computer botnet who left Canada before the warrant could
Science at the University of California, Davis, be served. Then in early 2014, Russian SpyEye
with an interest in security research created a KC creator Aleksandr Panin was arrested for cyber
software that operates using smartphone motion crime (IMD 2014). Also, Estonian Vladimir
data. When tested, Chen’s application correctly Tsastsin, the cyber criminal who created
guessed more than 70% of the keystrokes on a DNSChanger and became rich of online advertis-
virtual numerical keypad though he asserts that it ing fraud and KC by infecting millions of com-
would probably be less accurate on an alphanu- puters. Finnish Internet security expert Mikko
merical keypad (Aron 2011). Point-of-sale (POS) Hermanni Hyppönen claimed that Tsastsin
data, gathered when a credit card purchase is made owned 159 Estonian properties when he was
in a retail store or restaurant, is also vulnerable to arrested in 2011 (IMD 2014). Tsastsin was
KC software (Beierly 2010). In 2009 seven Lou- released 10 months after his arrest due to insuffi-
isiana restaurant companies (i.e., Crawfish Town cient proof. As of 2014 Tsastsin has been extra-
USA Inc., Don’s Seafood & Steak House Inc., dited to the US for prosecution (IMD 2014). Also
Mansy Enterprises LLC, Mel’s Diner Part II Inc., in 2014 the US Department of Justice Department
Sammy’s LLC, Sammy’s of Zachary LLC, and (DOJ) filed papers accusing a Russian Evgeniy
B.S. & J. Enterprises Inc.) sued Radiant Systems Mikhailovich Bogachev of leading the gang
Inc., a POS system maker, and Computer World behind GameOver Zeus. The DOJ claims
Inc., a POS equipment distributor, charging that GameOver Zeus caused $100 million in losses
the vendors did not secure the Radiant POS sys- from individuals and large organizations.
tems. The customers were then defrauded by KC Suspected Eastern European malware/spyware
software, and restaurant owners incurred financial oligarchs have received ample media attention for
costs related to this data capture. Similarly, Patco perpetuating KC via botnets and Trojan horses
Construction Company, Inc. sued People’s United while other perpetrators have taken the public by
Bank for failing to implement sufficient security surprise. In 2011 critics accused software com-
measures to detect and address suspicious trans- pany Carrier IQ of placing KC and geographical
actions due to KC. The company finally settled for position spyware in millions of users’ Android
$345,000, the cost that was stolen plus interest. devices (International Business Times 2011).
Teenage computer hackers, so-called hactivists The harshest of critics have alleged illegal
(people who protest ideologically by hacking wiretapping on the part of the company while
computers), and governments under the auspices Carrier IQ has rebutted that what was identified
4 Keystroke Capture
as spyware is actually diagnostic software that 2012–2013 attack on a California escrow firm,
provides network improvement data (Interna- Efficient Services Escrow Group of Huntington
tional Business Times 2011). Further the com- Beach, CA, that had one location and nine
pany stated that the data was both encrypted and employees. Using KC malware/spyware, the
secured and not sold to third parties. In January hackers drained the company of $1.5 million dol-
2014, 11 students were expelled from Corona del lars in three transactions wired to bank accounts in
Mar High School in California’s affluent Orange China and Russia. Subsequently, $432,215 sent to
County for allegedly using KC to cheat for several a Moscow Bank was recovered, while the $1.1
years with the help of tutor Timothy Lai. Police million sent to China was never recouped. The
report being unable to find Lai, a former resident loss was enough to shutter the business’s one
of Irvine, CA, since the allegations surfaced in office and put its nine employees out of work.
December 2013. The students are accused of plac- Though popular in European computer circles,
ing KC hardware onto teachers’ computers to get the relatively low-profile Chaos Computer Club
passwords to improve their grades and steal learned that German state police were using KC
exams. All 11 students signed expulsion agree- malware/spyware as well as saving screenshots
ments in January 2014 that whereby they aban- and activating the cameras/microphones of club
doned their right to appeal their expulsions in members (Kulish and Homola 2014). News of the
exchange for being able to transfer to other police’s actions led the German justice minister to
schools in the district. Subsequently, five of the call for stricter privacy rules (Kulish and Homola
students’ families sued the district for denying the 2014). This call echoes a 2006 commission report
students the right to appeal and/or claiming tutor to the EU Parliament that calls for strengthening
Lai committed the KC crimes. By the end of the regulatory framework for electronic commu-
March, the school district had spent almost nications. KC is a pressing concern in the US for
$45,000 in legal fees. as of 2014, 18 states and one territory (i.e., Alaska,
When large organizations are hacked via KC, Arizona, Arkansas, California, Georgia, Illinois,
the news is reported widely. For instance, Visa Indiana, Iowa, Louisiana, Nevada, New Hamp-
found KC software being able to transmit card shire, Pennsylvania, Rhode Island, Texas, Utah,
data to a fixed e-mail or IP address where hackers Virginia, Washington, Wyoming, Puerto Rico) all
could retrieve it. Here the hackers attached KC to have anti-spyware laws on the books (NCSL
a POS system. Similarly KC was used to capture 2015).
the keystrokes of pilots flying the US military’s
Predator and Reaper drones that have been used in
Afghanistan (Shachtman 2011). Military officials
Tackling the Problem
were unsure whether the KC software was already
built into the drones was the work of a hacker
The problem of malicious KC can be addressed
(Shachtman 2011). Finally, Kaspersky Labs has
through software interventions and changes in
publicized how it is possible to get control of
computer users’ behaviors, especially when
BMW’s Connected Drive system via KC and
online. Business travelers may be at a greater
other malware, and this gain control of a luxury
risk for losses if they log onto financial accounts
car that uses this Internet-based system.
using hotel business centers as these high-traffic
Research by Internet security firm Symantec
areas provide ample opportunities to hackers
shows that many small and medium-sized busi-
(Credit Union Times 2014). Many Internet secu-
nesses believe that malware/spyware is a problem
rity experts recommend not using public wireless
for large organizations (e.g., Visa, the US mili-
networks where of KC spyware thrives. Experts at
tary). However, since 2010 the company notes
Dell also recommend that banks have separate
that 40% of all companies attacked have fewer
computers dedicated only to banking transactions
than 500 employees while only 28% of attacks
with no emailing or web browsing.
target large organizations. A case in point is a
Keystroke Capture 5
Individuals without the resources to devote one systems, but new wisdom suggests that all devices
computer to financial transactions can, experts can be vulnerable especially when programs and
argue, protect themselves from KC through plug-ins are added to devices. Don Jackson, a
changing several computer behaviors. First, indi- senior security researcher with Dell SecureWorks,
viduals should change their online banking pass- argues that one of the most effective methods for
words regularly. Second, they should not use the preventing online business fraud, the air-gap tech-
same password for multiple accounts or use com- nique, is not widely utilized despite being around
mon words or phrases. Third is checking one’s since 2005. The air-gap technique creates a unique
bank account on a regular basis for unauthorized verification code that is transmitted as a digital
transfers. Finally, it is important to log off of token, text message, or other device not connected
banking websites when finished with them and to the online account device, so the client can read
to never click on third-party advertisements that and then key in the code as a signature for each
post to online banking sites and take you to a new transaction over a certain amount. Alternately in
website upon clicking. 2014 Israeli researchers presented research on a
Configurations of one’s computer features, technique to hack an air-gap network using just a
programs, and software are also urged to thwart cellphone.
KC. This includes removing remote access (i.e.,
accessing one’s work computer from home) con-
figurations when they are not needed in addition Cross-References
to using a strong firewall (Beierly 2010). Users
need to continually check their devices for unfa- ▶ Banking Industry
miliar hardware attached to mice or keyboards as ▶ Canada
well as check the listings of installed software ▶ China
(Adhikary et al. 2012; Beierly 2010). Many finan- ▶ Cyber Espionage
cial organizations are opting for virtual keypads ▶ Cyber Threat/Attack
and virtual mice, especially for online transactions ▶ Department of Homeland Security
(Kumar 2009). Under this configuration instead of ▶ Germany
typing a password and username on the keyboard ▶ Microsoft
using number and letter keys, the user scrolls ▶ Point-of-Sales Data
through numbers and letters using the cursors’ ▶ Royal Bank of Canada
virtual keyboard. Always use the online virtual ▶ Spyware
keyboard for your banking password to avoid ▶ Visa
the risk of keystrokes being logged when
available.
Further Readings
Conclusion Adhikary, N., Shrivastava, R., Kumar, A., Verma, S., Bag,
M., & Singh, V. (2012). Battering keyloggers and
screen recording software by fabricating passwords.
Having anti-KC/malware/spyware alone does not International Journal of Computer Network & Infor-
guarantee protection, but experts agree that it is an mation Security, 4(5), 13–21.
important component of an overall security strat- Aron, J. (2011). Smartphone jiggles reveal your private
data. New Scientist, 211(2825), 21.
egy. Anti-KC programs include SpyShelter Stop-
Beierly, I. (2010). They’ll be watching you. Retrieved from
Logger, Zemana AntiLogger, KeyScrambler Pre- http://www.hospitalityupgrade.com/_files/File_Articles/
mium, Keylogger Detector, and GuardedID Pre- HUSum10_Beierly_Keylogging.pdf
mium. Some computer experts claim that PC’s are Button, K. (2013). Wire and online banking fraud continues
to spike for businesses. Retrieved from http://www.
more susceptible to KC malware/spyware than are
americanbanker.com/issues/178_194/wire-and-online-
Mac’s as KC malwares/spywares are often banking-fraud-continues-to-spike-for-businesses-1062
reported to exploit holes in PC’s operating 666-1.html
6 Keystroke Capture
Credit Union Times. (2014). Hotel business centers Lyons, K. (2014). Is your computer already infected with
hacked. Credit Union Times, 25(29), 11. dangerous Gameover Zeus software? Virus could be
IMD: International Institute for Management Develop- lying dormant in thousands of Australian computers.
ment. (2014). Cybercrime buster speaks at IMD. Retrieved from http://www.dailymail.co.uk/news/article-
Retrieved from http://www.imd.org/news/Cybercrime- 2648038/Gameover-Zeus-lying-dormant-thousands-
buster-speaks-at-IMD.cfm Australian-computers-without-knowing.html#ixzz3
International Business Times. (2011). Carrier iq spyware: AmHLKlZ9
Company’s Android app logging the keystrokes of mil- NCSL: National Conference of State Legislatures. (2015).
lions. Retrieved from http://www.ibtimes.com/carrier- State spyware laws. Retrieved from http://www.ncsl.
iq-spyware-companys-android-app-logs-keystrokes- org/research/telecommunications-and-information-tech
millions-video-377244 nology/state-spyware-laws.aspx
Krebs, B. (2009). Data breach highlights role of ‘money Shachtman, N. (2011). Exclusive: Computer virus hits US
mules’. Retrieved from http://voices.washingtonpost. drone fleet. Retrieved from http://www.wired.com/
com/securityfix/2009/09/money_mules_carry_loot_for_ 2011/10/virus-hits-drone-fleet/
org.html Vancouver Sun. (2013). Police seize computers linked to
Kulish, N., & Homola, V. (2014). Germans condemn police large cybercrime operation: Malware Responsible for
use of spyware. Retrieved from http://www.nytimes. over $500 million in losses has affected more than five
com/2011/10/15/world/europe/uproar-in-germany-on- million people globally. Retrieved from http://www.
police-use-of-surveillance-software.html?_r=0 vancouversun.com/news/Police+seize+computers+
Kumar, S. (2009). Handling malicious hackers & assessing linked+large+cybercrime+operation/8881243/story.html
risk in real time. Siliconindia, 12(4), 32–33. #ixzz3Ale1G13s
L
Mainframe Servers also holds and stores copies of critical data off-
site. Multiple times a year, emergency business
There are over 100 servers housed in the Spring- resumption plans are tested. Furthermore, the data
field center, managing over 100 terabytes of data center has system management services 365 days
storage. As for the Miamisburg location, this com- a year and 24 h a day provided by skilled opera-
plex holds 11 huge mainframe servers, running tions engineers and staff. If needed, there are
34 multiple virtual storage (MVS) operating sys- additional specialists on site, or on call, to provide
tem images. The center also has 300 midrange the best support to customers. According to its
Unix servers and almost 1,000 multiprocessor website, LexisNexis invests a great deal in protec-
NT servers. They provide a wide range of com- tion architecture to prevent hacking attempts,
puter services including patent images to cus- viruses, and worms. In addition, the company
tomers, preeminent US case law citation also has third-party contractors which conduct
systems, a hosting channel data for Reed Elsevier, security studies.
and computing resources for the LexisNexis
enterprise. As the company states, its processors
have access to over 500 terabytes (or one trillion Security Breach
characters) of data storage capacity.
In 2013, Byron Acohido reported that a hacking
group hit three major data brokerage companies.
Telecommunications LexisNexis, Dun & Bradstreet, and Kroll Back-
ground America are companies that stockpile and
LexisNexis has developed a large telecommuni- sell sensitive data. The group that hacked these
cations network, permitting the corporation to data brokerage companies specialized in
support its data collection requirements while obtaining and selling social security numbers.
also serving its customers. As noted on its The security breach was disclosed by a cyberse-
website, subscribers to the LexisNexis Group curity blogger Brian Kebs. He stated that the
have a search rate of one billion times annually. website ssndob.ms (SSNDOB), their acronym
LexisNexis also provides bridges and routers and stands for social security number and date of
maintains firewalls, high-speed lines, modems, birth, markets itself on underground cybercrime
and multiplexors, providing an exceptional degree forums, offering services to its customers who
of connectivity. want to look up social security numbers, birth-
days, and other data on any US resident.
LexisNexis found an unauthorized program called
Physical Dimensions of the Miamisburg nbc.exe on its two systems listed in the botnet
Data Center interface network located in Atlanta, Georgia.
The program was placed as far back as April
LexisNexis Group has hardware, software, elec- 2013, compromising their security for at least
trical, and mechanical systems housed in a 5 months.
73,000 ft2 data center hub. Its sister complex,
located in Springfield, comprises a total of
80,000 ft2. In these facilities, the data center hard- LexisNexis Group Expansion
ware, software, electrical, and mechanical sys-
tems have multiple levels of redundancy, in the As of July 2014, LexisNexis Risk Solutions
event that a single component fails, ensuring expanded its healthcare solutions to the life sci-
uninterrupted service. The company’s website ence marketplace. In an article by Amanda Hall,
states that its systems are maintained and tested she notes that an internal analysis revealed that
on a regular basis to ensure they perform correctly 40% of the customer files have missing or inaccu-
in case of an emergency. The LexisNexis Group rate information in a typical life science company.
LexisNexis 3
LexisNexis Risk Solutions has leveraged its lead- legal professionals have trusted the LexisNexis
ing databases, reducing costs, improving effec- Group. It appears that the company will continue
tiveness, and strengthening identity transparency. to maintain this status and remain one of the
LexisNexis is able to deliver data to over 6.5 leading providers in the data brokerage
million healthcare providers in the United States. marketplace.
This will benefit life science companies allowing
them to tailor their marketing and sales strategies,
to identify the correct providers to pursue. The
Cross-References
LexisNexis databases are more efficient, which
will help health science organizations gain com-
▶ American Bar Association
pliance with federal and state laws.
▶ Big Data Quality
Following the healthcare solutions announce-
▶ Data Breach
ment, Elisa Rodgers writes that Reed Technology
▶ Data Center
and Information Services, Inc., a LexisNexis com-
▶ Legal Issues
pany, acquired PatentCore. PatentCore is an inno-
▶ Reed Elsevier
vator of patent data analytics. PatentAdvisor is a
user-friendly suite, delivering information to
assist with a more effective patent prosecution
and management. Its web-based patent analytic Further Readings
tools will help IP-driven companies and law
firms by making patent prosecution a more strate- Acohido, B. LexisNexis, Dunn & Bradstreet, Kroll Hacked.
http://www.usatoday.com/story/cybertruth/2013/09/26/
gic and probable process. lexisnexis-dunn–bradstreet-altegrity-hacked/2878769/.
The future of the LexisNexis Group should Accessed July 2014.
include more acquisitions, expansion, and Hall, A. LexisNexis verified data on more than 6.5 million
increased capabilities for the company. According providers strengthens identity transparency and reduces
costs for life science organizations. http://www.benzi
to its website, the markets for their companies nga.com/pressreleases/14/07/b4674537/lexisnexis-veri
have grown over the last three decades, servicing fied-data-on-more-than-6-5-million-providers-strengt
professionals in academic institutes, corporations, hens. Accessed July 2014.
governments, and business people. LexisNexis Krebs, B. Data broker giants hacked by ID theft service.
http://krebsonsecurity.com/2013/09/data-broker-giants-
Group provides critical information, in easy-to- hacked-by-id-theft-service/. Accessed July 2014.
use electronic products, to the benefit of sub- LexisNexis. http://www.lexisnexis.com. Accessed July
scribed customers. The company has a long his- 2014.
tory of fulfilling its mission statement “to enable Rodgers, E. Adding multimedia reed tech strengthens line
of LexisNexis intellectual property solutions by acquir-
its customers to spend less time searching for ing PatentCore, an innovator in patent data analytics.
critical information and more time using http://in.reuters.com/article/2014/07/08/supp-pa-reed-
LexisNexis knowledge and management tools to technology-idUSnBw015873a+100+BSW20140708.
guide critical decisions.” For more than a century, Accessed July 2014.
L
Common types of analyses, emphasizing those social system, will B reply? On the World Wide
types often used in practice, are explained below. Web, if website A has a hyperlink to B, will B link
Path analysis: A path p in a graph is a sequence to A?
of vertices p = (v1, v2, . . . , vm) , vi V such Transitivity refers to the degree to which two
that for each consecutive pair vi,vj of vertices in nodes in a network have a mutual connection in
p is matched by an edge of the form (vj,vi) (if the common. In other words, if there is an edge
network is undirected) or (vi,vj) (if the network is between nodes A and B and B to C, graphs that
directed or undirected). If one were to draw a graph are highly transitive indicate a tendency for an
graphically, a path is any sequence of movements edge to also exist between A and C. In the context
along the edges of the network that brings you from of social network analysis, transitivity carries an
one vertex to another. Any path is valid, even ones intuitive interpretation based on the old adage “a
that have loops or crosses the same vertex many friend of my friend is also my friend.” Transitivity
times. Paths that do not intersect with themselves is an important measure in other contexts, as well.
(i.e., vi does not equal vj for any vi,vj p) are self- For example, in a graph where edges correspond
avoiding. The length of a path is defined by the to paths of energy as in a power grid, highly
total number of edges along it. Geodesic paths transitive graphs correspond to more efficient sys-
between vertices i and j is a minimum length path tems compared to less transitive ones: rather than
of size k where p1 = i and pk = j. A breadth-first having energy take the path A to B to C, a transi-
search starting from node d, which iterates over all tive relation would allow a transmission from A to
paths of length 1, and then 2 and 3, and so on up to C directly. The transitivity of a graph is measured
the largest path that originates at d, is one way to by counting the total number of closed triangles in
compute geodesic paths. the graph (i.e., counting all subgraphs that are
Network interactions: Whereas path analysis complete graphs of three nodes) multiplied by
considers the global structure of a graph, the inter- three and divided by the total number of
actions among nodes are a concept related to sub- connected triples in the graph (e.g., all sets of
graphs or microstructures. Microstructural three vertices A, B, and C where at least the
measures consider a single node, members of its edges (A,B) and (B,C) exist).
nth degree neighborhood (the set of nodes no Balance is defined for networks where edges
more than n hops from it), and the collection of carry a binary variable that, without loss of gen-
interactions that run between them. If macro- erality, is either “positive” (i.e., a “+,” “1,” “Yes,”
measures study an entire system as a whole (the “True,” etc.) or “negative” (i.e., a “ ,” “0,” “No,”
“forest”), micro-measures such as interactions try “False,” etc.). Vertices incident to positive edges
to get at the heart of the individual conditions that are harmonious or non-conflicting entities in a
cause nodes to bind together locally (the “trees”). system, whereas vertices incident to negative
Three popular features for microstructural analy- edges may be competitive or introduce a tension
sis are reciprocity, transitivity, and balance. in the system. Subgraphs over three nodes that are
Reciprocity measures that degree to which complete are balanced or imbalanced depending
two nodes are mutually connected to each other on the assignment of + and labels to the edges
in a directed graph. In other words, if one observes of the triangle as follows:
that a node A connects to B, what is the chance
that B will also connect A? The term reciprocity • Three positive: Balanced. All edges are “posi-
comes from the field of social network analysis, tive” and in harmony with each other.
which describes a particular set of link/graph min- • One positive, two negative: Balanced. In this
ing techniques designed to operate over graphs triangle, two nodes exhibit a harmony, and
where nodes represent people and edges represent both are in conflict with the same other. The
the social relationships among them. For example, state of this triangle is “balanced” in the sense
if A does a favor for B, will B also do a favor for that every node is either in harmony or in
A? If A sends a friend request to B on an online conflict with all others in kind.
4 Link/Graph Mining
• Two positive, one negative: Imbalanced. In this of their job title but because they have a direct and
triangle, node A is harmonious with B, and B is strong relationship with the Commander in Chief.
harmonious with C, yet A and C are in conflict. Importance is measured by calculating the cen-
This is an imbalanced disagreement since, if trality of a node in a graph. Different centrality
A does not conflict with B, and B does not measures that encode different interpretations of
conflict with C, one would expect A to also node importance exist and should thus be selected
not conflict with C. For example, in a social according to the analysis at hand. Degree central-
context where positive means friend and neg- ity defines importance as being proportional to the
ative means enemy, B can fall into a conflicting number of connections a node has. Closeness
situation when friends A and C disagree. centrality defines importance as having a small
• Three negative: In this triangle, all vertices are average distance to all other nodes in the graph.
in conflict with one another. This is a danger- Betweenness centrality defines importance as
ous scenario in systems of almost any context. being part of as many shortest paths in graph
For example, in a dataset of nations, mutual from other pairs of nodes as possible. Eigenvec-
disagreements among three states has conse- tor centrality defines importance as being
quence to the world community. In a dataset of connected to not only many other nodes but also
computer network components, three routers to many other nodes that are themselves are
that are interconnected but in “conflict” (e.g., important.
a down connection or a disagreement among Graph partitioning: In the same way that clus-
routing tables) may lead to a system outage. ters of datums in a dataset correspond to groups of
points that are similar, interesting, or signify some
Datasets drawn from social process always other demarcation, vertices in graphs may also be
tend toward balanced states because people do divided into groups that correspond to a common
not like tension or conflict. It is thus interesting affiliation, property, or connectivity structure.
to use link/graph mining to study social systems Graph partitioning methods. Graph partitioning
where balance may actually not hold. If a graph takes as an input the number and size of the groups
where most triangles are not balanced comes from and then searches for the “best” partitioning under
a social system, one may surmise that there exist these constraints. Community detection algo-
latent factors pushing the system toward imbal- rithms are similar to graph partitioning methods
anced states. A labeled complete graph is bal- except that they do not require the number and
anced if every one of its triangles is balanced. size of groups to be specified a priori. But this is
Quantifying node importance: The impor- not necessarily a disadvantage to graph
tance of a node is related to its ability to reach partitioning methods; if a graph miner under-
out or connect to other nodes. A node may also be stands the domain from where the graph came
important if it carries a strong degree of “flow,” from well, or if for her application she requires a
that is, if the values of relationships connected to it partitioning into exactly k groups, graph
are very high (so that it acts as a strong conduit for partitioning methods should be used.
the passage of information). Nodes may be impor-
tant if they are vital to maintain network connec-
tivity, so that if an important node was removed, Conclusion
the graph may suddenly fragment or become dis-
connected. Importance may be measured recur- As systems that our society relies on become ever
sively: a node is important if it is connected to more complex, and as technological advances
other nodes that themselves are important. For continue to help us capture the structure of this
example, people who work in the United States complexity at high definition, link/graph mining
White House or serve as Senior Aids to the Pres- methods will continue to rise in prevalence. As the
ident are powerful people, not necessarily because primary means to understand and extract knowl-
edge from complex systems, link/graph mining
Link/Graph Mining 5
into digital contacts. In 2011, LinkedIn bought the that are relevant to members. It quickly notifies a
company, retooling it to pull up existing LinkedIn person whenever friends, family members,
profiles from each card improving the ability of coworkers, and so forth are mentioned online in
members to make connections. A significant part the news, blogs, and/or articles.
of LinkedIn’s success comes from its dedication LinkedIn continues to make great strides by
to selling services to people who purchase talent. leveraging its large data archives, to carve out a
The chief executive of LinkedIn, Jeff Weiner, niche in the social media sector specifically
has created an intense sales-focused culture. The targeting the needs of online professionals. It is
company celebrates new account wins during its evident that, through the use of big data, LinkedIn
biweekly meetings. According to George Anders, is changing and significantly influencing the job-
LinkedIn has doubled the number of sales hunting process. This company provides a service
employees in the past year. In addition, the com- that allows its member to connect and network
pany has made a $27 billion impact on the with professionals. LinkedIn is the world’s largest
recruiting industry. Jeff Weiner also states that professional network, proving to be an innovator
every time LinkedIn expands its sales team for in the employment service industry.
hiring solutions, the payoff increases “off the
charts.” He also talks about how sales keep rising
and its customers are spreading enthusiasm for
Cross-References
LinkedIn’s products. Jeff Weiner further states
that once sales are made, LinkedIn customers are
▶ Facebook
loyal, reoccurring, and low maintenance. This
▶ Information Society
trend is reflected in current stock market prices in
▶ Online Identity
the job-hunting sector. George Anders writes that
▶ Social Media
older search firm companies, such as Heidrick &
Struggles that recruits candidates the old fashion
way, have slumped 67%. Monster Worldwide has
experienced a more dramatic drop, tumbling 81%. Further Readings
As noted on its website, “LinkedIn operates the
world’s largest professional network on the Inter- Anders, G. How LinkedIn has turned your resume into a cash
machine. http://www.forbes.com/sites/georgeanders/
net.” This company has made billions of dollars, 2012/06/27/how-linkedin-strategy/. Accessed July 2014.
hosting a massive amount of data with a member- Boucher Ferguson, R. The relevance of data: Behind the
ship of 300 million people worldwide. The social scenes at LinkedIn. http://sloanreview.mit.edu/arti
network for professionals is growing at a fast pace cle/the-relevance-of-data-going-behind-the-scenes-
at-linkedin/. Accessed July 2014.
under the tenure of Chief Executive Jeff Weiner. Gelles, D. LinkedIn makes another deal, buying Bizo.
In a July 2014 article by David Gelles, he reports http://dealbook.nytimes.com/2014/07/22/linkedin-does-
that LinkedIn has made its second acquisition in another-deal-buying-bizo/?_php=true&_type=blogs&_
the last several weeks buying Bizo for $175 mil- php=true&_type=blogs&_php=true&_type=blogs&_
r=2. Accessed July 2014.
lion dollars. A week prior, it purchased Newsle, LinkedIn. https://www.linkedin.com. Accessed July 2014.
which is a service that combs the web for articles
M
handling governance and data protection. Han- processing, analytics, and visualization software
dling big data increases the risk of paralyzing that allows journalists to peer through the massive
privacy, because (social) media or internet-based amounts of data available in a digital environment
services require a lot of personal information in and to show it in a clear and simple way to the
order to use them. Moreover, analyzing big data publics. The importance of data journalism is
entails higher risks to incur in errors, for instance, given by its ability to gather, interrogate, visual-
when it comes to statistical calculations or visual- ize, and mash up data with different sources or
izations of big data. services, and it requires an amalgamation of a
journalist’s “nose for news” and tech savvy
competences.
Big Data in the Media Context However, data journalism is not as new as it
seems to be. Ever since organizations and public
Within media, big data mainly refers to huge administrations collected information or built up
amounts of structured (e.g., sales, clicks) or archives, journalism has been dealing with large
unstructured (e.g., videos, posts, or tweets) data amounts of data. As long as journalism has been
generated, collected, and aggregated by private practiced, journalists were keen to collect data and
business activities, governments, public adminis- to report them accurately. When the data
trations, or online-based organizations such as displaying techniques got better in the late eigh-
social media. In addition, the term big data usually teenth century, newspapers started to use this
includes references to the analysis of huge bulks know-how to present information in a more
of data, too. These large-scale data collections are sophisticated way. The first example of data jour-
difficult to analyze using traditional software or nalism can be traced back to 1821 and involved
database techniques and request new methods in The Guardian, at the time based in Manchester,
order to identify patterns in such a massive and UK. The newspaper published a leaked table list-
often incomprehensible amount of data. The ing the number of students and the costs for each
media ecosystem has therefore developed special- school in the British city. For the first time, it was
ized practices and tools not only to generate big publicly shown that the number of students
data but also to analyze it in turn. One of these receiving free education was higher than what
practices to analyze data is called data or data- was expected in the population. Another example
driven journalism. of early data journalism dates back to 1858, when
Florence Nightingale, the social reformer and
Data Journalism founder of modern nursing, published a report to
We live in an age of information abundance. One the British Parliament about the deaths of soldiers.
of the biggest challenges for the media industry, In her report she revealed with the help of visual
and journalism in particular, is to bring order in graphics that the main cause of mortality resulted
this data deluge. It is therefore not surprising that from preventable diseases during cure rather than
the relationship between big data and journalism as a cause from battles.
is becoming stronger, especially because large By the middle of the twentieth century, news-
amounts of data need new and better tools that rooms started to use systematically computers to
are able to provide specific context, to explain the collect and analyze data in order to find and enrich
data in a clear way, and to verify the information it news stories. In the 1950s this procedure was
contains. Data journalism is thus not entirely dif- called computer-assisted reporting (CAR) and is
ferent from more classic forms of journalism. perhaps the evolutionary ancestor of what we call
However, what makes it somehow special are data journalism today. Computer-assisted
the new opportunities given by the combination reporting was, for instance, used by the television
of traditional journalistic skills like research and network CBS in 1952 to predict the outcome of
innovative forms of investigation thanks to the use the US presidential election. CBS used a then
of key information sets, key data and new famous Universal Automatic Computer
Media 3
(UNIVAC) and programmed it with statistical methods, has helped – according to Philip
models based on voting behavior from earlier Meyer – to make journalism scientific. Besides,
elections. With just 5% of votes in, the computer Meyer’s approach tried also to tackle some of the
correctly predicted the landslide win of former common shortcomings of journalism like the
World War II general Dwight D. Eisenhower increasing dependence on press releases, shrink-
with a margin of error less than 1%. After this ing accuracy and trust, or the critique of political
remarkable success of computer-assisted bias. An important factor of precision journalism
reporting at CBS, other networks started to use was therefore the introduction and the use of sta-
computers in their newsrooms as well, particu- tistical software. These programs enabled journal-
larly for voting prediction. Not one election has ists for the first time to analyze bigger databases
since passed without a computer-assisted predic- such as surveys or public records. This new
tion. However, computers were slowly introduced approach might also be seen as a reaction to alter-
in newsrooms, and only in the late 1960s, they native journalistic trends that came up in the
started to be regularly used in the news production 1990s, for instance, the concept of new journal-
as well. ism. While precision journalism stood for scien-
In 1967, a journalism professor from the Uni- tific rigor in data analysis and reporting, new
versity of North Carolina, Philip Meyer, used for journalism used techniques from fiction to
the first time a quicker and better equipped IBM enhance reading experience.
360 mainframe computer to do statistical analyses There are some similarities between data jour-
on survey data collected during the Detroit riots. nalism and computer-assisted reporting: both rely
Meyer was able to show that not only less edu- on specific software programs that enable journal-
cated Southerners were participating in the riots ists to transform raw data into news stories. How-
but also people who attended college. This story, ever, there are also differences between computer-
published on the Detroit Free Press, won him a assisted reporting and data journalism, which are
Pulitzer Prize together with other journalists and due to the context in which the two practices were
marked a paradigm shift in computer-assisted developed. Computer-assisted reporting tried to
reporting. On the grounds of this success, Meyer introduce both informatics and scientific methods
not only supported the use of computers in jour- into journalism, given that at the time, data was
nalistic practices but developed a whole new scarce, and many journalists had to generate their
approach to investigative reporting by introducing own data. The rise of the Internet and new media
and using social science research methods in jour- contributed to the massive expansion of archives,
nalism for data gathering, sampling, analysis, and databases, and to the creation of big data. There is
presentation. In 1973 he published his thoughts in no longer a poverty of information, data is now
the seminal book entitled “Precision Journalism.” available in abundance. Therefore, data journal-
The fact that computer-assisted reporting entered ism is less about the creation of new databases, but
newsrooms especially in the USA was also more about data gathering, analysis, and visuali-
revealed through the increased use of computers zation, which means that journalists have to look
in news organizations. In 1986, the Time maga- for specific patterns within the data rather than
zine wrote that computers are revolutionizing merely seeking information – although recent dis-
investigative journalism. By trying to analyze cussions call for journalists to create their own
larger databases, journalists were able to offer a databases due to an overreliance on public data-
broader perspective and much more information bases. Either way, the success of data journalism
about the context of specific events. also led to new practices, routines, and mixed
The practice of computer-assisted reporting teams of journalists working together with pro-
spread further until, at the beginning of the grammers, developers, and designers within the
1990s, it became a standard routine particularly same newsrooms, allowing them to tell stories in a
in bigger newsrooms. The use of computers, different and visually engaging way.
together with the application of social science
4 Media
Media Organizations and Big Data audiences’ preferences. At the same time, however,
Big data is not only a valuable resource for data there is also a growing concern among journalist
journalism. Media organizations are data gath- with regard to their professional ethics and the
erers as well. Many media products, whether consequences for the function of journalism in
news or entertainment, are financed through society if they base their editorial decision-making
advertising. In order to satisfy the advertisers’ processes on real-time data. The results of web
interests in the site’s audience, penetration, and analytics not only influence the placement of
visits, media organizations track user behavior on news on the websites; they also have an impact
their webpages. Very often, media organizations on the journalists’ beliefs about what the audience
share this data with external research bodies, wants. Particularly in online journalism, the news
which then try to use the data on their behalf. selection is carried out grounding the decisions on
Gathering information about their customers is data generated by web analytics and no longer
therefore not only an issue when it comes to the on intrinsic notions such as news values or
use of social media. Traditional media organiza- personal beliefs. Consequently, online journalism
tions are also collecting data about their clients. becomes highly responsive to the audiences’
However, media organizations track the user preferences – serving less what would be in the
behavior on news websites not only to provide public interest. As many news outlets are integrated
data to their advertisers. Through user data, they organizations, which means that they apply a
also adapt the website’s content to the audience’s crossmedia strategy by joining previously sepa-
demand, with dysfunctional consequences for rated newsrooms such as the online and the print
journalism and its democratic function within staff, it might be possible that factors like data-
society. Due to web analytics and the generation based audience feedback will also affect print
of large-scale data collections, the audience exerts newsrooms. As Tandoc Jr. and Thomas state, if
an increasing influence over the news selection journalism continues to view itself as a sort of
process. This means that journalists – particularly “conduit through which transient audience prefer-
in the online realm – are at the risk of increasingly ences are satisfied, then it is no journalism worth
adapting their news selections on the audience’s bearing the name” (Tandoc and Thomas 2015, p.
feedback through data generated via web analyt- 253).
ics. Due to the grim financial situation and their While news organizations still struggle with
shrinking advertising revenue, some print media self-gathered data due to the conflicts that can
organizations especially in western societies try to arise in journalism, media organizations active in
apply strategies to compensate these deficits the entertainment industry rely much more
through a dominant market-driven discourse, strongly on data about their audiences. Through
manufacturing cheaper content that appeals to large amounts of data, entertainment media can
broader masses – publishing more soft news, sen- collect significant information about the audi-
sationalism, and articles of human interest without ence’s preferences for a TV series or a movie –
any connection to public policy issues. This is also even before it is broadcast. Particularly for big
due to the different competitive environment: production companies or film studios it is essen-
while there are fewer competitors in traditional tial to observe structured data like ratings, market
newspaper or broadcast markets, in the online share, and box office stats. But also unstructured
world, the next competitor is just one click away. data like comments or videos in social media are
Legacy media organizations, particularly news- equally important in order to understand con-
papers and their online webpages, offer more soft sumer habits, given that they provide information
news to increase traffic, to attract the attention of about the potential success or failure of a (new)
more readers, and thus to keep their advertisers at product.
it. A growing body of literature about the conse- An example of such use of big data is the
quences of this behavior shows that journalists, in launch of the TV show “House of Cards” by the
general, are becoming much more aware of the Internet-based on demand streaming provider
Media 5
Netflix. Before launching the first original content way of addressing customers, Facebook can make
with the political drama, Netflix was already it up with its incredible precision about the cus-
collecting huge amounts of data about the stream- tomers’ interests and its ability to target advertis-
ing habits of their customers. Of more than 25 mil- ing more effectively.
lion users, they tracked around 30 million views a Big data are an integrative part of social
day (recording also when people are pausing, media’s business model: they possess far more
rewinding, or fast-forwarding the videos), about information on their customers given that they
four million ratings, and three million searches have access not only to their surf behavior but
(Carr 2013). On top of that, they also try to gather above all to their tastes, interests, and networks.
unstructured data from social media, and they This might not only bear the potential to predict
look how customers are tagging the selected the users’ behavior but also to influence it, partic-
videos with metadata descriptors and whether ularly as social media such as Facebook and Twit-
they recommend the content. Based on these ter adapt also their noncommercial content to the
data, Netflix predicted possible preferences and individual users: the news streams we see on our
decided to buy “House of Cards.” It was a major personal pages are balanced by various variables
success for the online-based company. (differing between social media) such as interac-
There are also potential risks associated with tions, posting habits, popularity, the number of
the collection of such huge amounts of data: friends, user engagement, and others, being how-
Netflix recommends specific movies or TV ever constantly recombined. Through such
shows to their customers based on what they opaque algorithms, social media might well use
liked or what they have watched before. These their own data to model voters: in 2010, for exam-
recommendation algorithms might well guide the ple, 61 million users in the USA were shown a
user toward more of their original content, without banner message on their pages about how many of
taking into account the consumers’ actual prefer- their friends already voted for the US Congressio-
ences. In addition, consumers might not be able to nal Elections. The study showed that the banner
discover new TV shows that transcend their usual convinced more than 340,000 additional people to
taste. Given that services like Netflix know so cast their vote (Bond et al. 2012). The individually
much about their users’ habits, another concern tailored and modeled messaging does not only
with regard to privacy arises. bear the potential to harm the civic discourse; it
also enhances the negative effects deriving from
Big Data Between Social Media, Ethics, and “asymmetry and secrecy built into this mode of
Surveillance computational politics” (Tufekci 2014).
Social media are a main source for big data. Since The amount of data stored on social media will
the first major social media webpages have been continue to rise, and already today, social media
launched in the 2000s, they began to collect and are among the largest data repositories in the
store massive amounts of data. These sites started world. Since the data collecting mania of social
to gather information about the behavior, prefer- media will not decrease, which is also due to the
ences, and interests of their users in order to know explorative focus of big data, it raises issues with
how their users would both think and act. In regard to the specific purpose of the data collec-
general, this process of datafication is used to tion. Particularly if the data usage, storage, and
target and tailor the services better to the users’ transfer remain opaque and are not made transpar-
interests. At the same time, social media use these ent, the data collection might be disproportionate.
large-scale data collections to help advertiser tar- Yet, certain social media allow third parties to
get the users. Big data in social media have there- access their data, particularly as the trade of data
fore also a strong commercial connotation. increases because of its economic potential. This
Facebook’s business model, for instance, is policy raises ethical issues with regard to trans-
entirely based on hyper-targeted display ads. parency about data protection and privacy.
While display ads are a relatively old-fashioned
6 Media
Particularly in the wake of the Snowden reve- was that the users in the sample were not aware
lations, it has been shown that opaque algorithms that their newsfeed was altered. This study shows
and big data practices are increasingly important that the use of big data generated in social media
to surveillance: “[...] Big Data practices are can entail ethical issues, not the least because the
skewing surveillance even more towards a reli- constructed reality within Facebook can be
ance on technological “solutions,” and that these distorted. Ethical questions with regard to media
both privileges organizations, large and small, and big data are thus highly relevant in our soci-
whether public or private, reinforce the shift in ety, given that both the privacy of citizens and the
emphasis toward control rather than discipline protection of their data are at stake.
and rely increasingly on predictive analytics to
anticipate and preempt” (Lyon 2014, p. 10). Over-
all, the Snowden disclosures have demonstrated
Conclusion
that surveillance is no longer limited to traditional
instruments in the Orwellian sense but have
Big data plays a crucial role in the context of the
become ubiquitous and overly reliant on practices
media. The instruments of computer-assisted
of big data – as governmental agencies such as the
reporting and data journalism allow news organi-
NSA and GCHQ are allowed to access not only
zations to engage in new forms of investigations
the data of social media and search giants but also
and storytelling. Big data also allow media orga-
to track and monitor telecommunications of
nizations to better adapt their services to the pref-
almost every individual in the world. However,
erences of their users. While in the news business
the big issue even with the collect-all approach is
this may lead to an increase of soft news, the
that data is subject to limitations and bias, partic-
entertainment industry benefits from such data in
ularly if they rely on automated data analysis:
order to predict the audience’s taste with regard to
“Without those biases and limitations being
potential TV shows or movies. One of the biggest
understood and outlined, misinterpretation is the
issues with regard to media and big data are its
result” (Boyd and Crawford 2012, p. 668). This
ethical implications, particularly with regard to
might well lead to false accusation or failure of
data collection, storage, transfer, and surveillance.
predictive surveillance as could be seen in the case
As long as the urge to collect large amounts of
of the Boston Marathon bombing case: first, a
data and the use of opaque algorithms continue to
picture of the wrong suspect was massively shared
prevail in many already powerful (social) media
on social media, and second, the predictive radar
organizations, the risks of data manipulation and
grounded on data gathering was ineffective.
modeling will increase, particularly as big data are
In addition, the use of big data generated by
becoming even more important in many different
social media entails also ethical issues in reference
aspects of our lives. Furthermore, as the Snowden
to scientific research. Normally, when human
revelations showed, collect-it-all surveillance
beings are involved in research, strict ethical
already relies heavily on big data practices. It is
rules, such as the informed consent of the people
therefore necessary to increase both the research
participating in the study, have to be observed.
into and the awareness about the ethical implica-
Moreover, in social media there are “public” and
tions of big data in the media context. Only thanks
“private,” which can be accessed. An example of
to a critical discourse about the use of big data in
such a controversial use of big data is a study
our society, we will be able to determine “our
carried out by Kramera et al. (2014). The authors
agency with respect to big data that is generated
deliberately changed the newsfeed of Facebook
by us and about us, but is increasingly being used
users: some got more happy news, others more
at us” (Tufekci 2014). Being more transparent,
sad ones, because the goal of the study was to
accountable, and less opaque about the use and,
investigate whether emotional shifts in those sur-
in particular, the purpose of data collection might
rounding us – in this case virtually – can change
be a good starting point.
our own moods as well. The issue with the study
Media 7
Cross-References 2013/02/25/business/media/for-house-of-cards-using-
big-data-to-guarantee-its-popularity.html. Accessed
11 July 2016.
▶ Advertising Targeting Kramera, A. D. I., Guilloryb, J. E., & Hancock, J. T.
▶ Big Data Storytelling (2014). Experimental evidence of massive-scale emo-
▶ Crowdsourcing tional contagion through social networks. Proceedings
▶ Transparency of the National Academy of Sciences of the United
States of America, 111(24), 8788–8790.
Lyon, D. (2014, July–December). Surveillance, Snowden,
and Big Data: Capacities, consequences, critique. Big
References Data & Society, 1–13.
Tandoc Jr., E. C., & Thomas, R. J. (2015). The ethics of
Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D. I., web analytics. Implications of using audience metrics
Marlow, C., Settle, J. E., & Fowler, J. H. (2012). in news construction. Digital Journalism, 3(2),
A 61-million-person experiment in social influence 243–258.
and political mobilization. Nature, 489, 295–298. Tufekci, Z. (2014). Engineering the public: Big data, sur-
Boyd, D., & Crawford, K. (2012). Critical questions for big veillance and computational politics. First Monday,
data. Information, Communication & Society, 15(5), 19(7). http://journals.uic.edu/ojs/index.php/fm/article/
662–679. view/4901/4097. Accessed 12 July 2016.
Carr, D. (2013, February 24). Giving readers what they
want. New York Times. http://www.nytimes.com/
M
will the data be archived? When will the data be International Organization for Standardization
open access? Why a specific instrument is needed (ISO) in 2003 and later revised in 2009. It has
for data collection? How will the data be also been endorsed by a number of other national
maintained and updated? In journalism, the or international organizations such as the Ameri-
5W1H is often used to evaluate whether the infor- can National Standards Institute and the Internet
mation covered in a news article is complete or Engineering Task Force.
not. Normally, the first paragraph of a news article The 15 core elements are part of an enriched
gives a brief overview of the article and provides specification of metadata terms maintained by the
concise information to answer the 5W1H ques- Dublin Core Metadata Initiative (DCMI). The
tions. By reading the first paragraph, a reader can specification includes properties in the core ele-
grasp the key information of an article even before ments, properties in an enriched list of terms,
reading through the full text. Metadata is data vocabulary encoding schemes, syntax encoding
about data; such functionality is similar to what schemes, and classes (including the DCMI Type
the first paragraph works for a news article, and Vocabulary). The enriched terms include all the
metadata items used for describing a dataset are 15 core elements and cover a number of more
equal to the 5W1H question words. specific properties, such as abstract, access rights,
has part, has version, medium, modified, spatial,
temporal, valid, etc. In practice, the metadata
Metadata Hierarchy terms in the DCMI specification can be further
extended by combining with other compatible
Metadata are used for describing resources. The vocabularies to support various application pro-
description can be general or detailed according to files. With the 15 core elements, one is able to
the actual needs. Accordingly, there is a hierarchy provide rich metadata for a certain resource, and
of metadata items corresponding to the actual by using the enriched DCMI metadata terms and
needs of describing an object. For instance, the external vocabularies, one can create an even
abovementioned 5W1H question words can be more specific metadata description for the same
regarded as a list of general metadata items, and object. This can be done in a few ways. For
they can also be used to describe datasets. How- example, one way is to use terms that are not
ever, the six question words only offer a start included in the core elements, such as spatial and
point, and there may be various derived metadata temporal. Another possible way is to use a refined
items in actual works. In early days there was such metadata term that is more appropriate for
a heterogeneous situation among the metadata describing an object. For instance, the term
provided by different stakeholders. To promote “description” in the core elements is with broad
standardization of metadata items, a number of meaning, and it may include an abstract, a table of
international standards have been developed. contents, a graphical representation, or a free-text
The most well-known standard is the Dublin account of a resource. In the enrich DCMI terms,
Core Metadata Element Set (DCMI Usage Board there is a more specific term “abstract,” which
2012). The name “Dublin” originates from a 1995 means a summary of a resource. Compared to
workshop at Dublin, OH, USA. The word “Core” “description,” the term “abstract” is more specific
means that the elements are generic and broad. and appropriate if one wants to collect a literal
The 15 core elements are contributor, coverage, summary of an academic article.
creator, date, description, format, identifier, lan-
guage, publisher, relation, rights, source, subject,
title, and type. Those elements are more specific Domain-Specific Metadata Schemas
than the 5W1H question words and can be used
for describing a wide range of resources, includ- High-level metadata terms such as those in the
ing datasets. The Dublin Core Metadata Element Dublin Core Metadata Element Set have broad
Set was published as a standard by the meaning and are applicable to various resources.
Metadata 3
However, those metadata elements are too general The international geo sample number (IGSN),
in meaning and sometimes are implicit. If one initiated in 2004, is a sample identification code
wants a more specific and detailed description of for the geoscience community. Each registered
the resources, a domain-specific metadata schema IGSN identifier is accompanied with a group of
is needed. Such a metadata schema is a list of metadata providing detailed background informa-
organized metadata items for describing a certain tion about that sample. Top concepts in the current
type of resource. For example, there could be a IGSN metadata schema are sample number, reg-
metadata schema for each type defined in the istrant, related resource identifiers, and log. A top
DCMI Type Vocabulary, such as dataset, event, concept may include a few child concepts. For
image, physical object, service, etc. There have example, there are two child concepts for “regis-
been various national and international commu- trant”: registrant name and name identifier.
nity efforts for building domain-specific metadata The ISO 19115 and ISO 19115-2 geographic
schemas. Especially, many schemas developed in information metadata are regarded as a best prac-
recent years face the data management and tice of metadata schemas for geospatial data.
exchange on the Web. A few recent works are Geospatial data are about objects with some posi-
introduced below. tion on the surface of the Earth. The ISO 19115
The data catalog vocabulary (DCAT) standards provide guidelines on how to describe
(Erickson and Maali 2014) was approved as a geographical information and services. Detailed
World Wide Web Consortium (W3C) recommen- metadata items cover topics about contents, spa-
dation in January 2014. It was designed to facili- tiotemporal extents, data quality, channels for
tate interoperability among data catalogs access and rights to use, etc. Another standard,
published on the Web. DCAT defines a metadata ISO 19139, provides an XML schema implemen-
schema and provides a number of examples on tation for the ISO 19115. The catalog service for
how to use it. DCAT reuses a number of DCMI the Web (CSW) is an open geospatial consortium
metadata terms in combination with terms from (OGC) standard for describing online geospatial
other schemas such as the W3C Simple Knowl- data and services. It adopts ISO 19139, the Dublin
edge Organization System (SKOS). It also defines Core elements and items from other metadata
a few new terms to make the resulted schema efforts. Core elements in CSW include title, for-
more appropriate for describing datasets in data mat, type, bounding box, coordinate reference
catalogs. system, and association.
The Darwin Core is a group of standards for
biodiversity applications. By extending the Dub-
lin Core metadata elements, the Darwin Core Annotating a Web of Data
establishes a vocabulary of terms to facilitate the
description and exchange of data about the geo- Recent efforts on metadata standards and
graphic occurrence of organisms and the physical schemas, such as the abovementioned Dublin
existence of biotic specimens. The Darwin Core Core, DCAT, Darwin Core, EML, IGSN meta-
itself is also extensible, which provides a mecha- data, ISO 19139, and CSW, show a trend of pub-
nism for describing and sharing additional lishing metadata on the Web. More importantly,
information. by using standard encoding formats, such as the
The ecological metadata language (EML) is a XML and W3C resource description framework
metadata standard developed for the none- (RDF), they are making metadata machine dis-
geospatial datasets in the field of ecology. It is a coverable and readable. This mechanism moves
set of schemas encoded in the format of extensible the burden of searching, evaluating, and integrat-
markup language (XML) and thus allows struc- ing massive datasets from humans to computers,
tured expression of metadata. EML can be used to and for computers such burden is not real burden
describe digital resources and also nondigital because they can find ways to access various data
resources such as paper maps. sources through standardized metadata on the
4 Metadata
Web. For example, the project OneGeology aims represents those entities that the search engines
to enable online access to geological maps across can handle in a short term. Schema.org provides a
the world. By the end of 2014, the OneGeology mechanism for extending the scope of concepts,
has 119 participating nations, and most of them properties, and schemas. Webmasters and devel-
share national or regional geological maps opers can define their own specific concepts,
through OGC geospatial data service standards. properties, and schemas. Once those extensions
Those map services are maintained by their are commonly used on the Web, they can also be
corresponding organizations, and they also enable included as a part of the schema.org schemas.
standardized metadata services, such as CSW. On
the one hand, OneGeology provides technical
supports to organizations who want to set up Linking for Tracking
geologic map services using common standards.
On the other hand, it also provides a central data If the recognition of domain-specific topics is a
portal for end users to access various distributed work to identify resource types, then the definition
metadata and data services. The OneGeology pro- of metadata items is a work of annotating those
ject presents a successful example on how to types. The work in schema.org is an excellent
rescue the legacy data, update them with well- reflection of those two works. Various structured
organized metadata, and make them discoverable, and unstructured resources can be categorized and
accessible, and usable on the Web. annotated by using metadata and are ready to be
Comparing with domain-specific structured discovered and accessed. In a scientific or busi-
datasets, such as those in OneGeology, many ness procedure, various resources are retrieved
other datasets in the Big Data are not structured, and used, and outputs are generated and archived
such as webpages and data stream on social and perhaps be reused elsewhere. In recent years,
media. In 2011, the search engines Bing, Google, people take a further step to make links among
Yahoo!, and Yandex launched an initiative called those resources, their types, and properties, as
schema.org, which aims at creating and well as the people and activities involved in the
supporting a common set of schemas for struc- generation of those outputs. The work of catego-
tured data markup on web pages. The schemas are rization, annotation, and linking as a whole can be
presented as lists of tags in hypertext markup used to describe the origin of a resource, which is
language (HTML). Webmasters can use those called provenance. There have been community
tags to mark up their web pages, and search engine efforts developing specifications of commonly
spiders and other parsers can recognize those tags usable provenance models.
and record what a web page is about. This makes The Open Provenance Model was initiated in
it easier for search engine users to find the right 2006. It includes three top classes: artifact, pro-
web pages. Schema.org adopts a hierarchy to cess, and agent and their subclasses, as well as a
organize the schemas and vocabularies of terms. group of properties, such as was generated by, was
The concept on the top is thing, which is very controlled by, was derived from, and used, for
generic and is divided into schemas of a number describing the classes and the interrelationships
of child concepts, including creative work, event, among them. Another earlier effort is the proof
intangible, medical entity, organization, person, markup language, which was used to represent
place, product, and review. These schemas are knowledge about how information on the Web
further divided into smaller schemas with specific was asserted or inferred from other information
properties. A child concept inherits characteristics sources by intelligent agents. Information, infer-
from a parent concept. For example, book is a ence step/inference rule, and inference engine are
child concept of creative work. The hierarchy of the three key building blocks in the proof markup
concepts and properties does not intend to be a language.
comprehensive model that covers everything in Works on the Open Provenance Model and the
the world. The current version of schema.org only proof markup language have set up the basis for
Metadata 5
community actions. Most recently, the W3C business issues. In traditional data management,
approved the PROV Data Model as a recommen- especially for a single data center or data reposi-
dation in 2013. The PROV Data Model is a tory, the metadata life cycle is less addressed.
generic model for provenance, which allows spe- Now, facing the short-lived and quick Big Data
cific representations of provenance in research life cycles, attention should also be paid to the
domains or applications to be translated into the metadata life cycle.
model and be interchangeable among systems In general, a data life cycle covers steps of
(Moreau and Missier 2013). There are intelligent context recognition, data discovery, data access,
knowledge systems that can import the prove- data management, data archive, and data distribu-
nance information from multiple sources, process tion. Correspondingly, a metadata life cycle
it, and reason over it to generate clues for potential covers similar steps but they focus on the descrip-
new findings. The PROV Data Model includes tion of data rather than the data themselves. The
three core classes, entity, activity, and agent, context recognition allows people to study a spe-
which are comparable to the Open Provenance cific domain or application and reuse any existing
Model and the proof markup language. W3C metadata standards and schemas. Then in the
also approved the PROV Ontology as a recom- metadata discovery step, it is possible to develop
mendation for the expression of the PROV Data applications to automatically harvest machine
Model with semantic Web languages. It can be readable metadata from multiple sources and har-
used to represent machine readable provenance monize them. Commonly used domain-specific
information and can also be specialized to create metadata standards and machine readable formats
new classes and properties to represent prove- will significantly facilitate the metadata life cycle
nance information of specific applications and in applications using Big Data, because most of
domain. The extension and specification here are such applications will be on the Web and inter-
similar to the idea of a metadata hierarchy. changeable schemas and formats will be an
A typical application of the PROV Ontology is advantage.
the Global Change Information System for the US
Global Change Research Program (Ma et al.
2014), which captures and presents provenance
Cross-References
of global change research, and links to the publi-
cations, datasets, instruments, models, algo-
▶ Data Model, Data Modeling
rithms, and workflows that support key research
▶ Data Profiling
findings. The provenance information in the sys-
▶ Data Provenance
tem increases understanding, credibility, and trust ▶ Data Sharing
in the works of the US Global Change Research
▶ Open Data
Program and aids in fostering reproducibility of
▶ Semantic Web
results and conclusions.
browser while simultaneously “degrading” on a These include, but are not limited to, ClickTale,
mobile device in such a way that no functionality which offers mobile website optimization tools;
is lost. In other words, the same underlying code comScore, which is known for its audience mea-
provides the user experience regardless of what surement metrics; Flurry, which focuses on use
technological platform one uses to visit a site. and engagement metrics; Google, which offers
There are several advantages to this approach, both free and enterprise-level services; IBM,
including singularity of platform (that is, no which offers the ability to record user sessions
need to duplicate properties, logos, databases, and perform deep analysis on customer actions;
etc.), ease of update, unified user experience, and Localytics, which offers real-time user tracking
relative ease of deployment. However, there are and messaging options; Medio, which touts “pre-
downsides: full implementation of HTML5 and dictive” solutions that allow for custom content
CSS3 are relatively new. As a result, it can be creation; and Webtrends, which incorporates
costly to find a developer who is sufficiently other third-party (e.g., social media) data.
knowledgeable to make the solution as seamless The other primary mobile option: development
as desired, and who can articulate the solution in of a stand-alone smartphone or tablet app. Stand-
such a way that non-developers will understand alone apps are undeniably popular, given that
the full vision of the end product. Furthermore, 50 billion apps were downloaded from the Apple
development of a polished finished product can be App Store between July 2008 and June 2014.
time-consuming and will likely involve a great A number of retailers have had great success
deal of compromise from a design perspective. with their apps, including Amazon, Target,
Mobile analytics tools are relatively easy to Zappos, Groupon, and Walgreens, which speaks
deploy when a marketer chooses to take this to the potential power of the app as a marketing
route, as most modern smartphone web browsers tool. However, consider that there are more than
are built on the same technologies that drive one million apps in the Apple App Store alone, as
computer-based web browsers – in other words, of this writing – those odds greatly reduce the
most mobile browsers support both JavaScript chances that an individual will simply “stumble
and web “cookies,” both of which are typically across” a company’s app in the absence of some
requisites for analytics tools. Web pages can be sort of viral advertising, breakout product, or
“tagged” in such a way that mobile analytics can buzzworthy word-of-mouth. Furthermore, devel-
be measured, which will allow for the collection oping a successful and enduring app can be quite
of a variety of information on visitors. This might expensive, particularly considering that a mar-
include device type, browser identification, oper- keter will likely want to make versions of the
ating system, GPS location, screen resolution/ app available for both Apple iOS and Google
size, and screen orientation, all of which can pro- Android (the two platforms are incompatible
vide clues as to the contexts in which users are with each other). Estimates for app development
visiting the website on a mobile device. Some vary widely, from a few thousand dollars at the
mainstream web analytics tools, such as Google low end all the way up to six figures for a complex
Analytics, already include a certain degree of app, according to Mark Stetler of AppMuse – and
information pertaining to mobile users (i.e., it is these figures do not include ongoing updates, bug
possible to drill down into reports and determine fixes, or recurring content updates, all of which
how many mobile users have visited and what require staff with specialized training and know-
types of devices they were using); however, mar- how.
keting entities that want a greater degree of insight If a full-fledged app or redesigned website
into the success of their mobile sites will likely proves too daunting or beyond the scope of what
need to seek out a third-party solution to monitor a marketer needs or desires, there are a number of
performance. other techniques that can be used to reach con-
There are a number of providers of web-based sumers, including text and multimedia messaging,
analytics solutions that cover mobile web use. email messaging, mobile advertising, and so forth.
Mobile Analytics 3
Each of these techniques can reveal a wealth of A third-party provider that shares the data with
data about consumers, so long as the appropriate you, like Google, is more likely to come at a
analytic tools are deployed in advance of the bargain price, whereas a provider that grants you
launch of any particular campaign. exclusive ownership of the data is going to come
Mobile app analytics are quite different from at a premium. Finally, implementation will make a
web analytics in a number of ways, including the difference in costs: SaaS (software-as-a-service)
vocabulary. For example, there are no page views solutions, which are typically web based, run on
in the world of app analytics – instead, “screen the third-party service’s own servers, and rela-
views” are referenced. Likewise, an app “session” tively easy to install, tend to be less expensive,
is analogous to a web “visit.” App analytics often whereas “on-premises” solutions are both rare and
have the ability to access and gauge the use of quite expensive.
various features built into a phone or tablet, There are a small but growing number of com-
including the accelerometer, GPS, and gyroscope, panies that provide app-specific analytic tools,
which can provide interesting kinesthetic aspects typically deployed as SDKs (software develop-
to user experience considerations. App analytics ment kits) that can be “hooked” into apps. These
tools are also typically able to record and retain companies include, but are by no means limited
data related to offline usage for transmission when to, Adobe Analytics, which has been noted for its
a device has reconnected to the network, which scalability and depth of analysis; Artisan Mobile,
can provide a breadth of environmentally contex- an iOS-focused analytics firm that allows cus-
tual information to developers and marketers tomers to conduct experiments with live users in
alike. Finally, multiple versions of a mobile real time; Bango, which focuses on ad-based
app can exist “in the wild” simultaneously monetization of apps; Capptain, which allows
because users’ proclivities differ when it comes specific user segments to be identified and
to updating apps. Most app analytic packages targeted with marketing campaigns; Crittercism,
have the ability to determine which version of an which is positioned as a transaction-monitoring
app is in use so that a development team can track service; Distimo, which aggregates data from a
interactional differences between versions and variety of platforms and app stores to create a
confirm that bugs have been “squashed.” fuller position of an app in the larger marketplace;
As mentioned previously, marketers who ForeSee, which has the ability to record customer
choose to forego app development and develop a interactions with apps; and Kontagent, which
mobile version of their web page often choose to touts itself as a tool for maintaining customer
stick with their existing web analytics provider, retention and loyalty.
and oftentimes these providers do not provide a As mobile devices and the mobile web grow
level of detail regarding mobile engagement that increasingly sophisticated, there is no doubt that
would prove particularly useful to marketers who mobile analytics tools will also grow in sophisti-
want to capture a snapshot of mobile users. In cation. Nevertheless, it would seem that there are
many cases, companies simply have not given a wide range of promising toolkits already avail-
adequate consideration to mobile engagement, able to the marketer who is interested in better
despite the fact that it is a growing segment of understanding customer behaviors and increasing
online interaction that is only going to grow, par- customer retention, loyalty, and satisfaction.
ticularly as smartphone saturation continues.
However, for those entities that wish to delve
further into mobile analytics, there are a growing Cross-References
number of options available, with a few key dif-
ferences between the major offerings. There are ▶ Data Aggregation
both free and paid mobile analytics platforms ▶ Location Data
available; the key differentiator between these ▶ Network Data
offerings seems to come down to data ownership. ▶ Telecommunications
4 Mobile Analytics
highlighted the predominate role played by minorities in addition to the general privacy pro-
African-Americans in major battleground states tections commonly granted. Such controversy
and divulged openings for the Republican Party surrounding civil rights and big data may not be
in building rapport with the African-American self-evident; however, big data often involves the
community. In addition, the data signaled to Dem- targeting and segmenting of one type of individual
ocrats a message not to assume levels of Black from another. This serves as a threat to basic civil
support in 2016 on par with that realized in the rights –which are protected by law – in ways that
2008 and 2012 elections. were inconceivable in recent decades. For
By tailoring its outreach to individuals, the instance, the NAACP has expressed alarm regard-
NAACP has been successful in achieving rela- ing the collection of information by credit
tively high rates of engagement. The organization reporting agencies. Such collections can result in
segments supporters based on their actions, such the making of demographic profiles and stereo-
as whether they support a particular issue based on typical categories, leading to the marketing of
past involvement. For instance, many NAACP predatory financial instruments to minority
members view gun violence as a serious problem groups.
in today’s society. If such a member connects with The US government’s collection of massive
NAACP’s online community via a particular phone records for purposes of intelligence has
webpage or internet advertisement, s/he will be also drawn harsh criticism from the NAACP as
recognized as one espousing stronger gun control well as other civil rights organizations. They have
laws. Future outreach will entail tailored mes- vented warnings regarding such big data by
sages expressing attributes that resonate on a per- highlighting how abuses can uniquely affect dis-
sonal level with the supporter, not unlike that from advantaged minorities. The NAACP supports
a friend or colleague. principles aimed at curtailing the pervasive use
The NAACP also takes advantage of major of data in areas such as law enforcement and
events that reflect aspects of the organization’s employment. Increasing collections of data are
mission statement. Preparation for such moments viewed by the NAACP as a threat since such big
entails much advance work, as evidenced in the data could allow for unjust targeting of, and dis-
George Zimmerman trial involving the fatal crimination against, African-Americans. Thus,
shooting of 17-year-old Trayvon Martin. As the the NAACP strongly advocates measures such
trial was concluding in 2013, the NAACP formed as a stop to “high-tech profiling,” greater pressure
contingency plans in advance of the court’s deci- on private industry for more open and transparent
sion. Website landing pages and prewritten emails data, and greater protections for individuals from
were set in place, adapted for whatever result may inaccurate data.
come. Once the verdict was read, the NAACP sent
out emails within 5 min that detailed specific
actions for supporters to take. This resulted in Cross-References
over a million petition signatures demanding
action on the part of the US Justice Department, ▶ Demographic Data
which it eventually took. ▶ Discrimination
▶ Facebook
▶ Pattern Recognition
Controversy ▶ Targeting
National Oceanic and Atmospheric the oceans to the state of the sun, and to better
Administration safeguard and preserve seashores and marine life.
NOAA provides alerts to dangerous weather,
Steven J. Campbell maps the oceans and atmosphere, and directs the
University of South Carolina Lancaster, responsible handling and safeguarding of the seas
Lancaster, SC, USA and coastal assets. One key way NOAA pursues
its mission is by conducting research in order to
further awareness and better management of envi-
The National Oceanic and Atmospheric Adminis- ronmental resources. With a workforce of over
tration (NOAA) is an agency housed within the 12,000, NOAA consists of six major line offices,
US Commerce Department that monitors the sta- including the National Weather Service (NWS), in
tus and conditions of the oceans and the atmo- addition to over a dozen staff offices.
sphere. NOAA oversees a diverse array of NOAA’s collection and dissemination of vast
satellites, buoys, ships, aircraft, tide gauges, and sums of data on the climate and environment
supercomputers in order to closely track environ- contribute to a multibillion-dollar weather enter-
mental changes and conditions. This network prise in the private sector. The agency has sought
yields valuable and critical data that is crucial for ways to release extensive new troves of this data,
alerting the public to potential harm and pro- an effort that could be of great service to industry
tecting the environment nationwide. The vast and those engaged in research. NOAA announced
sums of data collected daily have served as a a call in early 2014 for ideas from the private
challenge to NOAA in storing as well as making sector to assist the agency’s efforts in freeing up
the information readily accessible and meaningful a large amount of the 20 terabytes of data that it
to the public and interested organizations. In the collects on a daily basis pertaining to the environ-
future, as demand grows for ever-greater amounts ment and climate change. In exchange,
and types of climate data, NOAA must be researchers stand to gain critical access to impor-
resourceful in meeting the demands of public tant information about the planet, and private
officials and other interested parties. companies can receive help and assistance in
First proposed by President Richard Nixon, advancing new climate tools and assessments.
who wanted a new department in order to better This request by NOAA shows that it is plan-
protect citizens and their property from natural ning to place large amounts of its data into the
dangers, NOAA was founded in October 1970. cloud, benefitting both the private and public sec-
Its mission is to comprehend and foresee varia- tors in a number of ways. For instance, climate
tions in the environment, from the conditions of data collected by NOAA is currently employed
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_141-1
2 National Oceanic and Atmospheric Administration
for forecasting the weather over a week in analyze mounds of scientific data proves vital in
advance. In addition, marine navigation and off- helping public officials, communities, and indus-
shore oil and gas drilling operations are very trial groups to better comprehend and prepare for
interested in related data. NOAA has pursued perils linked with turbulent weather and climatic
unleashing ever-greater amounts of its ocean and occurrences. Located in Virginia, the supercom-
atmospheric data by partnering with groups out- puters operate with 213 teraflops (TF) – up from
side government. This is seen as paramount to the 90 TF with the computers that came before
NOAA’s data management, where tens of them. This has helped to produce an advanced
petabytes of information are recorded in various Hurricane Weather Research and Forecasting
ways, engendering over 15 million results daily – (HWRF) model that the National Weather Service
from weather forecasts for US cities to coastal tide can more effectively employ. By allowing more
monitoring – which totals twice the amount of all effective monitoring of violent storms and more
the printed collections of the US Library of accurate predictions regarding the time, place, and
Congress. intensity of their impact, the HWRF model can
Maneuvering through NOAA’s mountain of result in saved lives.
weather and climate, the data has proved to be a NOAA’s efforts to build a Weather-Ready
great challenge over the years. To help address Nation have evolved from a foundation of super-
this issue, NOAA made available, in late 2013, an computer advancements that have permitted more
instrument that helped further open up the data to accurate storm-tracking algorithms for weather
the public. With a few clicks of a mouse, individ- prediction. First launched in 2011, this initiative
uals can create interactive maps illustrating natu- on the part of NOAA has resulted in advanced
ral and manmade changes in the environment services, particularly in ways that data and infor-
worldwide. For the most part, the data is free to mation can be made available to the public, gov-
the public, but much of the information has not ernment agencies, and private industry.
always been organized in a user-friendly format.
NOAA’s objective was to bypass that issue and
allow public exploration of environmental condi-
Cross-References
tions from hurricane occurrences to coastal tides
to cloud formations. The new instrument, named
▶ Climate Change, Hurricanes/Typhoons
NOAA View, allows ready access to many of
▶ Cloud or Cloud Computing
NOAA’s databases, including simulations of
▶ Data Storage
future climate models. These datasets grant users
▶ Environment
the ability to browse various maps and informa-
▶ Predictive Analytics
tion by subject and time frame. Behind the scenes,
numerous computer programs manipulate
datasets into maps that can demonstrate environ-
mental attributes and climate change over time. Further Readings
NOAA View’s origins were rooted in data visual-
ization instruments present on the web, and it is Freedman, A. (2014, February 24). U.S. readies big-data
dump on climate and weather. http://mashable.com/
operational on tablets and smartphones that 2014/02/24/NOAA-data-cloud/. Accessed September
account for 44% of all hours spent online by the 2014.
US public. Kahn, B. (2013). NOAA’s new cool tool puts climate on
Advances to NOAA’s National Weather Ser- view for all. http://www.climatecentral.org/news/
noaas-new-cool-tool-puts-climate-on-view-for-all-
vice supercomputers have allowed for much faster 16703. Accessed September 2014.
calculations of complex computer models, National Oceanic and Atmospheric Administration
resulting in more accurate weather forecasts. The (NOAA). www.noaa.gov. Accessed September 2014.
ability of these enhanced supercomputers to
N
National Organization for Women NOW’s current president Terry O’Neill has
stated that big data practices can render obsolete
Deborah Elizabeth Cohen the USA’s landmark civil rights and anti-
Smithsonian Center for Learning and Digital discrimination laws with special challenges for
Access, Washington, DC, USA women, the poor, people of color, trans-people,
and the LGBT community. While the technolo-
gies of automated decision-making are hidden and
The National Organization for Women (NOW) is largely not understood by average people, they are
an American feminist organization that is the being conducted with an increasing level of per-
grassroots arm of the women’s movement and vasiveness and used in contexts that affect indi-
the largest organization of feminist activists in viduals’ access to health, education, employment,
the United States. Since its founding in 1966, credit, and products. Problems with big data prac-
NOW has engaged in activity to bring about tices include the following:
equality for all women. NOW has been participat-
ing in recent dialogues to identify how common • Big data technology is increasingly being used
big data working methods lead to discriminatory to assign people to ideologically or culturally
practices against protected classes including segregated clusters, profiling them and in
women. This entry discusses NOW’s mission doing so leaving room for discrimination.
and issues related to big data and the activities • Through the practice of data fusion, big data
NOW has been involved with to end discrimina- tools can reveal intimate personal details, erod-
tory practices resulting from the usage of big data. ing personal privacy.
As written in its original statement of purpose, • As people are often unaware of this “scoring”
the purpose of NOW is to take action to bring activity, it can be hard for individuals to break
women into full participation in the mainstream out of being mislabeled.
of American society, exercising privileges and • Employment decisions made through data
responsibilities in completely equal partnership mining have the potential to be discriminatory.
with men. NOW strives to make change through • Metadata collection renders legal protection of
a number of activities including lobbying, rallies, civil rights and liberties less enforceable, undo-
marches, and conferences. NOW’s six core issues ing civil rights law.
are economic justice, promoting diversity and
ending racism, lesbian rights, ending violence Comprehensive US civil rights legislation in
against women, constitutional equality, and the 1960s and 1970s resulted from social actions
access to abortion and reproductive health.
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_142-1
2 National Organization for Women
Eubanks, V. (2014). How big data could undo our civil- NOW website. (2014). Who we are. National Organization
rights laws. The American Prospect. www.prospect. for Women. http://now.org/about/who-we-are/.
org/article/how-big-data-could-undo-our-civil-rights- Accessed 2 Sep 2014.
laws. Accessed 7 Sep 2014. The Leadership Conference on Civil and Human Rights.
Gangadharan, S. P. (2014). The dangers of high-tech profil- (2014). Civil rights principles for the era of big data.
ing, using big data. The New York Times. www.nytimes. www.civilrights.org/press/2014/civil-rights-principles-
com/roomfordebate/204/08/06/Is-big-data-spreading- big-data.html. Accessed 7 Sep 2014.
inequality/the-dangers-of-high-tech-profiling-using-
big-data. Accessed 5 Sep 2014.
N
In 2007 Netflix introduced streaming content Golden Globe for “Best Actress in a Television
as part of its “Watch Instantly” initiative. When Series Drama.”
Netflix first introduced streaming video to its Through its combination of DVD rentals,
website, subscribers were allowed 1 h of access streaming services, and original programming,
for every $1 spent on their monthly subscription. Netflix has grown exponentially since 1997. In
This restriction was later removed due to emerg- 2000, the company had approximately 300,000
ing competition from Hulu, Apple TV, Amazon subscribers. By 2005 that number grew to nearly
Prime, and other on-demand services. There are 4 million users, and by 2010 it grew to 20 million.
substantially less titles available through Netflix’s During this time, Netflix’s initial public offering
streaming service than its disc library. Despite this (IPO) of $15 per share soared to nearly $500, with
limitation, Netflix has become the most widely a reported annual revenue of more than $6.78
supported streaming service in the world by billion in 2015. Today, Netflix is the largest source
partnering with Sony, Nintendo, and Microsoft of Internet traffic in all of North America. Its sub-
to allow access through Blu-ray DVD players, as scribers stream more than 1 billion hours of media
well as the Wii, Xbox, and PlayStation gaming content each month, approximating one-third of
consoles. In subsequent years, Netflix has increas- total downstream web traffic. Such success has
ingly turned attention toward its streaming ser- resulted in several competitors for online stream-
vices. In 2008 the company added 2500 new ing and DVD rentals. Wal-Mart began its own
“Watch Instantly” titles through a partnership online rental service in 2002 before acquiring the
with Starz Entertainment. In 2010 Netflix inked Internet delivery network, Vudu, in 2010. Ama-
deals with Paramount Pictures, Metro-Goldwyn- zon Prime, Redbox Instant, Blockbuster @ Home,
Mayer, and Lions Gate Entertainment; in 2012 it and even “adult video” services like WantedList
inked a deal with DreamWorks Animation. and SugarDVD have also entered the video
Netflix has also bolstered its online library by streaming market. Competition from Blockbuster
developing its own programming. In 2011 Netflix sparked a price war in 2004, yet Netflix remains
announced plans to acquire and produce original the industry leader in online movie rentals and
content for its streaming service. That same year it streaming.
outbid HBO, AMC, and Showtime to acquire the Netflix owes much of its success to the inno-
production rights for House of Cards, a political vative use of Big Data. Because it is an Internet-
drama based on the BBC miniseries of the same based company, Netflix has access to an unprece-
name. House of Cards was released on Netflix in dented amount of viewer behavior. Broadcast net-
its entirety in early 2013. Additional program- works have traditionally relied on approximated
ming released during 2013 included Lilyhammer, ratings and focus group feedback to make deci-
Hemlock Grove, Orange is the New Black, and the sions about their content and airtime. In contrast,
fourth season of Arrested Development – a series Netflix can aggregate specified data about cus-
that originally aired on Fox between 2003 and tomers’ actual viewing habits in real time, allo-
2006. Netflix later received the first Emmy wing it to understand subscriber trends and
Award nomination for an exclusively online tele- tendencies at a much more sophisticated level.
vision series. House of Cards, Hemlock Grove, The type of information Netflix gathers is not
and Arrested Development received a total of limited to what viewers watch and the ratings
14 nominations at the 2013 Primetime Emmy they ascribe. Netflix also tracks the specific dates
Awards; House of Cards received an additional and times in which viewers watch particular pro-
four nominations at the 2014 Golden Globe gramming, as well as their geographic locations,
Awards. In the end, House of Cards won three search histories, and scrolling patterns; when they
Emmy Awards for “Outstanding Casting for a use pause, rewind, or fast-forward; the types of
Drama Series,” “Outstanding Directing for a streaming devices employed; and so on.
Drama Series,” and “Outstanding Cinematogra- The information Netflix collects allows it to
phy for a Single-Camera Series.” It won one deliver unrivaled personalization to each
Netflix 3
individual customer. This customization not only innovative implementation of Big Data. An
results in better recommendations but also helps unprecedented level of information about cus-
to inform what content the company should invest tomers’ viewing habits has allowed Netflix to
in. Once content has been acquired/developed, make informed decisions about programming
Netflix’s algorithms also help to optimize their development, promotion, and delivery. As a
marketing and to increase renewal rates on origi- result, Netflix currently streams more than 1 bil-
nal programming. As an example, Netflix created lion hours of content per month to over 80 million
ten distinct trailers to promote their original series subscribers in 190 countries and counting.
House of Cards. Each trailer was designed for a
different audience and seen by various customers
based on those customers’ previous viewing Cross-References
behaviors. Meanwhile, the renewal rate for origi-
nal programming on traditional broadcast televi- ▶ Algorithm
sion is approximately 35%; the current renewal ▶ Amazon
rate for original programming on Netflix is nearly ▶ Apple
70%. ▶ Communications
As successful as Netflix’s use of Big Data has ▶ Consumer Action
been, the company strives to keep pace with ▶ Entertainment
changes in viewer habits, as well as changes in ▶ Facebook
its own product. When the majority of subscribers ▶ Internet
used Netflix’s DVD-by-mail service, for instance, ▶ Internet Tracking
those customers consciously added new titles to ▶ Microsoft
their queue. Streaming services demand a more ▶ Social Media
instantaneous and intuitive process of generating ▶ Streaming Data
future recommendations. In response to develop- ▶ Streaming Data Analytics
ments such as this, Netflix initiated the “Netflix ▶ Video
Prize” in 2006: a $1 million payout to the first
person or group of persons to formulate a superior
algorithm for predicting viewer preferences. Over
the next 3 years, more than 40,000 teams from Further Readings
183 countries were given access to over 100 mil-
Keating, G. (2013). Netflixed: The epic battle for America’s
lion user ratings. BellKor’s Pragmatic Chaos was eyeballs. London: Portfolio Trade.
able to improve upon Netflix existing algorithm McCord, P. (2014). How Netflix reinvented HR. Harvard
by approximately 10% and was announced as the Business Review. http://static1.squarespace.com/
static/5666931569492e8e1cdb5afa/t/56749ea457eb
award winner in 2009.
8de4eb2f2a8b/1450483364426/How+Netflix+Reinven
ted+HR.pdf. Accessed 5 Jan 2016.
McDonald, K., & Smith-Rowsey, D. (2016). The Netflix
Conclusion effect: Technology and entertainment in the 21st cen-
tury. London: Bloomsbury Academic.
Simon, P. Big data lessons from Netflix. Wired. Retrieved
In summation, Netflix is presently the world’s from https://www.wired.com/insights/2014/03/big-
largest “Internet television network.” Key turning data-lessons-netflix/
points in the company’s development have Wingfield, N., & Stelter, B. (2011, October 24). How Netflix
lost 800,000 members, and good will. The New York
included a flat-rate subscription service, streaming
Times. http://faculty.ses.wsu.edu/rayb/econ301/Arti
content, and original programming. Much of the cles/Netflix%20Lost%20800,000%20Members%20.
company’s success has also been due to its pdf. Accessed 5 Jan 2016.
N
methods based on eigenvector computation. 1 min on the first network, the same calculation
Phillip Bonacich presented eigenvector centrality would take 1 million minutes (approximately
which led to important developments of metrics 2 years) on the second network (millionfold).
for web analytics like Google’s PageRank algo- This property of many network metrics makes it
rithm or the HITS algorithms by John Kleinberg, nearly impossible to apply them to big data net-
which is incorporated into several search engines works within reasonable time. Consequently,
to rank search results based on the website’s struc- optimization and approximation algorithms of tra-
tural importance on the Internet. ditional metrics are developed and used to speed
The second big pile of research questions up analysis for big data networks.
related to networks is about identifying groups. A straight forward approach for algorithmic
Groups can refer to a broad array of definitions, optimization of network algorithms for big data
e.g., nodes sharing of certain socioeconomic attri- is parallelization. The abovementioned algorithms
butes, membership affiliations, or geographic closeness and betweenness centralities are based
proximity. When analyzing networks, we are on all-pairs shortest path calculation. In other
often interested in structurally identifiable groups, words, the algorithm starts at a node, follows its
i.e., sets of nodes of a network that are denser links, and visits all other nodes in concentric cir-
connected among them and sparser connected to cles. The calculation for one node is independent
all other nodes. The most obvious group of nodes from the calculation for all other nodes; thus,
in a network would be a clique – a set of nodes different processors or different computers can
where each node is connected to all other nodes. jointly calculate a metric with very little coordi-
Other definitions of groups are more relaxed. nation overhead.
K-cores are a set of nodes for which every node Approximation algorithms try to estimate a
is connected to at least k other nodes in the set. It centrality metric based on a small part of the
turns out that k-cores are more realistic for real- actual calculations. The calculations of the all-
world data than cliques and much faster to calcu- pairs shortest path calculation can be restricted in
late. For any form of group identification in net- two ways. First, we can limit the centrality calcu-
works, we are often interested in evaluating the lation to the k-step neighborhood of nodes, i.e.,
“goodness” of the identified groups. The most instead of visiting all other nodes in concentric
common approach to assess the quality of group- circles, we stop at a distance k. Second, instead of
ing algorithms is to calculate the modularity index all nodes, we just select a small proportion of
developed by Michelle Girvan and Mark nodes as starting points for the shortest path cal-
Newman. culations. Both approaches can speed up calcula-
tion time tremendously as just a small proportion
of the calculations are needed to create these
Algorithmic Challenges results. Surprisingly, these approximated results
have very high accuracy. This is because real-
The most widely used algorithms in network ana- world networks are far from random and have
lytics were developed in the context of small specific characteristics. For instance, networks
groups of (less than 100) humans. When we created from social interactions among people
study big networks with millions of nodes, several often have core-periphery structure and are highly
major challenges emerge. To begin with, most clustered. These characteristics facilitate the accu-
network algorithms run in Y(n2) time or slower. racy of centrality approximation calculations. In
This means that if we double the number of nodes, the context of optimizing and approximating tra-
the calculation time is quadrupled. For instance, ditional network metrics, a major future challenge
let us assume we have a network with 1,000 nodes will be to estimate time/fidelity trade-offs(e.g.,
and a second network with one million nodes develop confidence intervals for network metrics)
(thousandfold). If a certain centrality calculation and to build systems that incorporate the con-
with quadratic algorithmic complexity takes straints of user and infrastructure into the
Network Analytics 3
people. Applying the same metrics to very big analyzing these networks are not scalable. None-
networks raises questions whether the algorithmic theless, it is worthwhile coping with these chal-
assumptions or the interpretations of results are lenges. Researchers from different academic areas
still valid. For instance, the abovementioned met- have been optimizing existing and developing
rics closeness and betweenness centralities just new metrics and methodologies as network ana-
incorporate the shortest paths between every pair lytics can provide unique insights into big data.
of nodes ignoring possible flow of information on
non-shortest paths. Even more, these metrics do
not take path length into account. In other words, Cross-References
if a node is on the shortest path of length, two or
eight is treaded identically. Most likely this does ▶ Algorithmic Complexity
not reflect real-world assumptions of information ▶ Complex Networks
flow. All these issues can be addressed by apply- ▶ Data Visualization
ing different metrics that incorporate all possible ▶ Streaming Data
paths or a random selection of paths with length k.
In general, when accomplishing network analyt-
ics, we need to ask which of the existing network Further Readings
algorithms are suitable under which assumptions
to be used for very large networks? Moreover, Batagelj, V., Mrvar, A., & de Nooy, W. (2011). Exploratory
social network analysis with Pajek. (Expanded edi-
what research questions are appropriate for very
tion.). New York: Cambridge University Press.
large networks? Does being a central actor in a Brandes, U., & Pich, C. (2007). Eigensolver Methods for
group of high school kids has the same interpre- progressive multidimensional scaling of large data.
tation as being a central user of an online social Proceedings of the 14th International Symposium on
Graph Drawing (GD’06), 42–53.
network with millions of users?
Freeman, L. C. (1979). Centrality in social networks: Con-
ceptual clarification. Social Networks, 1(3), 215–239.
Hennig, M., Brandes, U., Pfeffer, J., & Mergel, I. (2012).
Conclusions Studying social networks. A guide to empirical
research. Frankfurt: Campus Verlag.
Wasserman, S., & Faust, K. (1994). Social network analy-
Networks are everywhere in big data. Analyzing sis: Methods and applications. Cambridge: Cambridge
these networks can be challenging. Due of the University Press.
very nature of network data and algorithms,
many traditional approaches of handling and
N
on Americans’ buying habits. Their model pre- of nutritional epidemiology in 2013 versus just 1
dicted that an increase of 20% tax on sugar would in 1985. However, in the era of “big data”, there is
reduce Americans’ total caloric intake by 18% and an urgent need to translate big-data nutrition
reduce sugar consumption by over 16%. Based on research to practice, so that doctors and
their findings, they proposed a new policy of policymakers can utilize this knowledge to
implementing a broad-based tax on sugar to improve individual and population health.
improve public health. In another big-data study
on human nutrition, two researchers at West Vir
ginia University tried to understand and monitor
Controversy
the nutrition status of a population. They designed
intelligent data collection strategies and examined
Despite the exciting progress of big-data applica-
the effects of food availability on obesity occur-
tion in nutrition research, several challenges are
rence. They concluded that modifying environ-
equally noteworthy. First, to conduct big-data
mental factors (e.g., availability of healthy food)
nutrition research, researchers often need access
could be the key in obesity prevention.
to a complete inventory of foods purchased in all
Big data can be applied to self-tracking, that is,
retail outlets. This type of data, however, is not
monitoring one’s nutrition status. An emerging
readily available and gathering such information
trend in big data studies is quantified self (QS),
site by site is a time-consuming and complicated
which refers to keeping track of one’s nutritional,
process. Second, information provided by nutri-
biological and physical information, such as cal-
tion big data may be incomplete or incorrect. For
ories consumed, glycemic index, and specific
example, when doing self-tracking for nutrition
ingredients of food intake. By pairing the self-
status, many people fail to do consistent daily
tracking device with a web interface, the QS solu-
documentation or suffer from poor recall of food
tions can provide users with nutrient-data aggre-
intake. Also, big data analyses may be subject to
gation, infographic visualization, and personal
systematic biases and generate misleading
recommendations for diet.
research findings. Lastly, since an increasing
Big data can also enable researchers to monitor
amount of personal data is being generated
the global food consumption. One pioneering pro-
through quantified self-tracking devices, it is
ject is the Global Food Monitoring Group
important to consider privacy rights in personal
conducted by the George Institute for global
data. That individuals’ personal nutritional data
health with participations from 26 countries.
should be well-protected and that data shared and
With the support of these countries, the Group is
posted publicly should be used appropriately are
able to monitor the nutrition composition of var-
key ethical issues for nutrition researchers and
ious foods consumed around the world, identify
practitioners. In light of these challenges, techni-
the most effective food reformulation strategies,
cal, methodological, and educational interven-
and explore effective approaches on food produc-
tions are needed to deal with issues related to
tion and distribution by food companies in differ-
big-data accessibility, errors and abuses.
ent countries.
Thanks to the development of modern data
collection and analytic technologies, the amount
of nutritional, dietary, and biochemical data con- Cross-References
tinues to increase at a rapid pace, along with a
growing accumulation of nutritional epidemio- ▶ Biomedical Data
logic studies during this time. The field of nutri- ▶ Data Mining
tional epidemiology has witnessed a substantial ▶ Diagnostics
increase in systematic reviews and meta-analyses ▶ Health Informatics
over the past two decades. There were 523 meta-
analyses and systematic reviews within the field
Nutrition 3
Further Readings Satija, A., & Hu, F. (2014). Big data and systematic
reviews in nutritional epidemiology. Nutrition Reviews,
Harding, M., & Lovenheim, M. (2017). The effect of prices 72(12).
on nutrition: Comparing the impact of product-and Swan, M. (2013). The quantified self: Fundamental disrup-
nutrient-specific taxes. Journal of Health Economics, tion in big data science and biological discovery. Big
53. Data, 1(2).
Insel, P., et al. (2013). Nutrition. Boston: Jones and Bartlett WVU Today. WVU researchers work to track nutritional
Publishers. habits using ‘Big Data’. http://wvutoday.wvu.edu/n/
2013/01/11/wvu-researchers-workto-track-nutritional-
habits-using-big-data. Accessed Dec 2014.
O
high-traffic sites statically for a predefined period search terms that led a consumer to the ad in the
of time. While this method may be the least costly first place.
and targeted to a niche audience, it does not allow Online advertising may also include direct
for rich data collection. Banner advertising is a newsletter advertising delivered to potential cus-
less sophisticated form of online advertising. Ban- tomers who have purchased before. However, the
ner advertising could also be used as a hybrid of decision to use this way of advertising should be
cost per mille (CPM), or cost per thousand, as coupled with an ethical way of employing
another advertising option which will deliver an it. Email addresses became a commodity and can
ad to website users. This option is usually priced be bought. However, a newsletter sent to users
in a multiple of 1,000 impressions (or the number who never bought from a company may fire
of times an ad was shown) and an additional cost back and lead to unintended negative conse-
for clicks. It also allows businesses to assess how quences. Overall, this low-cost advertising
many times an ad was shown. However, this method can be effective in keeping past customers
method is limited in its ability to measure if the informed about new products and other cam-
return on an investment in advertising covered the paigns run by the company.
costs. However, proliferation of banners on sites Social media is another advertising channel,
and the overall volume of information on sites which is rapidly growing in its popularity. Social
lead to “banner blindness” among the Internet media networks created repositories of psycho-
users. In addition, with rapid increase of mobile graphic data, which include user-reported demo-
phones as Internet connection devises, the average graphic information, hobbies, travel destinations,
effectiveness of banners became even lower. The lifetime events, and topics of interest. Social
use of banner and pop-up ads increased in the late media can be used as more traditional advertising
1990s and early 2000s, but the users of the Inter- channels for PPC ad placements. However, they
net started to block these ads with pop-up can also serve as a base for customer engagement.
blockers, and the clicks on banner ads dropped Social media, although require a commitment and
to about 0.1%. time investment from advertisers, may generate
The next innovation in the online advertising is brand loyalty. Social media efforts, therefore,
tied to the growth in sophistication of search require careful evaluation as they can be both
engines. The search engines started to allow costly in terms of direct advertising costs and the
advertisers to place ad relevant to particular key- cost of time spent by company employees on
words. Tying advertising to relevant search key- developing and executing social media campaign
words gave rise to the pay-per-click (PPC) and keeping the flow of communication active.
advertising. PPC provides advertisers with most Data collected from social media channels can be
robust data to assess if expended costs generated analyzed on the individual level, which was
sufficient return. PPC advertising means that nearly impossible with earlier online advertising
advertisers are charged per click on an ad. This methods. Companies can collect information
advertising method ties exposure to advertising to about specific user communication and engage-
an action from a potential consumer thus provid- ment behavior, track communication activities of
ing advertisers with the data on the sites that are individual users, and analyze comments shared by
more effective. Google AdWords is an example of the social media users. At the same time, aggre-
pay-per-click advertising, which is linked to the gate data may allow for general sentiment analysis
keywords and phrases used in search. AdWords to assess if overall comments about a brand are
ads are correlated with these keywords and shown positive or negative and seek out product-related
only to the Internet users with relevant searches. signals shared by users. Social media evaluation,
By using PPC in conjunction with a search however, is challenged by the absence of deep
engine, like Google, Bing, or Yahoo, advertisers understanding of the audience engagement met-
can also obtain insights on the environment or rics and lack of industry-wide benchmarks and
evaluation standards. As a fairly new area of
Online Advertising 3
advertising, social media evaluation of likes, com- gardening and house-keeping magazines or
ments, and shares may be interpreted in a number home improvement stores.
of ways. Social media networks provide a frame- Geo, or local, targeting is focused on the deter-
work for a new type of advertising, community mination of the geographical location of a website
exchange, but they also are channels of online visitor. This information, in turn, is used to deliver
advertising through real-time advertising ads that are specific to a particular location, coun-
targeting. It is likely that focused targeting will try, region or state, city, or metro area. In some
continue to be the focus of advertisers as it leads to cases, targeting can go as deep as an organiza-
the increases in the effectiveness of advertising tional level. Internet protocol (IP) address,
efforts. At the same time, tracking of user web assigned to each device participating a computer
behavior throughout the Web creates privacy con- network, is used as the primary data point in this
cerns and policy challenges. targeting method. The use of this method may
prevent the delivery of ads to users where product
or service is not available – for example, a content
Targeting restriction for Internet television or region-
specific advertising that complies with regional
Innovations in online advertising introduced regulations.
targeting techniques that based advertising on Demographic targeting, as implied by its name,
the past browsing and purchase behaviors of Inter- tailors ads based on website users’ demographic
net users. Proliferation of data collection enabled information, like gender, age, income and educa-
advertisers to target potential clients based on a tion level, marital status, ethnicity, language pref-
multitude of web activities, like site browsing, key erences, and other data points. Users may supply
word searchers, past purchasing across different this information is social networking site registra-
merchants, etc. These targeting techniques led to tion. The sites, additionally, may also encourage
the development of data collection systems that its users to “complete” their profiles after the
track user activity in real time and make decisions initial registration to get access to the fullest set
to advertise or not advertise right as the user is of data.
browsing a particular page. Online advertising Behavioral targeting looks at users’ declared or
lacks rigorous standardization and several recent expressed interests to tailor the content of deliv-
targeting typologies have been proposed. ered ads. Web-browsing information, data on the
Reviewing strategies for online advertising, pages visited, the amount of time spent on partic-
Gabriela Taylor identifies nine distinct targeting ular pages, meta-data for the links that were
methods, which overlap or complement the dis- clicked, the searches conducted recently, and
cussion of targeting methods proposed by other information about recent purchases is collected
authors. In general, targeting refers to situation and analyzed by advertisement delivery systems
when ads that are shown to an Internet user are to select and display the most relevant ads. In a
relevant to their interests. The latter are deter- sense, website publishers can create user profiles
mined by the keywords used on searchers, pages based on the collected data and use it to predict
visited, or online purchases made. future browsing behavior and potential products
Contextual targeting ads are delivered to web of interest. This approach, using rich past data,
users based on the content of the sites these users allows advertisers to target their ads more effec-
visit. In other words, contextually targeted adver- tively to the page visitors who are more likely to
tising matches ads to the content of the webpage have interest in these products or services. Com-
an Internet user is browsing. Systems managing bined with other strategies, including contextual,
contextual advertising scan websites for key- geographic, and demographic targeting, this
words and place ads that match these keywords approach may lead to finely tuned and interest-
most closely. For example, a user viewing a tailored ads. The approach proves effective as
website about gardening may see ads for several studies showed that also Internet users
4 Online Advertising
prefer to have no ads on the web-pages they visit, their past behaviors are identified; they are seg-
they favor relevant ads over random ones. mented into groups to predict their future pur-
DayPart and time-based targeting is run during chase behavior. The goal of this method is to
specific times of the day or the week, for example, identify the most loyal group of customers, who
10 am to 10 pm local time Monday through generate revenue for the company and engage
Friday. Ads targeted based on this method are with this group in a most effective and
displayed only during these days and times and supportive way.
go off during the off-times. Ads run through
DayPart campaigns may focus on time-limited
offers and create a sense of urgency among audi- Privacy Concerns
ence members. At the same time, such ads may
create an increased sense of monitoring and lack Technology is developing at a speed too rapid for
of privacy among the users exposed to these ads. policy-making to catch up. Whichever advertising
Real-time targeting allows for the ad place- targeting method is used, each is based on an
ment systems to place bids for advertisement extended collections and analysis of personal
placement in real time. Additionally, this adver- and behavioral data for each user. Ongoing and
tising method allows to track every unique site potentially pervasive data collection raises impor-
user and collect real-time data to assess the likeli- tant privacy questions and concerns. Omer Tene
hood of each visitor to make a purchase. and Jules Polonetsky identify several privacy
Affinity targeting creates a partnership risks associated with big data. First is an incre-
between a product producer and an interest- mental adverse effect on privacy from an ongoing
based organization to promote the use of a third- accumulation of information. More and more data
party product. This method targets customers who points are collected about individual Internet users
share interest in a particular topic. These cus- and once information about real identify has been
tomers are assumed to have positive attitude linked to a virtual identify of a user, the anonymity
toward a website they visit and therefore have a is lost. Furthermore, disassociation of a user with
positive attitude toward more relevant advertis- a particular service may be insufficient to break a
ing. This method is akin to niche advertising, previously existing link as other networks and
and its success is based on the close match online resources may have already harvested
between the advertising content and that of the missing data points. Second area of privacy risks
passions and interests of website users. is an automated decision-making process. These
Look-alike targeting aims to identify prospec- automated algorithms may lead to discrimination
tive customers who are similar to the advertiser’s and self-determination. Targeting and profiling
customer base. Original customer profiles are used in online advertising gives ground to poten-
determined based on the website use and previous tial threats to the free access to information and
behaviors of active customers. These profiles are open, democratic society. The third area of pri-
then matched against a pool of independent Inter- vacy concerns is predictive analysis, which may
net users who share common attributes and behav- identify and predict stigmatizing behaviors or
iors and are the likely targets for an advertised characteristics, like susceptibility to disease or
product. The challenge with identifying these undisclosed sexual orientation. In addition, pre-
look-alike audiences is challenged by the large dictive analysis may give ground to social strati-
number of possible input data points which may fication by putting users in like-behaving clusters
or may not be defining for a particular behavior or and ignoring outliers and minority groups.
user group. Finally, the fourth area of concern is the lack of
Act-alike targeting is an outcome of predictive access to information and exclusion of smaller
analytics. Advertisers using this method define organizations and individuals from the access to
profiles of customers based on their information the benefits of big data. Large organizations are
consumption and spending habits. Customers and able to collect and use big data to price products
Online Advertising 5
used in all the studies summarized below, mea- “Internet.” Similarly, highly conscientious users
sures personality using the Big Five Model, which expressed their achievement orientation through
specifies five basic personality traits: (1) extraver- words such as “success,” “busy,” and “work,”
sion, or an individual’s tendency to be outgoing, whereas users high in openness to experience
talkative, and socially active; (2) agreeableness, or expressed their artistic and intellectual pursuits
an individual’s tendency to be compassionate, through words like “dreams,” “universe,” and
cooperative, trusting, and focused on maintaining “music.”
positive social relations; (3) openness to experi- In sum, this body of work shows that people’s
ence, or an individual’s tendency to be curious, identity, operationalized as personality traits, is
imaginative, and interested in new experiences illustrated in the actions they undertake and
and ideas; (4) conscientiousness, or an individ- words they use on Facebook. Given social media
ual’s tendency to be organized, reliable, consis- platforms’ controllable nature, which allows users
tent, and focused on long-term goals and time to ponder their claims and the ability to edit
achievement; and (5) neuroticism, or an individ- them, researchers argue that these digital traces
uals’ tendency to experience negative emotions, likely illustrate users’ intentional efforts to com-
stress, and mood swings. municate their identity to their audience, rather
One study conducted by Yoram Bachrach and than being unintentionally produced.
his colleagues investigated the relationship
between Big Five personality traits and Facebook
activity for a sample of 180,000 users. Results Identity Censorship
show that individuals high in extraversion had
more friends, posted more status updates, partici- While identity expression is frequent in social
pated in more groups, and “liked” more pages on media and, as discussed above, illustrated by
Facebook; individuals high in agreeableness behavioral traces, sometimes users suppress iden-
appeared in more photographs with other tity claims despite their initial impulse to divulge
Facebook users but “liked” fewer Facebook them. This process, labeled “last-minute self-cen-
pages; individuals high in openness to experience sorship,” was investigated by Sauvik Das and
posted more status updates, participated in more Adam Kramer using data from 3.9 million
groups, and “liked” more Facebook pages; indi- Facebook users over a period of 17 days. Censor-
viduals high in conscientiousness posted more ship was measured as instances when users
photographs but participated in fewer groups and entered text in the status update or comment
“liked” fewer Facebook pages; and individuals boxes on Facebook but did not post it in the next
high in neuroticism had fewer friends but partici- 10 min. The results show that 71% of the partic-
pated in more groups and “liked” more Facebook ipants censored at least one post or comment
pages. A related study, conducted by Michal during the time frame of the study. On average,
Kosinski and his colleagues, replicated these find- participants censored 4.52 posts and 3.20 com-
ings on a sample of 350,000 American Facebook ments. Notably, 33% of all posts and 13% of all
users, the largest dataset to date on the relationship comments written by the sample were censored,
between personality and Internet behavior. indicating that self-censorship is a fairly prevalent
Another study examined the relationship phenomenon. Men censored more than women,
between personality traits and word usage in the presumably because they are less comfortable
status updates of over 69,000 English-speaking with self-disclosure. This study suggests that
Facebook users. Results show that personality Facebook users take advantage of controllable
traits were indeed reflected in natural word use. media affordances, such as editability and unlim-
For instance, extroverted users used words ited composition time, in order to manage their
reflecting their sociable nature, such as “party,” identity claims. These self-regulatory efforts are
whereas introverted users used words reflecting perhaps a response to the challenging nature of
their more solitary interests, such as “reading” and addressing large and diverse audiences, whose
Online Identity 3
interpretation of the poster’s identity claims may reasonably inferred; and agreeableness cannot be
be difficult to predict. inferred at all. In other words, Facebook activity
renders extraversion highly visible and agreeable-
ness opaque.
Identity Detection Language can also be used to predict online
communicators’ identity, as shown by Andrew
Given that users leave digital traces of their per- Schwartz and his colleagues in a study of 15.4
sonal characteristics on social media platforms, million Facebook status updates, totaling over
research has been concerned with whether it is 700 million words. Language choice, including
possible to infer these characteristics from social words, phrases, and topics of conversation, was
media activity. For instance, can we deduce users’ used to predict users’ gender, age, and Big Five
gender, sexual orientation, or personality from personality traits with high accuracy.
their explicit statements and patterns of activity? In sum, this body of research suggests that it is
Is their identity implicit in their social media possible to infer many facets of Facebook users’
activity, even though they might not disclose it identity through automated analysis of their
explicitly? online activity, regardless of whether they explic-
One well-publicized study by Michal Kosinski itly choose to divulge this identity. While users
and his colleagues sought to predict Facebook typically choose to reveal their gender and ethnic-
users’ personal characteristics from their “likes” ity, they can be more reticent in disclosing their
– that is, Facebook pages dedicated to products, relational status or sexual orientation and might
sports, music, books, restaurant, and interests – themselves be unaware of their personality traits
that users can endorse and with which they can or intelligence quotient. This line of research
associate by clicking the “like” button. The study raises important questions about users’ privacy
used a sample of 58,000 volunteers recruited and the extent to which this information, once
through the myPersonality application. Results automatically extracted from Facebook activity,
show that, based on Facebook “likes,” it is possi- should be used by corporations for marketing or
ble to predict a user’s ethnic identity (African- product optimization purposes.
American vs. Caucasian) with 95% accuracy, gen-
der with 93% accuracy, religion (Christian vs.
Muslim) with 82% accuracy, political orientation Real and Imagined Audience for Identity
(Democrat vs. Republican) with 85% accuracy, Claims
sexual orientation among men with 88% accuracy
and among women with 75% accuracy, and rela- The purpose of many online identity claims is to
tionship status with 65% accuracy. Certain “likes” communicate a desired image to an audience.
stood out as having particularly high predictive Therefore, the process of identity construction
ability for Facebook users’ personal characteris- involves understanding the audience and targeting
tics. For instance, the best predictors of high intel- messages to them. Social media, such as
ligence were “The Colbert Report,” “Science,” Facebook and Twitter, where identity claims are
and, unexpectedly, “curly fries.” Conversely, low posted very frequently, pose a conundrum in this
intelligence was indicated by “Sephora,” “I Love regard, because audiences tend to be unprecedent-
Being a Mom,” “Harley Davidson,” and “Lady edly large, sometimes reaching hundreds and
Antebellum.” thousands of members, and diverse. Indeed,
In the area of personality, two studies found “friends” and “followers” are accrued over time
that users’ extraversion can be most accurately and often belong to different social circles (e.g.,
inferred from Facebook profile activity (e.g., high school, college, employment). How do users
group membership, number of friends, number conceptualize their audiences on social media
of status updates); neuroticism, conscientious- platforms? Are users’ mental models of their audi-
ness, and openness to experience can be ences accurate?
4 Online Identity
These questions were addressed by Michael their family connections to their audience, and
Bernstein and his colleagues in a study focusing how do family members publically talk to one
specifically on Facebook users. The study used a another on these platforms? Moira Burke and her
survey methodology, where Facebook users indi- colleagues addressed these questions in the con-
cated their beliefs about how many of their text of parent-child interactions on Facebook.
“friends” viewed their Facebook postings, Results show that 37.1% of English-speaking
coupled with large-scale log data for 220,000 US Facebook users specified either a parent or
Facebook users, where researchers captured the child relationship on the site. About 40% of teen-
actual number of “friends” who viewed users’ agers specified at least one parent on their profile,
postings. Results show that, by and large, and almost half of users age 50 or above specified
Facebook users underestimated their audiences. a child on their profile. The most common family
First, they believed that any specific status update ties were between mothers and daughters (41.4%
they posted was viewed, on average, by 20 of all parent-child ties), followed by mothers and
“friends,” when in fact it was viewed by 78 sons (26.8%), fathers and daughters (18.9%), and
“friends.” The median estimate for the audience least of all fathers and sons (13.1%). However,
size for any specific post was only 27% of the Facebook communication between parents and
actual audience size, meaning that participants children was limited, accounting for only 1–4%
underestimated the size of their audience by a of users’ public Facebook postings. When com-
factor of 4. Second, when asked how many total munication did happen, it illustrated family iden-
audience members they had for their profile post- tities: Parents gave advice to children, expressed
ings during the past month, Facebook users affection, and referenced extended family mem-
believed it was 50, when in fact it was 180. The bers, particularly grandchildren.
median perceived audience for the Facebook pro-
file, in general, was only 32% of the actual audi-
ence, indicating that users underestimated their
Cultural Identity
cumulative audience by a factor of 3. Slightly
less than half of Facebook users indicated they
Another critical aspect of personal identity is cul-
wanted a larger audience for their identity claims
tural identity. Is online communicators’ cultural
than they thought they had, ironically failing to
identity revealed by their communication pat-
understand that they did in fact have this larger
terns? Jaram Park and colleagues show that Twit-
audience. About half of Facebook users indicated
ter users create emoticons that reflect an
that they were satisfied with the audience they
individualistic or collectivistic cultural orienta-
thought they had, even though their audience
tion. Specifically, users from individualistic cul-
was actually much greater than they perceived it
tures preferred horizontal and mouth-oriented
to be. Overall, this study highlights a substantial
emoticons, such as :), whereas users from collec-
mismatch between users’ beliefs about their audi-
tivistic cultures preferred vertical and eye-ori-
ences and their actual audiences, suggesting that
ented emoticons, such as ^_^. Similarly, a study
social media environments are translucent, rather
of self-expression using a sample of four million
than transparent, when it comes to audiences. That
Facebook users from several English-speaking
is, actual audiences are somewhat opaque to users,
countries (USA, Canada, UK, Australia) shows
who as a result may fail to properly target their
that members of these cultures can be differenti-
identity claims to their audiences.
ated through their use of formal or informal
speech, the extent to which they discuss positive
personal events, and the extent to which they
Family Identity
discuss school. In sum, this research shows that
cultural identity is evident in linguistic self-
One critical aspect of personal identity is family
expression on social media platforms.
ties. To what extent do social media users reveal
Online Identity 5
development method by creating the Open Source Apache CouchDB is a web-focused database
Initiative in 1998. By 1998, open-source software system originally developed by Damien Katz, a
routed 80% of the e-mail on the internet. It has former IBM developer. Similar to Apache
continued to flourish to the modern day being Casandra, it is now developed by the Apache
responsible for a large number of software and Software Foundation. It is designed to deal with
information-based products today produced by large amounts of data through multi-master repli-
the open-source movement. cation across multiple locations.
Apache Hadoop is designed to store and pro-
cess large-scale datasets using multiple clusters of
C-form Organizational Architecture standardized low-level hardware. This technique
allows for parallel processing similar to a super-
The C-form organizational architecture is the pri- computer but using mass market off the shelf
mary organizational structure for open-source commodity computing systems. It was originally
development projects. A typical C-form has four developed by Doug Cutting and Mike Cafarella.
common organizing principles. First, there are Cutting was employed at Yahoo, and Cafarella
informal peripheral boundaries for developers. was a Masters student at the University of Wash-
Contributors can participate as much or as little ington at the time. It is now developed by the
as they like and join or leave a project on their Apache Software Foundation. It serves a similar
own. Second, many contributors receive no finan- purpose as Storm.
cial compensation at all for their work, yet some Apache HCatalog is a table and storage man-
may have employment relationships with more agement layer for Apache Hadoop. It is focused
traditional organizations which encourage their on assisting grid administrators with managing
participation in the C-form as part of their regular large volumes of data without knowing exactly
job duties. Third, C-forms focus on information- where the data is stored. It provides relational
based product, of which software is a major sub- views of the data, regardless of what the source
set. Since the product of a typical C-form is infor- storage location is. It is developed by the Apache
mation based, it can be replicated with minimal Software Foundation.
effort and cost. Fourth, typical C-forms operate Apache Lucene is an information retrieval soft-
with a norm of open transparent communication. ware library which tightly integrates with search
The primary intellectual property of an open- engine projects such as ElasticSearch. It provides
source C-form is the software code. This, by def- full text indexing and searching capabilities. It
inition, is made available for any and all to see, treats all document formats similarly by extracting
use, and edit. textual components and as such is independent of
file format. It is developed by the Apache Soft-
ware Foundation and released under the Apache
Prominent Examples of Open-Source Big Software License.
Data Projects D3.js is a data visualization package originally
created by Mike Bostock, Jeff Heer, and Vadim
Apache Casandra is a distributed database man- Ogievetsky who worked together at Stanford Uni-
agement system originally developed by Avinash versity. It is now licensed under the Berkeley
Lakshman and Prashant Malik at Facebook as a Software Distribution (BSD) open-source license.
solution to handle searching an inbox. It is now It is designed to graphically represent large
developed by the Apache Software Foundation, a amounts of data and is frequently used to generate
distributed community of developers. It is rich graphs and for map making.
designed to handle large amounts of data distrib- Drill is a framework to support distributed
uted across multiple datacenters. It has been rec- applications for data intensive analysis of large-
ognized by University of Toronto researchers as scale datasets in a self-serve manner. It is inspired
having leading scalability capabilities. by Google’s BigQuery infrastructure service. The
Open-Source Software 3
stated goal for the project is to scale to 10,000 or of 2012 is developed by the Apache Software
more servers to make low-latency queries of Foundation.
petabytes of data in seconds in a self-service man- Lumify is a big data analysis and visualization
ner. It is being incubated by Apache currently. It is platform originally targeted to investigative work
similar to Impala. in the national security space. It provides real-time
ElasticSearch is a search server that provides graphical visualizations of large volumes of data
near real-time full-text search engine capabilities and automatically searches for connections
for large volumes of documents using a distrib- between entities. It was originally created by Alta-
uted infrastructure. It is based upon Apache mira Technologies Corporation and then released
Lucene and is released under the Apache Software under the Apache License in 2014.
License. It spawned a venture-funded company in MongoDB is a NoSQL document focused
2012 created by the people responsible for database focused on handling large volumes of
ElasticSearch and Apache Lucene to provide sup- data. The software was first developed in 2007
port and professional services around the by 10gen. In 2009, the company made the soft-
software. ware open source and focused on providing pro-
Impala is an SQL query engine which enables fessional services for the integration and use of the
massively parallel processing of search queries on software. It utilizes a distributed file storage, load
Apache Hadoop. It was announced in 2012 and balancing, and replication system to allow quick
moved out of beta testing in 2013 to public avail- ad hoc queries of large volumes of data. It is
ability. It is targeted at data analysts and scientists released under the GNU Affero General Public
who need to conduct analysis on large-scale data License and uses drivers released under the
without reformatting and transferring the data to a Apache License.
specialized system or proprietary format. It is R is a technical computing high-performance
released under the Apache Software License and programming language focused on statistical
has professional support available from the ven- analysis and graphical representations of large
ture-funded Cloudera. It is similar to Drill. datasets. It is an implementation of the S program-
Julia is a technical computing high-perfor- ming language created by Bell Labs’ John Cham-
mance dynamic programming language with a bers. It was created by Ross Ihaka and Robert
focus on distributed parallel execution with high Gentleman at the University of Auckland. It is
numerical accuracy using an extensive mathemat- designed to allow multiple processors to work
ical function library. It is designed to use a simple on large datasets. It is released under the GNU
syntax familiar to many developers of older pro- License.
gramming languages while being updated to be Scribe is a log server designed to aggregate
more effective with big data. The aim is to speed large volumes of server data streamed in real
development time by simplifying coding for par- time from a high volume of servers. It is com-
allel processing support. It was first released in monly described as a scaling tool. It was origi-
2012 under the MIT open-source license after nally developed by Facebook and then released in
being originally developed starting in 2009 by 2008 using the open-source Apache License.
Alan Edelman (MIT), Jeff Bezanson (MIT), Spark is a data analytic cluster computing
Stefan Karpinski (UCSB), and Viral Shah framework designed to integrate with Apache
(UCSB). Hadoop. It has the capability to cache large
Kafka is a distributed, partitioned, replicated datasets in memory to interactively analyze the
message broker targeted on commit logs. It can be data and then extract a working analysis set to
used for messaging, website activity tracking, further analyze quickly. It was originally devel-
operational data monitoring, and stream pro- oped at the University of California at Berkeley
cessing. It was originally developed by LinkedIn AMPLab and released under the BSD License.
and released open source in 2011. It was subse- Later it was incubated in 2013 at the Apache
quently incubated by the Apache Incubator and as Incubator and released under the Apache License.
4 Open-Source Software
Major contributors to the project include Yahoo movements, biological data, consumer behavior,
and Intel. health metrics, and voice content.
Storm is a programming library focused on
real-time storage and retrieval of dynamic object
information. It allows complex querying across
Cross-References
multiple database tables. It handles unbound
streams of data in an instantaneous manner allo-
▶ Apache
wing real-time analytics of big data and continu-
▶ Crowdsourcing
ous computation. The software was originally
▶ Distributed Computing
developed by Canonical Ltd., also known for the
▶ Global Open Data Initiative
Ubuntu Linux operating system, and is released
▶ Google Flu
under the GNU Lesser General Public License. It
▶ Wikipedia
is similar to Apache Hadoop but with a more real-
time and less batch-focused nature.
Further Readings
The Future
Bretthauer, D. (2002). Open source software: A history.
Information Technology and Libraries, 21(1), 3–11.
The majority of open-source software focused on Lakhani, K. R., & von Hippel, E. (2003). How open source
big data applications has primarily been targeting software works: ‘Free’ user-to-user assistance.
web-based big data sources and corporate data Research Policy, 32(6), 923–943.
analytics. Current developments suggest a shift Marx, V. (2013). Biology: The big challenges of big data.
Nature, 498, 255–260.
toward more analysis of real-world data as sensors McHugh, J. (1998, August). For the love of hacking.
spread more widely into everyday use by mass Forbes.
market consumers. As consumers provide more O’Mahony, S., & Ferraro, F. (2007). The emergence of
and more data passively through pervasive sen- governance on an open source project. Academy of
Management Journal, 50(5), 1079–1106.
sors, the open-source software used to manage Seidel, M.-D. L., & Stewart, K. (2011). An initial descrip-
and understand big data appears to be shifting tion of the C-form. Research in the Sociology of Orga-
toward analyzing a wider variety of big data nizations, 33, 37–72.
sources. It appears likely that the near future will Shah, S. K. (2006). Motivation, governance, and the via-
bility of hybrid forms in open source software devel-
provide more open-source software tools to ana- opment. Management Science, 52(7), 1000–1014.
lyze real-world big data such as physical
P
Participatory Health and Big Data rise of big data comes concern about the security
of health information and privacy.
There are advantages and disadvantages to
Muhiuddin Haider, casting large data nets. Collecting data can help
Yessenia Gomez and Salma Sharaf organizations learn about individuals and commu-
School of Public Health Institute for Applied nities at large. Following online search trends and
Environmental Health, University of Maryland, collecting big data can help researchers under-
College Park, MD, USA stand health problems currently facing the studied
communities and can similarly be used to track
epidemics. For example, increases in Google
The personal data landscaped has changed drasti- searches for the term flu have been correlated
cally with the rise of social networking sites and with an increase in flu patient visits to emergency
the Internet. The Internet and social media sites rooms. In addition, a 2008 Pew study revealed
have allowed for the collection of large amounts that 80% of Internet users use the Internet to
of personal data. Every keystroke typed, website search for health information. Today, many
visited, Facebook post liked, Tweet posted, or patients visit doctors after having already
video shared becomes part of a user’s digital his- searched their symptoms online. Furthermore,
tory. A large net is cast collecting all the personal more patients are now using the Internet to search
data into big data sets that may be subsequently health information, seek medical advice, and
analyzed. This type of data has been analyzed for make important medical decisions. The rise of
years by marketing firms through the use of algo- the Internet has led to more patient engagement
rithms that analyze and predict consumer purchas- and participation in health.
ing behavior. The digital history of an individual Technology has also encouraged participatory
paints a clear picture about their influence in the health through an increase in interconnectedness.
community and their mental, emotional, and Internet technology has allowed for constant
financial state, and much about an individual can access to medical specialists and support groups
be learned through the tracking of his or her data. for people suffering from diseases or those
When big data is fine-tuned, it can benefit the searching for health information. The use of tech-
people and community at large. Big data can be nology has allowed individuals to take control of
used to track epidemics, and its analysis can be their own health, through the use of online
used in the support of patient education, treatment searches and the constant access to online health
of at-risk individuals, and encouragement of par- records and tailored medical information. In the
ticipatory community health. However, with the United States, hospitals are connecting
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_159-1
2 Participatory Health and Big Data
individuals to their doctors through the use of participatory health, better communication
online applications that allow patients to email between individuals and healthcare providers,
their doctors, check prescriptions, and look at and more tailored care.
visit summaries from anywhere where they have Big data collected from these various sources,
an Internet connection. The increase in patient whether Internet searches, social media sites, or
engagement has been seen to play a major role participatory health through applications and
in promotion of health and improvement in qual- technology, strongly influences our modern health
ity of healthcare. system. The analysis of big data has helped med-
Technology has also helped those at risk of ical providers and researchers understand health
disease seek treatment early or be followed care- problems facing their communities and develop
fully before contracting a disease. Collection of tailored programs to address health concerns, pre-
big data has helped providers see health trends in vent disease, and increase community participa-
their communities, and technology has allowed tory health. Through the use of big data
them to reach more people with targeted health technology, providers are now able to study health
information. A United Nations International Chil- trends in their communities and communicate
dren’s Emergency Fund (UNICEF) project in with their patients without scheduling any medi-
Uganda asked community members to sign up cal visits. However, big data also creates concern
for U-report, a text-based system that allows indi- for the security of health information.
viduals to participate in health discussions There are several disadvantages to the collec-
through weekly polls. This system was tion of big data. One being that not all the data
implemented to connect and increase communi- collected is significant and much of the informa-
cation between the community and the govern- tion collected may be meaningless. Additionally,
ment and health officials. The success of the computers lack the ability to interpret information
program helped UNICEF prevent disease out- the way humans do, so something that may have
breaks in the communities and encouraged multiple interpretations may be misinterpreted by
healthy behaviors. U-report is now used in other a computer. Therefore, data may be flawed if
countries to help mobilize communities to play simply interpreted based on algorithms, and any
active roles in their personal health. decisions regarding the health of the communities
Advances in technology have also created that were made based on this inaccurate data
wearable technology that is revolutionizing par- would also be flawed. Of greater concern is the
ticipatory health. Wearable technology is a cate- issue of privacy with regards to big data. Much of
gory of devices that are worn by individuals and the data is collected automatically based on peo-
are used to track data about the individuals, such ple’s online searches and Internet activities, so the
as health information. Examples of wearable tech- question arises as to whether people have the right
nology are wrist bands that collect information to choose what data is collected about them. Ques-
about the individual’s global positioning system tions that arise regarding big data and health
(gps) location, amount of daily exercise, sleep include how long is personal health data saved?
patterns, and heart rate. Wearable technology Will data collected be used against individuals?
enables users to track their health information, How will the Health Insurance Portability and
and some wearable technology even allows the Accountability Act (HIPPA) change with the
individual to save their health information and incorporation of big data in medicine? Will data
share it with their medical providers. Wearable collected determine insurance premiums? Privacy
technology encourages participatory health, and concerns need to be addressed before big health
the constant tracking of health information and data, health applications, and wearable technol-
sharing with medical providers allow for more ogy become a security issue.
accurate health data collection and tailored care. Today, big data can help health providers better
The increase in health technology and collection understand their target populations and can lead to
and analysis of big data has led to an increase in an increase in participatory health. However,
Participatory Health and Big Data 3
medical training as well. In 1805, Dr. David however, the content of patient records still varied
Hosack had suggested recording the specifics of considerably.
particularly interesting cases, especially those Although standardized forms ensured certain
holding the greatest educational value for medical events would be documented, there were no
students. The New York Board of Governors methods to ensure consistency across documenta-
agreed and mandated compiling summary reports tions or between providers. Dr. Larry Weed pro-
in casebooks. As Siegler noted, there were very posed a framework in 1964 to help standardize
few reports written at first: the first casebook recording medical care: SOAP notes. SOAP notes
spanned 1810–1834. Later, as physicians in train- are organized around four key areas: subjective
ing were required to write case reports in order to (what patients say), objective (what providers
be admitted to their respective specialties, the observe, including vital signs and lab results),
number of documented cases grew. Eventually, assessment (diagnosis), and plan (prescribed treat-
reports were required for all patients. The reports, ments). Other standardized approaches have been
however, were usually written retrospectively and developed since then. The most common charting
in widely varying narrative styles. formats today, in addition to SOAP notes, include
Widespread use of templates in American hos- narrative charting, APIE charting, focus charting,
pitals helped standardize patient records, but the and charting by exception. Narrative charting,
resulting quantitative data superseded narrative much as in the early days of patient
content. By the start of the twentieth century, recordkeeping, involves written accounts of
forms guaranteed documentation of specific patients’ conditions, treatments, and responses
tasks like physical exams, histories, orders, and and is documented in chronological order. Charts
test results. Graphs and tables dominated patient include progress notes and flow sheets which are
records and physicians’ narrative summaries multi-column forms for recording dates, times,
began disappearing. The freestyle narrative form and observations that are updated every few
that had previously comprised the bulk of the hours for inpatients and upon each subsequent
patient record allowed physicians to write as outpatient visit. They provide an easy-to-read
much or as little as they wished. Templates left record of change over time; however their limited
little room for lengthy narratives, no more than a space cannot take the place of more complete
few inches, so summary reports gave way to brief assessments, which should appear elsewhere in
descriptions of pertinent findings. As medical the patient record. APIE charting, similar to
technology advanced, according to Siegler, the SOAP notes, involves clustering patient notes
medical record became more complicated and around assessment (both subjective and objective
cumbersome with the addition of yet more forms findings), planning, implementation, and evalua-
for reporting each new type of test (e.g., chemis- tion. Focus charting is a more concise method of
try, hematology, and pathology tests). While most inpatient recording and is organized by keywords
physicians kept working notes on active patients, listed in columns. Providers note their actions and
these scraps of paper notating observations, daily patients’ responses under each keyword heading.
tasks, and physicians’ thoughts seldom made their Charting by exception involves documenting only
way into the official patient record. The official significant changes or events using specially for-
record emphasized tests and numbers, as Siegler matted flow sheets. Computerized charting, or
noted, and this changed medical discourse: inter- electronic health records (EHR), combines several
actions and care became more data driven. Care of the above approaches but proprietary systems
became less about the totality of the patient’s vary widely. Most hospitals and private practices
experience and the physician’s perception of are migrating to EHRs, but the transition has been
it. Nonetheless, patient records had become a expensive, difficult, and slower than expected.
mainstay and they did help ensure continuity of The biggest challenges include interoperability
care. Despite early efforts at a unifying style, issues impeding data sharing, difficult-to-use
Patient Records 3
EHRs, and perceptions that EHRs interfere with objective/quantifiable observations and use quo-
provider-patient relationships. tation marks to set apart patients’ statements, note
Today, irrespective of the charting format used, communication between all members of the care
patient records are maintained according to strict team while documenting the corresponding dates
guidelines. Several agencies publish and times, document informed consent and
recommended guidelines including the American patient education, record every step of every pro-
Association of Nurses, the American Medical cedure and medication administration, and chart
Association (AMA), the Joint Commission of instances of patients’ noncompliance or lack of
Accreditation of Healthcare Organizations cooperation. Providers should avoid writing over,
(JCAHO), and the Centers for Medicare and Med- whiting out, or attempting to erase entries, even if
icaid Services (CMS). Each regards the medical made in error – mistakes should be crossed
record as a communication tool for everyone through with a single line, dated, and signed.
involved in the patient’s current and future care. Altering a patient chart after the fact is illegal in
The primary purpose of the medical record is to many states, so corrections should be made in a
identify the patient, justify treatment, document timely fashion and dated/signed. Leaving blank
the course of treatment and results, and facilitate spaces on medical forms should be avoided as
continuity of care among providers. Data stored in well; if space is not needed for documenting
patient records have other functions; aside from patient care, providers are instructed to draw a
ensuring continuity of care, data can be extracted line through the space or write “N/A.” The fol-
for evaluating the quality of care administered, lowing should also be documented to ensure both
released to third-party payers for reimbursement, good patient care and malpractice defense: the
and analyzed for clinical research and/or epidemi- reason for each visit, chief complaint, symptoms,
ological studies. Each agency’s charting guide- onset and duration of symptoms, medical and
lines require certain fixed elements in the patient social history, family history, both positive and
record: the patient’s name, address, birthdate, negative test results, justifications for diagnostic
attending physician, diagnosis, next of kin, and tests, current medications and doses, over-the-
insurance provider. The patient record also con- counter and/or recreational drug use, drug aller-
tains physicians’ orders and progress notes, as gies, any discontinued medications and reactions,
well as medication lists, X-ray records, laboratory medication renewals or dosage changes, treatment
tests, and surgical records. Several agencies recommendations and suggested follow-up or
require the patient’s full name, birthdate, and a specialty care, a list of other treating physicians,
unique patient identification number appear on a “rule-out” list of considered but rejected diag-
each page of the record, along with the name of noses, final definitive diagnoses, and canceled or
the attending physician, date of visit or admission, missed appointments.
and the treating facility’s contact information. Patient records contain more data than ever
Every entry must be legibly signed or initialed before because of professional guidelines,
and date/time stamped by the provider. malpractice-avoidance strategies, and the ease of
The medical record is a protected legal docu- data entry many EHRs make possible. The result
ment and because it could be used in a malpractice is that providers are experiencing data overload.
case, charting takes on added significance. Incom- Many have difficulty wading through mounds of
plete, confusing, or sloppy patient records could data, in either paper or electronic form, to discern
signal poor medical care to a jury, even in the important information from insignificant attesta-
absence of medical incompetence. For that rea- tions and results. While EHRs are supposed to
son, many malpractice insurers require additional make searching for data easier, many providers
documentation above and beyond what profes- lack the needed skills and time to search for and
sional agencies recommend. For example, pro- review patients’ medical records. Researchers
viders are urged to: write legibly in permanent have found some physicians rely on their own
ink, avoid using abbreviations, write only memories or ask patients about previous visits
4 Patient Records
instead of searching for the information them- given provider in 8 years, will likely have their
selves. Other researchers have found providers records destroyed. Additionally, many retiring
have trouble quickly processing the amount of physicians typically only maintain records for
quantitative data and graphs in most medical 10 years. Better data management capabilities
records. Donia Scott and colleagues, for example, will inevitably change these practices in years
found that providers given narrative summaries of to come.
patient records culled from both quantitative and While patient records have evolved to ensure
qualitative data performed better on questions continuity of patient care, many claim the current
about patients’ conditions than those providers form that records have taken facilitates billing
given complete medical records, and did so in over communication concerns. Many EHRs, for
half the time. Their findings highlight the impor- instance, are modeled after accounting systems:
tance of narrative summaries that should be providers’ checkbox choices of diagnoses and
included in patients’ records. There is a clear tests are typically categorized and notated in bill-
need for balancing numbers with words in ensur- ing codes. Standardized forms are also designed
ing optimal patient care. with billing codes in mind. Diagnosis codes are
Another important issue is ownership of and reported in the International Statistical Classifica-
access to patient records. For each healthcare pro- tion of Diseases and Related Health Problems
vider and/or medical facility involved in a terminology, commonly referred to as ICD. The
patient’s care, there is a unique patient record World Health Organization maintains this coding
owned by that provider. With patients’ permis- system for epidemiological, health management,
sion, those records are frequently shared among and research purposes. Billable procedures and
providers. The Health Insurance Portability and treatments administered in the United States are
Accountability Act (HIPAA) protects the confi- reported in Current Procedural Terminology
dentiality of patient data, but patients, guardians (CPT) codes. The AMA owns this coding schema
or conservators of minor or incompetent patients, and users must pay a yearly licensing fee for the
and legal representatives of deceased patients may CPT codes and codebooks, which are updated
request access to records. Providers in some states annually. Critics claim this amounts to a monop-
can withhold records if, in the providers’ judg- oly, especially given HIPAA, CMS, and most
ment, releasing information could be detrimental insurance companies require CPT-coded data to
to patients’ well-being or cause emotional or men- satisfy reporting requirements and for reimburse-
tal distress. In addition to HIPAA mandates, many ment. CPT-coded data may impact patients’ abil-
states have strict confidentiality laws restricting ity to decipher and comprehend their medical
the release of HIV test results, drug and alcohol records, but the AMA does have a limited search
abuse treatment, and inpatient mental health function on its website for non-commercial use
records. While HIPAA guarantees patient access allowing patients to look up certain codes.
to their medical records, providers can charge Patient records are an important tool ensuring
copying fees. Withholding records because a continuity of care, but data-heavy records are
patient cannot afford to pay for them is prohibited cumbersome and often lacking narrative summa-
in many states because it could disrupt the conti- ries which have been shown to enhance providers’
nuity of care. HIPAA also allows patients the right understanding of patients’ histories and inform
to amend their medical records if they believe better medical decision-making. Strict guidelines
mistakes have been made. While providers are and malpractice concerns produce thorough
encouraged to maintain records in perpetuity, records that while ensuring complete documenta-
there are not requirements that they do so. Given tion, sometimes impede providers’ ability to dis-
the costs associated with data storage, both on cern important from less significant past findings.
paper and electronically, many providers will Better search and analytical tools are needed for
only maintain charts on active patients. Many managing patient records and data.
inactive patients, those who have not seen a
Patient Records 5
analytics, big data can play an important role in IOC recommends capturing data from pedometers
advancing patient-centered health by helping and sensors in smart phones, which provide
shape tailored wellness programs. The provider- details about patients’ physical activity, and com-
driven, disease-focused approach to health care bining that with data from interactive smart phone
has, heretofore, impacted the kind of health data applications (such as calorie counters and food
that exist: data that are largely focused on patients’ logs) to customize behavior counseling. This
symptoms and diseases. However, diseases do not approach individualizes not only patient care but
develop in isolation. Most conditions develop also education, prevention, and treatment inter-
through a complicated interplay of hereditary, ventions and advances patient-centered care with
environmental, and lifestyle factors. Expanding respect to information sharing, participation, and
health data to include social and behavioral data, collaboration. The IOC also identifies several
elicited via a biopsychosocial/patient-centered other potential sources of health data: social
approach, can help medical providers build better medial profiles, electronic medical records, and
predictive models. By examining comprehensive purchase histories. Collectively, this data can
rather than disease-focused data, providers can, yield a “mass customization” of prevention pro-
for example, leverage health data to predict grams. Given chronic diseases are responsible for
which patients will participate in wellness pro- 60 percent of deaths and 80 percent of healthcare
grams, their level of commitment, and their poten- spending is dedicated to chronic disease manage-
tial for success. This can be done using data ment, customizable programs have the potential to
mining techniques, like collaborative filtering. In save lives and money.
much the same way Amazon makes purchase Despite the potential, big data’s impact are
recommendations for its users, providers may largely unrealized in patient-centered care efforts.
similarly recommend wellness programs by tak- Although merging social, behavioral, and medical
ing into account patients’ past behavior and health data to improve health outcomes has not hap-
outcomes. Comprehensive data could also be use- pened on a widespread basis, there is still a lot
ful for tailoring different types of programs based that can be done analyzing medical data alone.
on patients’ preferences, thereby facilitating There is, however, a clear need for computational/
increased participation and retention. For exam- analytical tools that can aid providers in recogniz-
ple, programs could be customized for patients ing disease patterns, predicting individual
that go beyond traditional racial, ethnic, or socio- patients’ susceptibility, and developing personal-
demographic markers and include characteristics ized interventions. Nitesh Chawla and Darcy
such as social media use and shopping habits. By Davis propose aggregating and integrating big
designing analytics aimed at understanding indi- data derived from millions of electronic health
vidual patients and not just their diseases, pro- records to uncover patients’ similarities and con-
viders may better grasp how to motivate and nections with respect to numerous diseases. This
support the necessary behavioral changes makes a proactive medical model possible, as
required for improved health. opposed to the current treatment-based approach.
The International Olympic Committee (IOC), Chawla and Davis suggest that leveraging clini-
in a consensus meeting on noncommunicable dis- cally reported symptoms from a multitude of
ease prevention, has called for an expansion of patients, along with their health histories, pre-
health data collected and a subsequent conversion scribed treatments, and wellness strategies, can
of that data into information providers and provide a summary report of possible risk factors,
patients may use to achieve better health out- underlying causes, and anticipated concomitant
comes. Noncommunicable/chronic diseases, conditions for individual patients. They devel-
such as diabetes and high blood pressure, are oped an analytical framework called the Collabo-
largely preventable. These conditions are related rative Assessment and Recommendation Engine
to lifestyle choices: too little exercise, an (CARE), which applies collaborative filtering
unhealthy diet, smoking, and alcohol abuse. The using inverse frequency and vector similarity to
4 Patient-Centered (Personalized) Health
generate predictions based on data from similar program that predicts hospital inpatient mortality.
patients. The model was validated using a Medi- Similar programs help predict the likelihood of
care database of 13 million patients with two heart disease, Alzheimer’s, cancer, and digestive
million hospital visits over a 4-year period by disorders. Lastly, big data accrued from not only
comparing diagnosis codes, patient histories, and patients’ health records but from their social
health outcomes. CARE generates a short list that media profiles, purchase histories, and
includes high-risk diseases and early warning smartphone applications have the potential to pre-
signs that a patient may develop in the future, dict enrollment in wellness programs and improve
enabling a collaborative prevention strategy and behavioral modification strategies thereby
better health outcomes. Using this framework, improving health outcomes.
providers can improve the quality of care through
prevention and early detection and also advance
patient-centered health care. Cross-References
Data security is a factor that merits discussion.
Presently, healthcare systems and individual pro- ▶ Biomedical Data
viders exclusively manage patients’ health data. ▶ Electronic Health Records (EHR)
Healthcare systems must comply with security ▶ Epidemiology
mandates set forth by the Health Insurance Porta- ▶ Health Care Delivery
bility and Accountability Act of 1996 (HIPAA). ▶ Health Infomatics
HIPAA demands data servers are firewall and ▶ HIPAA
password protected, and use encrypted data trans- ▶ Medical/Health Care
mission. Information sharing is an important com- ▶ Predictive Analytics
ponent of patient-centered care. Some proponents
of the patient-centered care model advocate trans-
ferring control of health data to patients, who may
Further Readings
then use and share it as they see fit. Regardless as
to who maintains control of health data, storing Chawla, N. V., & Davis, D. A. (2013). Bringing big data to
and electronically transferring that data pose personalized healthcare: A patient-centered frame-
potential security and privacy risks. work. Journal of General Internal Medicine, 28(3),
660–665.
Patient-centered care requires collaborative Duffy, T. P. (2011). The Flexner report: 100 years later. Yale
partnerships and wellness strategies that incorpo- Journal of Biology and Medicine, 84(3), 269–276.
rate patients’ thoughts, feelings, and preferences. Institute of Medicine. (2001). Crossing the quality chasm.
It also requires individualized care, tailored to Washington, DC: National Academies Press.
Institute for Patient- and Family-Centered Care. FAQs.
meet patients’ unique needs. Big data facilitates
http://www.ipfcc.org/faq.html. Accessed Oct 2014.
patient-centered/individualized care in several Matheson, G., et al. (2013). Prevention and management of
ways. First, it ensures continuity of care and non-communicable disease: The IOC consensus state-
enhanced information sharing through integrated ment, Lausanne 2013. Sports Medicine, 43,
1075–1088.
electronic health records. Second, analyzing pat-
Picker Institute. Principles of patient-centered care. http://
terns embedded in big data can help predict dis- pickerinstitute.org/about/picker principles/. Accessed
ease. APACHE III, for example, is a prognostic Oct 2014.
P
that are specific to understanding health through conversations. In more than one occasion,
social media. Social media organizations try to unauthorized intruders (including journalists and
develop meaningful and actionable information academics) were detected and found screen-
from their database by trying to make data struc- scraping data from the website. Despite the orga-
tures more precise in differentiating between phe- nization employing state-of-the-art techniques to
nomena and reporting about them in data records, protect patient data from unauthorized exporting,
and make the system easier and flexible in use in any sensitive data shared on a website remains at a
order to generate more data. Often these demands risk, given the widespread belief – and public
work at cross-purposes. The development of record on other websites and systems – that skilled
social media for producing new knowledge intruders could always execute similar exploits
through distributed publics involves the engineer- unnoticed. Patients can have a lot to be concerned
ing of social environment where sociality and about, especially if they have conditions with a
information production are inextricably social stigma or if they shared explicit political or
intertwined. Users need to be steered towards personal views in the virtual comfort of a forum
information-productive behaviors as they engage room. In this respect, even if the commercial pro-
in social interaction of sorts, for information is the jects that the organization has undertaken with
worth upon which social media businesses industry partners implied the exchange of user
depend. In this respect, it has been argued that data that had been pseudonymised before being
PatientsLikeMe is representative of the construc- handed over, the limits of user profile
tion of sociality that takes place in all social media anonymization are well known. In the case of
sites, where social interaction unfolds along the profiles of patients living with rare diseases,
paths that the technology continuously and which are a consistent portion of the users in
dynamically draws based on the data that the PatientsLikeMe, it can arguably be not too diffi-
users are sharing. cult to reidentify individuals, upon determined
As such, many see PatientsLikeMe as incarnat- effort. These issues of privacy and confidentiality
ing an important dimension of the much-expected remain a highly sensitive topic as society does not
revolution of personalized medicine. Improve- dispose of standard and reliable solutions against
ments in healthcare will not be limited to a capil- the various forms that data misuse can take. As
lary application of genetic sequencing and other both news and scholars have often reported, the
micro and molecular biology tests that try to open malleability of digital data makes it impossible to
up the workings of individual human physiology stop the diffusion of sensitive data once that func-
at unprecedented scale, instead the information tion creep happens.
produced by these tests will often the related Moreover, as it is often discussed in the social
with the information about the subjective patient media and big data public debate, data networks
experience and expectations that new information increasingly put pressure on the notion of
technology capabilities are increasingly making informed consent as an ethically sufficient device
possible. for conducting research with user and patient data.
The need for moral frameworks of operation that
overperform over strict compliance with law has
Other Issues often been called for, and recently by the report on
data in biomedical research by the Nuffield Coun-
Much of the public debate about the cil for Bioethics. In the report, PatientsLikeMe
PatientsLikeMe network involves issues of pri- was held as a paramount example of new kinds
vacy and confidentiality of the patient users. The of research networks that rely on extensive patient
network is a “walled garden,” with patient profiles involvement and social (medical) data – these
remaining inaccessible to unregistered users by networks are often dubbed as citizen science or
default. However, once logged in, every user can participatory research.
browse all patient profiles and forum
4 PatientsLikeMe
On another note, some have argued that of information that networks such as
PatientsLikeMe, as many other prominent social PatientsLikeMe or search engines such as Google
media organizations, has been exploiting the rhe- make available at a click’s distance is without
toric of sharing (one’s life with a network and its antecedents and what this implies for healthcare
members) to encourage data-productive behav- must still be fully understood. Autonomous deci-
iors. The business model of the network is built sions by the patients do not necessarily happen for
around a traditional, proprietary model of data the worst. As healthcare often falls short of pro-
ownership. The network facilitates the data flow viding appropriate information and counseling,
inbound and makes it less easy for the data to flow especially about everything that is not strictly
outbound, controlling their commercial applica- therapeutic, patients can eventually devise
tion. In this respect, we must notice that the cur- improved courses of action, through a consulta-
rent practice in social media management in tion of appropriate information-rich web
general is often characterized by data sharing resources. At the same time, risks and harms are
evangelism by the managing organization, which not fully appreciated and there is a pressing need
at the same time requires monopoly of the most to understand more on the consequences of these
important data resources that the network gener- networks for individual health and the future of
ates. In the general public debate, this kind of healthcare and health research.
social media business model has been linked as a There are other issues besides these more evi-
factor contributing to the erosion of user privacy. dent and established topics of discussion. As it has
On a different level, one can notice how the been pointed out, questions of knowledge transla-
kind of patient-reported data collection and med- tion (from the patient vocabulary to the clinical-
ical research that the network makes possible to professional) remain open, and unclear is also the
perform is a much cheaper and under many capacity of these distributed and participative net-
respects more efficient model than what the works to consistently represent and organize the
professional-laden institutions such as the clinical patient populations that they are deemed to serve,
research hospital, with their specific work loci and as the involvement of patients is however limited
customs, could put in place. This way of and relative to specific tasks, most often of data-
organising the collection of valuable data operates productive character. The afore-mentioned issues
by including large amounts of end users who are are not exhaustive nor exhausted in this essay.
not remunerated. Despite this, running and orga- They require in-depth treatment; with this intro-
nizing such an enterprise is expensive and labor- duction the aim has been to give a few coordinates
intensive and as such, critical analysis of this kind on how to think about the subject.
of “crowdsourcing” enterprise needs to look
beyond the more superficial issue of the absence
of a contract to sanction the exchange of a mone-
Further Readings
tary reward for distributed, small task perfor-
mances. One connected problem in this respect Angwin, J. (2014). Dragnet nation: A quest for privacy,
is that since data express their value only when security, and freedom in a world of relentless surveil-
they are re-situated through use, no data have a lance. New york: Henry Holt and Company.
distinct, intrinsic value upon generation; not all Arnott-Smith, C., & Wicks, P. (2008). PatientsLikeMe:
Consumer health vocabulary as a folksonomy. Ameri-
data generated will ever be equal. can Medical Informatics Association Annual Sympo-
Finally, the affluence of medical data that this sium Proceedings, 2008, 682–686.
network makes available can have important con- Kallinikos, J., & Tempini, N. (2014). Patient data as med-
sequences on therapy or lifestyle decisions that a ical facts: Social media practices as a foundation for
medical knowledge creation. Information Systems
patient might take. Sure, patients can make up Research, 25, 817–833. doi:10.1287/isre.2014.0544.
their mind and take critical decisions without Lunshof, J. E., Church, G. M., & Prainsack, B. (2014).
appropriate consultation at any time, as they Raw personal data: Providing access. Science, 343,
have always done. Nonetheless, the sheer amount 373–374. doi:10.1126/science.1249382.
PatientsLikeMe 5
Prainsack, B. (2013). Let’s get real about virtual: Online distributed and data-based social media network. The
health is here to stay. Genetical Research, 95, 111–113. Information Society, 31, 193–211.
doi:10.1017/S001667231300013X. Wicks, P., Vaughan, T. E., Massagli, M. P., & Heywood,
Richards, M., Anderson, R., Hinde, S., Kaye, J., Lucassen, J. (2011). Accelerated clinical discovery using self-
A., Matthews, P., Parker, M., Shotter, M., Watts, G., reported patient data collected online and a patient-
Wallace, S., & Wise, J. (2015). The collection, linking matching algorithm. Nature Biotechnology, 29,
and use of data in biomedical research and health care: 411–414. doi:10.1038/nbt.1837.
Ethical issues. London: Nuffield Council on Bioethics. Wyatt, S., Harris, A., Adams, S., & Kelly, S. E. (2013).
Tempini, N. (2014). Governing social media: Organising Illness online: Self-reported data and questions of trust
information production and sociality through open, in medical and social research. Theory Culture & Soci-
distributed and data-based systems (Doctoral disserta- ety., 30, 131–150. doi:10.1177/0263276413485900.
tion). School of Economics and Political Science, Zuboff, S. (2015). Big other: surveillance capitalism and
London. the prospects of an information civilization. Journal of
Tempini, N. (2015). Governing PatientsLikeMe: Informa- Information Technology, 30, 75–89.
tion production and research through an open,
P
during clinical trial stages; listing warnings and strategically link this information with specific
known reactions reported during the post-drug physicians.
production stage; forecasting new drugs needed Prescription tracking refers to the collection of
in the marketplace; providing inventory control data from prescriptions as they are filled at phar-
and supply chain management information; and macies. When a prescription gets filled, data
managing inventories. Data mining was first used miners are able to collect: the name of the drug,
in the pharmaceutical industry as early as the the date of the prescription, and the name or
1960s alongside the increase in prescription drug licensing number of the prescribing physician.
patenting. With over 1,000 drug patents a year Yet, it is simple for the prescription drug industry
being introduced at that time, data collection to identify specific physicians through protocol in
assisted pharmaceutical scientists in keeping up place by the American Medical Association
with patents being proposed. At this time, infor- (AMA). The AMA has a “Physician Masterfile”
mation was collected and published in an edito- that includes all US physicians, whether or not
rial-style bulletin categorized according to areas they belong to the AMA, and this file allows the
of interest in an effort to make relevant issues for physician licensing numbers collected by data
scientists easier to navigate. Early in the 1980s, miners to be connected to a name. Information
technologies allowed biological sequences to be distribution companies (such as IMS Health, Den-
identified and stored, such as the Human Genome drite, Verispan, Wolters Kluwer, etc.) purchase
Project, which led to the increased use and pub- records from pharmacies. What many consumers
lishing of databanks. Occurring alongside the do not realize is that most pharmacies have these
popularity of personal computer usage, bioinfor- records for sale and are able to do so legally by not
matics was born, which allowed biological including patient names and only providing a
sequence data to be used for discovering and physician’s state licensing number and/or name.
studying new prescription drug targets. Ten While pharmacies cannot release a patient’s name,
years later, in the 1990s, microarray technology they can provide data miners with a patient’s age,
developed, posing a problem for data collection, sex, geographic location, medical conditions, hos-
as this technology permitted the simultaneous pitalizations, laboratory tests, insurance copays,
measurement of large numbers of genes and col- and medication use. This has caused a significant
lection of experimental data on a large scale. As area of concern on behalf of patients, as it not only
the ability to sequence a genome occurred in the may increase instances of prescription detailing,
2000s, the ability to manage large levels of raw but it may compromise the interests of patients.
data was still maturing, creating a continued prob- Data miners do not have access to patient names
lem for data mining in the pharmaceutical indus- when collected prescription data; however, data
try. As the challenges presented for data mining in miners assign unique numbers to individuals so
relation to R&D have continued to increase since that future prescriptions for the patient can be
the 1990s, the opportunities for data mining in tracked and analyzed together. This means that
order to increase prescription drug sales have data miners can determine: how long a patient
steadily grown. remains on a drug, whether the drug treatment is
continued, and which new drugs become pre-
scribed for the patient.
As information concerning a patient’s health is
Data Mining in the Pharmaceutical
highly sensitive, data mining techniques used by
Industry as a Form of Controversy
the pharmaceutical industry have perpetuated the
notion that personal information carries a substan-
Since the early 1990s, health-care information
tial economic value. By data mining companies
companies have been purchasing the electronic
paying pharmacies to extract prescription drug
records of prescriptions from pharmacies and
information, the relationships between patients
other data collection resources in order to
and their physicians and/or pharmacists is being
Pharmaceutical Industry 3
exploited. The American Medical Association physicians. For example, as a result of data mining
(AMA) established the Physician Data Restriction in the pharmaceutical industry, pharmaceutical
Program in 2006, giving any physician the oppor- sales representatives could: determine which phy-
tunity to opt out from data mining initiatives. To sicians are already prescribing specific drugs in
date, no such program for patients exists that order to reinforce already-existent preferences, or,
would give them the opportunity to have their could learn when a physician switches from a drug
records removed from data collection procedures to a competing drug, so that the representative can
and subsequent analyses. Three states have attempt to encourage the physician to switch back
enacted statutes that do not permit data mining to the original prescription.
of prescription records. The Prescription Confi-
dentiality Act of 2006 in New Hampshire was the
first state to decide that prescription information The Future of Data Mining in the
could not be sold or used for any advertising, Pharmaceutical Industry
marketing, or promotional purposes. However, if
the information is de-identified, meaning that the As of 2013, only 18% of pharmaceutical compa-
physician and patient names cannot be accessed, nies work directly with social media to promote
then the data can be aggregated by geographical their prescription drugs, but this number is
region or zip code, meaning that data mining expected to increase substantially in the next
companies could still provide an overall, more year. As more individuals tweet about their med-
generalized report for small geographic areas but ical concerns, symptoms, the drugs they take, and
could not target specific physicians. Maine and respective side effects, pharmaceutical companies
Vermont have statutes that limit the presence of have noticed that social media has become an
data mining. Physicians in Maine can register with integrated part of personalized medicine for indi-
the state to prevent data mining companies from viduals. Pharmaceutical companies are already in
obtaining their prescribing records. Data miners in the process of hiring data miners to collect and
Vermont must obtain consent from the physician analyze various forms of public social media in an
for which they are analyzing prior to using “pre- effort to: discover unmet needs, recognize new
scriber-identifiable” information for marketing or adverse events, and determine what types of
promotional purposes. drugs consumers would like to enter the market.
The number one customer for information dis- Based on the history of data mining used by
tribution companies is the pharmaceutical indus- pharmaceutical corporations, it is evident that the
try, which purchases the prescribing data to lucrative nature of prescription drugs serves as a
identify the highest prescribers and also to track catalyst for data collection and analysis. By hav-
the effects of their promotional efforts. Physicians ing the ability to generalize what should be very
are given a value, a ranking from one to ten, which private information about patients for the pre-
identifies how often they prescribe drugs. A sales scription drug industry, the use of data allows
training guide for Merck even states that this value prescription drugs to make more profit than ever,
issued to identify which products are currently in as individual information can be commoditized to
favor with the physician in order to develop a benefit the bottom line of a corporation. Although
strategy to change those prescriptions into Merck there are evident problems associated with pre-
prescriptions. The empirical evidence provided by scription drug data mining, the US Supreme Court
information distribution companies offers a has continued to recognize that the pharmaceuti-
glimpse into the personality, behaviors, and cal industry has a first amendment right to adver-
beliefs of a physician, which is why these num- tise and solicit clients for goods and future
bers are so valued by the drug industry. services. The Court has argued that legal safe-
By collecting and analyzing this data, pharma- guards, such as the Health Information Portability
ceutical sales representatives are able to better and Accountability Act (HIPAA), are put in place
target their marketing activities toward to combat the very concerns posed by practices
4 Pharmaceutical Industry
such as pharmaceutical industry data mining. ▶ Food and Drug Administration (FDA)
Additionally, the Court has found that by stripping ▶ Health Care Industry
pharmaceutical records of patient information that ▶ Patient Records
could lead to personal identification (e.g., name, ▶ Privacy
address, etc.), patients have their confidentiality
adequately protected. The law, therefore, leaves it
to the discretion of the physician to decide
Further Readings
whether they will associate with pharmaceutical
sales representatives and various data collection Altan, S., et al. (2010). Statistical considerations in design
procedures. space development. Pharmaceutical Technology, 34
An ongoing element to address in analyzing (7), 66–70.
the pharmaceutical industry’s use of data mining Fugh-Berman, A. (2008). Prescription tracking and public
health. Journal of General Internal Medicine, 23(8),
techniques will be the level of transparence used 1277–1280.
with patients while utilizing the information col- Greene, J. A. (2007). Pharmaceutical marketing research
lected. Research shows that the majority of and the prescribing physician. Annals of Internal Med-
patients in the United States are not only unfamil- icine, 146(10), 742–747.
Klocke, J. L. (2008). Comment: Prescription records for
iar with data mining use by the pharmaceutical sale: Privacy and free speech issues arising from the
industry, but that they are against any personal sale of de-identified medical data. Idaho Law Review,
information (e.g., prescription usage information 44(2), 511536.
and personal diagnoses) being sold and shared Orentlicher, D. (2010). Prescription data mining and the
protection of patients’ interests. The Journal of Law,
with outside entities, namely, corporations. As Medicine & Ethics, 38(1), 74–84.
health care continues to change in the United Steinbrook, R. (2006). For sale: Physicians’ prescribing
States, it will be important for patients to under- data. The New England Journal of Medicine, 354(26),
stand the ways in which their personal informa- 2745–2747.
Wang, J., et al. (2011). Applications of data mining in
tion is being shared and used, in an effort to pharmaceutical industry. The Journal of Management
increase national understandings of how privacy and Engineering Integration, 4(1), 120–128.
laws are connected to the pharmaceutical industry. White paper: Big Data and the needs of the Pharmaceuti-
cal Industry. (2013). Philadelphia: Thomson Reuters.
World Health Organization. (2013). Pharmaceutical
Industry. Retrieved online from http://www.who.int/
Cross-References trade/glossary/story073/en/.
Pollution, Air Hence, concern for air pollution and its influ-
ences on the earth and efforts to prevent/and to
Zerrin Savasan mitigate it have increased greatly in global scale.
Department of International Relations, Sub- However, today, it still stands as one of the pri-
Department of International Law, Faculty of mary challenges that should be addressed globally
Economics and Administrative Sciences, Selcuk on the basis of international cooperation. So, it
University, Konya, Turkey becomes necessary to promote the widespread
understanding on air pollution, its pollutants,
sources, and impacts.
The air contains many different substances, gases,
aerosols, particulate matter, trace metals, and a
variety of other compounds. If those are not at Sources of Air Pollution
the same concentration and change in space, and
over time to an extent that the air quality deterio- The air pollutants can be produced from natural-
rates, some contaminants or pollutant substances based reasons (e.g., fires from burning vegetation,
exist in the air. The release of these air pollutants forest fires, volcanic eruptions, etc.) or anthropo-
causes harmful effects to both environment and genic (human-caused) reasons. When outdoor
humans, to all organisms. This is regarded as air pollution – referring to the pollutants found in
pollution. outdoors – is thought, smokestacks of industrial
The air is a common/shared resource of all plants can be given as an example of human-made
human beings. After released, air pollutants can ones. However, natural processes also produce
be carried by natural events like winds, rains, and outdoor air pollution, e.g., volcanic eruptions.
so on. So, some pollutants, e.g., lead or chloro- The main causes of indoor air pollution, on the
form, often contaminate more than one environ- other hand, again raise basically from human-
mental occasions, so, many air pollutants can also driven reasons, e.g., technologies used for
be water or land pollutants. They can combine cooking, heating, and lighting. Nonetheless,
with other pollutants and thus can undergo chem- again there are also natural indoor air pollutants,
ical transformations, and then they can be eventu- like radon, and chemical pollutants from building
ally deposited on different locations. Their effects materials and cleaning products.
can emerge in different locations far from their Among those, human-based reasons, specifi-
main resources. Thus, they can detrimentally cally after industrialization, have produced a vari-
affect upon all organisms on local or regional ety of sources of air pollution, and thus more
scales and also upon the climate on global scale. contributed to the global air pollution. They can
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_167-1
2 Pollution, Air
emanate from point and nonpoint sources or from necessary to know sufficiently about the features
mobile and stationary sources. A point source of that pollutant. This is because some pollutants
describes a specific location from which large can be the reason of environmental or health
quantities of pollutants are discharged, e.g., coal- problems in the air, they can be essential in the
fired power plants. A nonpoint source, one the soil or water, e.g., nitrogen is harmful as it can
other hand, is more diffuse often involving many form ozone in the air, and it is necessary for the
small pieces spread across a wide range of area, soil as it can also act beneficially as fertilizer in the
e.g., automobiles. Automobiles are also known as soil. Additionally, if toxic substances exist below
mobile sources, and the combustion of gasoline is a certain threshold, they are not necessarily
responsible for released emissions from mobile harmful.
sources. Industrial activities are also known as
stationary sources, and the combustion of fossil
fuels (coal) is accountable for their emissions. New Technologies for Air Pollution:
These pollutants producing from distinct Big Data
sources may cause harm directly or indirectly. If
they are emitted from the source directly into the Before the industrialization period, the compo-
atmosphere, and so cause harm directly, they are nents of pollution are thought to be primarily
called as primary pollutants, e.g., carbon oxides, smoke and soot; but with industrialization, they
carbon monoxide, hydrocarbons, nitrogen oxides, have been expanded to include a broad range of
sulfur dioxide, particulate matter, and so on. If emissions, including toxic chemicals and biolog-
they are produced from chemical reactions includ- ical or radioactive materials. Therefore, even
ing also primary pollutants in the atmosphere, today it is still admitted that there are six conven-
they are known as secondary pollutants, e.g., tional pollutants (or criteria air pollutants) identi-
ozone and sulfuric acid. fied by the US Environmental Protection Agency
(EPA): carbon monoxide, lead, nitrous oxides,
ozone, particulate matter, and sulfur oxides.
The Impacts of Air Pollution Hence, it is expectable that there can be new
sources for air pollution and so new threats for
The air pollutants result in a wide range of impacts the earth soon. Indeed, very recently, through
both upon humans and environment. Their detri- Kigali (Rwanda) Amendment (14 October,
mental effects upon humans can be briefly sum- 2016) to the Montreal Protocol adopted at the
marized as follows: health problems resulting Meeting of the Parties (MOP 28), it is accepted
particularly from toxicological stress, like respi- to address hydrofluorocarbons (HFCs) – green-
ratory diseases such as emphysema and chronic house gases having a very high global warming
bronchitis, chronic lung diseases, pneumonia, car- potential even if not harmful as much as CFCs and
diovascular troubles, and cancer, and immune HCFCs for the ozone layer under the Protocol – in
system disorders increasing susceptibility to addition to chlorofluorocarbons (CFCs) and
infection and so on. Their adverse effects upon hydrochlorofluorocarbons (HCFCs).
environment, on the other hand, are the following: Air pollution first becomes an international
acid deposition, climate change resulting from the issue with the Trail Smelter Arbitration
emission of greenhouse gases, degradation of air (1941) between Canada and the United States.
resources, deterioration of air quality, noise, Indeed, prior to the decision made by the Tribunal,
photooxidant formation (smog), reduction in the disputes over air pollution between two countries
overall productivity of crop plants, stratospheric had never been settled through arbitration. Since
ozone (O3) depletion, threats to the survival of this arbitration case – specifically with increasing
biological species, etc. efforts since the early 1990s – attempts to mea-
While determining the extent and degree of sure, to reduce, and to address rapidly growing
harm given by these pollutants, it becomes impacts of air pollution have been continuing.
Pollution, Air 3
Developing new technologies, like Big Data, involves research on Apps and Sensors for
arises as one of those attempts. Air Pollution (ASAP), National Ambient Air
Big Data has no uniform definition (ELI 2014; Quality Standards (NAAQS) compliance, and
Keeso 2014; Simon 2013; Sowe and Zettsu 2014). data fusion methods)
In fact, it is defined and understood in diverse • Village Green Project (on improving Air Qual-
ways by different researchers (Boyd 2010; Boyd ity Monitoring and awareness in communities)
and Crawford 2012; De Mauro et al. 2016; Gogia • Environmental Quality Index (EQI) (a dataset
2012; Mayer-Schönberger and Cukier 2013; consisting of an index of environmental quality
Manyika et.al 2011) and interested companies based on air, water, land, build environment,
like Experian, Forrester, Forte Wares, Gartner, and sociodemographic space)
and IBM. It is initially identified by 3Vs – volume
(data amount), velocity (data speed), and variety There are also examples generated by local
(data types and sources) (Laney 2001). By the governments like “E-Enterprise for the Environ-
time, it has included fourth Vs like veracity (data ment,” by environmental organizations like “Per-
accuracy) (IBM) and variability (data quality of sonal Air Quality Monitoring,” or by citizen
being subject to structural variation) (Gogia 2012) science like “Danger Maps,” or by private firms
and a fifth V, value (data capability to turn into like “Aircraft Emissions Reductions” (ELI 2014)
value) together with veracity (Marr), and a sixth or Green Horizons Project (IBM 2015).
one, vulnerability (data security-privacy) The Environmental Performance Index (EPI) is
(Experian 2016). It can be also defined by veracity also another platform – using Big Data compiled
and value together with visualization (visual rep- from a great number of sensors and models – pro-
resentation of data) as additional 3Vs (Sowe and viding a country and an issue ranking on how each
Zettsu 2014) and also by volume, velocity, and country manages environmental issues and also a
variety requiring specific technology and analyti- Data Explorer allowing users to investigate the
cal methods for its transformation into value global data comparing environmental performance
(De Mauro et al. 2016). However, it is generally with GDP, population, land area, or other variables.
referred as large and complex data processing Despite all, as the potential benefits and costs
sets/applications that conventional systems are of the use of Big Data are still under discussion
not able to cope with them. (Boyd 2010; Boyd and Crawford 2012; De Mauro
Because air pollution has various aspects that et al. 2016; Forte Wares, – ; Keeso 2014; Mayer-
should be measured as mentioned above, it Schönberger and Cukier 2013; Simon 2013; Sowe
requires massive data that should be collected at and Zettsu 2014), various concerns can be raised
different spatial and temporal levels. Therefore, it about the use of Big Data to monitor, measure, and
is observed in practice that Big Data sets and forecast air pollution as well. Therefore, it is
analytics are increasingly used in the field of air required to make further research to identify
pollution, for monitoring, predicting its possible gaps, challenges, and solutions for “making the
consequences, responding timely to them, con- right data (not just higher volume) available to the
trolling and reducing its impacts, and mitigating right people (not just higher variety) at the right
the pollution itself. time (not just higher velocity)” (Forte Wares, ).
They can be used by different kind of organi-
zations, such as governmental agencies, private
firms, and nongovernmental organizations Cross-References
(NGOs). To illustrate, under US Environmental
Protection Agency (EPA), samples of Big Data ▶ Climate Change
use include: ▶ Environment
▶ Pollution, Land
• Air Quality Monitoring (collaborating with ▶ Pollution, Water
NASA on the DISCOVER-AQ initiative, it
4 Pollution, Air
effect of Rachel Carson’s very famous book, This process threatens both these particular spe-
Silent Spring (1962), which documents detrimen- cies and also all the other species above and below
tal effects of pesticides on the environment, par- in the food chain. All these combining with the
ticularly on birds. Nonetheless, as it is not massive extinctions of certain species – primarily
ordinarily biodegradable, so known as persistent because of the disturbance of their habitat –
organic pollutant, it has remained in the environ- induce also massive reductions in biodiversity.
ment ever since it was first used.
• Located under US Department of the Interior compiled from a great number of sensors regard-
(DOI), the National Integrated Land System ing environmental issues, on land pollution and on
(NILS) aims to provide the principal data other types of pollution. That is, Big Data tech-
source for land surveys and status by combin- nologies can be thought as a way of addressing
ing Bureau of Land Management (BLM) and consequences of all types of pollution, not just of
Forest Service data into a joint system. land pollution. This is particularly because, all
• New York City Open Accessible Space Infor- types of pollution are deeply interconnected with
mation System (OASIS) is another sample another type, so their consequences cannot be
case; as being an online open space mapping restricted to the place where the pollution is first
tool, it involves a huge amount of data discharged as mentioned above. Therefore, actu-
concerning public lands, parks, community ally, for all types of pollution, relying on satellite
gardens, coastal storm impact areas, and zon- technology and data and data visualization is
ing and land use patterns. essentially required to monitor them regularly, to
• Providing online accession of the state Depart- forecast and reduce their possible impacts, and to
ments of Natural Resources (DNRs) and other mitigate the pollution itself. Nonetheless, there are
agencies to the data of Geographic Information serious concerns raised about different aspects of
Systems (GIS) on environmental concerns, the use of Big Data in general (boyd 2010; boyd
while contributing to the effective manage- and Crawford 2012; De Mauro et al. 2016; Forte
ment of land, water, forest, and wildlife, it Wares; Keeso 2014; Mayer-Schönberger and
essentially requires the use of Big Data to Cukier 2013; Simon 2013; Sowe and Zettsu
make this contribution. 2014). So, further investigation and analysis are
• Alabama’s State Water Program is another needed to clarify the relevant gaps and challenges
example ensuring geospatial data related to regarding the use of Big Data for specifically land
hydrologic, soil, geological, land use, and pollution.
land cover issues.
• The National Ecological Observatory Network
(NEON) is an environmental organization pro-
Cross-References
viding the collection of the site-based data
related to the effects of climate change, inva-
▶ Climate Change
sive species from 160 sites and also land use
▶ Earth Sciences
throughout the USA.
▶ Environment
• The Tropical Ecology Assessment and Moni-
▶ Natural Sciences
toring Network (TEAM) is also a global net-
▶ Pollution, Air
work facilitating the collection and integration
▶ Pollution, Water
of publicly shared data related to patterns of
biodiversity, climate, ecosystems, and also
land use.
• The Danger Maps is another sample case for Further Readings
the use of Big Data, as it also provides the
mapping of government-collected data on Alloway, B. J. (2001). Soil pollution and land contamina-
tion. In R. M. Harrison (Ed.), Pollution: Causes, effects
over 13,000 polluting facilities in China to and control (pp. 352–377). Cambridge: The Royal
allow users to search by area or type of pollu- Society of Chemistry.
tion (water, air, radiation, soil). Boyd, D. (2010). Privacy and publicity in the context of big
data. WWW Conference, Raleigh, 29 Apr 2010.
Retrieved from http://www.danah.org/papers/talks/
The US Environmental Protection Agency 2010/WWW2010.html. Accessed 3 Feb 2017.
(EPA) and the Environmental Performance Index Boyd, D., & Crawford, K. (2012). Critical questions for big
(EPI) are also other platforms using Big Data data, information, communication & society. 15(5),
662–679. Retrieved from http://www.tandfonline.com/
4 Pollution, Land
suspended in the water or depositing beneath the Water pollution, like other types of pollution,
earth’s surface, get involved in water bodies and has serious widespread effects. In fact, adverse
result in water quality degradation. Indeed, there alteration of water quality produces costs both to
are many different types of water pollutants spill- humans (e.g., large-scale diseases and deaths) and
ing into waterways causing water pollution. They to environment (e.g., biodiversity reduction, spe-
all can be divided up into various categories: cies mortality). Its impact differs depending on the
chemical, physical, pathogenic pollutants, radio- type of water body affected (groundwater, lakes,
active substances, organic pollutants, inorganic rivers, streams, and wetlands). However, it can be
fertilizers, metals, toxic pollutants, biological pol- prevented, lessened, and even eliminated in many
lutants, and so on. Conventional, non- different ways. Some of these different treatment
conventional, and toxic pollutants are some of methods, aiming to keep the pollutants from dam-
these divisions which are regulated by the US aging the waterways, can be relied on the use of
Clean Water Act. The conventional pollutants techniques reducing water use, reducing the usage
are as follows: dissolved oxygen, biochemical of highly water soluble pesticide and herbicide
oxygen demand (BOD), temperature, pH (acid compounds, and reducing their amounts, control-
deposition), sewage, pathogenic agents, animal ling rapid water runoff, physical separation of
wastes, bacteria, nutrients, turbidity, sediment, pollutants from the water, or on the management
total suspended solids (TSS), fecal coliform, oil, practices in the field of urban design and
and grease. Nonconventional (or nontoxic) pollut- sanitation.
ants are not identified as either conventional or There are also some other attempts to measure,
priority, like aluminum, ammonia, chloride, col- reduce, and address rapidly growing impacts of
ored effluents, exotic species, instream flow, iron, water pollution, such as the use of Big Data. Big
radioactive materials, and total phenols. Toxic Data technologies can provide ways of achieving
pollutants, metals, dioxin, and lead can be counted better solutions for the challenges of water pollu-
as examples of priority pollutants. Each group of tion. To illustrate, EPA databases can be accessed
these pollutants has its own specific ways of enter- and maps can be generated from them including
ing the water bodies and its own specific risks. information on environmental activities affecting
water and also on air and land in the context of
EnviroMapper. Under US Department of the Inte-
Water Pollution Control rior (DOI), National Water Information System
(NWIS) monitors surface and underground water
In order to control all these pollutants, it is bene- quantity, quality, distribution, and movement.
ficial to determine from where they are Under National Oceanic and Atmospheric
discharged. So, the following categories can be Administration (NOAA), California Seafloor
identified to find out where they originate from: Mapping Program (CSMP) works for creating a
point and nonpoint sources of pollution. If the comprehensive base map series of coastal/marine
sources causing pollution come from single iden- geology and habitat for all waters of the USA.
tifiable points of discharge, they are point sources Additionally, the Hudson River Environmental
of pollution, e.g., domestic discharges, ditches, Conditions Observing System comprises 15 mon-
pipes of industrial facilities, and ships discharging itoring stations – located between Albany and the
toxic substances directly into a water body. Non- New York Harbor – automatically collecting sam-
point sources of pollution are characterized by ples every 15 min that are used to monitor water
dispersed, not easily identifiable discharge points, quality, assess flood risk, and assist in pollution
e.g., runoff of pollutants into a waterway, like cleanup and fisheries management. Contamina-
agricultural runoff, stormwater runoff. As it is tion Warning System Project, conducted by the
harder to identify them, it is nearly impossible to Philadelphia Water Department, is a combination
collect, trace, and control them precisely, whereas of new data technologies with existing manage-
point sources can be easily controlled. ment systems. It provides a visual representation
Pollution, Water 3
Simon, P. (2013). Too big to ignore: The business case for Vaughn, J. (2007). Environmental politics. Thomson
big data. Hoboken: Wiley. Wadsworth.
Sowe, S. K., & Zettsu, K. (2014). Curating big data made Vigil, K. M. (2003). Clean water, An introduction to water
simple: Perspectives from scientific communities. Big quality and water pollution control. Oregon State Uni-
Data, 2(1), 23–33. Mary Ann Liebert, Inc. versity Press.
The Open University. (2007). T210 – Environmental con- Withgott, J., & Brennan, S. (2011). Environment. Pearson.
trol and public health. The Open University.
P
the more informative the predictive analysis Some techniques in predictive analytics are
can be. borrowed from traditional forecasting techniques,
Unlike past good or bad omens, the results of such as moving average, linear regressions, logis-
predictive analytics are probabilistic. This means tic regressions, probit regressions, multinomial
that predictive analytics informs the probability of regressions, time series models, or random forest
a certain data point or the probability of a hypoth- techniques. Other techniques, such as supervised
esis to be true. learning, A|B testing, correlation ranking,
While true prediction can be achieved only by k-nearest neighbor algorithm are closer to
determining clearly the cause and the effect in a machine learning and newer computational
set of data, a task that is usually hard to do, most of methods.
the predictive analytics techniques are outputting One of the most used techniques in predictive
probabilistic values and error term analyses. analytics today though is supervised learning or
supervised segmentation (Provost and Fawcett
2013). Supervised segmentation includes the fol-
lowing steps:
Predictive Modeling Methods
– Selection of informative attributes – particu-
Predictive modeling statistically shows the under-
larly in large datasets, the selection of the vari-
lying relationships in historical, time series data in
ables that are more likely to be informative to
order to explain the data and make predictions,
the goal of prediction is crucial; otherwise the
forecasts, or classifications about future events.
prediction can render spurious results.
In general, predictive analytics uses a series of
– Information gain and entropy reduction – these
statistical and computational techniques in order
two techniques measure the information in the
to forecast future outcomes from past data. Tradi-
selected attributes.
tionally, the most used method has been the linear
– Selection is done based on tree induction,
regression, but lately, with the emergence of the
which fundamentally represents subsetting
Big Data phenomenon, there have been developed
the data and searching for these informative
many other techniques aiming to support busi-
attributes.
nesses and forecasters, such as machine learning
– The resulting tree-structured model partitions
algorithms or probabilistic methods.
the space of all data into possible segments
Some classes of techniques include:
with different predicted values.
1. Applications of both linear and nonlinear
mathematical programming algorithms, in The supervised learning/segmentation has
which one objective is optimized within a set been popular because it is computationally and
of constraints. algorithmically simple.
2. Advanced “neural” systems, which learn com-
plex patterns from large datasets to predict the
probability that a new individual will exhibit Visual Predictive Analytics
certain behaviors of business interest. Neural
networks (also known as deep learning) are Data visualization and predictive analytics com-
biologically inspired machine learning models plement each other nicely and together they are an
that are being used to achieve the recent even more powerful methodology for the analysis
record-breaking performance on speech recog- and forecasting of complex datasets that comprise
nition and visual object recognition. a variety of data types and data formats.
3. Statistical techniques for analysis and pattern Visual predictive analytics is a specific set of
detection within large datasets. techniques of predictive analytics that is applied
to visual and image data. Just as in the case of
Predictive Analytics 3
predictive analytics in general, temporal data is In any predictive model or analytics technique,
required for the visual (spatial) data (Maciejewski the model can do only what the data is. In other
et al. 2011). This technique is particularly useful words, it is impossible to assess a predictive
in determining hotspots and areas of conflict with model of the heart disease incidence based on
a high dynamics. Some of the techniques used in the travel habits if no data regarding travel is
spatiotemporal analysis are kernel density estima- included.
tion for event distribution and seasonal trend Another important point to remember is that
decomposition by loess smoothing (Maciejewski the accuracy of the model also depends on the
et al. 2011). accuracy measure, and using multiple accuracy
measures is desired (i.e., mean squared error,
p-value, R-squared).
In general, any predictive analytic technique
Predictive Analytics Example
will output a dataset of created variables, called
predictive values, and the newly created dataset.
A good example for using predictive analytics is
Therefore a good technique for verification and
in healthcare. The problem of understanding the
validation of the methods used is to partition the
probability of an upcoming epidemics or the prob-
real dataset in two sets and use one to “train” the
ability of increase in incidence of various dis-
model and the second one to validate the model’s
eases, from flu to heart disease and cancer.
results.
For example, given a dataset that contains data
The success of the model ultimately depends
with respect to the past incidence of heart disease
on how real events will unfold and that is one of
in the USA, demographic data (gender, average
the reasons why longer time series are better at
income, age, etc.), exercise habits, eating habits,
informing predictive modeling and giving better
traveling habits, and other variables, a predictive
accuracy for the same set of techniques.
model would follow these steps:
Provost, F., & Fawcett, T. (2013). Data science for busi- Shmueli, G., & Koppius, O. (2010). Predictive analytics in
ness: What you need to know about data mining and information systems research. Robert H. Smith School
data-analytic thinking. Sebastopol: O’Reilly Media. Research Paper No. RHS, 06-138.
Shmueli, G. (2010) To Explain or to Predict?. Statistical Siegel, E. (2013). Predictive analytics: The power to pre-
Science 25(3):289–310. dict who will click, buy, lie, or die. Hoboken: Wiley.
Tukey, J. (1977). Exploratory data analysis. New York:
Addison-Wesley.
P
This inconsistency in privacy perceptions governmental agents. Initially the right was used
results from varied cultural and historical back- to limit the rapidly evolving press industry, with
ground of individual states as well as their differ- time, as individual awareness and recognition of
ing political and economic situation. In countries the right increased; the right to privacy primarily
recognizing values reflected in universal human introduced limits of individual information that
rights treaties, including Europe, large parts of state or local authorities may obtain and process.
Americas, and some Asian states, the right to As any new idea, the right to privacy initially
privacy covers numerous elements of individual provoked much skepticism, yet by mid twentieth
autonomy and is strongly protected by compre- century became a necessary element of the rising
hensive legal safeguards. On the other hand in human rights law. In the twenty-first century, it
countries rapidly developing, as well as in ones gained increased attention as a side effect of the
with unstable political or economic situation, pri- growing, global information society. International
marily located in Asia and Africa, the significance online communications allowed for easy and
of the right to one’s private life subsides to urgent cheap mass collection of data, creating the
needs of protecting life and personal or public greatest threat to privacy so far. What followed
security. As a consequence the undisputed right was an eager debate on the limits of allowed
to privacy, subject to numerous international privacy intrusions and actions required from states
treaties and rich international law jurisprudence, aimed at safeguarding the rights of an individual.
remains highly ambiguous, an object of A satisfactory compromise is not easy to find as
conflicting interpretations by national authorities states and communities view privacy differently,
and their agents. This is one of the key challenges based on their history, culture, and mentality. The
to finding the appropriate legal norms governing existing consensus on human rights seems to be
Big Data. In the unique Big Data environment, it the only starting point of a successful search for an
is not only the traditional jurisdictional chal- effective privacy compromise, much needed in
lenges, specific to all online interactions, that the era of transnational companies operating on
must be faced but also the tremendously varying Big Data. With the modern notions of “the right to
perceptions of privacy all finding their application be forgotten” or “data portability” referring to new
to the vast and varied Big Data resource. facets of the right to protect one’s privacy, the Big
Data phenomenon is one of the deciding factors of
this ongoing evolution.
History
The eventual 1966 compromise in the form of the interpretation in its 1988 General Comment
two fundamental human rights treaties: the Inter- No. 16 as well as recommendations and observa-
national Covenant on Civil and Political Rights tions issued thereafter. Before Big Data became,
(ICCPR) and the International Covenant on Eco- among its other functions, an effective tool for
nomic Social and Cultural Rights (ICESCR) allo- mass surveillance, the HRC took a clear stand on
wed for a conciliatory wording on hard law the question of legally permissible limits of state
obligations for different categories of human inspection. It clearly stated that any surveillance,
rights, yet left the crucial details to future state whether electronic or otherwise; interceptions of
practice and international jurisprudence. Among telephonic, telegraphic, and other forms of com-
the right to be put into detail by future state prac- munication; wiretapping; and recording of con-
tice, international courts, and organizations was versations should be prohibited. It confirmed that
the right to privacy, established as a human right individual limitation upon privacy must be
in Article 12 UDHR and Article 17 ICCPR. They assessed on a case-by-case basis and follow a
both granted every individual freedom from “arbi- detailed legal guideline, containing precise cir-
trary interference” with their “privacy, family, cumstances when privacy may be restricted by
home, or correspondence” as well as from any actions of local authorities or third parties. The
attacks upon their honor and reputation. While HRC specified that even interference provided for
neither document defines “privacy,” the UN by law should be in accordance with the provi-
Human Rights Committee (HRC) has gone into sions, aims, and objectives of the Covenant and
much detail on delimitating its scope for the inter- reasonable in the particular circumstances, where
national community. All 168 ICCPR state parties “reasonable” means justified by those particular
are obliged per the Covenant to reflect HRC rec- circumstances. Moreover, as per the HRC inter-
ommendations on the scope and enforcement of pretation, states must take effective measures to
the treaty in general and privacy in particular. guarantee that information about individual’s life
Over time the HRC produced detailed instruction does not reach ones not authorized by law to
on the scope of privacy protected by international obtain, store, or process it. Those general guide-
law, discussing the thin line with state sover- lines are to be considered the international stan-
eignty, security, and surveillance. According to dard of protecting the human right to privacy and
Article 12 UDHR and Article 17 ICCPR, privacy need to be respected regardless of the ease that Big
must be protected against “arbitrary or unlawful” Data services offer in connecting pieces of infor-
intrusions or attacks through national laws and mation available online with individuals they
their enforcement. Those laws are to detail limits relate to. Governments must ensure that Big
for any justified privacy invasions. Those limits of Data is not to be used in a way that infringes
individual privacy right are generally described in individual privacy, regardless of the economic
Article 29 para. 2 which allows for limitations of benefits and technical accessibility of Big Data
all human rights determined by law solely for the services.
purpose of securing due recognition and respect The provisions of Article 17 ICCPR resulted in
for the rights and freedoms of others and of meet- similar stipulations of other international treaties.
ing the just requirements of morality, public order, Those include Article 8 of the European Conven-
and the general welfare in a democratic society. tion on Human Rights (ECHR) binding upon its
Although proposals for including a similar 48 member states or Article 11 of the American
restraint in the text of the ICCPR were rejected Convention on Human Rights (ACHR) agreed
by negotiating parties, the right to privacy is not upon by 23 parties to the treaty. The African
an absolute one. Following HRC guidelines and Charter on Human and Peoples’ Rights (Banjul
state practice surrounding the ICCPR, privacy Charter) does not contain a specific stipulation
may be restrained according to national laws regarding privacy, yet its provisions of Article
which meet the general standards present in 4 on the inviolability of human rights, Article
human rights law. The HRC confirmed this 5 on human dignity, and Article 16 on the right
4 Privacy
to health serve as basis to grant individuals within specification principle, (5) the use limitation prin-
the jurisdiction of 53 state parties the protection ciple, (6) the security safeguards principle, (7) the
recognized by European or American states as openness principle, and (8) the accountability
inherent to the right of privacy. While no general principle. They introduce certain obligations
human rights document exists among Austral- upon “data controllers” that is parties “who,
asian states, the general guidelines provided by according to domestic law, are competent to
the HRC and the work of the OECD are often decide about the contents and use of personal
reflected in national laws on privacy, personal data regardless of whether or not such data are
rights, and personal data protection. collected, stored, processed or disseminated by
that party or by an agent on their behalf.” They
oblige data controllers to respect limits made by
Privacy and Personal Data national laws pertaining to the collection of per-
sonal data. As already noted this is of particular
The notion of personal data is closely related to importance to Big Data operators, who must be
that of privacy, yet their scopes differ. While per- aware and abide by the varying national regimes.
sonal data is a term relatively well defined, pri- Personal data must be obtained by “lawful and
vacy is a more broad and ambiguous notion. As fair” means and with the knowledge or consent
Kuner rightfully notes, the concept of privacy of the data subject, unless otherwise provided by
protection is a broader one than personal data relevant law. Collecting or processing personal
regulations, where the latter provides a more data may only be done when it is relevant to the
detailed framework for individual claims. The purposes for which it will be used. Data must be
influential Organization for Economic accurate, complete, and up to date. The purposes
Co-operation and Development (OECD) Forum for data collection ought to be specified no later
identified personal data as a component of the than at the time of data collection. The use of the
individual right to privacy, yet its 34 members data must be limited to the purposes so identified.
differ on the effective methods of privacy protec- Data controllers, including those operating on Big
tion and the extent to which such protection Data, are not to disclose personal data at their
should be granted. Nevertheless, the nonbinding disposal for purposes other than those initially
yet influential 1980 OECD Guidelines on the Pro- specified and agreed upon by the data subject,
tection of Privacy and Transborder Flow of Per- unless such use or disclosure is permitted by law.
sonal Data (Guidelines) together with their 2013 All data processors are to show due diligence in
update have so far encouraged over data protec- protecting their collected data, by introducing rea-
tion laws in over 100 countries, justifying the sonable security safeguards against the loss or
claim that, thanks to its detailed yet unified char- unauthorized data access and its destruction, use,
acter and national enforceability personal data modification, or disclosure. This last obligation
protection, is the most common and effective may prove particularly challenging for Big Data
legal instrument safeguarding individual privacy. operators, with regard to the multiple locations of
The Guidelines identify the universal privacy pro- data storage and their continuously changeability.
tection through eight personal data processing Consequently each data subjects enjoys the right
principles. The definition of “personal data” to obtain information on the fact of the data con-
contained in the Guidelines is usually directly troller having data relating to him, to have any
adopted by national legislations which cover any such data communicated within a reasonable time,
information relating to an identified or identifiable to be given reasons if a request for such informa-
individual, referred to as “data subject.” The basic tion is denied, as well as to be able to challenge
eight principles of privacy and data protection such denial and any data relating to him.
include (1) the collection limitation principle, Followingly each data subject enjoys the right to
(2) the data quality principle, (3) the individual have their data erased, rectified, completed, or
participation principle, (4) the purpose amended, and data controller is to be held
Privacy 5
accountable to national laws for lack of effective figures” enjoying least protection. An assessment
measures ensuring all of those personal data of the limits of one’s privacy when compared with
rights. their public function would always be made on
Therewith the OECD principles form a practi- case-by-case basis. Any information that may not
cal standard for privacy protection represented in be considered public is to be granted privacy
the human rights catalogue, applicable also to Big protection and may only be collected or processed
Data operators, given the data in their disposal with permission granted by the one it concerns.
relates directly or indirectly to an individual. The need to obtain consent from the individual the
While their effectiveness may come to depend information concerns is also required for the inti-
upon jurisdictional issues, the criteria for identifi- mate sphere, where the protection is even stron-
cation of data subjects and obligations of data ger. Some authors argue that information on one’s
processors are clear. health, religious beliefs, sexual orientation, or his-
tory should only be distributed in pursuit of a
legitimate aim, even when permission for its dis-
Privacy as a Personal Right tribution was granted by the one it concerns.
With the civil law scheme for privacy protec-
Privacy is recognized not only by international tion being relatively simple, its practical applica-
law treaties and international organizations but tion relies on case-by-case basis and therefore
also by national laws, from constitutions to civil may show challenging and unpredictable in prac-
and criminal law codes and acts. Those regula- tice, especially when international court practice
tions hold great practical significance, as they is of issue.
allow for direct remedies against privacy infrac-
tions from private parties, rather than those
enacted by state authorities. Usually privacy is Privacy and Big Data
considered an element of the larger catalogue of
personal rights and granted equal protection. It Big Data is a term that directly refers to informa-
allows individuals whose privacy is under threat tion about individuals. It may be defined as gath-
for the threatening activity to be seized (e.g., ering, compiling, and using large amounts of
infringing information be deleted or a press information enabling for marketing or policy deci-
release be stopped). It also allows for pecuniary sions. With large amounts of data being collected
compensation or damages should a privacy by international service providers, in particular
infringement already take place. ones offering telecommunication services, such
Originating from German-language civil law as Internet access, the scope of data they may
doctrine, privacy protection may be well collect and the use to which they may put it is of
described by the theory of concentric spheres. crucial concern to all their clients but also to their
Those include the public, private, and intimate competitors and state authorities interested in par-
sphere, with different degrees of protection from ticipating in this valuable resource. In the light of
interference granted to each of them. The stron- the analysis presented above, any information
gest protection is granted to intimate information; falling within the scope of Big Data that is col-
activities falling within the public sphere are not lected and processed while rendering online ser-
protected by law and may be freely collected and vices may be considered subject to privacy
used. All individual information may be qualified protection when it refers to identified or identifi-
as falling into one of the three spheres, with the able individual that is a physical person who may
activities performed in the public sphere being either be directly identified or whose identification
those performed by an individual as a part of is possible. When determining whether particular
their public or professional duties and obligations category or a piece of information constitutes
and deprived of privacy protection. This sphere private data, account must be taken of means
would differ as per individual, with “public likely reasonably to be used either by any person
6 Privacy
to identify the individual, in particular costs, time, processed in bulk, with no judicial supervision
and labor needed to identify such person. When or without the consent of the individual it refers
private information has been identified, the pro- to. Big Data offer new possibilities for collecting
cedures required for privacy protection described and processing personal data. When designing
above ought to be applied by entities dealing with Big Data services or using information they pro-
such information. In particular the guidelines vide, all business entities must address the inter-
described by the HRC in their comments and national standards of privacy protection, as
observations may serve as a guideline for han- identified by international organizations and
dling personal data falling within the Big Data good business practice.
resource. Initiatives such as Global Network Ini-
tiative, a bottom-up initiative of the biggest online
service providers aimed at identifying and apply-
Cross-References
ing universal human rights standards for online
services, or the UN Protect Respect and Remedy
▶ Data Processing
Framework for business, defining the human
▶ Data Profiling
rights obligations of private parties, present a use-
▶ Data Quality Management
ful tool for introducing enhanced privacy safe-
▶ Data Security
guards for all Big Data resources. With the
▶ Data Security Management
users’ growing awareness of the value of their
privacy, company privacy policies prove to be a
significant element of the marketing game, incit-
ing Big Data operators to convince forever more Further Readings
users to choose their privacy-oriented services.
Kuner, C. (2009). An international legal framework for
data protection: Issues and prospects. Computer Law
and Security Review, 25(263), 307.
Summary Kuner, C. (2013). Transborder data flows and data privacy
law. Oxford: Oxford University Press.
UN Human Rights Committee. General Comment No. 16:
Privacy recognized as a human right requires cer-
Article 17 (Right to Privacy), The Right to Respect of
tain precautions to be taken by state authorities Privacy, Family, Home and Correspondence, and Pro-
and private business alike. Any information that tection of Honour and Reputation. 8 Apr 1988. http://
may allow for the identification of an individual www.refworld.org/docid/453883f922.html.
UN Human Rights Council. Report of the Special Rappor-
ought to be subjected to particular safeguards
teur on the promotion and protection of human rights
allowing for its collection or processing solely and fundamental freedoms while countering terrorism,
based on the consent of the individual in question Martin Scheinin. U.N. Doc. A/HRC/13/37.
or a particular norm of law applicable in a case Warren, S.D., & Brandeis, L.D. (1980). The right to pri-
vacy. Harvard Law Review, v. 4/193.
where the inherent privacy invasion is reasonable Weber, R.H. (2013). Transborder data transfers: Concepts,
and necessary to achieve a justifiable aim. In no regulatory approaches and new legislative initiatives.
case may private information be collected or International Data Privacy Law v. 1/3–4.
P
researcher must efficiently process and highlight used with big data, especially with big data sets
the most important information, stay attentive that include groups of individuals and their rela-
enough to do this for a long period of time, and tionships with one another, the scope of social
because of limited working memory capacity and psychology. The field of social psychology is
a lot of data to be processed, effectively manage able to ask questions and collect large amounts
the data, such as by chunking information, so that of data that can be examined and understood using
it is easier to filter and store in memory. these big data-type analyses including, but not
The goal of analysis is to lead to decisions or limited to, the following types of analyses.
conclusions about data, the scope of the rational Linguistic analysis offers the ability to process
field. If all principles from cognitive psychology transcripts of communications between individ-
have been applied correctly (e.g., only the most uals, or to groups as in social media applications,
relevant data are presented and only the most such as tweets from a Twitter data set. A linguistic
useful information stored in memory), tenets of analysis may be applied in a multitude of ways,
rational psychology must next be applied to make including analyzing the qualities of relationship
good decisions about the data. Decision making between individuals or how communications to
may be aided by programming the analysis soft- groups may differ based on the group. These
ware to present decision options to the researcher. analyses can determine qualities of these commu-
For example, in examining educational outcomes nications, which may include trust, attribution of
of children who come from low income families, personal characteristics, or dependencies, among
the researcher’s options may be to include chil- other considerations.
dren who are or are not part of a state-sponsored Sentiment analysis is a type of linguistic anal-
program, or are of a certain race. Statistical soft- ysis that takes communications and produces rat-
ware could be designed to present these options to ings of the emotional valence individuals direct to
the researcher, which may reveal results or rela- the topic. This is of value for considerations of
tionships in the data that the researcher may not social data researchers who must find those with
have otherwise discovered. Option presentation whom alliances may be formed and who to avoid.
may not be enough, however, as researchers A famous example is the strategy shift taken by
must also be aware of the consequences of their United Stated Armed Forces commanders to ally
decisions. One possible solution is the implemen- with Iraqi residents. Sentiment analysis indicated
tation of associate systems for big data software. which residential leaders would give their coop-
An associate system is automation that attempts to eration for short-term goals of mutual interest.
advise the user, in this case to aid decision mak- The final social psychological big data analysis
ing. Because these systems are knowledge based, technique under consideration here is social-
they have situational awareness and are able to network analysis or SNA. With SNA, special
recommend courses of action and the reasoning emphasis is not with the words spoken as in lin-
behind those recommendations. Associate sys- guistic and sentiment analysis but on the direc-
tems do not make decisions themselves, but tionality and frequency of communication
instead work semiautonomously, with the user between individuals. SNA created a type of net-
imposing supervisory control. If the researcher work map that uses nodes and ties to connect
deems recommended options to be unsuitable, members of groups or organizations to one
then the associate system can present what it another. This visualization tool allows a
judges to be the next best options. researcher to see how individuals are connected
to one another with factors like the thickness of a
line to determine frequency of communication, or
Social Field the number of lines coming from a node deter-
mining the number of nodes to which they are
The field of social psychology provides good connected.
examples of methods of analysis that can be
Psychology 3
Psychological Data as Big Data to slow your car), cognitive processing must take
place at the levels of perception, information pro-
Each field of psychology potentially includes big cessing, and initiation of action. Therefore, any
data sets for analysis by a psychological behavior or thought process that is measured in
researcher. Traditionally, psychologists have col- cognitive psychology will yield a large amount of
lected data on a smaller scale using controlled data for even the simplest of these, such that
methods and manipulations analyzable with tradi- complex processes or behaviors measured for
tional statistical analyses. However, with the their cognitive process will yield data sets of the
advent of big data principles and analysis tech- magnitude of big data.
niques, psychologists can expand the scope of Another clear case of a field with big data sets
data collection to examine larger data sets that is rational psychology. In rational psychological
may lead to new and interesting discoveries. The paradigms, researchers who limit experimental
following section discusses each of the aforemen- participants to a predefined set of options often
tioned fields. find themselves limiting their studies to the point
In clinical psychology, big data may be used to of not capturing naturalistic rational processing.
diagnose an individual. In understanding an indi- The rational psychologist, instead typically con-
vidual or attempting to make a diagnosis, the fronts big data as imaginative solutions to prob-
person’s writings and interview transcripts may lems, and many forms of data, such as verbal
be analyzed in order to provide insight to his or protocols (i.e., transcripts of participants
her state of mind. To thoroughly analyze and treat explaining their reasoning), require big data anal-
a person, a clinical psychologist’s most valuable ysis techniques.
tool may be this type of big data set. Finally, with the large time band under consid-
Biological psychology includes the subfields eration, social psychologists must often consider
of psychophysiology and neuropsychology. Psy- days’ worth of data in their studies. One popular
chophysiological data may include hormone col- technique is to have participants use wearable
lection (typically salivary), blood flow, heart rate, technology to periodically remind them to record
skin conductance, and other physiological how they are doing, thinking, and feeling during
responses. Neuropsychology includes multiple the day. These types of studies lead to big data sets
technologies for collecting information about the not just because of the frequency with which the
brain, including electroencephalography (EEG), data is collected, but also due to the enormous
functional magnetic resonance imaging (fMRI), number of possible activities, thoughts, and feel-
functional near infrared spectroscopy (fNIRS), ing that participants may have experienced and
among other lesser used technologies. Measures recorded at each prompted time point.
in biological psychology are generally taken near-
continuously across a certain time range, so much
of the data collected in this field could be consid- The Unique Role of Psychology in
ered big data. Big Data
Cognitive psychology covers all mental pro-
cessing. That is, this field includes the initiation of As described above, big data plays a large role in
mental processing from internal or external stim- the field of psychology, and psychology can play
uli (e.g., seeing a stoplight turn yellow), the actual an important role in how big data are analyzed and
processing of this information (e.g., understand- used. One aspect of this relationship is the neces-
ing that a yellow light means to slow down), and sity of the role of the psychology researcher on
the initiation of an action (e.g., knowing that you both ends of big data. That is, psychology is a
must step on the brake in order to slow your car). theory-driven field, where data are collected in
For each action that we take, and even actions that light of a set of hypotheses, and analyzed as either
may be involuntary (e.g., turning your head supporting or rejecting those hypotheses. Big data
toward an approaching police siren as you begin offers endless opportunities for exploration and
4 Psychology
50
25
crime
–25
–50
Regression, Figure 1 Linear regression of crime rate and residents’ poverty level
function is from the final to the end of the semes- prediction of the outcome variable. In logistic
ter. Similarly, regarding cubic, quartic, and more regression, we predict the odds or log-odds
complicated regressions, they can also be approx- (logit) that a certain condition will or will not
imated with a sequence of linear functions. How- happen. Odds range from 0 to infinity and are a
ever, analyzing nonlinear models in this way can ratio of the chance of an event (p) divided by the
produce much residual and leave considerable chance of the event not happening, that is, p/
variance unexplained. The second way is consid- (1p). Log-odds (logits) are transformed odds,
ered better than the first one from this aspect, by ln[p/(1p)], and range from negative to positive
including nonlinear terms in the regression func- infinity. The relationship predicting probability
tion as ^y = a þ b1x þ b2x2. As the graph of a using x follows an S-shaped curve as shown in
quadratic function is a parabola, if b2 < 0, the Figure 3. The shape of curve above is called a
parabola opens downward, and if b2 > 0, the “logistic curve.” This is defined as
parabola opens upward. Instead of having x2 in expðb0 þb1 xi þei Þ
pð y i Þ ¼ . In this logistic regression,
the model, the nonlinearity can also be presented 1þexpðb0 þb1 xi þei Þ
pffiffiffi the value predicted by the equation is a log-odds
in many other ways, such as x, ln(x), sin(x),
cos(x), and so on. However, which nonlinear or logit. This means when we run logistic regres-
model to choose should be based on both theory sion and get coefficients, the values the equation
or former research and the R2. produces are logits. Odds is computed as exp
expðlogitÞ
(logit), and probability is computed as 1þexp ðlogitÞ.
Another model used to predict binary outcome is
the probit model, with the difference between
Logistic Regression logistic and probit models lying in the assumption
about the distribution of errors: while the logit
When the outcome variable is dichotomous (e.g., model assumes standard logistic distribution of
yes/no, success/failure, survived/died, accept/ errors, probit model assumes normal distribution
reject), logistic regression is applied to make
Regression 3
Regression, Anxiety
Figure 2 Nonlinear
regression models
Confidence in
the Subject
of errors (Chumney & Simpson 2006). Despite opportunities and challenges. Generally speaking,
the difference in assumption, the predictive results big data is a collection of large-scale and complex
using these two models are very similar. When the data sets that are difficult to be processed and
outcome variable has multiple categories, multi- analyzed using traditional data analytic tools.
nomial logistic regression or ordered logistic Inspired by the advent of machine learning and
regression should be implemented depending on other disciplines, statistical learning has
whether the dependent variable is nominal or emerged as a new subfield in statistics, including
ordinal. supervised and unsupervised statistical learn-
ing (James, Witten, Hastie, & Tibshirani, 2013).
Supervised statistical learning refers to a set of
approaches for estimating the function f based on
Regression in Big Data
the observed data points, to understand the rela-
tionship between Y and X = (X1, X2, . . . , XP),
Due to the advanced technologies that have been
which can be represented as Y = f(X) þ e. Since
increasingly used in data collection and the vast
the two main purposes for the estimation are to
amount of user-generated data, the amount of data
make prediction and inference, which regression
will continue to increase at a rapid pace, along
modeling is widely used for, many classical sta-
with a growing accumulation of scholarly works.
tistical learning methods use regression models,
The explosion of knowledge makes big data one
such as linear, nonlinear, and logistic regression,
of new research frontiers with an extensive num-
with the selection of specific regression model
ber of application areas affected by big data, such
based on research question and data structure. In
as public health, social science, finance, geogra-
contrast, for unsupervised statistical learning,
phy, and so on. The high volume and complex
there is no response variable to predict for every
structure of big data bring statisticians both
4 Regression
Regression,
Figure 3 Logistic 1.00
regression models
0.80
0.60
pass
0.40
0.20
0.00
0 2 4 6 8 10
X
strengthen existing members or recruit potential and groups as they are announced in the service;
new ones. Of course, depending on a religion’s or those using online scripture software can access
stance toward culture, they may (like the Amish) texts and take notes. There are just a few
eschew some technology. However, for most possibilities.
mosques, churches, and synagogues, it has There are other ways religious groups can har-
become standard for each to have its own website ness big data. Some churches have begun analyz-
or Facebook page. Email newsletters and Twitter ing liturgies to assess and track length and content
accounts feeds have replaced traditional newslet- over time. For example, a dip in attendance during
ters and event reminders. a given month might be linked to the sermons
New opportunities are constantly emerging being 40% longer in that same time frame. Many
that create novel space for leaders to engage prac- churches make their budgets available to mem-
titioners. Religious leaders can communicate bers for the sake of transparency, and in a digital
directly with followers through social media, age it is not difficult to create financial records that
adding a personal touch to digital messages, are clear and accessible to laypeople. Finally,
which can sometimes feel distant or cold. Rabbi learning from a congregant’s social media profiles
SchmuleyBoteach, “America’s Rabbi,” has and personal information, a church might remind
29 best-selling books but often communicates a parishioner of her daughter’s upcoming birth-
daily though his Twitter account, which has over day, the approaching deadline for an application
a hundred thousand followers. On the flip side, to a family retreat, or when other congregants are
people can thoroughly vet potential religious attending a sporting event of which she is a fan.
leaders or organizations before committing to The risk of overstepping boundaries is real and,
them. If concerned that a particular group’s ideol- just like with Facebook or similar entities, privacy
ogy might not align with one’s own, a quick settings should be negotiated beforehand.
Internet search or trip to the group’s website As with other commercial entities, religious
should identify any potential conflicts. In this institutions utilizing big data must learn to differ-
way, providing data about their identity and entiate information they need from information
beliefs helps religious groups differentiate they don’t. The sheer volume of available data
themselves. makes distinguishing desired signal from irrele-
In a sense, big data makes it possible for reli- vant noise an increasingly important task. Ran-
gious institutions to function more like – and take dom correlations may lead to false positive
their cues from – commercial enterprises. Track- causation. A mosque may benefit from learning
ing streams of information about its followers can that members with the highest income are not
help religious groups be more in tune with the actually its biggest givers, or testing for a relation-
wants and needs of these “customers.” Some reli- ship between how far away its members live and
gious organizations implement the retail practice how often they attend. Each religious group must
of “tweets and seats”: by ensuring that members determine how big data may or may not benefit its
always have available places to sit, rest, or hang operation in any given endeavor, and the oppor-
out, and that wifi (wireless Internet connectivity) tunities are growing.
is always accessible, they hope to keep people
present and engaged. Not all congregations
embrace this change, but the clear cultural trend Individual Religion
is toward ubiquitous smart phone connectivity.
Religious groups that take advantage of this may The everyday practice of religion is becoming
provide several benefits to their followers: mem- easier to track as it increasingly utilizes digital
bers could immediately identify and download technology. A religious individual’s personal
any worship music being played; interested mem- blog, Twitter feed, Facebook profile keep a record
bers could look up information about a local reli- of his or her activity or beliefs, making it relatively
gious leader; members could sign up for events easy for any interested entity to track online
Religion 3
behavior over time. Producers and advertisers use individuals is unprecedented. With over a billion
this data to promote products, events, or websites opens and/or uses, YouVersion statistically pro-
to people who might be interested. Currently com- ved several phenomena. The data demonstrated
panies like Amazon have more incentive than, the most frequent activity for users is looking up a
say, a local synagogue in keeping tabs on what favorite verse for encouragement. Despite the ste-
websites one visits, but the potential exists for reotype of shirtless men at football games, the
religious groups to access the same data that most popular verse was not John 3:16, but Philip-
Facebook, Amazon, Google, etc. already possess. pians 4:13: “I can do all things through him who
Culturally progressive religious groups antici- gives me strength.” Religious adherents have
pate mutually beneficial scenarios: they provide a always claimed that their faith gives them strength
data service that benefits personal spiritual and hope, but big data has now provided a brief
growth, and in turn the members generate fields insight into one concrete way this actually
of data that are of great value to the group. A Sikh happens.
coalition created the FlyRights app in 2012 to help The YouVersion data also reveal that people
with quick reporting of discriminatory TSA pro- used the bible to make a point in social media.
filing while travelling. The Muslim’s Prayer Verses were sought out and shared in an attempt to
Times app provides a compass, calendar (with support views on marriage equality, gender roles,
moon phases), and reminders for Muslims about or other divisive topics. Tracking how individuals
when and in what direction to pray. Apple’s claim to have their beliefs supported by scripture
app store has also had to ban other apps from may help religious leaders learn more about how
fringe religious groups or individuals for being these beliefs are formed, how they change over
too irreverent or offensive. time, and which interpretations of scripture are
The most popular religious app to date simply most influential. Finally, YouVersion data reveal
provides access to scripture. In 2008 LifeChurch. that Christian users like verses with simple mes-
tv launched “the Bible app,” also called sages, but chapters with profound ideas. Verses
YouVersion, and it currently has over 151 million are easier to memorize when they are short and
installations worldwide on smartphones and tab- unique, but when engaging in sustained reading,
lets. Users can access scripture (in over 90 differ- believers prefer chapters with more depth.
ent translations) while online or download it for Whether large data sets confirm suspicions or
access offline. An audio recording of each chapter shatter expectations, they continue to change the
being read aloud can also be downloaded for some way religion is practiced and understood.
of the translations. A user can search through
scripture by keyword, phrase, or book of the
Bible, or there are reading plans of varying levels
Numerous or Numinous
of intensity and access to related videos or
movies. A “live” option lets users search out
In the past, spiritual individuals had a few reli-
churches and events in surrounding geographic
gions to choose from, but the globalizing force of
areas, and a sharing option lets users promote the
technology has dramatically increased the avail-
app, post to social media what they have read, or
able options. While the three big monotheisms
share personal notes directly to friends. The digi-
(Christianity, Judaism, and Islam) and
tal highlights or notes made, even when using the
pan/polytheisms (Hinduism and Buddhism) are
app offline, will later upload to one’s account and
still the most popular, the Internet has made it
remain in one’s digital “bible” permanently.
possible for people of any faith, sect, or belief to
All this activity has generated copious amounts
find each other and validate their practice. Though
of data for YouVersion’s producers. In addition to
pluralism is not embraced in every culture, there is
using the data to improve their product they also
at least increasing awareness of the many ways
released it to the public. This kind of insight into
religion is practiced across the globe.
the personal religious behavior of so many
4 Religion
Additionally, more and more people are iden- continue to offer for those who engage in numi-
tifying themselves as “spiritual but not religious,” nous and religious behavior.
indicating a desire to seek out spiritual experi-
ences and questions outside the confines of a
traditional religion. Thus for discursive activities Cross-References
centered on religion, Daniel Stout advocates the
use of another term in addition to “religion”: ▶ Data Monetization
numinous. Because “religious” can have negative ▶ Digitization
or limiting connotations, looking for the “numi- ▶ Entertainment
nous” in cultural texts or trends can broaden the ▶ Internet
search for and dialogue about a given topic. To be ▶ Text Analytics
numinous, something must meet several criteria:
stir deep feeling (affect), spark belief (cognition),
include ritual (behavior), and be done with fellow Further Readings
believers (community). This four-part framework
is a helpful tool for identification of numinous Campbell, H. A. (Ed.). (2012). Digital religion: Under-
activity in a society where it once might have standing religious practice in new media worlds.
Abingdon: Routledge.
been labeled “religious.”
Hjarvard, S. (2008). The mediatization of religion:
By this definition, the Internet (in general) and A theory of the media as agents of religious change.
entertainment media (in particular) all contain Northern Lights: Film & Media Studies Yearbook, 6(1),
numinous potential. The flexibility of the Internet 9–26.
Hoover, S. M., & Lundby, K. (Eds.). (1997). Rethinking
makes it relevant to the needs of most; while
media, religion, and culture (Vol. 23). Thousand Oaks:
authority of some of its sources can be dubious, Sage.
the ease of social networking and multi-mediated Kuruvilla, C. Religious mobile apps changing the faith-
experiences provides all the elements of tradi- based landscape in America. Retrieved from http://
www.nydailynews.com/news/national/gutenberg-
tional religion (community, ritual, belief, feeling).
moment-mobile-apps-changing-america-religious-
Entertainment media, which produce at least as landscape-article-1.1527004. Accessed Sep 2014.
much data as – and may be indistinguishable Mayer-Schönberger, V., & Cukier, K. (2013). Big data:
from – religious media, emphasize universal A revolution that will transform how we live, work, and
think. Houghton Mifflin Harcourt.
truths through storytelling. The growing opportu-
Taylor, B. (2008). Entertainment theology (cultural exege-
nities of big data (and its practical analysis) will sis): New-edge spirituality in a digital democracy.
Baker Books.
R
model risks in interconnected and complex sys- source, general findings indicate this type of data
tems: Capturing interdependent dynamics and has poor predictive accuracy. Additional reasons
other properties of systems requires vast amounts to question experts are situations or systems with a
of heterogeneous data over space and time. large number of unknown factors and potentially
Interdependencies are also critical to risk anal- catastrophic impacts for erroneous estimations.
ysis because even when risks are mitigated, they Big data can be an improvement over small data
may still cause amplifying negative effects and one or several expert opinions. However,
because of human risk perception. Perceived risk volume is not necessarily the same as quality.
is the public social, political, and economic Multidimensional aspects of data quality, whether
impacts of unrealized (and realized) risks. An the data are big or small, should always be
example of the impact of a perceived risk is the considered.
nuclear power accident at Three-Mile Island. In
this accident, minimal radiation was released so
the real risk was mitigated. Nevertheless, the near
Risk Analysis Methods
miss of a nuclear meltdown had immense social
and political consequences that continue to nega-
Vose (2008) explains the general techniques for
tively impact the nuclear power industry in the
conducting risk analysis. A common, descriptive
United States. The realized consequences of per-
method for risk analysis is Probability-Impact (P-
ceived risk mean that “real” risk should not nec-
I). P-I is the probability of a risk occurring multi-
essarily be separated from “perceived” risk.
plied by the impact of the risk if it materializes:
Probability Impact = Weighted Risk. All values
may be either qualitative (e.g., low, medium, and
Data: Quality and Sources
high likelihood or severity) or quantitative (e.g.,
10% or one million dollars). The Probability may
Many of the analysis challenges for big data are
be a single value or multiple values such as a
not unique but are pertinent to analysis of all data
distribution of probabilities. The Impact may
(Lazer et al. 2014). Regardless of the size of the
also be a single value or multiple values and is
dataset, it is important for analysts and
usually expressed as money. A similar weighted
policymakers to understand how, why, when,
model to P-I, Threat Vulnerability Conse-
and where the data were collected and what the
quence = Risk, is frequently used in risk analysis.
data contain and do not contain. Big data may be
However, a significant weakness with P-I and
“poor data” because rules, causality, and out-
related models with fixed values is that they tend
comes are far less clear compared to small data.
to systematically underestimate the probability
More specifically, Vose (2008) describes the
and impact of rare events that are interconnected,
quality of data characteristics for risk analysis.
such as natural hazards (e.g., floods), protection of
The highest quality data are obtained using a
infrastructure (e.g., power grid), and terrorist
large sample of direct and independent measure-
attacks. Nevertheless, the P-I method can be effec-
ments collected and analyzed using established
tive for quick risk assessments.
best practices over a long period of time and
continually validated to correct data for errors.
The second highest quality data use proxy mea- Probabilistic Risk Assessment
sures, a widely used method for collection, anal- P-I is a foundation for Probabilistic Risk Assess-
ysis, and some validation. Other characteristics of ment (PRA), an evaluation of the probabilities for
decreasing data quality are: A smaller sample of multiple potential risks and their respective
objective data, agreement among multiple impacts. The US Army’s standardized risk matrix
experts, a single expert opinion, and is weakest is an example of qualitative PRA, see Fig. 1 (also
with speculation. While there may be some situa- see Level 5 of risk analysis below).
tions in which expert opinions are the only data The risk matrix is constructed by:
Risk Analysis 3
Risk Analysis, Fig. 1 Risk analysis (Source: Safety Risk Management, Pamphlet 385-30 (Headquarters, Department of
the Army, 2014, p. 8): www.apd.army.mil/pdffiles/p385_30.pdf)
Step 1: Identifying possible hazards (i.e., potential conditions and information change, updating the
risks) risk matrix as needed, and provided feedback to
Step 2: Estimating the probabilities and impacts of improve the accuracy of future risk matrix.
each risk and using the P-Is to categorize Other widely used techniques include inferen-
weighted risk tial statistical tests (e.g., regression) and the more
comprehensive approach of what-if data simula-
Risk analysis informs risk reduction, but they tions, which are also used in catastrophe model-
are not one and the same. After the risk matrix is ing. Big data may improve the accuracy of
constructed, appropriate risk tolerance and miti- probability and impact estimates, particularly the
gation strategies are considered. The last step is upper bounds in catastrophe modeling, leading to
ongoing supervision and evaluation of risk as more accurate risk analysis.
4 Risk Analysis
From a statistical perspective, uncertainty and multiple curves, which are then combined
variability tend to be interchangeable. If uncer- using the average or another measure. A
tainty can be attributed to random variability, generic example of Level 5, for qualitative
there is no distinction. However, in risk analysis, values, was illustrated with the above risk
uncertainty can arise from incomplete knowledge matrix. When implemented quantitatively,
(Paté-Cornell 1996). Uncertainty in risk may be Level 5 is similar to what-if simulations in
due to a lack of data (particularly for rare events), catastrophe modeling.
not knowing relevant risks and/or impacts and
unknown interdependencies among risks and/or Catastrophe Modeling
impacts. Big data may improve risk analysis at Level 2 and
above but may be particularly informative for
Levels of Risk Analysis modeling multiple risks at Level 5. Using catas-
There are six levels for understanding uncertainty, trophe modeling, big data can allow for a more
ranging from qualitative identification of risk fac- comprehensive analysis of the combinations of P-
tors (Level 0) to multiple risk curves constructed Is while taking into account interdependences
using different PRAs (Level 5) (Paté-Cornell among systems. Catastrophe modeling involves
1996). Big data are relevant to Level 2 and running a large number of simulations to construct
beyond. The specific levels are as follows a landscape of risk probabilities and their impacts
(adapted Paté-Cornell 1996): for events such as terrorist attacks, natural disas-
ters, and economic failures. Insurance, finance,
Level 0: Identification of a hazard or failure other industries, and governments are increas-
modes. Level 0 is primarily qualitative. For ingly relying on big data to identify and mitigate
example, does exposure to a chemical increase interconnected risks using catastrophe modeling.
the risk of cancer? Beiser (2008) describes the high level of data
Level 1: Worst case. Level 1 is also qualitative, detail in catastrophe modeling. For risk analysis of
with no explicit probability. For example, if a terrorist attack in a particular location,
individuals are exposed to a cancer-causing interconnected variables taken into account may
chemical, what is the highest number that include the proximity to high-profile targets (e.g.,
could develop cancer? government buildings, airports, and landmarks),
Level 2: Quasi-worst case (probabilistic upper- the city, and details of the surrounding buildings
bound). Level 2 introduces subjective estima- (e.g., construction materials), as well as the poten-
tion of probability based on reasonable expec- tial size and impact of an attack. Simulations are
tation(s). Using the example from Level 1, this run under different assumptions, including the
could be the 95th percentile for the number of likelihood of acquiring materials to carry out a
individuals developing cancer. particular type of attack (e.g., a conventional
Level 3: Best and central estimates. Rather than a bomb versus a biological weapon) and the proba-
worst case, Level 3 aims to model the most bility of detecting the acquisition of such mate-
likely impact using central values (e.g., mean rials. Big data is informative for the wide range of
or median). possible outcomes and their impacts in terms of
Level 4: Single-curve PRA. Previous levels were projected loss of life and property damage. How-
point estimates of risk; Level 4 is a type of ever, risk analysis methods are only as good as
PRA. For example, what is the number of their assumptions, regardless of the amount of
individuals that will develop cancer across a data.
probability distribution?
Level 5: Multiple-curve PRA. Level 5 has more Assumptions: Cascading Failures
than one probabilistic risk curve. Using the Even with big data, risk analysis can be flawed
cancer risk example, different probabilities due to inappropriate model assumptions. In the
from distinct data can be represented using case of Hurricane Katrina, the model assumptions
Risk Analysis 5
for a Category 3 hurricane did specify a large, reality, searches were likely influenced by exter-
slow-moving storm system with heavy rainfall nal events such as media reporting of a possible
nor did they account for the interdependencies in flu pandemic, seasonal increases in searches for
infrastructure systems. This storm caused early cold symptoms that were similar to flu symptoms,
loss of electrical power, so many of the pumping and the introduction of suggestions in Google
stations for levees could not operate. Conse- Search. Therefore, GFT wrongly assumed the
quently, water overflowed, causing breaches, data were stationary (i.e., no trends or changes in
resulting in widespread flooding. Because of cas- the mean and variance of data over time). Second,
cading effects in interconnected systems, risk Google did not provide sufficient information for
probabilities and impacts are generally far greater understanding the analysis, such as all selected
than in independent systems and therefore will be search terms and access to the raw data and algo-
substantially underestimated when incorrectly rithms. Third, big data is not necessarily a replace-
treated as independent. ment for small data. Critically, the increased
volume of data does not necessarily make it the
highest quality source. Despite these issues, GFT
Right Then Wrong: Google Flu Trends was at the second highest level of data quality
using criteria from Vose (2008) because GFT ini-
GFT is an example of both success and failure for tially used:
risk analysis using big data. The information pro-
vided by an effective disease surveillance tool can 1. Proxy measures: search terms originally corre-
help mitigate disease spread by reducing illnesses lated with local flu reports over a finite period
and fatalities. Initially, GFT was a successful real- of time
time predictor of flu prevalence, but over time, it 2. A common method: search terms used for
becomes inaccurate. This is because the model Internet advertising, disease surveillance was
assumptions did not hold over time, validation novel (with limited validation)
with small data was not on-going, and it lacked
transparency. GFT used a data-mining approach In the case of GFT, the combination of big and
to estimate real-time flu rates: Hundreds of mil- small data, by continuously recalibrating the algo-
lions of possible models were tested to determine rithms for the big data using the small (surveil-
the best fit of millions of Google searches to lance) data, would have been much more accurate
traditional weekly surveillance data. The tradi- than either alone. Moreover, big data can make
tional weekly surveillance data consisted of the powerful predictions that are impossible with
proportion of reported doctor visits for flu-like small data alone. For example, GFT could provide
systems. At first, GFT was a timely and accurate estimates of flu prevalence in local geographic
predictor of flu prevalence, but it began to produce areas using detailed spatial and temporal informa-
systematic overestimates, sometimes by a factor tion from searches; this would be impossible with
of two or greater compared with the gold-standard only the aggregated traditional surveillance data.
of traditional surveillance data. The erroneous
estimates from GFT resulted from a lack of con-
tinued validation, thus assuming relevant search Conclusions
terms only changed as a result of flu symptoms
and transparency in the data and algorithms used. Similar to GFT, many popular techniques for ana-
Lazer et al. (2014) called the inaccuracy of lyzing big data use data mining to automatically
GFT a parable for big data, highlighting several uncover hidden structures. Data mining tech-
key points. First, a key cause for the misestimates niques are valuable for identifying patterns in
was that the algorithm assumed that influences on big data but should be interpreted with caution.
search patterns were the same over time and pri- The dimensions of big data do not obviate con-
marily driven by the onset of flu symptoms. In siderations of data quality, the need for continuous
6 Risk Analysis
open government from the technologies of open Robinson þ Yu researched the effects of the
data in order to clarify the potential impacts of use of big data in credit scoring in a guide for
public policies on civic life. policymakers titled “Knowing the Score.” The
guide endorsed the most widely used credit scor-
ing methods, including FICO, while acknowledg-
Criminal Justice ing concerns about disparities in scoring among
racial groups. The guide concluded that the scor-
Upturn has worked with the Leadership Confer- ing methods themselves were not discriminatory,
ence, a coalition of civil rights and media justice but that the disparities rather reflected other under-
organizations, to evaluate police department pol- lying societal inequalities. Still, the guide advo-
icies on the use of body-worn cameras. The orga- cated some changes to credit scoring methods.
nizations, noting increased interest in the use of One recommendation was to include “mainstream
such cameras following police-involved deaths in alternative data” such as utility bill payments in
communities such as Ferguson (Missouri), New order to allow more people to build their credit
York City, and Baltimore, also cautioned that files. The guide expressed reservations about
body-worn cameras could be used for surveil- “nontraditional” data sources, such as social net-
lance, rather than protection, of vulnerable indi- work data and the rate at which users scroll
viduals. The organizations released a scorecard on through terms of service agreements. Robinson
body-worn camera policies of 25 police depart- þ Yu also called for more collaboration among
ments in November 2015. The scorecard included financial advocates and the credit industry, since
criteria such as whether body-worn camera poli- much of the data on credit scoring is proprietary.
cies were publicly available, whether footage was Finally, Robinson þ Yu advocated that govern-
available to people who file misconduct com- ment regulators more actively investigate “mar-
plaints, and whether the policies limited the use keting scores,” which are used by businesses to
of biometric technologies to identify people in target services to particular customers based on
recordings. their financial health. The guide suggested that
marketing scores appeared to be “just outside the
scope” of the Fair Credit Reporting Act, which
Lending requires agencies to notify consumers when their
credit files have been used against them.
Upturn has warned of the use of big data by
predatory lenders to target vulnerable consumers.
In a 2015 report, “Led Astray,” Upturn explained Voting
how businesses used online lead generation to sell
risky payday loans to desperate borrowers. In Robinson þ Yu partnered with Rock the Vote in
some cases, Upturn found that the companies 2013 in an effort to simplify online voter registra-
violated laws against predatory lending. Upturn tion processes. The firm wrote a report,
also found some lenders exposed their customers’ “Connected OVR: a Simple, Durable Approach
sensitive financial data to identity thieves. The to Online Voter Registration.” At the time of the
report recommended that Google, Bing, and report, nearly 20 states had passed online voter
other online platforms tighten restrictions on pay- registration laws. Robinson þ Yu recommended
day loan ads. It also called on the lending industry that all states allow voters to check their registra-
to promote best practices for online lead genera- tion statuses in real time. It also recommended that
tion and for greater oversight of the industry by online registration systems offer alternatives to
the Federal Trade Commission and Consumer users who lack state identification, and that the
Financial Protection Bureau. systems be responsive to devices of various sizes
Upturn 3
and operating systems. Robinson þ Yu also income people. The automobile insurance com-
suggested that states streamline and better coordi- pany Progressive, for example, installed devices
nate their online registration efforts. Robinson þ in customers’ vehicles that allowed for the track-
Yu recommended that states develop a simple, ing of high-risk behaviors. Such behaviors
standardized platform for accepting voter data included nighttime driving. Robinson þ Yu
and allow third-party vendors (such as Rock the argued that many lower-income workers com-
Vote) to design interfaces that would accept voter muted during nighttime hours and thus might
registrations. Outside vendors, the report have to pay higher rates, even if they had clean
suggested, could use experimental approaches to driving records. The report also argued that mar-
reach new groups of voters while still adhering to keters used big data to develop extensive profiles
government registration requirements. of consumers based on their incomes, buying
habits, and English-language proficiency, and
such profiling could lead to predatory marketing
Big Data and Civil Rights and lending practices. Consumers often are not
aware of what data has been collected about
In 2014, Robinson þ Yu advised The Leadership them and how that data is being used, since such
Conference on “Civil Rights Principles for the Era information is considered to be proprietary. Rob-
of Big Data.” Signatories of the document inson þ Yu also suggested that credit scoring
included the American Civil Liberties Union, methods can disadvantage low-income people
Free Press, and NAACP. The document offered who lack extensive credit histories.
guidelines for developing technologies with The report found that big data could impair job
social justice in mind. The principles included an prospects in several ways. Employers used the
end to “high-tech profiling” of people through the federal government’s E-Verify database, for
use of surveillance and sophisticated data-gather- example, to determine whether job applicants
ing techniques, which the signatories argued were eligible to work in the United States. The
could lead to discrimination. Other principles system could return errors if names had been
included fairness in algorithmic decision-making; entered into the database in different ways. For-
the preservation of core legal principles such as eign-born workers and women have been dispro-
the right to privacy and freedom of association; portionately affected by such errors. Resolving
individual control of personal data; and protec- errors can take weeks, and employers often lack
tions from data inaccuracies. the patience to wait. Other barriers to employment
The “Civil Rights Principles” were cited by the arise from the use of automated questionnaires
White House in its report, “Big Data: Seizing some applicants must answer. Some employers
Opportunities, Preserving Values.” John Podesta, use the questionnaires to assess which potential
Counselor to President Barack Obama, cautioned employees will likely stay in their jobs the lon-
in his introduction to the report that big data had gest. Some studies have suggested that longer
the potential “to eclipse longstanding civil rights commute times correlate to shorter-tenured
protections in how personal information is used.” workers. Robinson þ Yu questioned whether ask-
Following the White House report, Robinson þ ing the commuting question was fair, particularly
Yu elaborated upon four areas of concern in the since it could lead to discrimination against appli-
white paper “Civil Rights, Big Data, and Our cants who lived in lower-income areas. Finally,
Algorithmic Future.” The paper included four Robinson þ Yu raised concerns about “sublimi-
chapters: Financial Inclusion, Jobs, Criminal Jus- nal” effects on employers who conducted web
tice, and Government Data Collection and Use. searches for job applicants. A Harvard researcher,
The Financial Inclusion chapter argued the era they noted, found that Google algorithms were
of big data could result in new barriers for low- more likely to show advertisements for arrest
4 Upturn
Robinson, D., & Yu, H. (2014, October). Knowing the score: The Leadership Conference on Civil and Human Rights &
New data, underwriting, and marketing in the consumer Upturn. (2015, November). Police body worn cameras:
credit marketplace. https://www.teamupturn.com/static/ A policy scorecard. https://www.bwcscorecard.org/
files/Knowing_the_Score_Oct_2014_v1_1.pdf static/pdfs/LCCHR_Upturn-BWC_Scorecard-v1.04.pdf
Robinson þ Yu. (2013). Connected OVR: A simple, durable Upturn. (2014, September). Civil rights, big data, and our
approach to online voter registration. Rock the Vote. algorithmic future. https://bigdata.fairness.io/
http://www.issuelab.org/resource/connected_ovr_a_simp Upturn. (2015, October). Led Astray: Online lead genera-
le_durable_approach_to_online_voter_registration tion and payday loans. https://www.teamupturn.com/
Robinson, D., Yu, H., Zeller, W. P., & Felten, E. W. (2008). reports/2015/led-astray
Government data and the invisible hand. Yale JL & Yu, H., & Robinson, D. G. (2012). The new ambiguity of
Tech., 11, 159. ‘open government’. UCLA L. Rev. Disc. 59, 178.
S
sensitivity and the hierarchy of various clients. platform the post or message originated from.
Extensive reporting is a value of Salesforce’s The “River of News” displays posts with many
offerings, which provides management an ability different priorities, such as newest post first, num-
to track problem areas within an organization to a ber of Twitter followers, social media platform
distinct department, area, or tangible product used, physical location, and Klout score. This
offering. tool provides strong functionality for marketers
Salesforce has been a key leader in evolving or corporations wishing to hone in, or take part
marketing within this digital era through the use of in, industry, customer, or competitor
specific marketing strategy aimed at creating and conversations.
tracking marketing campaigns as well as measur- “Topic analysis” is a widget that is most often
ing the success of online campaigns. These used to show share of voice or the percentage of
services are part of another growing segment conversation happening about your brand or orga-
available within Salesforce offerings in addition nization in relation to competitor brands. It is
to the CRM packaging. Marketing departments displayed as a pie chart and can be segmented
leveraging Salesforce’s Buddy Media, Radian6, multiple ways based on user configuration.
or ExactTarget obtain the ability of users to con- Many use this feature as a quick visual assessment
duct demographic, regional, or national searches to see the conversations and interest revolving
on keywords and themes across all social net- around specific initiatives or product launches.
works, which create a more informed and accurate “Topic trends” is a widget that provides the
marketing direction. Further, Salesforce’s dash- ability to display the volume of conversation
board, which is the main user interactive page, over time through graphs and charts. This feature
allows the creation of specific marketing directed can be used to understand macro day, week, or
tasks that can be customized and shared for dif- month data. This widget is useful when tracking
fering organizational roles or personal crisis management or brand sentiment. With a line
preferences. graph display, users can see spikes of activity and
Salesforce marketing dashboard utilizes wid- conversation around critical areas. Further, users
gets that are custom, reusable page elements, then can click and hone in on spikes, which can
which can be housed on individual users’ pages. open a “Conversation Cloud” or “River of News”
When a widget is created, it is added to a widgets that allows users to see the catalyst behind the
view where all team members can easily be spike of social media activity. This tool is used
assigned access. This allows companies and orga- as a way to better understand reasons for increased
nizations to share appropriate widgets defined and interest or conversation across broad social media
created to serve the target market or industry- platforms.
specific groups. The shareability of widgets
allows the most pertinent and useful tasks to be
replicated by many users within a single Salesforce Uses
organization.
Salesforce offers wide ranging data inference
from its varied and evolving products. As CRM
Types of Widgets integration within the web and mobile has
increased, the broad interest to better understand
The Salesforce Marketing Cloud “River of News” and leverage social media marketing campaigns
is a widget that allows users to scroll through has risen as well, allowing Salesforce a leading
specific search results, within all social media push within this industry’s market share. The
conversations, and utilizes user-defined key- diverse array of businesses, nonprofits, munici-
words. Users have the ability to see original palities, and other organizations that utilize
posts that were targeted from keyword searches Salesforce illustrates the importance of this soft-
and provided a source link to the social media ware within daily business and marketing
Salesforce 3
strategy. Salesforce clients include the American Chatter is a social and collaborative function
Red Cross, the City of San Francisco, that relates to the Salesforce platform. Similar to
Philadelphia’s 311 system, Burberry, H&R Facebook and Twitter, Chatter allows users to
Block, Volvo, and Wiley Publishing. form a community within their business that can
be used for secure collaboration and knowledge
sharing.
Salesforce Service Offerings Work.com is a corporate performance manage-
ment platform for sales representatives. The plat-
Salesforce is a leader within other CRM and form targets employee engagement in three areas:
media marketing-orientated companies such as alignment of team and personal goals with busi-
Oracle, SAP, Microsoft Dynamics CRM, Sage ness goals, motivation through public recognition,
CRM, Goldmine, Zoho, Nimble, Highrise, and real-time performance feedback.
Insight.ly, and Hootsuite. Salesforce’s offerings Salesforce has more than 5,500 employees,
can be purchased individually or as a complete revenues of approximately $1.7 billion, and a
bundle. It offers current breakdowns of services market value of approximately $17 billion. The
and access in its varied options that are referred to company regularly conducts over 100 million
as Sales Cloud, Service Cloud, ExactTarget Mar- transactions a day and has over 3 million
keting Cloud, Salesforce1 Platform, Chatter, and subscribers.
Work.com. Headquartered in San Francisco, California,
Sales Cloud allows businesses to track cus- Salesforce also maintains regional offices in Dub-
tomer inquiries, escalate issues requiring special- lin, Singapore, and Tokyo with secondary loca-
ized support, and monitor employee productivity. tions in Toronto, New York, London, Sydney, and
This product provides customer service teams San Mateo, California. Salesforce operates with
with the answers to customers’ questions and the over 170,000 companies and 17,000 nonprofit
ability to make the answers available on the web organizations. In June 2004, Salesforce was
so consumers can find answers for themselves. offered on the New York Stock Exchange under
Service Cloud offers active and real-time infor- the symbol CRM.
mation directed toward customer service. This
service provides functionality such as Agent Con-
sole which offers relevant information about cus-
Cross-References
tomers and their media profiles. This service also
provides businesses the ability to give customers
▶ Customer Service
access to live agent web chats from the web to
▶ Data Aggregation
ensure customers can have access to information
▶ Social Media
without a phone call.
▶ Streaming Data
ExactTarget Marketing Cloud focuses on cre-
ating closer relationships with customers through
directed email campaigns, in-depth social market-
ing, data analytics, mobile campaigns, and mar- Further Readings
keting automation.
Sales1Platform is geared toward mobile app Denning, S. (2011). Successfully implementing radical
management at Salesforce.com. Strategy & Leader-
creation. Sales1Platform gives access to create ship, 39(6), 4.
and promote mobile apps with over four million
apps created utilizing this service.
S
subsequent research relies on them to advance promising use of social media data lies not in its
knowledge. use as a predictor of traditional impact measures but
One prominent metric used in scientometrics is as means of creating novel metrics of the social
the h-index, which was proposed by Jorge Hirsch impact of research.
in 2005. The h-index considers the number of Indeed the development of an alternative set of
publications produced by an individual or organi- measurements – often referred to as “altmetrics” –
zation and the number of citations these publica- based on data gleaned from the social web repre-
tions receive. An individual can be said to have an sents a particularly active field of scientometrics
h-index of h when she produces h publications research. Toward this end, services such as PLOS
each of which receives at least h citations and no Article-Level Metrics use big data techniques to
other publication receives more than h citations. develop metrics of research impact that consider
The advent of large databases and big data factors other than citations. PLOS Article-Level
analytics has greatly facilitated the calculation of Metrics pulls in data on article downloads,
the h-index and similar impact metrics. For exam- commenting and sharing via services such
ple, in a 2013 study, Filippo Radicchi and Claudio CiteuLike, Connotea, and Facebook, to broaden
Castellano utilized the Google Scholar Citations the way in which a scholar’s contribution is
data set to evaluate the individual scholarly con- measured.
tribution of over 35,000 scholars (Radicchi and Certain academic fields, such as the humanities,
Castellano 2013). The researchers found that the that rely on under-indexed forms of scholarship
number of citations received by a scientist is a such as book chapters and monographs have proven
strong proxy for that scientist’s h-index, whereas difficult to study using traditional scientometrics
the number of publications is a less precise proxy. techniques. Because they do not depend on online
The same principles behind citation analysis can bibliographic databases, altmetrics may prove useful
be applied to measure the impact or quality of in studying such fields. Björn Hammarfelt uses data
patents. Large patent databases such as PATSTAT from Twitter and Mendeley – a web-based citation
allow researchers to measure the importance of indi- manager that has a social networking component –
vidual patents using forward citations. Forward cita- to study scholarship in the humanities (Hammarfelt
tions come from the “prior art” section of the patent 2014). While his study suggests that coverage gaps
documents, which describes the technologies that still exist using altmetrics, as these applications
were deemed critical to their innovation by the become more widely used, they will likely become
patent applicants. Scholars use patent counts, a useful means of studying neglected scientific
weighed by forward citations, to derive measures fields.
of national innovative productivity.
Until recently, measurement of research impact
has been almost exclusively based on citation-based See Also
measures. However, citations are slow to accumu-
late and ignore the influence of research on the ▶ Bibliometrics
broader public. Recently there has been a push to ▶ Social Media
include novel data sources in the evaluation of ▶ Text Analytics
research impact. Gunther Eysenbach has found ▶ Thomson Reuters
that tweets about a journal article within the first
3 days of publication are a strong predictor of even-
tual citations for highly cited research articles Further Readings
(Eysenbach 2011). The direction of causality in
this relationship – i.e., whether strong papers lead Bornmann, L., & Mutz, R. (2015). Growth rates of modern
science: A bibliometric analysis based on the number of
to a high volume of tweets or whether the tweets
publications and cited references. Journal of the Asso-
themselves cause subsequent citations – is unclear. ciation for Information Science and Technology,
However, the author suggests that the most 66(11), 2215–2222. arXiv:1402.4578 [Physics, Stat].
Scientometrics 3
Eysenbach, G. (2011). Can tweets predict citations? Met- Radicchi, F., & Castellano, C. (2013). Analysis of
rics of social impact based on Twitter and correlation bibliometric indicators for individual scholars in a
with traditional metrics of scientific impact. Journal of large data set. Scientometrics, 97(3), 627–637. https://
Medical Internet Research, 13, e123. doi.org/10.1007/s11192-013-1027-3.
Hammarfelt, B. (2014). Using altmetrics for assessing Xian, H., & Madhavan, K. (2014). Anatomy of scholarly
research impact in the humanities. Scientometrics, collaboration in engineering education: A big-data
101, 1419–1430. bibliometric analysis. Journal of Engineering Educa-
tion, 103, 486–514.
S
human language using computational methods. into main memory on most modern machines.
Historically, this meant implementing rules and While techniques such as parallel and distributed
structures inspired by the cognitive structures pro- processing may be necessary in some cases, for
posed by Chomskyan generative linguistics. Over example, global streams of social media text or
time, computational linguistics has broadened to applying machine learning techniques for classi-
include diverse methods for machine processing fication, typically the challenge of text data is to
of language irrespective of whether the computa- parse and extract useful information from the idi-
tional models are plausible cognitive models of osyncratic and opaque structures of natural lan-
human language processing. As practiced today, guage, rather than overcoming computational
computational linguistics is closer to a branch of difficulties simply to store and manipulate the
computer science than a branch of linguistics. The text. The unpredictable structure of text files
branch of linguistics that uses quantitative analy- means that general purpose programming lan-
sis of large text corpora is known as corpus guages are commonly used, unlike in other appli-
linguistics. cations where the tabular format of the data allows
Research in computational linguistics and nat- the use of specialized statistical software.
ural language processing involves finding solu- Original Unix command line tools such as
tions for the many subproblems associated with grep, sed, and awk are still extremely useful for
understanding language, and combining advances batch processing of text documents. Historically,
in these modules to improve performance on gen- Perl has been the programming language of
eral tasks. Some of the most important NLP sub- choice for text processing, but recently Ruby and
problems include part-of-speech tagging, Python have become more widely used. These are
syntactic parsing, identifying the semantic roles scripting languages, designed for ease of use and
played by verb arguments, recognizing named flexibility rather than speed. For more computa-
entities, and resolving references. These feed tionally intensive tasks, NLP tools are
into performance on more general tasks like implemented in Java or C/Cþþ.
machine translation, question answering, and The python libraries spaCy and gensim and the
summarization. Java-based Stanford Core NLP software are
In the social sciences, the terms quantitative widely used in industry and academia. They pro-
content analysis, quantitative text analysis, or vide implementations and guides for the most
“text as data” are all used. Content analysis may widely used text processing and statistical docu-
be performed by human coders, who read and ment analysis methods.
mark-up documents. This process can be stream-
lined with software. Fully automated content anal-
ysis, or quantitative text analysis, typically Preprocessing
employs statistical word-frequency analysis to
discover latent traits from text, or scale documents The first step in approaching a text analysis
of interest on a particular dimension of interest in dataset is to successfully read the document for-
social science or political science. mats and file encodings used. Most programming
languages provide libraries for interfacing with
Microsoft Word and pdf documents. The ASCII
Tools and Resources coding system represents unaccented English
upper and lowercase letters, numbers, and punc-
Text data does not immediately challenge compu- tuation, using one byte per character. This is no
tational resources to the same extent as other big longer sufficient for most purposes, and modern
data sources such as video or sensor data. For documents are encoded in a diverse set of charac-
example, the entire proceedings of the European ter encodings. The Unicode system defines code
parliament from 1996 to 2005, in 21 languages, points which can represent characters and sym-
can be stored in 5.4 gigabytes – enough to load bols from all writing systems. The UTF-8 and
Semantic/Content Analysis/Natural Language Processing 3
UTF-16 encodings implement these code points function, and some NLP systems simply remove
in 8 bit or 16 bit encoded files. them before proceeding with a statistical analysis.
Words are the most apparent units of written After the initial text preprocessing, there are
text, and most text processing tasks begin with several simple metrics that may be used to assess
tokenization – dividing the text into words. In the complexity of language used in the docu-
many languages, this is relatively uncomplicated: ments. The type-token ratio, a measure of lexical
whitespace delimits words, with a few ambiguous diversity, gives an estimate of the complexity of
cases such as hyphenation, contraction, and the the document by comparing the total number of
possessive marker. Within languages written in words in the document to the number of unique
the Roman alphabet there is some variance, for words (i.e., the size of the vocabulary). The
example, agglutinative languages like Finnish and Fleisch-Kincaid readability metric uses the aver-
Hungarian tend to use long compound terms dis- age sentence length and the average number of
ambiguated by case markers, which can make the syllables per word combined with coefficients
connection between space-separated words and calibrated with data from students to give an esti-
dictionary-entry meanings tenuous. For languages mate of the grade-level reading difficulty of a text.
with a different orthographic system, such as Chi-
nese, Japanese, and Arabic, it is necessary to use a
customized tokenizer to split text into units suit- Document-Term Matrices
able for quantitative analysis.
Even in English, the correspondence between After tokenization and other preprocessing steps,
space-separated word and semantic unit is not most text analysis methods work with a matrix
exact. The fundamental unit of vocabulary – that stores the frequency with which each word in
sometimes called the lexeme – may be modified the vocabulary occurs in each document. This is
or inflected by the addition of morphemes indicat- the simplest case, known as the “bag-of-words”
ing tense, gender, or number. For many applica- model, and no information about the ordering of
tions, it is not desirable to distinguish between the the words in the original texts is retained. More
inflected forms of words, rather we want to sum sophisticated analysis might involve extracting
together counts of equivalent words. Therefore, it counts of complex features from the documents.
is common to remove the inflected endings of For example, the text may be parsed and tagged
words and count only the root, or stem. For exam- with part-of-speech information as part of the
ple, a system to judge the sentiment of a movie preprocessing stage, which would allow for the
review need not distinguish between the words words with identical spellings but different part-
“excite,” “exciting,” “excites,” and “excited.” of-speech categories or grammatical roles to be
Typically the word ending is removed and the counted as separate features.
terms are treated equivalently. Often, rather than using only single words,
The Porter stemmer (Porter 1980) is one of the counts of phrases are used. These are known as
most frequently used algorithms for this purpose. n-grams, where n is the number of words in the
A slightly more sophisticated method is phrase, for example, trigrams are three-word
lemmatization, which also normalizes inflected sequences. N-gram models are especially impor-
words, but uses a dictionary to match irregular tant for language modeling, used to predict the
forms such as “be”/“is”/“are”. In addition to stem- probability of a word or phrase given the preced-
ming and tokenizing, it may be useful to remove ing sequence of words. Language modeling is
very common words that are unlikely to have particularly important for natural language gener-
semantic content related to the task. In English, ation and speech recognition problems.
the most common words are function words such Once each document has been converted to a
as “of,” “in,” and “the.” These “stopwords” row of counts of terms or features, a wide range of
largely serve a grammatical rather than semantic automated document analysis methods can be
employed. The document-term matrix is usually
4 Semantic/Content Analysis/Natural Language Processing
sparse and uneven – a small number of words In addition, word frequencies are extremely
occur very frequently in many documents, while unevenly distributed (an observation known as
a large number of words occur rarely, and most Zipf’s law) and are highly correlated with one
words do not occur at all in a given document. another, resulting in parameter vectors that make
Therefore, it is common practice to smooth or less than ideal examples for regression models. It
weight the matrix, either using the log of the may therefore be necessary to use regression
term frequency or with a measure of term impor- methods designed to mitigate this problem, such
tance like tf-idf (term frequency x inverse docu- as lasso and ridge regression, or to prune the
ment frequency) or mutual information. feature space to avoid overtraining, using feature
subset selection or a dimensionality reduction
technique like principal components analysis or
Matrix Analysis singular value decomposition. With recent
advances in neural network research, it has
Supervised classification methods attempt to auto- become more common to use unprocessed counts
matically categorize documents based on the doc- of n-grams, tokens, or even characters as input to a
ument-term matrix. One of the most familiar of neural network with many intermediate layers.
such tasks is the email spam detection problem. With sufficient training data, such a network can
Based on the frequencies of words in a corpus of learn the feature extraction process better than
emails, the system must decide if an email is spam hand-curated feature extraction systems, and
or not. Such a system is supervised in the sense these “deep learning” networks have improved
that it requires as a starting point a set of docu- the state of the art in machine translation and
ments that have been correctly labeled with the image labeling.
appropriate category, in order to build a statistical Unsupervised methods can cluster documents
model of which terms are associated with each or reveal the distribution of topics in documents in
category. One simple and effective algorithm for a data-driven fashion. For unsupervised scaling
supervised document classification is Naive and clustering of documents, methods include k-
Bayes, which gives a new document the class means clustering, or the Wordfish algorithm, a
that has the maximum a posteriori probability multinomial Poisson scaling model for political
given the term counts and their independent asso- documents.
ciation between the terms and the categories in the Another goal of unsupervised analysis is to
training documents. In political science,a similar measure what topics comprise the text corpus,
algorithm – “wordscores” – is widely used, which and how these topics are distributed across docu-
sums Naive-Bayes-like word parameters to scale ments. Topic modeling (Blei 2012) is a widely
new documents based on reference scores used generative technique to discover a set of
assigned to training texts with extreme posi- topics that influence the generation of the texts,
tions (Laver et al. 2003). and explore how they are associated with other
Other widely used supervised classifiers variables of interest.
include support vector machines, logistic regres-
sion, and nearest neighbor models. If the task is to
predict a continuous variable rather than a class Vector Space Semantics and Machine
label, then a regression model may be used. Sta- Learning
tistical learning and prediction systems that oper-
ate on text data very often face the typical big data In addition to retrieving or labeling documents, it
problem of having more features (word types) can be useful to represent the meaning of terms
than observed or labeled documents. This is a found in the documents. Vector space semantics,
high dimensional learning problem, where p (the or distributional semantics, aims to represent the
number of parameters) is much larger than n (the meaning of words using counts of their co-occur-
number of observed examples). rences with other words. The “distributional
Semantic/Content Analysis/Natural Language Processing 5
hypothesis,” as described by JR Firth (Firth 1957), over the last few decades, and with the prepon-
is the idea that “you shall know a word by the derance of online training data and advances in
company it keeps.” The co-occurrence vectors of machine learning methods, it is likely that further
words have been shown to be useful for noun gains will be made in the coming years. For
phrase disambiguation, semantic relation extrac- researchers intending to make use of rather than
tion, and analogy resolution. Many systems now advance these methods, a fruitful approach is a
use the factorization of the co-occurrence matrices good working knowledge of a general purpose
as the initial input to statistical learners, allowing a programming language, combined with the ability
fine-grained representation of lexical semantics. to configure and execute off-the-shelf machine
Vector semantics also allows for word sense dis- learning packages.
ambiguation – it is possible to distinguish the
different senses of a word by clustering the vector
representations of its occurrences. Cross-References
These vectors may count instances of words
co-occurring with the same context (syntagmatic ▶ Artificial Intelligence
relations) or compare the similarity of the contexts ▶ Biomedical Natural Language Processing
of words as a measure of their substitutability ▶ Python Scripting Language
(paradigmatic relations) (Turney and Pantel ▶ Supervised Machine Learning
2010). The use of neural networks or dimension- ▶ Text Analytics
ality reduction techniques allows researchers to ▶ Unstructured Data
produce a relatively low dimensional space in
which to compare word vectors, sometimes called
word embeddings.
References
Machine learning has long been used to per-
form classification of documents or to aid the Blei, D. M. (2012). Probabilistic topic models. Communi-
accuracy of NLP subtasks described above. How- cations of the ACM, 55(4), 77–84.
ever, as in many other fields, the recent application Chomsky, N. (2002). Syntactic structures. Berlin: Walter
de Gruyter.
of neural networks with many hidden layers
Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek,
(Deep Learning) has led to large improvements D., Kalyanpur, A., Lally, A., Murdock, J., Nyberg, E.,
in accuracy rates on many tasks. These opaque but Prager, J., Schlaefer, N., & Welty, C. A. (2010). Build-
computationally powerful techniques require only ing Watson: An overview of the deep QA project. AI
Magazine, 31(3), 59–79.
a large volume of training data and a differentiable
Firth, J. R. (1957). A synopsis of linguistic theory. In
target function to model complex linguistic Studies in linguistic analysis. Blackwell: Oxford.
behavior. Laver, M., Benoit, K., & Garry, J. (2003). Extracting policy
positions from political texts using words as data.
American Political Science Review, 97(02), 311–331.
Porter, MF. "An algorithm for suffix stripping." Pro-
Conclusion gram 14.3 (1980): 130–137.
Slapin, J. B., & Proksch, S.-O. (2008). A scaling model for
Natural language processing is a complex and estimating time-series party positions from texts. Amer-
ican Journal of Political Science, 52(3), 705–722.
varied problem that lies at the heart of artificial
Turney, P. D., & Pantel, P. (2010). From frequency to
intelligence. The combination of statistical and meaning: Vector space models of semantics. Journal
symbolic methods has led to huge leaps forward of Artificial Intelligence Research, 37(1), 141–188.
S
that a large number of correlations can be found identification of analytic algorithms, should they
does not necessarily mean that analysis is reliable be deemed necessary. Algorithms are analytic
and complete. One of the preparation measures approaches to data, which may be very sophisti-
before the actual data analysis is data reduction. cated. However, establishing a reliable set of
While a large number of data points may be avail- meaningful metrics to answer a question may be
able for collection, not all these data points should a more reliable strategy. Step 8 looks at the results
be included in an analysis to every question. and conclusions of the analysis and calls for con-
Instead, a careful consideration of data points is servative assessment of possible explanations and
likely to produce a more reliable and explainable models suggested by the data, assertions for cau-
interpretation of observed data. In other words, sality, and possible biases. Finally, step 9 calls for
just because the data is available, it does not mean validation of results in step 8 using comparable
it needs to be included in the analysis. Some data sets. Invalidation of predictions may suggest
elements may be random and will not add sub- necessary adjustments to any of the steps in the
stantively to the answer to a particular questions. data analysis and make conclusions more robust.
Some other elements may be redundant and not
add any new information compared to the one
already provided by other data points. Data Management
Jules Berman suggests nine steps to the analy-
sis of semi-structured data. Step 1 includes formu- Semi-structured data includes both database char-
lation of a question which can and will be acteristics and incorporates documents and other
subsequently answered with data. A Big Data files types, which cannot be fully described by a
approach may not be the best strategy for ques- standard database entry. Data entries in structured
tions that can be answered with other traditional data sets follow the same order; all entries in a
research methods. Step 2 evaluates data resources group have the same descriptions, defined format,
available for collection. Data repositories may and predefined length. In contrast, semi-structured
have “blind spots” or data points that are system- data entries are organized in semantic entities,
atically excluded or restricted for public access. similar to the structured data, which may not
At step 3, a question is reformulated to adjust for have same attributes in the same order or of the
the resources identified in step 2. Available data same length. Early digital databases were orga-
may be insufficient to answer the original question nized based on the relational model of data, where
despite the access to large amounts of data. Step data is recorded into one or more tables with a
4 involved evaluation of possible query outputs. unique identifier for each entry. The data for such
Data mining may return a large number of data databases needs to be structured uniformly for
points, but these data points most frequently need each record. Semi-structured data but relies on
to be filtered to focus on the analysis of the ques- tag or other markers to separate data elements.
tion at hand. At step 5, data should be reviewed Semi-structured data may miss data elements or
and evaluated for its structure and characteristics. have more than one data point in an element.
Returned data may be quantitative or qualitative, Overall, while semi-structured data has a pre-
or it may have data points which are missing for a defined structure, the data within this structure is
substantial number of records, which will impact not entered with the same rigor as in the traditional
future data analysis. Step 6 requires a strategic and relational databases. This data management situa-
systematic data reduction. Although it may sound tion arises from the practical necessity to handle
counterintuitive, Big Data analysis can provide user-generated and widely interactional data
most powerful insights when the data set is con- brought up by the Web 2.0. The data contained
densed to bare essentials to answer a focused in emails, blog posts, PowerPoint presentation
question. Some collected data may be irrelevant files, images, and videos may have very different
or redundant to the problem at hand and will not sets of attributes, but they also offer a possibility
be needed for the analysis. Step 7 calls for the to assign metadata systematically. Metadata may
Semi-structured Data 3
include information about author and time and Two main types of semi-structured data for-
may create the structure to assign the data to mats are Extensible Markup Language (XML)
semantic groups. Unstructured data, on the other and JavaScript Object Notation (JSON). XML,
hand, is the data that cannot be readily organized developed in the mid-1990s, is a markup language
in tables to capture the full extent of it. Semi- that sets rules for the data interchange. XML,
structured data, as the name suggests, carries although being an improvement to earlier markup
some elements of structured data. These elements languages, has been critiqued for being bulky and
are metadata tags that may list the author or cumbersome in implementation. JSON is viewed
sender, entry creation and modification times, as a possible successor format for digital architec-
the length of a document, or the number of slides ture and database technologies. JSON is an open
in a presentation. Yet, these data also have ele- standard format that transmits data between an
ments that cannot be described in a traditional application and a server. Data objects in JSON
relational database. For example, traditional data- format consist of attribute-value pairs stored in
base structure which would require initial infra- databases like MongoDB and Couchbase. The
structure design will not be able to handle data, which is stored in a database like MongoDB,
information as a sent email, and all response that can be pulled with a software network for more
were received as it is unknown if an email respon- efficient and faster processing. Apache Hadoop is
dents will use one or all names in response, if an example of an open-source framework that
anyone will get added or omitted, if original mes- provides both storage and processing support.
sage will be modified, if attachments will be Other multi-platform query processing applica-
added to subsequent messages, etc. tions suitable for enterprise-level use are Apache
Semi-structured data allows programmers to Spark and Presto.
nest data or create hierarchies that represent com-
plex data models and relationships among entries.
However, robustness of the traditional relational
See Also
data model forces more thoughtful implementa-
tion of data applications and possible subsequent
▶ Big Data Storytelling, Digital Storytelling
ease in analysis. Handling of semi-structured data
▶ Discovery Analytics
is associated with some challenges. The data itself
▶ Hadoop
may present a problem by being embedded in
▶ MongoDB
natural text, which cannot always be extracted
▶ Text Analytics
automatically with precision. Natural text is
based on sentences which may not have easily
identifiable relationships and entities which are nec-
essary for data collection. Natural text is based on Further Readings
sentences that may not have easily identifiable rela-
tionships and entities, which are necessary for data Abiteboul, S., et al. (2012). Web data management.
New York: Cambridge University Press.
collection, and the lack of widely accepted stan- Foreman, J. W. (2013). Data smart: Using data science to
dards for vocabularies. A communication process transform information into insight. Indianapolis: Wiley.
may involve different models to transfer the same Miner, G., et al. (2012). Practical text mining and statisti-
information or require richer data transfer available cal analysis for non-structured text data applications.
Waltham: Academic.
through natural text and not through a structured
exchange of keywords. For example, email
exchange can capture the data about senders and
recipients, but automated filtering and analysis of
the body of email are limited.
S
decision-making in the realms of product research the training corpus to classify data into sentiment
and development, marketing and public relations, categories.
crisis management, and customer relations.
Although businesses have traditionally relied on
surveys and focus groups, sentiment analysis
Levels of Analysis
offers several unique advantages over such con-
ventional data collection methods. These advan-
The classification of an opinion in text as positive,
tages include reduced cost and time, increased
negative, or neutral (or a more fine-grained clas-
access to much larger samples and hard-to-reach
sification scheme) is impacted by and thus
populations, and real-time intelligence. Thus, sen-
requires consideration of the level at which the
timent analysis can be a useful market research
analysis is conducted. There are three levels of
tool. Indeed, sentiment analysis is now commonly
analysis: document, sentence, and aspect and/or
offered by many commercial social data analysis
entity. First, the document-level sentiment classi-
services.
fication addresses a whole document as the unit of
analysis. The task of this level of analysis is to
determine whether an entire document (e.g., a
Approaches product review, a blog post, an email, etc.) is
positive, negative, or neutral about an object.
Broadly speaking, there exist two approaches in This level of analysis assumes that the opinions
the automatic extraction of sentiment from textual expressed on the document are targeted toward a
material: the lexicon-based approach and the single entity (e.g., a single product). As such, this
machine learning-based approach. In the level is not particularly useful to documents that
lexicon-based approach, a sentiment orientation discuss multiple entities.
score is calculated for a given text unit based on The second, sentence-level sentiment classifi-
a predetermined set of opinion words with posi- cation, focuses on the sentiment orientation of
tive (e.g., good, fun, exciting) and negative (e.g., individual sentences. This level of analysis is
bad, boring, poor) sentiments. In a simple form, a also referred to as subjectivity classification and
list of words, phrases, and idioms with known comprised of two tasks: subjective classification
sentiment orientations is built into a dictionary, and sentence-level classification. In the first task,
or an opinion lexicon. Each word is assigned the system determines whether a sentence is sub-
specific sentiment orientation scores. Using the jective or objective. If it is determined that the
lexicon, each opinion word extracted receives a sentence expresses a subjective opinion, the anal-
predefined sentiment orientation score, which is ysis moves to the second task, sentence-level clas-
then aggregated for a text unit. sification. This second task involves determining
The machine learning-based approach, also whether the sentence is positive, negative, or
called the text classification approach, builds a neutral.
sentiment classifier to determine whether a given The third type of classification is referred to as
text about an object is positive, negative, or neu- entity and aspect-level sentiment analysis. Also
tral. Using the ability of machines to learn, this called feature-based opinion mining, this level of
approach trains a sentiment classifier to use a large analysis focuses on sentiments directed at entities
set of examples, or training corpus, that have and/or their aspects. An entity can include a prod-
sentiment categories (e.g., positive, negative, or uct, service, person, issue, or event. An aspect is a
neutral). The sentiment categories are manually feature of the entity, such as its color or weight.
annotated by humans according to predefined For example, in the sentence “the design of this
rules. The classifier then applies the properties of laptop is bad, but its processing speed is excel-
lent,” there are two aspects stated – “design” and
Sentiment Analysis 3
“processing speed.” This sentence is negative sentiment if said sincerely but implies negative
about one aspect, “design,” and positive about sentiment if said sarcastically. Similarly, words
the other aspect, “processing speed.” Entity- and such as “sick,” “bad,” and “nasty” may have
aspect-level sentiment analysis is not limited to reversed sentiment orientation depending on con-
analyzing documents or sentences alone. Indeed, text and how they are used. For example, “My
although a document or sentence may contain new car is sick!” implies positive sentiment
opinions regarding multiple entities and their toward the car. These issues can also contribute
aspects, the entity- and aspect-level sentiment to inaccuracies in sentiment analysis.
analysis has the ability to identify the specific Altogether, despite these limitations, the com-
entities and/or aspects that the opinions on the putational study of opinions provided by senti-
document or sentence are referring to and then ment analysis can be beneficial for practical
determine whether the opinions are positive, neg- purposes. So long as individuals continue to
ative, or neutral. share their opinions through online user-generated
media, the possibilities for entities seeking to gain
meaningful insights into the opinions of key pub-
Challenges and Limitations lics will remain. Yet, challenges to sentiment,
analysis such as those discussed above, pose sig-
Extracting opinions from texts is a daunting task. nificant limitations to its accuracy and thus its
It requires a thorough understanding of the seman- usefulness in decision-making.
tic, syntactic, explicit, and implicit rules of a lan-
guage. Also, because sentiment analysis is carried
out by a computer system with a typical focus on Cross-References
analyzing documents on a particular topic, off-
topic passages containing irrelevant information ▶ Competitive Monitoring
may also be included in the analyses (e.g., a doc- ▶ Consumer Products
ument may contain information on multiple ▶ Data Mining
topics). This could result in creating inaccurate ▶ Facebook
global sentiment polarities about the main topic ▶ Internet
being analyzed. Therefore, the computer system ▶ LinkedIn
must be able to adequately screen and distinguish ▶ Marketing/Advertising
opinions that are not relevant to the topic being ▶ Online Identity
analyzed. Relatedly, for the machine learning- ▶ Real-Time Analytics
based approach, a sentiment classifier trained on ▶ SalesForce
a certain domain (e.g., car reviews) may perform ▶ Social Media
well on the particular topic, but may not when ▶ Twitter
applied to another domain (e.g., computer
review). The issue of domain independence is
another important challenge. Further Reading
Also, the complexities of human communica-
tion limit the capacity of sentiment analysis to Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013).
capture nuanced, contextual meanings that opin- New avenues in opinion mining and sentiment analysis.
IEEE Intelligent Systems, 28, 15–21.
ion holders actually intend to communicate in
Liu, B. (2011). Sentiment analysis and opinion mining. San
their messages. Examples include the use of sar- Rafael: Morgan & Claypool.
casm, irony, and humor in which context plays a Pang, B., & Lee, L. (2008). Opinion mining and sentiment
key role in conveying the intended message, par- analysis. Foundations and Trends in Information
Retrieval, 2(1–2), 1–135.
ticularly in cases when an individual says one
Pang, B., Lee, L., & Vaithyanathan S. (2002). Thumbs up?
thing but means the opposite. For example, some- Sentiment classification using machine learning tech-
one may say “nice shirt,” which implies positive niques. In Proceedings of the Conference on Empirical
4 Sentiment Analysis
Meijer, A., & Bolívar, M. P. R. (2015). Governing the Schrock, A. R. (2016). Civic hacking as data activism and
smart city: A review of the literature on smart urban advocacy: A history from publicity to open government
governance. International Review of Administrative data. New Media & Society, 18(4), 581–599.
Sciences. doi:10.1177/0020852314564308. Townsend, A. (2013). Smart cities: Big data, civic hackers,
Sadowski, J., & Pasquale, F. A. (2015). The spectrum and the quest for a new utopia. New York:
of control: A social theory of the smart city. W.W. Norton.
First Monday, 20(7). doi:10.5210/fm.v20i7.5903. Vanolo, A. (2013). Smartmentality: The smart city as
disciplinary strategy. Urban Studies, 51(5), 883–898.
S
distinction between personal communication and In the development of the flows of information,
the broadcast model of messages. the Internet holds the key role as a catalyst of a
novel platform for public discourse and public
communication. The Internet consists of both a
Theoretical Foundations of Social Media technological infrastructure and (inter)acting
humans, in a technological system and a social
Looking into the role of the new interactive and subsystem that both have a networked character.
empowering media, it is important to study their Together these parts form a techno-social system.
development as techno-social systems, focusing The technological structure is a network that pro-
on the dialectic relation of structure and agency. duces and reproduces human actions and social
As Fuchs (2014) describes, media are techno- networks and is itself produced and reproduced by
social systems, in which information and commu- such practices.
nication technologies enable and constrain human The specification of the online platforms, such
activities that create knowledge that is produced, as Web 1.0, Web 2.0, or Web 3.0, marks distinc-
distributed, and consumed with the help of tech- tively the social dynamics that define the evolu-
nologies in a dynamic and reflexive process that tion of the Internet. Fuchs (2014) provides a
connects technological structures and human comprehensive approach for the three “genera-
agency. The network infrastructure of the Internet tions” of the Internet, founding them on the idea
allows multiple and multi-way communication of knowledge as a threefold dynamic process of
and information flow between agents, combining cognition, communication, and cooperation. The
both interpersonal (one-to-one), mass (one-to- (analytical) distinction indicates that all Web 3.0
many), and complex, yet dynamically equal com- applications (cooperation) and processes also
munication (many-to-many). include aspects of communication and cognition
The discussion on the role of social media and and that all Web 2.0 applications (communica-
networks finds its roots in the emergence of the tion) also include cognition. The distinction is
network society and the evolvement of the Inter- based on the insight of knowledge as threefold
net as a result of the convergence of the audiovi- process that all communication processes require
sual, information technology, and cognition, but not all cognition processes result in
telecommunications sector. Contemporary society communication, and that all cooperation pro-
is characterized by what can be defined as conver- cesses require communication and cognition, but
gence culture (Jenkins 2006) in which old and not all cognition and communication processes
new media collide, where grassroots and corpo- result in cooperation.
rate media intersect, where the power of the media In many definitions, the notions of collabora-
producer and the power of the media consumer tion and collective actions are central, stressing
interact in unpredictable ways. that social media are tools that increase our ability
The work of Manuel Castells (2000) on the to share, to cooperate, with one another, and to
network society is central, emphasizing that the take collective action, all outside the framework
dominant functions and processes in the Informa- of traditional institutional institutions and organi-
tion Age are increasingly organized around net- zations. Social media enable users to create their
works. Networks constitute the new social own content and decide on the range of its dis-
morphology of our societies, and the diffusion of semination through the various available and eas-
networking logic substantially modifies the oper- ily accessible platforms. Social media can serve as
ation and outcomes in processes of production, online facilitators or enhancers of human net-
experience, power, and culture. Castells (2000) works – webs of people that promote connected-
introduces the concept of “flows of information,” ness as a social value.
underlining the crucial role of information flows Social network sites (SNS) are built on the
in networks for the economic and social pattern of online communities of people who are
organization. connected and share similar interests and
Social Media 3
activities. Boyd and Ellison (2007) provide a The Emergence of Citizen Journalism
robust and articulated definition of SNS, describ-
ing them as Web-based services that allow indi- The rise of social media and networks has a direct
viduals to (1) construct a public or semipublic impact on the types and values of journalism and
profile within a bounded system, (2) articulate a the structures of the public sphere. The transfor-
list of other users with whom they share a con- mation of interactions between political actors,
nection, and (3) view and traverse their list of journalists and citizens through the new technol-
connections and those made by others within the ogies has created the conditions for the emergence
system. The nature and nomenclature of these of a distinct form from professional journalism,
connections may vary from site to site. As the often called citizen, participatory, or alternative
social media and user-generated content phenom- journalism. The terms used to identify the new
ena grew, websites focused on media sharing journalistic practices on the Web range from inter-
began implementing and integrating SNS features active or online journalism to alternative journal-
and becoming SNSs themselves. ism, participatory journalism, citizen journalism,
The emancipatory power of social media is or public journalism. The level and the form of
crucial to understand the importance of network- public’s participation in the journalistic process
ing, collaboration, and participation. These con- determine whether it is a synergy between jour-
cepts, directly linked to social media, are key nalists and the public or exclusive journalistic
concepts to understand the real impact and dimen- activities of the citizens.
sions of contemporary participatory media cul- However, the phenomenon of alternative jour-
ture. According to Jenkins (2006), the term nalism is not new. Already in the nineteenth cen-
participatory culture contrasts with older notions tury, the first forms of alternative journalism made
of passive media consumption. Rather than their appearance with the development of the rad-
talking about media producers and consumers ical British press. The radical socialist press in the
occupying separate roles, we might now see USA in the early twentieth century followed as
them as participants who interact with each other did the marginal and feminist press between 1960
and contribute actively and prospectively equally and 1970. Fanzines and zines appeared in the
to social media production. 1970s and were succeeded by pirate radio sta-
Participation is a key concept that addresses the tions. At the end of the twentieth century, how-
main differences between the traditional (old) ever, the attention has moved to new media and
media and the social (new) media and focuses Web 2.0 technologies.
mainly on the empowerment of the audience/ The evolution of social networks with the new
users of media toward a more active information paradigm shift is currently defining to a great
and communication role. The changes transform extent the type, the impact, and the dynamics of
the relation between the main actors in political action, reaction, and interaction of the involved
communication, namely, political actors, journal- participants in a social network. According to
ists, and citizens. Social media and networks Atton (2003), alternative journalism is an ongoing
enable any user to participate in the mediation effort to review and challenge the dominant
process by actively searching, sharing, and approaches to journalism. The structure of this
commenting on available content. The distrib- alternative journalistic practice appears as the
uted, dynamic, and fluid structure of social counterbalance to traditional and conventional
media enables them to circumvent professional media production and disrupts its dominant
and political restrictions on news production and forms, namely, the institutional dimension of
has given rise to new forms of journalism defined mainstream media, the phenomena of capitaliza-
as citizen, alternative, or participatory journalism, tion and commercialization, and the growing con-
but also new forms of propaganda and centration of ownership.
misinformation. Citizen journalism is based on the assumption
that the public space is in crisis (institutions,
4 Social Media
politics, journalism, political parties). It appears as The purpose of citizen journalism is to reverse
an effort to democratize journalism and thereby is the “hierarchy of access” as it was identified by
questioning the added value of objectivity, which Glasgow University Media Group, giving voice to
is supported by professional journalism. the ones marginalized by the mainstream media.
The debate on a counterweight to professional, While mainstream media rely extensively on elite
conventional, mainstream journalism was intensi- groups, alternative media can offer a wider range
fied around 1993, when the signs of fatigue and of “voices” that wait to be heard. The practices of
the loss of public’s credibility in journalism alternative journalism provide “first-hand” evi-
became visible and overlapped with the innova- dences, as well as collective and anti-hierarchical
tive potentials of the new interactive technologies. forms of organizations and a participatory, radical
The term public journalism (public journalism) approach of citizen journalism. This form of jour-
appeared in the USA in 1993 as part of a move- nalism is identified by Atton as native reporting.
ment that expressed concerns for the detachment To determine the moving boundary between
of journalists and news organizations from the news producers and the public, Bruns (2005)
citizens and communities, as well as of US citi- used the term produsers, combining the words
zens from public life. However, the term citizen and concepts of producers and users. These
journalism has defined on various levels. If both changes determine the way in which power rela-
its supporters and critics agree on one core thing, tions in the media industry and journalism are
it is that it means different things to different changing, shifting the power from journalists to
people. the public.
The developments that Web 2.0 has introduced
and the subsequent explosive growth of social
media and networks mark the third phase of public Social Movements
journalism and its transformation to alternative
journalism. The field of information and commu- In the last few years, we have witnessed a growing
nication is transformed into a more participatory heated debate among scholars, politicians, and
media ecosystem, which evolves the news as journalists regarding the role of the Internet in
social experiences. News are transformed into a contemporary social movements. Social media
participatory activity to which people contribute tools such as Facebook, Twitter, and YouTube
their own stories and experiences and their reac- which facilitate and support user-generated con-
tions to events. tent have taken up a leading role in the develop-
Citizen journalism proposes a different model ment and coordination of a series of recent social
of selection and use of sources and of news prac- movements, such as the student protests in Britain
tices and redefinition of the journalistic values. at the end of 2010 as well as the outbreak of
Atton (2003) traces the conflict with traditional, revolution in the Arab world, the so-called Arab
mainstream journalism in three key points: (a) Spring.
power does not come exclusively from the official The open and decentralized character of the
institutional institutions and the professional cat- Internet has inspired many scholars to envisage a
egory of journalists, (b) reliability and validity can rejuvenation of democracy, focusing on the
derive from descriptions of lived experience and (latent) democratic potentials of the new media
not only objectively detached reporting, and (c) it as interactive platforms that can motivate and
is not mandatory to separate the facts from sub- fulfill the active participation of the citizens in
jective opinion. Although Atton (2003) does not the political process. On the other hand, Internet
consider lived experiences as an absolute value, skeptics suggest that the Internet will not itself
he believes it can constitute the added value of alter traditional politics. On the contrary, it can
alternative journalism, combining it with the generate a very fragmented public sphere based
capability of recording it through documented on isolated private discussions while the abun-
reports. dance of information, in combination with the
Social Media 5
vast amounts of offered entertainment and the Facebook revolution,” demonstrating the power
options for personal socializing, can lead people of networks.
to restrain from public life. The Internet actually In the European continent, we have witnessed
offers a new venue for information provision to the recent development of the Indignant Citizens
the citizen-consumer. At the same time, it allows Movement, whose origin was largely attributed to
politicians to establish direct communication with the social movements that started in Spain and
the citizens free from the norms and structural then spread to Portugal, the Netherlands, the
constraints of traditional journalism. UK, and Greece. In these cases, the digital social
Social media aspire to create new opportunities networks have proved powerful means to convey
for social movements. Web 2.0 platforms allow demands for a radical renewal of politics based on
protestors to collaborate so that they can quickly a stronger and more direct role of citizens and on a
organize and disseminate a message across the critique of the functioning of Western democratic
globe. By enabling the fast, easy, and low-cost systems.
diffusion of protest ideas, tactics, and strategies,
social media and networks allow social move-
ments to overcome problems historically associ-
See Also
ated with collective mobilization.
Over the last years, the center of attention was
▶ Digital Literacy
not the Western societies, which were used in
▶ Open Data
being the technology literate and information-
▶ Social Network Analysis
rich part of the world, but the Middle Eastern
▶ Twitter
ones. Especially after 2009, there is considerable
evidence advocating in favor of the empowering,
liberating, and yet engaging potentials of the
online social media and networks as in the case Further Reading
of the protesters in Iran who have actively used
Web services like Facebook, Twitter, Flickr, and Atton, C. (2003). What is ‘alternative’ journalism? Jour-
nalism: Theory Practice and Criticism, 4(3), 267–272.
YouTube to organize, attract support, and share Boyd, D. M., & Ellison, N. B. (2007). Social network sites:
information about street protests after the June Definition, history, and scholarship. Journal of Com-
2009 presidential elections. More recently, a rev- puter-Mediated Communication, 13(1), 210–230.
olutionary wave of demonstrations has swept the Bruns, A. (2005). Gatewatching: Collaborative online
news production. New York: Peter Lang.
Arab countries as the so-called Arab Spring, using Castells, M. (2000). The rise of the network society, the
again the social media as means for raising aware- information age: Economy, society and culture vol. I.
ness, communication, and organization, facing at Oxford: Blackwell.
the same time strong Internet censorship. Though Fuchs, C. (2014). Social media: A critical introduction.
London: Sage.
neglecting the complexity of these transforma- Jenkins, H. (2006). Convergence culture: Where old and
tions, the uprisings were largely quoted as “the new media collide. New York: New York University
Press.
S
small groups was used in disciplines that analyze dimensions. What makes big data so interesting to
society through qualitative methods, such Social Sciences is the possibility to reduce data,
Sociology. apply filters that allow to identify relevant patterns
Data collection has always been a problem for of information, aggregate sets in a way that helps
social research because of its inherent subjectivity identify temporal scales and spatial resolutions,
as Social Sciences have traditionally relied on and segregate streams and variables in order to
small samples using methods and tools gathering analyze social systems.
information based on people. In fact, one of the As big data is dynamic, heterogeneous, and
critical issues of Social Science is the need to interrelated, social scientists are facing new chal-
develop research methods that ensure the objec- lenges due to the existence of computational and
tivity of the results. Moreover, the objects of study statistical tools, which allow extracting and ana-
of Social Sciences do not fit into the models and lyzing large datasets of social information. Big
methods used by other sciences and do not allow data is being generated in multiple and
the performance of experiments under controlled interconnecting disciplinary fields. Within the
laboratory conditions. The quantification of infor- social domain, data is being collected from trans-
mation is possible because there are several tech- actions and interactions through multiple devices
niques of analysis that transform ideas, social and digital networks. The analysis of large
capital, relationships, and other variables from datasets is not within the field of a single scientific
social systems into numerical data. However, the discipline or approach. In this regard, big data can
object of study always interacts with the culture of change Social Science because it requires an inter-
the social scientist, making it very difficult to have section of sciences within different research tradi-
a real impartiality. tions and a convergence of methodologies and
Big data is not self-explanatory. Consequently, techniques. The scale of the data and the methods
it requires new research paradigms across multi- required to analyze them need to be developed
ple disciplines, and for social scientists, it is a combining expertise with scholars from other sci-
major challenge as it enables interdisciplinary entific disciplines. Within this collaboration with
studies and the intersection between computer data scientists, social scientists must have an
science, statistics, data visualization, and social essential role in order to read the data and under-
sciences. Furthermore, big data empowers the stand the social reality.
use real-time data on the level of whole The era of big data implies that Social Sciences
populations, to test new hypotheses and study rethink and update theories and theoretical ques-
social phenomena on a larger scale. In the context tions such as small world phenomenon, complex-
of modern Social Sciences, large datasets allow ity of urban life, relational life, social networks,
scientists to understand and study different social study of communication and public opinion for-
phenomena, from the interactions of individuals mation, collective effervescence, and social influ-
and the emergence of self-organized global move- ence. Although computerized databases are not
ments to political decisions and the reactions of new, the emergence of an era of big data is critical
economic markets. as it creates a radical shift of paradigm in social
Nowadays, social scientists have more infor- research. Big data reframes key issues on the
mation on interaction and communication pat- foundation of knowledge, the processes and tech-
terns than ever. The computational tools allow niques of research, the nature of information, and
understanding the meaning of what those patterns the classification of social reality.
reveal. The models build about social systems The new forms of social data have interesting
within the analysis of large volumes of data must dimensions: volume, variety, velocity, exhaustive,
be coherent with the theories of human actors and indexical, relational, flexible, and scalable. Big
their behavior. The advantages of large datasets data consists of relational information in large
and of the scaling up the size of data are that it is scale that can be created in or near real time with
possible to make sense of the temporal and spatial different structures, extensive in scope, capable of
Social Sciences 3
identifying and indexing information distinc- Several scholars, who believe that the new
tively, flexible, and able to expand in size quickly. empiricism operates as a discursive rhetorical
The datasets can be created by personal data or device, criticize this approach. Kitchin argues
nonpersonal data. Personal data can be defined as that whereas data can be interpreted free of con-
information relating to an identified person. This text and domain-specific expertise, such an epis-
definition includes online user-generated content, temological interpretation is probable to be
online social data, online behavioral data, location unconstructive as it absences to be embedded in
data, sociodemographic data, and information broader discussions.
from an official source (e.g., police records). All As large datasets are highly distributed and
data collected that do not directly identify individ- present complex data, a new model of data-driven
uals are considered nonpersonal data. Personal science is emerging within the Social Science
data can be collected from different sources with disciplines. The data-driven science uses a hybrid
three techniques: voluntary data that is created combination of abductive, inductive, and deduc-
and shared online by individuals; observed data, tive methods to the understanding of a phenome-
which records the actions of the individual; and non. This approach assumes theoretical
data inferred about individuals based on voluntary frameworks and pursues to generate scientific
information or observed. hypotheses from the data by incorporating a
The disciplinary outlines of Social Sciences in mode of induction into the research design. There-
the age of big data are in constant readjustment fore, the epistemological strategy adopted within
because of the speed of change in the data land- this model is to detect techniques to identify
scape. Some authors argued that the new data potential problems and questions, which can be
streams could reconfigure and constitute social worth of further analysis, testing, and validation.
relations and populations. Academic researchers Although big data enhance the set of data
attempt to handle the methodological challenges available for analysis and enable new approaches
presented by the growth of big social data, and and techniques, it does not replace the traditional
new scientific trends arise, although the diversity small data studies. Due to the fact that big data
of the philosophical foundations of Social Science cannot answer specific social questions, more
disciplines. Objectivity of the data does not result targeted studies are required. Computational
directly in their interpretation. The scientific Social Sciences can be the interface between com-
method postulated by Durkheim attempts to puter science and the traditional social sciences.
remove itself from the subjective domain. Never- This interdisciplinary and emerging scientific
theless, the author stated that objectivity is made from Social Sciences uses computationally
by subjects and is based on subjective observa- methods to model social reality and analyze phe-
tions and selections of individuals. nomena, as well as social structures and collective
A new empiricist epistemology emerged in behavior. The main computational approaches
Social Sciences and goes against the deductive from Social Sciences to study big data are social
approach that is hegemonic within modern sci- network analysis, automated information extrac-
ence. According to this new epistemology, big tion systems, social geographic information sys-
data can capture an entire social reality and pro- tems, complexity modeling, and social simulation
vide their full understanding. Therefore, there is models.
no need for theoretical models or hypotheses. This Computational Social Science is an intersec-
perspective assumes that patterns and relation- tion of Computer Science, Statistics, and the
ships within big data are characteristically signif- Social Sciences, which uses large-scale demo-
icant and accurate. Thus, the application of data graphic, behavioral, and network data to analyze
analytics transcends the context of a single scien- individual activity, collective behaviors, and rela-
tific discipline or a specific domain of knowledge tionships. Computational Social Sciences can be
and can be interpreted by those who can interpret the methodological approach to Social Sciences
statistics or data visualization. study big data because of the use of mathematical
4 Social Sciences
Global Earth Observation System of Systems, (5) Spatial database management systems such as
which is coordinated by the Group on Earth PostgreSQL, Ingres Geospatial, and JASPA
Observations. It acts as a central portal and clear- (6) Web-based spatial data publication and pro-
inghouse providing access to spatial data in sup- cessing servers such as GeoServer,
port of the whole system. The portal provides MapServer, and 52n WPS
registry for both data services and standards used (7) Web-based spatial data service development
in data services. It allows users to discover, frameworks such as OpenLayers, GeoTools,
browse, edit, create, and save spatial data from and Leaflet
members of the Group on Earth Observations
across the world. An international organization, the Open
Another popular spatial data service is the vir- Source Geospatial Foundation, was formed in
tual globe, which provides three-dimensional rep- 2006 to support the collaborative development
resentation of the Earth or another world. It allows of open-source geospatial software programs and
users to navigate in a virtual environment by promote their widespread use.
changing the position, viewing angle, and scale. Companies such as Google, Microsoft, and
A virtual globe has the capability to represent Yahoo! already provide free map services. One
various different views on the surface of the can browse maps on the service website, but the
Earth by adding spatial data as layers on the sur- spatial data behind the service is not open. In
face of a three-dimensional globe. Well-known contrast, the free and open-source spatial data
virtual globes include Google Earth, NASA approach requires not only freely available
World Wind, ESRI ArcGlobe, etc. Besides spatial datasets but also details about the data, such as
data browsing, most virtual globe programs also format, conceptual structure, vocabularies used,
enable the functionality of interactions with users. etc. A well-known open-source spatial data pro-
For example, the Google Earth can be extended ject is the OpenStreetMap, which aims at creating
with many add-ons encoded in the Keyhole a free editable map of the world. The project was
Markup Language, such as geological map layers launched in 2004. It adopts a crowdsourcing
exported from OneGeology. approach, that is, to solicit contributions from a
large community of people. By the middle of
2014, the OpenStreetMap project has more than
Open-Source Approaches 1.6 million contributors. Comparing with the
maps, the data generated by the OpenStreetMap
There are already widely used free and open- are considered as the primary output. Due to the
source software programs serving different pur- crowdsourcing approach, the current data quali-
poses in spatial handling (Steiniger and Hunter ties vary across different regions. Besides the
2013). Those programs can be grouped into a OpenStreetMap, there are numerous similar
number of categories: open-source and collaborative spatial data pro-
jects addressing the needs of different communi-
(1) Standalone desktop geographic information ties, such as the GeoNames for geographical
systems such as GRASS GIS, QGIS, and names and features, the OpenSeaMap for a world-
ILWIS wide nautical chart, and the eBird project for real-
(2) Mobile and light geographic information sys- time data about bird distribution and abundance.
tems such as gvSIG Mobile, QGIS for Open-source spatial data formats have also
Android, and tangoGPS received increasing attention in recent years, espe-
(3) Libraries with capabilities for spatial data pro- cially Web-based formats. A typical example is
cessing, such as GeoScript, CGAL, and GeoJSON, which enables the encoding of simple
GDAL geospatial features and their attributes using
(4) Data analysis and visualization tools such as JavaScript Object Notation (JSON). GeoJSON is
GeoVISTA Studio and R and PySAL; now supported by various spatial data software
4 Spatial Data
packages and libraries, such as OpenLayers, a more flexible mechanism for the Linked Data
GeoServer, and MapServer. Map services of Goo- approach and data exploration as they are fully
gle, Yahoo!, and Microsoft also support open. For example, there are already works done
GeoJSON in their application programming to transform data of the OpenStreetMap and
interfaces. GeoNames into RDF triples. For the pattern
exploration, there are already initial results, such
as those in the GeoKnow project (Athanasiou et
Spatial Intelligence al. 2014). The project built a prototype called
GeoKnow Generator which provides functions
The Semantic Web brings innovative ideas to the to link, enrich, query, and visualize RDF triples
geospatial community. The Semantic Web is a of spatial data and build lightweight applications
web of data compared to the traditional web of addressing specific requests in the actual world.
documents. A solid enablement of the Semantic The linked spatial data is still far from mature
Web is the Linked Data, which is a group of yet. More efforts are needed on the annotation and
methodologies and technologies to publish struc- accreditation of shared spatial RDF data, the inte-
tured data on the Web so they can be annotated, gration and fusion of them, the efficient RDF
interlinked, and queried to generate useful infor- query in a big data environment, and innovative
mation. The Web-based capabilities of linking and ways to visualize and present the results.
querying are specific features of the Linked Data,
which help people to find patterns from data and
use them in scientific or business activities. To
Cross-References
make full use of the Linked Data, the geospatial
community is developing standards and technol-
▶ Geography
ogies to (1) transform spatial data into Semantic
▶ Location Data
Web compatible formats such as the Resource
▶ Spatial Analytics
Description Framework (RDF), (2) organize and
▶ Spatio-Temporal Analytics
publish the transformed data using triple stores,
and (3) explore patterns in the data using new
query languages such as GeoSPARQL.
The RDF uses a simple triple structure of sub- References
ject, predicate, and object. The structure is robust
enough to support the linked spatial data Athanasiou, S., Hladky, D., Giannopoulos, G., Rojas, A.
G., Lehmann, J. (2014). GeoKnow: Making the web an
consisting of billions of triples. Building on the exploratory place for geospatial knowledge. ERCIM
basis of the RDF, there are a number of specific News, 96. http://ercim-news.ercim.eu/en96/special/
schemas for representing locations and spatial geoknow-making-the-web-an-exploratory-place-for-geo
relationships in triples, such as the GeoSPARQL. spatial-knowledge. Accessed 29 Apr 2016.
Huisman, O., & de By, R. A. (Eds.). (2009). Principles of
Triple stores offer functionalities to manage spa- geographic information systems. Enschede: ITC Edu-
tial data RDF triples and query them, which are cational Textbook Series.
very similar to what the traditional relational data- Open Geospatial Consortium (2016). About OGC. http://
bases are capable. As mentioned above, spatial www.opengeospatial.org/ogc. Accessed 29 Apr 2016.
Steiniger, S., & Hunter, A. J. S. (2013). The 2012 free and
data have two major sources: conventional data open source GIS software map: A guide to facilitate
legacy and crowdsourcing data. While technolo- research, development, and adoption. Computers,
gies are being mature for transforming both of Environment and Urban Systems, 39, 136–150.
them into triples, the crowdsourcing data provide
T
Disclosures allow comparison and review. government projects developed technology capa-
Detailed activity disclosure of operations answers bilities in public sector organizations.
questions of who, what, when, and where. Con- Advances in computing have increased the use
versely, disclosures can also answer questions of big data techniques to automatically review
about influential people or wasteful projects. Dis- transparency disclosures. Transparency can be
closure may emphasize predictive trends and ret- implemented without technology, but often the
rospective measurement, while other disclosures two are intrinsically linked. One impact technol-
may emphasize narrative interpretation and ogy has on transparency is that information now
explanation. comes in multiple forms. Disclosure before tech-
nology was the static production of documents
and regularly scheduled reports that could be
released on paper by request. Disclosure with
Implementation
technology is the dynamic streaming of real-time
data available through machine-readable search
Transparency is implemented by disclosing
and discovery. Transparency is often implemented
timely information to meet specific needs. This
by releasing digital material as open data that can
assumes that stakeholders will discover the
be reused with few limitations. Open data trans-
disclosed information, comprehend its impor-
parency initiatives disclose information in formats
tance, and subsequently use it to change behavior.
that can be used with big data methods.
Organizations, including corporations and gov-
ernment, often implement transparency using
technology which creates digital material used in
Intellectual History
big data.
Corporations release information about how
Transparency has its origins in economic and
their actions impact communities. The goal of
philosophical ideas about disclosing the activities
corporate transparency is to improve services,
of those in authority. In Europe, the intellectual
share financial information, reduce harm to the
history spans from Aristotle in fifth-century
public, or reduce reputation risks. The veracity
Greece to Immanuel Kant in eighteenth-century
of corporate disclosures has been debated by man-
Prussia. Debates on big data can be positioned
agement science scholars (Bennis et al. 2008). On
within these conversations about the dynamics
the one hand, mandatory corporate reporting fails
of information and power. An underlying assump-
if the information provided does not solve the
tion of transparency is that there are hidden and
target issue (Fung et al. 2007). On the other
visible power relationships in the exchange of
hand, organizations that are transparent to
information. Transparency is often an antidote to
employees, management, stockholders, regulators,
situations where information is used as power to
and the public may have a competitive advantage.
control others.
In any case, there are real limits to what corpora-
Michel Foucault, the twentieth-century French
tions can disclose and still remain both domesti-
philosopher, considered how rulers used statistics
cally and internationally competitive.
to control populations in his lecture on
Governments release information as a form of
Governmentality. Foucault engaged with Jeremy
accountability. From the creation of the postal
Bentham’s eighteenth-century descriptions of the
code system to social security numbers, govern-
ideal prison and the ideal government, both of
ments have inadvertently provided core categories
which require full visibility. This philosophical
for big data analytics (Washington 2014). Starting
position argues that complete surveillance will
in the mid-twentieth century, legislatures around
result in complete cooperation. While some
the world began to write freedom of information
research suggests that people will continue bad
laws that supported the release of government
behavior under scrutiny, transparency is still seen
materials on request. Subsequently, electronic
as a method of enforcing good behavior.
Transparency 3
Big data extends concerns about the balance of because it includes monitoring customers and
authority, power, and information. Those who employees.
collect, store, and aggregate big data have more Transparency of the analytics industry dis-
control than those generating data. These concep- closes how the big data market functions. Industry
tual foundations are useful in considering both the transparency of operations might establish techni-
positive and negative aspects of big data. cal standards or policies for all participating orga-
nizations. The World Wide Web Consortium’s
data provenance standard provides a technical
Big Data Transparency solution by automatically tracing where data orig-
inated. Multi-stakeholder groups, such as those
Big data transparency discloses the transfer and for Internet Governance, are a possible tool to
transformation of data across networks. Big data establish self-governing policy solutions. The
transparency brings visibility to the embedded intent is to heighten awareness of the data supply
power dynamic in predicting human behavior. chaindata supply chain from upstream content
Analysis of digital material can be done without quality to downstream big data production. Indus-
explicit acknowledgment or agreement. Further- try transparency of procedure might disclosure
more, the industry that exchanges consumer data algorithms and research designs that are used in
is easily obscured because transactions are all data-driven decisions.
virtual. While a person may willingly agree to Big data transparency makes it possible to
free services from a platform, it is not clear if compare data-driven decisions to other methods.
users know who owns, sees, collects, or uses It faces particular challenges because its produc-
their data. The transparency of big data is tion process is distributed across a network of
described from three perspectives: sources, orga- individuals and organizations. The process flows
nizations, and the industry. from an initial data capture to secondary uses and
Transparency of sources discloses information finally into large-scale analytic projects. Transpar-
about the digital material used in big data. Disclo- ency is often associated with fighting potential
sure of sources explains which data generated on corruption or attempts to gain unethical power.
which platforms were used in which analysis. The Given the influence of big data in many aspects
flip side of this disclosure is that those who create of society, the same ideas apply to the transpar-
user-generated content would be able to trace their ency of big data.
digital footprint. User-generated content creators
could detect and report errors and also be aware of
their overall data profile. Academic big data Criticism
research on social media was initially questioned
because of opaque sources from privacy compa- A frequent criticism of transparency is that its
nies. Source disclosure increases confidence in unintended consequences may thwart the antici-
data quality and reliability. pated goals. In some cases, the trend toward
Transparency of platforms considers organiza- visibility is reversed as those under scrutiny stop
tions that provide services that create user-gener- creating findable traces and turn to informal mech-
ated content. Transparency within the anisms of communication.
organization allows for internal monitoring. It is important to note that a transparency label
While part of normal business operations, some- may be used to legitimize authority without any
one with command and control is able to view substantive information exchange. Large amounts
personally identifiable information about the of information released under the name of trans-
activities of others. The car ride service Uber parency may not, in practice, provide the intended
was fined in 2014 because employees used the result. Helen Margetts (1999) questions whether
internal customer tracking system inappropriately. unfiltered data dumps obscure more than they
Some view this as a form of corporate surveillance reveal. Real-time transparency may lack
4 Transparency
meaningful engagement because it requires inter- earlier research that examines the relationship
mediary interpretation. This complaint has been between power and information. Transparency
lodged at open data transparency initiatives which of big data evaluates the risks and opportunities
did not release crucial information. of aggregating sources for large-scale analytics.
Implementation of big data transparency is
constrained by complex technical and business
issues. Algorithms and other technology are lay-
Cross-References
ered together, each with its own embedded
assumptions. Business agreements about the
▶ Algorithmic Accountability
exchange of data may be private, and release
▶ Business Process
may impact market competition. Scholars ques-
▶ Data Governance
tion how to analyze and communicate what drives
▶ Economics
big data, given these complexities.
▶ Enterprise Data
Other critics question whether what is learned
▶ Privacy
through disclosure is looped back into the system
▶ Standardization
for reform or learning. Information disclosed for
transparency may not be channeled to the right
places or people. Without any feedback mecha-
nism, transparency can be a failure because it does Further Readings
not drive change. Ideally, either organizations
improve performance or individuals make new Bennis, W. G., Goleman, D., & O’Toole, J. (2008). Trans-
parency: How leaders create a culture of candor. San
consumer choices. Francisco: Jossey-Bass.
Fung, A., Graham, M., & Weil, D. (2007). Full disclosure:
The perils and promise of transparency. New York:
Summary Cambridge University Press.
Hood, C., & Heald, D. (Eds.). (2006). Transparency: The
key to better governance? Oxford. New York: Oxford
Transparency is a governance mechanism for dis- University Press.
closing activities and decisions that profoundly Margetts, H. (1999). Information technology in govern-
enhances confidence in big data. It builds on ment: Britain and America. London: Routledge.
Washington, A. L. (2014). Government information policy
existing corporate and government transparency in the era of big data. Review of Policy Research, 31(4).
efforts to monitor the visibility of operations and doi:10.1111/ropr.12081.
procedures. Transparency scholarship builds on
U
societies need to take advantage of new technolo- considerations based around the ownership of
gies and crowd-sourced data and improve digital data and privacy. This is an area the UN recog-
connectivity in order to empower citizens with nizes that policy-makers will need to address to
information that can contribute towards progress ensure that data will be used safely to address their
towards wider development goals. While there are objectives while still protecting the rights of peo-
many data sets available about the state of global ple whom the data is about or generated from.
education, it is argued that better data could be Furthermore, there are a number of critiques of
generated, even around basic measures such as the big data which make more widespread use of big
number of schools. In fact rather than focus on data for UNESCO problematic: first that claims
“big data” which has captured the attention of that big data are objective and accurate represen-
many leaders and policy-makers, instead more tations are misleading; not all data produced can
efforts should focus on “little data,” i.e., focus be used comparably; there are important ethical
on data that is both useful and relevant to partic- considerations necessary about the use of big data;
ular communities. Now discussions are shifting to limited access to big data is exacerbating existing
identify which indicators and data should be digital divides.
prioritized. The Scientific Advisory Board of the Secre-
UNESCO Institute for Statistics is the organi- tary-General of the United Nations which is
zation’s own statistic arm; however, much of the hosted by UNESCO provided comments on the
data collection and analysis that takes place here report on data revolution in sustainable develop-
relies on much more conventional management ment. It highlighted concerns over equity and
and information systems which in turn relies on access to data noting that the data revolution
national statistical agencies which in many devel- should lead to equity in access and use of data
oping countries are often unreliable or heavily for all. Furthermore, it suggested that a number of
focused on administrative data (UNESCO 2012). global priorities should be included in any agenda
This means that the data used by UNESCO is related to the data revolutions: first that countries
often out of date, or not detailed enough. While should seek to avoid contributing to a data divide
digital technologies have become widely used in between the rich and poor countries and secondly
many societies, more potential sources of data are that there should be some form of harmonization
generated (Pentland 2013). For example, mobile and standardization of data platform to increase
phones are now used as banking devices as well as accessibility internationally, there should be
for standard communications. Official statistics national and regional capacity building efforts,
organizations are still behind in many countries and there should be a series of training institutes
and international organizations in that they have and training programs in order to develop skills
not developed ways to adapt and make use of this and innovation in areas related to data generation
data alongside the standard administrative data and analysis (Manyika et al. 2011). A key point
already collected. made here is that the quality and integrity of the
There are a number of innovative initiatives to data generated needs to be addressed, as it is
make better use of survey data and mobile phone- recognized that big data often plays an important
based applications to collect data more efficiently role in political and economic decision-making.
and prove more timely feedback to schools, com- Therefore a series of standards and methods for
munities, and ministries on target areas such as analysis and evaluation of data quality should be
enrolment, attendance, and learning achievement. developed.
UNESCO could make a significant contribution to In the journal Nature, Hubert Gijzen,
a data revolution in education by investing in UNESCO Regional Science Bureau for Asia and
resources in collecting these innovations and the Pacific, calls for more big data to help secure a
making them more widely available to countries. sustainable future (Gijzen 2013). He argues that
Access to big data for development, as with all more data should be collected which can be used
big data sources, presents a number of ethical to model different scenarios for sustainable
United Nations Educational, Scientific and Cultural Organization (UNESCO) 3
societies concerning a range of issues from energy mechanisms that are developed with respect to
consumption, improving water conditions, and big data (United Nations 2014). These principles
poverty eradication. Big data, according to are likely to influence UNESCOs engagement
Gijzen, has the potential if coordinated globally with big data in the future.
between countries, regions, and relevant institu- UNESCO, and the UN more broadly, acknowl-
tions to have a big impact on the way societies edge that technology has been, and will continue
address some of these global challenges. The to be, a driver of the data revolution and a wider
United Nations has begun to take actions to do variety of data sources. For big data that is derived
this through the creation of the Global Pulse ini- from this technology to have an impact, these data
tiative bringing together experts from the govern- sources need to be leveraged in order to develop a
ment, academic, and private sectors to consider greater understanding of the issues related to the
new ways to use big data to support development development agenda.
agendas. Global Pulse, a network of innovation
labs which conduct research on Big Data for
Development via collaborations between the gov-
Cross-References
ernments, academic, and private sectors. The ini-
tiative is designed especially to make use of the
▶ History
digital data flood that has developed in order to
▶ International Development
address the development agendas that are at the
▶ United National Global Pulse
heart of UNESCO, and the UN more broadly.
▶ United Nations
The UN Secretary-General’s Independent
▶ World Bank
Expert Advisory Group on the Data Revolution
for Sustainable Development produced the report
“A World That Counts” UN Secretary-General’s
Export Advisory Group on Data Revolution Further Readings
report in November 2014 suggested a number of
key principles which should be sought regards to Gijzen, H. (2013). Development: Big data for a sustainable
future. Nature, 52, 38.
the use of data: data quality and integrity to ensure Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,
clear standards for use of data, data disaggregation Roxburgh, C., Byers, A. (2011). Big data: The next
to provide a basis for comparison, data timeliness frontier for innovation, competition, and productivity.
to encourage a flow of high quality data for used in McKinsey Global Institute. New York. http://www.
mckinsey.com/insights/mgi/research/technology_and_
evidence-based policy-making, data transparency innovation/big_data_the_next_frontier_for_innova
to encourage systems which allow data to make tion. Accessed 12 Nov 14.
freely available, data usability to ensure data can Pentland, A. (2013). The data driven society. Scientific
be made user-friendly, data protection and pri- American, 309, 78–83.
UNESCO (2012). Learning analytics. UNESCO Institute
vacy: to establish international and national poli- for Information Technologies Policy Brief. Available
cies and legal frameworks for regulating data from http://iite.unesco.org/pics/publications/en/files/
generation and use, data governance and indepen- 3214711.pdf. Accessed 11 Nov 14.
dence, data resources and capacity to ensure all United Nations (2014). A world that counts. United
Nations. United Nations. Available from http://www.
countries have effective national statistical agen- unglobalpulse.org/IEAG-Data-Revolution-Report-A-
cies, and finally data rights to ensure human rights World-That-Counts. Accessed 28 Nov 14.
remains a core part of any legal or regulatory
V
objects by combined use of words, numbers, sym- Visualization or more specifically data visual-
bols, points, lines, color, shading, coordinate sys- ization provides support to different steps in the
tems, and more. While there are various choices of data life cycle. For example, the Unified Modeling
visual representations for a same piece of data, Language (UML) provides a standard way to
there are a few general guidelines that can be visualize the design of information systems,
applied to establish effective and efficient data including the conceptual and logical models of
visualization. This first is to avoid distorting databases. Typical relationships in UML include
what the data have to say. That is, the visualization association, aggregation, and composition at the
should not give a false or misleading account of instance level, generalization and realization at the
the data. The second is to know the audience and class level, and general relationships such as
serve a clear purpose. For instance, the visualiza- dependency and multiplicity. For ontologies and
tion can be a description of the data, a tabulation vocabularies in the Semantic Web, concept maps
of the records, or an exploration of the information are widely used for organizing concepts in a sub-
that is of interest to the audience. The third is to ject domain and the interrelationships among
make large datasets coherent. A few artistic those concepts. In this way a concept map is the
designs will be required to present the data and visual representation of a knowledge base. The
information in an orderly and consistent way. The concept maps are more flexible than UML
presidential, Senate, and House elections of the because they cover all the relationships defined
United States have been reported with well- in UML and allow people to create new relation-
presented data visualization, such as those on the ships that apply to the domain under working
website of The New York Times. The visualization (Ma et al. 2014). For example, there are concept
on that website is underpinned by dynamic maps for the ontology of the Global Change Infor-
datasets and can show the latest records mation System led by the US Global Change
simultaneously. Research Program. The concept maps are able to
show that report is a subclass of publication, and
there are several components in a report, such as
Visualization in the Data Life Cycle chapter, table, figure, array, and image. Recent
work in information technologies also enable
Visualization is crucial in the process from data to online visualized tools to capture and explore
information. However, information retrieval is concepts underlying collaborative science activi-
just one of the many steps in the data life cycle, ties, which greatly facilitate the collaboration
and visualization is useful through the whole data between domain experts and computer scientists.
life cycle. In conventional understanding, a data Visualization is also used to facilitate data
life cycle begins with data collection and con- archive, distribution, and discovery. For instance,
tinues with cleansing, processing, archiving, and the Tetherless World Constellation at Rensselaer
distribution. Those are from the perspective of Polytechnic Institute recently developed the Inter-
data providers. Then, from the perspective of national Open Government Dataset Catalog,
data users, the data life cycles continues with which is a Web-based faceted browsing and
data discovery, access, analysis, and then search interface to help users find datasets of
repurposing. From repurposing, the life cycle interest. A facet represents a part of the properties
may go back to the collection or processing step of a dataset, so faceted classification allows the
restarting the cycle. Recent studies show that there assignment of the dataset to multiple taxonomies,
is another step called concept before the step of and then datasets can be classified and ordered in
data collection. The concept step covers works different ways. On the user interface of a data
such as conceptual models, logical models, and center the faceted classification can be visualized
physical models for relational databases, and as a number of small windows and options, which
ontologies and vocabularies for Linked Data in allows the data center to hide the complexity of
the Semantic Web.
Visualization 3
data classification, archive and search on the arrangement and update of cells in a notebook.
server side. A notebook can be shared with others as a normal
file, or it can also be shared with the public using
online services such as the IPython Notebook
Visual Analytics Viewer. A completed notebook can be converted
into a number of standard output formats, such as
The pervasive existence of visualization in the HyperText Markup Language (HTML), HTML
data life cycle shows that visualization can be presentation slides, LaTeX, Portable Document
applied broadly in data analytics. Yet, in actual Format (PDF), and more. The conversion is
practices visualization is often treated as a method done through a few simple operations, so that
to show the result of data analysis rather than as a means once a notebook is complete, a user only
way to enable the interactions between users and needs to press a few buttons to generate a scien-
complex datasets. That is, the visualization as a tific report. The notebook can be reused to analyze
result is separated from the datasets upon which other datasets, and the cells inside it can also be
the result is generated. Many of the data analysis reused in other notebooks.
and visualization tools scientists use in nowadays
do not allow dynamic and live linking between
visual representations and datasets, and when Standards and Best Practices
dataset changes, the visualization is no longer
updated to reflect the changes. In the context of Any applications of Big Data will face the chal-
Big Data, many socioeconomic challenges and lenges caused by the four dimensions of Big Data:
scientific problem facing the world are increas- volume, variety, velocity, and veracity. Com-
ingly linked to the interdependent datasets from monly accepted standards or communities con-
multiple fields of research, organizations, instru- sensus are a proved way to reduce the
ments, dimensions, and formats. Interactions are heterogeneities between datasets under working.
becoming an inherent characteristic of data ana- Various standards have already been used in appli-
lytics with the Big Data, which requires new cation tackling scientific, social, and business
methodologies and technologies of data visuali- issues, such as the aforementioned JSON for
zation to be developed and deployed. transmitting data with human-readable text, the
Visual analytics is a field of research to address Scalable Vector Graphics (SVG) for two-
the requests of interactive data analysis. It com- dimensional vector graphics, and the GeoJSON
bines many existing techniques from data visual- for representing collections of georeferenced fea-
ization with those from computational data tures. There are also organizations coordinating
analysis, such as those from statistics and data the works on community standards. The World
mining. Visual analytics is especially focused on Wide Web Consortium (W3C) coordinates the
the integration of interactive visual representa- development of standards for the Web. For exam-
tions with the underlying computational process. ple, the SVG is an output of the W3C. Other W3C
For example, the IPython Notebook provides an standards include the Resource Description
online collaborative environment for interactive Framework (RDF), the Web Ontology Language
and visual data analysis and report drafting. (OWL), and the Simple Knowledge Organization
IPython Notebook uses JavaScript Object Nota- System (SKOS). Many of them are used for data
tion (JSON) as the scripting language, and each in the Semantic Web. The Open Geospatial Con-
notebook is a JSON document that contains a sortium (OGC) coordinates the development of
sequential list of input/output cells. There are standards relevant to geospatial data. For exam-
several types of cells to contain different contents, ple, the Keyhole Markup Language (KML) is
such as text, mathematics, plots, codes, and even developed for presenting geospatial features in
rich media such as video and audio. Users can Web-based maps and virtual globes such as Goo-
design a workflow of data analysis through the gle Earth. The Network Common Data Form
4 Visualization
(netCDF) is developed for encoding array- dataset of the National Cultural Heritage with
oriented data. Most recently, the GeoSPARQL is more than 13 thousand archaeological monu-
developed for encoding and querying geospatial ments in the Netherlands. Besides the
data in the Semantic Web. GeoSPARQL, GeoJSON and few other standards
Standards just enable the initial elements for and libraries are also used in that demo system.
data visualization, and domain expertise and
novel ideas are needed to put standards into prac-
tice (Fox and Hendler 2011). For example, Goo-
Cross-References
gle Motion Chart adapts the fresh idea of motion
charts to extend the traditional static charts, and
▶ Data Visualization
the aforementioned IPython Notebook allows the
▶ Data-Information-Knowledge-Action Model
use of several programming languages and data
▶ Interactive Data Visualization
formats through the use of cells. There are various
▶ Pattern Recognition
programming libraries developed for data visual-
ization, and many of them are made available on
the Web. The D3.js is a typical example of such
open source libraries (Murray 2013). The D3 here References
represents Data-Driven Documents. It is a
JavaScript library using digital data to drive the Cohen, L., Lehericy, S., Chochon, F., Lemer, C., Rivaud,
S., & Dehaene, S. (2002). Language-specific tuning of
creation and running of interactive graphics in visual cortex? Functional properties of the visual word
Web browsers. D3.js based visualization uses form area. Brain, 125(5), 1054–1069.
JSON as the format of input data and SVG as Fox, P., & Hendler, J. (2011). Changing the equation on
the format for the output graphics. The scientific data visualization. Science, 331(6018),
705–708.
OneGeology data portal provides a platform to Ma, X., Fox, P., Rozell, E., West, P., & Zednik, S. (2014).
browse geological map services across the Ontology dynamics in a data life cycle: Challenges and
world, using standards developed by both OGC recommendations from a geoscience perspective. Jour-
and W3C, such as SKOS and Web Map Service nal of Earth Science, 25(2), 407–412.
Murray, S. (2013). Interactive data visualization for the
(WMS). GeoSPARQL is a relatively newer stan- web. Sebastopol: O’Reilly.
dard for geospatial data but there are already fea- Tufte, E. (1983). The visual display of quantitative infor-
ture applications. The demo system of the Dutch mation. Cheshire: Graphics Press.
Heritage and Location shows the linked open
W
White House Big Data Initiative economic growth, education, health, clean energy,
and national security (Raul 2014; Savitz 2012).
Gordon Alley-Young The administration stated that the private sector
Department of Communications & Performing would lead by developing BD while the govern-
Arts, Kingsborough Community College, ment will promote R&D, facilitate private sector
Kingsborough Community College – City access to government data, and shape public pol-
University of New York, New York, NY, USA icy. Several government agencies made the initial
investment in this initiative to advance the tools/
techniques required to analyze and capitalize on
Synonyms BD. TBDRDI has been compared by the Obama
Administration to previous administrations’
The Big Data Research and Development Initia- investments in science in technology that lead to
tive (TBDRDI) innovations such as the Internet. Critics of the
initiative argue that administration BD efforts
need to be directed elsewhere.
Introduction
On March 29, 2012, the White House introduced History of the White House Big Data
The Big Data Research and Development Initia- Initiative
tive (TBDRDI) at a cost of $200 million. Big data
(BD) refers to the collection and interpretation of TBDRDI is the White House’s $200 million fed-
enormous datasets, using supercomputers running eral agency funded initiative that seeks to secure
smart algorithms to rapidly uncover important the US’s position as the world’s most powerful
features (e.g., interconnections, emerging trends, and influential economy by channeling the infor-
anomalies, etc.). The Obama Administration mation power of BD into social and economic
developed TBDRDI because having the large development (Raul 2014). BD is an all-inclusive
amounts of instantaneous data that is continually name for the nonstop supply of sophisticated elec-
being produced by research and development tronic data that is being produced by a variety of
(R&D) and emerging technology go unprocessed technologies and by scientific inquiry. In short,
hurts the US economy and society. President BD includes any digital file, tag or data that is
Obama requested an all-hands-on-deck for created whenever we interact with technology, no
TBDRDI including the public (i.e., government) matter how briefly (Carstensen 2012). The
and private (i.e., business) sectors to maximize dilemma posed by BD to the White House, as
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_204-1
2 White House Big Data Initiative
well as to other countries, organizations, and busi- referred to TBDRDI as representing it placing its
nesses worldwide, is that so much of it goes bet on BD meaning that the financial investment
unanalyzed due to its sheer volume and the limits in this initiative is expected to yield a significant
of our current technological tools to effectively return for the country in coming years. To this
store, organize, and analyze. Processing BD is not end, President Obama has sought the involvement
so simple because it requires supercomputing of public, private, and other (e.g., academia, non-
capabilities, some of which are still emerging. governmental organizations) experts and organi-
Experts argue that up until 2003, only zations to work in a way that emphasizes
5-exabytes (EB) of data were produced; that num- collaboration. For spearheading TBDRDI and
ber has since exploded to over five quintillion for choosing to stake the future of the country on
bytes of data (approximately 4.3 EB) every BD, President Barack Obama has been dubbed the
2 days. BD president by the media.
The White House Office of Science and Tech-
nology Policy (WHOSTP) announced TBDRDI
in March 2012 in conjunction with the National Projects of the White House Big Data
Science Foundation (NSF), National Institutes of Initiative
Health (NIH), US Geological Survey (USGS),
and the Department of Defense (DoD) and The projects included under the umbrella of
Department of Energy (DoE). Key concerns to TBDRDI are diverse, but they share common
be addressed by TBDRDI are to manage BD by themes of emphasizing collaboration (i.e., to max-
significantly increasing the speed of scientific imize resources and eliminate data overlap) and
inquiry and discovery, bolstering national secu- making data openly accessible for its social and
rity, and overhauling US education. TBDRDI is economic benefits. One project undertaken with
the result of recommendations in 2011 by the the co-participation of NIH and Amazon, the
President’s Council of Advisors on Science and world’s largest online retailer, aims to provide
represents the US government’s wish to get ahead public access to the 1,000 Genomes Project
of the wave BD and prevent a cultural lag by using cloud computing (Smith 2012). The 1,000
revamping its BD practices (Executive Office of Genomes Project involved scientists and
the President 2014). John Holdren, Director of researchers sequencing the genomes of over
WHOSTP, compared the $200 million being 1,000 anonymous and ethnically diverse people
invested in BD to prior federal investments in between 2008 and 2012 in order to better treat
science and technology that are responsible for illness and predict medical conditions that are
our current technological age (Scola 2013). The genetically influenced. The NIH will deposit
innovations of the technology age ironically have 200 terabytes (TB) of genomic data into Ama-
created the BD that makes initiatives such as these zon’s Web Services. According to the White
necessary. House, this is currently the world’s largest collec-
In addition to the US government agencies that tion of human genetic data. In August 2014, the
helped to unveil TBDRDI, several other federal UK reported that it would undertake a 100,000
agencies had been requested to develop BD man- genomes project that is slated to finish in 2017.
agement strategies in the time leading up to and The NIH and NSF will cooperate to fund 15–20
following this initiative. A US government fact research projects for a cost of $25 million. Other
sheet listed between 80 and 85 BD projects across collaborations include the DoE’s and University
a dozen federal agencies including, in addition to of California’s creation of a new facility as part of
the departments previously mentioned, the their Lawrence Berkeley National Laboratory
Department of Homeland Security (DHS), called the Scalable Data Management, Analysis,
Department of Health and Human Services and Visualization Institute ($25 million) and the
(DHHS), and the Food and Drug Administration NSF and University of California, Berkeley’s
(FDA) (Henschen 2012). The White House geosciences Earth Cube BD project ($10 million).
White House Big Data Initiative 3
cosponsored a bipartisan online federal spending government to invest more money in the training
database bill (i.e., for USAspending.gov) and as a of quantitative analysts to feed initiatives such
presidential candidate who actively utilized BD as this (Tucker 2012).
techniques (Scola 2013). In terms of cutting overspending, cloud com-
TBDRDI comes at a time when International puting (platform-as-a-service technologies) has
Data Corporation (IDC) predicts that by 2020, been identified under TBDRDI as a means to
over a third of digital information will generate consolidate roughly 1,200 unneeded federal data
value if analyzed. Making BD open and accessi- centers (Tucker 2012). The Obama Administra-
ble will bring businesses an estimated three tril- tion has stated that it will eliminate 40 % of federal
lion dollars in profits. Mark Weber, President of data centers by 2015. This is estimated to generate
US Public Sector for NetApp and government IT a $5 billion in savings. Some in the media applaud
commentator, argues that the value of BD lies in the effort and corresponding savings while some
transforming it into quality knowledge for critics of the plan argue that the data centers be
increasing efficiency better informed decision- streamlined and upgraded instead. As of 2014, the
making (CIO Insight 2012). TBDRDI is also US government reports that 750 data centers have
said to national security. Kaigham Gabriel, a Goo- been eliminated.
gle executive and the next CEO and President of In January 2014, after classified information
Draper Laboratory, argued that the cluttered leaks by former NSA contractor Edward Snowden,
nature of the BD field allows America’s adversar- President Obama asked the White House for a com-
ies to hide and that field is becoming increasingly prehensive review of BD that some argue dampened
cluttered as it is estimated that government agen- the enthusiasm for TBDRDI (Raul 2014). The US
cies generated one petabyte (PB) or one quadril- does not have a specific BD privacy law leading
lion bytes of data from 2012 to 2014 (CIO Insight critics to claim a policy deficit. Others point to the
2012). One would need almost 14,552 Federal Trade Commission (FTC) Act, Section 5
64-gigabyte (GB) iPhones in order to store this that prohibits unfair or deceptive acts or practices in
amount of data. Experts argue that the full extent or affecting commerce as being firm enough to
of technology/applications required to success- handle any untoward business practices that might
fully manage the amounts BD that TBDRDI emerge from BD while flexible enough to not hinder
could produce now and in the future remains to the economy (Raul 2014). Advocates note that the
be seen. European Union (EU) has adopted a highly detailed
President Obama promised that TBDRDI privacy policy that has done little to foster commer-
would stimulate the economy and save taxpayer cial innovation and economic growth (Raul 2014).
money, and there is evidence to indicate this. The
employment outlook for individuals trained in
mathematics, science, and technology is strong Conclusion
as the US government attempts to hire sufficient
staff to carry out the work of TBDRDI. Hiring Other criticism argues that TBDRDI, and the
across governmental agencies requires the skilled Obama Administration by default, actually serves
work of deriving actionable knowledge from big business instead of individual consumers and
BD. This responsibility falls largely on a subset citizens. In support of this argument, critics argue
of highly trained professionals known as quanti- that the administration pressured communications
tative analysts or the quants for short. Currently companies to provide more affordable and higher
these employees are in high demand and thus can speeds of mobile broadband. As of the summer of
be difficult to source as the US government must 2014, Hong Kong has the world’s fastest mobile
compete alongside private sector businesses for broadband speeds that are also some of the most
talent when the latter may be able to provide larger affordable with South Korea second and Japan
salaries and higher profile positions (e.g., Wall third; the US and its neighbor Canada are not
Street firms). Some have argued for the even in the top ten list of fastest mobile broadband
White House Big Data Initiative 5
speed countries. Supporters of the administration CIO Insight (2012). Can government IT meet the big data
cite that the Obama Administration has instead challenge? Retrieved from http://www.cioinsight.com/
c/a/Latest-News/Big-Data-Still-a-Big-Challenge-for-G
chosen to emphasize its unprecedented open data overnment-IT-651653/.
initiatives under TBDRDI. The US Open Data Eddy, N. (2014). Big data proves alluring to federal IT pros.
Action Plan emphasizes making high-priority Retrieved from http://www.eweek.com/enterprise-apps/
US government data both mobile and publically big-data-proves-alluring-to-federal-it-pros.html.
Executive Office of the President (2014). Big data: Seizing
accessible while Japan is reported to have fallen opportunities, preserving values. Retrieved from https://
behind in open-sourcing its BD, specifically in www.whitehouse.gov/sites/default/files/docs/big_data_
providing access to their massive stores of state/ privacy _report_may_1_2014.pdf.
local data, costing its economy trillions of yen. Henschen, D. (2012). Big data initiative or
big government boondoggle? Retrieved from http://
www.informationweek.com/software/information-man
agement/big-data-initiative-or-big-government-boondog
Cross-References gle/d/d-id/1103666?
Raul, A.C. (2014). Don’t throw the big data out with the bath
water. Retrieved from http://www.politico.com/maga
▶ Big Data zine/story/2014/04/dont-throw-the-big-data-out-with-the-
▶ Cloud or Cloud Computing bath-water-106168_full.html?print#.U_PA-lb4bFI.
▶ Cyberinfrastructure Savitz, E. (2012). Big data in the enterprise: A lesson or
▶ Defense Advanced Research Projects Agency two from big brother. Retrieved from http://www.
forbes.com/sites/ciocentral/2012/12/26/big-data-in-the-
(DARPA) enterprise-a-lesson-or-two-from-big-brother/.
▶ Department of Homeland Security Scola, N. (2013). Obama, the ‘big data’ president. Retrieved
▶ Food and Drug Administration (FDA) from http://www.washingtonpost.com/opinions/obama-
▶ NASA the-big-data-president/2013/06/14/1d71fe2e-d391-11e2-
b05f-3ea3f0e7bb5a_story.html.
▶ National Oceanic and Atmospheric Smith, J. (2012). White House aims to tap power of gov-
Administration ernment data. Retrieved from https://www.yahoo.
▶ National Science Foundation com/news/white-house-aims-tap-power-government-
▶ Office of Science and Technology Policy data-093701014.html?ref=gs.
Tucker, S. (2012). Budget pressures will drive government
▶ United Nations Global Pulse (Development) IT change. Retrieved from http://www.washingtonpost.
▶ United States Geological Survey (USGS) com/business/capitalbusiness/budget-pressures-will-dri
ve-government-it-change/2012/08/24/ab928a1e-e898-
11e1-a3d2-2a05679928ef_story.html.
UN Global Pulse. (2012). Big data for development: Chal-
References lenges & opportunities. Retrieved from UN Global Pulse,
Executive Office of the Secretary-General United
Carstensen, J. (2012). Berkeley group digs in to challenge of Nations, New York, NY at http://www.unglobalpulse.
making sense of all that data. Retrieved from http:// org/sites/default/files/BigDataforDevelopment-UNGl
www.nytimes.com/2012/04/08/us/berkeley-group-tries- obalPulseJune2012.pdf.
to-make-sense-of-big-data.html?_r=0.
W
White House BRAIN Initiative Institutes of Health (NIH), the Defense Advanced
Research Projects Agency (DARPA), and the
Gordon Alley-Young National Science Foundation (NSF) with
Department of Communications & Performing matching support for the initiative reported to
Arts, Kingsborough Community College, City come from private research institutions and foun-
University of New York, New York, NY, USA dations. TWHBI has drawn comparisons to the
Human Genome Project (HGP) for the potential
scientific discovery that the project is expected to
Synonyms yield. The HGP and TWHBI are also big data
projects for the volume of data that they have
Brain Research Through Advancing Innovative already produced and will produce in the future.
Neurotechnologies
neural circuitry. A fifth goal is to increase under- first century (i.e., The HGP was previously
standing of the biological basis for mental pro- deemed a grand challenge). Unlocking the secrets
cesses by theory building and developing new of the brain will tell us how the brain can record,
data analysis tools. The sixth is to innovate tech- process, utilize, retain, and recall large amounts of
nology to better understand the brain so as to information. Dr. Geoffrey Ling, deputy director of
better treat disorders. The seventh is to establish the Defense Sciences Office at Defense Advanced
and sustain interconnected networks of brain Research Projects Agency (DARPA), states that
research. Finally, the last goal is to integrate the TWHBI is needed to attract young and intelligent
outcomes of the other goals to discover how people into the scientific community. Ling cites a
dynamic patterns of neural activity get translated lack of available funding as a barrier to persuading
into human thought, emotion, perception, and students to pursue research careers (Vallone
action in illness and in health. 2013). Current NIH director and former HGP
NIH Director Dr. Francis Collins echoed Pres- director Dr. Francis Sellers Collins notes the
ident Obama in publically stating that TWHBI potential of TWHBI to create jobs while poten-
will change the way we treat the brain and grow tially curing diseases of the brain and the nervous
the economy (National Institutes of Health 2014). system, for instance, Alzheimer’s disease (AD). In
During his 2013 SOTUA, President Obama drew 2012 Health and Human Services Secretary
an analogy to the Human Genome Project (HGP) Kathleen Sebelius stated the Obama administra-
arguing that for every dollar the USA invested in tion’s goal to cure AD by 2025. The Alzheimer’s
the project, the US economy gained $140. Esti- Association (AA) estimates that AD/dementia
mates suggest that the HGP created $800 billion in health and care cost $203 billion in 2013 ($142
economic activity. The HGP was estimated to cost billion by Medicare/Medicaid); this will reach
$3 billion and take 15 years (i.e., 1990–2005). The $1.2 trillion by 2050 (Alzheimer’s Association
project finished 2 years early and under cost at 2013).
$2.7 billion in 1991 dollars. The HGP project is Dr. Ling argues that for scientists to craft and
estimated to have cost $3.39–$5 billion in 2003 validate their hypotheses to build on their knowl-
dollars. TWHBI has a budget of $100 million edge that potentially lead to medical break-
allocated in budget year 2014 with comparable throughs, they need access to the latest research
funds ($122 million) contributed by private inves- tools. Ling states that some of the today’s best
tors. A US federal report calls for $4.5 billion in clinical brain research tools are nonetheless lim-
funding for brain research over the next 12 years. ited and outdated in light of TWHBI work that
remains to be done. To bolster his case for better
research tools, Ling uses an analogy whereby the
Projects Undertaken by the Initiative physical brain is hardware and the dynamic pro-
cesses across the brain’s circuits are software.
The first research paper believed to be produced Ling notes that cutting-edge tools can help iden-
under TWHBI initiative came from a paper tify bugs in the brain’s software caused by a phys-
published on June 19, 2014, by principal investi- ical trauma (i.e., to the hardware) that once found
gator Dr. Karl Deisseroth of Stanford University. might be repairable. The tools necessary for med-
The research described Deisseroth and his team’s ical research will need to be high-speed tools with
innovation of the CLARITY technique that can a much greater capacity for record signals from
remove fat from the brain without damaging its brain cells. TWHBI, by bringing together scien-
wiring and enable the imaging of a whole trans- tists and researchers from a variety of fields such
parent brain. Data from the study is being used by as nanoscience, imaging, engineering, informat-
international biomedical research projects. ics, has the greatest opportunity to develop these
TWHBI was undertaken because it addresses tools.
what the science, society, and government con-
siders one of the grand challenges of the twenty-
White House BRAIN Initiative 3
Earlier Efforts and Influences Similarly, with a $500 million investment, billion-
aire philanthropist Fred Kavli funded brain insti-
Brain research was emphasized prior to TWHBI tutes at Yale, Columbia, and the University of
by the previous two administrations. The Clinton California (Broad 2014). It was primarily scien-
administration held a White House conference on tists from these two institutes that crafted the
early childhood development and leaning focused TWHBI blueprint. Connor states that there are
on insights gleaned from the latest brain research benefits and downsides to TWHBI’s connections
in 1997. In 2002 the Bush administration’s to private philanthropy. Connor acknowledges
National Drug Control Policy Director John that philanthropists are able to invest in risky
Walters donated millions of dollars of drug-war initiatives in a way that the government cannot
money to purchase dozens of MRI machines. but that this can lead to a self-serving research
Their goal was a decade long, $100 million focus, the privileging of affluent universities at the
brain-imaging initiative to study the brain to better expense of poorer ones and a US government that
understand addiction. is following the lead of private interests rather
Publicity surrounding TWHBI brings attention than setting the course itself (Connor 2013).
to how much science has learned about the brain The $100 million for the first phase of TWHBI
in relatively short period of time. In the nineteenth in fiscal year 2014 comes from three government
century, brain study focused mostly on what hap- agencies’ budgets specifically NIH, DARPA, and
pens when parts of the brain are damaged/ NSF. The NIH Blueprint for Neuroscience
removed. For instance, Phineas Gage partially Research will lead with contributions specifically
lost his prefrontal cortex in an 1848 accident, geared to projects that would lead to the develop-
and scientists noted how Mr. Gage changed from ment of cutting edge, high-speed tools, training,
easygoing and dependable before to angry and and other resources. The next generation of tools
irresponsible afterward. From the late eighteenth has designated as viewed as vital to the advance-
to mid-nineteenth centuries, pseudoscientists ment of this initiative. Contributor DARPA will
practiced phrenology or reading a person’s mind invest in programs that aim to understand the
by handling a person’s skull. dynamic functions of the brain, noted in
Phillip Low, a director of San Diego-based Dr. Ling’s analogy as the software of the brain,
NeuroVigil Inc. (NVI), states that the White showing breakthrough applications based on the
House talked to many scientists and researchers dynamic function insights gained. DARPA also
while planning TWHBI but did not reveal to these seeks to develop new tools for capturing and
individuals that they were talking to many others, processing dynamic neural and synaptic activities.
all of who potentially believed they were the par- DARPA develops applications for improving the
ent of TWHBI. However, the originators of the diagnosis and treatment of post-traumatic stress,
idea that lead to TWHBI are said to be six scien- brain injury, and memory loss sustained through
tists, whose journal article in the June 2012 issue war and battle. Such applications would include
of Neuron proposed a brain-mapping project. The generating new information processing systems
six are A. Paul Alivisatos (University of Califor- related to the information processing system in
nia Berkeley), Miyoung Chun (The Kavli Foun- the brain and mechanisms of functional restora-
dation), George M. Church (Harvard University), tion after brain injury. DARPA is mindful that
Ralph J. Greenspan (The Kavli Institute), Michael advances in neurotechnology, such as those
L. Roukes (Kavli Nanoscience Institute), and outlined above, will entail ethical, legal, and
Rafael Yuste (Columbia University) (Alivisatos social issues that it will oversee via its own
et al. 2012). New York Times reporter Steve experts. Ethics are also at the forefront of
Connor says the roots of TWHBI occur 10 years TWHBI. Specifically President Obama identified
earlier when Microsoft cofounder and philanthro- adhering to the highest standards of research pro-
pist Paul G. Allen established a brain science tections as a prime focus. Oversight of ethical
institute in Seattle for a $300 million investment. issues related to this as well as any other
4 White House BRAIN Initiative
neuroscience initiative will fall to the administra- across the USA. The NIH is said to be establishing
tion’s Commission for the Study of Bioethical a bicoastal cochaired working group under
Issues. Dr. Cornelia Bargmann, a former UCSF Profes-
The NSF’s strength as a contributor to TWHBI sor, with the Rockefeller University in New York
is that it will sponsor interdisciplinary research City and Dr. William Newsome from California’s
that spans the fields of biology, physics, engineer- Stanford University to specify goals for the NIH’s
ing, computer science, social science, and behav- investment and create a multiyear plan for achiev-
ioral science. The NSF’s contribution to TWHBI ing these goals with timelines and costs (Univer-
again emphasizes the development of tools and sity of California San Francisco 2013). On the east
equipment specifically molecular-scale probes coast of the USA, the NIH Blueprint for Neuro-
that can sense and record the activity of neural science Research, draws on 15 of its 27 NIH Insti-
networks. Additionally, the NSF will also seek to tutes and Centers headquartered in Bethesda, MD,
address the innovations that will be necessary in will be a leading NIH contributor to TWHBI.
the field of big data in order to store, organize, and Research will occur in nearby Virginia at
analyze the enormous amounts of data that will be HHMI’s Janelia Farm Research Campus that
produced. Finally, NSF projects under TWHBI focuses on developing new imaging technologies
will see better understanding of how thoughts, and finding out how information is stored and
emotions, actions, and memories get represented processed in neural networks. Imaging technol-
in the brain. ogy furthers TWHBI’s goals of mapping the
In addition to federal government agencies, at brain’s structures by allowing researchers to cre-
least four private institutes and foundations have ate dynamic brain pictures down to the level of
pledged an estimated $122 million to support to single brain cells as they interact with complex
TWHBI: The Allen Institute (TAI), the Howard neural circuits at the speed of thought.
Hughes Medical Institute (HHMI), The Kavli
Foundation (TKF), and The Salk Institute for Bio-
logical Studies (TSI). TAI’s strengths lie in large- Conclusion
scale brain research, tools, and data sharing which
is necessary for a big data project like TWHBI Contributions to and extensions of TWHBI are
represents. Starting in March 2012, TAI under- also happening on the US west coast and interna-
took a 10-year project to unlock the neural code tionally. San Diego State University (SDSU) is
(i.e., how brain activity leads to perception, contributing to TWHBI via its expertise in clinical
decision-making, and action). HHMI by compar- and cognitive neuroscience specifically their
ison is the largest nongovernmental funder of investigations to understand and treat brain-
basic biomedical research and has long supported based disorders like autism, aphasia, fetal alcohol
neuroscience research. TKF anticipates drawing spectrum (FAS) disorders, and AD. San Diego’s
on the endowments of existing Kavli Institutes NVI, founded in 2007 and advised by Dr. Stephen
(KI) to fund its participation in TWHBI. This Hawking, and its founder, CEO, and Director
includes funding new KIs. Finally the TSI, under Dr. Phillip Low, helped to shape TWHBI initia-
its dynamic BRAIN initiative, will support cross- tive. NVI’s is notable for its iBrain™ single-
boundary research in neuroscience. For example, channel electroencephalograph (EEG) device
TSI researchers will map brain’s neural networks that noninvasively monitors the brain (Keshavan
to determine their interconnections. TSI scientists 2013). Dr. Low has also taken the message of the
will lay the groundwork for solving neurological WBHI international as he was asked to go to Israel
puzzles such as Alzheimer’s/Parkinson’s by and help them develop their own BRAIN initia-
studying age-related brain differences (The tive. To this end Dr. Low delivered one of two
White House 2013). keynotes for Israel’s first International Brain
The work of TWHBI will be spread across Technology Conference in Tel Aviv in October
affiliated research institutions and laboratories 2013. Australia also supports TWHBI through
White House BRAIN Initiative 5
neuroscience research collaboration and increased Alzheimer’s Association. (2013). Alzheimer’s Association
hosting of the NSF’s US research fellows for applauds White House Brain Mapping Initiative.
Retrieved from Alzheimer’s Association National
collaborating on relevant research projects. Office, Chicago, IL at http://www.alz.org/news_and_
events_alz_association_applauds_white_house.asp
Broad, W.J. (2014). Billionaires with big ideas are
privatizing American science. Retrieved from The
Cross-References New York Times, New York, NY http://www.nytimes.
com/2014/03/16/science/billionaires-with-big-ideas-
▶ Australia are-privatizing-american-science.html
Connor, S. (2013). One of the biggest mysteries in the
▶ Big Data universe is all in the head. Retrieved from Independent
▶ Data Sharing Digital News and Media, London, UK at http://www.
▶ Defense Advanced Research Projects Agency independent.co.uk/voices/comment/one-of-the-biggest-
(DARPA) mysteries-in-the-universe-is-all-in-the-head-8791565.
html
▶ Engineering Keshavan, M. (2013). BRAIN Initiative will tap our best
▶ Human Genome Project minds. San Diego Business Journal, 34(15), 1.
▶ Medicare National Institutes of Health. (2014). NIH embraces bold,
▶ Medical/Health Care 12-year scientific vision for BRAIN Initiative. Retrieved
from National Institutes of Health, Bethesda, MD at
▶ Medicaid http://www.nih.gov/news/health/jun2014/od-05.htm
▶ National Institutes of Health The White House. (2013). Fact sheet: BRAIN Initiative.
▶ National Science Foundation Retrieved from The White House Office of the Press
▶ Neuroscience Secretary, Washington, DC at http://www.whitehouse.
gov/the-press-office/2013/04/02/fact-sheet-brain-initiative
University of California San Francisco. (2013). President
Obama unveils brain mapping project. Retrieved from
the University Of California San Francisco at http://
References www.ucsf.edu/news/2013/04/104826/president-obama-
unveils-brain-mapping-project
Alivisatos, A. P., Chun, M., Church, G. M., Greenspan, Vallone, J. (2013). Federal initiative takes aim at treating
R. J., Roukes, M. L., & Yuste, R. (2012). The brain brain disorders. In Investors Business Daily, Los
activity map project and the challenge of functional Angeles, CA, (p. A04).
connectomics. Neuron, 74(6), 970–974.
W
organization. This video, known familiarly as separately but contain the same information) have
“Collateral Murder,” shows a United States’ appeared allowing users to access WikiLeaks doc-
Apache helicopter shooting Reuters reporters, uments and also donate with “blocked” payment
individuals helping these reporters, and seriously methods. WikiLeaks also sells paraphernalia on
injuring two children. There have been two ver- its website, but it is unclear if these products fall
sions of the video that have been released: a under the banking blockade restrictions.
shorter, 17-min video and a more detailed Because of his affiliation with WikiLeaks,
39-min video. Both videos were leaked by Julian Assange has been granted political asylum
WikiLeaks and remain on its website. in Ecuador in 2012. Prior to his asylum, he had
WikiLeaks uses a number of different drop been accused of molestation and rape in Sweden
boxes in order to obtain documents and maintain but evaded arrest. In June 2013, Edward Snow-
the anonymity of the leakers. Many leakers are den, a former employer of the National Security
well versed in anonymity protective programs Agency (NSA), leaked evidence of the United
such as Tor, which uses what they call “onion States spying on its citizens to the UK’s The
routing”: several layers of encryption to avoid Guardian. On many occasions, WikiLeaks has
detection. However, in order to make leaking supported Snowden, helping him apply for polit-
less complicated, WikiLeaks provides instruc- ical asylum, providing funding, and also provid-
tions on its website for users to skirt around reg- ing him with escorts him on flights (most notably
ular detection through normal identifiers. Users Sarah Harrison accompanying Snowden from
are instructed to submit documents in one of Hong Kong to Russia).
many anonymous drop boxes to avoid detection. WikiLeaks has been nominated for multiple
In order to verify the authenticity of a docu- awards for reporting. Among the awards, it has
ment, WikiLeaks performs several forensic tests won including the Economist Index on Censor-
including weighing the price of forgery as well as ship Freedom of Expression award (2008) and the
possible motives for falsifying information. On its Amnesty International human rights reporting
website, WikiLeaks explains that it verified the award (2009, New Media). In 2011, Norwegian
now infamous “Collateral Murder” video by actu- citizen Snorre Valen publically announced that he
ally sending journalists to interview individuals nominated Julian Assange for the Nobel Peace
affiliated with the attack. WikiLeaks states that Prize, although Assange did not win.
simply when it publishes a document, the fact
that it has been published is verification enough.
By making information more freely available, Cross-References
WikiLeaks aims to start a larger conversation
within the press about access to authentic docu- ▶ Anonymization
ments and democratic information. ▶ National Security Agency (NSA)
Funding for WikiLeaks has been a contentious ▶ Transparency
issue since its founding. Since 2009, Assange has
noted several times that WikiLeaks is in danger of
running out of funding. One of the major reasons
causing these funding shortages is the result of Further Readings
many corporations (including Visa, MasterCard,
Dwyer, D. n.d. “WikiLeaks’ Assange for Nobel Prize?”
and PayPal) ceasing to allow its customers to ABC News. Available at: http://abcnews.go.com/Poli
donate money to WikiLeaks. On the WikiLeaks tics/wikileaks-julian-assange-nominated-nobel-peace-
website, this action is described as the “banking prize/story?id=12825383. Accessed 28 Aug 2014.
blockade.” To work around this banking block- Greenberg, A. This machine kills secrets: How
wikileakers, cypherpunks, and hacktivists aim to free
ade, many mirror sites (websites that are hosted the world’s information. Dutton: New York, 2012.
WikiLeaks 3
through visuals like graphs, charts, or histograms. issues and to promote resource sharing, the
Given the multi-language and international nature Wikimedia Commons was introduced in 2004 as
of Wikipedia, as well as the disproportionate size a central location for images and other media for
and activity of the English version in particular, all WMF projects. Wikidata works on a similar
geography is important in its critical discourse. premise with data. Its initial task was to centralize
Maps are thus popular visuals to demonstrate inter-wiki links, which connect, for example, the
disparities, locate concentrations, and measure English article “Cat” to the Portuguese “Gato”
coverage or influence. Several programs have and Swedish “Katt.” Inter-language links had pre-
been developed to create visualizations using viously been handled locally, creating links at the
Wikipedia data, as well. One of the earliest, the bottom of an article to its counterparts at every
IBM History Flow tool, produces images based on other applicable version. Since someone adding
stages of an individual article’s development over links to the Tagalog Wikipedia is not likely to
time, giving a manageable, visual form to an speak Swedish, and because someone who speaks
imposingly long edit history and the disagree- Swedish is not likely to actively edit the Tagalog
ments, vandalism, and controversies it contains. Wikipedia and vice versa, this process frequently
The Wikipedia database has been and con- resulted in inaccurate translations, broken links,
tinues to be a valuable resource, but there are one-way connections, and other complications.
limitations to what can be done with its unstruc- Wikidata helps by acting as a single junction for
tured data. It is downloaded as a relational data- each topic.
base filled with text and markup, but machines A topic, or an item, on Wikidata is given its
that researchers use to process data are not able to own page which includes an identification num-
understand text like a human, limiting what tasks ber. Users can then add a list of alternative terms
they can be given. It is for this reason there have for the same item and a brief description in every
been a number of attempts to extract structured language. Items also receive statements
data as well. DBPedia is a database project started connecting values and properties. For example,
in 2007 to put as much of Wikipedia into the The Beatles’s 1964 album A Hard Day’s Night is
Resource Description Framework (RDF) as pos- item Q182518. The item links to the album’s
sible. Whereas content on the web typically Wikipedia articles in 49 languages and includes
employs HTML to display and format text, mul- 17 statements with properties and values. The
timedia, and links, RDF emphasizes not what a very common instance of property has the value
document looks like but how its information is “album,” a property called record label has the
organized, allowing for arbitrary statements and value “Parlophone Records,” and four statements
associations which effectively make the items connect the property genre with “rock and roll,”
meaningful to machines. The article for the film “beat music,” “pop music,” and “rock music.”
Moonlight Kingdom may contain the textual Other statements describe its recording location,
statement “it was shot in Rhode Island,” but a personnel, language, and chronology, and many
machine would have a difficult time extracting applicable properties are not yet filled in. Like
the desired meaning, instead preferring to see a Wikipedia, Wikidata is an open community pro-
subject “Moonlight Kingdom” with a standard ject and anybody can create or modify statements.
property “filming location” set to the value Some of the other properties items are given
“Rhode Island.” include names, stage names, pen names, dates,
In 2012, WMF launched Wikidata, its own birth dates, death dates, demographics, genders,
structured database. In addition to Wikipedia, professions, geographic coordinates, addresses,
WMF operates a number of other sites like manufacturers, alma maters, spouses, running
Wiktionary, Wikinews, Wikispecies, and mates, predecessors, affiliations, capitals, awards
Wikibooks. Like Wikipedia, these sites are avail- won, executives, parent companies, taxonomic
able in many languages, each more or less inde- orders, and architects, among many others. So as
pendent from the others. To solve redundancy to operate according to the core Wikipedia tenet of
Wikipedia 3
neutrality, multiple conflicting values are allowed. support Wikipedia research, Wikipedia can be
Property-value pairs can furthermore be assigned used to support other forms of research and even
their own property-value pairs such that the enhance commercial products. Google, Facebook,
record sales property and its value can have the IBM, and many others regularly make use of data
qualifier as of and another value to reflect when from Wikipedia and Wikidata in order to improve
the sales figure was accurate. Each property-value search results or provide better answers to ques-
pair along the way can be assigned references akin tions. By creating points of informational inter-
to cited sources on Wikipedia. section and interpretation for hundreds of
Some Wikipedia metadata is easy to locate and languages, Wikidata also has potential for use in
parse as fundamental elements of wiki technol- translation applications and to enhance cultural
ogy: timestamps, usernames, and article titles, for education. The introduction of Wikidata in 2012,
example. Other data is incidental, like template built on an already impressively large knowledge
parameters. Design elements that would other- base, and its ongoing development, have opened
wise be repeated in many articles are frequently many new areas for exploration and accelerated
copied into a separate template which can then be the pace of experimentation, incorporating the
invoked when relevant, using parameters to cus- data into many areas of industry, research, educa-
tomize it for the particular page on which it is tion, and entertainment.
displayed. For example, in the top-right corner
of articles about books there is typically a neatly
formatted table called an infobox which includes
Cross-References
standardized information input as template
parameters like author, illustrator, translator,
▶ Anonymity
awards received, number of pages, Dewey deci-
▶ Crowdsourcing
mal classification, and ISBN number.
▶ Open Data
A fundamental part of DBPedia and the second
▶ Semantic Web
goal for Wikidata is the collection of data based on
these relatively few structured fields that exist in
Wikipedia.
Standardizing the factual information in Further Reading
Wikipedia holds incredible potential for research.
Wikidata and DBPedia, used in conjunction with Jemielniak, D. (2014). Common knowledge: An ethnogra-
phy of wikipedia. Stanford: Stanford University Press.
the Wikipedia database, make it possible to, for Krötzscha, M., et al. (2007). Semantic Wikipedia. Web
example, assess article coverage of female musi- Semantics: Science, Services and Agents on the World
cians as compared to male musicians in different Wide Web, 5(4), 251–261.
parts of the world. Since they use machine- Leetaru, K. (2012). A bigdata approach to the humanities,
arts, and social sciences: Wikipedia’s view of the world
readable formats, they can also interface with through supercomputing. Research Trends, 30, 17–30.
one another and with many other sources like Stefaner, M., et al. Notability – Visualizing deletion dis-
GeoNames, Library of Congress Subject Head- cussions on Wikipedia. http://www.notabilia.net/
ings, Internet Movie Database, MusicBrainz, and Viégas, F., et al. (2004). Studying cooperation and conflict
between authors with history flow visualizations. Paper
Freebase, allowing for richer, more complex presented at CHI 2004, Vienna.
queries. Likewise, just as these can be used to
W
where over 150 experts, data scientists, civil soci- Examples include: the use of GPS-equipped vehi-
ety groups, and development practitioners met to cles in Stockholm, providing real-time traffic
analyze various forms of big data and consider assessments, which are used in conjunction with
how it could be used to tackle development issues. other data sets such as weather which can then be
The event was a public acknowledgement of how used to make traffic predictions, using mobile
the World Bank viewed the importance of phone data to predict mobility patterns.
expanding the awareness of how big data can The World Bank piloted some activities in
help combine various data sets to generate knowl- Central America to explore the potential of big
edge which can in turn foster development data to impact on development agendas. This
solutions. region has historically experienced low frequen-
A report produced in conjunction with the cies of data collection for traditional data forms,
World Bank, Big Data in Action for Development, such as household surveys and so other forms of
highlights some of the potential ways in which big data collection, were viewed as particularly
data can be used to work toward development important. One of these pilot studies used google
objectives and some of the challenges associated trends data to explore the potential for the ability
with doing so. The report sets out a conceptual to forecast price changes to commodities. Another
framework for using big data in the development study, in conjunction with the UN Global Pulse,
sector highlighting the potential transformative explored the use of social media content to ana-
capacity of big data, particularly in relation to lyze public perceptions of policy reforms, in par-
raising awareness, developing understanding, ticular a gas subsidy reform in El Salvador,
and contributing to forecasting. highlighting the potential for this form of data to
Using big data to develop and enhance aware- complement other studies on public perception
ness of different issues has been widely acknowl- (United Nations Global Pulse 2012).
edged. Examples of this include: using The report from the World Bank, Big Data in
demographic data in Afghanistan to detect Action for Development, presents a matrix of dif-
impacts of small scale violence outbreaks, using ferent ways in which big data could be used in
social media content to indicate unemployment transformational ways toward the development
rises or crisis related stress, or using tweets to agenda: using mobile data (e.g., reduced mobile
recognize where cholera outbreaks were phone top ups as an indicator of financial stress),
appearing at a much faster rate than was recog- financial data (e.g., increased understanding of
nized in official statistics. This ability to gain customer preferences), satellite data (e.g., to
awareness of situations, experiences, and senti- crowd source information on damage after an
ments is seen to have the potential to reduce earthquake), internet data (e.g., to collect daily
reaction times and improve processes which deal prices), and social media data (e.g., to track par-
with such situations. ents perception of vaccination). The example of
Big data can also be used to develop under- examining the relationship between food and fuel
standing of societal behaviors (LaValle et al. prices and corresponding change in official price
2011). Examples include investigation of twitter index measures by using twitter data (by the UN
data to explore the relationship between food and Global Pulse Lab) is outlined in detail explaining
fuel price tweets and changes in official price how it was used to provide an indication of social/
indexes in Indonesia; after the 2010 earthquake economic conditions in Indonesia. This was done
in Haiti, mobile photo data was used to track by extracting tweets mentioning food and fuel
population displacement after the event, and sat- prices between 2011 and 2013 (around 100,000
ellite rainfall data was used in combination with relevant tweets after filtering for location and lan-
qualitative data sources to understand how rainfall guage) and analyzing these with corresponding
affects migration. changes from official data sets. The analysis indi-
Big data is also seen to have potential for cated a clear relationship between official food
contributing to modelling and forecasting. inflation statistics and the number of tweets
World Bank 3
about food price increases. This study was cited as requires expertise to both clean the data and
an example of how big data could be used to where necessary aggregate it (e.g., if one set of
analyze public sentiment, in addition to objective data collected every hour, and another every day).
economic conditions. The examples mentioned Then the media through which data is collected is
here are just some of the activities undertaken by also an important factor to consider. Mobile
the World Bank to embrace the world of big data. phones, for example, producing highly sensitive
As with many other international institutions data, satellite images produce highly unstructured
which recognize the potential uses for big data, data, and social media platforms produce a lot of
the World Bank also recognizes there are a range unstructured text which requires filtering and cod-
of challenges associated with the generation, anal- ifying which in itself requires specific analytic
ysis, and use of big data. capabilities.
One of the most basic challenges for many Then in order to make effective use of big data,
organizations (and individuals) is gaining access those using it need to consider elements about the
to data, from both government institutions and the data itself. The generation of big data has been
private sector. A new ecosystem needs to be driven by advances in technology, yet these
developed where data is made openly available advances are not alone sufficient to be able to
and sharing incentives are in place. It is acknowl- understand the results which can be gleaned
edged by the World Bank that international agen- from big data. Transforming vast data sets into
cies will need to address this challenge by not only meaningful results requires effective human capa-
by promoting the availability of data but promot- bilities. Depending on how the data is generated,
ing collaboration and mechanisms for sharing and by whom, there is scope for bias and therefore
data. In particular, a shift in business models will misleading conclusions. Then with large amounts
be required in order to ensure the private sector is of data, there is a tendency for patterns to be
willing to share data, and governments will need observed where there may be none; because of
to design policy mechanisms to ensure the value its nature, big data can give rise to significant
of big data is captured and is shared across depart- statistical correlations. It is important to remember
ments. Related to this, there need to be consider- that correlation does not imply causation. Then
ations of how to engage the public with this data. just because there is large amount of data avail-
Thinking particularly about the development able, this does not necessarily mean this is the
agenda at the heart of the World Bank, there is a right data for the question or issue being
paradox: countries where poverty is high or where investigated.
development agendas require the most attention The World Bank acknowledges that for big
are often countries where data infrastructures or data to be made effective for development, there
technological systems are insufficient. Because will need to be collaboration between practi-
the generation of big data relies largely on tech- tioners, social scientists, and data scientists in
nological capabilities, relying on those who use or order to ensure the understanding of the real-
interact with digital sources may be systematically world conditions and data generation mecha-
unrepresentative of the larger population that nisms, and methods of interpretation are effec-
forms the focus of the research. tively combined. Beyond this there will need to
The ways in which data are recorded have be cooperation between public and private sector
implications for the results which are interpreted. bodies in order to foster greater data sharing and
Where data is passively recorded, there is less incentivize the use of big data across different
potential for bias in the results generated, and sectors. Even when data has been accessed, in
likewise where data is actively recorded, there is nearly all occasions it needs to be filtered and
greater potential for the results to be more made suitable for analysis. Filters require human
susceptive to selection bias. Furthermore, how input and need to be applied carefully as their use
data is processed into a more structured from the may preclude information and affect the results.
often very large and unstructured data sets Data needs to be cleaned. Mobile data is received
4 World Bank
in unstructured form of millions of files, which data and has begun to explore areas of clear poten-
requiring time-intensive processing to obtain data tial for big data use. However, questions remain
suitable for analysis. Likewise, analysis of text about how it can support countries to take owner-
from social media requires a decision making ship and create, manage, and maintain their own
process to filter out suitable search terms. data, contributing to their own development
Finally, there are a series of concerns about agendas in effective ways.
how privacy is ensured with big data, given that
often there are elements of big data which can be
sensitive in nature (either to the individual or
Cross-References
commercially). This is made more complicated
as each country will have different regulations
▶ Bank of America
about data privacy which poses particular chal-
▶ Citigroup Inc
lenges for institutions working across national
▶ International Development
boundaries, like the World Bank.
▶ United Nations
For the World Bank, the use of big data is seen
▶ United Nations Global Pulse
to have potential for improving and changing the
▶ World Health Organization
international development sector. Underpinning
the ideas of the World Bank’s approach to big
data is the recognition that while the technological
capacities for generation, storage, and processing Further Reading
of data continue to develop, this also needs to be
accompanied by institutional capabilities to Coppola, A., Calvo-Gonzalez, O., Sabet, E., Arjomand, N.,
Siegel, R., Freeman, C., Massarat, N. (2014). Big data
enable big data analysis to contribute to effective in action for development. Washington, DC: World
actions that can contribute to development, Bank and Second Muse. Available at: http://live.
whether this is through strengthening of warning worldbank.org/sites/default/files/Big%20Data%20for
systems, raising awareness, or developing under- %20Development%20Report_final%20version.pdf.
LaValle, S., Lesser, E., Shockley, R., Hopkins, M., &
standing of social systems or behaviors. Kruschwitz, N. (2011). Big data, analytics and the
The World Bank has begun to consider an path from insights to value. MIT Sloan Management
underlying conceptual framework around the use Review, 52(2), 21–31.
of big data, in particular considering the chal- Lehdonvirta, V., & Ernkvist, M. (2011). Converting the
virtual economy into development potential: Knowl-
lenges it presents in terms of using big data for edge map of the virtual economy. InfoDev/World Bank
development. In the report Big Data in Action for White Paper, 1, 5–17.
Development, it is acknowledged that there is McAfee, A., & Brynjolfsson, E. (2012). Big data: The
great potential for big data to provide a valuable management revolution. Harvard Business Review,
90(10), 60–66.
input for designing effective development policy United Nations Global Pulse. (2012). Big data for devel-
recommendation but also that big data is no pan- opment: Challenges & opportunities. New York: UN,
acea (Coppola et al. 2014). The World Bank has New York.
made clear efforts to engage with the use of big
Z
problems, there were mixed responses to how the browsewraps. Browsewraps are user agreements
company had handled the situation. As part of its that bind users simply for browsing the website.
response to the breach, the company sent out The courts ruled that Zappos presented its user
emails informing its customers of the problem agreement as a browsewrap. Furthermore, Zappos
urging them to change their passwords. Zappos claimed on its website that the company reserved
also provided an 800-number phone service to its the right to amend the contract whenever it saw fit.
customers helping them through the process of Despite other companies using this language
choosing a new password. online, it is detrimental to a contract. The courts
However, some experts familiar with the ruled that Zappos can amend the terms of the user
online industry have criticized the moves by agreement at any time, making the arbitration
Zappos. In an article by Ellen Messmer, she clause susceptible to change as well. This makes
interviewed an Assistant Professor of Information the clause unenforceable. Eric Goldman posits
Technology from the University of Notre Dame, that the court ruling left Zappos in a bad position
who argued that the response strategy by Zappos because all of the risk management provisions are
was not appropriate. Professor John D’Arcy posits ineffective. In other words, losing the contract left
that the company’s decision to terminate cus- Zappos without the following: its waiver of con-
tomers’ passwords promotes a panic mode, creat- sequential damages, its disclaimer of warranties,
ing a sense of panic in its customers. In contrast, its clause restricting class actions in arbitration,
other analysts claim that Zappos public response and its reduced statute of limitations. Conversely,
to the situation was the right move, communicat- companies that use click-through agreements and
ing to its customers publicly. remove clauses that state they can amend the
Nevertheless, Zappos is doing a good job of contract unilaterally are in a better legal position,
getting the information out about the security according to Eric Goldman.
breach to the public as soon as possible, according
to Professor John D’Arcy. This typically benefits
the customers, creating favorable reactions. In Holacracy
terms of the cost of the security breaches, the
Ponemon Institute estimates that on average, a Zappos CEO Tony Hsieh announced in
data breach costs $277 per compromised record. November 2013 that his company would be
implementing the management style known as
Holacracy. With Holacracy, there are two key
Lawsuits elements that Zappos will follow: distributed
authority and self-organization. According to an
After the security breach, dozens of lawsuits were article by Nicole Leinbach-Reyhle, distribution
filed. Zappos attempted to send the lawsuits to authority allows employees to evolve the organi-
arbitration, citing its user agreement. In the fall zation’s structure by responding to real-word cir-
of 2012, a federal court struck down Zappos. cumstances. In regard to self-organization,
com’s user agreement, according to Eric employees have the authority to engage in useful
Goldman. Eric Goldman is a professor of law at action to express their purpose as long as it does
Santa Clara University School of Law who writes not “violate of the domain of another role.” There
about Internet law, intellectual property, and is a common misunderstanding that Holacracy is
advertising law. He states that Zappos made mis- nonhierarchical when in fact it is strongly hierar-
takes that are easily avoidable. The courts typi- chical, distributing power within the organization.
cally divide user agreements into one of three This approach to management creates an atmo-
groups: “clickwraps” or “click-through agree- sphere where employees can speak up evolving
ments,” “browsewraps,” and “clearly not a con- into leaders rather than followers. Zappos CEO
tract.” Eric Goldman argues that the click-through Tony Hsieh states that he is trying to structure
agreements are effective in courts unlike Zappos less like a bureaucratic corporation and
Zappos 3
more like a city, resulting in increased productiv- downtown Las Vegas region. As Sara Corbett
ity and innovation. To date, with 1,500 notes in her article, he hopes to change the area
employees, Zappos is the largest company to into a start-up fantasyland.
adopt the management model – Holacracy.
Cross-References
Innovation
▶ Bureau of Consumer Protection: Data Breach
The work environment at Zappos has become
▶ Legal Issues
known for its unique corporate culture, which
▶ Small Business Enterprises
incorporates fun and humor into daily work. As
stated on Zappos.com, the company has a total of
ten core values: “deliver WOW through service,
embrace and drive change, create fun and a little Further Reading
weirdness, be adventurous, creative, and open-
minded, pursue growth and learning, build open Corbett, S. (n.d.). How Zappos’ CEO turned Las Vegas into
a startup fantasyland. http://www.wired.com/2014/01/
and honest relationships with communication, zappos-tony-hsieh-las-vegas/
build a positive team and family spirit, do more Goldman, E. (n.d.). How Zappos’ user agreement Failed in
with less, be passionate and determined, and be court and left Zappos legally naked. http://www.forbes.
humble.” Nicole Leinbach-Reyhle writes that com/sites/ericgoldman/2012/10/10/how-zappos-user-
agreement-failed-in-court-and-left-zappos-legally-
Zappos’ values help to encourage its employees naked/. Accessed Jul 2014
to think outside of the box. Leinbach-Reyhle, N. (n.d.). Shedding hierarchy: Could
To date, Zappos is a billion-dollar online Zappos be setting an innovative trend? http://www.
retailer, expanding beyond selling shoes. The forbes.com/sites/nicoleleinbachreyhle/2014/07/15/
shedding-hierarchy-could-zappos-be-setting-an-
company is also making waves in its corporate innvoative-trend/. Accessed Jul 2014
culture and hierarchy. Additionally, information Messmer, E. (n.d.). Zappos data breach response a good
technology plays a huge role in the corporation, idea or just panic mode? Online shoe and clothing
serving its customers and the business. Based retailer Zappos has taken assertive steps after breach,
but is it enough? http://www.networkworld.com/arti
upon the growing success of Zappos, it is keeping cle/2184860/malware-cybercrime/zappos-data-breach-
true to its mission statement “to provide the best response-a-good-idea-or-just-panic-mode-.html.
customer service possible.” It evident that Zappos Accessed Jul 2014
will continue to make positive changes for the Ponemon Group. (n.d.). 2013 cost of data breach study:
Global analysis. http://www.ponemon.org. Accessed
corporation and its corporate headquarters in Las Jul 2014
Vegas. In 2013, Zappos CEO Tony Hsieh com- Zappos. (n.d.). http://www.zappos.com. Accessed Jul 2014
mitted $350 million to rebuild and renovate the
Z
They can charge more for ads that appear during a amounts of data accessible to common people.
search for homes in Beverly Hills than in Bis- Potential buyers no longer need to contact a real
marck, South Dakota. Some 57,000 agents spend estate agent before searching for homes – they can
an average of $4,000 every year for leads to get start a detailed search on just about any house in
new buyers and sellers. Zillow keeps a record of the country from their own mobile or desktop
how many times a listing has been viewed, which device. This is empowering for consumers, but it
may help negotiate the price with among agents, shakes up an industry that has long relied on
buyers, and sellers. Real estate agents can sub- human agents. These agents made it their business
scribe to silver, gold, or platinum programs to get to know specific areas, learn the ins and outs of a
CRM (customer relationship management) tools, given community, and then help connect inter-
their photo in listings, a web site, and more. Basic ested buyers to the right home. Sites that give
plans start at 10 dollars a month. users a tool peer into huge amounts of data (like
Zillow’s mortgage marketplace also earns Zillow) are useful to a point, but some critics feel
them revenue. Potential homebuyers can find only a human being who is local and present in a
and engage with mortgage brokers and firms. community can really serve potential buyers.
The mortgage marketplace tells potential buyers Because it takes an aggregate of multiple
what their monthly payment would be, how much national and MLS listing sites, Zillow is rarely
they can afford, submit loan requests, and get perfect. Any big data computing service that
quotes from various lenders. In the third quarter works with offline or subjective entities – and
of 2013, Zillow’s mortgage marketplace received real estate prices certainly fit this description –
5.9 million loan requests from borrowers (more will have to make logical (some would say illog-
than all of 2011), which grew its revenue stream ical) leaps where information is scarce. When
120% to $5.7 million. A majority of Zillow’s Zillow does not have exact or current data on a
revenue comes from the real estate segment that house or neighborhood, it “guesses” when prices
lets users browse homes for sale and for rent. This come in too high, sellers have unrealistic expec-
earned them over $35 million in 2013’s third tations of the potential price of their home.
quarter. Buyers, too, may end up paying for a home than
Analysts and shareholders have voiced some it is actually worth.
concerns over Zillow’s business model. Zillow A human expert (real estate agent) has tradi-
now spends over 70% of its revenues on sales tionally been the expert in this area, yet people are
and marketing, as opposed to 33% for LinkedIn still surprised when too much stock is put into an
and between 21% and 23% for IBM and Micro- algorithm. Zillow zestimates tend to work best for
soft. Spending money on television commercials midrange homes in an area where there are plenty
and online ads for its services seems to have of comparable houses. Zestimates are less accu-
diminishing returns for Zillow, who is spending rate for low- and high-end homes because there
more and more on marketing for the same net are fewer comps (comparable houses for sale or
profit. What once seemed like a sure-fire recently sold). Similarly, zestimates of rural,
endeavor – making money by connecting cus- unique, or fixer-upper homes are difficult to
tomers to agents through relevant and concise gauge. Local MLS sites may have more detail on
management of huge amounts of data – is no a specific area, but Zillow has broader, more gen-
longer a sure thing. Zillow will have to continu- eral information over a larger area. They estimate
ally evolve its business model if it is to stay afloat. their coverage of American homes to be around
57%.
Real estate data is more difficult to come by in
Zillow and the Real Estate Industry some areas. Texas doesn’t provide public records
of housing transaction prices, so Zillow had to
Zillow has transformed the real estate industry by access sales data from property databases through
finding new and practical ways to make huge real estate brokers. Because of the high number of
Zillow 3
cooperative buildings, New York City is another sort of activity bridges the traditional brick-and-
difficult area in which to gauge real estate prices. mortar house hunting of the past with the instant
Tax assessments are made on the co-ops, not the big data access of the future (and increasingly, the
individual units, which negates that factor in present). Zillow has emerged as a leader in its field
zestimate calculations. Additional information, of real estate by connecting its customers, not just
like square footage or amenities, is also difficult to big data but the right data at the right time and
to come by, forcing Zillow to seek out alternative places.
sources.
Of course, zestimates can be accurate as well.
As previously noted, when the house is midrange
Cross-References
and in a neighborhood with plenty of comps (and
thus plenty of data), zestimates can be very good
▶ Data-Driven Marketing
indicators of the home’s actual worth. As Zillow
▶ Digitization
zestimates – and sources from which to draw
▶ E-Commerce
factoring information – continue to evolve, the
▶ Real Estate/Housing
service may continue growing in popularity. The
▶ Utilities Industry
more popular Zillow becomes, the more incentive
real estate agents will have to list all of their
housing database information with the service.
Agents know that, in a digital society, speed is Further Readings
key: 74% of buyers and 76% of sellers will work
with the first agent with whom they talk. Arribas-Bel, D. (2014). Accidental, open and everywhere:
Emerging data sources for the understanding of cities.
Recently Zillow is recognizing a big shift to Applied Geography, 49, 45–53.
mobile: about 70% of Zillow’s usage now occurs Cranshaw, J., Schwartz, R., Hong, J.I., Sadeh,
on mobile platforms. This trend is concurrent with N.M. (2012). The livelihoods project: Utilizing social
other platforms’ shift to mobile usage; Facebook, media to understand the dynamics of a city. In ICWSM.
Hagerty, J. R.(2007). How good are Zillow’s estimates?
Instagram, Zynga, and others have begun to rec- Wall Street Journal.
ognize and monetize users’ access from Huang, H., & Tang, Y. (2012). Residential land use regu-
smartphones and tablets. For real estate, this lation and the US housing price cycle between 2000
mobile activity is about more than just conve- and 2009. Journal of Urban Economics, 71(1), 93–99.
Wheatley, M. (n.d.). Zillow-Trulia merger will create bound-
nience: user can find information on homes in less new big data opportunities. http://siliconangle.com/
real time as they drive around a neighborhood, blog/2014/07/31/zillow-trulia-merger-will-create-bound
looking directly at the potential homes, and con- less-new-big-data-opportunities/. Accessed on Sept
tact the relevant agent before they get home. This 2014.
A
Synonyms Generalities
E-agriculture; Precision agriculture; Precision This relatively new expression derives from a
farming combination of the two terms agriculture and
informatics, hence alluding to the application of
informatics to the analysis, design, and develop-
Definition ment of agricultural activities. It broadly involves
the study and practice of creating, collecting, stor-
The term stems from the blending of the two ing and retrieving, manipulating, classifying, and
words agriculture and informatics and refers to sharing information concerning both natural and
the application of informatics to the analysis, engineered agricultural systems. The domains of
design and development of agricultural activities. application are mainly agri-food and environmen-
It overarches expressions such as Precision Agri- tal sciences and technologies, while sectors
culture (PA), Precision Livestock Farming (PLF), include biosystems engineering, farm manage-
and Agricultural landscape analysis and planning. ment, crop production, and environmental moni-
The adoption of AgInformatics can accelerate toring. In this respect, it encompasses the
agricultural development by providing farmers management of the information coming from
and decision makers with more accessible, com- applications and advances of information and
plete, timely, and accurate information. However, communication technologies (ICTs) in agriculture
it is still hindered by a number of important yet (e.g., global navigation satellite system, GNSS;
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_218-1
2 AgInformatics
remote sensing, RS; wireless sensor networks, management between different actors are supply
WSN; and radio-frequency identification, RFID) chain information systems (SCIS) including those
and performed through specific agriculture infor- specifically designed for traceability and supply
mation systems, models, and methodologies (e.g., chain planning.
farm management information systems, FMIS; Recently, PA has evolved to predictive and
GIScience analyses; Data Mining; decision sup- prescriptive agriculture. Predictive agriculture
port systems, DSS). regards the activity of combining and using a
AgInformatics is an umbrella concept that large amount of data to improve knowledge and
includes and overlaps issues covered in precision predict trends, whereas prescriptive agriculture
agriculture (PA), precision livestock farming involves the use of detailed, site-specific recom-
(PLF), and agricultural landscape analysis and mendations for a farm field. Today PA embraces
planning, as follows. new terms such as precision citrus farming, preci-
sion horticulture, precision viticulture, precision
Precision Agriculture (PA) livestock farming, and precision aquaculture (Li
PA was coined in 1929 and later defined as “a and Chung 2015).
management strategy that uses information tech-
nologies to bring data from multiple sources to
Precision Livestock Farming (PLF)
bear on decisions associated with crop produc-
The increase in activities related to livestock farm-
tion” (Li and Chung 2015). The concept evolved
ing triggered the definition of the new term preci-
since the late 1980s due to new fertilization equip-
sion livestock farming (PLF), namely, the real-
ment, dynamic sensing, crop yield monitoring
time monitoring technologies aimed at managing
technologies, and GNSS technology for auto-
the smallest manageable production unit’s tempo-
mated machinery guidance.
ral variability, known as “the per animal
Therefore, PA technology has provided
approach” (Berckmans 2004). PLF consists in
farmers with the tools (e.g., built-in sensors in
the real-time gathering of data related to livestock
farming machinery, GIS tools for yield monitor-
animals and their close environment, applying
ing and mapping, WSNs, satellite and low-alti-
knowledge-based computer models, and extra-
tude RS by means of unmanned aerial systems
cting useful information for automatic monitoring
(UAS), and recently robots) and information (e.g.,
and control purposes. It implies monitoring ani-
weather, environment, soil, crop, and production
mal health, welfare, behavior, and performance
data) needed to optimize and customize the
and the early detection of illness or a specific
timing, amount, and placement of inputs includ-
physiological status and unfolds in several activ-
ing seeds, fertilizers, pesticides, and irrigation,
ities including real-time analysis of sounds,
activities that were later applied also inside closed
images, and accelerometer data, live weight
environments, buildings, and facilities, such as for
assessment, condition scoring, and online milk
protected cultivation.
analysis. In PLF, continuous measurements and
To accomplish the operational functions of a
a reliable prediction of variation in animal data or
complex farm, FMISs for PA are designed to
animal response to environmental changes are
manage information about processes, resources
integrated in the definition of models and algo-
(materials, information, and services), procedures
rithms that allow for taking control actions (e.g.,
and standards, and characteristics of the final
climate control, feeding strategies, and therapeu-
products (Sørensen et al. 2010). Nowadays dedi-
tic decisions).
cated FMISs operate on networked online frame-
works and are able to process a huge amount of
data. The execution of their functions implies the Agricultural Landscape Analysis and Planning
adoption of various management systems, data- Agricultural landscape analysis and planning is
bases, software architectures, and decision increasingly based on the development of inter-
models. Relevant examples of information operable spatial data infrastructures (SDIs) that
AgInformatics 3
Big Data Quality and complexity related to data and its quality
compounds incrementally and could potentially
Subash Thota challenge the very growth of the business that
Synectics for Management Decisions, Inc., acquired the data. This paper is intended to show-
Arlington, VA, USA case challenges related to data quality and
approaches to mitigating data quality issues.
Introduction
Data Defined
Data is the most valuable asset for any organiza-
tion. Yet in today’s world of big and unstructured Data is “ . . . language, mathematical or other sym-
data, more information is generated than can be bolic surrogates which are generally agreed upon to
collected and properly analyzed. The onslaught of represent people, objects, events and concepts”
data presents obstacles to create data-driven deci- (Liebenau and Backhouse 1990). Vayghan et al.
sions. Data quality is an essential characteristic of (2007) argued that most enterprises deal with three
data that determines the reliability of data for types of data: master data, transactional data, and
making decisions in any organization or business. historical data. Master data are the core data enti-
Errors in data can cost a company millions of ties of the enterprise, i.e., customers, products,
dollars, alienate customers, and make employees, vendors, suppliers, etc. Transactional
implementing new strategies difficult or impossi- data describe an event or transaction in an organi-
ble (Redman 1995). zation, such as sales orders, invoices, payments,
In practically every business instance, project claims, deliveries, and storage records. Transac-
failures and cost overruns are due to fundamental tional data is time bound and changes to historical
misunderstanding about the data quality that is data once the transaction has ended. Historical
essential to the initiative. A global data manage- data contain facts, as of certain point in time (e.g.,
ment survey by PricewaterhouseCoopers of 600 database snapshots), and version information.
companies across the USA, Australia, and Britain
showed that 75% of reported significant problems
were a result of data quality issues, with 33% of Data Quality
those saying the problems resulted in delays in
getting new business intelligence (BI) systems Data quality is the capability of data to fulfill and
running or in having to scrap them altogether satisfy the stated business, framework, system and
(Capehart and Capehart 2005). The importance technical requirements of an enterprise. A classic
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_240-1
2 Big Data Quality
definition of data quality is “fitness for use,” or more value is derived from analyzing connec-
more specifically, the extent to which some data tivity and relationships, the inability to link
successfully serve the purposes of the user (Tayi related data instance together impedes this
and Ballou 1998; Cappiello et al. 2003; Lederman valuable analysis.
et al. 2003; Watts et al. 2009).
To be able to correlate data quality issues to
business impacts, we must be able to both classify
Causes and Consequences
our data quality expectations as well as our busi-
ness impact criteria. In order to do that, it is
The “Big Data” era comes with new challenges
valuable to understand these common data quality
for data quality management. Beyond volume,
dimensions (Loshin 2006):
velocity, and variety lies the importance of the
fourth “V” of big data: veracity. Veracity refers
– Completeness: Is all the requisite information
to the trustworthiness of the data. Due to the sheer
available? Are data values missing, or in an
volume and velocity of some data, one needs to
unusable state? In some cases, missing data is
embrace the reality that when data is extracted
irrelevant, but when the information that is
from multiple datasets at a fast and furious clip,
missing is critical to a specific business pro-
determining the semantics of the data – and under-
cess, completeness becomes an issue.
standing correlations between attributes –
– Conformity: Are there expectations that data
becomes of critical importance.
values conform to specified formats? If so, do
Companies that manage their data effectively
all the values conform to those formats?
are able to achieve a competitive advantage in the
Maintaining conformance to specific formats
marketplace (Sellar 1999). On the other hand, bad
is important in data representation, presenta-
data can put a company at a competitive disad-
tion, aggregate reporting, search, and
vantage comments (Greengard 1998). It is there-
establishing key relationships.
fore important to understand some of the causes of
– Consistency: Do distinct data instances pro-
bad data quality:
vide conflicting information about the same
underlying data object? Are values consistent
• Lack of data governance standards or valida-
across data sets? Do interdependent attributes
tion checks.
always appropriately reflect their expected
• Data conversion usually involves transfer of
consistency? Inconsistency between data
data from an existing data source to a new
values plagues organizations attempting to rec-
database.
oncile different systems and applications.
• Increasing complexity of data integration and
– Accuracy: Do data objects accurately repre-
enterprise architecture.
sent the “real-world” values they are expected
• Unreliable and inaccurate sources of
to model? Incorrect spellings of products, per-
information.
sonal names or addresses, and even untimely
• Mergers and acquisitions between companies.
or not current data can impact operational and
• Manual data entry errors.
analytical applications.
• Upgrades of infrastructure systems.
– Duplication: Are there multiple, unnecessary
• Multidivisional or line-of-business usage of data.
representations of the same data objects within
• Misuse of data for purposes different from the
your data set? The inability to maintain a single
capture reason.
representation for each entity across your sys-
tems poses numerous vulnerabilities and risks.
Different people performing the same tasks
– Integrity: What data is missing important rela-
tionship linkages? The inability to link related have a different understanding of the data being
processed, which leads to inconsistent data mak-
records together may actually introduce dupli-
ing its way into the source systems. Poor data
cation across your systems. Not only that, as
Big Data Quality 3
quality is a primary reason for 40% of all business 1. Enterprise Focus and Discipline
initiatives failing to achieve their targeted benefits
(Friedman and Smith 2011). Marsh (2005) sum- Enterprises should be more focused and
marizes consequences in one of his article: engaged toward data quality issues; views toward
data cleansing must evolve. Clearly defining roles
• Eighty-eight percent of all data integration pro- and outlining the authority, accountability and
jects either fail completely or significantly responsibility for decisions regarding enterprise
overrun their budgets. data assets provides the necessary framework for
• Seventy-five percent of organizations have resolving conflicts and driving a business forward
identified costs stemming from dirty data. as the data-driven organization matures. Data
• Thirty-three percent of organizations have quality programs are most efficient and effective
delayed or canceled new IT systems because when they are implemented in a structured,
of poor data. governed environment.
• $611B per year is lost in the USA to poorly
targeted bulk mailings and staff overheads. 2. Implementing MDM and SOA
• According to Gartner, bad data is the number
one cause of customer-relationship manage- The goal of a master data management (MDM)
ment (CRM) system failure. solution is to provide a single source of truth of
• Less than 50% of companies claim to be very data, thus providing a reliable foundation for that
confident in the quality of their data. data across the organization. This prevents busi-
• Business intelligence (BI) projects often fail due ness users across an organization from using dif-
to dirty data, so it is imperative that BI-based ferent versions of the same data. Another
business decisions are based on clean data. approach of big data and big data governance is
• Only 15% of companies are very confident in the deployment of cloud-based models and soft-
the quality of external data supplied to them. ware-oriented architecture (SOA). SOA enables
• Customer data typically degenerates at 2% per the tasks associated with a data quality program to
month or 25% annually. be deployed as a set of services that can be called
dynamically by applications. This allows business
To Marsh, organizations typically overestimate rules for data quality enforcement to be moved
the quality of their data and underestimate the cost outside of applications and applied universally at
of data errors. Business processes, customer a business process level. These services can either
expectations, source systems and compliance be called proactively by applications as data is
rules are constantly changing – and data quality entered into an application system, or by batch
management systems must reflect this. Vast after the data has been created.
amounts of time and money are spent on custom
coding and “firefighting” to dampen an immediate 3. Implementing Data Standardization and Data
crisis rather than dealing with the long-term prob- Enrichment
lems that bad data can present to an organization.
Data standardization usually covers
reformatting of user-entered data without any
loss of information or enrichment of information.
Data Quality: Approaches Such solutions are most suitable for applications
that integrate data. Data enrichment covers the
Due to the large variety of sources from which reformatting of data with additional enrichment
data is collected and integrated, for its sheer vol- or addition of useful referential and analytical
ume and changing nature, it is impossible to man-
information.
ually specify data quality rules. Below are a few
approaches to mitigating data quality issues:
4 Big Data Quality
Further Readings
Core Curriculum Issues (Big Data the curricula of those who will not obtain degrees
Research/Analysis) or certificates in disciplines related to big data –
but for whom training or education in these KSAs
Rochelle E. Tractenberg is still desired or intended. A third core issue is
Collaborative for Research on Outcomes and how to construct the curriculum – whether the
–Metrics, Washington, DC, USA degree is directly related to big data or some key
Departments of Neurology; Biostatistics, KSAs relating to big data are proposed for inte-
Bioinformatics & Biomathematics; and gration into another curriculum – in such a way
Rehabilitation Medicine, Georgetown University, that it is evaluable. Since the technical attributes
Washington, DC, USA of big data and its management and analysis are
evolving nearly constantly, any curriculum devel-
oped to teach about big data must be evaluated
Definition periodically (e.g., annually) to ensure that what is
being taught is relevant; this suggests that core
A curriculum is defined as the material and con- underpinning constructs must be identified so that
tent that comprises a course of study within a learners in every context can be encouraged to
school or college, i.e., a formal teaching program. adapt to new knowledge rather than requiring
The construct of “education” is differentiated retraining or reeducation.
from “training” based on the existence of a cur-
riculum, through which a learner must progress in
an evaluable, or at least verifiable, way. In this Role of the Curriculum in “Education”
sense, a fundamental issue about a “big data cur- Versus “Training”
riculum” is what exactly is meant by the expres-
sion. “Big data” is actually not a sufficiently Education can be differentiated from training by
concrete construct to support a curriculum, nor the existence of a curriculum in the former and its
even the integration of one or more courses into absence in the latter. The Oxford English Dictio-
an existing curriculum. Therefore, the principal nary defines education as “the process of educating
“core curriculum issue” for teaching and learning or being educated, the theory and practice of teach-
around big data is to articulate exactly what ing,” whereas training is defined as “teaching a
knowledge, skills, and abilities are to be taught particular skill or type of behavior through regular
and practiced through the curriculum. A second practice and instruction.” The United Nations
core issue is how to appropriately integrate those Educational, Scientific and Cultural Organization
key knowledge, skills, and abilities (KSAs) into (UNESCO) highlights the fact that there may be an
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_285-1
2 Core Curriculum Issues (Big Data Research/Analysis)
articulated curriculum (“intended”) but the curric- curriculum, it is important to understand that
ulum that is actually delivered (“implemented”) there is no uniform cognitive schema, nor other
may differ from what was intended. There are also contextual support, that the formal curriculum
the “actual” curriculum, representing what students typically provides. Thus, it can be helpful to con-
learn, and the “hidden” curriculum, which com- sider “training in big data” as appropriate for those
prises all the bias and unintended learning that any who have completed a formal curriculum in data-
given curriculum achieves (http://www.unesco. related domains. Otherwise, skills that are
org/new/en/education/themes/strengthening-educ acquired in such training, intended for deploy-
ation-systems/quality-framework/technical-notes ment currently and specifically, may actually
/different-meaning-of-curriculum/). These types limit the trainees’ abilities to adapt to new knowl-
of curricula are also described by the Netherlands edge, and thereby, lead to a requirement for
Institute for Curriculum Development (SLO, retraining or reeducation.
http://international.slo.nl/) and worldwide in mul-
tiple books and publications on curriculum devel-
opment and evaluation.
Determining the Knowledge, Skills, and
When a curriculum is being developed or eval-
Abilities Relating to Big Data That
uated with respect to its potential to teach about big
Should Be Taught
data, each of these dimensions of that curriculum
(intended, implemented, actual, hidden) must be
The principal core curricular issue for teaching
considered. These features, well known to instruc-
and learning around big data is to articulate
tors and educators who receive formal training to
exactly what knowledge, skills, and abilities are
engage in the kindergarten–12th grade (US) or
to be taught and practiced through the curriculum.
preschool/primary/secondary (UK/Europe) edu-
As big data has become an increasingly popular
cation, are less well known among instructors in
construct (since about 2010), different stake-
tertiary/higher education settings whose training
holders in the education enterprise have articu-
is in other domains – even if their main job will be
lated curricular objectives in computer science,
to teach undergraduate, graduate, postgraduate,
statistics, mathematics, and bioinformatics for
and professional students. It may be helpful, in
undergraduate (e.g., De Veaux et al. 2017) and
the consideration of curricular elements around
graduate students (e.g., Greene et al. 2016). These
big data, for those in the secondary education/
stakeholders include longstanding national or
college/university setting to consider what attri-
international professional associations and new
butes characterize the curricula that their incom-
groups seeking to establish either their own cred-
ing students have experienced relating to the same
ibility or to define the niche in “big data” where
content or topics.
they plan to operate. However, “big data” is not a
Many modern researchers in the learning
specific domain that is recognized or recogniz-
domains reserve the term “training” to mean
able; it has been described as a phenomenon
“vocational training.” For example, Gibbs et al.
(Boyd and Crawford 2012) and is widely consid-
(2004) identify training as specifically “skills
ered not to be a domain for training or education
acquisition” to be differentiated from instruction
on its own. Instead, knowledge, skills, and abili-
(“information acquisition”); together with social-
ties relating to big data are conceptualized as
ization and the development of thinking and prob-
belonging to the discipline of data science; this
lem-solving skills, this information acquisition is
discipline is considered as existing at the intersec-
the foundation of education overall. The voca-
tion of mathematics, computer science, and statis-
tional training is defined as a function of skills or
tics. This is practically implemented as the
behaviors to be learned (“acquired”) by practice in
articulation of foundational aspects of each of
situ. When considering big data trainees, defined
these disciplines together with their formal and
as individuals who participate in any training
purposeful integration into a formal curriculum.
around big data that is outside of a formal
Core Curriculum Issues (Big Data Research/Analysis) 3
With respect to data science, then, generally, tool – e.g., instructional videos on YouTube or as
there is agreement that students must develop formal courses of varying lengths that can be read
abilities to reason with data and to adapt to a (slides, documentation) or watched as webinars.
changing environment, or changing characteris- Examples can be found online at sites including
tics of data (preferably both). However, there is Big Data University (bigdatauniversity.com), cre-
not agreement on how to achieve these abilities. ated by IBM and freely available, and Coursera
Moreover, because existing undergraduate course (coursera.org) which offers data science, analyt-
requirements are complex and tend to be compre- ics, and statistics courses as well as eight different
hensive for “general education” as well as for the specializations, comprising curated series of
content making up a baccalaureate, associate, or courses – but also many other topics. Coursera
other terminal degree in the postsecondary con- has evolved many different educational opportu-
text, in some cases just a single course may be nities and some curated sequences that can be
considered for incorporation into either required completed to achieve “certification,” with differ-
or elective course lists. This would represent the ent costs depending on the extent of student
least coherent integration of big data into a col- engagement/commitment. The Open University
lege/university undergraduate curriculum. In the (www.open.ac.uk) is essentially an online version
construction of a program that would award a cer- of regular university courses and curricula (and
tificate, minor or major, if it seeks to successfully so is closer to “education” than “training”) –
prepare students for work in or with big data, or degree and certificate programs all have costs
statistics and data science, or analytics, or of other associated and also can be considered to follow a
programs intended to train or prepare people for formal curriculum to a greater extent than any
jobs that either focus on, or simply “know about,” other option for widely accessible training/learn-
big data must follow the same curricular design ing around big data. These examples represent a
principles that every formal educational enterprise continuum that can be characterized by the atten-
should follow. If they do not, they risk tion to the curricular structure from minimal (Big
underperforming on their advertising and Data University) to complete (The Open Univer-
promises. sity). The individual who selects a given training
It is important to consider the role of training in opportunity, as well as those who propose and
the development, or consideration of develop- develop training programs, must articulate exactly
ment, of curricula that feature big data. In addition what knowledge, skills, and abilities are to be
to the creation of undergraduate degrees and taught and practiced. The challenge for individ-
minors, Master’s degrees, post-baccalaureate cer- uals making selections is to determine how cor-
tificate programs, and doctoral programs, all of rectly an instructor or program developer has
which must be characterized by the curricula described the achievements the training is
they are defined and created to deliver, many intended to provide. The challenge for those curat-
other “training” opportunities and workforce ing or creating programs of study is to ensure that
development initiatives also exist. These are the learning objectives of the curriculum are met,
being developed in corporate and other human i.e., that the actual curriculum is as high a match
resource-oriented domains, as well as in more to the intended curriculum as possible. Basic prin-
open (open access) contexts. Unlike traditional ciples of curriculum design can be brought to bear
degree programs, training and education around for acceptable results in this matching challenge.
big data are unlikely to be situated specifically The stronger the adherence to these basic princi-
within a single disciplinary context – at least not ples, the more likely a robust and evaluable curric-
exclusively. People who have specific skills, or ulum, with demonstrable impact, will result. This is
who have created specific tools, often create free not specific to education around big data, but with
or easily accessible representations of the skills or all the current interest in data and data science,
4 Core Curriculum Issues (Big Data Research/Analysis)
these challenges rise to the level of “core curricu- just training is. This may arise from a sense that
lum issues” for this domain. the technology is changing too fast to create a
whole curriculum around it. Training opportunity
creators are typically experts in the domain, but
Utility of Training Versus a Curriculum may not necessarily be sufficiently expert in
Around Big Data teaching and learning theories, or the domains
from which trainees are coming, to successfully
De Veaux et al. (2017) convened a consensus translate their expertise into effective “training.”
panel to determine the fundamental requirements This may lead to the development of new training
for an undergraduate curriculum in “data sci- opportunities that appear to be relevant, but which
ence.” They articulated that the main topical can actually contribute only minimally to an indi-
areas that comprise – and must be leveraged for vidual trainee’s ability to function competently in
appropriate baccalaureate-level training in – this a new domain like big data, because they do not
domain are as follows: data description and also include or provide contextualization or sche-
curation, mathematical foundations, computa- matic links with prior knowledge.
tional thinking, statistical thinking, data model- An example of this problem is the creation of
ing, communication, reproducibility, and ethics. “competencies” by subject matter expert consen-
Since computational and statistical thinking, as sus committees, which are then used to create
well as data modeling, all require somewhat dif- “learning plans” or checklists. The subject matter
ferent mathematical foundations, this list shows experts undoubtedly can articulate what compe-
clearly the challenges in selecting specific “train- tencies are required for functional status in their
ing opportunities” to support development of new domain. However, (a) a training experience devel-
skills in “big data” for those who are not already oped to fill in a slot within a competency checklist
trained in quantitative sciences to at least some often fails to support teaching and learning around
extent. Moreover, arguments are arising in many the integration of the competencies into regular
quarters (science and society, philosophy/ethics/ practice; and (b) curricula created in alignment
bioethics, and professional associations like the with competencies often do not promote the actual
Royal Statistical Society, American Statistical development and refinement of these competen-
Association, and Association of Computing cies. Instead, they may tend to favor the checking-
Machinery) that “ethics” is not a single entity off of “achievement of competency X” from the
but, with respect to big data and data science, is list.
a complex – and necessary – type of reasoning Another potential challenge arises from the
that cannot be developed in a single course or opposite side of the problem, learner-driven train-
training opportunity. The complexity of reasoning ing development. “What learners want and need
that is required for competent work in the domain from training” should be considered together with
referred to exchangeably as “data analytics,” what experts who are actually using the target
“data science,” and “big data”, which includes knowledge, skills, and abilities believe learners
this ability to reason ethically, underscores the need from training. However, the typical trainee
point that piecemeal training will be unsuccessful will not be sufficiently knowledgeable to choose
unless the trainee possesses the ability to organize the training that is in fact most appropriate for
the new material together with extant (high level) their current skills and learning objectives. The
reasoning abilities, or at least a cognitive/mental construct of “deliberate practice” is instructive
schema within which the diverse training experi- here. In their 2007 Harvard Business Review arti-
ences can be integrated for a comprehensive cle, “The making of an expert,” Ericsson, Prietula,
understanding of the domain. and Cokely summarize Ericsson’s prior work on
However, the proliferation of training opportu- expertise and its acquisition, commenting that
nities around big data suggests a pervasive sense “(y)ou need a particular kind of practice – delib-
that a formal curriculum is not actually needed – erate practice - to develop expertise” (emphasis in
Core Curriculum Issues (Big Data Research/Analysis) 5
original, p. 3). Deliberate practice is practice training of those who will not obtain, or have not
where weaknesses are specifically identified and obtained, degrees or certificates in disciplines
targeted – usually by an expert both in the target related to big data. A third core issue is how to
skillset and perhaps more particularly in identify- construct the curriculum in such a way that the
ing and remediating specific weaknesses. If a alignment of the intended and the actual objec-
trainee is not (yet) an expert, determining how tives is evaluable and modifiable as appropriate.
best to address a weakness that one has self-iden- Since the technical attributes of big data and its
tified can be another limitation on the success of a management and analysis are evolving nearly
training opportunity, if it focuses on what the constantly, any curriculum developed to teach
learner wants or believes they need without appeal about big data must be evaluated periodically to
to subject matter experts. This perspective argues ensure the relevance of the content; however the
for the incorporation of expert opinion into the alignment of the intended and actual curricula
development, descriptions, and contextualizations must also be regularly evaluated to ensure learn-
of training, i.e., the importance of deliberate prac- ing objectives are achieved and achievable.
tice in the assurance that as much as possible of
the intended curriculum becomes the actual cur-
riculum. Training opportunities around big data
Further Readings
can be developed to support, or fill in gaps, in a
formal curriculum; without this context, training Boyd, D., & Crawford, K. (2012). Critical questions for big
in big data may not be as successful as desired. data: Provocations for a cultural, technological, and
scholarly phenomenon. Information, Communication,
& Society, 15(5), 662–679.
De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S.,
Conclusions Bray, A., Bressoud, T. C., et al. (2017). Curriculum guide-
lines for undergraduate programs in data science. Annual
A curriculum is a formal program of study, and Review of Statistics and its Applications, 4, 2.1–2.16.
basic curriculum development principles are doi:10.1146/annurev-statistics-060116-053930. Down-
loaded from http://www.amstat.org/asa/files/pdfs/EDU-
essential for effective education in big data – as DataScienceGuidelines.pdf. 2 Jan 2017.
in any other domains. Knowledge, skills, and Ericsson, K. A., Prietula, M. J., & Cokely, E. T. (2007). The
abilities, and the levels to which these will be making of an expert. Harvard Business Review 85
both developed and integrated, must be articulated (7–8):114–121, 193. Downloaded from https://hbr.
org/2007/07/the-making-of-an-expert. 5 June 2010.
in order to structure a curriculum to optimize the Gibbs, T., Brigden, D., & Hellenberg, D. (2004). The
match between the intended and the actual curric- education versus training and the skills versus compe-
ula. The principal core curricular issue for teach- tency debate. South African Family Practice, 46(10),
ing and learning around big data is to articulate 5–6. doi:10.1080/20786204.2004.10873146.
Greene, A. C., Giffin, K. A., Greene, C. S., & Moore, J. H.
exactly what knowledge, skills, and abilities are to (2016). Adapting bioinformatics curricula for big data.
be taught and practiced. A second core issue is Briefings in Bioinformatics, 17(1), 43–50. doi:10.1093/
that the “big data” knowledge, skills, and abilities bib/bbv018.
may require more foundational support for
D
Overview
Additional Terminology
Data exhaust is also known as ambient data, rem-
Data exhaust is a type of big data that is often
nant data, left over data, or even digital exhaust
generated unintentionally by users from normal
(Mcfedries 2013). A digital footprint or a digital
Internet interaction. It is generated in large quan-
dossier is the data generated from online activities
tities and appears in many forms, such as the
that can be traced back to an individual. The
results from web searches, cookies, and tempo-
passive traces of data from such activities are
rary files. Initially, data exhaust has limited, or
considered to be data exhaust. The big data that
no, direct value to the original data collector.
interests many companies is called “found data.”
However, when combined with other data for
Typically data is extracted from random Internet
analysis, data exhaust can sometimes yield valu-
searches and location data is generated from smart
able insights.
or mobile phone usage. Data exhaust should not
be confused with community data that is gener-
ated by users in online social communities, such
Description as Facebook and Twitter.
In the age of big data, one can, thus, view data
Data exhaust is passively collected and consists of as a messy collage of data points, which includes
random online searches or location data that is found data, as well as the data exhaust extracted
generated, for example, from using smart phones from web searches, credit card payments, and
with location dependent services or applications mobile devices. These data points are collected
(Gupta and George 2016). It is considered to for disparate purposes (Harford 2014).
be “noncore” data that may be generated when
individuals use technologies that passively emit
Middle East the West Bank and the Gaza Strip (Palestine),
Egypt, Sudan, Libya, Saudi Arabia, Kuwait,
Feras A. Batarseh Yemen, Oman, Bahrain, Qatar, and United Arab
College of Science, George Mason University, Emirates (UAE). Subsequent political and histor-
Farifax, VA, USA ical events have tended to include more countries
into the mix (such as: Tunisia, Algeria, Morocco,
Afghanistan, and Pakistan).
Synonyms The Middle East is often referred to as the
cradle of civilization. By studying the history of
Mid-East; The Middle East and North Africa the region, it is clear why the first human civiliza-
(MENA) tions were established in this part of the world
(particularly the Mesopotamia region around the
Tigris and Euphrates rivers). The Middle East is
Definition where humans made their first transitions from
nomadic to agriculture, invented the wheel, cre-
The Middle East is a transcontinental region in ated basic agriculture, and where the beginnings
Western Asia and North Africa. Countries of the of the written-word first existed. It is well known
Middle East are ones extending from the shores of that this region is an active political, economic,
the Mediterranean Sea, south towards Africa, and historic, and religious part of the world
east towards Asia, and sometimes beyond (Encyclopedia Britannica 2017). For the purposes
depending on the context (political, geographical, of this encyclopedia, the focus of this entry is on
etc.). The majority of the countries of the region technology, data, and software of the Middle East.
speak Arabic.
jumping on the wagon of social media, govern- making the software available and reliable across
ments still struggle to manage, define, or guide the the geographical borders of the Arab states. Dif-
usage of such technologies. ferent spoken languages have different orienta-
The McKinsey Middle East Digitization Index tions and fall into different groups. Dealing with
is the one of the main metrics to assess the level these groups is accomplished by using different
and impact of digitization across the Middle East. code pages and Unicode fonts. Languages fall into
Only 6% of Middle Eastern public lives under a two main families, single-byte (such as: French,
digitized smart or electronic government (The German, and Polish) and double-byte (such as:
UAE, Jordan, Israel, and Saudi Arabia are Japanese, Chinese, and Korean). Another catego-
among the few countries that have some form of rization that is more relevant to Middle Eastern
e-government) (Elmasri et al. 2016). However, Languages is based on their orientation. Most
many new technology startups are coming from Middle Eastern languages are right-to-left (RTL)
the Middle East with great success. The most (such as: Arabic and Hebrew), while other world
famous technology startup companies coming languages are left-to-right (LTR) (such as: English
out of the Middle East include: (1) Maktoob and Spanish). For all languages, however, a set of
(from Jordan): is one that stands out. The com- translated strings should be saved in a bundle file
pany represents a major trophy on the list of that indexes all the strings, assign them IDs so the
Middle Eastern tech achievements. It made global software program can locate them and display the
headlines when it was bought by Yahoo, Inc. for right string in the language of the user. Further-
$80 million in 2009, symbolizing a worldwide more, to accomplish software Arabization, char-
important step by a purely Middle Eastern com- acters encoding should be enabled. The default
pany. (2) Yamli (from Lebanon): One of the most encoding for a given system is determined by the
popular web apps for Arabic speakers today. runtime locale set on the machine’s operating
(3) GetYou (from Israel): A famous social media system. The most commonplace character
application. (4) Digikala (from Iran): An online encoding format is UTF (USC transformation for-
retailer application. (5) ElWafeyat (from Egypt): mat) USC is the universal character set. UTF is
An Arabic language social media site for honoring designed to be compatible with ASCII. UTF has
deceased friends and family. (6) Project X (from three types: UTF-8, UTF-16, and UTF-32. UTF is
Jordan): A mobile application that allows for 3D the international standard for ISO/IEC 10646. It is
printing of prosthetics, inspired by wars in the important to note that the process of Arabization is
region. These examples are assembled from mul- not a trivial process; engineers cannot merely
tiple sources; many other exciting projects exist as inject translated language strings into the system,
well (such as Souq which was acquired by Ama- or hardcode cultural, date, or numerical settings
zon in 2017, Masdar, Namshi, Sukar, and many into the software, rather, the process is done by
others). obtaining different files based on the settings of
the machine, the desires of the user, and applying
the right locales. An Arabization package needs to
Software Arabization: The Next Frontier be developed to further develop the digital, soft-
ware, and technological evolution in the
The first step towards invoking more technology Middle East.
in a region is to localize the software, content, and
its data. Localizing a software system is accom-
plished by supporting a new spoken language Bridging the Digital Divide
(Arabic Language in this context, hence the
name, Arabization). A new term is presented in Information presented in this entry showed how
this entry of the Encyclopedia, Arabization: it is the Middle East is speeding towards catching-up
the overall concept that includes the process of with industrialized nations in terms of software
Middle East 3
Middle East,
Fig. 1 Middle Eastern
Investments in Technology
(Elmasri et al. 2016)
technology adoption and utilizations (i.e., bridge economic growth at countries all across the
the digital divide between third world and first region; however, the impacts of technology
world countries). Figure 1 below shows which require minimum adoption thresholds before
countries are investing towards leading that trans- those impacts begin to materialize; the wider the
formation; numbers in the figure illustrate venture intensity and use of big data, Internet of things
capital funding as share of GDP (Elmasri et al. (IoT), and machine learning, the greater the
2016). However, According to Cisco’s 2015 impacts.
visual networking index (VNI), the world is
looking towards a new digital divide, beyond
software and mobile apps. By 2019, the number Conclusion
of people connecting to Internet is going to rise to
3.9 billion users, reaching over 50% of the global The Middle East is known for many historical and
population. That will accelerate the new wave of political events, conflicts, and controversies; how-
big data, machine learning, and the Internet of ever, it is not often referred to as a technological
Things (IoT). That will be the main new challenge and software-startup hub. This entry of the Ency-
for technology innovators in the Middle East. clopedia presents a brief introduction to the Mid-
Middle Eastern countries need to first lay the dle East and draws a simple picture about its
“data” infrastructure (such as the principle of soft- digitization, and claims that Arabization of soft-
ware Arabization presented above) that would ware could lead to many advancements across the
enable the peoples of the Middle East towards region and eventually the world – for startups and
higher adoption rates of future trends (big data creativity, the Middle East is an area worth
and IoT). Such a shift would greatly influence watching (Forbes 2017).
4 Middle East
References %2520the%2520region%2520into%2520a%2520lead
ing%2520digital%2520economy%2Fdigital-middle-east-
Elmasri, T., Benni, E., Patel, J., & Moore, J. (2016). Digital finalupdated.ashx&usg=AFQjCNHioXhFY692mS_Qwa
Middle East: Transforming the region into a leading 6hkBT6UiXYVg&sig2=6udbc7EP-bPs-ygQ18KSLA&
digital economy. McKinsey and Company. https://www. cad=rja
google.com/url?sa=t&rct=j&q=&esrc=s&source=web& Encyclopedia Britannica. (2017). Available at https://
cd=2&ved=0ahUKEwiG2J2e55LTAhXoiVQKHfD8Cx www.britannica.com/place/Middle-East
AQFggfMAE&url=http%3A%2F%2Fwww.mckinsey. Forbes reports on the Middle East. (2017). Available at
com%2F~%2Fmedia%2Fmckinsey%2Fglobal%2520 http://www.forbes.com/sites/natalierobehmed/2013/08/
themes%2Fmiddle%2520east%2520and%2520africa 22/forget-oil-tech-could-be-the-next-middle-east-
%2Fdigital%2520middle%2520east%2520transforming goldmine/
S
automation and efficiency is often referred to as Increasingly, smart home sensors are being used
the Internet of things (IoT). Sensors are becoming for everyday monitoring in order to have more
more prevalent and cheap enough that the public efficient energy consumption with smart lighting
can make use of personal sensors that already fixtures and temperature controls. Sensors are
exist in their daily lives or can be easily acquired. often placed to inform on activities in the house
such as a door or window being opened. This
Personal Health Monitoring integrated network of house monitoring prom-
Health-monitoring applications are becoming ises efficiency, automation, and safety based on
increasingly common and produce very large vol- personal preferences. There is significant invest-
umes of data. Biophysical processes such as heart ment in smart home technologies, and big data
rate, breathing rate, sleep patterns, and restless- analysis can play a major role in determining
ness can be recorded continuously using devices appropriate settings based on feedback.
kept in contact with the body. Health-conscious
and athletic communities, such as runners, have Environmental Monitoring
particularly taken to personal monitoring by using Monitoring of the environment from the surface to
technology to track their current condition and the atmosphere is traditionally a function
progress. Pedometers, weight scales, and ther- performed by the government through remotely
mometers are commonplace. Heart rate, blood sensed observations and broad surveys. Remote
pressure, and muscle fatigue are now monitored sensing imagery from satellites and airborne
by affordable devices in the form of bracelets, flights can create large datasets on global environ-
rings, adhesive strips, and even clothing. Brands mental changes for use in such applications as
of smart clothing are offering built-in sensors for agriculture, pollution, water, climatic conditions,
heart rate, respiration, skin temperature and mois- etc. Government agencies also employ static sen-
ture, and electrophysiological signals that are sors and make on-site visits to check sensors
sometimes even recharged by solar panels. There which monitor environmental conditions. These
are even wireless sensors for the insole of shoes to sensors are sometimes integrated into networks
automatically adjust for the movements of the which can communicate observations to
user in addition to providing health and training form real-time monitoring systems.
analysis. In addition to traditional government sources
Wearable health technologies are often used to of environmental data, there are growing collec-
provide individuals with private personal informa- tions of citizen science data that are focused pri-
tion; however, certain circumstances call for sys- marily on areas of community concern such as air
tem-wide monitoring for medical or emergency quality, water quality, and natural hazards. Air
purposes. Medical patients, such as those with quality and water quality have long been moni-
diabetes or hypertension, can use continuously tored by communities concerned about pollution
testing glucose meters or blood pressure monitors in their environment, but a recent development
(Kalantar-zadeh 2013). Bluetooth-enabled devices after the 2011 Fukushima nuclear disaster is radi-
can transmit data from monitoring sensors and ation sensing. Safecast is a radiation monitoring
contact the appropriate parties automatically if project that seeks to empower people with infor-
there are health concerns. Collective health infor- mation on environmental safety and openly dis-
mation can be used to have a better understanding tributes measurements under creative commons
of such health concerns as cardiac issues, extreme rights (McGrath and Scanaill 2013). Radiation is
temperatures, and even crisis information. not visibly observable so it is considered a “silent”
environmental harm, and the risk needs to be
Smart Home considered in light of validated data (Hultquist
Sensors have long been a part of modern house- and Cervone 2017). Citizen science projects for
holds from smoke and carbon monoxide detec- sensing natural hazards from flooding, landslides,
tors to security systems and motion sensors. earthquakes, wildfires, etc. have come online with
Sensor Technologies 3
support from both governments and communities. the raw data from an individual, but at a generalized
Open-source environmental data is a growing level, such data can be valuable for research and
movement as people get engaged with their envi- can appropriately take into account variations in the
ronment and become more educated about their data.
health. Sensor technologies are integrated into every-
day life and are used in numerous applications to
monitor conditions. The usefulness of technolog-
Conclusion ical sensors should be no surprise as every living
organism has biological sensors which serve sim-
The development and availability of sensor tech- ilar purposes to indicate the regulation of internal
nologies is a part of the big data paradigm. Sen- functions and conditions of the external environ-
sors are able to produce an enormous amount of ment. The integration of sensor technologies is a
data, very quickly with real-time uploads, and natural step that goes from individual measure-
from diverse types of sensors. Many questions ments to collective monitoring which highlights
still remain of how to use this data and if the need for big data analysis and validation.
connected sensors will lead to smart environments
that will be a part of everyday modern life. The
Internet of things (IoT) is envisioned to connect
Cross-References
communication across domains and applications
in order to enable the development of smart cities.
▶ AgInformatics
Sensor data can provide useful information for
▶ Air Pollution
individuals and generalized information from col-
▶ Biometrics
lective monitoring. Services often offer personal-
▶ Biosurveillance
ized analysis in order to keep people engaged using
▶ Crowdsourcing
the application. Yet, most analysis and interest
▶ Drones
from researchers in sensor data is at a generalized
▶ Environment
level. Despite mostly generalized data analysis,
▶ Health Informatics
there is public concern related to data privacy
▶ Land Pollution
from individual and home sensors. The privacy
▶ Participatory Health and Big Data
level of the data is highly dependent on the system
▶ Patient-Centered (Personalized) Health
used and the terms of service agreement if a service
▶ Remote Sensing
is being provided related to the sensor data.
▶ Water Pollution
Analysis of sensor data is often complex, messy,
and hard to verify. Nonpersonal data can often be
checked or referenced to a comparable dataset to
see if it makes sense. However, large datasets pro- Further Readings
duced by personal sensors for such applications as
health are difficult to independently verify at an Hultquist, C., & Cervone, G. (2017). Citizen monitoring
during hazards: Validation of Fukushima radiation
individual level. For example, an environmental measurements. Geo Journal. http://doi.org/10.1007/
condition could have caused a natural reaction of s10708-017-9767-x.
a rapid heartbeat which is medically safe given the Kalantar-zadeh, K. (2013). Sensors: An introductory
condition that the user awoke with a quick increase course (1st ed.). Boston: Springer US.
McGrath, M. J., & Scanaill, C. N. (2013). Sensor technol-
in heart rate due to an earthquake. Individual ogies: Healthcare, wellness, and environmental appli-
inspection of data for such noise is fraught with cations. New York: Apress Open.
problems as it is complicated to identify causes in
S
“Small” Data and use for which the data are intended. In fact,
disciplinary perspectives vary on how large “big
data” need to be to merit this label, but small data
Rochelle E. Tractenberg1,2 and are not characterized effectively by the absence of
Kimberly F. Sellers3 one or more of these “3 Vs.” Most statistical
1
Collaborative for Research on Outcomes and analyses require some amount of vector and
Metrics, Washington, DC, USA matrix manipulation for efficient computation in
2
Departments of Neurology; Biostatistics, the modern context. Data sets may be considered
Bioinformatics & Biomathematics; and “big” if they are so large, multidimensional,
Rehabilitation Medicine, Georgetown University, and/or quickly accumulating in size that the typi-
Washington, DC, USA cal linear algebraic manipulations cannot con-
3
Department of Mathematics and Statistics, verge or yield true summaries of the full data set.
Georgetown University, Washington, DC, USA The fundamental statistical analyses, however, are
the same for data that are “big” or “small”; the true
distinction arises from the extent to which com-
Synonyms putational manipulation is required to map and
reduce the data (Day and Ghemawat 2004) such
Data; Statistics that a coherent result can be derived. All analyses
share common features, irrespective of the size,
complexity, or completeness of the data – the
Introduction relationship between statistics and the underlying
population; the association between inference,
Big data are often characterized by “the 3 Vs”: estimation, and prediction; and the dependence
volume, velocity, and variety. This implies that of interpretation and decision-making on statisti-
“small data” lack these qualities, but that is an cal inference. To expand on the lack of distin-
incorrect conclusion about what defines “small” guishability between “small” data and “big”
data. Instead, we define “small data” to be simply data, we explore each of these features in turn.
“data” – specifically, data that are finite but not By doing so, we expound on the assertion that a
necessarily “small” in scope, dimension, or rate of characterization of a dataset as “small” depends
accumulation. The characterization of data as on the users’ intention and the context in which
“small” is essentially dependent on the context the data, and results from its analysis, will be used.
and most multidimensional dataset is presumably In frequentist statistical analysis (based on long
an incomplete (albeit massive) representation of run results), this characterization typically
the entire universe of values – the “population.” describes how likely the observed result would
Thus, the field of statistics has historically been be if there were, in truth, no relationship between
based on long-run frequencies or computed esti- (any) variables, or if the true parameter value was
mates of the true population parameters. For a specific value (e.g., zero). In Bayesian statistical
example, in some current massive data collection analysis (based on current data and prior knowl-
and warehousing enterprises, the full population edge), this characterization describes how likely it
can never be obtained because the data are con- is that there is truly no relationship given the data
tinuously streaming in and collected. In other that were observed and prior knowledge about
massive data sets, however, the entire population whether such a relationship exists.
is captured; examples include the medical records Whenever inferences are made about estimates
for a health insurance company, sales on Amazon. and predictions about future events, relationships,
com, or weather data for the detection of an evolv- or other unknown/unobserved events or results,
ing storm or other significant weather pattern. The corrections must be made for the multitude of
fundamental statistical analyses would be the inferences that are made for both frequentist and
same for either of these data types; however, Bayesian methods. Confidence and uncertainty
they would result in estimates for the about every inference and estimate must accom-
(essentially) infinite data set, while actual modate the fact that more than one has been made;
population-descriptive values are possible when- these “multiple comparisons corrections” protect
ever finite/population data are obtained. Impor- against decisions that some outcome or result is
tantly, it is not the size or complexity of the data rare/statistically significant when, in fact, the var-
that results in either estimation or population iability inherent in the data make that result far
description – it is whether or not the data are finite. less rare than it appears. Numerous correction
This underscores the reliance of any and all data methods exist with modern (since the mid-
analysis procedures on statistical methodologies; 1990s) approaches focusing not on controlling
assumptions about the data are required for the for “multiple comparisons” (which are closely
correct use and interpretation of these methodol- tied to experimental design and formal hypothesis
ogies for data of any size and complexity. It fur- testing), but controlling the “false discovery rate”
ther blurs qualifications of a given data set as (which is the rate at which relationships or esti-
“big” or “small.” mates will be declared “rare given the inherent
variability of the data” when they are not, in fact,
rare). Decisions made about inferences, estimates,
Inference, Estimation, and Prediction
and predictions are classified as correct (i.e., the
Statistical methods are generally used for two
event is rare and is declared rare, or the event is
purposes: (1) to estimate “true” population param-
not rare and is declared not rare) or incorrect (i.e.,
eters when only sample information is available,
the event is rare but is declared not rare – a false
and (2) to make or test predictions about either
negative/Type II error; or the event is not rare but
future results or about relationships among vari-
is declared rare – a false positive/Type I error);
ables. These methods are used to infer “the truth”
controls for multiple comparisons or false discov-
from incomplete data and are the foundations of
eries seek to limit Type I errors.
nearly all experimental designs and tests of quan-
Decisions are made based on the data analysis,
titative hypotheses in applied disciplines (e.g.,
which holds for “big” or “small” data. While
science, engineering, and business). Modern sta-
multiple comparisons corrections and false dis-
tistical analysis generates results (i.e., parameter
covery rate controls have long been accepted as
estimates and tests of inferences) that can be char-
representing competent scientific practice, they
acterized with respect to how rare they are given
are also essential features of the analysis of big
the random variability inherent in the data set.
4 “Small” Data
data, whether or not these data are analyzed for and interpreting all data will also continue to
scientific or research purposes. evolve, and these will become increasingly
interdependent on the methods for collecting,
Analysis, Interpretation, and Decision Making manipulating, and storing the data. Because of
Analyses of data are either motivated by theory or the constant evolution and advancement in tech-
prior evidence (“theory-driven”), or they are nology and computation, the notion of “big data”
unplanned and motivated by the data themselves may be best conceptualized as representing the
(“data-driven”). Both types of investigations can processes of data collection, storage, and manip-
be executed on data of any size, complexity, or ulation for interpretable analysis, and not the size,
completeness. While the motivations for data utility, or complexity of the data itself. Therefore,
analysis vary across disciplines, evidence that the characterization of data as “small” depends
supports decisions is always important. Statistical critically on the context and use for which the
methods have been developed, validated, and uti- data are intended.
lized to support the most appropriate analysis,
given the data and its properties, so that defensible
and reproducible interpretations and inferences
Further Reading
result. Thus, decisions that are made based on
the analysis of data, whether “big” or “small,” Bickel, P. J. (2000). Statistics as the information science.
are inherently dependent on the quality of the Opportunities for the mathematical sciences, 9, 11.
analysis and associated interpretations. Day, J., & Ghemawat, S (2004, December). MapReduce:
Simplified data processing on large clusters. In
OSDI’04: Sixth symposium on operating system design
and implementation. San Francisco. Downloaded from
Conclusion https://research.google.com/archive/mapreduce.html on
21 Dec 2016.
As has been the case for centuries, today’s “big” Rao, C. R. (2001). Statistics: Reflections on the past and
visions for the future. Communications in Statistics –
data will eventually be perceived as “small”; how- Theory and Methods, 30(11), 2235–2257.
ever, the statistical methodologies for analyzing
T
Time Series Analytics before and after annual traffic accident data to
determine the efficacy of safety legislation. Time
Erik Goepner series analytics can be used to forecast, determine
George Mason University, Arlington, VA, USA the transfer function, assess the effects of unusual
intervention events, analyze the relationships
between variables of interest, and design control
Synonyms schemes (Box et al. 2015). Preferably, observa-
tions have been recorded at fixed time intervals. If
Time series analysis, Time series data the time intervals vary, interpolation can be used
to fill in the gaps (Zois et al. 2015).
Of critical importance is whether the variables
Introduction are stationary or nonstationary. Stationary vari-
ables are not time dependent (i.e., mean, variance,
Time series analytics utilize data observations and covariance remain constant over time). How-
recorded over time at certain intervals. Subse- ever, time series data are quite often non-
quent values of time-ordered data often depend stationary. The trend of nonstationary variables
on previous observations. Time series analytics is, can be deterministic (e.g., following a time
therefore, interested in techniques that can ana- trend), stochastic (i.e., random), or both.
lyze this dependence (Box et al. 2015; Zois et al. Addressing nonstationarity is a key requirement
2015). Up until the second half of the twentieth for those working with time series and is
century, social scientists largely ignored the pos- discussed further under “Challenges” (Box et al.
sibility of dependence within time series data 2015; Kirchgässner et al. 2012).
(Kirchgässner et al. 2012). Statisticians have Time series are frequently comprised of four
since demonstrated that adjacent observations components. There is the trend over the long-term
are frequently dependent in a time series and that and, often, a cyclical component that is normally
previous observations can often be used to accu- understood to be a year or more in length. Within
rately predict future values (Box et al. 2015). the cycle, there can be a seasonal variation. And
Time series data abound and are of importance finally, there is the residual which includes all
to many. Physicists and geologists investigating variation not explained by the trend, cycle, and
climate change, for example, use annual tempera- seasonal components. Prior to the 1970s, only the
ture readings, economists study quarterly gross residual was thought to include random impact,
domestic product and monthly employment with trend, cycle, and seasonal change understood
reports, and policy makers might be interested in to be deterministic. That has changed, and now it
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_469-1
2 Time Series Analytics
is assumed that all four components can be sto- to inaccurate, missing, or incomplete data. Before
chastically modeled (Kirchgässner et al. 2012). analysis, these issues should be addressed via
duplicate elimination, interpolation, data fusion,
or an influence model (Zois et al. 2015).
The Evolution of Time Series Analytics
Contending with Massive Amounts of Data
In the first half of the 1900s, fundamentally dif- Tremendous amounts of time series data exist,
ferent approaches were pursued by different dis- potentially overwhelming computer memory. In
ciplines. Natural scientists, mathematicians, and response, solutions are needed to lessen the effects
statisticians generally modeled the past history of on secondary memory access. Sliding windows
the variable of interest to forecast future values of and time series indexing may help. Both are com-
the variable. Economists and other social scien- monly used; however, newer users may find the
tists, however, emphasized theory-driven models learning curve unhelpfully steep for time series
with their accompanying explanatory variables. In indexing. Similarly, consideration should be
1970, Box and Jenkins published an influential given to selecting management schemes and
textbook, followed in 1974 by a study from query languages simple enough for common
Granger and Newbold, that has substantially users (Zois et al. 2015).
altered how social scientists interact with time
series data (Kirchgässner et al. 2012).
The Box Jenkins approach, as it has been fre- Analysis and Forecasting
quently called ever since, relies on extrapolation.
Box Jenkins focuses on the past behavior of the Time series are primarily used for analysis and
variable of interest rather than a host of explana- forecasting (Zois et al. 2015). A variety of poten-
tory variables to predict future values. The vari- tial models exist, including autoregressive (AR),
able of interest must be transformed so that it moving average (MA), mixed autoregressive
becomes stationary and its stochastic properties moving average (ARMA), and autoregressive
time invariant. At times, the terms Box Jenkins integrated moving average (ARIMA). ARMA
approach and time series analysis have been used models are used with stationary processes and
interchangeably (Kennedy 2008). ARIMA models for nonstationary ones (Box et
al. 2015). Forecasting options include regression
and nonregression based models. Model develop-
Time Series Analytics and Big Data ment should follow an iterative approach, often
executed in three steps: identification, estimation,
Big Data has stimulated interest in efficient que- and diagnostic checking. Diagnostic checks
rying of time series data. Both time series and Big examine whether the model is properly fit, and
Data share similar characteristics relating to vol- the checks analyze the residuals to determine
ume, velocity, variety, veracity, and volatility model adequacy. Generally, 100 or more observa-
(Zois et al. 2015). The unprecedented volume of tions are preferred. If fewer than 50 observations
data can overwhelm computer memory and pre- exist, development of the initial model will
vent processing in real time. Additionally, the require a combination of experience and past
speed at which new data arrives (e.g., from sen- data (Box et al. 2015; Kennedy 2008).
sors) has also increased. The variety of data
includes the medium from which it comes (e.g., Autoregressive, Moving Average, and Mixed
audio and video) as well as differing sampling Autoregressive Moving Average Models
rates, which can prove problematic for data anal- An autoregressive model predicts the value of the
ysis. Missing data and incompatible sampling variable of interest based on its values from one or
rates are discussed further in the “Challenges” more previous time periods (i.e., its lagged value).
section below. Veracity includes issues relating If, for instance, the model only relied on the value
Time Series Analytics 3
of the immediately preceding time period, then it function of the vector’s lagged values combined
would be a first-order autoregression. Similarly, if with an error vector. The single vector is derived
the model included the values for the prior two from the linear function of each variable’s lagged
time periods, then it would be referred to as a values and the lagged values for each of the other
second-order autoregression and so on. A moving variables. VAR models are used to investigate the
average model also uses lagged values, but of the potential causal relationship between different
error term rather than the variable of interest (Ken- time series, yet they are controversial because
nedy 2008). If neither an autoregressive nor mov- they are atheoretical and include dubious asser-
ing average process succeeds in breaking off the tions (e.g., orthogonal innovation of one variable
autocorrelation function, then a mixed auto- is assumed to not affect the value of any other
regressive moving average approach may be pre- variable). Despite the controversy, many scholars
ferred (Kirchgässner et al. 2012). AR, MA, and and practitioners view VAR models as helpful,
ARMA models are used with stationary time particularly VAR’s role in analysis and forecasting
series, to include time series made stationary (Kennedy 2008; Kirchgässner et al. 2012; Box et
through differencing. However, the potential loss al. 2015).
of vital information during differencing opera-
tions should be considered (Kirchgässner et al. Error Correction Models
2012). These models attempt to harness positive features
ARMA models produce unconditional fore- of both ARIMA and VAR models, accounting for
casts, using only the past and current values of the dynamic feature of time series data while also
the variable. Because such forecasts frequently taking advantage of the contributions explanatory
perform better than traditional econometric variables can make. Error correction models add
models, they are often preferred. However, theory-driven exogenous variables to a general
blended approaches, which transform linear form of the VAR model (Kennedy 2008).
dynamic simultaneous equation systems into
ARMA models or the inverse, are also available.
These blended approaches can retain information Challenges
provided by explanatory variables (Kirchgässner
et al. 2012). Nonstationarity
Nonstationarity can be caused by deterministic
Autoregressive Integrated Moving Average and stochastic trends (Kirchgässner et al. 2012).
(ARIMA) Models To transform nonstationary processes into station-
In ARIMA models, also known as ARIMA (p,d, ary ones, the deterministic and/or stochastic
q), p indicates the number of lagged values of Y*, trends must be eliminated. Measures to accom-
which represents the variable of interest after it plish this include differencing operations and
has been made stationary by differencing. d indi- regression on a time trend. However, not all non-
cates the number of differencing operations stationary processes can be transformed
required to transform Y into its stationary version, (Kirchgässner et al. 2012).
Y*. The number of lagged values of the error term The Box Jenkins approach assumes that
is represented by q. ARIMA models can forecast differencing operations will make nonstationary
for univariate and multivariate time series (Ken- variables stationary. A number of unit root tests
nedy 2008). have been developed to test for nonstationarity,
but their lack of power remains an issue. Addi-
Vector Autoregressive (VAR) Models tionally, differencing (as a means of eliminating
VAR models blend the Box Jenkins approach with unit roots and creating stationarity) comes with
traditional econometric models. They can be quite the undesirable effect of eliminating any theory-
helpful in forecasting. VAR models express a driven information that might otherwise contrib-
single vector (of all the variables) as a linear ute to the model.
4 Time Series Analytics
scraping HTML and other XML documents. It web data integration. For examples, at a micro-
provides convenient Pythonic functions for navi- scale, the price of a stock can be regularly scraped
gating, searching, and modifying a parse tree; a in order to visualize the price change over time
toolkit for decomposing an HTML file and extra- (Case et al. 2005), and social media feeds can be
cting desired information via lxml or html5lib. collectively scraped to investigate public opinions
Beautiful Soup can automatically detect the and identify opinion leaders (Liu and Zhao 2016).
encoding of the parsing under processing and At a macro-level, the metadata of nearly every
convert it to a client-readable encode. Similarly, website is constantly scraped to build up Internet
Pyquery provides a set of Jquery-like functions to search engines, such as Google Search or Bing
parse xml documents. But unlike Beautiful Soup, Search (Snyder 2003).
Pyquery only supports lxml for fast XML Although web scraping is a powerful technique
processing. in collecting large data sets, it is controversial and
Of the various types of web scraping programs, may raise legal questions related to copyright
some are created to automatically recognize the (O’Reilly 2006), terms of service (ToS) (Fisher
data structure of a page, such as Nutch or Scrapy, et al. 2010), and “trespass to chattels” (Hirschey
or to provide a web-based graphic interface that 2014). A web scraper is free to copy a piece of data
eliminates the need for manually written web in figure or table form from a web page without
scraping code, such as Import.io. Nutch is a robust any copyright infringement because it is difficult
and scalable web crawler, written in Java. It to prove a copyright over such data since only a
enables fine-grained configuration, paralleling specific arrangement or a particular selection of
harvesting, robots.txt rule support, and machine the data is legally protected. Regarding the ToS,
learning. Scrapy, written in Python, is an reusable although most web applications include some
web crawling framework. It speeds up the process form of ToS agreement, their enforceability usu-
of building and scaling large crawling projects. In ally lies within a gray area. For instance, the
addition, it also provides a web-based shell to owner of a web scraper that violates the ToS
simulate the website browsing behaviors of a may argue that he or she never saw or officially
human user. To enable nonprogrammers to har- agreed to the ToS. Moreover, if a web scraper
vest web contents, the web-based crawler with a sends data acquiring requests too frequently, this
graphic interface is purposely designed to mitigate is functionally equivalent to a denial-of-service
the complexity of using a web scraping program. attack, in which the web scraper owner may be
Among them, Import.io is a typical crawler for refused entry and may be liable for damages under
extracting data from websites without writing any the law of “trespass to chattels,” because the
code. It allows users to identify and convert owner of the web application has a property inter-
unstructured web pages into a structured format. est in the physical web server which hosts the
Import.io’s graphic interface for data identifica- application. An ethical web scraping tool will
tion allows user to train and learn what to extract. avoid this issue by maintaining a reasonable
The extracted data is then stored in a dedicated requesting frequency.
cloud server, and can be exported in CSV, JSON, A web application may adopt one of the fol-
and XML format. A web-based crawler with a lowing measures to stop or interfere with a web
graphic interface can easily harvest and visualize scrapping tool that collects data from the given
real-time data stream based on SVG or WebGL website. Those measures may identify whether an
engine but fall short in manipulating a large data operation was conducted by a human being or a
set. bot. Some of the major measures include the fol-
Web scraping can be used for a wide variety of lowing: HTML “fingerprinting” that investigates
scenarios, such as contact scraping, price change the HTML headers to identify whether a visitor is
monitoring/comparison, product review collec- malicious or safe (Acar et al. 2013); IP reputation
tion, gathering of real estate listings, weather determination, where IP addresses with a
data monitoring, website change detection, and recorded history of use in website assaults that
Web Scraping 3
will be treated with suspicion and are more likely Fisher, D., Mcdonald, D. W., Brooks, A. L., & Churchill,
to be heavily scrutinized (Sadan and Schwartz E. F. (2010). Terms of service, ethics, and bias: Tapping
the social web for CSCW research. Computer
2012); behavior analysis for revealing abnormal Supported Cooperative Work (CSCW), Panel
behavioral patterns, such as placing a suspiciously discussion.
high rate of requests and adhering to anomalous Hirschey, J. K. (2014). Symbiotic relationships: Pragmatic
browsing patterns; and progressive challenges acceptance of data scraping. Berkeley Technology Law
Journal, 29, 897.
that filter out bots with a set of tasks, such as Liu, J. C.-E., & Zhao, B. (2016). Who speaks for climate
cookie support, JavaScript execution, and change in China? Evidence from Weibo. Climatic
CAPTCHA (Doran and Gokhale 2011). Change, 140(3), 413–422.
Mooney, S. J., Westreich, D. J., & El-Sayed, A. M. (2015).
Epidemiology in the era of big data. Epidemiology,
26(3), 390.
Further Readings O’Reilly, S. (2006). Nominative fair use and Internet
aggregators: Copyright and trademark challenges
Acar, G., Juarez, M., Nikiforakis, N., Diaz, C., Gürses, S., posed by bots, web crawlers and screen-scraping tech-
Piessens, F., & Preneel, B. (2013). Fpdetective: Dusting nologies. Loyola Consumer Law Review, 19, 273.
the web for fingerprinters. In Proceedings of the 2013 Sadan, Z., & Schwartz, D. G. (2012). Social network
ACM SIGSAC conference on computer & communica- analysis for cluster-based IP spam reputation. Informa-
tions security. New York: ACM. tion Management & Computer Security, 20(4),
Bar-Ilan, J. (2001). Data collection methods on the web for 281–295.
infometric purposes – A review and analysis. Snyder, R. (2003). Web search engine with graphic snap-
Scientometrics, 50(1), 7–32. shots. Google Patents.
Butler, J. (2007). Visual web page analytics. Google Yi, J., Nasukawa, T., Bunescu, R., & Niblack, W. (2003).
Patents. Sentiment analyzer: Extracting sentiments about a
Case, K. E., Quigley, J. M., & Shiller, R. J. (2005). Com- given topic using natural language processing tech-
paring wealth effects: The stock market versus the niques. Data Mining, 2003. ICDM 2003. Third IEEE
housing market. The BE Journal of Macroeconomics, International Conference on, IEEE. Melbourne,
5(1), 1. Florida, USA.
Doran, D., & Gokhale, S. S. (2011). Web robot detection
techniques: Overview and limitations. Data Mining
and Knowledge Discovery, 22(1), 183–210.
B
In one special issue on big data at the journal modeling, (2) the development of advanced
of Annals of GIS (volume 20, Issue 4, 2014), spatial analysis functions and models, and
researchers further discussed several key techno- (3) the advancement of quality assurance issues
logies (e.g., cloud computing, high-performance on big geo-data. Finally, there will still be ongoing
geocomputation cyberinfrastructures) for dealing comparisons between data-driven and theory-
with quantitative and qualitative dynamics of big driven research methodologies in geography.
geo-data. Advanced spatiotemporal big data mining
and geoprocessing methods should be developed
by optimizing the elastic storage, balanced sched-
Further Readings
uling, and parallel computing resources in high-
performance geocomputation cyberinfrastructures. Gao, S., Li, L., Li, W., Janowicz, K., & Zhang, Y. (2017).
Constructing gazetteers from volunteered big geo-data
based on Hadoop. Computers, Environment and Urban
Conclusion Systems, 61, 172–186.
Janowicz, K., van Harmelen, F., Hendler, J., & Hitzler,
P. (2015). Why the data train needs semantic rails. AI
With the advancements in location-awareness Magazine, Association for the Advancement of Artifi-
technology and mobile distributed sensor net- cial Intelligence (AAAI), pp. 5–14.
works, large-scale high-resolution spatiotemporal Miller, H. J., & Goodchild, M. F. (2015). Data-driven
geography. Geo Journal, 80(4), 449–461.
datasets about the Earth and the society become Shaw, S. L., Tsou, M. H., & Ye, X. (2016). Editorial:
available for geographic research. The research on Human dynamics in the mobile and big data era. Inter-
big geo-data involves interdisciplinary collabora- national Journal of Geographical Information Science,
tive efforts. There are at least three research areas 30(9), 1687–1693.
Yang, C., Huang, Q., Li, Z., Liu, K., & Hu, F. (2017). Big
that require further work: (1) the systematic inte- data and cloud computing: Innovation opportunities
gration of various big geo-data sources in and challenges. International Journal of Digital
geospatial knowledge discovery and spatial Earth, 10(1), 13–53.
I
Ting Zhang With the rising attraction of big data and the
Department of Finance and Economics, Merrick exploding need to share existing data, the need
School of Business, University of Baltimore, to link already collected various administrative
Baltimore, MD, USA records rises. The systems allow government
agencies to integrate various databases and bridge
the gaps that have traditionally formed within
Definition/Introduction individual agency databases; it can be used for
quick knowledge-to-practice development cycle
Integrated Data Systems (IDS) typically link indi- to better address the often interconnected citizens’
vidual level administrative records collected by needs efficiently and effectively (Actionable Intel-
multiple agencies such as k–12 schools, commu- ligence for Social Policy 2017), for case manage-
nity colleges, other colleges and universities, ment (National Neighborhood Indicators
departments of labor, justice, human resources, Partnership 2017), program or service monitor-
human and health services, police, housing, and ing, tracking, and evaluation, developing and test-
community services. The systems can be used for ing an intervention and monitoring the outcomes
quick knowledge-to-practice development cycle (Davis et al. 2014), research and policy analysis,
(Actionable Intelligence for Social Policy 2017), strategic planning and performance management,
case management, program or service monitoring, and so on. It can test social policy innovations
tracking, and evaluation (National Neighborhood through high-speed, low-cost randomized control
Indicators Partnership 2017), research and policy trials and quasi-experimental approaches, can be
analysis, strategic planning and performance used for continuous quality improvement efforts
management, and so on. It can also help evaluate and benefit cost analysis, and can also help pro-
how different programs, services, and policies vide a complete account of how different pro-
affect individual persons or individual geographic grams, services, and policies affect individual
units. The linkages between different agency persons or individual geographic units to more
records are often made through a common indi- efficiently and effectively address the often
vidual personal identification number, a shared interconnected needs of the citizens (Actionable
case number, or a geographic unit. Intelligence for Social Policy 2017).
Key Elements to Build an IDS for data quality of IDS information. However,
some of the relevant databases, particularly stu-
According to Davis et al. (2014) and Zhang and dent records. do not include a universally linkable
Stevens (2012), typical crucial factors related to a personal identifier, that is, a Social Security num-
successful IDS include: ber; some databases are unable to ensure that a
known to be valid Social Security number is
• A broad and steady institutional commitment paired with one individual, and only that individ-
to administrate the system ual, consistently over time; and some databases
• Individual-level data (no matter individual per- are unable to ensure that each individual is asso-
sons or individual geographic units) to mea- ciated with only one Social Security number over
sure outcomes time (Zhang and Stevens 2012). Zhang and Ste-
• The necessary data infrastructure vens (2012) included ongoing collection of case
• Linkable data fields, such as Social Security studies documenting how SSNs can be extracted,
numbers, business identifiers, shared case validated, and securely stored offline. With the
number, and addresses established algorithms required for electronic
• The capacity to match various administrative financial transactions, spreading adoption of elec-
records tronic medical records and rising interest in big
• A favorable state interpretation of the data data, there is an extensive, and rapidly growing,
privacy requirements, consistent with federal literature illustrating probabilistic matching solu-
regulations tions and various software designs to address the
• The funding, knowhow, and analytical capac- identity management challenge. Often the
ity to work with and maintain the data required accuracy threshold is application spe-
• Successfully obtaining participation from mul- cific; assurance of an exact match may not be
tiple data providing agencies with clearance to required for some anticipated longitudinal data
use those data. system uses (Zhang and Stevens 2012).
Data Privacy
Maintenance
To build and use an IDS, issues related to privacy
of personal information within the system is
Administrative data records are typically col-
important. Many government agencies have rele-
lected by public and private agencies. An IDS
vant regulations. For example, a nationally wide-
often requires to extract, transform, clean, and
known law is the Family Educational Rights and
link information from various source administra-
Privacy Act (FERPA) that defines when student
tive databases and load it into a data warehouse.
information can be disclosed and data privacy
Many data warehouses offer a tightly coupled
practices (U.S. Department of Education 2017).
architecture that it usually takes little time to
Similarly Health Insurance Portability and
resolve queries and extract information (Widom
Accountability Act of 1996 (HIPAA) addresses
1995).
the use and disclosure of health information (U.S.
Department of Health & Human Services 2017).
Challenges
Ethics
Identity Management and Data Quality Most IDS taps individual person’s information.
One challenge to build an IDS is to have effective When using IDS information, in order not to mis-
and appropriate individual record identity man- use personal information, extra caution is needed.
agement diagnostics that include consideration Institutional review boards are often needed when
of the consequences of gaps in common identifier conducting research involving human subjects.
availability and accuracy. This is the first key step
Integrated Data System 3
State Longitudinal Data Systems (SLDS) connect The data system aligns p-12 student education
databases across two or more of state-level agen- records with secondary and postsecondary educa-
cies of early learning, K–12, postsecondary, and tion and the workforce records, with linkable stu-
workforce. It is a state-level Integrated Data Sys- dent and teacher identification numbers and
tem and focuses on tracking individuals student and teacher information on student level
longitudinally. (National Center for Education Statistics 2010).
The student education records include informa-
tion on enrollment, demographics, program par-
Purpose of the SLDS ticipation, test records, transcript information,
college readiness test scores, successful transition
SLDS are intended to enhance the ability of states to postsecondary programs, enrollment in post-
to capture, manage, develop, analyze, and use secondary remedial courses, entries, and exits
student education records, to support evidence- from various levels of the education system
based decisions to improve student learning, to (National Center for Education Statistics 2010).
facilitate research to increase student achievement
and close achievement gaps (National Center for
Education Statistics 2010), to address potential Statewide Longitudinal Data Systems
recurring impediments to student learning, to Grant Program
measure and document education long-term
return on investment, to support education According to US Department of Education
accountability systems, and to simplify the pro- (2015), the Statewide Longitudinal Data Systems
cesses used by state educational agencies to make Program awards grants to State educational agen-
education data transparent through federal and cies to design, develop, and implement SLDS to
# Springer International Publishing AG 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_495-1
2 State Longitudinal Data System
efficiently and accurately manage, analyze, disag- interpretations of the confidentiality provisions
gregate, and use individual student data. As autho- of FERPA and its implementing regulations
rized by the Educational Technical Assistance Act (Davis et al. 2014). Many states have overcome
of 2002, Title II of the statute that created the potential FERPA-related obstacles in their own
Institute of Education Sciences (IES), the SLDS unique ways, for example: (1) obtaining legal
Grant Program has awarded competitive, cooper- advice recognizing that the promulgation of
ative agreement grants to almost all states since amended FERPA regulations was intended to
2005; in addition to the grants, the program offers facilitate the use of individual-level data for
many services and resources to assist education research purposes, (2) maintaining the workforce
agencies with SLDS-related work data within the education state’s agency, and
(US Department of Education 2016). (3) creating a special agency that holds both the
education and workforce data (Davis et al. 2014).