You are on page 1of 6

A I I N N O V A I O N I N I N D U TR Y

T S
Editor: Daniel E. O’leary, University of Southern California, oleary@usc.edu

Artificial Intelligence
and Big Data
Daniel E. O’Leary, University of Southern California

I Innovation in Industry is a new department


structured data. However, devices and sensors
A for IEEE Intelligent Systems, and this first article
will examine some of the basic concerns and uses
aren’t the only sources of data. Additionally, peo
ple on the Internet generate a highly diverse set
of AI for big data (AI has been used in several of structured and unstructured data. Web
browsing data, captured as a sequence of
different ways to facilitate capturing and structur
clicks, is struc tured data. However, there’s also
ing big data, and it has been used to analyze big
substantial un structured data. For example,
data for key insights). In future articles, we’ll pres
according to Ping dom,4 in 2011 there were 555
ent some case studies that analyze emerging issues
million websites and more than 100 million
and approaches that integrate AI and big data.
blogs, with many includ ing unstructured text,
pictures, audio, and video. As a result, there’s an
What Is Big Data? assemblage of data emerging through the
Michael Cox and David Ellsworth 1 were among
“Internet of People and Things”5 and the
the first to use the term big data literally, referring
“Internet of Everything.”
to using larger volumes of scientific data for visu
Velocity of data also is increasing rapidly over
alization (the term large data also has been used).
time for both structured and unstructured data,
Currently, there are a number of definitions of big
and there’s a need for more frequent decision
data. Perhaps the most wellknown version comes
making about that data. As the world becomes
from IBM,2 which suggested that big data could
more global and developed, and as the IoT builds,
be characterized by any or all of three “V” words
there’s an increasing frequency of data capture
to investigate situations, events, and so on: vol
and decision making about those “things” as
ume, variety, and velocity.
they move through the world. Further, the veloc ity
Volume refers to larger amounts of data being
of social media use is increasing. For example,
generated from a range of sources. For example, there are more than 250 million tweets per day.4
big data can include data gathered from the Inter Tweets lead to decisions about other Tweets, es
net of Things (IoT). As originally conceived,3 IoT calating the velocity. Further, unlike classic data
referred to the data gathered from a range of de warehouses that generally “store” data, big data
vices and sensors networked together, over the is more dynamic. As decisions are made using big
Internet. RFID tags appear on inventory items data, those decisions ultimately can influence the
capturing transaction data as goods are shipped next data that’s gathered and analyzed, adding an
through the supply chain. Big data can also refer other dimension to velocity.
to the exploding information available on social Big data isn’t just volume, variety, and veloc ity,
media such as Facebook and Twitter. though; it’s volume, variety, and velocity at scale.
Variety refers to using multiple kinds of data to
As a result, big data has received substan tial
analyze a situation or event. On the IoT, millions attention as distributed and parallel computing has
of devices generating a constant flow of data re allowed processing of larger volumes of data, most
sults in not only a large volume of data but dif notably through applications of Google’s
ferent types of data characteristic of different MapReduce.
situations. For example, in addition to RFID,
heart monitors in patients and location informa
MapReduce and Hadoop
tion from phones all generate different types of
MapReduce6 has been used by Google to generate
scalable applications. Inspired by the “map” and
96 1541-1672/13/$31.00 © 2013 IEEE iEEE iNTElliGENT SYSTEMS
Published by the IEEE Computer Society
“reduce” functions in Lisp, Map
Under situations of large volumes of in other work I investigated issues such
Reduce breaks an application into
data, AI allows delegation of difficult as monitoring and auditing financial
several small portions of the problem,
pattern recognition, learning, and and other data streams (in fraud detec
each of which can be executed across
other tasks to computerbased ap tion, for example).10
any node in a computer cluster. The
proaches. For example, over onehalf Structuring data has taken multiple
“map” stage gives subproblems to
of the world’s stock trades are done approaches. Philip Hayes and Steven
nodes of computers, and the “reduce”
using AIbased systems. In addition, AI Weinstein11 developed a system for
combines the results from all of those
contributes to the velocity of data, by use at Reuter’s News Service to help
different subproblems. MapReduce
facilitating rapid computerbased categorize individual news articles.
provides an interface that allows dis
decisions that lead to other decisions. The resulting system categorizes un
tributed computing and parallel
For example, since so many stock structured news articles into around
ization on clusters of computers.
trades are made by AIbased systems 700 categories, recognizing more
MapReduce is used at Google for
rather than people, the velocity of the than 17,000 company names with an
a large number of activities, in
trades can increase, and one trade accuracy of 85 percent. As another
cluding data mining and machine
could lead to others. Finally, variety approach, researchers have begun to
learning.
issues aren’t solved simply by paral generate analysis of unstructured
Hadoop
lelizing and distributing the problem. sen timent contained in blogs,
(http://hadoop.apache.org), named
Instead, variety is mitigated by cap Twit ter messages, and other text. 12
after a boy’s toy elephant, is an open
turing, structuring, and understand The nature of those different
source version of MapReduce.
ing unstructured data using AI and opinions can be used to investigate a
Apparently, Yahoo
other analytics. range of issues. For example, after an
(http://developer.
advertise ment has been run, there’s
yahoo.com/hadoop) is the largest user Generating Structured Data structured transaction
(developer and tester) of Hadoop,
AI researchers have long been inter information, such as when the ad
with more than 500 million users per ested in building applications that ran, where it ran, and so on. That
month and billions of transactions analyze unstructured data, and in transaction information could be
per day using multiple petabytes of somehow categorizing or structuring aligned with previously un structured
data. 7 As an example of the use of that data so that the resulting infor data, such as the number of tweets
the MapReduce approach, consider a mation can be used directly to under that mention the ad, along with
Yahoo front page that might be broken stand a process or to interface with corresponding positive or nega tive
into multiple categories—such as ad other applications. As an example, sentiment in those messages. In
vertisements (optimized for the user), Johan Bollen and Huina Mao8 found addition, AI research often examines
mustsee videos (subject to content that stock market predictions of the what other available data can provide
optimization), news (subject to con Dow Jones Industrial average were structure. For example, Efthymios
tent management), and so on—where improved by considering the overall Kouloumpis and his colleagues 13 in
each category could be handled by “sentiment” of the stock market— vestigated Twitter messages and
different clusters of computers. Fur this is an unstructured concept, but found that hashtags and emoticons
ther, within each of those areas, prob based on structured data generated were useful for ascertaining senti
lems might be further decomposed, from Google. ment. Once the data has been struc
facilitating even faster response. In another application, firms have tured, enterprises want to use data
MapReduce allows the develop begun to investigate the impact of un mining to develop insights into these
ment of approaches that can handle structured data issues such as a firm’s kinds of big data—however, some
larger volumes of data using larger reputation. For example, Scott Span limitations exist that hinder such
numbers of processors. As a result, gler and his colleagues 9 reviewed how analysis.
some of the issues caused by some firms are analyzing a range of
increas ing volumes and velocities of different types of data to provide con Some Limitations
data can be addressed using parallel- tinuous monitoring of a range of ac of Current AI Algorithms
based approaches. tivities, including generating struc Xindong Wu and his colleagues
tured measures and assessments of have identified the top 10 data
Contributions of AI firms’ and products’ reputations, while mining al gorithms.14 Unfortunately,
Like Big Data, AI is about increasing available algorithm sets often are
volumes, velocities and variety of nonstandard
data.
March/april 2013 www.computer.org/intelligent 97
pabilities into parallel computing.

and primarily researchbased. Al For example, Tim Kraska and his col
gorithms might lack documenta leagues,16 as well as others, have ini tiated
tion, support, and clear examples. research on issues in machine learning in
Further, historically the focus of AI distributed environments. However, AI
has largely been on singlemachine researchers might not be familiar with issues
implementations. With big data, we such as parallel ization. As a result, teams of
now need AI that’s scalable to clus AI and parallel computing researchers are
ters of machines or that can be logi combining efforts.
cally set on a MapReduce structure As part of the MapReduce ap proach, the
such as Hadoop. As a result, effec “map” portion provides subproblems to
tively using current AI algorithms in nodes for further analysis, to provide the
big data enterprise settings might be ability to par allelize. Different AI approaches
limited. and different units of analysis potentially can
However, recently, MapReduce influence the extent to which al gorithms can
has been used to develop parallel be attacked using Map Reduce approaches
pro cessing approaches to AI and how the problems can be decomposed.
algorithms. ChengTao Chu and his How ever, in some cases, algorithms de
colleagues introduced the veloped for singlemachine environ ments can
MapReduce approach into machine be extended readily to parallel processing
learning to facilitate a parallel environments.
programing approach to a variety Although the system in Hayes and
of AI learning algorithms.15 Their Weinstein11 was developed prior to
approach was to show that they MapReduce developments, we can
could write the algorithms in what anticipate implementing it in such an
they referred to as a summation ap- environment. Because the algo rithm
proach, where sufficient statistics categorizes individual news stories
from the subproblems can be cap independently, one approach to
tured, aggregated, and solved. decomposing the data into sub problems
Using parallel processing, they would be to process each news story
obtained a linear speedup by separately in a cluster. As an other
increasing the num ber of processors. example, SooMin Kim and Eduard Hovy12
Consistent with that development, analyzed sentiment data at the sentence
along with Hadoop, there’s now a level, generating structured analysis of
machine learning library with capa unstructured data. If sentences are
bilities such as recommendation min processed in dependently, then
ing, clustering, and classification, subproblems can be developed for the
referred to as Mahout (Hindi for a sentence level. Similarly, if the unit of
person who rides an elephant; see analysis is hashtags or emoticons, then
http://mahout.apache.org). sub problems can be generated for those
Accordingly, this library can be artifacts. If the task is monitoring trans actions
combined with Hadoop to facilitate or other chunks of data, 9 then individual
the ability of enterprises to use AI transactions and chunks can be analyzed
and machine learning in a parallel- separately in paral lel. As a result, we can
processing en vironment analysis of see that AI al gorithms designed for single-
large volumes of data. machine environments might have
emergent
Parallelizing Other
Machine Learning
Algorithms
AI researchers are increasingly drawn
to the idea of integrating AI ca
use of genetic algorithms and additional developments. One ap
other iterative approaches on proach could include capturing ex pert
Hadoop. visualization capabilities in a
Second, it follows that with knowledge base designed to facilitate
subproblem structures useful for
big data there will also be dirty analysis by other users as big data
parallelization.
data, with potential errors, permeates the enterprise. Another ap
incomplete ness, or differential proach is to make intelligent data vi
Emerging Issues sualization apps available, possibly
precision. AI can be used to
There are a number of emerging for particular types of data.
identify and clean dirty data or
issues associated with AI and big Fourth, as flashstorage technology
use dirty data as a means of
data. First, unfortunately, the evolves, approaches such as inmemory
establishing context knowledge
nature of some machine- for the data. For example, database technology becomes increas
learning algorithms—for “consistent” dirty data might ingly feasible18 to potentially provide
example, iterative approaches indicate a different context users with nearrealtime analyses of
such as genetic algorithms—can than the one assumed—for larger databases, speeding decision
make their use in a MapReduce example, data in a different making capabilities. With inmemory
environment more difficult. As language. Third, since data approaches, business logic and algo
a result, research ers such as visualization was one of the rithms can be stored with the data,
Abhishek Verma and his first uses of big data, we would and AI research can include
colleagues 17 are investigating expect AI to further facilitate developing
the design, implementation, and

98 www.computer.org/intelligent iEEE iNTElliGENT SYSTEMS


approaches to exploit that technology.
intelligent analysis of that data, and 11. P. Hayes and S. Weinstein, “ConstruTIS:
However, it’s likely that as technology
capturing structured interpretations A System for Contentbased Indexing of
increases the ability to handle more a Database of News Stories,” Proc.
of the wide variety of unstructured
data faster, there will also be interest 2nd Conf. Innovative Applications of
data increasingly available.
in even larger datasets and even more
Artificial Intelligence, Assoc. for the
kinds of data, such as that available
References Advancement of Artificial Intelligence
from audio and video sources.
1. M. Cox and D. Ellsworth, “Managing (AAAI), pp. 49–64.
Fifth, up to this point, when we’ve Big Data for Scientific 12. S. Kim and E. Hovy, “Determining
talked about big data, we’ve taken a Visualization,” Proc. ACM the Sentiment of Opinions,” Proc.
more traditional approach that treats
Siggraph, ACM, 1997, pp. 51–517. COLING Conf., Assoc. for Comput
big data as being information typi
2. P. Zikopoulous et al., Harness ing Linguistics, 2004, article no. 1367;
cally available in a database, as a sig
the Power of Big Data, McGraw- doi:10.3115/1220355.1220555.
nal or text format. However, look ing
Hill, 2013. 13. E. Kouloumpis, T. Wilson, and
forward we can anticipate that big
3. K. Ashton, “That ‘Internet of Things’ J. Moore, “Twitter Sentiment Analysis:
data will begin to include more audio
Thing,” RFID J., 22 June 2009; www. The Good, the Bad, and the OMG!”
and videobased information. Natural
rfidjournal.com/article/view/4986. Proc. 5th Int’l AAAI Conf. Weblogs
language, natural visual in
4. Pingdom, “Internet 2011 in Numbers,” and Social Media, AAAI, 2011,
terpretation, and visual machine
tech. blog, 17 Jan. 2012; http://royal. pp. 538–541.
learning will become increasingly im
pingdom.com/2012/01/17/internet 14. X. Wu et al., “Top 10 Algorithms in
portant forms of AI for big data, and
2011innumbers. Data Mining,” Knowledge and Infor-
AIstructured versions of audio and
5. C. Chu et al., “MapReduce for Ma mation Systems, vol. 14, no. 1,
video will be integrated along with
chine Learning for Multicore,” Proc. 2008, pp. 1–37.
other forms of data.
Neural Information Processing Systems 15. UK Future Internet Strategy Group,
Conf., Neural Information Processing “Future Internet Report,” tech. report,

Although recent use of the term Systems Foundation, 2006; http://


books.nips.cc/nips19.html.
May 2011. https://connect.innovateuk.
org/c/document_library/get_file?
“big data” has grown substantially,
6. J. Dean, and S. Ghemawat, “MapReduce: folderId= 861750&name=DLFE-
perhaps the term will someday
Simplified Data Processing on Large 34705.pdf.
be found inappropriately descrip
Clusters,” Comm. ACM, vol. 51, no. 1, 16. T. Kraska et al., “MLbase: A Dis
tive, once what is seen as big data
2008, pp. 107–113. tributed Machine Learning System,”
changes with computing technology
7. E. Baldeschwieler, “Hadoop @ Yahoo,” Proc. 6th Biennial Conf. Innovative
and capabilities: the scale of big data
2009 Cloud Computing Expo, presen Data Systems Research, 2013; www.
of today is likely to be little or small
tation, 2009; www.slideshare.net/ydn/ cs.berkeley.edu/~ameet/mlbase.pdf.
data in 10 years. Further, the term
hadoopyahoointernetscaledata 17. A. Verma et al., Scaling Simple and
is likely to splinter, not unlike the
processing. Compact Genetic Algorithms Using
term artificial intelligence, as differ
8. J. Bollen and H. Mao, “Twitter MapReduce, Illinois Genetic Algo
ent approaches or subdomains gain
attention. Mood as a Stock Market rithms Laboratory (IlliGAL) report
Predictor,” Computer, vol. 44, no. no. 2009001, IlliGAL, Univ. of Illinois
In any case, right now big data is
10, 2011, pp. 91–94. at UrbanaChampaign, 2009.
enabling organizations to move away
9. S. Spangler et al., “COBRA—Mining 18. H. Plattner and A. Zeier, In-
from intuitive to databased decision
Web for Corporate Brand and Memory Data Management,
making. Ultimately, enterprises will
Reputa tion Analysis,” Web Springer, 2011.
use big data because it creates value
by solving new problems, as well as Intelligence and Agent Systems, vol.
7, no. 3, 2009, pp. 243–254. Daniel E. O’leary is a professor at the
solving existing problems faster or
10. D.E. O’Leary, “Knowledge Discovery Marshall School of Business at the Univer
cheaper, or providing a better and
for Continuous Financial sity of Southern California. Contact him at
Assurance oleary@usc.edu.
richer understanding of those prob Using Multiple Types of Digital Infor
lems. As a result, a key role of ma mation,” Contemporary Perspectives in Selected CS articles and columns
chine learning and AI is to help create Data Mining, Information Age Publish are also available for free at
value by providing enterprises with ing, 2012, pp. 103–122. http://ComputingNow.computer.org.

March/april 2013 www.computer.org/intelligent 99

You might also like