Professional Documents
Culture Documents
T S
Editor: Daniel E. O’leary, University of Southern California, oleary@usc.edu
Artificial Intelligence
and Big Data
Daniel E. O’Leary, University of Southern California
and primarily researchbased. Al For example, Tim Kraska and his col
gorithms might lack documenta leagues,16 as well as others, have ini tiated
tion, support, and clear examples. research on issues in machine learning in
Further, historically the focus of AI distributed environments. However, AI
has largely been on singlemachine researchers might not be familiar with issues
implementations. With big data, we such as parallel ization. As a result, teams of
now need AI that’s scalable to clus AI and parallel computing researchers are
ters of machines or that can be logi combining efforts.
cally set on a MapReduce structure As part of the MapReduce ap proach, the
such as Hadoop. As a result, effec “map” portion provides subproblems to
tively using current AI algorithms in nodes for further analysis, to provide the
big data enterprise settings might be ability to par allelize. Different AI approaches
limited. and different units of analysis potentially can
However, recently, MapReduce influence the extent to which al gorithms can
has been used to develop parallel be attacked using Map Reduce approaches
pro cessing approaches to AI and how the problems can be decomposed.
algorithms. ChengTao Chu and his How ever, in some cases, algorithms de
colleagues introduced the veloped for singlemachine environ ments can
MapReduce approach into machine be extended readily to parallel processing
learning to facilitate a parallel environments.
programing approach to a variety Although the system in Hayes and
of AI learning algorithms.15 Their Weinstein11 was developed prior to
approach was to show that they MapReduce developments, we can
could write the algorithms in what anticipate implementing it in such an
they referred to as a summation ap- environment. Because the algo rithm
proach, where sufficient statistics categorizes individual news stories
from the subproblems can be cap independently, one approach to
tured, aggregated, and solved. decomposing the data into sub problems
Using parallel processing, they would be to process each news story
obtained a linear speedup by separately in a cluster. As an other
increasing the num ber of processors. example, SooMin Kim and Eduard Hovy12
Consistent with that development, analyzed sentiment data at the sentence
along with Hadoop, there’s now a level, generating structured analysis of
machine learning library with capa unstructured data. If sentences are
bilities such as recommendation min processed in dependently, then
ing, clustering, and classification, subproblems can be developed for the
referred to as Mahout (Hindi for a sentence level. Similarly, if the unit of
person who rides an elephant; see analysis is hashtags or emoticons, then
http://mahout.apache.org). sub problems can be generated for those
Accordingly, this library can be artifacts. If the task is monitoring trans actions
combined with Hadoop to facilitate or other chunks of data, 9 then individual
the ability of enterprises to use AI transactions and chunks can be analyzed
and machine learning in a parallel- separately in paral lel. As a result, we can
processing en vironment analysis of see that AI al gorithms designed for single-
large volumes of data. machine environments might have
emergent
Parallelizing Other
Machine Learning
Algorithms
AI researchers are increasingly drawn
to the idea of integrating AI ca
use of genetic algorithms and additional developments. One ap
other iterative approaches on proach could include capturing ex pert
Hadoop. visualization capabilities in a
Second, it follows that with knowledge base designed to facilitate
subproblem structures useful for
big data there will also be dirty analysis by other users as big data
parallelization.
data, with potential errors, permeates the enterprise. Another ap
incomplete ness, or differential proach is to make intelligent data vi
Emerging Issues sualization apps available, possibly
precision. AI can be used to
There are a number of emerging for particular types of data.
identify and clean dirty data or
issues associated with AI and big Fourth, as flashstorage technology
use dirty data as a means of
data. First, unfortunately, the evolves, approaches such as inmemory
establishing context knowledge
nature of some machine- for the data. For example, database technology becomes increas
learning algorithms—for “consistent” dirty data might ingly feasible18 to potentially provide
example, iterative approaches indicate a different context users with nearrealtime analyses of
such as genetic algorithms—can than the one assumed—for larger databases, speeding decision
make their use in a MapReduce example, data in a different making capabilities. With inmemory
environment more difficult. As language. Third, since data approaches, business logic and algo
a result, research ers such as visualization was one of the rithms can be stored with the data,
Abhishek Verma and his first uses of big data, we would and AI research can include
colleagues 17 are investigating expect AI to further facilitate developing
the design, implementation, and