Professional Documents
Culture Documents
Science ML 2015
Science ML 2015
Input image Convolutional feature extraction RNN with attention over image Word by word
generation
A
bird
flying
LSTM over
a body
of
14 x 14 feature map
water
Fig. 2. Automatic generation of text captions for images with deep networks. A convolutional neural network is trained to interpret images, and its
output is then used by a recurrent neural network trained to generate a text caption (top). The sequence at the bottom shows the word-by-word focus of
the network on different parts of input image while it generates the caption word-by-word. [Adapted with permission from (30)]
Topics Documents
gene 0.04
dna 0.02 Topic proportions
genetic 0.01 and assignments
.,,
genes organism genes
survive?
brain 0.04
neuron 0.02
nerve 0.01
.,,
predictions
Fig. 3. Topic models. Topic modeling is a methodology for analyzing documents, where a document is viewed as a collection of words, and the words in
the document are viewed as being generated by an underlying set of topics (denoted by the colors in the figure). Topics are probability distributions
across words (leftmost column), and each document is characterized by a probability distribution across topics (histogram). These distributions are
inferred based on the analysis of a collection of documents and can be viewed to classify, index, and summarize the content of documents. [From (31).
Copyright 2012, Association for Computing Machinery, Inc. Reprinted with permission]
(see Fig. 2). Deep network methods are being rithms are developed to optimize the criterion. egy is trained to chose actions for any given state,
actively pursued in a variety of additional appli- As another example, clustering is the problem with the objective of maximizing its expected re-
cations from natural language translation to of finding a partition of the observed data (and ward over time. The ties to research in control
collaborative filtering. a rule for predicting future data) in the absence theory and operations research have increased
The internal layers of deep networks can be of explicit labels indicating a desired partition. over the years, with formulations such as Markov
viewed as providing learned representations of A wide range of clustering procedures has been decision processes and partially observed Mar-
the input data. While much of the practical suc- developed, all based on specific assumptions kov decision processes providing points of con-
cess in deep learning has come from supervised regarding the nature of a “cluster.” In both clus- tact (15, 16). Reinforcement-learning algorithms
learning methods for discovering such repre- tering and dimension reduction, the concern generally make use of ideas that are familiar
sentations, efforts have also been made to devel- with computational complexity is paramount, from the control-theory literature, such as policy
op deep learning algorithms that discover useful given that the goal is to exploit the particularly iteration, value iteration, rollouts, and variance
representations of the input without the need for large data sets that are available if one dis- reduction, with innovations arising to address
labeled training data (13). The general problem is penses with supervised labels. the specific needs of machine learning (e.g., large-
referred to as unsupervised learning, a second A third major machine-learning paradigm is scale problems, few assumptions about the un-
paradigm in machine-learning research (2). reinforcement learning (14, 15). Here, the infor- known dynamical environment, and the use of
Broadly, unsupervised learning generally in- mation available in the training data is inter- supervised learning architectures to represent
volves the analysis of unlabeled data under as- mediate between supervised and unsupervised policies). It is also worth noting the strong ties
sumptions about structural properties of the learning. Instead of training examples that in- between reinforcement learning and many dec-
data (e.g., algebraic, combinatorial, or probabi- dicate the correct output for a given input, the ades of work on learning in psychology and
listic). For example, one can assume that data training data in reinforcement learning are as- neuroscience, one notable example being the
lie on a low-dimensional manifold and aim to sumed to provide only an indication as to whether use of reinforcement learning algorithms to pre-
identify that manifold explicitly from data. Di- an action is correct or not; if an action is incor- dict the response of dopaminergic neurons in
mension reduction methods—including prin- rect, there remains the problem of finding the monkeys learning to associate a stimulus light
cipal components analysis, manifold learning, correct action. More generally, in the setting of with subsequent sugar reward (17).
factor analysis, random projections, and autoen- sequences of inputs, it is assumed that reward Although these three learning paradigms help
coders (1, 2)—make different specific assump- signals refer to the entire sequence; the assign- to organize ideas, much current research involves
tions regarding the underlying manifold (e.g., ment of credit or blame to individual actions in the blends across these categories. For example, semi-
that it is a linear subspace, a smooth nonlinear sequence is not directly provided. Indeed, although supervised learning makes use of unlabeled data
manifold, or a collection of submanifolds). An- simplified versions of reinforcement learning to augment labeled data in a supervised learning
CREDIT: ISTOCK/AKINBOSTANCI
other example of dimension reduction is the known as bandit problems are studied, where it context, and discriminative training blends ar-
topic modeling framework depicted in Fig. 3. is assumed that rewards are provided after each chitectures developed for unsupervised learning
A criterion function is defined that embodies action, reinforcement learning problems typically with optimization formulations that make use
these assumptions—often making use of general involve a general control-theoretic setting in of labels. Model selection is the broad activity of
statistical principles such as maximum like- which the learning task is to learn a control strat- using training data not only to fit a model but
lihood, the method of moments, or Bayesian egy (a “policy”) for an agent acting in an unknown also to select from a family of models, and the
integration—and optimization or sampling algo- dynamical environment, where that learned strat- fact that training data do not directly indicate
ments and explicitly allow users to express and the product in some fashion (perhaps by purchas- machine learning remains a young field with
control trade-offs among resources. ing that item in the past). The machine-learning prob- many underexplored research opportunities.
As an example of resource constraints, let us lem is to suggest other items to a given user that he Some of these opportunities can be seen by con-
suppose that the data are provided by a set of or she may also be interested in, based on the data trasting current machine-learning approaches
individuals who wish to retain a degree of pri- across all users. to the types of learning we observe in naturally
GraphX
SparkR
us. Considerations such as these suggest that
Splash
Access and
Spark
Velox
MLPiplines
transformative technologies of the 21st century.
SparkSQL MLIib Processing Although it is impossible to predict the future, it
engine appears essential that society begin now to con-
Spark Core
sider how to maximize its benefits.
Succinct
Tachyon HDFS, S3, Ceph, … Storage REFERENCES
1. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical
Learning: Data Mining, Inference, and Prediction (Springer,
Resource New York, 2011).
Mesos Hadoop Yarn virtualization 2. K. Murphy, Machine Learning: A Probabilistic Perspective
(MIT Press, Cambridge, MA, 2012).
3. L. Valiant, Commun. ACM 27, 1134–1142 (1984).
AMPLab developed Spark Community 3rd party
4. V. Chandrasekaran, M. I. Jordan, Proc. Natl. Acad. Sci. U.S.A.
110, E1181–E1190 (2013).
Fig. 5. Data analytics stack. Scalable machine-learning systems are layered architectures that are 5. S. Decatur, O. Goldreich, D. Ron, SIAM J. Comput. 29, 854–879
built on parallel and distributed computing platforms. The architecture depicted here—an open- (2000).
6. S. Shalev-Shwartz, O. Shamir, E. Tromer, Using more data to
source data analysis stack developed in the Algorithms, Machines and People (AMP) Laboratory at
speed up training time, Proceedings of the Fifteenth Conference
the University of California, Berkeley—includes layers that interface to underlying operating systems; on Artificial Intelligence and Statistics, Canary Islands, Spain, 21
layers that provide distributed storage, data management, and processing; and layers that provide to 23 April, 2012.
core machine-learning competencies such as streaming, subsampling, pipelines, graph processing, 7. S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, in
Foundations and Trends in Machine Learning 3 (Now
and model serving. Publishers, Boston, 2011), pp. 1–122.
8. S. Sra, S. Nowozin, S. Wright, Optimization for Machine
occurring systems such as humans and other organizations, and biological evolution and see Learning (MIT Press, Cambridge, MA, 2011).
animals, organizations, economies, and biological machine learning benefit from ongoing studies 9. J. Schmidhuber, Neural Netw. 61, 85–117 (2015).
evolution. For example, whereas most machine- of these other types of learning systems. 10. Y. Bengio, in Foundations and Trends in Machine Learning 2
learning algorithms are targeted to learn one As with any powerful technology, machine (Now Publishers, Boston, 2009), pp. 1–127.
11. A. Krizhevsky, I. Sutskever, G. Hinton, Adv. Neural Inf. Process.
specific function or data model from one single learning raises questions about which of its po- Syst. 25, 1097–1105 (2015).
data source, humans clearly learn many differ- tential uses society should encourage and dis- 12. G. Hinton et al., IEEE Signal Process. Mag. 29, 82–97
ent skills and types of knowledge, from years courage. The push in recent years to collect new (2012).
of diverse training experience, supervised and kinds of personal data, motivated by its eco- 13. G. E. Hinton, R. R. Salakhutdinov, Science 313, 504–507
(2006).
unsupervised, in a simple-to-more-difficult se- nomic value, leads to obvious privacy issues, as 14. V. Mnih et al., Nature 518, 529–533 (2015).
quence (e.g., learning to crawl, then walk, then mentioned above. The increasing value of data 15. R. S. Sutton, A. G. Barto, Reinforcement Learning:
run). This has led some researchers to begin also raises a second ethical issue: Who will have An Introduction (MIT Press, Cambridge, MA, 1998).
exploring the question of how to construct com- access to, and ownership of, online data, and who 16. E. Yaylali, J. S. Ivy, Partially observable MDPs (POMDPs):
Introduction and examples. Encyclopedia of Operations Research
puter lifelong or never-ending learners that op- will reap its benefits? Currently, much data are and Management Science (John Wiley, New York, 2011).
erate nonstop for years, learning thousands of collected by corporations for specific uses leading 17. W. Schultz, P. Dayan, P. R. Montague, Science 275, 1593–1599
interrelated skills or functions within an over- to improved profits, with little or no motive for (1997).
all architecture that allows the system to im- data sharing. However, the potential benefits that 18. C. Dwork, F. McSherry, K. Nissim, A. Smith, in Proceedings of
the Third Theory of Cryptography Conference, New York, 4 to 7
prove its ability to learn one skill based on society could realize, even from existing online March 2006, pp. 265–284.
having learned another (26–28). Another aspect data, would be considerable if those data were to 19. A. Blum, K. Ligett, A. Roth, J. ACM 20, (2013).
of the analogy to natural learning systems sug- be made available for public good. 20. J. Duchi, M. I. Jordan, J. Wainwright, J. ACM 61, 1–57
gests the idea of team-based, mixed-initiative To illustrate, consider one simple example (2014).
21. M.-F. Balcan, A. Blum, S. Fine, Y. Mansour, Distributed learning,
learning. For example, whereas current machine- of how society could benefit from data that is communication complexity and privacy. Proceedings of the
learning systems typically operate in isolation already online today by using this data to de- 29th Conference on Computational Learning Theory, Edinburgh,
to analyze the given data, people often work crease the risk of global pandemic spread from UK, 26 June to 1 July 2012.
in teams to collect and analyze data (e.g., biol- infectious diseases. By combining location data 22. Y. Zhang, J. Duchi, M. Jordan, M. Wainwright, in Advances in
Neural Information Processing Systems 26, L. Bottou,
ogists have worked as teams to collect and an- from online sources (e.g., location data from cell C. Burges, Z. Ghahramani, M. Welling, Eds. (Curran Associates,
alyze genomic data, bringing together diverse phones, from credit-card transactions at retail Red Hook, NY, 2014), pp. 1–23.
experiments and perspectives to make progress outlets, and from security cameras in public places 23. Q. Berthet, P. Rigollet, Ann. Stat. 41, 1780–1815 (2013).
on this difficult problem). New machine-learning and private buildings) with online medical data 24. A. Kleiner, A. Talwalkar, P. Sarkar, M. I. Jordan, J. R. Stat. Soc.,
B 76, 795–816 (2014).
methods capable of working collaboratively with (e.g., emergency room admissions), it would be 25. M. Mahoney, Found. Trends Machine Learn. 3, 123–224
humans to jointly analyze complex data sets feasible today to implement a simple system to (2011).
might bring together the abilities of machines telephone individuals immediately if a person 26. T. Mitchell et al., Proceedings of the Twenty-Ninth Conference
to tease out subtle statistical regularities from they were in close contact with yesterday was just on Artificial Intelligence (AAAI-15), 25 to 30 January 2015,
Austin, TX.
massive data sets with the abilities of humans to admitted to the emergency room with an infec- 27. M. Taylor, P. Stone, J. Mach. Learn. Res. 10, 1633–1685
draw on diverse background knowledge to gen- tious disease, alerting them to the symptoms they (2009).
erate plausible explanations and suggest new should watch for and precautions they should 28. S. Thrun, L. Pratt, Learning To Learn (Kluwer Academic Press,
CREDIT: ISTOCK/AKINBOSTANCI
hypotheses. Many theoretical results in machine take. Here, there is clearly a tension and trade-off Boston, 1998).
29. L. Wehbe et al., PLOS ONE 9, e112575 (2014).
learning apply to all learning systems, whether between personal privacy and public health, and 30. K. Xu et al., Proceedings of the 32nd International Conference
they are computer algorithms, animals, organ- society at large needs to make the decision on on Machine Learning, vol. 37, Lille, France,
izations, or natural evolution. As the field pro- how to make this trade-off. The larger point of 6 to 11 July 2015, pp. 2048–2057.
gresses, we may see machine-learning theory this example, however, is that, although the data 31. D. Blei, Commun. ACM 55, 77–84 (2012).
and algorithms increasingly providing models are already online, we do not currently have the
for understanding learning in neural systems, laws, customs, culture, or mechanisms to enable 10.1126/science.aaa8415
If you wish to distribute this article to others, you can order high-quality copies for your
colleagues, clients, or customers by clicking here.
Permission to republish or repurpose articles or portions of articles can be obtained by
following the guidelines here.
Science (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the
American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. Copyright
2015 by the American Association for the Advancement of Science; all rights reserved. The title Science is a
registered trademark of AAAS.