You are on page 1of 14

Machine Teaching

A New Paradigm for Building Machine Learning Systems

Patrice Y. Simard patrice@microsoft.com


Saleema Amershi samershi@microsoft.com
David M. Chickering dmax@microsoft.com
Alicia Edelman Pelton aliciaep@microsoft.com
arXiv:1707.06742v3 [cs.LG] 11 Aug 2017

Soroush Ghorashi sorgh@microsoft.com


Christopher Meek meek@microsoft.com
Gonzalo Ramos goramos@microsoft.com
Jina Suh jinsuh@microsoft.com
Johan Verwey joverwey@microsoft.com
Mo Wang mowan@microsoft.com
John Wernsing johnwer@microsoft.com
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA

Abstract principles. We also describe how, by decou-


pling knowledge about machine learning al-
The current processes for building machine
gorithms from the process of teaching, we can
learning systems require practitioners with
accelerate innovation and empower millions
deep knowledge of machine learning. This
of new uses for machine learning models.
significantly limits the number of machine
learning systems that can be created and has
led to a mismatch between the demand for
machine learning systems and the ability for
1. Introduction
organizations to build them. We believe that The demand for machine learning (ML) models far
in order to meet this growing demand for ma- exceeds the supply of “machine teachers” that can
chine learning systems we must significantly build those models. Categories of common-sense un-
increase the number of individuals that can derstanding tasks that we would like to automate with
teach machines. We postulate that we can computers include interpreting commands, customer
achieve this goal by making the process of support, or agents that perform tasks on our behalf.
teaching machines easy, fast and above all, The combination of categories, domains, and tasks
universally accessible. leads to millions of opportunities for building special-
While machine learning focuses on creating ized, high-accuracy machine learning models. For ex-
new algorithms and improving the accuracy ample, we might be interested in building a model to
of “learners”, the machine teaching discipline understand voice commands for controlling a televi-
focuses on the efficacy of the “teachers”. Ma- sion or building an agent for making restaurant reser-
chine teaching as a discipline is a paradigm vations. The key to opening up the large space of solu-
shift that follows and extends principles of tions to is to increase the number of machine teachers
software engineering and programming lan- by making the process of teaching machines easy, fast
guages. We put a strong emphasis on the and universally accessible.
teacher and the teacher’s interaction with A large fraction of the machine learning community is
data, as well as crucial components such as focused on creating new algorithms to improve the ac-
techniques and design principles of interac- curacy of the “learners” (machine learning algorithms)
tion and visualization. on given labeled data sets. The machine teaching
In this paper, we present our position re- (MT) discipline is focused on the efficacy of the teach-
garding the discipline of machine teaching ers given the learners. The metrics of machine teaching
and articulate fundamental machine teaching measure performance relative to human costs, such as
Machine Teaching

productivity, interpretability, robustness, and scaling interfaces between the underlying algorithm and the
with the complexity of the problem or the number of teaching language. Therefore, machine teaching lives
contributors. at the interaction of the human-computer interaction,
machine learning, visualization, systems and engineer-
Many problems that affect model building productiv-
ing fields. The goal of this paper is to explore machine
ity are not addressed by traditional machine learning.
learning model building from the teacher’s perspec-
One such problem is concept evolution, a process in
tive.
which the teacher’s underlying notion of the target
class is defined and refined over time (Kulesza et al.,
2014). Label noise or inconsistencies can be detrimen- 2. The need for a new discipline
tal to traditional machine learning because it assumes
In 2016, at one of Microsoft’s internal conferences
that the target concept is fixed and is defined by the
(TechFest) during a panel titled “How Do We Build
labels. In practice, concept definitions, schemas, and
and Maintain Machine Learning Systems?”, the host
labels can change as new sets of rare positives are dis-
started the discussion by asking the audience “What is
covered or when teachers simply change their minds.
your worst nightmare?” in the context of building ma-
Consider a binary classification task for gardening web chine learning models for production. A woman raised
pages where the machine learner and feature set is her hand and gave the first answer:
fixed. The teacher may initially label botanical gar-
den web pages as positive examples for the gardening “[...] Manage versions. Manage data ver-
concept, but then later decide that these are negative sions. Being able to reproduce the models.
examples. Relabeling the examples when the target What if, you know, the data disappears, the
concept evolves is a huge burden on the teacher. From person disappears, the model disappears...
a teacher’s perspective, concepts should be decompos- And we cannot reproduce this. I have seen
able into sub-concepts and the manipulation of the this hundreds of times in Bing. I have seen
relationship between sub-concepts should be easy, in- it every day. Like... Oh yeah, we had a good
terpretable, and reversible. At the onset, the teacher model. Ok, I need to tweak it. I need to
could decompose gardening into sub-concepts (that in- understand it. And then... Now we cannot
clude botanical gardens) and label the web page ac- reproduce it. That is my biggest nightmare!”
cording to this concept schema.
In this scenario, labeling for sub-concepts has no ben- To put context to this testimony, we review what
efits to the machine learning algorithm, but it benefits building a machine learning model may look like in
the teacher by enabling concept manipulation. Manip- a product group:
ulation of sub-concepts can be done in constant time
(i.e., not dependent on the number of labels), and the 1. A problem owner collects data, writes labeling
teacher’s semantic decisions can be documented for guidelines, and optionally contributes some labels.
communication and collaboration. Addressing concept
evolution is but one example of where the emphasis on 2. The problem owner outsources the task of labeling
the teacher’s perspective can make a large difference a large portion of the data (e.g., 50,000 examples).
in model building productivity.
3. The problem owner examines the labels and may
Machine teaching is a paradigm shift away from ma- discover that the guidelines are incorrect or that
chine learning, akin to how other fields in program- the sampled examples are inappropriate or in-
ming language have shifted from optimizing perfor- adequate for the problem. When that happens,
mance to optimizing productivity with the notions of GOTO step 1.
functional programming, programming interfaces, ver-
sion control, etc. The discipline of machine teaching 4. An ML expert is consulted to select the algorithm
follows and extends principles of software engineering (e.g., deep neural network), the architecture (e.g.,
and languages that are fundamental to software pro- number of layers, units per layer, etc.), the objec-
ductivity. Machine teaching places a strong empha- tive function, the regularizers, the cross-validation
sis on the teacher and the teacher’s interaction with sets, etc.
data, and techniques and design principles of interac- 5. Engineers adjust existing features or create new
tion and visualization are crucial components. Ma- features to improve performance. Models are
chine teaching is also directly connected to machine trained and deployed on a fraction of traffic for
learning fundamentals as it defines abstractions and testing.
Machine Teaching

6. If the system does not perform well on test traffic, 2.1. Definitions of machine learning and
GOTO step 1 machine teaching
It is difficult to argue that the challenges discussed
7. The model is deployed on full traffic. Performance above are given a high priority in the world’s best
of the model is monitored, and if that performance machine learning conferences. These problems and
goes below a critical level, the model is modified inefficiencies do not stem from the machine learning
by returning to step 1. algorithm, which is the central topic of the machine
learning field; they come from the processes that use
An iteration through steps 1 to 6 typically takes weeks. machine learning, from the interaction between people
The system can be stable at step 7 for months. When and machine learning algorithms, and from people’s
it eventually breaks, it can be for a variety of reasons: own limitations.
the data distribution has changed, the competition To give more weight to this assertion, we will define
has improved and the requirements have increased, the machine learning research field narrowly as:
new features are available and some old features are
no longer available, the definition of the problem has Definition 2.1 (Machine learning research)
changed, or a security update or other change has bro- Machine Learning research aims at making the
ken the code. At various steps, the problem owner, learner better by improving ML algorithms.
the machine learning expert, or the key engineer may
have moved on to another group or another company. This field covers, for instance, any new variations or
The features or the labels were not versioned or doc- breakthroughs in deep learning, unsupervised learning,
umented. No one understands how the data was col- recurrent networks, convex optimization, and so on.
lected because it was done in an ad hoc and organic Conversely, we see version control, concept decomposi-
fashion. Because multiple players with different ex- tion, semantic data exploration, expressiveness of the
pertise are involved, it takes a significant amount of teaching language, interpretability of the model, and
effort and coordination to understand why the model productivity as having more in common with program-
does not perform as well as expected after being re- ming and human-computer interaction than with ma-
trained. In the worst case, the model is operating but chine learning. These “machine teaching” concepts,
no one can tell if it is performing as expected, and however, are extraordinarily important to any practi-
no one wants the responsibility of turning it off. Ma- tioners of machine learning. Hence, we define a disci-
chine learning “litter” starts accumulating everywhere. pline aimed at improving these concept as:
These problems are not new to machine learning in
practice (Sculley et al., 2014). Definition 2.2 (Machine teaching research)
Machine teaching research aims at making the teacher
The example above illustrates the fact that building a
more productive at building machine learning models.
machine learning model involves more than just col-
lecting data and applying learning algorithms, and We have chosen these definitions to minimize the inter-
that the management process of building machine section between the two fields and thus provide clarity
learning solutions can be fraught with inefficiencies. and scoping. The two disciplines are complementary
There are other forms of inefficiencies that are deeply and can evolve independently. Of course, like any gen-
embedded in the current machine learning paradigm. eralization, there are limitations. Curriculum learn-
For instance, machine learning projects typically con- ing (Bengio et al., 2009), for instance, could be seen
sist of a single monolithic model trained on a large as belonging squarely in the intersection because it in-
labeled data set. If the model’s summary performance volves both a learning algorithm and teacher behavior.
metrics (e.g., accuracy, F1 score) were the only re- Nevertheless, we have found these definitions useful to
quirements and the performance remained unchanged, decide what to work on and what not to work on.
adding examples would not be a problem even if the
new model errs on the examples that were previously
2.2. Decoupling machine teaching from
predicted correctly. However, for many problems for
machine learning
which predictability and quality control are important,
any negative progress on the model quality leads to Machine teaching solutions require one or more ma-
laborious testing of the entire model and incurs high chine learning algorithms to produce models through-
maintenance cost. A single monolithic model lacks the out the teaching process (and for the final output).
modularity required for most people to isolate and ad- This requirement can make things complex for teach-
dress the root cause of a regression problem. ers. Different deployment environments may support
Machine Teaching

different runtime functions, depending on what re- create and debug any concept function to an arbitrary
sources are available (e.g., DSPs, GPUs, FPGAs, tight level of accuracy without being required to understand
memory or CPU constraints) or what has been imple- the runtime function space, learning algorithms, or op-
mented and green-lighted for deployment. Machine timization.
learning algorithms can be understood as “compilers”
that convert the teaching information to an instance 3. Analogy to programming
of the set of functions available at runtime. For ex-
ample, each such instance might be characterized by In this section, we argue that teaching machines is a
the weights in a neural network, the “means” in K- form of programming. We first describe what machine
means, the support vectors in SVMs, or the decisions teaching and programming have in common. Next, we
in decision trees. For each set of runtime functions, highlight several tools developed to support software
different machine learning compilers may be available development that we argue are likely to provide valu-
(e.g., LBFGS, stochastic gradient descent), each with able guidance and inspiration to the machine teaching
its own set of parameters (e.g., history size, regulariz- discipline. We conclude this section with a discussion
ers, k-folds, learning rates schedule, batch size, etc.) of the history of the discipline of programming and
how it might be predictive of the trajectory of the dis-
Machine teaching aims at shielding the teacher from
cipline of machine teaching.
both the variability of the runtime and the complex-
ity of the optimization. This has a performance cost:
optimizing for a target runtime with expert control of 3.1. Commonalities and differences between
the optimization parameters will always outperform programming and teaching
generic parameter-less optimization. It is akin to in- Assume that a software engineer needs to create a
lining assembly code. But like high-level programming stateless target function (e.g., as in functional pro-
languages, our goal with machine teaching is to reduce gramming) that returns value Y given input X. While
the human cost in terms of both maintenance time not strictly sequential, we can describe the program-
and required expertise. The teaching language should ming process as a set of steps as follows:
be “write once, compile anywhere”, following the ISO
C++ philosophy. 1. The target function needs to be specified
Using well-defined interfaces describing the inputs
2. The target function can be decomposed into sub-
(feature values) and outputs (label value predictions)
functions
of machine learning algorithms, the teaching solution
can leverage any machine learning algorithms that 3. Functions (including sub-functions) need to be
support these interfaces. We impose three additional tested and debugged
system requirements:
4. Functions can be documented
1. The featuring language available to the teacher 5. Functions can be shared
should be expressive enough to enable examples
to be distinguished in meaningful ways (a hash of 6. Functions can be deployed
a text document has distinguishing power, but it
is not considered meaningful). This enables the 7. Functions need to be maintained (scheduled and
teacher to remove feature blindness without nec- unscheduled debug cycles)
essarily increasing concept complexity.
Further assume that a teacher wants to build a target
2. The complexity (VC dimension) of the set of func- classification function that returns class Y given in-
tions that the system can return increases with put X. The process for machine teaching presented in
the dimension of the feature space. This enables the previous section is similar to the set of program-
the teacher to decrease the approximation error ming steps above. While there are strong similarities,
by adding features. there are also significant differences, especially in the
3. The available ML algorithms must satisfy the debugging step (Table 1).
classical definition of learning consistency (Vap- In order to strengthen the analogy between teaching
nik, 2013). This enables the teacher to decrease and programming, we need a machine teaching lan-
the estimation error by adding labeled examples. guage that lets us express these steps in the context
of a machine learning model building task. For pro-
The aim of these requirements is to enable teachers to gramming, the examples of languages include C++,
Machine Teaching

Table 1. Comparison of debugging steps in programming and machine teaching


Debugging in programming Debugging in machine teaching
(3) Repeat: (3) Repeat:
(a) Inspect (a) Inspect
(b) Edit code (b) Edit/add knowledge (e.g., labels, features, ...)
(c) Compile (c) Train
(d) Test (d) Test

Python, JavaScript, etc. which can be compiled into patterns. Rather than discussing each of these, we con-
machine language for execution. For teaching, the lan- trast the differing expectations between software engi-
guage is a means of expressing teacher knowledge into neers solving a complex problem and machine teachers
a form that a machine learning algorithm can leverage solving a complex problem. One of the most power-
for training. Teacher knowledge does not need to be ful concepts that allowed software engineers to write
limited to providing labels but can be a combination systems that solve complex problems is that of decom-
of schema constraints (e.g., mutually exclusive labels position. The next anecdote illustrates its importance
for classification, state transition constraints in entity and power.
extraction1 ), labeled examples, and features. Just as
We asked dozens of software engineers the following:
new programing languages are being developed to ad-
dress current limitations, we expect that new teaching
languages will be developed that allow the teacher to 1. Can you write a program that correctly imple-
communicate different types of knowledge and to com- ments the game Tetris?
municate knowledge more effectively.
2. Can you do it in a month?
3.2. Programming paving the way forward
The answer to the first question is universally “yes”.
As we have illustrated in the previous sections, cur- The answer to the second question varies from “I think
rent machine learning processes require multiple peo- so” to “why would it take more than 2 days?”. The
ple of different expertise and strong knowledge depen- first question is arguably related to the Church-Turing
dency among them, there are no standards or tooling thesis which states that all computable functions are
for versioning of data and models, and there is a strong computable by a Turing machine. If a human can com-
co-dependency between problem formulation, training pute the function, there exists a program that can per-
and the underlying machine learning algorithms. For- form the same computation on a Turing machine. In
tunately, the emerging discipline of machine teaching other words, given that there is an algorithm to imple-
can leverage lessons learned from the programming, ment the Tetris game, most respectable software en-
software engineering and related disciplines. These gineers believe they can also implement the game on
disciplines have developed over the last half century whatever machine they have access to and in whatever
and addressed many analogous problems that machine programming language they are familiar with. The
teaching aims to solve. This is not surprising given answer to the second question is more puzzling. The
their strong commonalities. In this section, we high- state space in a Tetris game (informally the number of
light several lessons and relate them to machine teach- configurations of the pieces on the screen) is very large,
ing. in fact, far larger than can be examined by the software
engineer. Indeed, one might expect that the complex-
3.2.1. Solving complex problems ity of the program should grow exponentially with the
size of the representation of a state in the state space.
The programming discipline has developed and im-
Yet, the software engineers seem confident that they
proved a set of tools, techniques and principles that
can implement the game in under a month. The most
allow software engineers to solve complex problems in
likely explanation is that they consider the complexity
ways that allow for efficient, maintainable and under-
of the implementation and the debugging to be poly-
standable solutions. These principles include problem
nomial in both the representation of the state space
decomposition, encapsulation, abstraction, and design
and the input.
1
In an address recognizer, we might want to require that
Let us examine how machine learning experts react
the zip code appears after the state.
to similar questions asked about teaching a complex
Machine Teaching

problem: week or two. If the same student was asked to code


Tetris after two weeks, they would not know where to
1. Can you teach a machine to recognize kitchen start. After 6 to 12 months of learning how to pro-
utensils in an image as well as you do? gram, most software engineers would be able to ac-
commodate the task of programming the Tetris game
2. Can you do it in a month? in under a month.

When these questions were asked to another handful of Akin to how decomposition brings confidence to soft-
machine learning experts, the answers were quite var- ware engineers2 and an upper bound to solving com-
ied. While one person answered “yes” to both ques- plex problems, machine teachers can learn to decom-
tions without hesitation, most machine learning ex- pose complex machine learning problems with the
perts were less confident about both questions with right tools and experiences, and the machine teach-
answers including “probably”, “I think so”, “I am not ing discipline can bring the expectations of success for
sure”, and “probably not”. Implementing the Tetris teaching a machine to a level comparable to that of
game and recognizing simple non-deformable objects programming.
seem like fairly basic functions in either fields, thus it
is surprising that the answers to both sets of questions 3.2.2. Scaling to multiple contributors
are so different. The complexity of the problems that software engi-
The goal of both programming and teaching is to cre- neers can solve has increased significantly over the past
ate a function. In that respect, the two activities have half century, but there are limits to the scale of prob-
far more in common than they have differences. In lems that one software engineer can solve. To address
both cases we are writing functions, so there is no rea- this, many tools and techniques have been developed
son to think that the Church-Turing thesis is not true to enable multiple engineers to contribute to the solu-
for teaching. Despite the similarities, the expectations tion of a problem. In this section, we focus on three
of success for creating, debugging, and maintaining concepts - programming languages, interfaces (APIs),
such function differ widely between software engineers and version control.
and teachers. While the programming languages and One of the key developments that enables scaling with
teaching languages are different, the answers to the the number of contributors is the creation of standard-
questions were the same for all software engineers re- ized programming languages. The use of a standard-
gardless of the programming languages. Yet, most ma- ized programming language along with design pat-
chine learning experts did not give upper bounds on terns and documentation enables other collaborators
how long it would take to solve a teaching problem, to read, understand and maintain the software. The
even when they thought the problem was solvable. analog to programming languages for machine teach-
Software engineers have the confidence of being able ing is the expressions of a teacher’s domain knowledge
to complete the task in a reasonable time because which include labels, features and schemas. Currently,
they have learned to decompose problems into smaller there is no standardization of the programming lan-
problems. Each smaller problem can be further de- guages for machine teaching.
composed until coding and debugging can be done in Another key development that enables scaling with the
constant or polynomial time. For instance, to code number of contributors is the use of componentization
Tetris, one can create a state module, a state transfor- and interfaces, which are closely related to the idea of
mation module, an input module, a scoring module, a problem decomposition discussed above. Componen-
shape display module, an animation module, and so tization allows for a separation of concerns that re-
on. Each of these modules can be further decomposed duces development complexity, and clear interfaces al-
into smaller modules. The smaller modules can then low for independent development and innovation. For
be composed and debugged in polynomial time. Given instance, a software engineer does not need to consider
that each module can be built efficiently, software en- the details of the hardware upon which the solution
gineers have confidence that they can code Tetris in will run. For machine teaching, the development of
less than a month’s time. clear interfaces for services required for teaching, such
It is interesting to observe that that the ability to de- as training, sampling and featuring, would enable inde-
compose a problem is a learned skill and is not easy to pendent teaching. In addition, having clear interfaces
learn. A smart student could understand and learn all 2
For similar reasons, the ability to decompose also bring
the functions of a programming language (variables, confidence to professional instructors and animal trainers.
arrays, conditional statements, for loops, etc.) in a
Machine Teaching

for models, features, labels, and schemas enables com- the personal computer as did the number of software
posing these constituent parts to solve more complex engineers solving the problems (e.g., with Basic). Fi-
problems, and thus, allowing for their use in problem nally, in the 1990s, another explosive round of growth
decomposition. began with the advent of web programming and pro-
gramming languages like JavaScript and Java. As of
The final development that enables scaling with the
writing this paper, the number of software engineers
number of contributors is the development of version
in the world is approaching 20 million!
control systems. Modern version control systems sup-
port merging contributions by multiple software engi- Machine teaching is undergoing a similar explosion.
neers, speculative development, isolation of bug fixes Currently, much of the machine teaching effort is un-
and independent feature development, and rolling back dertaken by experts in machine learning and statistics.
to previous versions among many other benefits. The Like the story of programming, the range of problems
primary role of a version control system is to track and to which machine learning has been applied has been
manage changes to the source code rather than keeping expanding. With the deep-learning breakthroughs in
track of the compiled binaries. Similarly, in machine perceptual tasks in the 2010s (e.g., speech, vision, self-
teaching, a version control system could support man- driving cars), there has been an incredible effort to
aging the changes of the labels, features, schemas, and broaden the range of problems addressed by teaching
learners used for building the model and enable re- machines to solve the problems. Similar to the expand-
producibility and branching for experimentation while ing population of software engineers, the advent of ser-
providing documentation and transparency necessary vices like LUIS.ai3 and Wit.ai4 have enabled domain
for collaboration. experts to build their own machine learning models
with no machine learning knowledge. The discipline of
3.2.3. Supporting the development of machine teaching is young and in its formative stages.
problem solutions One can only expect that this growth will continue
at an even quicker pace. In fact, machine teaching
In the past few decades, there has been an explosion
might be the path to bringing machine learning to the
of tools and processes aimed at increasing program-
masses.
ming productivity. These include the development of
high-level programming languages, innovations in inte-
grated development environments, and the creation of 4. The role of teachers
development processes. Some of these tools and pro-
The role of the teacher is to transfer knowledge to the
cesses have a direct analog in machine teaching, and
learning machine so that it can generate a useful model
some are yet to be developed and adapted. Table 2
that can approximate a concept. Let’s define what we
presents a mapping of many of these tools and con-
mean by this.
cepts to machine teaching.
Definition 4.1 (Concept) A concept is a mapping
3.3. The trajectory of the machine teaching from any example to a label value.
discipline
For example, the concept of a recipe web page can
We conclude this section with a brief review of the be represented by a function that returns zero or
history of programming and how that might inform one, based on whether a web page contains a cooking
the trajectory of the machine teaching discipline. The recipe. In another example, an address concept can
history of programming is inexorably linked to the de- be represented by a function that, given a document,
velopment of computers. Programming started with returns a list of token ranges, each labeled “address”,
scientific and engineering tasks (1950s) with few pro- “street”, “zip”, “state”, etc. Label values for a binary
grams and programming languages like FORTRAN concept could be “Is” and “Is Not”. We may also
that focused on compute performance. In the 1960s, allow a “Undecided” label which allows a teacher to
the range of problems expanded to include manage- postpone labeling decisions or ignore ambiguous ex-
ment information systems and the range of program- amples. Postponing a decision is important because
ming languages expanded to target specific application the concept may be evolving in the teacher’s head.
domains (e.g., COBOL). The explosion of the number An example of this is in (Kulesza et al., 2014).
of software engineers led to the realization that scal-
ing with contributors was difficult (Brooks Jr, 1995). Definition 4.2 (Feature) A feature is a concept that
In the 1980s, the scope of problems to which pro-
3
gramming was applied exploded with the advent of https://www.luis.ai/
4
https://wit.ai/
Machine Teaching

Table 2. Mapping between programming and machine teaching


Programming Machine teaching
Compiler ML Algorithms (Neural Networks, SVMs)

Operating System/Services/IDEs Training, Sampling, Featuring Services,


etc.

Frameworks ImageNet, word2vec, etc.

Programming Languages (Fortran, Labels, Features, Schemas, etc.


Python, C#)

Programming Expertise Teaching Expertise

Version Control Version Control

Development Processes (specifications, Teaching Processes (data collection,


unit testing, deployment, monitoring, testing, publishing, etc.)
etc.)

assigns each example a scalar value. integer number on the horizontal axis could be inter-
preted as an example. We ignore the programs that
We usually use feature to denote a concept when em- do not compile and the examples that are nonsensical.
phasizing its use in a machine learning model. For We now use Figure 1 to refer to the different ways a
example, the concept corresponding to the presence or teacher can pass information to a learning system.
absence of the word “recipe” in text examples might
be a useful feature when teaching the recipe concept. Definition 4.4 (Selection) Selection is the process
by which teachers gain access to an example that ex-
Definition 4.3 (Teacher) A teacher is the person emplifies useful aspects of a concept.
who transfers concept knowledge to a learning ma-
chine. Teachers can select specific examples by filtering the
set of unlabeled examples. By choosing these fil-
To clarify this definition of a teacher, the methods of ters deliberately, they can systematically explore the
knowledge transfer need to be defined. At this point, space and discover information relevant to concepts.
they include a) example selection (biased), b) label- For example, a teacher may discover insect recipes
ing, c) schema definition (relationship between labels), while building a recipe classifier by issuing a query on
d) featuring, and e) concept decomposition (where “source of proteins”. We note that uniform sampling
features are recursively defined as sub-models). The and uncertainty sampling, which have no explicit input
teachers are expected to make mistakes in all the forms from a teacher, are likely of little use for discovering
of knowledge transfer. These teaching “bugs” are com- rare clusters of positive examples. Combinations of se-
mon occurrences. mantic filters involving trained models are even more
Figure 1 illustrates how concepts, labels, features, and powerful (e.g., “nutrition proteins” and low score with
teachers are related. We assume that every concept is current classifier). This ability to find examples con-
a computable function of a representation of examples. taining useful aspects of a concept enables the teacher
The representation is assumed to include all available to find useful features and provide the labels to train
information about each example. The horizontal axis them. Furthermore, the selection choices themselves
represents the (infinite) space of examples. The verti- can be valuable documentation of the teaching pro-
cal axis represents the (infinite) space of programs or cess.
concepts. In computer science theory, programs and
examples can be represented as (long) integers. Us- Definition 4.5 (Label) A label is a (example, con-
ing that convention, each integer number on the ver- cept value) pair created by a teacher in relation to a
tical axis could be interpreted as a program, and each concept.
Machine Teaching

Figure 1. Representation of examples and concepts. Each column represents an example and contains all concept values
for that example. A teacher looks in that direction to “divine” a label. The teacher has access to feature concepts not
available to the training set (it is part of the teaching power). However, the teacher does not know his/her own program.
Each row represents a concept and contains the value of that concept for all examples. A teacher looks in that direction
to “divine” the usefulness of a feature concept. A teacher can guess the values over the space of examples (it is part of
the teaching power). Features selected by the teacher looking horizontally are immune to over-training.

Teachers can provide labels by “looking at a column” and edited at a high level. The implied labels can
in Figure 1. It is important to realize that the teachers be changed simply by changing the concept relation-
do not know which programs are running in their heads ship. For instance, “Golf” could be moved from being
when they evaluate the target concept values. If they a sub-concept of “Sport” to being mutually exclusive
knew the programs, they would transfer their knowl- or vice versa. Teachers can understand and change
edge in programmatic form to the machine and would the semantics of a concept by reviewing its schema.
not need machine learning. Teachers instead look at Semantic decisions can be reversed without editing in-
the available data of an example and “divine” its label. dividual labels.
They do this by unconsciously evaluating sub-features
and combining them to make labeling decisions. The Definition 4.7 (Generic feature) A generic fea-
feature spaces and the combination functions available ture is a set of related feature functions.
to the teachers are beyond what is available through
Generic features are created by engineers in
the training sets. This power is what makes the teach-
parametrizable form, and teachers instantiate individ-
ers valuable for the purpose of creating labels.
ual features by providing useful and semantic parame-
ters. For instance, a generic feature could be: “Log(1
Definition 4.6 (Schema) A schema is a relation-
+ number of instances of words in list X in a docu-
ship graph between concepts.
ment)” and an instantiation would be setting X to a
list of car brands (useful for an automotive classifier).
When multiple concepts are involved, a teacher can
express relationship between them. For instance, the Given a set of generic features, teachers have the abil-
teacher could express that the concepts “Tennis” and ity to evaluate different (instantiated) features by look-
“Soccer” are mutually exclusive, or that concept “Ten- ing along the corresponding horizontal lines in Fig-
nis” implies the concept “Sport”. These concept con- ure 1. Given two features, the teachers can “divine”
straints are relationships between lines on the dia- that one is better than the other on a large unlabeled
gram (true across all examples). Separating knowl- set. For instance, a teacher may choose a feature that
edge captured by the schema from the knowledge cap- measures the presence of the word “recipe” over a fea-
tured by the labels allows information to be conveyed ture that measures the presence of the word “the”,
Machine Teaching

even though the latter feature might yield better re- learning or data scientist on the driver’s seat. While
sults on the training set. This ability to estimate the training more scientists is a way to increase the number
value of a feature over estimated distributions of the of teachers, we believe that that is not the right path
test set is essential to feature engineering, and is prob- to follow. For starters, machine learning and data sci-
ably the most useful capability of a teacher. Features entists are a scarce and expensive resource. Secondly,
selected by the teacher in this manner are immune to machine learning scientists can serve a better purpose
over-training because they are created independently inventing and optimizing learning algorithms. In the
of the training set. Note the contrast to “automatic same way, data scientists are indispensable applying
feature selection”, which only looks at the training set their expertise to make sense of data and transform it
and concept-independent statistics and is susceptible into a usable form.
to over-training.
The machine teaching process that we envision does
not require the skills of a ML expert or data scientist.
Definition 4.8 (Decomposition) Decomposition is Machine teachers use their domain knowledge to pick
the act of using simpler concepts to express more com- the right examples and counterexamples for a concept
plex ones. and explain why they differ. They do this through
an interactive information exchange with a learning
Whereas teachers do not have direct access to the pro- system. It is within the ranks of the domain experts
gram implementing their concept, they sometimes can where we will find the large population of machine
infer how these programs work. Socrates used to teach teachers that will increase, by orders of magnitude,
by asking the right questions. The “right question” is the number of ML models used to solve problems. We
akin to providing a useful sub-concept, whose value can transform domain experts by making a machine
makes evaluating the overall concept easier. In other teaching language universally accessible.
words, Socrates was teaching by decomposition rather
than by examples. This ability is not equally available A key characteristic of domain experts is that they
to teachers. It is learned. It is essential to scaling with understand the semantics of a problem. To this point,
complexity and with the number of teachers. It is the we argue that if a problem’s data does not need to be
same ability that helps software engineers decompose interpreted by a person to be useful, machine teaching
functions into sub-functions. Software engineers also is not needed. For example, problems for which the
acquire this ability with experience. As in program- labeled data is abundant or practically limitless; e.g.
ming, teaching decompositions are not unique (in soft- Computer Vision, Speech Understanding, Genomics
ware engineering, switching from one decomposition to Analysis, Click-Prediction, Financial Forecasting. For
another is called refactoring). these, powerful learning algorithms or hardware may
be the better strategy to arrive at an effective solution.
The knowledge provided by the teacher through con- In other problems like the above, feature selection us-
cept decomposition is high level and modular. Each ing cross validation can be used to arrive at a good
concept implementation might provide its own exam- solution without the need of a machine teacher.
ple selection, labels, schema, and features. These can
be viewed as documentation of interfaces and con- There is nonetheless, an ever-growing set of prob-
tracts. Each concept implementation may be a black lems for which machine teaching is the right approach;
box, but the concept hierarchy is transparent and in- problems where unlabeled data is plentiful and domain
terpretable. Concept decomposition is the highest knowledge to articulate a concept is essential. Exam-
form of knowledge provided by the teacher. ples of these include controlling Internet-of-Things ap-
pliances through spoken dialogs and the environment’s
Now that we have defined some of the key roles of the context, or routing customer feedback for a brand new
(machine) teacher, we turn to the question of how do product of a start-up to the right department, build-
we meet the demand for them. ing a one-time assistant to help a paralegal sift through
hundreds of thousands of briefs, etc.
Meeting the demand for teachers
We aim at reaching the same number of machine teach-
We postulate that the right solution to satisfy the in- ers as there are software engineers, a set counted in
creasing demand for machine learning models is to in- the tens of millions. Table 3 illustrates the differences
crease the number of people that can teach machines in numbers between machine learning scientists, data
these models. But how do we do that and who are scientists, and domain experts. By enabling domain
they? experts to teach, we will enable them to apply their
The current ML-focused work flows put the machine knowledge to solve directly millions of meaningful, per-
Machine Teaching

sonal, shared, “one-off” and recurrent problems at a features. If the teacher adds a third AND feature
scale that we have never seen. A∧B, however, logistic regression can work. Note
that A∧B is itself learnable via logistic regression
5. Teaching process in the A and B feature space.

A teaching or programming language can be applied in 3. Explicitly ignore ambiguous patterns: Am-
many different ways, some more effective than others. biguous patterns can be marked as “don’t care”
We propose the following principles for the language to avoid wasting features, labels, and the teacher’s
and process of machine teaching: time on difficult examples. Areas of “don’t care”
are used as a coping mechanism to keep the real-
Universal teaching language We do not rely on izability assumption despite the Bayes error rate.
the power of specific machine learning algorithms. The This action does not constrain the feature set.
teaching interface is the same for all algorithms. If
a machine learning algorithm is swapped for another Feature completeness of a teaching language does not
one, more teaching may be necessary, but the teach- imply that the language can be used to efficiently teach
ing language and the model building experience is not concepts. If a feature complete language is not very
changed. Machine learning algorithms should be inter- expressive, realizability can require a large number of
changeable. Conversely, the teaching language should model compositions. If a feature complete language
be simple and easy to learn given the domain (e.g., is too expressive (e.g. a features can be specified as
text, signal, images). Ideally, we aim at designing programs), the teachers have to become engineers.
an ANSI or ISO standard per domain. Teachers that
speak the same language should be interchangeable. Rich and diverse sampling set We call the set of
unlabeled documents accessible to the teacher when
Feature completeness (or realizability) We as- building models the “sampling distribution”. We
sume that all the target concepts that a teacher may call the set of documents for which models are built
want to implement are “realizable” through a recur- the “deployment distribution”. The rich-and-diverse-
sive composition of models and existing features. This sampling-set principle is that the sampling distribution
implies a property on the feature set, which we call captures the richness and diversity of examples in the
“feature completeness”. Feature completeness is the deployment distribution. The sampling distribution
responsibility of the teaching tool, not the teachers. and the deployment distributions are preferably simi-
Teachers achieve realizability through the following ac- lar, but they do not have to be perfectly matched. The
tions: most important requirement of the sampling distribu-
tion is that all important types of documents be repre-
1. Add missing features: If a teacher can distin- sented (rich and diverse). If important documents are
guish two documents belonging to two different missing from the sampling distribution, performance
classes in a meaningful way, there must be a (cor- could be impacted in unpredictable ways. As a rule of
responding) feature expressible in the system that thumb, unlabeled data should be collected indiscrim-
can make an equivalent semantically meaningful inately because the cost of storing data is negligible
distinction. By adding such a feature, the teacher compared to the cost of teaching; we view selectively
can correct feature blindness errors. If no such collecting only the data that is meant to be labeled
feature exists, the language is not feature com- as both risky and limiting5 . A rich and diverse data
plete for distinguishing the desired classes. set allows the teacher to explore it to express knowl-
2. Create features through decomposition: If edge through selection. It also allows the teacher to
the concept function cannot be learned from the find examples that can be used to train sub-concepts
existing set of features due to limitations of the that are more specific than the original concept. For
model class, the teacher can circumvent this prob- instance, a teacher could decide to build classifiers for
lem by creating features that are themselves mod- bonsai gardening (sub-concept) and botanical garden-
els; we call this process “model decomposition”. ing (excluded concept) to be used as features to a gar-
To illustrate the point, suppose there are two bi- dening classifier. The sampling set needs to be rich
nary features A and B, and the teacher would like enough to contain sufficient examples to successfully
to produce a model for A ⊕ B (where ⊕ stands for learn the sub-concepts. The sampling distribution can
XOR) using logistic regression. Because of the 5
For example, the collected set may not contain im-
capacity limitations of logistic regression, it is im- portant examples that would otherwise be found via the
possible to represent A ⊕ B without additional machine teaching process.
Machine Teaching

Table 3. Where to find machine teachers


Potential teacher Quantities Characteristics
Machine learning experts Tens of thousands Has profound understanding
of machine learning. Can
modify a machine learning
algorithm or architecture to
improve performance.
Data Scientist / Analyst Hundreds of thousands Can analyze big data, detect
trend and correlations using
machine learning. Can train
machine learning models on
existing values to extract
value for a business.
Domain expert Tens of millions Understands the semantics of
a problem. Can provide
examples and counter
examples, and explain the
difference between them.

be updated or re-collected. Examples that have been “projects” that are grouped into “solutions”. Projects
labeled by teachers, however, are kept forever because in a solution are trained together because their retrain-
labels always retain some semantic value. ing can affect each other. Dependencies across differ-
ent solutions are treated as versioned packages, which
Distribution robustness The assumption that the means that retraining a project in one solution does
training distribution matches the sampling or deploy- not affect a project in a different solution (the teacher
ment distribution is unrealistic in practice. The role must update the package reference to incorporate such
of the teacher is to create a model that is correct for changes). The modular development principle encour-
any example, regardless of the deployment distribu- ages the sharing of explicit concept implementations.
tion. Given our assumption of feature completeness
and a rich and diverse sampling set, the result of a suc-
cessful teaching process should be robust to not know- Version control All teacher actions (e.g., labels,
ing the deployment distributions. Imagine program- features, label constraints, schema and dependency
ming a “Sort” function. We expect “Sort” to work graph, and even programming code if necessary) are
regardless of the distribution of the data it is sorting. equivalent to a concept “program”. They are saved
Thanks to realizability, we have the same correctness in the same “commit”. Like programming code, the
expectation for teaching. Because the training data is teacher’s actions relevant to a concept are saved in a
discovered and labeled for the training set in an ad hoc version control system. Different type of actions are
way using filtering, distribution robustness is a critical kept in different files to facilitate merge operations be-
assumption and we therefore favor machine learning tween contributions from different teachers.
algorithms that are robust to covariate shifts. Warn-
ing: having a mismatch between train and sampling The combination of these principles suggests a teach-
(or deployment) distributions complicates evaluation. ing process that is different from the standard teach-
ing process. The universal teaching language implies
Modular development Decomposition is a central that the machine learning expert can be left out of the
principle of both programming and machine teach- teaching loop. The featuring completeness principle
ing. The machine teaching process should support implies that the engineers can be left out of the teach-
the modular development of concept implementation. ing loop as well. The teaching tool should provide the
This includes the decomposition of concepts into sub- teacher with all that is needed to build models effec-
concepts, and the use of models as features for other tively. The engineers can update the data pipeline and
models. We can achieve this by standardizing model the programming language, but neither are concept-
and feature interfaces. Similar to a programming in- dependent so the engineer is out of the teaching loop.
tegrated development environment (IDE), within our These two principles imply that a single person with
teaching IDE, concept implementation is done through domain and teaching knowledge can own the whole
Machine Teaching

repeat
while training set is realizable do
if quality criteria is met then
exit
end
// Actively and semantically explore sampling set using concept based filters.
Find a test error (i.e., an incorrectly predicted (example, label) pair);
Add example to training set.;
end
// Fix training set error
if training error is caused by labeling error(s) then
Correct labeling error(s);
else
// Fix feature blindness. This may entail one or more of the following actions:
Add or edit basic features;
Create a new concept/project for a new feature (decomposition);
Change label constraints or schema (high level knowledge);
end
until forever ;
Algorithm 1: A machine teaching process

process. The availability of a rich and diverse sam- a trivial and uninteresting case (ambiguous examples
pling set means that the traditional data collection for are not useful for coming up with new decomposition
labeling step is not part of the concept teaching pro- concepts).
cess. The distribution robustness principle allows the
There are a few striking differences between the teach-
teacher to explore and label freely throughout the pro-
ing process above and the standard model building
cess without worrying about balancing classes or ex-
process. The most important aspect is that it can be
ample types. Concept modularity and version control
done by a single actor operating on the true distribu-
guarantee that a function created in a project is repro-
tion. Knowledge transfer from teacher to learner has
ducible provided that (1) all of its features are deter-
multiple modalities (selection, labels, features, con-
ministic and (2) training is deterministic. The concept
straints, schema). The process is a never-ending loop
modularity principle enables interpretability and scal-
reminiscent of Tom Mitchell’s NELL (Carlson et al.,
ing with complexity. The interpretability comes from
2010). Capacity is increased on demand, so there is no
being able to explain what each sub-concept does by
need for traditional regularization because the teacher
looking at the labels, features, or schema. Even if each
controls the capacity of the learning system by adding
sub-concept is a black box inside, their interfaces are
features only when necessary.
transparent. The merge functionality in version con-
trol enables easy collaboration between multiple teach-
ers. 6. Conclusion
Based on the above, we propose a skeleton for a teach- Over the past two decades, the machine learning field
ing process in Algorithm 1. Note that this process is has devoted most of its energy to developing and im-
not unique. proving learning algorithms. For problems in which
data is plentiful and statistical guarantees are suffi-
Evaluating the quality criteria in a distribution-robust
cient, this approach has paid off handsomely. The
setting is difficult and beyond the scope of this pa-
field is now evolving toward addressing a larger set
per. A simple criteria could be to pause when the
of simpler and more ephemeral problems. While the
teacher’s cost or time invested reaches a given limit.
demand to solve these problems effectively grows, the
Finding test error effectively is also difficult and be-
access to teachers that can build corresponding solu-
yond the scope of this paper. The idea is to query
tions is limited by their scarcity and cost. To truly
over the large sample set by leveraging query-specific
meet this demand, we need to advance the discipline
teacher-created concepts and sub-concepts. The art is
of machine teaching. This shift is identical to the shift
to maximize the semantic expressiveness of querying
in the programming field in the 1980s and 1990s. This
and the diversity of results. Uncertainty sampling is
Machine Teaching

parallel yields a wealth of benefits. This paper takes 7. Acknowledgements


inspiration from three lessons from the history of pro-
gramming. The first one is problem decomposition and We thank Jason Williams for his support and contri-
modularity. They have allowed programming to scale butions to Machine Teaching. Jason is the creator
with complexity. We argue that a similar approach of the LUIS (Language Understanding Internet Ser-
has the same benefits for machine teaching. The sec- vices) project, a service for building language under-
ond lesson is the standardization of programming lan- standing models, based on the principles mentioned
guages: write once, run everywhere. This paper is not in this paper. We thank Riham Mansour and the
proposing a standard machine teaching language, but Microsoft Cairo team for co-building, maintaining,
we enumerated the most important machine-learning- and improving the www.LUIS.ai service. Finally, we
agnostic knowledge channels available to the teacher. thank Matthew Hurst and his team for building high-
The final lesson is the process discipline, which in- performing web page entity extractors leveraging our
cludes separation of concerns and the building of stan- machine teaching tools. These entity extractors are
dard tools and libraries. This addresses the same limi- deployed in Bing Local.
tations to productivity and scaling with the number of
contributors that plagued programming (as described References
in the ”Mythical Man Month” (Brooks Jr, 1995)). We
Yoshua Bengio, Jérôme Louradour, Ronan Collobert,
have proposed a set of principles that lead to a better
and Jason Weston. 2009. Curriculum Learning. In
teaching process discipline. Some of the tools of pro-
Proceedings of the 26th Annual International Con-
gramming, such as version control, can be used as is.
ference on Machine Learning (ICML ’09). ACM,
Some of these principles have been successfully applied
New York, NY, USA, 41–48. https://doi.org/
in services such as LUIS.ai and by product groups in-
10.1145/1553374.1553380
side Microsoft such as Bing Local. We are in the early
stages of building a teaching interactive development Frederick P Brooks Jr. 1995. The Mythical Man-
environment. Month: Essays on Software Engineering, Anniver-
On a more philosophical note, large monolithic sys- sary Edition, 2/E. Pearson Education India.
tems, as epitomized by deep learning, are a popular Andrew Carlson, Justin Betteridge, Bryan Kisiel,
trend in artificial intelligence. We see this as a form of Burr Settles, Estevam R. Hruschka, Jr., and
machine learning behaviorism. It is the idea that com- Tom M. Mitchell. 2010. Toward an Architec-
plex concepts can be learned from a large set of (in- ture for Never-ending Language Learning. In Pro-
put, output) pairs. With the aid of regularizers and/or ceedings of the Twenty-Fourth AAAI Conference
deep representations computed using unsupervised or on Artificial Intelligence (AAAI’10). AAAI Press,
semi-supervised learning, the monolithic learning ap- 1306–1313. http://dl.acm.org/citation.cfm?
proach has yielded impressive results. This has been id=2898607.2898816
the case in several fields where labeled data is abun-
dant (speech, vision, machine translation). The mono- Todd Kulesza, Saleema Amershi, Rich Caruana,
lithic approach, however, has limitations when labeled Danyel Fisher, and Denis Charles. 2014. Structured
data is hard to come by. Deep representations built labeling for facilitating concept evolution in machine
from unlabeled data optimize where the data is. Rare learning. In Proceedings of the SIGCHI Conference
misspellings that are domain specific are likely to be ig- on Human Factors in Computing Systems. ACM,
nored or misinterpreted if they appear more frequently 3075–3084.
in a different domain context. Corner cases with little
or no labels for autonomous-driving may be ignored D. Sculley, Gary Holt, Daniel Golovin, Eugene Davy-
at great perils. Large (amorphic) models are hard to dov, Todd Phillips, Dietmar Ebner, Vinay Chaud-
interpret. These limitations can be overcome by in- hary, and Michael Young. 2014. Machine Learning:
jecting semantic knowledge via active teaching (e.g., The High Interest Credit Card of Technical Debt. In
labels, features, structure). For this reason, we believe SE4ML: Software Engineering for Machine Learn-
that both large monolithic systems and systems more ing (NIPS 2014 Workshop).
actively supervised by teaching have important roles Vladimir Vapnik. 2013. The Nature of Statistical
to play in machine learning. As a bonus, they can Learning Theory. Springer science & business me-
easily be combined to complement each other. dia.

You might also like