You are on page 1of 10

IEEE/ASME TRANSACTIONS ON MECHATRONICS, VOL. 24, NO.

1, FEBRUARY 2019 197

A Method for Sensor Reduction in a Supervised


Machine Learning Classification System
Niko Murrell , Ryan Bradley , Nikhil Bajaj , Member, IEEE, Julie Gordon Whitney ,
and George T.-C. Chiu , Senior Member, IEEE

Abstract—Smart devices employing interconnected sen- plex tasks such as classification. Concerns arise when the num-
sors for feedback and control are being rapidly adopted. ber of sensors and the capability of individual nodes are con-
Many useful applications for these devices are in mar- strained due to cost or other associated factors like computation
kets that demand cost-conscious solutions. Traditional
machine-learning-based control systems often rely on mul- time and memory footprint. Previous efforts to address this con-
tiple measurements from many sensors to achieve per- cern have focused on a reduction of computational requirements
formance targets. An alternative method is presented that during both the training and classification phases of embedded
leverages a time-series output produced by a single sensor. supervised machine learning algorithm development [1]. Meth-
By using domain expert knowledge, the time-series output ods attempting to minimize the number of features required for
is discretized into finite intervals that correspond to the
physical events occurring in the system. Statistical mea- classification also exist; these may be used to reduce the number
sures are taken across these intervals to serve as the fea- of sensors necessary for a given task.
tures to the machine learning system. Additional features This work presents a novel method to reduce the number of
that decouple key physical metrics are identified, improv- sensors required for a supervised machine learning classification
ing the performance of the system. This novel approach system. Expert knowledge of expected sensor output variation
requires a more modest dataset and does not compromise
performance. The resulting development effort is signifi- as a function of intrinsic properties, extrinsic properties, and un-
cantly more cost-effective than traditional sensor classifi- controllable external factors is used to establish a unique feature
cation systems, not only due to the reduced sensor count, set that sufficiently decouples otherwise inseparable classes. The
but also due to a significantly simplified and more robust system design and the control system were concurrently tuned
algorithm development and testing step. Results are pre- to elicit distinct dynamic responses within predefined tempo-
sented with the case study of a media-type classification
system within a printing system, which was deployed to the ral regions of a continuous data stream. The analog data were
field as a commercial product. discretized into several distinct zones of interest corresponding
to the sensor’s response to different dynamical processes. A
Index Terms—Embedded software, expert-based sys-
tems, integrated design, machine learning, sensor systems
unique difference method allowed the learning algorithm to ex-
and applications, system-level design. tract additional useful information from the confounded dataset.
This methodology is validated by a case study of a print me-
I. INTRODUCTION dia classifier system developed for a commercial laser printer,
ENSORS are rapidly decreasing in cost, while performance which was manufactured and deployed at a large volume. The
S and accuracy increase. Consequently, many electromechan-
ical devices have incorporated sensor-enabled control schemes.
resultant classification success exceeded that of embodiments
using multiple sensors with only a single sensor. Finally, the
Recently, machine learning algorithms have begun to leverage implications of this design methodology and advantages over a
this trend to enable new functionality. Sensed information may traditional data-driven classification system are discussed.
be used to generate input features for algorithms that enable
proactive diagnostics, system awareness, and other more com- II. BACKGROUND
The goal of simplification of multisensor systems by har-
Manuscript received September 11, 2017; revised April 1, 2018 and
July 31, 2018; accepted October 29, 2018. Date of publication November vesting more independent features from a reduced sensor set
19, 2018; date of current version February 14, 2019. Recommended by relies on modification of the measured object usually based
Technical Editor Q. Zou. (Corresponding author: Nikhil Bajaj.) on time or geometry. There are numerous studied methods
N. Murrell was with Lexmark, Lexington, KY 40511 USA. He is now
with Ethicon, Inc., Somerville, NJ 08876 USA (e-mail:, niko.murrell@ for dimensionality reduction and representation of time-series
gmail.com). data. General dimension-reduction and rerepresentation meth-
R. Bradley and J. G. Whitney were with Lexmark, Lexington, KY 40511 ods include model-based techniques such as those using hidden
USA. They are now with the University of Kentucky, Lexington, KY 40506
USA (e-mail:, ryanuspsagm@gmail.com; juliewhitney2@gmail.com). Markov models [2], [3]. A second class of methods have at-
N. Bajaj and G. T.-C. Chiu are with Purdue University, West Lafayette, tempted to reformulate the data with interpolative or regression
IN 47907 USA (e-mail:, bajajn@purdue.edu; gchiu@purdue.edu). methods such as piecewise linear [4] or piecewise polynomial
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. [5] approximations. Another group of methods uses a sym-
Digital Object Identifier 10.1109/TMECH.2018.2881889 bolic representation optimized with certain constraints such as
1083-4435 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
198 IEEE/ASME TRANSACTIONS ON MECHATRONICS, VOL. 24, NO. 1, FEBRUARY 2019

symbolic aggregated approximation (SAX). Still other methods Consider the case of a least-squares support vector machine
use transforms such as discrete Fourier [6] or discrete cosine (LS-SVM) [21], [22] deployed in an embedded classification,
transforms or wavelet systems [7], [8]. Although these methods solving a multiclass problem (e.g., determine if a presented
are largely designed for use on general potentially multidimen- set of features belongs to which one of several distinct sets).
sional time series, they are frequently tested, presented, and The goal is to take as input a vector x ∈ Rn f , where nf is
verified on application-specific data from medical data [8] to the number of features used for classification, and produce an
faults in mechanical gear systems [9]. output y(x), which represents the classifier output. Given that
Once the transformation has been performed, classifica- xk ∈ Rn f , k = 1, 2, . . . , N , are the feature vectors correspond-
tion training and evaluation can occur. Possible algorithms in- ing to N training examples and yk are the corresponding true
clude one-nearest neighbors or k-nearest neighbors [10], which classes (in this case, yk = +1 if the measurement belongs to a
demonstrated considerable success when implemented with rep- set and yl = −1 if it does not), the classification algorithm is
resentations like SAX in combination with dynamic time warp- trained by solving the following optimization problem to de-
ing [11]. More sophisticated methods such as neural networks, termine a best separating hypersurface defined through a non-
multilayer perceptrons [12], Bayesian networks [13], support linear mapping. Some interpretations of the LS-SVM and other
vector machines [14], and decision trees [15] have also been SVMs make assumptions about the variables being independent
used with success and represent alternative design options. Some and identically distributed random variables. While we cannot
methods use information from a transformation, such as warp- make this claim for this dataset due to temporal correlation,
ing distance, as an additional feature and integrate this into the SVM-type algorithms can still work well in practice as long as
classification method [16]. In each case, the features used to the combination of features can provide sufficient separation. In
train these systems are selected to be as orthogonal as possible, the implementation section, we discuss the distribution of the
and the quality of the resulting algorithm is, among other things, selected input features, and it can be observationally inferred
a function of that orthogonality. that an SVM might work well given a geometric rather than
Often, the system cannot be easily simplified, and hardware probabilistic interpretation of SVM methods
with embedded supervised machine learning systems is de-
signed using a complex network of various sensors. In theory, 1 2
N
1 T
these extra data enable the designer to build and test a robust al- minimize JP (w, e) = w w+γ ek (1)
2 2
gorithm since a network of sensors can be selected to maximize k =1
feature orthogonality. This can lead to a temptation to deploy subject to yk [wT ϕ(xk )b ] = 1 − ek , k = 1, . . . , N (2)
more sensors and computational resources than is strictly nec-
essary. In industries where customers are highly sensitive to where the classifier takes the form y(x) = sgn[wT ϕ(x) + b],
product cost, such as office printing, the strategy is often to and ϕ(xk ) is a mapping to a (often) higher dimensional space.
deploy a single sensor to partly meet design needs. These at- In practice, the classifier is usually solved for in the dual space,
tempts have included using a set of electrodes to take electrical the space of Lagrange multipliers of the constraints, αk (for
measurements of media [17], [18], a camera to measure surface k = 1, 2, . . . , N ). b is a scalar bias offset term. γ is a regular-
roughness [19], or an ultrasonic sensor to determine media den- ization parameter that can be used to control overfitting versus
sity [20]. The desired combination would result in an inflated underfitting behavior, but was set as 1. w ∈ Rn f is a vector of
infrastructure that is limited in practice by cost and effort. The weights that, along with the mapping ϕ(xk ), helps define the
result is products that fall short of ideal performance in order decision hypersurface. The dual space classifier takes the form
to maintain competitive cost structures. Although a reduction
of algorithm memory and computational requirements is pos-  

N
sible [1], modifying the algorithm does not necessarily reduce y(x) = sgn αk yk K(x, xk ) + b . (3)
the cost or footprint of sensing (as opposed to computational) k =1
hardware.
K(x, xk ) = ϕT (x)ϕ(xk ) is a Kernel function (a nonlinear map-
III. METHODOLOGY ping that allows additional flexibility in the classification func-
tion). Both the dual-space classifier and the solution of the clas-
In concurrently developed physical systems, the designer has sifier optimization problem can be addressed by considering the
to access significantly more information about the situation than Karush–Kuhn–Tucker (KKT) conditions for optimality
is often available with analyzing time-series data in a general
case. Time-series data output by a single sensor may contain

N
information about multiple physical quantities due to system w= αk yk ϕ(xk )
dynamic behavior. Therefore, multiple physical quantities do k =1
not always need to be measured by the same number of physical 
N
sensors. The designer has an opportunity to tune the hardware αk yk = 0
to produce a time-series output from a single sensor and then k =1
discretize the output with domain expert knowledge to produce αk = γek ∀k = 1, 2, . . . , N
multiple features while preserving orthogonality. This results in
a system with fewer sensor nodes and a lower associated cost. yk [w ϕ(xk ) + b] − 1 + ek = 0
T
∀k = 1, 2, . . . , N.
MURRELL et al.: METHOD FOR SENSOR REDUCTION IN A SUPERVISED MACHINE LEARNING CLASSIFICATION SYSTEM 199

This allows assembly of the following matrix equation to solve


the KKT system:

⎛ ⎞
0 yT
⎜ ⎟ b 0
⎝ I⎠ = (4)
y Ω+ α 1v
γ

where Ωk l = yl yl ϕ(xk )T ϕ(xk ) = yk yl K(xk , xl ), with k, l =


1, . . . , N . At this point, the (nonsparse) matrix equation can Fig. 1. Traditional method for enabling feature-based decision making
capability on an existing device. The final classification algorithm is a
be solved for α and b using standard methods (lower–upper function of N features, represented by N nodes.
factorization, etc.). The Kernel function can take a number of
different forms, of which K(x, xk ) = xTk x (linear), K(x, xk ) =
||x−x ||2
(xTk x + τ )d (polynomial), and K(x, xk ) = exp(− σ 2k 2 ) (ra-
dial basis function) are common examples. In this work, only
polynomial classifiers are considered. This is due to the appli-
cation requirements of processing power and program memory
space, constrained to use the algorithm of [1].
Typically, y(x) = +1 would yield a prediction that x belongs
in one set, and y(x) = −1 would correspond to the other com-
plementary set. However, in some cases, including media clas-
sification in a printer, there are areas of the feature space that for
some comparisons make no difference (there are cases in which
mistakes in classification cause less of a problem for down-
stream processes). Specifically, one can have some errors in clas-
sification that are acceptable to downstream processes, and some
that should be weighted more heavily. This idea was discussed
and formulated into the training of a multiclass SVM problem
and described in detail in [23]. The solution method for the sys-
tem is the same. In this work, the result associated with each
classification is accordingly either an incorrect classification, an
incorrect (but acceptable) error, or a correct classification. An
acceptable error is simply one that is tolerable to the downstream
Fig. 2. Proposed approach uses the system knowledge of codesigned
processes. hardware to pull multiple features out of a single time series of data.
In order to create a multiclass classification system, the differ-
ent classes are separated into complementary groups and evalu-
ated in a one-versus-all sense [22] (other options exist, but one sampled time series) into discrete intervals, such that
versus all is the encoding used in this work); if there are three m(t) = [x(t1 , t2 ) : [Ψt 1 ,t 2 ]
classes, then there are three classifiers, each of which evaluates
whether the data belong in one set or, alternatively, all of the x(t2 , t3 ) : [Ψt 2 ,t 3 ]
other sets. As mentioned before, selection of the features that ..
comprise the feature vector is critical to classifier performance. .
The focus of this work is the design of the features and corre- x(tN −1 , tN ) : [Ψt N −1 ,t N ]].
sponding sensors and mechanical elements needed in order to
achieve good performance while minimizing training data and Here, the time intervals [(t1 , t2 ), (t2 , t3 ), . . . , (tN −1 , tN )] cor-
overall cost. respond to known physical events in the system, and
A traditional approach, shown in Fig. 1, places the burden [x(t1 , t2 ), x(t2 , t3 ), . . . , x(tN −1 , tN )] is the set of discrete mea-
of the system on the sensor nodes themselves. In this example, surement intervals. Ψ is a set of statistical measures (mean, vari-
a feature contributing to the classifier has a one-to-one rela- ance, skewness, range, minimum, maximum, etc.) taken within
tionship to the number of required sensor nodes. The proposed the corresponding measurement interval to describe the interval
approach illustrated in Fig. 2 puts the burden of the system on the under inspection.
domain expert knowledge and the temporal output of a single The classifier is trained on data that are of the form (yk , xk ).
node. The domain expertise is used to partition the measure- Ideally, xk = φk , where φk is the set of intrinsic physical proper-
ment time series m(t) (in implementation, this is most likely a ties in the system (φk = [φ1 , φ2 , . . . , φN p i ]Tk ∈ RN p i ). Npi rep-
200 IEEE/ASME TRANSACTIONS ON MECHATRONICS, VOL. 24, NO. 1, FEBRUARY 2019

resents an ideal set of orthogonal intrinsic properties. Ψ ⊆ φk . For the same training example, Δφk = 0. The same is true for
Simply put, ideally, the sets to be classified are well separated ΔZk . Therefore, the only remaining terms are those that include
by a measurement of some direct relevant intrinsic physical ΔYi and ΔYj , the associated partial derivatives, and the differ-
property and have good orthogonality. In the practical case, this ence of the offset constants. This new feature, fi − fj , is solely
is not so. Every measurement is a function of both the intrin- a function of ΔYi and ΔYj , which are functions of certain fixed
sic property being measured and the properties of the physical extrinsic system properties. This information can be learned by
system involved in that measurement. These properties include the classifier and improves classification performance.
the structure of the system and its operation, which are control-
lable by the system designer, and known environmental factors
that may not be controllable by the designer. Considering the IV. CASE STUDY AND IMPLEMENTATION
form of the constructed intervals and corresponding statistical This case study applies the proposed approach to a commer-
measures, the training data examples xk are such that cial color laser (electrophotographic) printer intended for shared
office use in a managed print services environment. Most of the
xk = [f1 (φk , Y1 , Zk ) laser printer users do not check or adjust the media-type settings.
f2 (φk , Y2 , Zk ) Additionally, only a fraction of users that do adjust the media
settings do so correctly. Incorrect settings on these devices may
.. cause problems for both the customer and the manufacturer. To
.
address this issue, an inexpensive sensor system and an embed-
fN (φk , YN , Zk )]. ded machine learning algorithm were implemented to classify
Here, (f1 , f2 , . . . , fN ) are nonlinear functions of the arguments: media without user input. The printer control system adjusted
φk , the intrinsic physical properties; Zk ∈ RN p e , which are device parameters based on this media classification.
known quantifiable extrinsic system properties that influence A single inexpensive optical sensor consisting of a paired
the measurement (Npe is the number of extrinsic properties LED and phototransistor was mounted within the printer media
affecting measurements); and (Y1 , Y2 , . . . , YN ), which are un- path. The sensor output a continuous data stream corresponding
controllable external factors that are a function of the hardware to the amount of light transmitted by in-process media. A simple
design. model of the sensor was developed, and based upon this, system
In the case of systems where measurements taken in different hardware and controls were tuned to generate an information-
intervals are coupled, taking the difference between two func- rich data stream by leveraging the dynamic response of media
tions can help train the classifier with independent information to control system inputs. The printer generated features from
about system interactions and decouple external factors that this data stream for each sheet of media. A broad population
influence the measurement. This can be justified with a brief of standard office media with varied intrinsic properties, ϕk ,
expansion analysis. Given two functions fi and fj , the Tay- existing along a continuum were sorted into one of five distinct
lor series expansions can be taken about a nominal operating classes: light, normal, heavy, card stock, and transparency. This
point as dataset was used to generate an embedded machine learning
algorithm that used these features to determine media class in
∂fi ∂fi ∂fi near real time. Printer process parameters and system controls
fi (φk , Yi , Zk ) = Δφk + ΔYi + ΔZk + Ci (5)
∂φk ∂Yi ∂zk were adjusted based upon this prediction. The final embodi-
∂fj ∂fj ∂fj ment significantly reduced overall cost, complexity, and system
fj (φk , Yj , Zk ) = Δφk + ΔYj + ΔZk + Cj . footprint when compared to traditional implementations and is
∂φk ∂Yj ∂zk
described in greater detail in [24].
(6)
A cross section of the printer media path is shown in Fig. 3.
Taking the difference yields The highlighted region contains a section view of the sensor
and the surrounding printer hardware including upstream feed
fi (φk , Yi , Zk ) − fj (φk , Yj , Zk ) rollers, media guides, and downstream feed rollers. The electri-
 cal design schematic for the optical sensor system is shown in
∂fi ∂fi ∂fi
= Δφk + ΔYi + ΔZk + Ci Fig. 4. Nominal circuit values were tuned to adjust the sensor
∂φk ∂Yi ∂Zk
 gain, response, and sensitivity. The resulting full-scale range of
∂fj ∂fj ∂fj the dataset was maximized for the population of expected media,
− Δφk + ΔYj + ΔZk − Cj
∂φk ∂Yj ∂Zk and maximum separation between media classes was achieved.
 Calibration was performed to compensate for system gain and
∂fi ∂fj ∂fi ∂fj
= Δφk − + ΔYi − ΔYj offset errors.
∂φk ∂φk ∂Yi ∂Yj
   The sensor outputs a continuous data stream corresponding to
Δ φ k =0 for same k the amount of infrared light transmitted by the in-process media.
 This output is a highly coupled function of many confounded
∂fi ∂fj
+ ΔZk − + Ci − Cj . factors including intrinsic media properties (e.g., media basis
∂Zk ∂Zk   
   constant weight, media roughness, media thickness, etc.) ϕk , extrinsic
Δ Z k =0 for same k system properties (e.g., LED intensity, media speed, media input
MURRELL et al.: METHOD FOR SENSOR REDUCTION IN A SUPERVISED MACHINE LEARNING CLASSIFICATION SYSTEM 201

Fig. 3. Cross section of the printer media path is depicted. The high-
lighted region contains an optical sensor consisting of an LED (1) and
phototransistor (2) that measures the amount of infrared light transmit-
ted by a sheet of media (4) as it is processed by the printer. Media fed
by upstream feed rollers (3) passes through the sensor, beyond a me- Fig. 5. Normalized analog sensor output for 20 separate sheets of a
dia guide (5), and into a set of downstream feed rollers (6). Hardware standard office paper are plotted. Data were collected for 100 mm of
(physical design of the media path) and firmware (system timings and media travel. The population mean and 99.7% confidence bands for
relative velocities of the feed rollers) were tuned during development to this given media are plotted for reference. Larger values on the y-axis
enhance data orthogonality by controlling the position and shape of the correspond to an instant in time, or media leading edge position relative
media relative to the sensor in the spatial/temporal domain. to the sensor, where less light was detected by the phototransistor. It
is clear that there is a significant amount of noise in the measurement.
This may be attributed to both intrinsic media properties, ϕ k , extrinsic
system properties, Z k , and uncontrollable external factors,Y i .

Fig. 4. Electrical design schematic for the optical sensor system is


shown. The connection labeled “Analog Output” is the voltage signal
measured by the analog-to-digital converter and used in the classification
system. Nominal circuit values were selected to optimize the sensor gain,
response, and sensitivity for a broad range of media types.

source, phototransistor sensitivity, feed roller velocities, media


shape and offset, LED and phototransistor directionality, etc.) Fig. 6. Normalized analog sensor output for the mean and 99.7% con-
Zk , and uncontrollable external factors (e.g., relative humidity, fidence band of each class are plotted. The population for each class
consists of 360 training samples from each media listed in Table II. A
temperature, etc.) Yi . Fig. 5 depicts how variability caused by standard classification problem utilizing a traditional feature set would be
these confounded factors impacts the measurement for a single intractable due to the continuous overlapping nature of the data.
media. Twenty measurements for a normal weight office media
are shown. The signal varies substantially from sheet to sheet
and within a given sheet. Sensor output for given media may to train and test the algorithm and are listed in Table II for ref-
vary as much as 20% of the sensor’s full-scale range at a given erence. Corner cases (distinguishing light from card stock, for
process point. This is primarily a function of intrinsic media example) are easily distinguished. However, media properties
properties, ϕk . Within a given sheet, the sensor output may vary exist along a continuum and variability from sheet to sheet and
as much as 60% of the sensor’s full-scale range. This is primarily within a given sheet made the classification problem particularly
a function of extrinsic system properties, Zk . Uncontrollable challenging. There was a large amount of boundary confusion.
external factors, Yi , alter both the intrinsic media properties and This is especially true for the heavy class of media, which sig-
extrinsic system properties, Zk . nificantly overlaps with both the normal and card stock classes.
Fig. 6 depicts how this variability manifests as boundary con- For a classifier to be successful, it must decouple the rele-
fusion. A broad set of standard office media possessing a range vant intrinsic media properties, ϕk , from the other confounding
of intrinsic properties, ϕk , existing along a continuum were used variables and generate a substantially orthogonal feature set.
202 IEEE/ASME TRANSACTIONS ON MECHATRONICS, VOL. 24, NO. 1, FEBRUARY 2019

Fig. 7. Simplified model of the response of the sensing system to the movement of media through the printer was developed. Observed measure-
ment differences due to the dynamic interaction of the media with the sensor were leveraged to decouple confounded variables.

Media-to-media variability must be decoupled from the vari- bulk properties, as described in Fig. 7, which are less sensi-
ability seen from sheet to sheet or within a given sheet. For the tive to printer-to-printer variation. Finally, embedded firmware
case of media classification, this was achieved by tuning sys- and system hardware were tuned during product development
tem hardware and control parameters to leverage the sensitivity to generate subtle changes in media offset and shape relative to
of the measurement to uncontrollable external factors, Yi , and the emitter for each zone such that additional useful information
extrinsic system properties, Zk . Since the sensor output was a may be extracted from the dataset.
nonlinear function of ϕk , Zk , and Yi , it was possible to use This specific approach is summarized in Fig. 7. For example,
the dynamic response of the system to help decouple these media in Zone 1 enter the sensor and obscure the photodetector.
convoluted variables using the difference method described Prior to Zone 1, the photodetector is saturated and the signal
previously. is low. When the leading edge of the media directly obscures
Concurrently developed printer control algorithms and sensor the direct path between the emitter and the photodetector, a
hardware were tuned during the development phase to gener- minimal amount of light is transmitted and the signal is high.
ate a continuous data stream that could be deconstructed into As the media continue downstream, a larger area of the in-
several distinct zones of interest corresponding to the sensors process media is exposed to the emitter and additional diffusely
response to different dynamical processes. The resultant time- scattered light reaches the phototector, the signal decreases. The
series data was divided into five distinct zones of interest that output in Zone 1 is a strong function of media opacity and feed
corresponded to changes in the printer process that were de- rate.
signed to elicit a varied response from the sensor. In order to Furthermore, media in Zone 3 are fed by two separate feed
make the design more insensitive to printer-to-printer variation, roller systems simultaneously. The relative velocity of the roller
four ideas were considered when designing the zone positions. systems is precisely controlled by embedded firmware to elicit
First, a flag sensor (integrated into the paper feed control sys- a specific media response. The shape of the bubble is strongly
tem) allowed accurate registration of the leading edge of the coupled to a specific intrinsic property (basis weight). Heavier
sheet, and the traverse distance was known from the paper feed media are stiffer and are less likely to buckle; the upstream feed
drive encoders. Second, the zones are larger than strictly nec- rollers will slip. Lighter media will buckle, and the position of
essary for a single printer in order to accommodate variation the sheet relative to the sensor will change.
around the population of printers (determined empirically from In this manner, the hardware and firmware within the sys-
a number of different printers). It is important to be aware that tem may be adjusted using expert domain knowledge to extract
performance can decrease if the buffer regions are too large as distinct information from the measurement based upon the dy-
the data quality will decrease from the statistical measure be- namic response of the media to generated system inputs. This
ing taken. Third, the features and zones are designed around novel concurrent design approach allowed the photodetector to
MURRELL et al.: METHOD FOR SENSOR REDUCTION IN A SUPERVISED MACHINE LEARNING CLASSIFICATION SYSTEM 203

TABLE I
INPUT FEATURES USED BY THE MACHINE LEARNING ALGORITHM TO DETERMINE MEDIA CLASSIFICATION

*
Function of media composition, thickness, roughness, etc.

collect additional useful information that was strongly influ- through the sensor. Features 18–20 are features that help decou-
enced by extrinsic system properties, Zk . ple media opacity from other dynamic system interactions such
Additionally, Zones 2–4 extract similar information from the as media sheet shape and position. The features in Fig. 8 were
time-series data. Each zone provides a distinct measure of media selected to demonstrate that the difference method provided
opacity that is a strong function of intrinsic media properties, unique pertinent information (Features 19 and 20) that allowed
ϕk . This provides the algorithm with a degree of redundancy the learning algorithm to better classify media. This contention
and robustness against gross error. is supported by the different distinct trends demonstrated by the
Discretization of the analog data in this manner generated a plotted feature trends. However, all the features have significant
richer feature set with some measurement redundancy. A small boundary confusion, are not practical for use individually, but
designed experiment was conducted to assess system perfor- contribute to the overall classification performance.
mance and select the final feature set. Due to the information After gathering a training set and developing an algorithm, the
gained from the difference method previously described, inclu- result was embedded as a set of decision polynomials used in a
sion of features from redundant zones yielded improved perfor- one-versus-all classifier by the system firmware. This technique
mance with minimal additional computing overhead. Features was effective in dealing with this type of classification problem
used for the machine learning algorithm are provided in Table I. and dramatically increased accuracy over a simple node mean
Features x1 , x2 , . . . , x5 are extrinsic system properties and un- comparison.
controllable external factors that are provided by the printer sys-
tem’s embedded firmware to help stratify and decouple the train-
ing set. Features x6 , x7 , . . . , x18 contain an abundance of useful
intrinsic media information, but are nonlinearly coupled to Zk V. IMPLEMENTATION PERFORMANCE
and Yi . These features are calculated from the raw data and con- The results of the classification are given in Table II. The
tain minimum, maximum, and mean calculations (a measure of single-node mean and the domain expert knowledge solutions
opacity) and range calculations (a measure of uniformity). Fea- are compared. The single-node mean corresponds to the Zone
tures x19 , x20 , x21 , and x22 represent the previously described 2 mean, or x8 , and was selected as the best single-node clas-
difference calculations that are used to separate ϕk information sification system. The domain expert knowledge system was
from the influence of Zk and Yi . compared against this implementation. In the case of the do-
Constructing the feature set in this manner provided more main expert knowledge, a number of feature sets using different
useful information to the learning algorithm. The dataset con- order kernels were evaluated in a designed experiment to select
tains almost 7000 examples divided into training, testing, and the optimum group. A second-order polynomial kernel with
validation sets. These examples represented data from four dif- the features shown in Table I was selected. The cost function
ferent process speeds and three different environments: hot/wet, of the algorithm was modified to ensure that media near deci-
lab ambient, and cold/dry. The temperature and relative humid- sion boundaries were classified in a manner that would have
ity were chosen as the corners of a Class B environment and no negative impact on the printer performance, as detailed in
represent the limits of the printer’s operating space. A set of [23]. For this reason, “% Acceptable” is the key design metric
representative features gathered at a single process speed and for this system. This expert-prescribed cost function weighting
environmental condition are shown in Fig. 8. Feature 7 is pre- resulted in one particular paper type (Canon GFR-070) having
dominantly a measure of the media opacity as the sheet passes poorer “% Correct” than in the single-node mean case. This
204 IEEE/ASME TRANSACTIONS ON MECHATRONICS, VOL. 24, NO. 1, FEBRUARY 2019

Fig. 8. Representative input features after scaling for an assortment of media types. While the features contain information that can do corner
case separation, the features individually suffer from boundary confusion (significantly overlapping error bars between categories).

particular error is due to the fact that media are not naturally VI. DISCUSSION
categorical. The weighting method was designed to integrate The methodology described has significant cost advantages
into the printer’s existing control scheme with a minimal sys- over the traditional approach. These advantages stem from
tem impact. The richer feature set provides the machine learning
several fundamental aspects of single-sensor design. This in-
algorithm more flexibility to adjust decision surfaces such that cludes a reduction in hardware, associated nonrecurring engi-
printer performance is not compromised when boundary confu- neering expenses, and the savings associated with an inher-
sion occurs.
ent robustness of reducing the number of variables in the final
When the system is implemented using a single-node mean system.
method without domain expert knowledge, the resulting ac- The case study utilized a single optical sensor to extract
curacy is 69%, which filters to 93% acceptability when the
multiple distinct pieces of information from a dynamic sig-
expert-prescribed cost function weighting is applied. By com- nal response. This means that only one sensor was required
parison, the full implementation of domain expert knowledge
for the design. The development time was significantly shorter
including all features improved system performance to 85.38%
than that of a multisensor system. In addition to hardware
correct with 99.95% acceptability. When the domain expert cost reductions, the impact on the final product size was min-
knowledge implementation did not contain the delta features
imized. The resulting product had a reduced environmental
(x19 , x20 , x21 , and x22 ), and consisted solely of the minimum,
footprint.
maximum, and mean values from zones 1–3 (x6 , x7 −x9 , x11 − Reducing the part count also has some less obvious benefits.
x13 , x15 −x17 ), classifier performance decreased to 81.46%
All sensors drift with time and have part-to-part variability. Ide-
correct with 98.93% acceptability. This increase in perfor- ally, products must account for this drift and variability. In a
mance makes the case for value added by the domain ex- machine learning algorithm, many of the features being moni-
pert knowledge method and the difference method previously
tored and used are differences (deltas) or ratios. Relative calibra-
described. tion of different sensors becomes a performance-limiting design
MURRELL et al.: METHOD FOR SENSOR REDUCTION IN A SUPERVISED MACHINE LEARNING CLASSIFICATION SYSTEM 205

TABLE II
RESULTS OF THE CLASSIFICATION ALGORITHM USING BOTH THE BEST SINGULAR-FEATURE THRESHOLD (x8 , THE ZONE 2 MEAN) AND THE FULL-DOMAIN
EXPERT-KNOWLEDGE-SUPPORTED MACHINE LEARNING ALGORITHM

choice for a multisensor system. By utilizing only one sensor, [2] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models for
changes in calibration factors are less complex to accommo- complex action recognition,” in Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit., Jun. 1997, pp. 994–999.
date, both for the machine learning algorithm and for the system [3] T. Mori, Y. Nejigane, M. Shimosaka, Y. Segawa, T. Harada, and T. Sato,
design. “Online recognition and segmentation for time-series motion with HMM
and conceptual relation of actions,” in Proc. IEEE/RSJ Int. Conf. Intell.
Robots Syst., Aug. 2005, pp. 3864–3870.
VII. CONCLUSION [4] E. Keogh, S. Chu, D. Hart, and M. Pazzani, “An online algorithm for
segmenting time series,” in Proc. IEEE Int. Conf. Data Mining, 2001,
A methodology for leveraging domain expert knowledge and pp. 289–296.
temporal data for the design of an Internet of Things (IoT) [5] E. Fuchs, T. Gruber, J. Nitschke, and B. Sick, “Online segmentation of
time series based on polynomial least-squares approximations,” IEEE
system resulted in a highly robust and accurate system with Trans Pattern Anal. Mach. Intell., vol. 32, no. 12, pp. 2232–2245, Sep.
reduced cost, size, and complexity over traditional approaches. 2010.
This methodology was demonstrated in a case study of a mass- [6] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient similarity search in
sequence databases,” in Proc. 4th Int. Conf. Found. Data Org. Algorithms,
produced electrophotographic printer in a system designed to Oct. 1993, pp. 69–84.
classify media types. The proposed methodology increased clas- [7] K.-P. Chan and A. W.-C. Fu, “Efficient time series matching by wavelets,”
sifier accuracy by 16% and classifier acceptability by 6.5% when in Proc. 15th Int. Conf. Data Eng., Mar. 1999, pp. 126–133.
[8] I. Güler and E. D. Übeyli, “Multiclass support vector machines for EEG-
compared with a more traditional method that did not leverage signals classification,” IEEE Trans. Inf. Technol. Biomed., vol. 11, no. 2,
domain expert knowledge to enrich the dataset. The methodol- pp. 117–126, Mar. 2007.
ogy used can be applied to sensor-integrated IoT devices seek- [9] Y. Lei and M. J. Zuo, “Gear crack level identification based on weighted
K nearest neighbor classification algorithm,” Mech. Syst. Signal Process.,
ing to benefit from performance enhancements associated with vol. 23, no. 5, pp. 1535–1547, Jul. 2009.
today’s modern sensor technology, while still meeting various [10] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE
market constraints. Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.
[11] X. Xi, E. Keogh, C. Shelton, L. Wei, and C. A. Ratanamahatana, “Fast
time series classification using numerosity reduction,” in Proc. 23rd Int.
REFERENCES Conf. Mach. Learn., 2006, pp. 1033–1040.
[12] U. Orhan, M. Hekim, and M. Ozer, “EEG signals classification using the
[1] N. Bajaj, G. T. C. Chiu, and J. P. Allebach, “Reduction of memory footprint K-means clustering and a multilayer perceptron neural network model,”
and computation time for embedded support vector machine (SVM) by Expert Syst. Appl., vol. 38, no. 10, pp. 13475–13481, Sep. 2011.
kernel expansion and consolidation,” in Proc. IEEE Int. Workshop Mach. [13] K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge,
Learn. Signal Process., Sep. 2014, pp. 1–6. MA, USA: MIT Press, 2012.
206 IEEE/ASME TRANSACTIONS ON MECHATRONICS, VOL. 24, NO. 1, FEBRUARY 2019

[14] V. N. Vapnik, The Nature of Statistical Learning Theory. New York, NY, Nikhil Bajaj (M’10) received the B.S., M.S., and
USA: Springer, 1995. Ph.D. degrees in mechanical engineering from
[15] J. J. Rodrı́guez and C. J. Alonso, “Interval and dynamic time warping- Purdue University, West Lafayette, IN, USA, in
based decision trees,” in Proc. ACM Symp. Appl. Comput., 2004, 2008, 2011, and 2017, respectively.
pp. 548–552. He is currently a Postdoctoral Research Asso-
[16] C. Orsenigo and C. Vercellis, “Combining discrete SVM and fixed cardi- ciate with the School of Mechanical Engineering,
nality warping distances for multivariate time series classification,” Pattern Purdue University. He has held research assis-
Recognit., vol. 43, no. 11, pp. 3787–3794, Nov. 2010. tant positions on several projects in the areas of
[17] J. S. Weaver, J. G. Bearss, and T. Camis, “Image forming devices and advanced manufacturing, computational design,
sensors configured to monitor media, and methods of forming an image and heat transfer, and a summer research po-
upon media,” U.S. Patent 6 157 793, Dec. 5, 2000. sition with Alcatel-Lucent Bell Laboratories. His
[18] J. S. Weaver, “Capacitance and resistance monitor for image producing research interests include nonlinear dynamical and control systems and
device,” U.S. Patent 6 493 523, Dec. 10, 2002. the analysis and design of mechatronic systems, especially in the con-
[19] M. Aoki, “Recording medium imaging apparatus for determining a type text of learning sensor systems.
of a recording medium based on a surface image of a reference plate and
a surface image of the recording medium,” U.S. Patent 8 611 772, Sep.
16, 2013.
[20] R. E. Sorace, V. S. Reinhardt, and S. A. Vaughn, “High-speed digital-to-RF
converter,” U.S. Patent 5 668 842, Dec. 17, 2013.
[21] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine
classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, Jun. 1999.
[22] J. A. K. Suykens, Least Squares Support Vector Machines. River Edge, Julie Gordon Whitney received the B.S. de-
NJ, USA: World Scientific, 2002. gree in mechanical engineering from Purdue
[23] N. Bajaj, N. J. Murrell, J. G. Whitney, J. P. Allebach, and G. T. C. Chiu, University, West Lafayette, IN, USA, in 1982, the
“Expert-prescribed weighting for support vector machine classification,” M.S. degree from Indiana State University, Terre
in Proc. IEEE Int. Conf. Adv. Intell. Mechatronics, Jul. 2016, pp. 913–918. Haute, IN, USA, in 1986, and the Ph.D. degree
[24] R. T. Bradley, J. L. Combs, N. J. Murrell, F. J. Palumbo, D. Steinberg, and from the University of Cincinnati, Cincinnati, OH,
J. A. G. Whitney, “Method of determining a media class in an imaging USA, in 1992.
device using an optical translucence sensor,” U.S. Patent 9 451 111, Sep. She has more than 25 years of experience
20, 2016. in Research and Development, and recently re-
tired as a Senior Technical Staff Member with
Lexmark International. She has joined the First
Year Engineering program with the University of Kentucky, Lexington,
KY, USA, as a Lecturer.

Niko Murrell received the B.S.M.E. degree


from Texas A&M University, College Station, TX,
USA, in 2001, and the M.S.M.E. degree from
the Georgia Institute of Technology, Atlanta, GA,
USA, in 2002, where he specialized in mecha-
tronics and control systems.
From 2003 to 2016, he was with Lexmark,
Lexington, KY, USA, where he worked in a
variety of engineering roles including design,
product development, supply chain, research,
and systems engineering. He is currently with
Ethicon, Inc., Somerville, NJ, USA, as a Staff Electromechanical Engi-
neer and is supporting the digital surgery platform being developed by George T.-C. Chiu (SM’16) received the B.S.
Verb Surgical (a joint venture between Johnson & Johnson’s Ethicon and degree from National Taiwan University, in 1985,
Alphabet’s Verily Life Sciences). He is an inventor on 26 U.S. patents. and the M.S. and Ph.D. degrees from the Univer-
Mr. Murrell is a licensed Professional Mechanical Engineer in the state sity of California at Berkeley, in 1990 and 1994,
of Kentucky. respectively, all in mechanical engineering.
He is currently a Professor with the School of
Mechanical Engineering with courtesy appoint-
ments in the School of Electrical and Computer
Engineering, and the Department of Psycholog-
ical Sciences, Purdue University. His current re-
search interests are mechatronics, and dynamic
Ryan Bradley received the B.S. and M.S. de-
systems and control with applications to digital printing and imaging sys-
grees in mechanical engineering in 2014 and
tems, digital fabrications and functional printing, human motor control,
2015, respectively, from the University of Ken-
and motion and vibration perception and control.
tucky, Lexington, KY, USA, where he is currently
Dr. Chiu is a Fellow of American Society of Mechanical Engineers
working toward the Ph.D. degree in mechani-
cal engineering with the Institute of Sustainable (ASME), and the Society for Imaging Science and Technology.
Manufacturing.
He started his engineering career with
the printer manufacturer, Lexmark International,
Lexington, in a variety of roles including research
and development, sustainability, product engi-
neering, and data science. He is a Sustainability Researcher. Through
his research, he aspires to advance sustainable product design and
manufacturing through the application of machine learning, life-cycle as-
sessment, life-cycle costing, and alternative business models.

You might also like