You are on page 1of 34

A Comprehensive Survey on

Transfer Learning
This survey provides a comprehensive understanding of transfer learning from the
perspectives of data and model.
By F UZHEN Z HUANG , Z HIYUAN Q I , K EYU D UAN , D ONGBO X I , Y ONGCHUN Z HU ,
H ENGSHU Z HU , Senior Member IEEE, H UI X IONG , Fellow IEEE, AND Q ING H E

ABSTRACT | Transfer learning aims at improving the performance of different transfer learning models, over 20 rep-
performance of target learners on target domains by transfer- resentative transfer learning models are used for experiments.
ring the knowledge contained in different but related source The models are performed on three different data sets, that
domains. In this way, the dependence on a large number is, Amazon Reviews, Reuters-21578, and Office-31, and the
of target-domain data can be reduced for constructing tar- experimental results demonstrate the importance of selecting
get learners. Due to the wide application prospects, trans- appropriate transfer learning models for different applications
fer learning has become a popular and promising area in in practice.
machine learning. Although there are already some valuable
KEYWORDS | Domain adaptation; interpretation; machine
and impressive surveys on transfer learning, these surveys
learning; transfer learning.
introduce approaches in a relatively isolated way and lack the
recent advances in transfer learning. Due to the rapid expan-
sion of the transfer learning area, it is both necessary and
challenging to comprehensively review the relevant studies. N O M E N C L AT U R E
This survey attempts to connect and systematize the existing Symbol Definition
transfer learning research studies, as well as to summarize n Number of instances.
and interpret the mechanisms and the strategies of transfer m Number of domains.
learning in a comprehensive way, which may help readers D Domain.
have a better understanding of the current research status and T Task.
ideas. Unlike previous surveys, this survey article reviews more X Feature space.
than 40 representative transfer learning approaches, espe- Y Label space.
cially homogeneous transfer learning approaches, from the x Feature vector.
perspectives of data and model. The applications of transfer y Label.
learning are also briefly introduced. In order to show the X Instance set.
Y Label set corresponding to X .
S Source domain.
Manuscript received March 14, 2020; revised June 11, 2020; accepted June 19, T Target domain.
2020. Date of publication July 7, 2020; date of current version December 21, L Labeled instances.
2020. This work was supported in part by the National Key Research and
Development Program of China under Grant 2018YFB1004300; in part by the U Unlabeled instances.
National Natural Science Foundation of China under Grant U1836206, Grant H Reproducing kernel Hilbert space.
U1811461, Grant 61773361, and Grant 61836013; and in part by the Project of
Youth Innovation Promotion Association CAS under Grant 2017146. θ Mapping/coefficient vector.
(Fuzhen Zhuang and Zhiyuan Qi contributed equally to this work.) α Weighting coefficient.
(Corresponding authors: Fuzhen Zhuang; Zhiyuan Qi.)
Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, and β Weighting coefficient.
Qing He are with the Key Laboratory of Intelligent Information Processing, λ Tradeoff parameter.
Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing
100190, China, and also with the University of Chinese Academy of Sciences, δ Parameter/error.
Beijing 100049, China (e-mail: zhuangfuzhen@ict.ac.cn; qizhyuan@gmail.com). b Bias.
Hengshu Zhu is with Baidu Inc., Beijing 100085, China.
Hui Xiong is with Rutgers, The State University of New Jersey, Newark,
B Boundary parameter.
NJ 08854 USA. N Iteration/kernel number.
Digital Object Identifier 10.1109/JPROC.2020.3004555 f Decision function.

0018-9219 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 43

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

L Loss function.
η Scale parameter.
G Graph.
Φ Nonlinear mapping.
σ Monotonically increasing function.
Χ Structural risk.
κ Kernel function.
K Kernel matrix.
H Centering matrix.
C Covariance matrix. Fig. 1. Intuitive examples about transfer learning.
d Document.
w Word.
z Class variable.
z̃ Noise. knowledge. Fig. 1 shows some intuitive examples about
D Discriminator. transfer learning. Inspired by human beings’ capabilities
G Generator. to transfer knowledge across domains, transfer learning
S Function. aims to leverage knowledge from a related domain (called
M Orthonormal bases. source domain) to improve the learning performance or
Θ Model parameters. minimize the number of labeled examples required in a
P Probability. target domain. It is worth mentioning that the transferred
E Expectation. knowledge does not always bring a positive impact on
Q Matrix variable. new tasks. If there is little in common between domains,
R Matrix variable. knowledge transfer could be unsuccessful. For example,
W Mapping matrix. learning to ride a bicycle cannot help us learn to play
the piano faster. Besides, the similarities between domains
I. I N T R O D U C T I O N do not always facilitate learning, because sometimes the
Although traditional machine learning technology has similarities may be misleading. For example, although
achieved great success and has been successfully applied Spanish and French have a close relationship with each
in many practical applications, it still has some limita- other and both belong to the Romance group of languages,
tions for certain real-world scenarios. The ideal scenario people who learn Spanish may experience difficulties in
of machine learning is that there are abundant labeled learning French, such as using the wrong vocabulary or
training instances, which have the same distribution as the conjugation. This occurs because previous successful expe-
test data. However, in many scenarios, collecting sufficient rience in Spanish can interfere with learning the word
training data is often expensive, time-consuming, or even formation, usage, pronunciation, conjugation, and so on,
unrealistic. Semi-supervised learning can partly solve this in French. In the field of psychology, the phenomenon
problem by relaxing the need of mass labeled data. Typ- that previous experience has a negative effect on learning
ically, a semi-supervised approach only requires a limited new tasks is called negative transfer [1]. Similarly, in the
number of labeled data, and it utilizes a large amount of transfer learning area, if the target learner is negatively
unlabeled data to improve the learning accuracy. But in affected by the transferred knowledge, the phenomenon is
many cases, unlabeled instances are also difficult to col- also termed as negative transfer [2], [3]. Whether negative
lect, which usually makes the resultant traditional models transfer will occur may depend on several factors, such as
unsatisfactory. the relevance between the source and the target domains
Transfer learning, which focuses on transferring the and the learner’s capacity of finding the transferable and
knowledge across domains, is a promising machine learn- beneficial part of the knowledge across domains. In [3],
ing methodology for solving the problem mentioned above. a formal definition and some analyses of negative transfer
The concept about transfer learning may initially come are given.
from educational psychology. According to the general- Roughly speaking, according to the discrepancy between
ization theory of transfer, as proposed by psychologist domains, transfer learning can be further divided into
C. H. Judd, learning to transfer is the result of the two categories, that is, homogeneous and heterogeneous
generalization of experience. It is possible to realize the transfer learning [4]. Homogeneous transfer learning
transfer from one situation to another, as long as a per- approaches are developed and proposed for handling the
son generalizes his experience. According to this theory, situations where the domains are of the same feature
the prerequisite of transfer is that there needs to be a space. In homogeneous transfer learning, some studies
connection between two learning activities. In practice, assume that domains differ only in marginal distributions.
a person who has learned the violin can learn the piano Therefore, they adapt the domains by correcting the sam-
faster than others, since both the violin and the piano ple selection bias [5] or covariate shift [6]. However, this
are musical instruments and may share some common assumption does not hold in many cases. For example,

44 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

in sentiment classification problem, a word may have dif- combined with a limited number of labeled instances
ferent meaning tendencies in different domains. This phe- to train a learner. Semi-supervised learning relaxes the
nomenon is also called context feature bias [7]. To solve dependence on labeled instances and thus reduces the
this problem, some studies further adapt the conditional expensive labeling costs. Note that, in semi-supervised
distributions. Heterogeneous transfer learning refers to learning, both the labeled and unlabeled instances are
the knowledge transfer process in the situations where drawn from the same distribution. In contrast, in transfer
the domains have different feature spaces. In addition to learning, the data distributions of the source and the
distribution adaptation, heterogeneous transfer learning target domains are usually different. Many transfer learn-
requires feature space adaptation [7], which makes it more ing approaches absorb the technology of semi-supervised
complicated than homogeneous transfer learning. learning. The key assumptions in semi-supervised learning,
The survey aims to give readers a comprehensive under- that is, smoothness, cluster, and manifold assumptions, are
standing about transfer learning from the perspectives also made use of in transfer learning. It is worth mention-
of data and model. The mechanisms and the strategies ing that semi-supervised transfer learning is a controversial
of transfer learning approaches are introduced to allow term. The reason is that the concept of whether the label
readers grasp how the approaches work. And a number of information is available in transfer learning is ambiguous
the existing transfer learning researches are connected and because both the source and the target domains can be
systematized. Specifically, over 40 representative transfer involved.
learning approaches are introduced. Besides, we conduct
experiments to demonstrate on which data set a transfer B. Multiview Learning
learning model performs well.
Multiview learning [12] focuses on the machine learn-
In this survey, we focus more on homogeneous transfer
ing problems with multiview data. A view represents a
learning. Some interesting transfer learning topics are
distinct feature set. An intuitive example about multi-
not covered in this survey, such as reinforcement transfer
ple views is that a video object can be described from
learning [8], lifelong transfer learning [9], and online
two different viewpoints, that is, the image signal and
transfer learning [10]. The rest of this survey is orga-
the audio signal. Briefly, multiview learning describes
nized into seven sections. Section II clarifies the difference
an object from multiple views, which results in abun-
between transfer learning and other related machine learn-
dant information. By properly considering the informa-
ing techniques. Section III introduces the notation used
tion from all views, the learner’s performance can be
in this survey and the definitions about transfer learning.
improved. There are several strategies adopted in mul-
Sections IV and V interpret transfer learning approaches
tiview learning such as subspace learning, multikernel
from the data and the model perspectives, respectively.
learning, and co-training [13], [14]. Multiview techniques
Section VI introduces some applications of transfer learn-
are also adopted in some transfer learning approaches.
ing. Experiments are conducted and the results are pro-
For example, Zhang et al. [15] proposed a multiview trans-
vided in Section VII. Section VIII concludes this survey.
fer learning framework, which imposes the consistency
The main contributions of this survey are summarized as
among multiple views. Yang and Gao [16] incorporated
follows.
multiview information across different domains for knowl-
1) Over 40 representative transfer learning approaches
edge transfer. The work by Feuz and Cook [17] introduces
are introduced and summarized, which can give
a multiview transfer learning approach for activity learn-
readers a comprehensive overview about transfer
ing, which transfers activity knowledge between heteroge-
learning.
neous sensor platforms.
2) Experiments are conducted to compare different
transfer learning approaches. The performance of
over 20 different approaches is displayed intuitively C. Multitask Learning
and then analyzed, which may be instructive for the The thought of multitask learning [18] is to jointly learn
readers to select the appropriate ones in practice. a group of related tasks. To be more specific, multitask
learning reinforces each task by taking advantage of the
II. R E L A T E D W O R K interconnections between task, that is, considering both
Some areas related to transfer learning are introduced. the intertask relevance and the intertask difference. In this
The connections and difference between them and transfer way, the generalization of each task is enhanced. The
learning are clarified. main difference between transfer learning and multitask
learning is that the former transfer the knowledge con-
A. Semisupervised Learning tained in the related domains, while the latter transfer the
Semisupervised learning [11] is a machine learning knowledge via simultaneously learning some related tasks.
task and a method that lies between supervised learn- In other words, multitask learning pays equal attention to
ing (with completely labeled instances) and unsupervised each task, while transfer learning pays more attention to
learning (without any labeled instances). Typically, a semi- the target task than to the source task. There are some
supervised method utilizes abundant unlabeled instances commons and associations between transfer learning and

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 45

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

multitask learning. Both of them aim to improve the per- The above definition, which covers the situation of
formance of learners via knowledge transfer. Besides, they multisource transfer learning, is an extension of the one
adopt some similar strategies for constructing models, such presented in the survey [2]. If mS equals 1, the scenario
as feature transformation and parameter sharing. Note that is called single-source transfer learning. Otherwise, it is
some existing studies utilize both the transfer learning and called multisource transfer learning. Besides, mT repre-
the multitask learning technologies. For example, the work sents the number of the transfer learning tasks. A few
by Zhang et al. [19] employs multitask and transfer learn- studies focus on the setting that mT ≥ 2 [21]. The
ing techniques for biological image analysis. The work by existing transfer learning studies pay more attention to
Liu et al. [20] proposes a framework for human action the scenarios where mT = 1 (especially where mS =
recognition based on multitask learning and multisource mT = 1). It is worth mentioning that the observation of
transfer learning. a domain or a task is a concept with broad sense, which
is often cemented into a labeled/unlabeled instance set or
III. O V E R V I E W a prelearned model. A common scenario is that we have
In this section, the notation used in this survey are listed abundant labeled instances or have a well-trained model
for convenience. Besides, some definitions and categoriza- on the source domain, and we only have limited labeled
tions about transfer learning are introduced, and some target-domain instances. In this case, the resources such as
related surveys are also provided. the instances and the model are actually the observations,
and the goal of transfer learning is to learn a more accurate
A. Notation decision function on the target domain.
Another term commonly used in the transfer learning
For convenience, a list of symbols and their definitions
area is domain adaptation. Domain adaptation refers to
are shown in Nomenclature. Besides, we use  ·  to repre-
the process that adapting one or more source domains to
sent the norm and superscript T to denote the transpose of
transfer knowledge and improve the performance of the
a vector/matrix.
target learner [4]. Transfer learning often relies on the
domain adaptation process, which attempts to reduce
B. Definition the difference between domains.
In this section, some definitions about transfer learning
are given. Before giving the definition of transfer learning,
let us review the definitions of a domain and a task. C. Categorization of Transfer Learning
Definition 1 (Domain): A domain D is composed of two There are several categorization criteria of transfer
parts, that is, a feature space X and a marginal distribution learning. For example, transfer learning problems can be
P (X). In other words, D = {X , P (X)}. And the symbol X divided into three categories, that is, transductive, induc-
denotes an instance set, which is defined as X = {x|xi ∈ tive, and unsupervised transfer learning [2]. The com-
X , i = 1, . . . , n}. plete definitions of these three categories are presented
Definition 2 (Task): A task T consists of a label space in [2]. These three categories can be interpreted from a
Y and a decision function f , that is, T = {Y, f }. The label-setting aspect. Roughly speaking, transductive trans-
decision function f is an implicit one, which is expected fer learning refers to the situations where the label infor-
to be learned from the sample data. mation only comes from the source domain. If the label
Some machine learning models actually output the pre- information of the target-domain instances is available,
dicted conditional distributions of instances. In this case, the scenario can be categorized into inductive transfer
f (xj ) = {P (yk |xj )|yk ∈ Y, k = 1, . . . , |Y|}. learning. If the label information is unknown for both the
In practice, a domain is often observed by a number source and the target domains, the situation is known as
of instances with or without the label information. For unsupervised transfer learning. Another categorization is
example, a source domain DS corresponding to a source based on the consistency between the source and the target
task TS is usually observed via the instance-label pairs, feature spaces and label spaces. If X S = X T and Y S = Y T ,
that is, DS = {(x, y)|xi ∈ X S , yi ∈ Y S , i = 1, . . . , nS }; the scenario is termed as homogeneous transfer learning.
an observation of the target domain usually consists of a Otherwise, if X S = X T or/and Y S = Y T , the scenario is
number of unlabeled instances and/or limited number of referred to as heterogeneous transfer learning.
labeled instances. According to the survey [2], the transfer learning
Definition 3 (Transfer Learning): Given some/an obser- approaches can be categorized into four groups: instance-
vation(s) corresponding to mS ∈ N+ source domain(s) based, feature-based, parameter-based, and relational-
and task(s) (i.e., {(DSi , TSi )|i = 1, . . . , mS }), and some/an based approaches. Instance-based transfer learning
observation(s) about mT ∈ N+ target domain(s) and approaches are mainly based on the instance weighting
task(s) (i.e., {(DTj , TTj )|j = 1, . . . , mT }), transfer learning strategy. Feature-based approaches transform the original
utilizes the knowledge implied in the source domain(s) to features to create a new feature representation; they
improve the performance of the learned decision functions can be further divided into two subcategories, that
f Tj (j = 1, . . . , mT ) on the target domain(s). is, asymmetric and symmetric feature-based transfer

46 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

Fig. 2. Categorizations of transfer learning.

learning. Asymmetric approaches transform the source and [4], this survey does not focus on relational-based
features to match the target ones. In contrast, symmetric approaches.
approaches attempt to find a common latent feature space
and then transform both the source and the target features IV. D A T A - B A S E D I N T E R P R E T A T I O N
into a new feature representation. The parameter-based Many transfer learning approaches, especially the
transfer learning approaches transfer the knowledge at data-based approaches, focus on transferring the
the model/parameter level. Relational-based transfer knowledge via the adjustment and the transformation of
learning approaches mainly focus on the problems in data. Fig. 3 shows the strategies and the objectives of the
relational domains. Such approaches transfer the logical approaches from the data perspective. As shown in Fig. 3,
relationship or rules learned in the source domain to space adaptation is one of the objectives. This objective is
the target domain. For better understanding, Fig. 2 required to be satisfied mostly in heterogeneous transfer
shows the above-mentioned categorizations of transfer learning scenarios. In this survey, we focus more on
learning. homogeneous transfer learning, and the main objective
Some surveys are provided for the readers who want in this scenario is to reduce the distribution difference
to have a more complete understanding of this field. The between the source-domain and the target-domain
survey by Pan and Yang [2], which is a pioneering work, instances. Furthermore, some advanced approaches may
categorizes transfer learning and reviews the research attempt to preserve the data properties in the adaptation
progress before 2010. The survey by Weiss et al. [4] process. There are generally two strategies for realizing
introduces and summarizes a number of homogeneous the objectives from the data perspective, that is, instance
and heterogeneous transfer learning approaches. Hetero- weighting and feature transformation. In this section,
geneous transfer learning is specially reviewed in the sur- some related transfer learning approaches are introduced
vey by Day and Khoshgoftaar [7]. Some surveys review the in proper order according to the strategies shown in Fig. 3.
literatures related to specific themes such as reinforcement
learning [8], computational intelligence [22], and deep
learning [23], [24]. Besides, a number of surveys focus A. Instance Weighting Strategy
on specific application scenarios including activity recog- Let us first consider a simple scenario in which a large
nition [25], visual categorization [26], collaborative rec- number of labeled source-domain and a limited number of
ommendation [27], computer vision [24], and sentiment target-domain instances are available and domains differ
analysis [28]. only in marginal distributions (i.e., P S (X) = P T (X)
Note that the organization of this survey does and P S (Y |X) = P T (Y |X)). For example, let us consider
not strictly follow the above-mentioned categorizations. we need to build a model to diagnose cancer in a spe-
In Sections IV and V, transfer learning approaches are cific region where the elderly are the majority. Limited
interpreted from the data and the model perspectives. target-domain instances are given, and relevant data are
Roughly speaking, data-based interpretation covers the available from another region where young people are the
above-mentioned instance-based and feature-based trans- majority. Directly transferring all the data from another
fer learning approaches, but from a broader perspective. region may be unsuccessful, since the marginal distribution
Model-based interpretation covers the above-mentioned difference exists, and the elderly have a higher risk of
parameter-based approaches. Since there are relatively few cancer than younger people. In this scenario, it is natural
studies concerning relational-based transfer learning and to consider adapting the marginal distributions. A simple
the representative approaches are well introduced in [2] idea is to assign weights to the source-domain instances in

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 47

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

Fig. 3. Strategies and the objectives of the transfer learning approaches from the data perspective.

the loss function. The weighting strategy is based on the where δ is a small parameter and B is a parameter for
following equation [5]: constraint. The optimization problem mentioned above
can be converted into a quadratic programming problem
 by expanding and using the kernel trick. This approach to
P T (x, y)
E(x,y)∼P T [L(x, y; f )] = E(x,y)∼P S L(x, y; f ) estimating the ratios of distributions can be easily incor-
P S (x, y)
 porated into many existing algorithms. Once the weight
P T (x)
= E(x,y)∼P S L(x, y; f ) . βi is obtained, a learner can be trained on the weighted
P S (x)
source-domain instances.
Therefore, the general objective function of a learning task There are some other studies attempting to estimate
can be written as [5] the weights. For example, Sugiyama et al. [6] proposed an
approach termed Kullback–Leibler importance estimation
procedure (KLIEP). KLIEP depends on the minimization of
1 
n S
 
min βi L f (xS the Kullback–Leibler (KL) divergence and incorporates a
i ), yi + Χ(f )
S
S
n i=1
f built-in model selection procedure. Based on the studies of
weight estimation, some instance-based transfer learning
where βi (i = 1, 2, . . . , nS ) is the weighting parameter. frameworks or algorithms are proposed. For example,
The theoretical value of βi is equal to P T (xi )/P S (xi ). Sun et al. [29] proposed a multisource framework termed
However, this ratio is generally unknown and is difficult two-stage weighting framework for multisource-domain
to be obtained by using the traditional methods. adaptation (2SW-MDA) with the following two
Kernel mean matching (KMM), which is proposed by stages.
Huang et al. [5], resolves the estimation problem of the 1) Instance weighting: The source-domain instances are
above unknown ratios by matching the means between assigned with weights to reduce the marginal distrib-
the source-domain and the target-domain instances in a ution difference, which is similar to KMM.
reproducing kernel Hilbert space (RKHS), that is, 2) Domain weighting: Weights are assigned to each
source domain for reducing the conditional dis-
 2
     tribution difference based on the smoothness
 1  nS
1 
nT
arg min  S βi Φ xi − T
S
Φ xj 
T
assumption [30].
βi ∈[0,B]  n i=1 n j=1 
  H Then, the source-domain instances are reweighted based
  S 
 1 n  on the instance weights and the domain weights. These
s.t.  S βi − 1 ≤ δ reweighted instances and the labeled target-domain
 n i=1 
instances are used to train the target classifier.

48 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

In addition to directly estimating the weighting Finally, the selected classifiers from each iteration are com-
parameters, adjusting weights iteratively is also effective. bined to form the final classifier. Another parameter-based
The key is to design a mechanism to decrease the weights algorithm, that is, TaskTrAdaBoost, is also proposed in the
of the instances that have negative effects on the target work [33], which is introduced in Section V-C.
learner. A representative work is TrAdaBoost, which is a Some approaches realize instance weighting strategy in
framework proposed by Dai et al. [31]. This framework a heuristic way. For example, Jiang and Zhai [34] proposed
is an extension of AdaBoost [32]. AdaBoost is an effec- a general weighting framework. There are three terms in
tive boosting algorithm designed for traditional machine the framework’s objective function, which are designed to
learning tasks. In each iteration of AdaBoost, a learner minimize the cross-entropy loss of three types of instances.
is trained on the instances with updated weights, which The following types of instances are used to construct the
results in a weak classifier. The weighting mechanism of target classifier.
instances ensures that the instances with incorrect clas- 1) Labeled target-domain instance: The classifier should
sification are given more attention. Finally, the resultant minimize the cross-entropy loss on them, which is
weak classifiers are combined to form a strong classifier. actually a standard supervised learning task.
TrAdaBoost extends the AdaBoost to the transfer learn- 2) Unlabeled target-domain instance: These instances’
ing scenario; a new weighting mechanism is designed to true conditional distributions P (y|xT,U ) are unknown
i
reduce the impact of the distribution difference. Specif- and should be estimated. A possible solution is
ically, in TrAdaBoost, the labeled source-domain and to train an auxiliary classifier on the labeled
labeled target-domain instances are combined as a whole, source-domain and target-domain instances to help
that is, a training set, to train the weak classifier. The estimate the conditional distributions or assign
weighting operations are different for the source-domain pseudo labels to these instances.
and the target-domain instances. In each iteration, a tem- 3) Labeled source-domain instance: The authors define
porary variable δ̄ , which measures the classification error the weight of xS,L as the product of two parts,
i
rate on the labeled target-domain instances, is calcu- that is, αi and βi . The weight βi is ideally equal to
lated. Then, the weights of the target-domain instances P T (xi )/P S (xi ), which can be estimated by nonpara-
are updated based on δ̄ and the individual classification metric methods such as KMM or can be set uniformly
results, while the weights of the source-domain instances in the worst case. The weight αi is used to filter out
are updated based on a designed constant and the indi- the source-domain instances that differ greatly from
vidual classification results. For better understanding, the target domain.
the formulas used in the kth iteration (k = 1, . . . , N )
A heuristic method can be used to produce the value of αi ,
for updating the weights are presented repeatedly as
which contains the following three steps.
follows [31]:
1) Auxiliary classifier construction: An auxiliary classi-
  −|fk (xSi )−yiS | fier trained on the labeled target-domain instances
S
βk,i = βk−1,i
S
1+ 2 ln nS /N are used to classify the unlabeled source-domain
(i = 1, . . . , nS ) instances.
 −|fk (xTj )−yjT | 2) Instance ranking: The source-domain instances are
T
βk,j = βk−1,j
T
δ̄k /(1 − δ̄k ) (j = 1, . . . , nT ). ranked based on the probabilistic prediction results.
3) Heuristic weighting (βi ): The weights of the top-k
Note that each iteration forms a new weak classifier. The source-domain instances with wrong predictions are
final classifier is constructed by combining and ensembling set to zero, and the weights of others are set to 1.
half the number of the newly resultant weak classifiers
through voting scheme.
Some studies further extend TrAdaBoost. The work B. Feature Transformation Strategy
by Yao and Doretto [33] proposes a Multi-Source TrAd- Feature transformation strategy is often adopted
aBoost (MsTrAdaBoost) algorithm, which mainly has the in feature-based approaches. For example, consider a
following two steps in each iteration. cross-domain text classification problem. The task is to
1) Candidate classifier construction: A group of candi- construct a target classifier by using the labeled text data
date weak classifiers are, respectively, trained on from a related domain. In this scenario, a feasible solution
the weighted instances in the pairs of each source is to find the common latent features (e.g., latent topics)
domain and the target domain, that is, DSi ∪ DT through feature transformation and use them as a bridge to
(i = 1, . . . , mS ). transfer knowledge. Feature-based approaches transform
2) Instance weighting: A classifier (denoted by j and each original feature into a new feature representation for
trained on DSj ∪ DT ) which has the minimal classifi- knowledge transfer. The objectives of constructing a new
cation error rate δ̄ on the target-domain instances is feature representation include minimizing the marginal
selected, and then is used for updating the weights of and the conditional distribution difference, preserving the
the instances in DSj and DT . properties or the potential structures of the data, and

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 49

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

Table 1 Metrics Adopted in Transfer Learning (FAM). This method transforms the original features by
simple feature replication. Specifically, in single-source
transfer learning scenario, the feature space is augmented
to three times its original size. The new feature represen-
tation consists of general features, source-specific features,
and target-specific features. Note that, for the transformed
source-domain instances, their target-specific features are
finding the correspondence between features. The opera- set to zero. Similarly, for the transformed target-domain
tions of feature transformation can be divided into three instances, their source-specific features are set to zero. The
types, that is, feature augmentation, feature reduction, new feature representation of FAM is presented as follows:
and feature alignment. Besides, feature reduction can be
further divided into several types such as feature mapping,  
 

ΦS xS
i = xS
i , xi , 0 ,
S
ΦT xTj = xTj , 0, xTj
feature clustering, feature selection, and feature encoding.
A complete feature transformation process designed in an
algorithm may consist of several operations. where ΦS and ΦT denote the mappings to the new feature
space from the source and the target domain, respectively.
1) Distribution Difference Metric: One primary objective
The final classifier is trained on the transformed labeled
of feature transformation is to reduce the distribution
instances. It is worth mentioning that this augmentation
difference of the source and the target-domain instances.
method is actually redundant. In other words, augmenting
Therefore, how to measure the distribution difference or
the feature space in other ways (with fewer dimensions)
the similarity between domains effectively is an important
may be able to produce competent performance. The supe-
issue.
riority of FAM is that its feature expansion has an elegant
The measurement termed maximum mean discrep-
form, which results in some good properties such as the
ancy (MMD) is widely used in the field of transfer learning,
generalization to multisource scenarios. An extension of
which is formulated as follows [35]:
FAM is proposed by Daumé et al. [65], which utilizes
 2 the unlabeled instances to further facilitate the knowledge
    S
  1  nT  
 1 n transfer process.
MMD X , X S T
=  S Φ xi − T
S
Φ xj  .
T
However, FAM may not work well in handling hetero-
 n i=1 n j=1 
H geneous transfer learning tasks. The reason is that directly
replicating features and padding zero vectors are less effec-
MMD can be easily computed by using kernel trick. Briefly, tive when the source and the target domains have different
MMD quantifies the distribution difference by calculat- feature representations. To solve this problem, Li et al. [67]
ing the distance of the mean values of the instances in proposed an approach termed heterogeneous feature aug-
an RKHS. Note that the above-mentioned KMM actually mentation (HFA) [66]. The feature representation of HFA
produces the weights of instances by minimizing the MMD is presented as follows:
distance between domains.
Table 1 lists some commonly used metrics and the  

ΦS xS
i = W S xS
i , xi , 0
S T
related algorithms. In addition to Table 1, there are some
 

other measurement criteria adopted in transfer learning, ΦT xTj = W T xTj , 0S , xTj
including Wasserstein distance [59], [60] and central
moment discrepancy [61]. Some studies focus on opti-
mizing and improving the existing measurements. Let us where W S xSi and W T xTj have the same dimension;
take MMD as an example. Gretton et al. [62] proposed 0S and 0T denote the zero vectors with the dimensions
a multikernel version of MMD, that is, MK-MMD, which of xS and xT , respectively. HFA maps the original features
takes advantage of multiple kernels. Besides, Yan et al. [63] into a common feature space and then performs a feature
proposed a weighted version of MMD, which attempts to stacking operation. The mapped features, original features,
address the issue of class weight bias. and zero elements are stacked in a particular order to
produce a new feature representation.
2) Feature Augmentation: Feature augmentation opera-
tions are widely used in feature transformation, especially 3) Feature Mapping: In the field of traditional machine
in symmetric feature-based approaches. To be more spe- learning, there are many feasible mapping-based methods
cific, there are several ways to realize feature augmen- of extracting features such as principal component analysis
tation such as feature replication and feature stacking. (PCA) [68] and kernelized-PCA (KPCA) [69]. However,
For better understanding, we start with a simple transfer these methods mainly focus on the data variance rather
learning approach which is established based on feature than the distribution difference. In order to solve the
replication. distribution difference, some feature extraction methods
The work by Daumé [64] proposes a simple domain are proposed for transfer learning. Let us first consider
adaptation method, that is, feature augmentation method a simple scenario where there is little difference in the

50 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

conditional distributions of the domains. In this case, to incorporate new terms or/and constraints. For exam-
the following simple objective function can be used to find ple, the following general objective function is a possible
a mapping for feature extraction choice:
       
min DIST X S , X T ; Φ + λΧ(Φ) / VAR X S ∪ X T ; Φ min μDIST X S , X T ; Φ + λ1 ΧGEO (Φ) + λ2 Χ(Φ)
Φ Φ
 
+ (1 − μ)DIST Y S |X S , Y T |X T ; Φ
where Φ is a low-dimensional mapping function, DIST(·)
represents a distribution difference metric, Χ(Φ) is a
s.t. Φ(X)T HΦ(X) = I, with H = I − (1/n) ∈ Rn×n
regularizer controlling the complexity of Φ, and VAR(·)
represents the variance of instances. This objective func-
tion aims to find a mapping function Φ that minimizes where μ is a parameter balancing the marginal and the
the marginal distribution difference between domains and conditional distribution difference [70], ΧGEO (Φ) is a reg-
meanwhile makes the variance of the instances as large as ularizer controlling the geometric structure, Φ(X) is the
possible. The objective corresponding to the denominator matrix whose rows are the instances from both the source
can be optimized in several ways. One possible way is to and the target domains with the extracted new feature
optimize the objective of the numerator with a variance representation, H is the centering matrix for constructing
constraint. For example, the scatter matrix of the mapped the scatter matrix, and the constraint is used to maxi-
instances can be enforced as an identity matrix. Another mize the variance. The last term in the objective function
way is to optimize the objective of the numerator in a denotes the measurement of the conditional distribution
high-dimensional feature space at first. Then, a dimension difference.
reduction algorithm such as PCA or KPCA can be per- Before the further discussion about the above objective
formed to realize the objective of the denominator. function, it is worth mentioning that the label information
Furthermore, finding the explicit formulation of Φ(·) is of the target-domain instances is often limited or even
nontrivial. To solve this problem, some approaches adopt unknown. The lack of the label information makes it
linear mapping technique or turn to the kernel trick. difficult to estimate the distribution difference. In order
In general, there are three main ideas to deal with the to solve this problem, some approaches resort to the
above optimization problems. pseudo-label strategy, that is, assigning pseudo labels to
1) Mapping learning + feature extraction: A possible way the unlabeled target-domain instances. A simple method
is to find a high-dimensional space at first where the of realizing this is to train a base classifier to assign
objectives are met by solving a kernel matrix learning pseudo labels. By the way, there are some other methods of
problem or a transformation matrix finding problem. providing pseudo labels such as co-training [71], [72] and
Then, the high-dimensional features are compacted tri-training [73], [74]. Once the pseudo-label information
to form a low-dimensional feature representation. For is complemented, the conditional distribution difference
example, once the kernel matrix is learned, the prin- can be measured. For example, MMD can be modified
cipal components of the implicit high-dimensional and extended to measure the conditional distribution dif-
features can be extracted to construct a new feature ference. Specifically, for each label, the source-domain
representation based on PCA. and the target-domain instances that belong to the same
2) Mapping construction + mapping learning: Another class are collected, and the estimation expression of the
way is to map the original features to a con- conditional distribution difference is given by [38]
structed high-dimensional feature space, and then a  2
|Y|  nS   nT  
low-dimensional mapping is learned to satisfy the   1  k
1 k
 Φ xi − T
S
Φ xj 
T
objective function. For example, a kernel matrix can  nS
k=1  k i=1 
nk j=1
be constructed based on a selected kernel function at H
first. Then, the transformation matrix can be learned,
which projects the high-dimensional features into a where nSk and nTk denote the numbers of the instances
common latent subspace. in the source and the target domains with the same
3) Direct low-dimensional mapping learning: It is usually label Yk , respectively. This estimation actually measures
difficult to find a desired low-dimensional mapping the class-conditional distribution [i.e., P (x|y)] difference
directly. However, if the mapping is assumed to satisfy to approximate the conditional distribution [i.e., P (y|x)]
certain conditions, it may be solvable. For example, difference. Some studies improve the above estimation. For
if the low-dimensional mapping is restricted to be a example, the work by Wang et al. [70] uses a weighted
linear one, the optimization problem can be easily method to additionally solve the class imbalance problem.
solved. For better understanding, the transfer learning approaches
Some approaches also attempt to match the condi- that are the special cases of the general objective func-
tional distributions and preserve the structures of the data. tion presented in the previous paragraph are detailed as
To achieve this, the above simple objective function needs follows.

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 51

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

1) μ = 1 and λ1 = 0: The objective function of MMD affects the performance of JDA. In order to improve
embedding (MMDE) is given by [75] the labeling quality, the authors adopt the iterative
refinement operations. Specifically, in each iteration,
  λ1 JDA is performed, and then a classifier is trained
min MMD X S , X T ; Φ −
K nS + nT on the instances with the extracted features. Next,

||Φ(xi ) − Φ(xj )||2 the pseudo labels are updated based on the trained
i=j classifier. After that, JDA is performed repeatedly with
s.t. ∀(xi ∈ k-NN(xj )) ∧ (xj ∈ k-NN(xi )) the updated pseudo labels. The iteration ends when
||Φ(xi ) − Φ(xj )||2 = ||xi − xj ||2 convergence occurs. Note that JDA can be extended
  by utilizing the label and structure information [79],
xi , xj ∈ X S ∪ X T
clustering information [80], various statistical and
geometrical information [81], and so on
where k-NN(x) represents the k nearest neighbors 4) μ ∈ (0, 1) and λ1 = 0: The article by Wang et al. [70]
of the instance x. Weinberger et al. [76] design the proposes an approach termed balanced distribution
above objective function motivated by maximum vari- adaptation (BDA), which is an extension of JDA. Dif-
ance unfolding (MVU). Instead of employing a scatter ferent from JDA which assumes that the marginal and
matrix constraint, the constraints and the second the conditional distributions have the same impor-
term of this objective function aim to maximize the tance in adaptation, BDA attempts to balance the
distance between instances as well as preserve local marginal and the conditional distribution adaptation.
geometry. The desired kernel matrix K can be learned The operations of BDA are similar to JDA. In addition,
by solving a semi-definite programming (SDP) [77] the authors also proposed the weighted BDA (WBDA).
problem. After obtaining the kernel matrix, PCA is In WBDA, the conditional distribution difference is
applied to it, and then the leading eigenvectors are measured by a weighted version of MMD to solve the
selected to help construct a low-dimensional feature class imbalance problem.
representation.
It is worth mentioning that some approaches transform
2) μ = 1 and λ1 = 0: The work by Pan et al. [36], [78]
the features into a new feature space (usually of a high
proposes an approach termed transfer component
dimension) and train an adaptive classifier simultaneously.
analysis (TCA). TCA adopts MMD to measure the
To realize this, the mapping function of the features and
marginal distribution difference and enforces the scat-
the decision function of the classifier need to be associated.
ter matrix as the constraint. Different from MMDE
One possible way is to define the following decision func-
that learns the kernel matrix and then further adopts
tion: f (x) = θ · Φ(x) + b, where θ denotes the classifier
PCA, TCA is a unified method that just needs to
parameter; b denotes the bias. In light of the represen-
learn a linear mapping from an empirical kernel
ter theorem [82], the parameter θ can be defined as
feature space to a low-dimensional feature space.
θ= n i=1 αi Φ(xi ), and thus we have
In this way, it avoids solving the SDP problem, which
results in relatively low computational burden. The

n 
n
final optimization problem can be easily solved via f (x) = αi Φ(xi ) · Φ(x) + b = αi κ(xi , x) + b
eigen-decomposition. TCA can also be extended to i=1 i=1
utilize the label information. In the extended ver-
sion, the scatter matrix constraint is replaced by a where κ denotes the kernel function. By using the kernel
new one that balances the label dependence (mea- matrix as the bridge, the regularizers designed for the
sured by HSIC) and the data variance. Besides, mapping function can be incorporated into the classi-
a graph Laplacian regularizer [30] is also added to fier’s objective function. In this way, the final optimiza-
preserve the geometry of the manifold. Similarly, tion problem is usually about the parameter (e.g., αi )
the final optimization problem can also be solved by or the kernel function. For example, the article by
eigen-decomposition. Long et al. [39] proposes a general framework termed
3) μ = 0.5 and λ1 = 0: Long et al. [38] pro- adaptation regularization-based transfer learning (ARTL).
posed an approach termed joint distribution adap- The goals of ARTL are to learn the adaptive classifier,
tation (JDA). JDA attempts to find a transformation to minimize the structural risk, to jointly reduce the
matrix that maps the instances to a low-dimensional marginal and the conditional distribution difference, and
space where both the marginal and the conditional to maximize the manifold consistency between the data
distribution difference are minimized. To realize it, structure and the predictive structure. The authors also
the MMD metric and the pseudo-label strategy are proposed two specific algorithms under this framework
adopted. The desired transformation matrix can be based on different loss functions. In these two algo-
obtained by solving a trace optimization problem rithms, the coefficient matrix for computing MMD and
via eigen-decomposition. Furthermore, it is obvious the graph Laplacian matrix for manifold regularization are
that the accuracy of the estimated pseudo labels constructed at first. Then, a kernel function is selected to

52 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

construct the kernel matrix. After that, the classifier learn- algorithm, both the source and the target document-to-
ing problem is converted into a parameter (i.e., αi ) solving word matrices are co-clustered. The source document-to-
problem, and the solution formula is also given in [39]. word matrix is co-clustered to generate the word clusters
In ARTL, the choice of the kernel function affects the based on the known label information, and these word
performance of the final classifier. In order to construct a clusters are used as constraints during the co-clustering
robust classifier, some studies turn to kernel learning. For process of the target-domain data. The co-clustering crite-
example, the article by Duan et al. [83] proposes a unified rion is to minimize the loss in mutual information, and the
framework termed domain transfer multiple kernel learn- clustering results are obtained by iteration. Each iteration
ing (DTMKL). In DTMKL, the kernel function is assumed to contains the following two steps.
be a linear combination of a group of base kernels, that is, 1) Document clustering: Each row of the target
N
κ(xi , xj ) = k=1 βk κk (xi , xj ). DTMKL aims to minimize document-to-word matrix is reordered based on the
the distribution difference, the classification error, and objective function for updating the document clusters.
so on, simultaneously. The general objective function of 2) Word clustering: The word clusters are adjusted to
DTMKL can be written as follows: minimize the joint mutual-information loss of the
  source and the target document-word matrices.
min σ MMD(X S , X T ; κ) + λΧL (βk , f ) After several times of iterations, the algorithm converges,
βk ,f
and the classification results are obtained. Note that,
in CoCC, the word clustering process implicitly extracts the
where σ is any monotonically increasing function, f is
word features to form unified word clusters.
the decision function with the same definition as the
Dai et al. [42] also proposed an unsupervised clustering
one in ARTL, and ΧL (βk , f ) is a general term repre-
approach, which is termed as self-taught clustering (STC).
senting a group of regularizers defined on the labeled
Similar to CoCC, this algorithm is also a co-clustering-
instances such as the ones for minimizing the classifica-
based one. However, STC does not need the label
tion error and controlling the complexity of the resultant
information. STC aims to simultaneously co-cluster the
model. Rakotomamonjy et al. [84] developed an algorithm
source-domain and the target-domain instances with the
to learn the kernel and the decision function simulta-
assumption that these two domains share the same feature
neously by using the reduced gradient descent method.
clusters in their common feature space. Therefore, two
In each iteration, the weight coefficients of base kernels
co-clustering tasks are separately performed at the same
are fixed to update the decision function at first. Then,
time to find the shared feature clusters. Each iteration of
the decision function is fixed to update the weight coef-
STC has the following steps.
ficients. Note that DTMKL can incorporate many existing
kernel methods. The authors proposed two specific algo- 1) Instance clustering: The clustering results of the
rithms under this framework. The first one implements the source-domain and the target-domain instances are
framework by using hinge loss and support vector machine updated to minimize their respective loss in mutual
(SVM). The second one is an extension of the first one information.
with an additional regularizer utilizing pseudo-label infor- 2) Feature clustering: The feature clusters are updated to
mation, and the pseudo labels of the unlabeled instances minimize the joint loss in mutual information.
are generated by using base classifiers. When the algorithm converges, the clustering results of the
target-domain instances are obtained.
4) Feature Clustering: Feature clustering aims to find Different from the above-mentioned co-clustering-based
a more abstract feature representation of the original ones, some approaches extract the original features into
features. Although it can be regarded as a way of fea- concepts (or topics). In the document classification prob-
ture extraction, it is different from the above-mentioned lem, the concepts represent the high-level abstractness of
mapping-based extraction. the words (e.g., word clusters). In order to introduce the
For example, some transfer learning approaches implic- concept-based transfer learning approaches easily, let us
itly reduce the features by using the co-clustering briefly review the latent semantic analysis (LSA) [86],
technique, that is, simultaneously clustering both the the probabilistic LSA (PLSA) [87], and the dual-PLSA [88].
columns and rows of (or say, co-cluster) a contin- 1) LSA: LSA is an approach to mapping the document-to-
gency table based on the information theory [85]. The word matrix to a low-dimensional space (i.e., a latent
article by Dai et al. [41] proposes an algorithm termed semantic space) based on the singular value decom-
Co-Clustering-Based Classification (CoCC), which is used position SVD) technique. In short, LSA attempts to
for document classification. In a document classification find the true meanings of the words. To realize this,
problem, the transfer learning task is to classify the SVD technique is used to reduce the dimensionality,
target-domain documents (represented by a document- which can remove the irrelevant information and
to-word matrix) with the help of the labeled source filter out the noise information from the raw data.
document-to-word data. CoCC regards the co-clustering 2) PLSA: PLSA is developed based on a statistical view
technique as a bridge to transfer the knowledge. In CoCC of LSA. PLSA assumes that there is a latent class

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 53

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

variable z , which reflects the concept, associating dual-PLSA. Its diagram is shown as follows:
the document d and the word w. Besides, d and
 
w are independently conditioned on the concept z . P (Dk0 ) 
d
P di |zk ,Dk0 w
P wj |zk ,Dk0
1 2
The diagram of this graphical model is presented as ⇓ ⇓ P d ,z w
zk
1 k2 ⇓

follows: D → d ← z d ←−−−−−−−→z w → w

P (zk )
P (di |z ) ⇓ P (wj |zk )
d ←−−−−k− z −−−−−−→ w where 1 ≤ k0 ≤ mS + mT denotes the domain index.
The domain D connects both the variables d and w but
where the subscripts i, j , and k represent the is independent of the variables z d and z w . The label
indexes of the document, the word, and the con- information of the source-domain instances is utilized by
cept, respectively. PLSA constructs a Bayesian net- initializing the value P (di |zkd1 , Dk0 ) (k0 = 1, . . . , mS ).
work, and the parameters are estimated by using the Due to the lack of the target-domain label information,
expectation-maximization (EM) algorithm [89]. the value P (di |zkd1 , Dk0 ) (k0 = mS + 1, . . . , mS + mT )
3) Dual-PLSA: The dual-PLSA is an extension of PLSA. can be initialized based on any supervised classifier. Sim-
This approach assumes there are two latent variables ilarly, the authors adopt the EM algorithm to find the
z d and z w associating the documents and the words. parameters. Through the iterations, all the parameters in
Specifically, the variables z d and z w reflect the con- the Bayesian network are obtained. Thus, the class label
cepts behind the documents and the words, respec- of the ith document in a target domain (denoted by Dk )
tively. The diagram of the dual-PLSA is provided as can be predicted by computing the posterior probabilities,
follows: i.e., arg maxz d P (z d |di , Dk ).
Zhuang et al. [93] further proposed a general
d
 d
  framework that is termed as homogeneous-identical-
P di |zk P zk ,z w w
P wj |zk
1 1 k2 2
d ←−−−−−−− z ←−−−−−−−→ z −−−−−−−→ w.
d w
distinct-concept (HIDC) model. This framework is
also an extension of dual-PLSA. HIDC is composed
of three generative models, that is, identical-concept,
The parameters of the dual-PLSA can also be obtained
homogeneous-concept, and distinct-concept models.
based on the EM algorithm.
These three graphical models are presented as
follows:
Some concept-based transfer learning approaches are
established based on PLSA. For example, the article by
Xue et al. [90] proposes a cross-domain text classi- Identical-concept model: D → d ← z d → zIC
w
→w

fication approach termed topic-bridged PLSA (TPLSA).
TPLSA, which is an extension of PLSA, assumes that the

source-domain and the target-domain instances share the
Homogeneous-concept model: D → d ← z d → zHC
w
→w
same mixing concepts of the words. Instead of performing
two PLSAs for the source domain and the target domain
separately, the authors merge those two PLSAs as an
integrated one by using the mixing concept z as a bridge,
Distinct-concept model: D → d ← z d → zDC
w
→w .
that is, each concept has some probabilities to produce
the source-domain and the target-domain documents. The
diagram of TPLSA is provided as follows:
The original word concept z w is divided into three types,
w w w
dS P (dS |zk ) that is, zIC , zHC , and zDC . In the identical-concept model,
  P (zk |wj )
 T ====T==== z ←−−−−−− w. the word distributions only rely on the word concepts,
i

d P (di |zk ) and the word concepts are independent of the domains.
However, in the homogeneous-concept model, the word
Note that PLSA does not require the label information. distributions also depend on the domains. The differ-
In order to exploit the label information, the authors ence between the identical and the homogeneous con-
w w
add the concept constraints, which include must-link and cepts is that zIC is directly transferable, while zHC is the
cannot-link constraints, as the penalty terms in the objec- domain-specific transferable one that may have different
tive function of TPLSA. Finally, the objective function is effects on the word distributions for different domains.
w
iteratively optimized to obtain the classification results In the distinct-concept model, zDC is actually the nontrans-
[i.e., arg maxz P (z|dTi )] by using the EM algorithm. ferable domain-specific one, which may only appear in a
The work by Zhuang et al. [91], [92] proposes specific domain. The above-mentioned three models are
an approach termed collaborative dual-PLSA (CD-PLSA) combined as an integrated one, that is, HIDC. Similar to
for multidomain text classification (mS source domains other PLSA-related algorithms, HIDC also uses EM algo-
and mT target domains). CD-PLSA is an extension of rithm to get the parameters.

54 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

5) Feature Selection: Feature selection is another kind of it. The newly added autoencoder is then trained by using
operation for feature reduction, which is used to extract the encoded output of the upper-level autoencoder as its
the pivot features. The pivot features are the ones that input. In this way, deep learning architectures can thus be
behave in the same way in different domains. Due to the constructed.
stability of these features, they can be used as the bridge Some transfer learning approaches are developed
to transfer the knowledge. For example, Blitzer et al. [94] based on autoencoders. For example, the article by
proposed an approach termed structural correspondence Glorot et al. [96] proposes an approach termed stacked
learning (SCL). Briefly, SCL consists of the following steps denoising autoencoder (SDA). The denoising autoencoder,
to construct a new feature representation. which can enhance the robustness, is an extension of
1) Feature selection: SCL first performs feature selection the basic one [97]. This kind of autoencoder contains a
operations to obtain the pivot features. randomly corrupting mechanism that adds noise to the
2) Mapping learning: The pivot features are utilized to input before mapping. For example, an input can be cor-
find a low-dimensional common latent feature space rupted or partially destroyed by adding a masking noise or
by using the structural learning technique [95]. Gaussian noise. The denoising autoencoder is then trained
3) Feature stacking: A new feature representation is con- to minimize the denoising reconstruction error between
structed by feature augmentation, that is, stacking the the original clean input and the output. The SDA algorithm
original features with the obtained low-dimensional proposed in this article mainly encompasses the following
features. steps.

Take the part-of-speech tagging problem as an example. 1) Autoencoder training: The source-domain and
The selected pivot features should occur frequently in target-domain instances are used to train a stack of
source and target domains. Therefore, determiners can be denoising autoencoders in a greedy layer-by-layer
included in pivot features. Once all the pivot features are way.
defined and selected, a number of binary linear classifiers 2) Feature encoding and stacking: A new feature rep-
whose function is to predict the occurrence of each pivot resentation is constructed by stacking the encoding
feature are constructed. Without the loss of generality, output of intermediate layers, and the features of
the decision function of the ith classifier, which is used the instances are transformed into the obtained new
to predict the ith pivot feature, can be formulated as representation.
fi (x) = sign(θi · x), where x is assumed to be a binary 3) Learner training: The target classifier is trained on the
feature input. And the ith classifier is trained on all the transformed labeled instances.
instances excluding the features derived from the ith pivot Although the SDA algorithm has excellent performance
feature. The following formula can be used to estimate the for feature extraction, it still has some drawbacks such
ith classifier’s parameters, that is as high computational and parameter-estimation cost.
In order to shorten the training time and to speed up
1
n
traditional SDA algorithms, Chen et al. [98], [99] proposed
θi = arg min L (θ · xj , Rowi (xj )) + λθ2
θ n j=1 a modified version of SDA, that is, marginalized stacked
linear denoising autoencoder (mSLDA). This algorithm
adopts linear autoencoders and marginalizes the randomly
where Rowi (xj ) denotes the true value of the unlabeled
corrupting step in a closed form. It may seem that linear
instance xj in terms of the ith pivot feature. By stacking the
autoencoders are too simple to learn complex features.
obtained parameter vectors as column elements, a matrix
However, the authors observe that linear autoencoders are
W̃ is obtained. Next, based on SVD, the top-k left singular
often sufficient to achieve competent performance when
vectors, which are the principal components of the matrix
encountering high-dimensional data. The basic architec-
W̃ , are taken to construct the transformation matrix W .
ture of mSLDA is a single-layer linear autoencoder. The
At last, the final classifier is trained on the labeled instances
corresponding single-layer mapping matrix W (augmented
in an augmented feature space, that is, ([xL i ; W xi ] , yi ).
T L T L
with a bias column for convenience) should minimize the
6) Feature Encoding: In addition to feature extraction expected squared reconstruction loss function, that is,
and selection, feature encoding is also an effective tool.
1  
n
For example, autoencoders, which are often adopted in
W = arg min EP (x̃i |x) ||xi − W x̃i ||2
deep learning area, can be used for feature encoding. W 2n i=1
An autoencoder consists of an encoder and a decoder. The
encoder tries to produce a more abstract representation
where x̃i denotes the corrupted version of the input xi .
of the input, while the decoder aims to map back that
The solution of W is given by [98], [99]
representation and to minimize the reconstruction error.
Autoencoders can be stacked to build a deep learning  n  −1
 
n  
architecture. Once an autoencoder completes the training W = T
xi E[x̃i ] T
E x̃i x̃i .
process, another autoencoder can be stacked at the top of i=1 i=1

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 55

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

When the corruption strategy is determined, the above In light of SA, a number of transfer learning
formulas can be further expanded and simplified into a approaches are established. For example, the article by
specific form. Note that, in order to insert nonlinearity, Sun and Saenko [101] proposes an approach that aligns
a nonlinear function is used to squash the output of each both the subspace bases and the distributions, which is
autoencoder after we obtain the matrix W in a closed termed as subspace distribution alignment between two
form. Then, the next linear autoencoder is stacked to the subspaces (SDA-TS). In SDA-TS, the transformation matrix
current one in a similar way to SDA. In order to deal with W is formulated as W = MST MT Q, where Q is a matrix
high-dimensional data, the authors also put forward an used to align the distribution difference. The transforma-
extension approach to further reduce the computational tion matrix W in SA is a special case of the one in SDA-TS
complexity. by setting Q to an identity matrix. Note that SA is a
symmetrical feature-based approach, while SDA-TS is an
7) Feature Alignment: Note that feature augmentation
asymmetrical one. In SDA-TS, the labeled source-domain
and feature reduction mainly focus on the explicit fea-
instances are projected to the source subspace, then
tures in a feature space. In contrast, in addition to the
mapped to the target subspace, and finally mapped back
explicit features, feature alignment also focuses on some
to the target domain. The transformed source-domain
implicit features such as the statistic features and the
instances are formulated as X S MS W MTT .
spectral features. Therefore, feature alignment can play
Another representative subspace feature alignment
various roles in the feature transformation process. For
approach is geodesic flow kernel (GFK), which is proposed
example, the explicit features can be aligned to generate
by Gong et al. [102]. GFK is closely related to a previous
a new feature representation, or the implicit features can
approach termed geodesic flow subspaces (GFSs) [103].
be aligned to construct a satisfied feature transformation.
Before introducing GFK, let us review the steps of GFS at
There are several kinds of features that can be aligned,
first. GFS is inspired by incremental learning. Intuitively,
which includes subspace features, spectral features, and
utilizing the information conveyed by the potential path
statistic features. Take the subspace feature alignment as
between two domains may be beneficial to the domain
an example. A typical approach mainly has the following
adaptation. GFS generally takes the following steps to align
steps.
features.
1) Subspace generation: In this step, the instances are
used to generate the respective subspaces for the 1) Subspace generation: GFS first generates two
source and the target domains. The orthonormal subspaces of the source and the target domains by
bases of the source- and the target-domain subspaces performing PCA, respectively.
are then obtained, which are denoted by MS and MT , 2) Subspace interpolation: The two obtained subspaces
respectively. These bases are used to learn the shift can be viewed as two points on the Grassmann
between the subspaces. manifold [104]. A finite number of the interpo-
2) Subspace alignment (SA): In the second step, a map- lated subspaces are generated between these two
ping, which aligns the bases MS and MT of the sub- subspaces based on the geometric properties of the
spaces, is learned. And the features of the instances manifold.
are projected to the aligned subspaces to generate 3) Feature projection and stacking: The original features
new feature representation. are transformed by stacking the corresponding projec-
3) Learner training: Finally, the target learner is trained tions from all the obtained subspaces.
on the transformed instances. Despite the usefulness and superiority of GFS, there is
For example, the work by Fernando et al. [100] proposes a problem about how to determine the number of the
an approach termed SA. In SA, the subspaces are generated interpolated subspaces. GFK resolves this problem by inte-
by performing PCA; the bases MS and MT are obtained by grating infinite number of the subspaces located on the
selecting the leading eigenvectors. Then, a transformation geodesic curve from the source subspace to the target one.
matrix W is learned to align the subspaces, which is given The key of GFK is to construct an infinite-dimensional fea-
by [100] ture space that incorporating the information of all the sub-
spaces lying on the geodesic flow. In order to compute the
inner product in the resultant infinite-dimensional space,
W = arg min ||MS W − MT ||2F = MST MT
W the GFK is defined and derived. In addition, a subspace-
disagreement measure is proposed to select the optimal
where || · ||F denotes the Frobenius norm. Note that the dimensionality of the subspaces; a rank-of-domain metric
matrix W aligns MS with MT , or say, transforms the is also proposed to select the optimal source domain when
source subspace coordinate system into the target subspace multisource domains are available.
coordinate system. The transformed low-dimensional Statistic feature alignment is another kind of feature
source-domain and target-domain instances are given by alignment. For example, Sun et al. [105] proposed an
X S MS W and X T MT , respectively. Finally, a learner can approach termed co-relation alignment (CORAL). CORAL
be trained on the resultant transformed instances. constructs the transformation matrix of the source features

56 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

by aligning the second-order statistic features, that is, There are some other spectral transfer learning
the covariance matrices. The transformation matrix W is approaches. For example, the work by Ling et al. [110]
given by [105] proposes an approach termed cross-domain spectral clas-
sifier (CDSC). The general ideas and steps of this approach
W = arg min ||W T CS W − CT ||2F are presented as follows.
W
1) Similarity matrix construction: In the first step, two
similarity matrices are constructed corresponding to
where C denotes the covariance matrix. Note that, com- the whole instances and the target-domain instances,
pared to the above subspace-based approaches, CORAL respectively.
avoids subspace generation as well as projection and is 2) SFA: An objective function is designed with respect to
very easy to implement. a graph-partition indicator vector; a constraint matrix
Some transfer learning approaches are established is constructed, which contains pair-wise must-link
based on spectral feature alignment (SFA). In traditional information. Instead of seeking the discrete solution
machine learning area, spectral clustering is a clustering of the indicator vector, the solution is relaxed to
technique based on graph theory. The key of this technique be continuous, and the eigen-system problem corre-
is to utilize the spectrum, that is, eigenvalues, of the sponding to the objective function is solved to con-
similarity matrix to reduce the dimension of the features struct the aligned spectral features [111].
before clustering. The similarity matrix is constructed to 3) Learner training: A traditional classifier is trained on
quantitatively assess the relative similarity of each pair the transformed instances.
of data/vertices. On the basis of spectral clustering and
To be more specific, the objective function has a form of
feature alignment, SFA is proposed by Pan et al. [106].
the generalized Rayleigh quotient, which aims to find the
SFA is an algorithm for sentiment classification. This
optimal graph partition that respects the label information
algorithm tries to identify the domain-specific words and
with small cut-size [112], to maximize the separation of
domain-independent words in different domains, and then
the target-domain instances, and to fit the constraints of
aligns these domain-specific word features to construct
the pair-wise property. After eigen-decomposition, the last
a low-dimensional feature representation. SFA generally
eigenvectors are selected and combined as a matrix, and
contains the following five steps.
then the matrix is normalized. Each row of the normalized
matrix represents a transformed instance.
1) Feature selection: In this step, feature selection
operations are performed to select the domain-
independent/pivot features. This article presents V. M O D E L - B A S E D I N T E R P R E T A T I O N
Transfer learning approaches can also be interpreted from
three strategies to select domain-independent fea-
tures. These strategies are based on the occurrence the model perspective. Fig. 4 shows the corresponding
frequency of words, the mutual information between strategies and the objectives. The main objective of a
features and labels [107], and the mutual information transfer learning model is to make accurate prediction
between features and domains, respectively. results on the target domain, for example, classification
or clustering results. Note that a transfer learning model
2) Similarity matrix construction: Once the domain-
specific and the domain-independent features are may consist of a few sub-modules such as classifiers,
identified, a bipartite graph is constructed. Each edge extractors, or encoders. These submodules may play differ-
ent roles, for example, feature adaptation or pseudo-label
of this bipartite graph is assigned with a weight
that measures the co-occurrence relationship between generation. In this section, some related transfer learning
a domain-specific word and a domain-independent approaches are introduced in proper order according to the
word. Based on the bipartite graph, a similarity matrix strategies shown in Fig. 4.
is then constructed.
3) SFA: In this step, a spectral clustering algorithm is
A. Model Control Strategy
adapted and performed to align domain-specific fea-
tures [108], [109]. Specifically, based on the eigen- From the perspective of model, a natural thought is
vectors of the graph Laplacian, a feature alignment to directly add the model-level that regularizers to the
mapping is constructed, and the domain-specific fea- learner’s objective function. In this way, the knowledge
tures are mapped into a low-dimensional feature contained in the preobtained source models can be trans-
space. ferred into the target model during the training process.
4) Feature stacking: The original features and the low- For example, the article by Duan et al. [113], [114]
dimensional features are stacked to produce the final proposes a general framework termed domain adaptation
feature representation. machine (DAM), which is designed for multisource transfer
5) Learner training: The target learner is trained learning. The goal of DAM is to construct a robust classifier
on the labeled instances with the final feature for the target domain with the help of some preobtained
representation. base classifiers that are, respectively, trained on multiple

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 57

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

Fig. 4. Strategies and objectives of the transfer learning approaches from the model perspective.

source domains. The objective function is given by and the last term is the consensus regularizer in the
form of cross-entropy. The consensus regularizer can
min LT,L (f T ) + λ1 ΧD (f T ) + λ2 Χ(f T ) not only enhance the agreement of all the classifiers
fT but also reduce the uncertainty of the predictions
on the target domain. The authors implement this
where the first term represents the loss function used framework based on the logistic regression. A differ-
to minimize the classification error of the labeled ence between DAM and CRF is that DAM explicitly
target-domain instances, the second term denotes different constructs the target classifier, while CRF makes the
regularizers, and the third term is used to control the target predictions based on the reached consensus
complexity of the final decision function f T . Different from the source classifiers.
types of the loss functions can be adopted in LT,L (f T ) such 2) Domain-dependent regularizer: Fast-DAM is a specific
as the square error or the cross-entropy loss. Some transfer algorithm of DAM [113]. In light of the manifold
learning approaches can be regarded as the special cases assumption [30] and the graph-based regularizer
of this framework to some extent. [117], [118], fast-DAM designs a domain-dependent
regularizer. The objective function is given by
1) Consensus regularizer: The work by Luo et al. [115]
proposes a framework termed consensus regular-

nT ,L   2
ization framework (CRF) [116]. CRF is designed min f T xT,L
j − yjT,L + λ2 Χ(f T )
fT
for multisource transfer learning with no labeled j=1

target-domain instances. The framework constructs      2


S
m nT ,U

mS classifiers corresponding to each source domain, + λ1 βk f T xT,U


i − fkS xT,U
i
and these classifiers are required to reach mutual k=1 i=1

consensuses on the target domain. The objective func-


tion of each source classifier, denoted by fkS (with where fkS (k = 1, 2, . . . , mS ) denotes the preob-
k = 1, . . . , mS ), is similar to that of DAM, which is tained source decision function for the kth source
presented as follows: domain and βk represents the weighting parameter
that is determined by the relevance between the
    target domain and the kth source domain and can be

nSk
min − log P yiSk |xS
i ; fk
k S
+ λ2 Χ fkS measured based on the MMD metric. The third term
S
fk
i=1
  is the domain-dependent regularizer, which trans-
  1   
nT ,U mS fers the knowledge contained in the source classifier
+ λ1 S  P y j |x T,U
i ; fk
S 
0 motivated by domain dependence. Duan et al. [113]
i=1 yj ∈Y
mS k =1
0 also introduced and added a new term to the above
objective function based on ε-insensitive loss function
where fkS denotes the decision function correspond- [119], which makes the resultant model have high
ing to the kth source domain, and S(x) = −x log x. computational efficiency.
The first term is used to quantify the classification 3) Domain-dependent regularizer + universum
error of the kth classifier on the kth source domain, regularizer: Univer-DAM is an extension of the

58 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

fast-DAM [114]. Its objective function contains an the document classes and the concepts conveyed by
additional regularizer, that is, Universum regularizer. the word clusters through matrix tri-factorization. These
This regularizer usually utilizes an additional data set connections are considered to be the stable knowledge
termed Universum where the instances do not belong that is supposed to be transferred. The main idea is to
to either the positive or the negative class [120]. decompose a document-to-word matrix into three matri-
The authors treat the source-domain instances as the ces, that is, document-to-cluster, connection, and cluster-
Universum for the target domain, and the objective to-word matrices. Specifically, by performing the matrix
function of Univer-DAM is presented as follows: tri-factorization operations on the source and the target
document-to-word matrices, respectively, a joint optimiza-

nT ,L   2 n 
  2 tion problem is constructed, which is given by
S

min f T
xT,L
j − yjT,L + λ2 f T xS
j
fT
min X S − QS RW S 2 + λ1 X T − QT RW T 2
j=1 j=1


mS 
nT ,U    2 Q,R,W

+ λ1 βk f T xT,U
i − fkS xT,U
i + λ2 QS − Q̆S 2
k=1 i=1
s.t. normalization constraints
+ λ3 Χ(f T ).

where X denotes the document-to-word matrix, Q denotes


Similar to fast-DAM, the ε-insensitive loss function
the document-to-cluster matrix, R represents the transfor-
can also be utilized [114].
mation matrix from document clusters to word clusters,
W denotes the cluster-to-word matrix, nd denotes the
B. Parameter Control Strategy number of the documents, and Q̆S represents the label
matrix. The matrix Q̆S is constructed based on the class
The parameter control strategy focuses on the
information of the source-domain documents. If the ith
parameters of models. For example, in the application of
document belongs to the kth class, Q̆S[i,k] = 1. In the above
object categorization, the knowledge from known source
objective function, the matrix R is actually the shared
categories can be transferred into target categories via
parameter. The first term aims to tri-factorize the source
object attributes such as shape and color [121]. The
document-to-word matrix, and the second term decom-
attribute priors, that is, probabilistic distribution parame-
poses the target document-to-word matrix. The last term
ters of the image features corresponding to each attribute,
incorporates the source-domain label information. The
can be learned from the source domain and then used to
optimization problem is solved based on the alternating
facilitate learning the target classifier. The parameters of a
iteration method. Once the solution of QT is obtained,
model actually reflect the knowledge learned by the model.
the class index of the kth target-domain instance is the one
Therefore, it is possible to transfer the knowledge at the
with the maximum value in the kth row of QT .
parametric level.
Furthermore, Zhuang et al. [123] extended MTrick and
1) Parameter Sharing: An intuitive way of controlling proposed an approach termed Triplex Transfer Learn-
the parameters is to directly share the parameters of ing (TriTL). MTrick assumes that the domains share the
the source learner to the target learner. Parameter shar- similar concepts behind their word clusters. In contrast,
ing is widely employed especially in the network-based TriTL assumes that the concepts of these domains can
approaches. For example, if we have a neural network for be further divided into three types, that is, domain-
the source task, we can freeze (or say, share) most of its independent, transferable domain-specific, and nontrans-
layers and only finetune the last few layers to produce a ferable domain-specific concepts, which is similar to
target network. The network-based approaches are intro- HIDC. This idea is motivated by dual transfer learning
duced in Section V-D. (DTL), where the concepts are assumed to be composed
In addition to network-based parameter sharing, matrix- of the domain-independent ones and the transferable
factorization-based parameter sharing is also workable. domain-specific ones [124]. The objective function of TriTL
For example, Zhuang et al. [122] proposed an approach is provided as follows:
for text classification, which is referred to as matrix tri-
  
factorization-based classification framework (MTrick). The  DI 2
authors observe that, in different domains, different words  
mS +mT

  W



min Xk − Qk RDI RTD RkND  WkTD 
or phrases sometimes express the same or similar connota- Q,R,W  
k=1  WkND 
tive meaning. Thus, it is more effective to use the concepts
behind the words rather than the words themselves as s.t. normalization constraints
a bridge to transfer the knowledge in source domains.
Different from PLSA-based transfer learning approaches where the definitions of the symbols are similar to those
that utilize the concepts by constructing Bayesian net- of MTrick and the subscript k denotes the index of the
works, MTrick attempts to find the connections between domains with the assumption that the first mS domains

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 59

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

are the source domains and the last mT domains are cross-validation experiment. Motivated by Cawley’s work,
the target domains. The authors proposed an iterative the generalization error can be easily estimated to guide
algorithm to solve the optimization problem. And in the the parameter setting in SMKL.
initialization phase, W DI and WkTD are initialized based on Tommasi et al. [130] further extended SMKL by uti-
the clustering results of the PLSA algorithm, while WkUT is lizing all the preobtained decision functions. In [130],
randomly initialized; the PLSA algorithm is performed on an approach that is referred to as multimodel knowledge
the combination of the instances from all the domains. transfer (MMKL) is proposed. Its objective function is
There are some other approaches developed based presented as follows:
on matrix factorization. Wang et al. [125] proposed
 2
a transfer learning framework for image classification. 
1  k  λ    T,L 
n T ,L
2
min θ −

Wang et al. [126] proposed a softly associative approach βi θi  + ηj f xj − yjT,L
f 2   2 j=1
that integrates two matrix tri-factorizations into a joint i=1

framework. Do et al. [127] utilized matrix tri-factorization


to discover both the implicit and the explicit similarities for where θi and βi are the model parameter and the
cross-domain recommendation. weighting parameter of the ith preobtained decision func-
tion, respectively. The leave-one-out error can also be
2) Parameter Restriction: Another parameter-control- obtained in a closed form, and the optimal value of βi
type strategy is to restrict the parameters. Different from (i = 1, 2, . . . , k) is the one that maximizes the generaliza-
the parameter sharing strategy that enforces the models tion performance.
share some parameters, parameter restriction strategy only
requires the parameters of the source and the target mod-
C. Model Ensemble Strategy
els to be similar.
Take the approaches to category learning as examples. In sentiment analysis applications related to product
The category-learning problem is to learn a new decision reviews, data or models from multiple product domains
function for predicting a new category (denoted by the are available and can be used as the source domains [131].
(k + 1)th category) with only limited target-domain Combining data or models directly into a single domain
instances and k preobtained binary decision functions. may not be successful because the distributions of these
The function of these preobtained decision functions domains are different from each other. Model ensem-
is to predict which of the k categories an instance ble is another commonly used strategy. This strategy
belongs to. In order to solve the category-learning prob- aims to combine a number of weak classifiers to make
lem, Tommasi and Caputo [128] proposed an approach the final predictions. Some previously mentioned transfer
termed single-model knowledge transfer (SMKL). SMKL is learning approaches already adopted this strategy. For
based on least-squares SVM (LSS-SVM). The advantage of example, TrAdaBoost and MsTrAdaBoost ensemble the
LS-SVM is that LS-SVM transforms inequality constraints to weak classifiers via voting and weighting, respectively.
equality constraints and has high computational efficiency; In this section, several typical ensemble-based transfer
its optimization is equivalent to solving a linear equation learning approaches are introduced to help readers bet-
system problem instead of a quadratic programming prob- ter understand the function and the appliance of this
lem. SMKL selects one of the preobtained binary decision strategy.
functions and transfers the knowledge contained in its As mentioned in Section IV-A, TaskTrAdaBoost, which
parameters. The objective function is given by is an extension of TrAdaBoost for handling multisource
scenarios, is proposed in the article [33]. TaskTrAdaBoost
 2 λ n   mainly has the following two stages.
1  
T ,L
 2
min θ − β θ̃  + ηj f xT,L − yjT,L 1) Candidate classifier construction: In the first stage,
f 2 2 j=1 j
a group of candidate classifiers are constructed by
performing AdaBoost on each source domain. Note
where f (x) = θ · Φ(x) + b, β is the weighting parameter that, for each source domain, each iteration of
controlling the transfer degree, θ̃ is the parameter of AdaBoost results in a new weak classifier. In order to
a selected preobtained model, and ηj is the coefficient avoid the over-fitting problem, the authors introduced
for resolving the label imbalance problem. The kernel a threshold to pick the suitable classifiers into the
parameter and the tradeoff parameter are chosen based candidate group.
on cross-validation. In order to find the optimal weight- 2) Classifier selection and ensemble: In the second stage,
ing parameter, the authors refer to an earlier work a revised version of AdaBoost is performed on the
[129]. Cawley [129] proposed a model selection mech- target-domain instances to construct the final clas-
anism for LS-SVM, which is based on the leave-one-out sifier. In each iteration, an optimal candidate classi-
cross-validation method. The superiority of this method fier which has the lowest classification error on the
is that the leave-one-out error for each instance can be labeled target-domain instances is picked out and
obtained in a closed form without performing the real assigned with a weight based on the classification

60 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

error. Then, the weight of each target-domain the reliable neighbors are the ones whose label predictions
instance is updated based on the performance of the are made by the combined classifier.
selected classifier on the target domain. After the iter- The above-mentioned TaskTrAdaBoost and LWE
ation process, the selected classifiers are ensembled approaches mainly focus on the ensemble process.
to produce the final predictions. In contrast, some studies focus more on the construction
The difference between the original AdaBoost and the sec- of weak learners. For example, ensemble framework of
ond stage of TaskTrAdaBoost is that, in each iteration, anchor adapters (ENCHOR) is a weighting ensemble
the former constructs a new candidate classifier on the framework proposed by Zhuang et al. [133]. An anchor
weighted target-domain instances, while the latter selects is a specific instance. Different from TrAdaBoost which
one preobtained candidate classifier which has the min- adjusts weights of instances to train and produce a
imal classification error on the weighted target-domain new learner iteratively, ENCHOR constructs a group
instances. of weak learners via using different representations of
The article by Gao et al. [132] proposes another the instances produced by anchors. The thought is that
ensemble-based framework that is referred to as locally the higher similarity between a certain instance and an
weighted ensemble (LWE). LWE focuses on the ensem- anchor, the more likely the feature of that instance remains
ble process of various learners; these learners could be unchanged relative to the anchor, where the similarity
constructed on different source domains, or be built by can be measured by using the cosine or Gaussian distance
performing different learning algorithms on a single source function. ENCHOR contains the following steps.
domain. Different from TaskTrAdaBoost that learns the 1) Anchor selection: In this step, a group of anchors are
global weight of each learner, the authors adopted the selected. These anchors can be selected based on
local-weight strategy, that is, assigning adaptive weights some rules or even randomly. In order to improve the
to the learners based on the local manifold structure of final performance of ENCHOR, Zhuang et al. [133]
the target-domain test set. In LWE, a learner is usually proposed a method of selecting high-quality anchors.
assigned with different weights when classifying different 2) Anchor-based representation generation: For each
target-domain instances. Specifically, the authors adopt a anchor and each instance, the feature vector of an
graph-based approach to estimate the weights. The steps instance is directly multiplied by a coefficient that
for weighting are outlined as follows. measures the distance from the instance to the anchor.
In this way, each anchor produces a new pair of
1) Graph construction: For the ith source learner, a graph
anchor-adapted source and target instance sets.
GTSi is constructed by using the learner to classify
3) Learner training and ensemble: The obtained pairs
the target-domain instances in the test set; if two
of instance sets can be, respectively, used to train
instances are classified into the same class, they are
learners. Then, the resultant learners are weighted
connected in the graph. Another graph GT is con-
and combined to make the final predictions.
structed for the target-domain instances as well by
The framework ENCHOR is easy to be realized in a parallel
performing a clustering algorithm.
manner in that the operations performed on each anchor
2) Learner weighting: The weight of the ith learner for
are independent.
the j th target-domain instance xTj is proportional to
the similarity between the instance’s local structures
in GTSi and GT . And the similarity can be measured
D. Deep Learning Technique
by the percentage of the common neighbors of xTj in Deep learning methods are particularly popular in the
these two graphs. field of machine learning. Many researchers utilize the
deep learning techniques to construct transfer learning
Note that this weighting scheme is based on the
models. For example, the SDA and the mSLDA approaches
clustering-manifold assumption, that is, if two instances
mentioned in Section IV-B6 utilize the deep learning
are close to each other in a high-density region, they often
techniques. In this section, we specifically discuss the deep-
have similar labels. In order to check the validity of this
learning-related transfer learning models. The deep learn-
assumption for the task, the target task is tested on the
ing approaches introduced are divided into two types, that
source-domain training set(s). Specifically, the clustering
is, nonadversarial (or say, traditional) ones and adversarial
quality of the training set(s) is quantified and checked by
ones.
using a metric such as purity or entropy. If the clustering
quality is not satisfactory, uniform weights are assigned 1) Traditional Deep Learning: As said earlier,
to the learners instead. Besides, it is intuitive that if the autoencoders are often used in deep learning area.
measured structure similarity is particularly low for every In addition to SDA and mSLDA, there are some other
learner, weighting and combining these learners seems reconstruction-based transfer learning approaches.
unwise. Therefore, the authors introduce a threshold and For example, the article by Zhuang et al. [44], [134]
compare it to the average similarity. If the similarity is proposes an approach termed transfer learning with deep
lower than the threshold, the label of xTj is determined autoencoders (TLDAs). TLDA adopts two autoencoders
by the voting scheme among its reliable neighbors, where for the source and the target domains, respectively.

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 61

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

These two autoencoders share the same parameters. For better understanding, let us review DAN in detail.
The encoder and the decoder both have two layers with DAN is based on AlexNet [138] and its architecture is
activation functions. The diagram of the two autoencoders presented as follows [137]:
is presented as follows:
full full full
 
−−→ R6S −−→ R7S −−→ R8S f (X S )
(W1 ,b1 ) (W2 ,b2 ) (Ŵ2 ,b̂2 ) (Ŵ1 ,b̂1 ) 6th 7th 8th
X S −−−−−→QS −−−−−−−−−−−→RS −−−−−→Q̃S −−−−−→X̃ S
Softmax Regression X S conv QS S
1 conv Q5 ⇑ ⇑ ⇑
−−→ −−→ MK-MMD MK-MMD MK-MMD
⇑ X T 1st QT1 ··· QT5
⇓ ⇓ ⇓
KL Divergence   !
⇓ full full full
 
Five Convolutional Layers −−→ R6T −−→ R7T −−→ R8T f (X T )
T (W1 ,b1 ) (W2 ,b2 ) T (Ŵ2 ,b̂2 ) T (Ŵ1 ,b̂1 ) 6th 7th 8th
X −−−−−→QT −
−−−−−−−−−−→R −−−−−→Q̃ −−−−−→X̃ T .   !
Softmax Regression
Three Fully Connected Layers.

There are several objectives of TLDA, which are listed as


follows. In the above network, the features are first extracted
1) Reconstruction error minimization: The output of the by five convolutional layers in a general-to-specific man-
decoder should be extremely close to the input of ner. Next, the extracted features are fed into one of the
encoder. In other words, the distance between X S two fully connected networks switched by their original
and X̃ S as well as the distance between X T and X̃ T domains. These two networks consist of three fully con-
should be minimized. nected layers that are specialized for the source and the
2) Distribution adaptation: The distribution difference target domains. DAN has the following objectives.
between QS and QT should be minimized. 1) Classification error minimization: The classification
3) Regression error minimization: The output of the error of the labeled instances should be minimized.
encoder on the labeled source-domain instances, that The cross-entropy loss function is adopted to measure
is, RS , should be consistent with the corresponding the prediction error of the labeled instances.
label information Y S . 2) Distribution adaptation: Multiple layers, which
Therefore, the objective function of TLDA is given by include the representation layers and the output
layer, can be jointly adapted in a layer-wise manner.
Instead of using the single-kernel MMD to measure
min LREC (X, X̃) + λ1 KL(QS QT ) + λ2 Χ(W, b, Ŵ , b̂)
Θ the distribution difference, the authors turn to
+ λ3 LREG (RS , Y S ) MK-MMD. Gretton et al. [62] adopt the linear-time
unbiased estimation of MK-MMD to avoid numerous
where the first term represents the reconstruction error, inner product operations.
the second term KL(·) represents the KL divergence, 3) Kernel parameter optimization: The weighting para-
the third term controls the complexity, and the last term meters of the multiple kernels in MK-MMD should be
represents the regression error. TLDA is trained by using optimized to maximize the test power [62].
a gradient descent method. The final predictions can be The objective function of the DAN network is given by
made in two different ways. The first way is to directly
use the output of the encoder to make predictions. And      

nL 
8
the second way is to treat the autoencoder as a feature min max L f xL
i , yiL +λ MK-MMD RlS , RlT ; κ
Θ κ
extractor, and then train the target classifier on the labeled i=1 l=6
instances with the feature representation produced by the
encoder’s first-layer output. where l denotes the index of the layer. The above opti-
In addition to the reconstruction-based domain mization is actually a minimax optimization problem. The
adaptation, discrepancy-based domain adaptation is also a maximization of the objective function with respect to the
popular direction. In earlier research, the shallow neural kernel function κ aims to maximize the test power. After
networks are tried to learn the domain-independent this step, the subtle difference between the source and
feature representation [135]. It is found that the shallow the target domains are magnified. This train of thought is
architectures often make it difficult for the resultant similar to the generative adversarial network (GAN) [139].
models to achieve excellent performance. Therefore, In the training process, the DAN network is initialized
many studies turn to utilize deep neural networks. by a pretrained AlexNet [138]. There are two categories
Tzeng et al. [136] added a single adaptation layer and of parameters that should be learned, that is, the net-
a discrepancy loss to the deep neural network, which work parameters and the weighting parameters of the
improves the performance. Furthermore, Long et al. [137] multiple kernels. Given that the first three convolutional
performed multilayer adaptation and utilized multikernel layers output the general features and are transferable,
technique, and they proposed an architecture termed deep Yosinski et al. [140] freeze them and finetune the last two
adaptation networks (DANs). convolutional layers and the two fully connected layers.

62 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

The last fully connected layer (or say, the classifier layer) as multiple feature spaces adaptation network (MFSAN).
is trained from scratch. The architecture of MFSAN consists of a common-feature
Long et al. [141] further extended the above DAN extractor, mS domain-specific feature extractors, and mS
approach and proposed the DAN framework. The new domain-specific classifiers. The corresponding schematic is
characteristics are summarized as follows. shown as follows:

1) Regularizer adding: The framework introduces an


X1S · · · XkS · · · XmS
S Common Q1 · · · Qk · · · QmS Domain-Specific
S S S
additional regularizer to minimize the uncertainty of T
−−−−−→ T
−−−−−−−−−→
X Extractor Q Extractors
the predicted labels of the unlabeled target-domain
S Domain-Specific Ŷ1 · · · Ŷk · · · ŶmS
S S S
R1S · · · RkS · · · Rm
S
instances, which is motivated by entropy minimiza- −−−−−−−−−→ T .
tion criterion [142]. R1T · · · RkT · · · Rm
T
S Classifiers Ŷ1 · · · ŶkT · · · ŶmT
S

2) Architecture generalizing: The DAN framework can


be applied to many other architectures such as In each iteration, MFSAN has the following steps.
GoogLeNet [143] and ResNet [144]. 1) Common feature extraction: For each source domain
3) Measurement generalizing: The distribution difference (denoted by DSk with k = 1, . . . , mS ), the source-
can be estimated by other metrics. For example, domain instances (denoted by XkS ) are separately
in addition to MK-MMD, Chwialkowski et al. [145] input to the common-feature extractor to produce
also present the mean embedding test for distribution instances in a common latent feature space (denoted
adaptation. by QSk ). Similar operations are also performed on
the target-domain instances (denoted by X T ), which
The objective function of the DAN framework is given by
produces QT .
2) Specific feature extraction: For each source domain,

nL     
lend   the extracted common features QSk is fed to the kth
min max L f xL
i , yiL + λ1 DIST RlS , RlT
Θ κ
i=1 l=lstrt
domain-specific feature extractor. Meanwhile, QT is
fed to all the domain-specific feature extractors,
 
nT ,U    
+ λ2 S P yj |f xT,U which results in RkT with k = 1, . . . , mS .
i
i=1 yj ∈Y 3) Data classification: The output of the kth
domain-specific feature extractor is input to the kth
where lstrt and lend denote the boundary indexes of the fully classifier. In this way, mS pairs of the classification
connected layers for adapting the distributions. results are predicted in the form of probability.
There are some other impressive works. For example, 4) Parameter updating: The parameters of the network
Long et al. [146] constructed residual transfer networks are updated to optimize the objective function.
for domain adaptation, which is motivated by deep resid- There are three objectives in MFSAN, that is, classifi-
ual learning. Besides, another work by Long et al. [147] cation error minimization, distribution adaptation, and
proposes the joint adaptation network (JAN), which consensus regularization. The objective function is given
adapts the joint distribution difference of multiple lay- by
ers. Sun and Saenko [148] extended CORAL for deep
domain adaptation and proposed an approach termed 
mS   
mS
 
Deep CORAL (DCORAL), in which the CORAL loss is added min L ŶiS , YiS + λ1 MMD RiS , RiT
Θ
i=1 i=1
to minimize the feature covariance. Chen et al. [149]
realized that the instances with the same label should be   T
mS 

close to each other in the feature space, and they not
+ λ2 Ŷi − ŶjT 
i=j
only add the CORAL loss but also add an instance-based
class-level discrepancy loss. Pan et al. [150] constructed
where the first term represents the classification error
three prototypical networks (corresponding to DS , DT , and
of the labeled source-domain instances, the second term
DS ∪ DT ) and incorporated the thought of multimodel
measures the distribution difference, and the third term
consensus. They also adopt pseudo-label strategy and
measures the discrepancy of the predictions on the
adapt both the instance-level and class-level discrepancy.
target-domain instances.
Kang et al. [151] proposed the contrastive adaptation net-
work (CAN), which is based on the discrepancy metric 2) Adversarial Deep Learning: The thought of adversarial
termed contrastive domain discrepancy. Zhu et al. [152] learning can be integrated into deep-learning-based trans-
aimed to adapt the extracted multiple feature represen- fer learning approaches. As mentioned above, in the DAN
tations and proposed the multirepresentation adaptation framework, the network Θ and the kernel κ play a minimax
network (MRAN). game, which reflects the thought of adversarial learning.
Deep learning technique can also be used for mul- However, the DAN framework is a little different from the
tisource transfer learning. For example, the work by traditional GAN-based methods in terms of the adversarial
Zhu et al. [153] proposes a framework that is referred to matching. In the DAN framework, there is only a few

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 63

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

parameters to be optimized in the max game, which makes There are some other related impressive works. The
the optimization easier to achieve equilibrium. Before work by Tzeng et al. [156] proposes a unified adver-
introducing the adversarial transfer learning approaches, sarial domain adaptation framework. The work by
let us briefly review the original GAN framework and the Shen et al. [59] adopts Wasserstein distance for domain
related work. adaptation. Hoffman et al. [157] adopted cycle-consistency
The original GAN [139], which is inspired by the loss to ensure the structural and semantic consistency.
two-player game, is composed of two models, a generator Long et al. [158] proposed the conditional domain
G and a discriminator D. The generator produces the adversarial network (CDAN), which utilizes a conditional
counterfeits of the true data for the purpose of confusing domain discriminator to assist adversarial adaptation.
the discriminator and making the discriminator produce Zhang et al. [159] adopted a symmetric design for the
wrong detection. The discriminator is fed with the mixture source and the target classifiers. Zhao et al. [160] uti-
of the true data and the counterfeits, and it aims to detect lized domain adversarial networks to solve the multisource
whether a data is the true one or the fake one. These two transfer learning problem. Yu et al. [161] proposed a
models actually play a two-player minimax game, and the dynamic adversarial adaptation network.
objective function is as follows: Some approaches are designed for some special
scenarios. Take the partial transfer learning as an exam-
min max Ex∼Ptrue [log D (x)] + Ez̃∼Pz̃ [log (1 − D (G (z̃)))] ple. The partial transfer learning approaches are designed
G D for the scenario that the target-domain classes are less
than the source-domain classes, that is, Y S ⊆ Y T .
where z̃ represents the noise instances (sampled from a In this case, the source-domain instances with different
certain noise distribution) used as the input of the gener- labels may have different importance for domain adap-
ator for producing the counterfeits. The entire GAN can tation. To be more specific, the source-domain and the
be trained by using the back-propagation algorithm. When target-domain instances with the same label are more
the two-player game achieves equilibrium, the generator likely to be potentially associated. However, since the
can produce almost true-looking instances. target-domain instances are unlabeled, how to identify
Motivated by GAN, many transfer learning approaches and partially transfer the important information from the
are established based on the assumption that a good labeled source-domain instances is a critical issue.
feature representation contains almost no discrimina- The article by Zhang et al. [162] proposes an approach
tive information about the instances’ original domains. for partial domain adaptation, which is called impor-
For example, the work by Ganin et al. [155] proposes tance weighted adversarial nets-based domain adaptation
a deep architecture termed domain-adversarial neural (IWANDA). The architecture of IWANDA is different from
network (DANN) for domain adaptation [154]. DANN that of DANN. DANN adopts one common feature extractor
assumes that there is no labeled target-domain instance to based on the assumption that there exists a common
work with. Its architecture consists of a feature extractor, feature space where QS,L and QT,U have the similar
a label predictor, and a domain classifier. The correspond- distribution. However, IWANDA uses two domain-specific
ing diagram is as follows. feature extractors for the source and the target domains,
respectively. Specifically, IWANDA consists of two feature
Label Ŷ S,L extractors, two domain classifiers, and one label predictor.
−−−−−→
Predictor Ŷ T,U The diagram of IWANDA is presented as follows:
S,L

S,L
"
X Feature Q Domain Ŝ
−−−−−→ −−−−−→ (Domain Label). Label Ŷ S,L
X T,U Extractor QT,U Classifier T̂ −−−−−→
Predictor Ŷ T,U

The feature extractor acts like the generator, which aims Source Feature
X S,L −−−−−−−−→ Q S,L "
Extractor +β S 2nd Domain Sˆ2
to produce the domain-independent feature representation −−−−−→−−−−−−−→ ˆ
T,U
Target Feature
−−−−−−−−→ QT,U +Ŷ T ,U Classifier T2
for confusing the domain classifier. The domain classi- X
Extractor ↓
fier plays the role like the discriminator, which attempts
to detect whether the extracted features come from the 1st Domain Sˆ1 Weight βS
−−−−−−→ ˆ −−−−−→ .
source domain or the target domain. Besides, the label Classifier T1 Function
predictor produces the label prediction of the instances,
which is trained on the extracted features of the labeled Before training, the source feature extractor and the label
source-domain instances, that is, QS,L . DANN can be predictor are pretrained on the labeled source-domain
trained by inserting a special gradient reversal layer (GRL). instances. These two components are frozen in the training
After the training of the whole system, the feature extractor process, which means that only the target feature extractor
learns the deep feature of the instances, and the output and the domain classifiers should be optimized. In each
Ŷ T,U is the predicted labels of the unlabeled target-domain iteration, the above network is optimized by taking the
instances. following steps.

64 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

1) Instance weighting: In order to solve the partial trans- often relies on experienced doctors. Therefore, in many
fer issue, the source-domain instances are assigned cases, it is expensive and hard to collect sufficient training
with weights based on the output of the first domain data. Transfer learning technology can be utilized for med-
classifier. The first domain classifier is fed with QS,L ical imaging analysis. A commonly used transfer learning
and QT,U , and then outputs the probabilistic predic- approach is to pretrain a neural network on the source
tions of their domains. If a source domain instance domain (e.g., ImageNet, which is an image database con-
is predicted with a high probability of belonging to taining more than fourteen million annotated images with
the target domain, this instance is highly likely to more than 20 000 categories [166]) and then finetune it
associate with the target domain. Thus, this instance based on the instances from the target domain.
is assigned with a larger weight and vice versa. For example, Maqsood et al. [167] finetuned the
2) Prediction making: The label predictor outputs the AlexNet [138] for the detection of Alzheimer’s dis-
label predictions of the instances. The second classi- ease. Their approach has the following four steps. First,
fier predicts which domain an instance belongs to. the MRI images from the target domain are preprocessed
3) Parameter updating: The first classifier is optimized to by performing contrast stretching operations. Second,
minimize the domain classification error. The second the AlexNet architecture [138] is pretrained over Ima-
classifier plays a minimax game with the target fea- geNet [166] (i.e., the source domain) as a starting point
ture extractor. This classifier aims to detect whether to learn the new task. Third, the convolutional layers
a instance is the instance from the target domain or of AlexNet are fixed, and the last three fully connected
the weighted instance from the source domain, and to layers are replaced by the new ones including one softmax
reduce the uncertainty of the label prediction Ŷ T,U . layer, one fully connected layer, and one output layer.
The target feature extractor aims to confuse the sec- Finally, the modified AlexNet is finetuned by training on
ond classifier. These components can be optimized in the Alzheimer’s data set [168] (i.e., the target domain).
a similar way to GAN or by inserting a GRL. The experimental results show that the proposed approach
In addition to IWANDA, the work by Cao et al. [163] achieves the highest accuracy for the multiclass classifica-
constructs the selective adversarial network for par- tion problem (i.e., Alzheimer’s stage detection).
tial transfer learning. There are some other studies Similarly, Shin et al. [169] finetuned the pretrained deep
related to transfer learning. For example, the work by neural network for solving the computer-aided detection
Wang et al. [164] proposes a minimax-based approach to problems. Byra et al. [170] utilized the transfer learning
select high-quality source-domain data. Chen et al. [165] technology to help assess knee osteoarthritis. In addition
investigated the transferability and the discriminability in to imaging analysis, transfer learning has some other
the adversarial domain adaptation and proposed a spectral applications in the medical area. For example, the work
penalization approach to boost the existing adversarial by Tang et al. [171] combines the active learning and the
transfer learning methods. domain adaptation technologies for the classification of
various medical data. Zeng et al. [172] utilized transfer
VI. A P P L I C A T I O N learning for automatically encoding ICD-9 codes that are
In previous sections, a number of representative trans- used to describe a patient’s diagnosis.
fer learning approaches are introduced, which have
been applied to solving a variety of text-related/image-
B. Bioinformatics Application
related problems in their original articles. For example,
MTrick [122] and TriTL [123] utilize the matrix factor- Biological sequence analysis is an important task in
ization technique to solve cross-domain text classification the bioinformatics area. Since the understanding of some
problems; the deep-learning-based approaches such as organisms can be transferred to other organisms, trans-
DAN [137], DCORAL [148], and DANN [154], [155] are fer learning can be applied to facilitate the biological
applied to solving image classification problems. Instead sequence analysis. The distribution difference problem
of focusing on the general text-related or image-related exists significantly in this application. For example,
applications, in this section, we mainly focus on the trans- the function of some biological substances may remain
fer learning applications in specific areas such as medicine, unchanged but with the composition changed between
bioinformatics, transportation, and recommender systems. two organisms, which may result in the marginal dis-
tribution difference. Besides, if two organisms have a
common ancestor but with long evolutionary distance,
A. Medical Application the conditional distribution difference would be signifi-
Medical imaging plays an important role in the medical cant. The work by Schweikert et al. [173] uses the mRNA
area, which is a powerful tool for diagnosis. With the splice site prediction problem as the example to analyze
development of computer technology such as machine the effectiveness of transfer learning approaches. In their
learning, computer-aided diagnosis has become a popular experiments, the source domain contains the sequence
and promising direction. Note that medical images are instances from a well-studied model organism, that is,
generated by special medical equipment, and their labeling Caenorhabditis elegans, and the target organisms include

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 65

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

two additional nematodes (i.e., Caenorhabditis remanei were taken from the same location in different conditions.
and Pristionchus pacificus), Drosophila melanogaster, and In the first step, a pretrained network is finetuned to
the plant Arabidopsis thaliana. A number of transfer learn- extract the feature representations of images. In the sec-
ing approaches, for example, FAM [64] and the variant of ond step, the feature transformation strategy is adopted
KMM [5], are compared with each other. The experimental to construct a new feature representation. Specifically,
results show that transfer learning can help improve the the dimension reduction algorithm (i.e., partial LSs regres-
classification performance. sion [179]) is performed on the extracted features to
Another widely encountered task in the bioinformatics generate low-dimension features. Then, a transformation
area is gene expression analysis, for example, predict- matrix is learned to minimize the domain discrepancy of
ing associations between genes and phenotypes. In this the dimension-reduced data. Next, the SA operations are
application, one of the main challenges is the data spar- adopted to further reduce the domain discrepancy. Note
sity problem, since there is usually very little data of that, although images under different conditions often
the known associations. Transfer learning can be used to have different appearances, they often have the similar
leverage this problem by providing additional informa- layout structure. Therefore, in the final step, the cross-
tion and knowledge. For example, Petegrosso et al. [174] domain dense correspondences are established between
proposed a transfer learning approach to analyze and the test image and the retrieved best matching image at
predict the gene–phenotype associations based on the first, and then the annotations of the best matching image
label propagation algorithm (LPA) [175]. LPA utilizes are transferred to the test image via the Markov random
the protein–protein interaction (PPI) network and the field model [180], [181].
initial labeling to predict the target associations based Transfer learning can also be applied to the task of driver
on the assumption that the genes that are connected behavior modeling. In this task, sufficient personalized
in the PPI network should have the similar labels. The data of each individual driver are usually unavailable.
authors extended LPA by incorporating multitask and In such situations, transferring the knowledge contained
transfer-learning technologies. First, human phenotype in the historical data for the newly involved driver is
ontology (HPO), which provides a standardized vocabu- a promising alternative. For example, Lu et al. [182]
lary of phenotypic features of human diseases, is utilized proposed an approach to driver model adaptation in
to form the auxiliary task. In this way, the associations can lane-changing scenarios. The source domain contains the
be predicted by utilizing phenotype paths and both the sufficient data describing the behavior of the source
linkage knowledge in HPO and in the PPI network; the drivers, while the target domain has a few numbers of
interacted genes in PPI are more likely to be associated data about the target driver. In the first step, the data from
with the same phenotype and the connected phenotypes both domains are preprocessed by performing PCA to gen-
in HPO are more likely to be associated with the same erate low-dimension features. The authors assume that the
gene. Second, gene ontology (GO), which contains the source and the target data are from two manifolds. There-
association information between gene functions and genes, fore, in the second step, a manifold alignment approach is
is used as the source domain. Additional regularizers adopted for domain adaptation. Specifically, the dynamic
are designed, and the PPI network and the common time warping algorithm [183] is applied to measuring
genes are used as the bridge for knowledge transfer. The similarity and finding the corresponding source-domain
gene-GO term and gene-HPO phenotype associations are data point of each target-domain data point. Then, local
constructed simultaneously for all the genes in the PPI net- Procrustes analysis [184] is adopted to align the two
work. By transferring additional knowledge, the predicted manifolds based on the obtained correspondences between
gene-phenotype associations can be more reliable. data points. In this way, the data from the source domain
Transfer learning can also be applied to solving the PPI can be transferred to the target domain. And in the final
prediction problems. Xu et al. [176] proposed an approach step, a stochastic modeling method (e.g., Gaussian mixture
to transfer the linkage knowledge from the source PPI regression [185]) is used to model the behavior of the
network to the target one. The proposed approach is based target driver. The experimental results demonstrate that
on the collective matrix factorization technique [177], the transfer learning approach can help the target driver
where a factor matrix is shared across domains. even when few target-domain data are available. Besides,
the results also show that when the number of target
instances are very small or very large, the superiority of
C. Transportation Application their approach is not obvious. This may be because the
One application of transfer learning in the transporta- relationship across domains cannot be found exactly with
tion area is to understand the traffic scene images. In this few target-domain instances, and in the case of sufficient
application, a challenge problem is that the images taken target-domain instances, the necessity of transfer learning
from a certain location often suffer from variations because is reduced.
of different weather and light conditions. In order to solve Besides, there are some other applications of trans-
this problem, Di et al. [178] proposed an approach that fer learning in the transportation area. For example,
attempts to transfer the information of the images that Liu et al. [186] applied transfer learning to driver poses

66 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

recognition. Wang et al. [187] adopted the regularization show that CST significantly outperforms the nontransfer
technique in transfer learning for vehicle-type recognition. baselines (i.e., average filling model and latent factoriza-
Transfer learning can also be utilized for anomalous activ- tion model) in all data sparsity levels [193].
ity detection [188], [189], traffic sign recognition [190], There are some other studies about cross-domain rec-
and so on ommendation [194]–[197]. For example, He et al. [198]
proposed a transfer learning framework based on the
Bayesian neural network. Zhu et al. [199] proposed a deep
D. Recommender-System Application framework, which first generates the user and item feature
Due to the rapid increase of the amount of information, representations based on the matrix factorization tech-
how to effectively recommend the personalized content nique, and then employs a deep neural network to learn
for individual users is an important issue. In the field of the mapping of features across domains. Yuan et al. [200]
recommender systems, some traditional recommendation proposed a deep domain adaptation approach based on
methods, for example, factorization-based collaborative autoencoders and a modified DANN [154], [155] to extract
filtering, often rely on the factorization of the user-item and transfer the instances from rating matrices.
interaction matrix to obtain the predictive function. These
methods often require a large amount of training data to E. Other Applications
make accurate recommendations. However, the necessary
1) Communication Application: In addition to WiFi local-
training data, for example, the historical interaction data,
ization tasks [2], [36], transfer learning has also been
are often sparse in real-world scenarios. Besides, for new
employed in wireless-network applications. For example,
registered users or new items, traditional methods are
Bastug et al. [201] proposed a caching mechanism; the
often hard to make effective recommendations, which is
knowledge contained in contextual information, which is
also known as the cold-start problem.
extracted from the interactions between devices, is trans-
Recognizing the above-mentioned problems in recom-
ferred to the target domain. Besides, some studies focus
mender systems, kinds of transfer learning approaches,
on the energy saving problems. The work by Li et al. [202]
for example, instance-based and feature-based approaches
proposes an energy saving scheme for cellular radio access
have been proposed. These approaches attempt to make
networks, which utilizes the transfer-learning expertise.
use of the data from other recommender systems (i.e.,
The work by Zhao and Grace [203] applies transfer
the source domains) to help construct the recommender
learning to topology management for reducing energy
system in the target domain. Instance-based approaches
consumption.
mainly focus on transferring different types of instances,
for example, ratings, feedbacks, and examinations, from 2) Urban-Computing Application: With a large amount
the source domain to the target domain. The work by of data related to our cities, urban-computing is a promis-
Pan et al. [191] leverages the uncertain ratings (repre- ing researching track in directions of traffic monitoring,
sented as rating distributions) of the source domain for health care, social security, and so on Transfer learn-
knowledge transfer. Specifically, the source-domain uncer- ing has been applied to alleviate the data scarcity prob-
tain ratings are used as constraints to help complete the lem in many urban computing applications. For example,
rating matrix factorization task on the target domain. Guo et al. [204] proposed an approach for chain store site
Hu et al. [192] proposed an approach termed trans- recommendation, which leverages the knowledge from
fer meeting hybrid, which extracts the knowledge from semantically relevant domains (e.g., other cities with the
unstructured text by using an attentive memory network same store and other chain stores in the target city) to the
and selectively transfer the useful information. target city. Wei et al. [205] proposed a flexible multimodal
Feature-based approaches often leverage and transfer transfer learning approach that transfers knowledge from
the information from a latent feature space. For example, a city that have sufficient multimodel data and labels to
Pan et al. [193] proposed an approach termed coordinate the target city to alleviate the data sparsity problem.
system transfer (CST) to leverage both the user-side and Transfer learning has been applied to some recognition
the item-side latent features. The source-domain instances tasks such as hand gesture recognition [206], face recog-
come from another recommender system, sharing common nition [207], activity recognition [208], and speech emo-
users and items with the target domain. CST is devel- tion recognition [209]. Besides, transfer-learning expertise
oped based on the assumption that the principle coor- has also been incorporated into some other areas such
dinates, which reflect the tastes of users or the factors as sentiment analysis [28], [96], [210], fraud detection
of items, characterize the domain-independent structure [211], social network [212], and hyperspectral image
and are transferable across domains. CST first constructs analysis [54], [213].
two principle coordinate systems, which are actually the
latent features of users and items, by applying sparse VII. E X P E R I M E N T
matrix tri-factorization on the source-domain data, and Transfer learning techniques have been successfully
then transfer the coordinate systems to the target domain applied in many real-world applications. In this section,
by setting them as constraints. The experimental results we perform experiments to evaluate the performance of

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 67

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

Table 2 Statistical Information of the Preprocessed Data Sets

some representative transfer learning models1 [214] of target domain. Note that the instances in the three
different categories on two mainstream research areas, categories are all labeled. In order to generate the
that is, object recognition and text classification. The data unlabeled instances, the labeled instances are selected
sets are introduced at first. Then, the experimental results from the data set, and their labels are ignored.
and further analyses are provided. 3) Office-31 [215]: Is an object recognition data set
which contains 31 categories and three domains, that
is, Amazon, Webcam, and DSLR. These three domains
A. Data Set and Preprocessing
have 2817, 498, and 795 instances, respectively. The
Three data sets are studied in the experiments, that images in Amazon are the online e-commerce pictures
is, Office-31, Reuters-21578, and Amazon Reviews. For taken from Amazon.com. The images in Webcam are
simplicity, we focus on the classification tasks. The sta- the low-resolution pictures taken by web cameras.
tistical information of the preprocessed data sets is listed And the images in DSLR are the high-resolution pic-
in Table 2. tures taken by DSLR cameras. In the experiments,
1) Amazon reviews2 [107]: Is a multidomain sentiment every two of the three domains (with the order con-
data set which contains product reviews taken from sidered) are selected as the source and the target
Amazon.com of four domains (books, kitchen, elec- domains, which results in six tasks.
tronics, and DVDs). Each review in the four domains
has a text and a rating from zero to five. In the B. Experiment Setting
experiments, the ratios that are less than three are Experiments are conducted to compare some
defined as the negative ones, while others are defined representative transfer learning models. Specifically,
as the positive ones. The frequency of each word in eight algorithms are performed on the data set Office-31
all reviews is calculated. Then, the 5000 words with for solving the object recognition problem. Besides,
the highest frequency are selected as the attributes of 14 algorithms are performed and evaluated on the data set
each review. In this way, we finally have 1000 pos- Reuters-21578 for solving the text classification problem.
itive instances, 1000 negative instances, and about In the sentiment classification problem, 11 algorithms are
5000 unlabeled instances in each domain. In the performed on Amazon Reviews. The classification results
experiments, every two of the four domains are are evaluated by accuracy, which is defined as follows:
selected to generate 12 tasks.
2) Reuters-215783 : Is a data set for text categorization, |{x|xi ∈ Dtest ∧ f (xi ) = yi }|
which has a hierarchical structure. The data set con- accuracy =
|Dtest |
tains five top categories (Exchanges, Orgs, People,
Places, and Topics). In out experiment, we use the where Dtest denotes the test data and y denotes the truth
top three big category Orgs, People, and Places to classification label; f (x) represents the predicted classifi-
generate three classification tasks (Orgs versus Peo- cation result. Note that some algorithms need the base
ple, Orgs versus Places, and People versus Places). classifier. In these cases, an SVM with a linear kernel is
In each task, the subcategories in the correspond- adopted as the base classifier in the experiments. Besides,
ing two categories are separately divided into two the source-domain instances are all labeled. And for the
parts. Then, the resultant four parts are used as the performed algorithms (except TrAdaBoost), the target-
components to form two domains. Each domain has domain instances are unlabeled. Each algorithm was exe-
about 1000 instances, and each instance has about cuted three times, and the average results are adopted as
4500 features. Specifically, taking the task Orgs versus our experimental results.
People as an example, one part from Orgs and one The evaluated transfer learning models include:
part from People and combined to form the source HIDC [93], TriTL [123], CD-PLSA [91], [92],
domain; similarly, the remaining two parts form the MTrick [122], SFA [106], mSLDA [98], [99], SDA [96],
1 https://github.com/FuzhenZhuang/Transfer-Learning-Toolkit
GFK [102], SCL [94], TCA [36], [78], CoCC [41],
2 http://www.cs.jhu.edu/ JDA [38], TrAdaBoost [31], DAN [137], DCORAL [148],
mdredze/data sets/sentiment/
3 https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+ MRAN [152], CDAN [158], DANN [154], [155],
Categorization+Collection JAN [147], and CAN [151].

68 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

Table 3 Accuracy Performance on the Amazon Reviews of Four Domains: Kitchen (K), Electronics (E), DVDs (D), and Books (B)

Table 4 Accuracy Performance on the Reuters-21578 of Three Domains: Table 5 Accuracy Performance on Office-31 of Three Domains: Ama-
Orgs, People, and Places zon (A), Webcam (W), and DSLR (D)

when the source domain is electronics or kitchen, which


indicates that these two domains may contain more trans-
ferable information than the other two domains. In addi-
tion, it can be observed that HIDC, SCL, SFA, MTrick, and
SDA perform well and relatively stable in all the 12 tasks.
Meanwhile, other algorithms, especially mSLDA, CD-PLSA,
C. Experimental Result
In this section, we compare over 20 algorithms on three
data sets in total. The parameters of all algorithms are
set to the default values or the recommended values men-
tioned in the original articles. The experimental results are
presented in Tables 3–5 corresponding to Amazon Reviews,
Reuters-21578, and Office-31, respectively. In order to
allow readers to understand the experimental results more
intuitively, three radar maps, that is, Figs. 5–7, are pro-
vided, which visualize the experimental results. In the
radar maps, each direction represents a task. The general
performance of an algorithm is demonstrated by a polygon
whose vertices representing the accuracy of the algorithm
for dealing with different tasks.
Table 3 shows the experimental results on Amazon
Reviews. The baseline is a linear classifier trained only on
the source domain (here, we directly use the results from
the article [107]). Fig. 5 depicts the results. As shown
in Fig. 5, most algorithms are relatively well-performed Fig. 5. Comparison results on Amazon reviews.

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 69

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

Fig. 6. Comparison results on Reuters-21578.

and TriTL, are relatively unstable; the performance of them these three models. The experimental results are provided
fluctuates in a range about 20%. TriTL has a relatively in Table 5 and the average performance is visualized
high accuracy on the tasks where the source domain is in Fig. 7. As shown in Fig. 7, all of these seven algorithms
kitchen, but has a relatively low accuracy on other tasks. have excellent performance, especially on the tasks D → W
The algorithms TCA, mSLDA, and CD-PLSA have similar and W → D, whose accuracy is very close to 100%. This
performance on all the tasks with an accuracy about 70% phenomenon reflects the superiority of the deep-learning-
on average. Among the well-performed algorithms, HIDC based approaches and is consistent with the fact that the
and MTrick are based on feature reduction (feature clus- difference between Webcam and DSLR is smaller than
tering), while the others are based on feature encoding that between Webcam/DSLR and Amazon. Clearly, CAN
(SDA), feature alignment (SFA), and feature selection outperforms the other six algorithms. In all the six tasks,
(SCL). Those strategies are currently the mainstreams of the performance of DANN is similar to that of DAN, and
feature-based transfer learning. is better than that of DCORAL, which indicates the effec-
Table 4 presents the comparison results on Reuter- tiveness and the practicability of incorporating adversarial
21578 (here, we directly use the results of the baseline learning.
and CoCC from articles [78] and [41]). The baseline is It is worth mentioning that, in the above experiments,
a regularized LS regression model trained only on the the performance of some algorithms is not ideal. One
labeled target domain instances [78]. Fig. 6, which has the reason is that we use the default parameter settings pro-
same structure of Fig. 5, visualizes the performance. For vided in the algorithms’ original articles, which may not be
clarity, 13 algorithms are divided into two parts that corre- suitable for the data set we selected. For example, GFK was
spond to the two subfigures in Fig. 6. It can be observed originally designed for object recognition, and we directly
that most algorithms are relatively well-performed for adopt it into text classification in the first experiment,
Orgs versus Places and Orgs versus People, but poor for
People versus Places. This phenomenon indicates that the
discrepancy between People and Places may be relatively
large. TrAdaBoost has a relatively good performance in this
experiment because it uses the labels of the instances in
the target domain to reduce the impact of the distribution
difference. Besides, the algorithms HIDC, SFA, and MTrick
have relatively consistent performance in the three tasks.
These algorithms are also well-performed in the previous
experiment on Amazon Reviews. In addition, the top two
well-performed algorithms in terms of People versus Places
are CoCC and TrAdaBoost.
In the third experiment, seven deep-learning-based
transfer learning models (i.e., DAN, DCORAL, MRAN,
CDAN, DANN, JAN, and CAN) and the baseline (i.e.,
the Alexnet [138], [140] pretrained on ImageNet [166]
and then directly trained on the target domain) are per-
formed on the data set Office-31 (here, we directly use
the results of CDAN, JAN, CAN, and the baseline from
the original articles [137], [147], [151], [158]). The
ResNet-50 [144] is used as the backbone network for all Fig. 7. Comparison results on Office-31.

70 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

which turns out to produce an unsatisfactory result (having recognition and text categorization. The comparisons of
about 62% accuracy on average). The above experimental the models have also been given, which reflects that the
results are just for reference. These results demonstrate selection of the transfer learning model is an important
that some algorithms may not be suitable for the data sets research topic as well as a complex issue in practical
of certain domains. Therefore, it is important to choose applications.
the appropriate algorithms as the baselines in the process Several directions are available for future research in
of research. Besides, in practical applications, it is also the transfer learning area. First, transfer learning tech-
necessary to find a suitable algorithm. niques can be further explored and applied to a wider
range of applications. And new approaches are needed to
VIII. C O N C L U S I O N A N D F U T U R E solve the knowledge transfer problems in more complex
DIRECTION scenarios. For example, in real-world scenarios, sometimes
In this survey article, we have summarized the mechanisms the user-relevant source-domain data comes from another
and the strategies of transfer learning from the perspec- company. In this case, how to transfer the knowledge
tives of data and model. The survey gives the clear def- contained in the source domain while protecting user
initions about transfer learning and manages to use a privacy is an important issue. Second, how to measure the
unified symbol system to describe a large number of repre- transferability across domains and avoid negative transfer
sentative transfer learning approaches and related works. is also an important issue. Although there have been some
We have basically introduced the objectives and strategies studies on negative transfer, negative transfer still needs
in transfer learning based on data-based interpretation further systematic analyses [3]. Third, the interpretability
and model-based interpretation. Data-based interpretation of transfer learning also needs to be investigated further
introduces the objectives, the strategies, and some transfer [216]. Finally, theoretical studies can be further conducted
learning approaches from the data perspective. Similarly, to provide theoretical support for the effectiveness and
model-based interpretation introduces the mechanisms applicability of transfer learning. As a popular and promis-
and the strategies of transfer learning but from the model ing area in machine learning, transfer learning shows some
level. The applications of transfer learning have also been advantages over traditional machine learning such as less
introduced. At last, experiments have been conducted data dependence and less label dependence. We hope our
to evaluate the performance of representative transfer work can help readers have a better understanding of the
learning models on two mainstream area, that is, object research status and the research ideas.

REFERENCES
[1] D. N. Perkins and G. Salomon, Transfer of [11] O. Chapelle, B. Schlkopf, and A. Zien, representations,” in Proc. 36th Int. Conf. Mach.
Learning. Oxford, U.K.: Pergamon, 1992. Semi-Supervised Learning. Cambridge, MA, USA: Learn., Long Beach, CA, USA, Jun. 2019,
[2] S. Jialin Pan and Q. Yang, “A survey on transfer MIT Press, 2010. pp. 5102–5112.
learning,” IEEE Trans. Knowl. Data Eng., vol. 22, [12] S. Sun, “A survey of multi-view machine learning,” [22] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and
no. 10, pp. 1345–1359, Oct. 2010. Neural Comput. Appl., vol. 23, nos. 7–8, G. Zhang, “Transfer learning using computational
[3] Z. Wang, Z. Dai, B. Poczos, and J. Carbonell, pp. 2031–2038, Dec. 2013. intelligence: A survey,” Knowl.-Based Syst., vol. 80,
“Characterizing and avoiding negative transfer,” in [13] C. Xu, D. Tao, and C. Xu, “A survey on multi-view pp. 14–23, May 2015.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern learning,” 2013, arXiv:1304.5634. [Online]. [23] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and
Recognit., Long Beach, CA, USA, Jun. 2019, Available: http://arxiv.org/abs/1304.5634 C. Liu, “A survey on deep transfer learning,” in
pp. 11293–11302. [14] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view Proc. 27th Int. Conf. Artif. Neural Netw., Rhodes,
[4] K. Weiss, T. M. Khoshgoftaar, and D. Wang, learning overview: Recent progress and new Greece, Oct. 2018, pp. 270–279.
“A survey of transfer learning,” J. Big Data, vol. 3, challenges,” Inf. Fusion, vol. 38, pp. 43–54, [24] M. Wang and W. Deng, “Deep visual domain
no. 1, Dec. 2016. Nov. 2017. adaptation: A survey,” Neurocomputing, vol. 312,
[5] J. Huang, A. J. Smola, A. Gretton, [15] D. Zhang, J. He, Y. Liu, L. Si, and R. Lawrence, pp. 135–153, Oct. 2018.
K. M. Borgwardt, and B. Schölkopf, “Correcting “Multi-view transfer learning with a large margin [25] D. Cook, K. D. Feuz, and N. C. Krishnan, “Transfer
sample selection bias by unlabeled data,” in Proc. approach,” in Proc. 17th ACM SIGKDD Int. Conf. learning for activity recognition: A survey,” Knowl.
20th Annu. Conf. Neural Inf. Process. Syst., Knowl. Discovery Data Mining, San Diego, CA, Inf. Syst., vol. 36, no. 3, pp. 537–556, Sep. 2013.
Vancouver, BC, Canada, Dec. 2006, USA, 2011, pp. 1208–1216. [26] L. Shao, F. Zhu, and X. Li, “Transfer learning for
pp. 601–608. [16] P. Yang and W. Gao, “Multi-view discriminant visual categorization: A survey,” IEEE Trans.
[6] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, transfer learning,” in Proc. 23rd Int. Joint Conf. Neural Netw. Learn. Syst., vol. 26, no. 5,
P. von Bünau, and M. Kawanabe, “Direct Artif. Intell., Beijing, China, Aug. 2013, pp. 1019–1034, May 2015.
importance estimation for covariate shift pp. 1848–1854. [27] W. Pan, “A survey of transfer learning for
adaptation,” Ann. Inst. Stat. Math., vol. 60, no. 4, [17] K. D. Feuz and D. J. Cook, “Collegial activity collaborative recommendation with auxiliary
pp. 699–746, Dec. 2008. learning between heterogeneous sensors,” Knowl. data,” Neurocomputing, vol. 177, pp. 447–453,
[7] O. Day and T. M. Khoshgoftaar, “A survey on Inf. Syst., vol. 53, no. 2, pp. 337–364, Mar. 2017. Feb. 2016.
heterogeneous transfer learning,” J. Big Data, [18] Y. Zhang and Q. Yang, “An overview of multi-task [28] R. Liu, Y. Shi, C. Ji, and M. Jia, “A survey of
vol. 4, no. 1, Dec. 2017. learning,” Nat. Sci. Rev., vol. 5, no. 1, pp. 30–43, sentiment analysis based on transfer learning,”
[8] M. E. Taylor and P. Stone, “Transfer learning for Jan. 2018. IEEE Access, vol. 7, pp. 85401–85412, Jun. 2019.
reinforcement learning domains: A survey,” [19] W. Zhang et al., “Deep model based transfer and [29] Q. Sun, R. Chattopadhyay, S. Panchanathan, and
J. Mach. Learn. Res., vol. 10, pp. 1633–1685, multi-task learning for biological image analysis,” J. Ye, “A two-stage weighting framework for
Jul. 2009. in Proc. 21th ACM SIGKDD Int. Conf. Knowl. multi-source domain adaptation,” in Proc. 25th
[9] H. B. Ammar, E. Eaton, J. M. Luna, and P. Ruvolo, Discovery Data Mining, Sydney, NSW, Australia, Annu. Conf. Neural Inf. Process. Syst., Granada,
“Autonomous cross-domain knowledge transfer in 2015, pp. 1475–1484. Spain, Dec. 2011, pp. 505–513.
lifelong policy gradient reinforcement learning,” [20] A.-A. Liu, N. Xu, W.-Z. Nie, Y.-T. Su, and [30] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold
in Proc. 24th Int. Joint Conf. Artif. Intell., Buenos Y.-D. Zhang, “Multi-domain and multi-task regularization: A geometric framework for
Aires, Argentina, Jul. 2015, pp. 3345–3351. learning for human action recognition,” IEEE learning from labeled and unlabeled examples,”
[10] P. Zhao and S. C. H. Hoi, “OTL: A framework of Trans. Image Process., vol. 28, no. 2, pp. 853–867, J. Mach. Learn. Res., vol. 7, pp. 2399–2434,
online transfer learning,” in Proc. 27th Int. Conf. Feb. 2019. Nov. 2006.
Mach. Learn., Haifa, Israel, Jun. 2010, [21] X. Peng, Z. Huang, X. Sun, and K. Saenko, [31] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting
pp. 1231–1238. “Domain agnostic learning with disentangled for transfer learning,” in Proc. 24th Int. Conf.

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 71

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

Mach. Learn., Corvallis, OR, USA, Jun. 2007, vol. 7, no. 3, pp. 200–217, 1967. no. 5, pp. 1299–1319, Jul. 1998.
pp. 193–200. [51] S. Si, D. Tao, and B. Geng, “Bregman [70] J. Wang, Y. Chen, S. Hao, W. Feng, and Z. Shen,
[32] Y. Freund and R. E. Schapire, “A decision-theoretic divergence-based regularization for transfer “Balanced distribution adaptation for transfer
generalization of on-line learning and an subspace learning,” IEEE Trans. Knowl. Data Eng., learning,” in Proc. IEEE Int. Conf. Data Mining
application to boosting,” J. Comput. Syst. Sci., vol. 22, no. 7, pp. 929–942, Jul. 2010. (ICDM), New Orleans, FL, USA, Nov. 2017,
vol. 55, no. 1, pp. 119–139, Aug. 1997. [52] H. Sun, S. Liu, S. Zhou, and H. Zou, pp. 1129–1134.
[33] Y. Yao and G. Doretto, “Boosting for transfer “Unsupervised cross-view semantic transfer for [71] A. Blum and T. Mitchell, “Combining labeled and
learning with multiple sources,” in Proc. IEEE remote sensing image classification,” IEEE Geosci. unlabeled data with co-training,” in Proc. 11th
Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Remote Sens. Lett., vol. 13, no. 1, pp. 13–17, Annu. Conf. Comput. Learn. theory, Madison, WI,
San Francisco, CA, USA, Jun. 2010, Jan. 2016. USA, Jul. 1998, pp. 92–100.
pp. 1855–1862. [53] H. Sun, S. Liu, and S. Zhou, “Discriminative [72] M. Chen, K. Q. Weinberger, and J. C. Blitzer,
[34] J. Jiang and C. Zhai, “Instance weighting for subspace alignment for unsupervised visual “Co-training for domain adaptation,” in Proc. 25th
domain adaptation in NLP,” in Proc. 45th Annu. domain adaptation,” Neural Process. Lett., vol. 44, Annu. Conf. Neural Inf. Process. Syst., Granada,
Meeting Assoc. Comput. Linguistics, Prague, Czech no. 3, pp. 779–793, Dec. 2016. Spain, Dec. 2011, pp. 2456–2464.
Republic, Jun. 2007, pp. 264–271. [54] Q. Shi, Y. Zhang, X. Liu, and K. Zhao, “Regularised [73] Z.-H. Zhou and M. Li, “Tri-training: Exploiting
[35] K. M. Borgwardt, A. Gretton, M. J. Rasch, transfer learning for hyperspectral image unlabeled data using three classifiers,” IEEE Trans.
H.-P. Kriegel, B. Schölkopf, and A. J. Smola, classification,” IET Comput. Vis., vol. 13, no. 2, Knowl. Data Eng., vol. 17, no. 11, pp. 1529–1541,
“Integrating structured biological data by kernel pp. 188–193, Mar. 2019. Nov. 2005.
maximum mean discrepancy,” Bioinformatics, [55] A. Gretton, O. Bousquet, A. J. Smola, and [74] K. Saito, Y. Ushiku, and T. Harada, “Asymmetric
vol. 22, no. 14, pp. 49–57, Jul. 2006. B. Schölkopf, “Measuring statistical dependence tri-training for unsupervised domain adaptation,”
[36] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, with Hilbert-Schmidt norms,” in Proc. 18th Int. in Proc. 34th Int. Conf. Mach. Learn., Sydney, NSW,
“Domain adaptation via transfer component Conf. Algorithmic Learn. Theory, Singapore, Australia, Aug. 2017, pp. 2988–2997.
analysis,” IEEE Trans. Neural Netw., vol. 22, no. 2, Oct. 2005, pp. 63–77. [75] S. J. Pan, J. T. Kwok, and Q. Yang, “Transfer
pp. 199–210, Feb. 2011. [56] H. Wang and Q. Yang, “Transfer learning by learning via dimensionality reduction,” in Proc.
[37] M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain structural analogy,” in Proc. 25th AAAI Conf. Artif. 23rd AAAI Conf. Artif. Intell., Chicago, IL, USA,
adaptive neural networks for object recognition,” Intell., San Francisco, CA, USA, Aug. 2011, Jul. 2008, pp. 677–682.
in Proc. Pacific Rim Int. Conf. Artif. Intell., Gold pp. 513–518. [76] K. Q. Weinberger, F. Sha, and L. K. Saul, “Learning
Coast, QLD, Australia, Dec. 2014, pp. 898–904. [57] M. Xiao and Y. Guo, “Feature space independent a kernel matrix for nonlinear dimensionality
[38] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, semi-supervised domain adaptation via kernel reduction,” in Proc. 21st Int. Conf. Mach. Learn.,
“Transfer feature learning with joint distribution matching,” IEEE Trans. Pattern Anal. Mach. Intell., Banff, AB, Canada, Jul. 2004, pp. 106–113.
adaptation,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 37, no. 1, pp. 54–66, Jan. 2015. [77] L. Vandenberghe and S. Boyd, “Semidefinite
Sydney, NSW, Australia, Dec. 2013, [58] K. Yan, L. Kou, and D. Zhang, “Learning programming,” SIAM Rev., vol. 38, no. 1,
pp. 2200–2207. domain-invariant subspace using domain features pp. 49–95, Mar. 1996.
[39] M. Long, J. Wang, G. Ding, S. J. Pan, and P. S. Yu, and independence maximization,” IEEE Trans. [78] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang,
“Adaptation regularization: A general framework Cybern., vol. 48, no. 1, pp. 288–299, Jan. 2018. “Domain adaptation via transfer component
for transfer learning,” IEEE Trans. Knowl. Data [59] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein analysis,” in Proc. 21st Int. Joint Conf. Artif. Intell.,
Eng., vol. 26, no. 5, pp. 1076–1089, May 2014. distance guided representation learning for Pasadena, CA, USA, Jul. 2009, pp. 1187–1192.
[40] S. Kullback and R. A. Leibler, “On information and domain adaptation,” in Proc. 32nd AAAI Conf. [79] C.-A. Hou, Y.-H.-H. Tsai, Y.-R. Yeh, and
sufficiency,” Ann. Math. Statist., vol. 22, no. 1, Artif. Intell., New Orleans, FL, USA, Feb. 2018, Y.-C.-F. Wang, “Unsupervised domain adaptation
pp. 79–86, 1951. pp. 4058–4065. with label and structural consistency,” IEEE Trans.
[41] W. Dai, G.-R. Xue, Q. Yang, and Y. Yu, [60] C.-Y. Lee, T. Batra, M. H. Baig, and D. Ulbricht, Image Process., vol. 25, no. 12, pp. 5552–5562,
“Co-clustering based classification for “Sliced Wasserstein discrepancy for unsupervised Dec. 2016.
out-of-domain documents,” in Proc. 13th ACM domain adaptation,” in Proc. IEEE/CVF Conf. [80] J. Tahmoresnezhad and S. Hashemi, “Visual
SIGKDD Int. Conf. Knowl. Discovery Data Mining, Comput. Vis. Pattern Recognit., Long Beach, CA, domain adaptation via transfer feature learning,”
San Jose, CA, USA, 2007, pp. 210–219. USA, Jun. 2019, pp. 10285–10295. Knowl. Inf. Syst., vol. 50, no. 2, pp. 585–605,
[42] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Self-taught [61] W. Zellinger, T. Grubinger, E. Lughofer, Feb. 2017.
clustering,” in Proc. 25th Int. Conf. Mach. Learn., T. Natschläger, and S. Saminger-Platz, “Central [81] J. Zhang, W. Li, and P. Ogunbona, “Joint
Helsinki, Finland, 2008, pp. 200–207. moment discrepancy (CMD) for domain-invariant geometrical and statistical alignment for visual
[43] J. Davis and P. Domingos, “Deep transfer via representation learning,” in Proc. 5th Int. Conf. domain adaptation,” in Proc. IEEE Conf. Comput.
second-order Markov logic,” in Proc. 26th Annu. Learn. Represent., Toulon, France, Apr. 2017, Vis. Pattern Recognit., Honolulu, HI, USA,
Int. Conf. Mach. Learn., Montreal, QC, Canada, pp. 1–13. Jul. 2017, pp. 5150–5158.
Jun. 2009, pp. 217–224. [62] A. Gretton et al., “Optimal kernel choice for [82] B. Schölkopf, R. Herbrich, and A. J. Smola,
[44] F. Zhuang, X. Cheng, P. Luo, S. J. Pan, and Q. He, large-scale two-sample tests,” in Proc. 26th Annu. “A generalized representer theorem,” in Proc. Int.
“Supervised representation learning: Transfer Conf. Neural Inf. Process. Syst., Lake Tahoe, Conf. Comput. Learn. Theory, Amsterdam, The
learning with deep autoencoders,” in Proc. 24th Dec. 2012, pp. 1205–1213. Netherlands, Jul. 2001, pp. 416–426.
Int. Joint Conf. Artif. Intell., Buenos Aires, [63] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo, [83] L. Duan, I. W. Tsang, and D. Xu, “Domain transfer
Argentina, Jul. 2015, pp. 4119–4125. “Mind the class weight bias: Weighted maximum multiple kernel learning,” IEEE Trans. Pattern
[45] I. Dagan, L. Lee, and F. Pereira, “Similarity-based mean discrepancy for unsupervised domain Anal. Mach. Intell., vol. 34, no. 3, pp. 465–479,
methods for word sense disambiguation,” in Proc. adaptation,” in Proc. IEEE Conf. Comput. Vis. Mar. 2012.
35th Annu. Meeting Assoc. Comput. Linguistics 8th Pattern Recognit., Honolulu, HI, USA, Jul. 2017, [84] A. Rakotomamonjy, F. Bach, S. Canu, and
Conf. Eur. Chapter Assoc. Comput. Linguistics pp. 2272–2281. Y. Grandvalet, “SimpleMKL,” J. Mach. Learn. Res.,
(ACL/EACL), Madrid, Spain, Jul. 1997, pp. 56–63. [64] H. Daumé, III, “Frustratingly easy domain vol. 9, pp. 2491–2521, Nov. 2008.
[46] B. Chen, W. Lam, I. Tsang, and T.-L. Wong, adaptation,” in Proc. 45th Annu. Meeting Assoc. [85] I. S. Dhillon, S. Mallela, and D. S. Modha,
“Location and scatter matching for dataset shift in Comput. Linguistics, Prague, Czech Republic, “Information-theoretic co-clustering,” in Proc. 9th
text mining,” in Proc. IEEE Int. Conf. Data Mining, Jun. 2007, pp. 256–263. ACM SIGKDD Int. Conf. Knowl. Discovery Data
Sydney, NSW, Australia, Dec. 2010, pp. 773–778. [65] H. Daumé, III, A. Kumar, and A. Saha, Mining, Washington, DC, USA, Aug. 2003,
[47] S. Dey, S. Madikeri, and P. Motlicek, “Information “Co-regularization based semi-supervised domain pp. 89–98.
theoretic clustering for unsupervised adaptation,” in Proc. 24th Annu. Conf. Neural Inf. [86] S. Deerwester, S. T. Dumais, G. W. Furnas,
domain-adaptation,” in Proc. IEEE Int. Conf. Process. Syst., Vancouver, BC, Canada, Dec. 2010, T. K. Landauer, and R. Harshman, “Indexing by
Acoust., Speech Signal Process. (ICASSP), pp. 478–486. latent semantic analysis,” J. Amer. Soc. Inf. Sci.,
Shanghai, China, Mar. 2016, pp. 5580–5584. [66] L. Duan, D. Xu, and I. W. Tsang, “Learning with vol. 41, no. 6, pp. 391–407, Sep. 1990.
[48] W.-H. Chen, P.-C. Cho, and Y.-L. Jiang, “Activity augmented features for heterogeneous domain [87] T. Hofmann, “Probabilistic latent semantic
recognition using transfer learning,” Sens. Mater., adaptation,” in Proc. 29th Int. Conf. Mach. Learn., analysis,” in Proc. 15th Conf. Uncertainty Artif.
vol. 29, no. 7, pp. 897–904, Jul. 2017. Edinburgh, Scotland, Jun. 2012, pp. 1–8. Intell., Stockholm, Sweden, Jul. 1999,
[49] J. Giles, K. K. Ang, L. S. Mihaylova, and [67] W. Li, L. Duan, D. Xu, and I. W. Tsang, “Learning pp. 289–296.
M. Arvaneh, “A subject-to-subject transfer learning with augmented features for supervised and [88] J. Yoo and S. Choi, “Probabilistic matrix
framework based on Jensen-Shannon divergence semi-supervised heterogeneous domain tri-factorization,” in Proc. IEEE Int. Conf. Acoust.,
for improving brain-computer interface,” in Proc. adaptation,” IEEE Trans. Pattern Anal. Mach. Speech Signal Process., Taipei, Taiwan, Apr. 2009,
IEEE Int. Conf. Acoust., Speech Signal Process. Intell., vol. 36, no. 6, pp. 1134–1148, Jun. 2014. pp. 1553–1556.
(ICASSP), Brighton, U.K., May 2019, [68] K. I. Diamantaras and S. Y. Kung, Principal [89] A. P. Dempster, N. M. Laird, and D. B. Rubin,
pp. 3087–3091. Component Neural Networks. New York, NY, USA: “Maximum likelihood from incomplete data via
[50] L. M. Bregman, “The relaxation method of finding Wiley, 1996. the EM algorithm,” J. Roy. Statist. Soc. B,
the common point of convex sets and its [69] B. Schölkopf, A. Smola, and K.-R. Müller, Methodol., vol. 39, no. 1, pp. 1–38, 1977.
application to the solution of problems in convex “Nonlinear component analysis as a kernel [90] G.-R. Xue, W. Dai, Q. Yang, and Y. Yu,
programming,” USSR Comput. Math. Math. Phys., eigenvalue problem,” Neural Comput., vol. 10, “Topic-bridged PLSA for cross-domain text

72 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

classification,” in Proc. 31st Annu. Int. ACM SIGIR [110] X. Ling, W. Dai, G.-R. Xue, Q. Yang, and Y. Yu, Netw., Vancouver, BC, Canada, Jul. 2006,
Conf. Res. Develop. Inf. Retr., Singapore, Jul. 2008, “Spectral domain-transfer learning,” in Proc. 14th pp. 1661–1668.
pp. 627–634. ACM SIGKDD Int. Conf. Knowl. Discovery Data [130] T. Tommasi, F. Orabona, and B. Caputo, “Safety in
[91] F. Zhuang et al., “Collaborative dual-PLSA: Mining Mining, Las Vegas, NV, USA, Aug. 2008, numbers: Learning categories from few examples
distinction and commonality across multiple pp. 488–496. with multi model knowledge transfer,” in Proc.
domains for text classification,” in Proc. 19th ACM [111] S. D. Kamvar, D. Klein, and C. D. Manning, IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Int. Conf. Inf. Knowl. Manage., Toronto, ON, “Spectral learning,” in Proc. 18th Int. Joint Conf. Recognit., San Francisco, CA, USA, Jun. 2010,
Canada, Oct. 2010, pp. 359–368. Artif. Intell., Acapulco, Aug. 2003, pp. 561–566. pp. 3081–3088.
[92] F. Zhuang et al., “Mining distinction and [112] J. Shi and J. Malik, “Normalized cuts and image [131] C.-K. Lin, Y.-Y. Lee, C.-H. Yu, and H.-H. Chen,
commonality across multiple domains using segmentation,” IEEE Trans. Pattern Anal. Mach. “Exploring ensemble of models in taxonomy-based
generative model for text classification,” IEEE Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000. cross-domain sentiment classification,” in Proc.
Trans. Knowl. Data Eng., vol. 24, no. 11, [113] L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua, 23rd ACM Int. Conf. Inf. Knowl. Manage.,
pp. 2025–2039, Nov. 2012. “Domain adaptation from multiple sources via Shanghai, China, Nov. 2014, pp. 1279–1288.
[93] F. Zhuang, P. Luo, P. Yin, Q. He, and Z. Shi, auxiliary classifiers,” in Proc. 26th Int. Conf. Mach. [132] J. Gao, W. Fan, J. Jiang, and J. Han, “Knowledge
“Concept learning for cross-domain text Learn., Montreal, QC, Canada, Jun. 2009, transfer via multiple model local structure
classification: A general probabilistic framework,” pp. 289–296. mapping,” in Proc. 14th ACM SIGKDD Int. Conf.
in Proc. 23rd Int. Joint Conf. Artif. Intell., Beijing, [114] L. Duan, D. Xu, and I. W. Tsang, “Domain Knowl. Discovery Data Mining, Las Vegas, NV, USA,
China, Aug. 2013, pp. 1960–1966. adaptation from multiple sources: Aug. 2008, pp. 283–291.
[94] J. Blitzer, R. McDonald, and F. Pereira, “Domain A domain-dependent regularization approach,” [133] F. Zhuang, P. Luo, S. J. Pan, H. Xiong, and Q. He,
adaptation with structural correspondence IEEE Trans. Neural Netw. Learn. Syst., vol. 23 “Ensemble of anchor adapters for transfer
learning,” in Proc. Conf. Empirical Methods Natural no. 3, pp. 504–518, Mar. 2012. learning,” in Proc. 25th ACM Int. Conf. Inf. Knowl.
Lang. Process., Sydney, NSW, Australia, Jul. 2006, [115] P. Luo, F. Zhuang, H. Xiong, Y. Xiong, and Q. He, Manage., Indianapolis, IN, USA, Oct. 2016,
pp. 120–128. “Transfer learning from multiple source domains pp. 2335–2340.
[95] R. K. Ando and T. Zhang, “A framework for via consensus regularization,” in Proc. 17th ACM [134] F. Zhuang, X. Cheng, P. Luo, S. J. Pan, and Q. He,
learning predictive structures from multiple tasks Conf. Inf. Knowl. Manage., Napa Valley, CA, USA, “Supervised representation learning with double
and unlabeled data,” J. Mach. Learn. Res., vol. 6, Oct. 2008, pp. 103–112. encoding-layer autoencoder for transfer learning,”
pp. 1817–1853, Nov. 2005. [116] F. Zhuang, P. Luo, H. Xiong, Y. Xiong, Q. He, and ACM Trans. Intell. Syst. Technol., vol. 9, no. 2,
[96] X. Glorot, A. Bordes, and Y. Bengio, “Domain Z. Shi, “Cross-domain learning from multiple pp. 1–17, Jan. 2018.
adaptation for large-scale sentiment classification: sources: A consensus regularization perspective,” [135] M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain
A deep learning approach,” in Proc. 28th Int. Conf. IEEE Trans. Knowl. Data Eng., vol. 22, no. 12, adaptive neural networks for object recognition,”
Mach. Learn., Bellevue, WA, USA, Jun. 2011, pp. 1664–1678, Dec. 2010. in Proc. 13th Pacific Rim Int. Conf. Artif. Intell.,
pp. 513–520. [117] T. Evgeniou, C. A. Micchelli, and M. Pontil, Gold Coast, QLD, Australia, Dec. 2014,
[97] P. Vincent, H. Larochelle, Y. Bengio, and “Learning multiple tasks with kernel methods,” pp. 898–904.
P.-A. Manzagol, “Extracting and composing robust J. Mach. Learn. Res., vol. 6, pp. 615–637, [136] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and
features with denoising autoencoders,” in Proc. Apr. 2005. T. Darrell, “Deep domain confusion: Maximizing
25th Int. Conf. Mach. Learn., Helsinki, Finland, [118] T. Kato, H. Kashima, M. Sugiyama, and K. Asai, for domain invariance,” 2014, arXiv:1412.3474.
Jul. 2008, pp. 1096–1103. “Multi-task learning via conic programming,” in [Online]. Available:
[98] M. Chen, Z. Xu, K. Weinberger, and F. Sha, Proc. 21st Annu. Conf. Neural Inf. Process. Syst., http://arxiv.org/abs/1412.3474
“Marginalized denoising autoencoders for domain Vancouver, BC, Canada, Dec. 2007, pp. 737–744. [137] M. Long, Y. Cao, J. Wang, and M. I. Jordan,
adaptation,” in Proc. 29th Int. Conf. Mach. Learn., [119] A. J. Smola and B. Schölkopf, “A tutorial on “Learning transferable features with deep
Edinburgh, Scotland, Jun. 2012, pp. 767–774. support vector regression,” Statist. Comput., adaptation networks,” in Proc. 32nd Int. Conf.
[99] M. Chen, K. Q. Weinberger, Z. E. Xu, and F. Sha, vol. 14, no. 3, pp. 199–222, Aug. 2004. Mach. Learn., Lille, France, Jul. 2015, pp. 97–105.
“Marginalizing stacked linear denoising [120] J. Weston, R. Collobert, F. Sinz, L. Bottou, and [138] A. Krizhevsky, I. Sutskever, and G. E. Hinton,
autoencoders,” J. Mach. Learn. Res., vol. 16, V. Vapnik, “Inference with the Universum,” in Proc. “Imagenet classification with deep convolutional
no. 12, pp. 3849–3875, 2015. 23rd Int. Conf. Mach. Learn., Pittsburgh, PA, USA, neural networks,” in Proc. 26th Annu. Conf. Neural
[100] B. Fernando, A. Habrard, M. Sebban, and Jun. 2006, pp. 1009–1016. Inf. Process. Syst., Lake Tahoe, NV, USA, Dec. 2012,
T. Tuytelaars, “Unsupervised visual domain [121] X. Yu and Y. Aloimonos, “Attribute-based transfer pp. 1097–1105.
adaptation using subspace alignment,” in Proc. learning for object categorization with zero/one [139] I. Goodfellow et al., “Generative adversarial nets,”
IEEE Int. Conf. Comput. Vis., Sydney, NSW, training example,” in Proc. Eur. Conf. Comput. Vis., in Proc. 28th Annu. Conf. Neural Inf. Process. Syst.,
Australia, Dec. 2013, pp. 2960–2967. Heraklion, Greece, Sep. 2010, pp. 127–140. Montreal, QC, Canada, Dec. 2014,
[101] B. Sun and K. Saenko, “Subspace distribution [122] F. Zhuang, P. Luo, H. Xiong, Q. He, Y. Xiong, and pp. 2672–2680.
alignment for unsupervised domain adaptation,” Z. Shi, “Exploiting associations between word [140] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson,
in Proc. Brit. Mach. Vis. Conf., Swansea, U.K., clusters and document classes for cross-domain “How transferable are features in deep neural
Sep. 2015, pp. 24.1–24.10. text categorization,” Stat. Anal. Data Min., vol. 4, networks?” in Proc. 28th Annu. Conf. Neural Inf.
[102] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic no. 1, pp. 100–114, Feb. 2011. Process. Syst., Montreal, QC, Canada, Dec. 2014,
flow kernel for unsupervised domain adaptation,” [123] F. Zhuang, P. Luo, C. Du, Q. He, Z. Shi, and pp. 3320–3328.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., H. Xiong, “Triplex transfer learning: Exploiting [141] M. Long, Y. Cao, Z. Cao, J. Wang, and
Providence, RI, USA, Jun. 2012, pp. 2066–2073. both shared and distinct concepts for text M. I. Jordan, “Transferable representation
[103] R. Gopalan, R. Li, and R. Chellappa, “Domain classification,” IEEE Trans. Cybern., vol. 44, no. 7, learning with deep adaptation networks,” IEEE
adaptation for object recognition: An pp. 1191–1203, Jul. 2014. Trans. Pattern Anal. Mach. Intell., vol. 41, no. 12,
unsupervised approach,” in Proc. Int. Conf. [124] M. Long, J. Wang, G. Ding, W. Cheng, X. Zhang, pp. 3071–3085, Dec. 2019, doi:
Comput. Vis., Nov. 2011, pp. 999–1006. and W. Wang, “Dual transfer learning,” in Proc. 10.1109/TPAMI.2018.2868685.
[104] M. I. Zelikin, Control Theory and Optimization I 12th SIAM Int. Conf. Data Mining, Anaheim, CA, [142] Y. Grandvalet and Y. Bengio, “Semi-supervised
(Encyclopaedia of Mathematical Sciences), USA, Apr. 2012, pp. 540–551. learning by entropy minimization,” in Proc. 18th
vol. 86. Berlin, Germany: Springer, 2000. [125] H. Wang, F. Nie, H. Huang, and C. Ding, “Dyadic Annu. Conf. Neural Inf. Process. Syst., Vancouver,
[105] B. Sun, J. Feng, and K. Saenko, “Return of transfer learning for cross-domain image BC, Canada, Dec. 2004, pp. 529–536.
frustratingly easy domain adaptation,” in Proc. classification,” in Proc. Int. Conf. Comput. Vis., [143] C. Szegedy et al., “Going deeper with
30th AAAI Conf. Artif. Intell., Phoenix, AZ, USA, Barcelona, Spain, Nov. 2011, pp. 551–556. convolutions,” in Proc. IEEE Conf. Comput. Vis.
Feb. 2016, pp. 2058–2065. [126] D. Wang et al., “Softly associative transfer Pattern Recognit., Boston, MA, USA, Jun. 2015,
[106] S. J. Pan, X. Ni, J.-T. Sun, Q. Yang, and Z. Chen, learning for cross-domain classification,” IEEE pp. 1–9.
“Cross-domain sentiment classification via spectral Trans. Cybern., early access, Jan. 2019, doi: [144] K. He, X. Zhang, S. Ren, and J. Sun, “Deep
feature alignment,” in Proc. 19th Int. Conf. World 10.1109/TCYB.2019.2891577. residual learning for image recognition,” in Proc.
Wide Web, Raleigh, North Carolina, Apr. 2010, [127] Q. Do, W. Liu, J. Fan, and D. Tao, “Unveiling IEEE Conf. Comput. Vis. Pattern Recognit.,
pp. 751–760. hidden implicit similarities for cross-domain Las Vegas, NV, USA, Jun. 2016, pp. 770–778.
[107] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, recommendation,” IEEE Trans. Knowl. Data Eng., [145] K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and
bollywood, boom-boxes and blenders: Domain early access, Jun. 2019, doi: A. Gretton, “Fast two-sample testing with analytic
adaptation for sentiment classification,” in Proc. 10.1109/TKDE.2019.2923904. representations of probability measures,” in Proc.
45th Annu. Meeting Assoc. Comput. Linguistics, [128] T. Tommasi and B. Caputo, “The more you know, 29th Annu. Conf. Neural Inf. Process. Syst.,
Prague, Czech Republic, Jun. 2007, pp. 440–447. the less you learn: From knowledge transfer to Montreal, QC, Canada, Dec. 2015,
[108] F. R. K. Chung, Spectral Graph Theory. Providence, one-shot learning of object categories,” in Proc. pp. 1981–1989.
RI, USA: AMS, 1997. Brit. Mach. Vis. Conf., London, U.K., Sep. 2009, [146] M. Long, H. Zhu, J. Wang, and M. I. Jordan,
[109] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral pp. 80.1–80.11. “Unsupervised domain adaptation with residual
clustering: Analysis and an algorithm,” in Proc. [129] G. C. Cawley, “Leave-one-out cross-validation transfer networks,” in Proc. 30th Annu. Conf.
15th Annu. Conf. Neural Inf. Process. Syst., based model selection criteria for weighted Neural Inf. Process. Syst., Barcelona, Spain,
Vancouver, BC, Canada, Dec. 2001, pp. 849–856. LS-SVMs,” in Proc. IEEE Int. Joint Conf. Neural Dec. 2016, pp. 136–144.

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 73

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

[147] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep [166] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and WA, USA, Jul. 1994, pp. 359–370.
transfer learning with joint adaptation networks,” L. Fei-Fei, “ImageNet: A large-scale hierarchical [184] N. Makondo, M. Hiratsuka, B. Rosman, and
in Proc. 34th Int. Conf. Mach. Learn., Sydney, NSW, image database,” in Proc. IEEE Conf. Comput. Vis. O. Hasegawa, “A non-linear manifold alignment
Australia, Aug. 2017, pp. 2208–2217. Pattern Recognit., Miami, FL, USA, Jun. 2009, approach to robot learning from demonstrations,”
[148] B. Sun and K. Saenko, “Deep CORAL: Correlation pp. 248–255. J. Robot. Mechatron., vol. 30, no. 2, pp. 265–281,
alignment for deep domain adaptation,” in Proc. [167] M. Maqsood et al., “Transfer learning assisted Apr. 2018.
Eur. Conf. Comput. Vis. Workshops, Amsterdam, classification and detection of Alzheimer’s disease [185] P. Angkititrakul, C. Miyajima, and K. Takeda,
The Netherlands, Oct. 2016, pp. 443–450. stages using 3D MRI scans,” Sensors, vol. 19, “Modeling and adaptation of stochastic
[149] C. Chen, Z. Chen, B. Jiang, and X. Jin, “Joint no. 11, pp. 1–19, Jun. 2019. driver-behavior model with application to car
domain alignment and discriminative feature [168] D. S. Marcus, A. F. Fotenos, J. G. Csernansky, following,” in Proc. IEEE Intell. Vehicles Symp. (IV),
learning for unsupervised deep domain J. C. Morris, and R. L. Buckner, “Open access Baden-Baden, Germany, Jun. 2011, pp. 814–819.
adaptation,” in Proc. 33rd AAAI Conf. Artif. Intell., series of imaging studies: Longitudinal MRI data [186] Y. Liu, P. Lasang, S. Pranata, S. Shen, and
Honolulu, HI, USA, Jan. 2019, pp. 3296–3303. in nondemented and demented older adults,” W. Zhang, “Driver pose estimation using recurrent
[150] Y. Pan, T. Yao, Y. Li, Y. Wang, C.-W. Ngo, and J. Cognit. Neurosci., vol. 22, no. 12, lightweight network and virtual data augmented
T. Mei, “Transferrable prototypical networks for pp. 2677–2684, Dec. 2010. transfer learning,” IEEE Trans. Intell. Transp. Syst.,
unsupervised domain adaptation,” in Proc. [169] H.-C. Shin et al., “Deep convolutional neural vol. 20, no. 10, pp. 3818–3831, Oct. 2019.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., networks for computer-aided detection: CNN [187] J. Wang, H. Zheng, Y. Huang, and X. Ding,
Long Beach, CA, USA, Jun. 2019, pp. 2239–2247. architectures, dataset characteristics and transfer “Vehicle type recognition in surveillance images
[151] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann, learning,” IEEE Trans. Med. Imag., vol. 35, no. 5, from labeled Web-nature data using deep transfer
“Contrastive adaptation network for unsupervised pp. 1285–1298, May 2016. learning,” IEEE Trans. Intell. Transp. Syst., vol. 19,
domain adaptation,” in Proc. IEEE/CVF Conf. [170] M. Byra et al., “Knee menisci segmentation and no. 9, pp. 2913–2922, Sep. 2018.
Comput. Vis. Pattern Recognit., Long Beach, CA, relaxometry of 3D ultrashort echo time cones MR [188] K. Gopalakrishnan, S. K. Khaitan, A. Choudhary,
USA, Jun. 2019, pp. 4893–4902. imaging using attention U-net with transfer and A. Agrawal, “Deep convolutional neural
[152] Y. Zhu et al., “Multi-representation adaptation learning,” Magn. Reson. Med., vol. 83, no. 3, networks with transfer learning for computer
network for cross-domain image classification,” pp. 1109–1122, Sep. 2020, doi: vision-based data-driven pavement distress
Neural Netw., vol. 119, pp. 214–221, Nov. 2019. 10.1002/mrm.27969. detection,” Construct. Building Mater., vol. 157,
[153] Y. Zhu, F. Zhuang, and D. Wang, “Aligning [171] X. Tang, B. Du, J. Huang, Z. Wang, and L. Zhang, pp. 322–330, Dec. 2017.
domain-specific distribution and classifier for “On combining active and transfer learning for [189] S. Bansod and A. Nandedkar, “Transfer learning
cross-domain classification from multiple medical data classification,” IET Comput. Vis., for video anomaly detection,” J. Intell. Fuzzy Syst.,
sources,” in Proc. 33rd AAAI Conf. Artif. Intell., vol. 13, no. 2, pp. 194–205, Mar. 2019. vol. 36, no. 3, pp. 1967–1975, Mar. 2019.
Honolulu, HI, USA, Jan. 2019, pp. 5989–5996. [172] M. Zeng, M. Li, Z. Fei, Y. Yu, Y. Pan, and J. Wang, [190] G. Rosario, T. Sonderman, and X. Zhu, “Deep
[154] Y. Ganin and V. Lempitsky, “Unsupervised domain “Automatic ICD-9 coding via deep transfer transfer learning for traffic sign recognition,” in
adaptation by backpropagation,” in Proc. 32nd Int. learning,” Neurocomputing, vol. 324, pp. 43–50, Proc. IEEE Int. Conf. Inf. Reuse Integr.,
Conf. Mach. Learn., Lille, France, Jul. 2015, Jan. 2019. Salt Lake City, UT, USA, Jul. 2018, pp. 178–185.
pp. 1180–1189. [173] G. Schweikert, G. Ratsch, C. Widmer, and [191] W. Pan, E. W. Xiang, and Q. Yang, “Transfer
[155] Y. Ganin et al., “Domain-adversarial training of B. Scholkopf, “An empirical analysis of domain learning in collaborative filtering with uncertain
neural networks,” J. Mach. Learn. Res., vol. 17, adaptation algorithms for genomic sequence ratings,” in Proc. 26th AAAI Conf. Artif. Intell.,
pp. 1–35, Apr. 2016. analysis,” in Proc. 22nd Annu. Conf. Neural Inf. Toronto, ON, Canada, Jul. 2012, pp. 662–668.
[156] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, Process. Syst., Vancouver, BC, Canada, Dec. 2008, [192] G. Hu, Y. Zhang, and Q. Yang, “Transfer meets
“Adversarial discriminative domain adaptation,” in pp. 1433–1440. hybrid: A synthetic approach for cross-domain
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [174] R. Petegrosso, S. Park, T. H. Hwang, and R. Kuang, collaborative filtering with text,” in Proc. World
Honolulu, HI, USA, Jul. 2017, pp. 2962–2971. “Transfer learning across ontologies for Wide Web Conf., San Francisco, CA, USA,
[157] J. Hoffman et al., “CyCADA: Cycle-consistent phenome–genome association prediction,” May 2019, pp. 2822–2829.
adversarial domain adaptation,” in Proc. 35th Int. Bioinformatics, vol. 33, no. 4, pp. 529–536, [193] W. Pan, E. W. Xiang, N. N. Liu, and Q. Yang,
Conf. Mach. Learn., Stockholm, Sweden, Jul. 2018, Feb. 2017. “Transfer learning in collaborative filtering for
pp. 1994–2003. [175] T. Hwang and R. Kuang, “A heterogeneous label sparsity reduction,” in Proc. 24th AAAI Conf. Artif.
[158] M. Long, Z. Cao, J. Wang, and M. I. Jordan, propagation algorithm for disease gene discovery,” Intell., Atlanta, GA, USA, Jul. 2010, pp. 230–235.
“Conditional adversarial domain adaptation,” in in Proc. 10th SIAM Int. Conf. Data Mining, [194] W. Pan and Q. Yang, “Transfer learning in
Proc. 32nd Annu. Conf. Neural Inf. Process. Syst., Columbus, OH, USA, Apr. 2010, pp. 583–594. heterogeneous collaborative filtering domains,”
Montreal, QC, Canada, Dec. 2018, [176] Q. Xu, E. W. Xiang, and Q. Yang, “Protein-protein Artif. Intell., vol. 197, pp. 39–55, Apr. 2013.
pp. 1640–1650. interaction prediction via collective matrix [195] F. Zhuang, Y. Zhou, F. Zhang, X. Ao, X. Xie, and
[159] Y. Zhang, H. Tang, K. Jia, and M. Tan, factorization,” in Proc. IEEE Int. Conf. Bioinf. Q. He, “Sequential transfer learning:
“Domain-symmetric networks for adversarial Biomed., Hong Kong, Dec. 2010, pp. 62–67. Cross-domain novelty seeking trait mining for
domain adaptation,” in Proc. IEEE Conf. Comput. [177] A. P. Singh and G. J. Gordon, “Relational learning recommendation,” in Proc. 26th Int. Conf. World
Vis. Pattern Recognit., Long Beach, CA, USA, via collective matrix factorization,” in Proc. 14th Wide Web Companion, Perth, WA, Australia,
Jun. 2019, pp. 5031–5040. ACM SIGKDD Int. Conf. Knowl. Discovery Data Apr. 2017, pp. 881–882.
[160] H. Zhao, S. Zhang, G. Wu, J. M. F. Moura, Mining, Las Vegas, NV, USA, Aug. 2008, [196] J. Zheng, F. Zhuang, and C. Shi, “Local ensemble
J. P. Costeira, and G. J. Gordon, “Adversarial pp. 650–658. across multiple sources for collaborative filtering,”
multiple source domain adaptation,” in Proc. 32nd [178] S. Di, H. Zhang, C.-G. Li, X. Mei, D. Prokhorov, in Proc. 26th ACM Conf. Inf. Knowl. Manage.,
Annu. Conf. Neural Inf. Process. Syst., Montreal, and H. Ling, “Cross-domain traffic scene Singapore, Nov. 2017, pp. 2431–2434.
QC, Canada, Dec. 2018, pp. 8559–8570. understanding: A dense correspondence-based [197] F. Zhuang, J. Zheng, J. Chen, X. Zhang, C. Shi,
[161] C. Yu, J. Wang, Y. Chen, and M. Huang, “Transfer transfer learning approach,” IEEE Trans. Intell. and Q. He, “Transfer collaborative filtering from
learning with dynamic adversarial adaptation Transp. Syst., vol. 19, no. 3, pp. 745–757, multiple sources via consensus regularization,”
network,” in Proc. IEEE Int. Conf. Data Mining, Mar. 2018. Neural Netw., vol. 108, pp. 287–295, Dec. 2018.
Beijing, China, Nov. 2019, pp. 1–9. [179] H. Abdi, “Partial least squares regression and [198] J. He, R. Liu, F. Zhuang, F. Lin, C. Niu, and Q. He,
[162] J. Zhang, Z. Ding, W. Li, and P. Ogunbona, projection on latent structure regression (PLS “A general cross-domain recommendation
“Importance weighted adversarial nets for partial Regression),” Wiley Interdiscip. Rev.-Comput. framework via Bayesian neural network,” in Proc.
domain adaptation,” in Proc. IEEE/CVF Conf. Statist., vol. 2, no. 1, pp. 97–106, Jan. 2010. 18th IEEE Int. Conf. Data Mining, Singapore,
Comput. Vis. Pattern Recognit., Salt Lake City, UT, [180] S. Della Pietra, V. Della Pietra, and J. Lafferty, Nov. 2018, pp. 1001–1006.
USA, Jun. 2018, pp. 8156–8163. “Inducing features of random fields,” IEEE Trans. [199] F. Zhu, Y. Wang, C. Chen, G. Liu, M. A. Orgun, and
[163] Z. Cao, M. Long, J. Wang, and M. I. Jordan, Pattern Anal. Mach. Intell., vol. 19, no. 4, J. Wu, “A deep framework for cross-domain and
“Partial transfer learning with selective adversarial pp. 380–393, Apr. 1997. cross-system recommendations,” in Proc. 27th Int.
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. [181] C. Liu, J. Yuen, and A. Torralba, “Nonparametric Joint Conf. Artif. Intell., Stockholm, Sweden,
Pattern Recognit., Salt Lake City, UT, USA, scene parsing via label transfer,” IEEE Trans. Jul. 2018, pp. 3711–3717.
Jun. 2018, pp. 2724–2732. Pattern Anal. Mach. Intell., vol. 33, no. 12, [200] F. Yuan, L. Yao, and B. Benatallah, “DARec: Deep
[164] B. Wang et al., “A minimax game for instance pp. 2368–2382, Dec. 2011. domain adaptation for cross-domain
based selective transfer learning,” in Proc. 25th [182] C. Lu, F. Hu, D. Cao, J. Gong, Y. Xing, and Z. Li, recommendation via transferring rating patterns,”
ACM SIGKDD Int. Conf. Knowl. Discovery Data “Transfer learning for driver model adaptation in in Proc. 29th Int. Joint Conf. Artif. Intell., Macao,
Mining, Anchorage, AK, USA, Jul. 2019, lane-changing scenarios using manifold China, Aug. 2019, pp. 4227–4233.
pp. 34–43. alignment,” IEEE Trans. Intell. Transp. Syst., early [201] E. Bastug, M. Bennis, and M. Debbah, “A transfer
[165] X. Chen, S. Wang, M. Long, and J. Wang, access, Jul. 2019, doi: learning approach for cache-enabled wireless
“Transferability vs. discriminability: Batch spectral 10.1109/TITS.2019.2925510. networks,” in Proc. 13th Int. Symp. Model. Optim.
penalization for adversarial domain adaptation,” [183] D. J. Berndt and J. Clifford, “Using dynamic time Mobile, Ad Hoc, Wireless Netw., Mumbai, India,
in Proc. 36th Int. Conf. Mach. Learn., Long Beach, warping to find patterns in time series,” in Proc. May 2015, pp. 161–166.
CA, USA, Jun. 2019, pp. 1081–1090. Knowl. Discovery Databases Workshop, Seattle, [202] R. Li, Z. Zhao, X. Chen, J. Palicot, and H. Zhang,

74 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

“TACT: A transfer actor-critic learning framework Neural Syst. Rehabil. Eng., vol. 27, no. 4, with hierarchical explainable network for
for energy saving in cellular radio access pp. 760–771, Apr. 2019. cross-domain fraud detection,” in Proc. Web Conf.,
networks,” IEEE Trans. Wireless Commun., vol. 13, [207] C.-X. Ren, D.-Q. Dai, K.-K. Huang, and Z.-R. Lai, Taipei, Taiwan, Apr. 2020, pp. 928–938.
no. 4, pp. 2000–2011, Apr. 2014. “Transfer learning of structured representation for [212] J. Tang, T. Lou, J. Kleinberg, and S. Wu, “Transfer
[203] Q. Zhao and D. Grace, “Transfer learning for QoS face recognition,” IEEE Trans. Image Process., learning to infer social ties across heterogeneous
aware topology management in energy efficient vol. 23, no. 12, pp. 5440–5454, Dec. 2014. networks,” ACM Trans. Inf. Syst., vol. 34, no. 2,
5G cognitive radio networks,” in Proc. 1st Int. [208] J. Wang, Y. Chen, L. Hu, X. Peng, and P. S. Yu, pp. 1–43, Apr. 2016.
Conf. 5G Ubiquitous Connectivity, Äkäslompolo, “Stratified transfer learning for cross-domain [213] L. Zhang, L. Zhang, D. Tao, and X. Huang, “Sparse
Finland, Nov. 2014, pp. 152–157. activity recognition,” in Proc. IEEE Int. Conf. transfer manifold embedding for hyperspectral
[204] B. Guo, J. Li, V. W. Zheng, Z. Wang, and Z. Yu, Pervasive Comput. Commun., Athens, Greece, target detection,” IEEE Trans. Geosci. Remote Sens.,
“Citytransfer: Transferring inter- and intra-city Mar. 2018, pp. 1–10. vol. 52, no. 2, pp. 1030–1043, Feb. 2014.
knowledge for chain store site recommendation [209] J. Deng, Z. Zhang, E. Marchi, and B. Schuller, [214] F. Zhuang et al., “Transfer learning toolkit: Primers
based on multi-source urban data,” in Proc. ACM “Sparse autoencoder-based feature transfer and benchmarks,” 2019, arXiv:1911.08967.
Interact., Mobile, Wearable Ubiquitous Technol., learning for speech emotion recognition,” in Proc. [Online]. Available:
Jan. 2018, pp. 1–23. Humaine Assoc. Conf. Affect. Comput. Intell. http://arxiv.org/abs/1911.08967
[205] Y. Wei, Y. Zheng, and Q. Yang, “Transfer Interact., Geneva, Switzerland, Sep. 2013, [215] K. Saenko, B. Kulis, M. Fritz, and T. Darrell,
knowledge between cities,” in Proc. 22nd ACM pp. 511–516. “Adapting visual category models to new
SIGKDD Int. Conf. Knowl. Discovery Data Mining, [210] D. Xi, F. Zhuang, G. Zhou, X. Cheng, F. Lin, and domains,” in Proc. 11th Eur. Conf. Comput. Vis.,
San Francisco, CA, USA, Aug. 2016, Q. He, “Domain adaptation with category Heraklion, Greece, Sep. 2010, pp. 213–226.
pp. 1905–1914. attention network for deep sentiment analysis,” in [216] Z. C. Lipton, “The mythos of model
[206] U. Cote-Allard et al., “Deep learning for Proc. Web Conf., Taipei, Taiwan, Apr. 2020, interpretability,” ACM Queue, vol. 16, no. 3,
electromyographic hand gesture signal pp. 3133–3139. pp. 1–27, May 2018.
classification using transfer learning,” IEEE Trans. [211] Y. Zhu et al., “Modeling users’ behavior sequences

ABOUT THE AUTHORS


Fuzhen Zhuang received the B.E. and Dongbo Xi received the B.S. degree in
Ph.D. degrees in computer science from computer science and technology from the
the Institute of Computing Technology, Chi- Nanjing University of Aeronautics and Astro-
nese Academy of Sciences, Beijing, China, nautics, Nanjing, China, in 2017. He is cur-
in 2006 and 2011, respectively. rently pursuing the M.S. degree with the
He is currently an Associate Profes- Institute of Computing Technology, Chinese
sor with the Institute of Computing Tech- Academy of Sciences, Beijing, China.
nology, Chinese Academy of Sciences. He has published several papers in con-
He has published more than 100 papers ference proceedings, including AAAI, WWW,
in some prestigious refereed journals and conference proceed- and SIGIR. His research interests include recommender systems
ings such as the IEEE TRANSACTIONS ON KNOWLEDGE AND DATA and transfer learning.
ENGINEERING, the IEEE TRANSACTIONS ON CYBERNETICS, the IEEE
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,
the ACM Transactions on Intelligent Systems and Technology, Infor-
mation Sciences, Neural Networks, SIGKDD, IJCAI, AAAI, TheWeb-
Yongchun Zhu received the B.S. degree
Conf, SIGIR, ICDE, ACM CIKM, ACM WSDM, SIAM SDM, and IEEE
from Beijing Normal University, Beijing,
ICDM. His research interests include transfer learning, machine
China, in 2018. He is currently pursuing the
learning, data mining, multitask learning, and recommendation
M.S. degree with the Institute of Computing
systems.
Technology, Chinese Academy of Sciences,
Dr. Zhuang is a Senior Member of CCF. He was a recipient of the
Beijing.
Distinguished Dissertation Award of CAAI in 2013.
He has published some papers in jour-
nals and conference proceedings, includ-
Zhiyuan Qi received the B.E. degree in ing the IEEE TRANSACTIONS ON NEURAL
software engineering from Sun Yat-sen Uni- NETWORKS AND LEARNING SYSTEMS, Neural Networks, AAAI, WWW,
versity, Guangzhou, China, in 2019. and PAKDD. His main research interests include transfer learning
He is currently a Visiting Student with and meta learning.
the Institute of Computing Technology, Chi-
nese Academy of Sciences, Beijing, China.
He has published several journal articles in
the IEEE Computational Intelligence Mag-
Hengshu Zhu (Senior Member, IEEE)
azine, IEEE ACCESS, Neurocomputing, and
received the B.E. and Ph.D. degrees in com-
Numerical Algorithms. He serves as a reviewer for some jour-
puter science from the University of Sci-
nals, including IEEE ACCESS and the IEEE TRANSACTIONS ON
ence and Technology of China (USTC), Hefei,
COMPUTATIONAL SOCIAL SYSTEMS. His current research interests
China, in 2009 and 2014, respectively.
include transfer learning, machine learning, recommender sys-
He is currently a Principal Data Scien-
tems, and time-variant problems solving.
tist and an Architect at Baidu Inc., Bei-
jing, China. He has published prolifically
Keyu Duan is currently pursuing the B.E. in refereed journals and conference pro-
degree in software engineering with Beihang ceedings, including the IEEE TRANSACTIONS ON KNOWLEDGE AND
University, Beijing, China. DATA ENGINEERING (TKDE), the IEEE TRANSACTIONS ON MOBILE
He has been with the Institute of Com- COMPUTING (TMC), the ACM Transactions on Information Systems
puting Technology, Chinese Academy of (ACM TOIS), the ACM Transactions on Knowledge Discovery from
Sciences, Beijing, as a Visiting Student, Data (TKDD), ACM SIGKDD, ACM SIGIR, WWW, IJCAI, and AAAI.
since May 2019. His main research interests His research interests include data mining and machine learning,
include transfer learning and data mining. with a focus on developing advanced data analysis techniques for
innovative business applications.

Vol. 109, No. 1, January 2021 | P ROCEEDINGS OF THE IEEE 75

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
Zhuang et al.: Comprehensive Survey on Transfer Learning

Dr. Zhu is a Senior Member of ACM and CCF. He was a recipient Qing He received the B.S. degree in math-
of the Distinguished Dissertation Award of CAS in 2016, the Dis- ematics from Hebei Normal University, Shi-
tinguished Dissertation Award of CAAI in 2016, the Special Prize of jiazhuang, China, in 1985, the M.S. degree
President Scholarship for Postgraduate Students of CAS in 2014, in mathematics from Zhengzhou University,
the Best Student Paper Award of KSEM-2011, WAIM-2013, CCDM- Zhengzhou, China, in 1987, and the Ph.D.
2014, and the Best Paper Nomination of ICDM-2014. He has served degree in fuzzy mathematics and artificial
regularly on the organization and program committees of numerous intelligence from Beijing Normal University,
conferences, including as a Program Co-Chair of the KDD Cup- Beijing, China, in 2000.
2019 Regular ML Track and a Founding Co-Chair of the International From 1987 to 1997, he was with the Hebei
Workshop on Organizational Behavior and Talent Analytics (OBTA) University of Science and Technology, Shijiazhuang. He is currently
and the International Workshop on Talent and Management Com- a Professor with the Institute of Computing Technology, Chinese
puting (TMC), in conjunction with ACM SIGKDD. Academy of Sciences, Beijing, where he is also a Doctoral Tutor.
He is also a Professor with the Graduate University of Chinese
Academy of Sciences, Beijing. His research interests include data
mining, machine learning, classification, and fuzzy clustering.

Hui Xiong (Fellow, IEEE) received the Ph.D.


degree from the University of Minnesota
(UMN), Minneapolis, MN, USA.
He is currently a Full Professor with the
Rutgers, The State University of New Jersey,
Newark, NJ, USA. He is an ACM Distinguished
Scientist.
Dr. Xiong received the 2018 Ram Charan
Management Practice Award as the Grand
Prix Winner from the Harvard Business Review, the RBS Dean’s
Research Professorship in 2016, the Rutgers University Board of
Trustees Research Fellowship for Scholarly Excellence in 2009,
the ICDM Best Research Paper Award in 2011, and the IEEE ICDM
Outstanding Service Award in 2017, from the Rutgers, The State
University of New Jersey. He is a Co-Editor-in-Chief of Encyclopedia
of GIS and an Associate Editor of the IEEE TRANSACTIONS ON BIG
DATA (TBD), the ACM Transactions on Knowledge Discovery from
Data (TKDD), and the ACM Transactions on Management Informa-
tion Systems (TMIS). He has served regularly on the organization
and program committees of numerous conferences, including as a
Program Co-Chair of the Industrial and Government Track for the
18th ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining (KDD), the IEEE 2013 International Conference
on Data Mining (ICDM), and the Research Track for the 2018 ACM
SIGKDD International Conference on KDD, and a General Co-Chair
of the IEEE 2015 ICDM.

76 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 1, January 2021

Authorized licensed use limited to: Pusan National University Library. Downloaded on February 26,2023 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.

You might also like