This action might not be possible to undo. Are you sure you want to continue?

by JIALIN PAN

A Thesis Submitted to The Hong Kong University of Science and Technology in Partial Fulﬁllment of the Requirements for the Degree of Doctor of Philosophy in Computer Science and Engineering

September 2010, Hong Kong

c Copyright ⃝ by Jialin Pan 2010

Authorization

I hereby declare that I am the sole author of the thesis.

I authorize the Hong Kong University of Science and Technology to lend this thesis to other institutions or individuals for the purpose of scholarly research.

I further authorize the Hong Kong University of Science and Technology to reproduce the thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.

JIALIN PAN

ii

**FEATURE-BASED TRANSFER LEARNING WITH REAL-WORLD APPLICATIONS
**

by JIALIN PAN

This is to certify that I have examined the above Ph.D. thesis and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the thesis examination committee have been made.

PROF. QIANG YANG, THESIS SUPERVISOR

PROF. SIU-WING CHENG, ACTING HEAD OF DEPARTMENT Department of Computer Science and Engineering 17 September 2010 iii

Hong Chang. Jie Yin. James deeply impressed me from his passion for doing research to his sense to research ideas. Dit-Yan Yeung. Dr. Jian Xia. iv . during my internship at Microsoft Research Asia.D. who hosted and supervised me during my visit to Carnegie Mellon University and California Institute of Technology. When I was still a master student in Sun Yat-Sen University. Special thanks also goes to Prof. Junfeng Pan. Weike Pan. study. Hao Hu. my wife and my little son Zhuoxuan Pan. Bin Cao. Wu-Jun Li. Qiang Yang sincerely and gratefully for his valuable advice and in-depth guidance throughout my Ph. Fugee Tsung. Yu Zhang. Wenchen Zheng. In the past four years. His research passion made me become more and more interested in doing research on machine learning. I would also like to thank Prof. Jieping Ye. Rong Pan. Chak-Keung Chan. Jingdong Wang and Prof. Dou Shen. Gang Wang. Raymond Wong for serving as committee members of my proposal and thesis defenses. I thank them very much. Yi Wang. They are Dr. Dr. I also thank his useful advice and information on my job hunting. Erheng Zhong. What I learned from him would be great wealth in my future research career. From him. Bingsheng He. Dr. Dr. Their valuable comments are of great help in my research work. Her kind help and support make everything I have possible. I would like to give my deepest gratitude to my parents Yunyu Pan and Jinping Feng. Qian Xu. respectively. I am grateful to current and previous members in our research group. Rong Pan. Dr. Zhihua Zhang. I got generous help from my mentor Dr. Prof. I also learned that as a researcher. I thank them for their advice and discussion on my research work during my visit. Yi Zhen and so on. Dr. More especially. Furthermore. I dedicate this dissertation to my parents. how to discover interesting research topics. Dr. Meanwhile. Ivor Tsang. Dr. Prof. Jian-Tao Sun. Si Shen and so on. Prof. I am also grateful to a number of colleagues at HKUST. Qi Wang. I am also very thankful to Prof. who I worked with closely at HKUST during the past four years. Wei Xiang. Dr. and work harder and harder. Brian Mak and Prof. I would like to express my thanks to my supervisor Prof. one needs to be always positive and active. James Kwok.ACKNOWLEDGMENTS First of all. Dr. Special thanks goes to Guang Dai. Dr. Pingzhong Tang. Dr. Zheng Chen. Rong Pan brought me to the ﬁeld of machine learning. I would also like to give my great thanks to my wife Zhiyun Zhao. I also got great help from Xiaochuan Ni. Dr. Last but not least. With his help. I would like to express my special thanks to Dr. Weizhu Chen. Nan Liu. Dr. Carlos Guestrin and Prof. Andreas Krause. Prof. I learned a lot from him. Yin Zhu. who always encourage and support me when I feel depressed. how to write research papers and how to do presentation clearly. In addition. I learned how to survey a ﬁeld of interest.

2.2.3 Transductive Transfer Learning 2.1 2.3.4 2.2.1 Overview 2.2.1 2.2 Inductive Transfer Learning 2.2 2.1.1.2 The Organization of This Thesis Chapter 2 A Survey on Transfer Learning A Brief History of Transfer Learning Notations and Deﬁnitions A Categorization of Transfer Learning Techniques Transferring Knowledge of Instances Transferring Knowledge of Feature Representations Transferring Knowledge of Parameters Transferring Relational Knowledge Transferring the Knowledge of Instances Transferring Knowledge of Feature Representations Transferring Knowledge of Feature Representations 2.3 2.3 2.3.1 The Contribution of This Thesis 1.2 2.4 Unsupervised Transfer Learning .1 2.4.1 2.1.TABLE OF CONTENTS Title Page Authorization Page Signature Page Acknowledgments Table of Contents List of Figures List of Tables Abstract Chapter 1 Introduction i ii iii iv v viii x xi 1 4 6 8 8 8 10 11 15 16 17 19 20 21 22 24 26 26 v 1.2 2.

3 A Novel Dimensionality Reduction Framework 3.2 Experimental Setup 4.6.3.2 Minimizing Distance between P (ϕ(XS )) and P (ϕ(XT )) Preserving Properties of XS and XT Kernel Learning for Transfer Latent Space Make Predictions in Latent Space Summary 3.5.2 3.4.7 Further Discussion Chapter 4 Applications to WiFi Localization 4.1 3.2.7 Real-world Applications of Transfer Learning Chapter 3 Transfer Learning via Dimensionality Reduction 27 29 30 32 32 33 33 34 34 34 35 36 37 38 38 41 41 42 42 43 45 46 47 49 50 51 53 54 57 57 58 61 3.5 Transfer Bounds and Negative Transfer 2.1 WiFi Localization and Cross-Domain WiFi Localization 4.2.2.6.6.1 3.3 Results 4.1 3.6 Semi-Supervised Transfer Component Analysis (SSTCA) 3.6.3.4 Maximum Mean Discrepancy Embedding (MMDE) 3.2 3.3 3.4.2 3.5.4 Parametric Kernel Map for Unseen Data Unsupervised Transfer Component Extraction Experiments on Synthetic Data Summary 3.3.5.1 3.2 3.2.1 Motivation 3.4 Dimensionality Reduction Hilbert Space Embedding of Distributions Maximum Mean Discrepancy Dependence Measure 3.3 3.5.6 Other Research Issues of Transfer Learning 2.4.1 Comparison with Dimensionality Reduction Methods vi 61 .4 Optimization Objectives Formulation and Optimization Procedure Experiments on Synthetic Data Summary 3.5 Transfer Component Analysis (TCA) 3.2 Preliminaries 3.2.1 3.3 3.3 3.

1 Conclusion 7.3.7 Experiments 6.3 6.4 Summary Chapter 5 Applications to Text Classiﬁcation 5.4.4.3 4.3.3 Results 5.1 Sentiment Classiﬁcation 6.8 Summary Chapter 7 Conclusion and Future Work 7.7.1 6.1 5.4.2 6.2 Experimental Setup 5.4.3.5 Computational Issues 6.5 Comparison with Non-Adaptive Methods Comparison with Domain Adaptation Methods Comparison with MMDE Sensitivity to Model Parameters 61 62 63 64 64 66 66 66 68 4.2 Future Work References vii .3.4 4.2 Existing Works in Cross-Domain Sentiment Classiﬁcation 6.4 Domain-Independent Feature Selection Bipartite Feature Graph Construction Spectral Feature Clustering Feature Augmentation 6.2 Experimental Setup Results 6.1 6.3 Problem Statement and A Motivating Example 6.2 Comparison to Other Methods Sensitivity to Model Parameters 68 68 71 73 73 74 76 78 79 80 81 83 84 85 85 85 87 92 94 94 95 96 5.7.4 Summary Chapter 6 Domain-Driven Feature Space Transfer for Sentiment Classiﬁcation 6.2 4.4 Spectral Domain-Speciﬁc Feature Alignment 6.3.4.3.6 Connection to Other Methods 6.1 Text Classiﬁcation and Cross-domain Text Classiﬁcation 5.

we shift them to positive values for visualization.1 2.3 4. Illustrations of the proposed TCA and SSTCA on synthetic dataset 2. Illustrations of the proposed TCA and SSTCA on synthetic dataset 1.5 3. viii 3 7 9 14 37 1. Illustrations of the proposed TCA and SSTCA on synthetic dataset 4.2 46 3.2 2.2 3. Illustrations of the proposed TCA and SSTCA on synthetic dataset 5.1 Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods and received by different mobile devices. Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods.8 54 58 4.6 . Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.LIST OF FIGURES 1. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. Comparison with dimensionality reduction methods.4 4.7 53 3.5 4. Different colors denote different signal strength values. Comparison of TCA. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.3 47 48 48 3.6 52 3.1 3.1 4. An indoor wireless environment example. Note that the original signal strength values are non-positive (the larger the stronger). Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. Illustrations of the proposed TCA and SSTCA on synthetic dataset 3. Different colors denote different signal strength values (unit:dBm). SSTCA and the various baseline methods in the inductive setting on the WiFi data. Different learning processes between traditional machine learning and transfer learning An overview of different settings of transfer Motivating examples for the dimensionality reduction framework.2 59 62 62 63 63 4. The organization of thesis. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. Comparison with localization methods that do not perform domain adaption. The direction with the largest variance is orthogonal to the discriminative direction There exists an intrinsic manifold structure underlying the observed data. Comparison with MMDE in the transductive setting on the WiFi data.4 3. Here.

2 6.1 6. Study on varying numbers of domain-independent features of SFA Model parameter sensitivity study of k on two datasets.4 6. Sensitivity analysis of the TCA / SSTCA parameters on the text data.1 6. A bipartite graph example of domain-speciﬁc and domain-independent features. Sensitivity analysis of the TCA / SSTCA parameters on the WiFi data. Comparison results (unit: %) on two datasets.4.5 Training time with varying amount of unlabeled data for training.8 5.7 4. 64 65 71 81 88 91 92 93 ix .3 6. Model parameter sensitivity study of γ on two datasets.

Bag-of-words representations of electronics (E) and video games (V) reviews.. Boldfaces are domain-speciﬁc words. and “-” denotes negative sentiment.4 3.” denotes all other words. Ideal representations of domain-speciﬁc words.5 6. which occur frequently in both domains. “+” denotes positive sentiment. x . Boldfaces are domain-speciﬁc words. and “-” denotes negative sentiment. Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products. which are much more frequent in one domain than in the other one.1 2.1 4.2 5.1 5.2 6. Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).4 6.2 2. An example of signal vectors (unit:dBm) Summary of the six data sets constructed from the 20-Newsgroups data. Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation). “+” denotes positive sentiment. Relationship between traditional machine learning and transfer learning settings Different settings of transfer learning Different approaches to transfer learning Different approaches in different settings Summary of dimensionality reduction methods based on Hilbert space embedding of distributions.3 6. A co-occurrence matrix of domain-speciﬁc and domain-independent words.6 Experiments with different domain-independent feature selection methods. “. Num90 bers in the table are accuracies in percentage.1 5. Italic words are some domain-independent words.1 75 77 77 78 86 6. which are much more frequent in one domain than in the other one.1 Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products. Only domain-speciﬁc features are considered.3 6. Summary of datasets for evaluation.LIST OF TABLES 1. 4 12 14 15 15 55 58 67 69 70 2.3 2..

In this thesis.FEATURE-BASED TRANSFER LEARNING WITH REAL-WORLD APPLICATIONS by JIALIN PAN Department of Computer Science and Engineering The Hong Kong University of Science and Technology ABSTRACT Transfer learning is a new machine learning and data mining framework that allows the training and test data to come from different distributions and/or feature spaces. Based on this framework. we propose three effective solutions to learn the latent space for transfer learning. which tries to reduce the distance between different domains while preserve data properties as much as possible. We can ﬁnd many novel applications of machine learning and data mining where transfer learning is helpful. xi . In this algorithm. In our study. especially when we have limited labeled data in our domain of interest. and achieve promising results. such as sentiment classiﬁcation. such that knowledge transfer becomes possible. This framework is general for many transfer learning problems when domain knowledge is unavailable. we ﬁrst survey different settings and approaches of transfer learning and give a big picture of the ﬁeld. We focus on latent space learning for transfer learning. We apply these methods to two diverse applications: cross-domain WiFi localization and cross-domain text classiﬁcation. Experimental results show that this method outperforms a state-of-the-art algorithm in two real-world datasets on cross-domain sentiment classiﬁcation. we try to align domain-speciﬁc features from different domains by using some domain independent features as a bridge. we propose a novel dimensionality reduction framework for transfer learning. which aims at discovering a “good” common feature space across domain. we propose a spectral feature alignment algorithm for cross-domain learning. where domain knowledge is available for encoding to transfer learning methods. for a speciﬁc application area. Furthermore.

However. most active learning methods assume that there is a budget for the active learner to pose queries in the domain of interest. semi-supervised learning [233. the performance of these algorithms heavily reply on collecting high quality and sufﬁcient labeled training data to train a statistical or computational model to make predictions on the future data [127. However. Instead of exploring unlabeled data to train a precise model. 77. labeled training data are in short supply or can only be obtained with expensive cost.CHAPTER 1 INTRODUCTION Supervised data mining and machine learning technologies have already been widely studied and applied to many knowledge engineering areas.g. most traditional supervised algorithms work well only under a common assumption: the training and test data are drawn from the same feature space and the same distribution. in contrast. including labeled and unlabeled data. which is another branch in machine learning for reducing annotation effort of supervised learning. 90] techniques have been proposed to address the problem that the labeled training data may be too few to build a good classiﬁer. However. The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns [101. where active learning methods may not work in learning accurate classiﬁers in the domain of interest. Nevertheless. In the last decade. 131. which implicitly assumes the training and test data are still represented in the same feature space and drawn from the same data distribution. the budget may be quite limited. Furthermore. Transfer learning. and the test data are both from the same domain of interest. The main idea behind transfer learning is to borrow labeled data or knowledge extracted from some related domains to help a machine learning algorithm to achieve greater performance in the domain of interest [183]. In some real-world applications. 27. most semi-supervised methods require that the training data. transfer learning can be referred to as a different strategy for learning model with minimal human supervision. and distributions used in training and testing to be different. we can observe many examples of 1 . usually in the form of unlabeled data instances to be labeled by an oracle (e. 34. Thus. compared to semi-supervised and active learning. in many real-world scenarios. tasks. 168]. tries to design an active learner to pose queries. active learning. In the real world. allows the domains. This problem has become a major bottleneck of making machine learning and data mining methods more applicable in practice.. a human annotator). 189]. by making use of a large amount of unlabeled data to discover a powerful structure together with a small amount of labeled data to train models.

g. As an example in the area of Web-document classiﬁcation (see. In such cases. the labeled data obtained in one time period may not follow the same distribution in a later time period. which aims to detect a user’s current location based on previously collected WiFi data.transfer learning. As shown in Figure 4. in many engineering applications. such as a brand of camera. However. learning to play the electronic organ may help facilitate learning the piano. In such cases. where our goal is to classify a given Web document into several predeﬁned categories. One example is Web document classiﬁcation. it is expensive or impossible to collect sufﬁcient training data to train a model for use in each domain of interest. To reduce the re-calibration effort. supervised learning algorithms [146] have proven to be promising and widely used in sentiment classiﬁcation. we may ﬁnd that learning to recognize apples might help to recognize pears. or to adapt the localization model trained on a mobile device (the source domain) for a new mobile device (the target domain). Similarly. into polarity categories (e. e. as introduced in [142]. As a result. a model trained in one time period or on one device may cause the performance for location estimation in another time period or on another device to be reduced. Furthermore. Consider the problem of sentiment classiﬁcation. the WiFi signal-strength values may be a function of time. it is very expensive to calibrate WiFi data for building localization models in a large-scale environment. For example. there may be a lack of labeled training data. where our task is to automatically classify the reviews on a product. knowledge transfer or transfer learning between tasks or domains become more desirable and crucial. because a user needs to label a large collection of WiFi signal data at each location. positive or negative). these methods are domain 2 .. values of received signal strength (RSS) may differ across time periods and mobile devices. As a third example. it would be helpful if we could transfer the classiﬁcation knowledge into the new domain. For example. However. transfer learning is also desirable when the features between domains change. [49]). we might wish to adapt the localization model trained in one time period (the source domain) for a new time period (the target domain).g. in indoor WiFi localization problems. device or other dynamic factors.2. we may not be able to directly apply the Web-page classiﬁers learned on the university Web site to the new Web site. It would be nice if one could reuse the training data which have been collected in some related domains/tasks or the knowledge that is already extracted from some related domains/tasks to learn a precise model for use in the domain of interest. Many examples in knowledge engineering can be found where transfer learning can truly be beneﬁcial. the labeled examples may be the university Web pages that are associated with category information obtained through previous manual-labeling efforts. In this case. As a result. In literature. The need for transfer learning may also arise when the data can be easily outdated. For a classiﬁcation task on a newly created Web site where the data features or data distributions may be different..

words like “hooked”.1 shows several user review sentences from two domains: electronics and video games. 3 . In the electronics domain.1: Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods and received by different mobile devices. (d) WiFi RSS received by device B in T2 . “sharp” to express our positive sentiment and use “blurry” to express our negative sentiment. Figure 1. The reason is that users may use domain-speciﬁc words to express sentiment in different domains. “realistic” indicate positive opinion and the word “boring” indicates negative opinion. Table 1. dependent. a sentiment classiﬁer trained in one domain may not work well when directly applied to other domains. we may use words like “compact”. Due to the mismatch among domain-speciﬁc words. Thus. cross-domain sentiment classiﬁcation algorithms are highly desirable to reduce domain dependency and manually labeling cost by transferring knowledge from related domains to the domain of interest [25]. 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 20 40 60 80 100 120 140 0 0 20 40 60 80 100 120 140 (c) WiFi RSS received by device B in T1 . 35 (b) WiFi RSS received by device A in T2 .35 30 25 20 15 10 5 0 0 35 30 25 20 20 18 16 14 12 10 15 10 5 0 0 8 6 4 2 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 (a) WiFi RSS received by device A in T1 . Different colors denote different signal strength values. While in the video game domain.

It is also quite blurry in very dark settings. Furthermore. brute-force transfer may be unsuccessful. when the source domain and target domain are not related to each other.1 The Contribution of This Thesis Generally speaking. and “-” denotes negative sentiment. which are much more frequent in one domain than in the other one. We leave the issue on how to avoid 4 . easy to operate. It is really nice and sharp. which will be introduced in detail in Chapter 2 as well. in transfer learning. Boldfaces are domain-speciﬁc words. I am very much hooked on this game. I am extremely unI will never buy HP again. Some knowledge is speciﬁc for individual domains or tasks. “+” denotes positive sentiment. 1. “What to transfer” asks which part of knowledge can be transferred across domains or tasks. After discovering which knowledge can be transferred. + + electronics Compact. In some situations. I purchased this unit from Circuit City and Very realistic shooting action and good I was very excited about the quality of the plots. which corresponds to the “how to transfer” issue. very good picture quality. Likewise. (3) When to transfer [141]. and some knowledge may be common between different domains such that they may help improve performance for the target domain or task. happy and will probably never buy UbiSoft again. we have the following three main research issues: (1) What to transfer. transferring skills should be done. “When to transfer” asks in which situations. In this thesis. transfer learning can be categorized into three settings: inductive transfer. our goal is to learn an accurate model for use in the target domain. In this thesis. In the worst case. a situation which is often referred to as negative transfer.1: Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products. we focus on the transductive transfer learning setting. it may even hurt the performance of learning in the target domain. where we are given a lot of labeled data in a source domain and some unlabeled data in a target domain. knowledge should not be transferred. we focus on “What to transfer” and “How to transfer” by implicitly assuming that the source and target domains are related to each other. transductive transfer and unsupervised transfer. (2) How to transfer. which is ﬁrst described in our survey article [141] and will be introduced in detail in Chapter 2. Note that in this setting. looks sharp! video games A very good game! It is action packed and full of excitement.Table 1. no labeled data in the target domain are available for training. learning algorithms need to be developed to transfer the knowledge. The game is so boring. We played this and were hooked. picture. we are interested in knowing in which situations.

in MMDE we translate the latent space learning for transfer learning to a non-parametric kernel matrix learning problem. they may use some domain-independent sentiment words. but suffers from expensive computational cost. Our framework aims to learn a latent space underlying across domains. in some application areas. such as variance and local geometric structure. Based on this framework. such as sentiment classiﬁcation. and WiFi data may be controlled by some hidden factors. Thus. and (2) domain knowledge can be observed or easy to encode in embedding learning. In addition. we propose a novel and general dimensionality reduction framework for transfer learning. text data may be controlled by some latent topics. More speciﬁcally. etc. For example. The main difference between TCA and SSTCA is that TCA is an unsupervised feature extraction method while SSTCA is semi-supervised feature extrantion method. such that the distance between data distributions can be dramatically reduced and the original data properties. This observation motivates us to propose a spectral feature clustering framework [137] to align domain-speciﬁc words from different domains in a latent space by modeling the correlation between the domain-independent and domain-speciﬁc words in a bipartite graph and using the domain-speciﬁc features as a 5 . some domain knowledge can be observed and used for learning the latent space across domains. in sentiment classiﬁcation. etc. can be preserved as much as possible. such as the structure of a building. In most application areas. when data from different domains are projected onto the latent space. some domain-speciﬁc and domain-independent words may co-occur in reviews frequently. we propose to discover a latent feature space for transfer learning. For “How to transfer”. “never buy”. Maximum Mean Discrepancy Embedding (MMDE) [135]. we propose three different algorithms to learn the latent space. In this case.1. the latent space can be treated as a bridge across domains to make knowledge transfer possible and successful. such as text classiﬁcation or WiFi localization. Standard machine learning and data mining methods can be applied directly in the latent space to train models for making predictions on the target domain data. In contrast to the general framework to transfer learning without domain knowledge. the domain knowledge is hidden. we propose two embedding learning frameworks to learn the latent space based on two different situations: (1) domain knowledge is hidden or hard to capture. We apply these three algorithms to two diverse application areas: wireless sensor networks and Web mining. Transfer Component Analysis (TCA) [139] and Semi-supervised Transfer Component Analysis (SSTCA) [140]. For example. we propose to learn parametric kernel based embeddings for transfer learning instead. which means there may be a correlation between these words. in TCA and SSTCA. Thus.negative transfer to our future work. where the distance between domains can then be reduced and the important information of the original data can be preserved simultaneously. such as “good”. though users may use some domain-speciﬁc words as shown in Table 1. The resultant kernel in a free form may be more precise for transfer learning. For “How to transfer”.

In Chapter 2. • We give a comprehensive survey on transfer learning. in this thesis. The other framework is focused on the situation that domain knowledge is explicit and easy to be encoded in feature space learning. In this thesis. In Chapter 3. categorize transfer learning approaches into four contexts. analyze the relationship between transfer learning and other related areas. Transfer Component Analysis (TCA) (Chapter 3. Other researchers may get a big picture of transfer learning by reading the survey. we apply them to solve the WiFi localization and text classiﬁcation problems and achieve promising results. Note that there has been a large amount of work on transfer learning for reinforcement learning in the machine learning literature (e. Then we apply these three methods to two diverse applications: WiFi localization (Chapter 4) and text classiﬁcation (Chapter 5). where we give some deﬁnitions of transfer learning.5) and Semi-supervised Transfer Component Analysis (SSTCA) (Chapter 3. 1. Maximum Mean Discrepancy Embedding (MMDE) (Chapter 3.g. which encode the domain knowledge in a spectral feature alignment framework.. • We propose a speciﬁc latent space learning for sentiment classiﬁcation. we propose two feature space learning frameworks for transfer learning. In 6 . we survey the ﬁeld of transfer learning. Based on different different situations that whether domain knowledge is available. where we summarize different transfer learning settings and approaches and discuss the relationship between transfer learning and other related areas. such as WiFi localization.2. summarize transfer learning into three settings.bridge for cross-domain sentiment classiﬁcation. • We propose a general dimensionality reduction framework for transfer learning without any domain knowledge.6). Based on the framework. we present our proposed general dimensionality reduction framework and three proposed algorithms. we study the problem of feature-based transfer learning and its real-world applications. we only focus on transfer learning for classiﬁcation and regression tasks that are related more closely to machine learning and data mining tasks. The ﬁrst framework is focused on the situation that domain knowledge is hidden and hard to encode in embedding learning.2 The Organization of This Thesis The organization of this thesis is shown in Figure 1. we propose three solutions to learn the latent space for transfer learning. The main contributions of this thesis can be summarized as follows. text classiﬁcation and sentiment classiﬁcation. However. Furthermore. discuss some interesting research issues in transfer learning and introduce some applications of transfer learning. a current survey article [182]). The proposed method outperforms a sate-of-the-art cross-domain methods in the ﬁeld of sentiment classiﬁcation.4).

2: The organization of thesis. 7 .Chapter 6. Figure 1. Finally. we conclude this thesis and discuss some thoughts on future work in Chapter 7. we present a spectral feature alignment (SFA) algorithm for sentiment classiﬁcation across domains.

knowledge consolidation.acadiau. we give a comprehensive survey of transfer learning for classiﬁcation. we add some up-to-date materials on transfer learning. meta learning.html http://www. including both methodologies and applications. which tries to learn multiple tasks simultaneously even when they are different. Among these. multi-task learning.CHAPTER 2 A SURVEY ON TRANSFER LEARNING In this chapter. life-long learning. A typical approach for multi-task learning is to uncover the common (latent) features that can beneﬁt each individual task.1 Overview 2. In contrast to multi-task learning.mil/ipto/programs/tl/tl. this chapter originated as our survey article [141]. 2.1 A Brief History of Transfer Learning The study of transfer learning is motivated by the fact that people can intelligently apply knowledge learned previously to solve new problems faster or with better solutions [61]. Research on transfer learning has attracted more and more attention since 1995 in different names: learning to learn.1. regression and clustering developed in machine learning and data mining areas. the Broad Agency Announcement (BAA) 05-29 of Defense Advanced Research Projects Agency (DARPA)’s Information Processing Technology Ofﬁce (IPTO) 2 gave a new mission of transfer learning: the ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks. transfer learning aims to extract the knowledge from one or more source tasks and applies the knowledge to a target task. in this chapter.ca/courses/comp/dsilver/NIPS95 LTL/transfer.1995. knowledge transfer.asp 8 . rather than learning all 1 2 http:/socrates. a closely related learning technique to transfer learning is the multi-task learning framework [31].darpa. In fact.workshop. which focused on the need for lifelong machinelearning methods that retain and reuse previously learned knowledge. the years after the NIPS-05 workshop. knowledge-based inductive bias. The fundamental motivation for transfer learning in the ﬁeld of machine learning was discussed in a NIPS-95 workshop on “Learning to Learn”1 . and incremental/cumulative learning [183]. In this deﬁnition. and their real-world applications. In 2005. which is the ﬁrst survey in the ﬁeld of transfer learning. context-sensitive learning. inductive transfer. Compared to the previous survey article.

International Conference on Computational Linguistics (COLING). Joint Conference of the 47th Annual Meeting of the Association of Computational Linguistics (ACL). most notably in data mining (ACM International Conference on Knowledge Discovery and Data Mining (KDD). for example) and applications (Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). transfer learning methods appear in several top venues. European Conference on Computer Vision (ECCV). artiﬁcial intelligence (AAAI Conference on Artiﬁcial Intelligence (AAAI) and International Joint Conference on Artiﬁcial Intelligence (IJCAI). Annual International Conference on Research in Computational Molecular Biology (RECOMB) and Annual International 9 . Figure 2.1: Different learning processes between traditional machine learning and transfer learning Today. ACM international conference on Multimedia (MM). transfer learning cares most about the target task. The roles of the source and target tasks are no longer symmetric in transfer learning. ACM International Conference on Ubiquitous Computing (UBICOMP). International World Wide Web Conference (WWW).1 shows the difference between the learning processes of traditional and transfer learning techniques. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). Annual Conference on Neural Information Processing Systems (NIPS) and European Conference on Machine Learning (ECML) for example). Conference on Empirical Methods in Natural Language Processing (EMNLP). for example). IEEE International Conference on Computer Vision (ICCV). (a) Traditional Machine Learning (b) Transfer Learning Figure 2. traditional machine learning techniques try to learn each task from scratch.of the source and target tasks simultaneously. while transfer learning techniques try to transfer the knowledge from some previous tasks to a target task when the latter has fewer high-quality training data. IEEE International Conference on Data Mining (ICDM) and European Conference on Knowledge Discovery in Databases (PKDD). machine learning (International Conference on Machine Learning (ICML). As we can see.

A domain D consists of two components: a feature space X and a marginal probability distribution P (X). where xi ∈ X and yi ∈ Y. which consist of pairs {xi . yT1 ).htm 10 . . a target domain DT and learning task TT .hk/˜sinnopan/conferenceTL. yi }. xn } ∈ X . in our document classiﬁcation 3 We summarize a list of transfer learning papers published in these few years and a list of workshops that are related to transfer learning at the following url for reference. DT . if two domains are different. DS can be a set of term vectors together with their associated true or false class labels. Thus the condition DS ̸= DT implies that either XS ̸= XT or PS (X) ̸= PT (X). Similarly. . False for a binary classiﬁcation task. f (x) can be written as P (y|x). we denote the target domain data as DT = {(xT1 . P (X)}. We now give a uniﬁed deﬁnition of transfer learning. we denote DS = {(xS1 . . as this is by far the most popular of the research works in the literature. which is not observed but can be learned from the training data. Here. f (x). For example. The issue of transfer learning from multiple source domains will be discussed in 2. yTnT )}. Y is the set of all labels. we only consider the case where there is one source domain DS . xi is the ith term vector corresponding to some documents. P (X)}. f (·)}). we ﬁrst describe the notations and deﬁnitions used in this chapter. In the above deﬁnition. and one target domain. yS1 ).2 Notations and Deﬁnitions First of all. and yi is “True” or “False”. D = {X . From a probabilistic viewpoint. where X = {x1 . a task T consists of two components: a label space Y and an objective predictive function f (·) (denoted by T = {Y. we give the deﬁnitions of a “domain” and a “task”. For example. . where DS ̸= DT . then X is the space of all term vectors. More speciﬁcally. (xTnT . . In our document classiﬁcation example. Before we give different categorizations of transfer learning. In most cases. In general. respectively. 2. and X is a particular learning sample. In our document classiﬁcation example. . . where xSi ∈ XS is the data instance and ySi ∈ YS is the corresponding class label. . 0 ≤ nT ≪ nS . and each term is taken as a binary feature. which is True.1. The function f (·) can be used to predict the corresponding label. Given a source domain DS and learning task TS . http://www. or TS ̸= TT .cse. where the input xTi is in XT and yTi ∈ YT is the corresponding output.Conference on Intelligent Systems for Molecular Biology (ISMB) for example) 3 . ySnS )} the source domain data. . .6. then they may have different feature spaces or different marginal probability distributions. transfer learning aims to help improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS . (xSnS . Given a speciﬁc domain.ust. of a new instance x. . a domain is a pair D = {X . Deﬁnition 1. . if our learning task is document classiﬁcation.

In our document classiﬁcation example. P (YS |XS ) ̸= P (YT |XT ). or (2) the conditional probability distributions between the domains are different. laptop}. XS ̸= XT . i. whereas the target domain has ten classes to classify the documents to. then either (1) the feature spaces between the domains are different. Note that it is hard to deﬁne the term “relationship” mathematically. where YSi ∈ YS and YTi ∈ YT . P (XS ) ̸= P (XT ). Similarly. case (1) corresponds to when the two sets of documents are described in different languages. we say that the source and target domains or tasks are related.. YS ̸= YT . Given speciﬁc domains DS and DT . desktop} may be related the task classifying documents into the categories{book. Case (2) corresponds to the situation where the documents classes are deﬁned subjectively. and their learning tasks are the same. (3) When to transfer. Different users may deﬁne different different tags for a same document. As an example. the terms “laptop” and “desktop” are close to each other.e. explicit or implicit. between the two domains or tasks. P (Y |X)}. and case (2) may correspond to when the source domain documents and the target domain documents focus on different topics. we have the following three main research issues: (1) What to transfer. the learning tasks may be related to each other. DS = DT . In addition.5. and some knowledge may be com11 . when the learning tasks TS and TT are different. Some knowledge is speciﬁc for individual domains or tasks. or (2) the feature spaces between the domains are the same but the marginal probability distributions between domain data are different. “What to transfer” asks which part of knowledge can be transferred across domains or tasks. As a result. resulting in P (Y |X) changes across different users. this means that between a source document set and a target document set. they use different languages). When the domains are different. This because from a semantic point of view.e. i. i.example. either the term features are different between the two sets (e.e. 2. TS = TT . in our document classiﬁcation example. i. when there exists some relationship. i. as such tagging. the task classifying documents into the categories {book. Thus. case (1) corresponds to the situation where source domain has binary document classes. the learning problem becomes a traditional machine learning problem.3 A Categorization of Transfer Learning Techniques In transfer learning. a task is deﬁned as a pair T = {Y. When the target and source domains are the same. i.1. then either (1) the label spaces between the domains are different. where XSi ∈ XS and XTi ∈ XT . in most transfer learning methods introduced in the following sections assume that the source and target domains or tasks are related. How to measure the relatedness between domains or tasks is an important research issue in transfer learning. or their marginal distributions are different. which will be discussed in Chapter 2. (2) How to transfer. Thus the condition TS ̸= TT implies that either YS ̸= YT or P (YS |XS ) ̸= P (YT |XT )..g.e.e. For example.e.

brute-force transfer may be unsuccessful. the target task is different from the source task. we are interested in knowing in which situations. Based on the deﬁnition of transfer learning.1) A lot of labeled data in the source domain are available. a situation which is often referred to as negative transfer. Most current work on transfer learning focuses on “What to transfer” and “How to transfer”. according to different situations of labeled and unlabeled data in the source domain.mon between different domains such that they may help improve performance for the target domain or task. transferring skills should be done. After discovering which knowledge can be transferred. knowledge should not be transferred.1: Relationship between traditional machine learning and transfer learning settings Learning Settings Traditional Machine Learning Inductive Transfer Learning / Transfer Unsupervised Transfer Learning Learning Transductive Transfer Learning Source & Target Domains the same the same different but related different but related Source & Target Tasks the same different but related different but related the same 1. Likewise. In addition. some labeled data in the target domain are required to induce an objective predictive model fT (·) for use in the target domain. by implicitly assuming that the source and target domains be related to each other. the inductive transfer learning setting is similar to the multi-task learning setting [31]. 12 . when the source domain and target domain are not related to each other. we can further categorize the inductive transfer learning setting into two cases: (1. which corresponds to the “how to transfer” issue. However. how to avoid negative transfer is an important open issue that is attracting more and more attention in the future. Table 2. However. In this case. In this case. learning algorithms need to be developed to transfer the knowledge. In the worst case. based on different situations between the source and target domains and tasks. the inductive transfer learning setting only aims at achieving high performance in the target task by transferring knowledge from the source task while multi-task learning tries to learn the target and source task simultaneously. transductive transfer learning and unsupervised transfer learning. In some situations. we summarize the relationship between traditional machine learning and various transfer learning settings in Table 2. “When to transfer” asks in which situations. In the inductive transfer learning setting. where we categorize transfer learning under three settings. no matter when the source and target domains are the same or not.1. it may even hurt the performance of learning in the target domain. inductive transfer learning.

The intuitive idea behind this case is to learn a “good” feature representation for the target domain. no labeled data in the target domain are available while a lot of labeled data in the source domain are available. whose assumptions are similar. the unsupervised transfer learning focus on solving unsupervised learning tasks in the target domain. 50. [150]. The relationship between the different settings of transfer learning and the related areas are summarized in Table 2. 99. In this situation. Thus. Table 2.. 6. 148. 197]. In this case. which is ﬁrst proposed by Raina et al. 2. In the transductive transfer learning setting. 46. e. but the marginal probability distributions of the input data are different.. (2. the source and target tasks are the same. The latter case of the transductive transfer learning setting is related to domain adaptation for knowledge transfer in Natural Language Processing (NLP) [23. 47. it’s similar to the inductive transfer learning setting where the labeled data in the source domain are unavailable. 22. the label spaces between the source and target domains may be different. In this case. we can further categorize the transductive transfer learning setting into two cases. XS ̸= XT . 26. which assumes that certain parts of the data in the source domain can be reused for learning in the target domain by re-weighting. 4. dimensionality reduction and density estimation [50. 84. according to different situations between the source and target domains. the inductive transfer learning setting is similar to the self-taught learning setting. In the self-taught learning setting.2 and Figure 2. 85] and sample selection bias [219] or co-variate shift [170]. P (XS ) ̸= P (XT ). such as clustering. the target task is different from but related to the source task. 37.3 shows these four cases and brief description. Instance re-weighting and importance sampling are two major techniques in this context. 231] for example). However. XS = XT . [219. the 13 . 3. 48.g. in the unsupervised transfer learning setting. 49. 150.1) The feature spaces between the source and target domains are different. 149. [31. 21. 147] for example). 107. In addition. similar to inductive transfer learning setting. 82. which implies the side information of the source domain cannot be used directly. 67. Approaches to transfer learning in the above three different settings can be summarized into four contexts based on “What to transfer”. while the source and target domains are different. e. (2. In this case. A second case can be referred to as feature-representation-transfer approach (see. 51. 83. 8. 112.2) The feature spaces between domains are the same.2. there are no labeled data available in both source and target domains in training. 87. 180.2) No labeled data in the source domain are available. Finally.(1. The ﬁrst context can be referred to as instance-based transfer learning(or instance-transfer) approach (see.g. 20.

The transferred knowledge is encoded into the shared parameters or priors.. 30] for example). Finally.2: An overview of different settings of transfer knowledge used to transfer across domains is encoded into the learned feature representation. The basic assumption behind this context is that some relationship among the data in the source and target domains are similar. 28.2: Different settings of transfer learning Transfer Learning Related Areas Settings Inductive Transfer Multi-task Learning Learning Self-taught Learning Transductive Transfer Learning Unsupervised Transfer Learning Domain Adaptation. e.Table 2. 14 . Clustering Figure 2.g. 165. by discovering the shared parameters or priors. Thus. the last case can be referred to as the relational-knowledge-transfer problem [124]. knowledge can be transferred across tasks. Sample Selection Bias. A third case can be referred to as parameter-transfer approach (see. which deals with transfer learning for relational domains. which assumes that the source tasks and the target tasks share some parameters or prior distributions of the hyper-parameters of the models. 63. 62. Classiﬁcation Regression. Classiﬁcation Dimensionality Reduction. Classiﬁcation Regression. [97. With the new feature representation. the performance of the target task is expected to improve signiﬁcantly. Co-variate Shift Source Domain Labels Available Unavailable Available Unavailable Target Domain Labels Available Available Unavailable Unavailable Tasks Regression.

g.g. 231] for example). which we discuss in detail in the following chapters. 26. the feature-representation-transfer problem has been proposed to all three settings of transfer learning. 180.d assumption is relaxed in each domain (see.i. Thus. e. 37... 28.2 Inductive Transfer Learning Deﬁnition 2. 6. Both domains are relational domains and i. 62. statistical relational learning techniques dominate this context [125. which can beneﬁt for transfer learning (see. [97. Table 2. 83. 4. 51. 8. 22. 82. However. [31. 63. 52] for example). 149. [124. 112. 148. the parameter-transfer and the relational-knowledge-transfer approach are only studied in the inductive transfer learning setting.g. 165. 21.g.4 shows the cases where the different approaches are used for each transfer learning setting. the knowledge to be transferred is the relationship among the data. Build mapping of relational knowledge between the source domain and the target domains. (Inductive Transfer Learning) Given a source domain DS and a learning task TS .. while the unsupervised transfer learning setting is a relatively new research topic and only studied in the context of the feature-representation-transfer case. a target domain DT and a learning task TT . 48. 125.Table 2. 147] for example). 49. In addition. Recently. Table 2. Discover shared parameters or priors between the source domain and target domain models. 30] for example).. 47. 107. e. inductive transfer learning aims to help improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS . Find a “good” feature representation that reduces difference between the source and the target domains and the error of classiﬁcation and regression models (see.4: Different approaches in different settings Inductive Transfer Learning √ √ √ √ Transductive Transfer Learning √ √ Unsupervised Transfer Learning √ Instance-transfer Feature-representationtransfer Parameter-transfer Relational-knowledgetransfer 2. 52]. 46. e. 50. 20. 87. [219. 150. 99. 84. e.3: Different approaches to transfer learning Approaches Instance-transfer Feature-representationtransfer Parameter-transfer Relational-knowledgetransfer Brief Description To re-weight some labeled data in the source domain for use in the target domain (see. We can see that the inductive transfer learning setting has been studied in many research works. where 15 .

3.TS ̸= TT . As mentioned in Chapter 2. For each round of iteration. which is an extension of the AdaBoost [66] algorithm.1. only the weights of the source domain data are adjusted downwards gradually until reaching a certain point. but the distributions of the data in the two domains are different. TrAdaBoost assumes that the source and target domain data use exactly the same set of features and labels. Then in the second stage. 2. Furthermore. In addition. In detail. there are certain parts of the data that can still be reused together with a few labeled data in the target domain. The error is only calculated on the target data. The idea is to apply the techniques which have been proposed for modifying AdaBoost for regression [56] on TrAdaBoost.R2 to extend the TrAdaBoost algorithm for regression problems. Theoretical analysis of TrAdaBoost is also presented in [49]. [49] proposed a boosting algorithm. Besides modifying boosting algorithms for transfer learning.2. Pardoe and Stone proposed to adjust the weights of the data in two stages. More recently. TrAdaBoost. TrAdaBoost attempts to iteratively re-weight the source domain data to reduce the effect of the “bad” source data while encourage the “good” source data to contribute more for the target domain. to address classiﬁcation problems in the inductive transfer learning setting. Based on the above deﬁnition of the inductive transfer learning setting. In the ﬁrst stage. Most transfer learning approaches in this setting focus on the former case. Pardoe and Stone [147] presented a two-stage algorithm TrAdaBoost. TrAdaBoost uses the same strategy as AdaBoost to update the incorrectly classiﬁed examples in the target domain while using a different strategy from AdaBoost to update the incorrectly classiﬁed source examples in the source domain. (2) Labeled data in the source domain are unavailable while unlabeled data in the source domain are available. Dai et al. Jiang and Zhai [87] proposed a heuristic method to remove “misleading” training examples from the source domain based 16 . TrAdaBoost also assumes that.1 Transferring Knowledge of Instances The instance-transfer approach to the inductive transfer learning setting is intuitively appealing: although the source domain data cannot be reused directly. the weights of source domain data are ﬁxed while the weights of the target domain data are updated as normal in the TrAdaBoost algorithm. some of the source domain data may be useful in learning for the target domain but some of them may not and could even be harmful. TrAdaBoost trains the base classiﬁer on the weighted source and target data. due to the difference in distributions between the source and the target domains. a few labeled data in the target domain are required as the training data to induce the target predictive function. Furthermore. this setting has two cases: (1) Labeled data in the source domain are available. to avoid the overﬁtting problem in TrAdaBoost.

[67] proposed a graph-based locally weighted ensemble framework to combine multiple models for transfer learning. the learned new representation can reduce the classiﬁcation or regression model error of each task as well. In a follow-up work. In addition. given as follows. supervised learning methods can be used to construct a feature representation. [6] proposed a sparse feature learning method for multi-task learning. U ∈ Od In this equation. Strategies to ﬁnd “good” feature representations are different for different types of the source domain data. of the model at the same time. ∑ 1 The (r. Supervised Feature Construction Supervised feature construction methods for the inductive transfer learning setting are similar to those used in multi-task learning.2 Transferring Knowledge of Feature Representations The feature-representation-transfer approach to the inductive transfer learning problem aims at ﬁnding “good” feature representations to minimize domain divergence and classiﬁcation or regression model error. respectively.1) r i=1 estimates the low-dimensional representations U T XT . S and T denote the tasks in the source domain and target domain.1) can be further transformed into an equivalent convex optimization formulation and be solved efﬁciently. The basic idea is to learn a low-dimensional representation that is shared across related tasks.S} i=1 L(yti . In the inductive transfer learning setting. U T xti ⟩) + γ∥A∥2 2.on the difference between conditional probabilities P (yT |xT ) and P (yS |xS ).t. If a lot of labeled data in the source domain are available. If no labeled data in the source domain are available. U is a d × d orthogonal matrix (mapping function) for mapping the original high-dimensional data to low-dimensional representations. The optimization problem (2. then weighted models are used to make predictions on text data. A = [aS . 2. This is similar to common feature learning in the ﬁeld of multi-task learning [31]. arg min A. The idea is to assign weights to various models dynamically based on local structures in a graph. Wu and Dietterich [202] integrated the source domain (auxiliary) data in a Support Vector Machine (SVM) [189] framework for improving the classiﬁcation performance in the target domain. ⟨at . U T XS and the parameters. 17 . Argyriou et al. the common features can be learned by solving an optimization problem.2. Gao et al. A. aT ] ∈ Rd×2 is a matrix of parameters.U nt ∑ ∑ t∈{T.1 (2.p := ( d ∥ai ∥p ) p . p)-norm of A is deﬁned as ∥A∥r.1) s. unsupervised learning methods are proposed to construct the feature representation. The optimization problem (2.

in order to convert the new formulation into a convex formulation. proposed to apply sparse coding [98]. aj i is a new representation of basis bj for input xSi and β is a coefﬁcient S to balance the feature construction term and the regularization term.3) Finally. in the second step. . However. Jebara [84] proposed to select features for multi-task learning with SVMs. At the ﬁrst step. In ASO.t. . an optimization algorithm (2. Finally. [160] designed a kernel-based approach to inductive transfer. Singular Vector Decomposition (SVD) is applied on the space to recover a low-dimensional predictive space as a common feature space underlying multiple tasks. The basic idea of this approach consists of two steps. Another famous common feature learning method for multi-task learning is the alternating structure optimization ASO algorithm. More recently.2) as shown as follows. [37] presented an improved formulation (iASO) based on ASO by proposing a novel regularizer. One drawback of this method 18 . . Chen et al. . proposed by Ando and Zhang [4]. The meta-priors can be transferred among different tasks.b ∑ i ∥xSi − ∑ j aj i bj ∥2 + β∥aSi ∥1 2 S (2. ∀j ∈ 1. The ASO algorithm has been applied successfully to several applications [25. ∑ j a∗ i = arg min ∥xTi − aTi bj ∥2 + β∥aTi ∥1 T 2 a Ti j (2.3) is applied on the target domain data to learn higher level features based on the basis vectors b. Furthermore. higher-level basis vectors b = {b1 . discriminative algorithms can be applied to {a∗ i }′ s with corresponding labels to train T classiﬁcation or regression models for use in the target domain.2) s. which is an unsupervised feature construction method. a linear classiﬁer is trained for each of the multiple tasks. min a. . Unsupervised Feature Construction In [150]. Lee et al. . After learning the basis vectors b. Ruckert et al. [99] proposed a convex optimization algorithm for simultaneously learning metapriors and feature weights from an ensemble of related prediction tasks. 5]. . for learning higher level features for transfer learning.Argyriou et al. [8] proposed a spectral regularization framework on matrices for multi-task structure learning. Chen et al. bs } are learned on the source domain data by solving the optimization problem (2. b2 . Raina et al. which aims at ﬁnding a suitable kernel for the target data. . it is non-convex and does not guarantee to ﬁnd a global optimum. s In this equation. ∥bj ∥2 ≤ 1. Then weight vectors of the multiple classiﬁers are used to construct a predictor space. in [37]. proposed a convex alternating structure optimization (cASO) algorithm to the optimization problem.

which can be used to transfer the knowledge across domains via the aligned manifolds.3 Transferring Knowledge of Parameters Most parameter-transfer approaches to the inductive transfer learning setting assume that individual models for related tasks should share some parameters or prior distributions of hyperparameters. where. The proposed method assumes that the parameter. we may assign a larger weight to the loss function of the target domain to make sure that we can achieve better performance in the target domain. In [193]. an 19 . By assuming ft = wt · x to be a hyper-plane for task t. In contrast. manifold learning methods have been adapted for transfer learning. in transfer learning. [165] proposed to use a hierarchical Bayesian framework (HB) together with GP for multi-task learning. MT-IVM tries to learn parameters of a Gaussian Process over multiple tasks by sharing the same GP prior. in multi-task learning. weights in the loss functions for different domains can be different. wS and wT are parameters of the SVMs for the source task and the target learning task. Most approaches described in this section. In inductive transfer learning. As mentioned above. including a regularization framework and a hierarchical Bayesian framework. [28] also investigated multi-task learning in the context of GP. Intuitively. Bonilla et al. w0 is a common parameter while vS and vT are speciﬁc parameters for the source task and the target task. they can be easily modiﬁed for transfer learning. multi-task learning tries to learn both the source and target tasks simultaneously and perfectly. Evgeniou and Pontil [63] borrowed the idea of HB to SVMs for multi-task learning. Recently. wS = w0 + vS . which is based on Gaussian Processes (GP). are designed to work under multi-task learning. where a GP prior is used to induce correlations between tasks. w. respectively. while transfer learning only aims at boosting the performance of the target domain by utilizing the source domain data. & wT = w0 + vT . some researchers also proposed to transfer parameters of SVMs under a regularization framework. to handle the multi-task learning case.is that the so-called higher-level basis vectors learned on the source domain in the optimization problem (2. However. The authors proposed to use a free-form covariance matrix over tasks to model inter-task dependencies. Schwaighofer et al. Besides transferring the priors of the GP models. Wang and Mahadevan proposed a Procrustes analysis based approach to manifold alignment without correspondences. One is a common term over tasks and the other is a task-speciﬁc term. weights of the loss functions for the source and target data are the same.2) may not be suitable for use in the target domain. in SVMs for each task can be separated into two terms. Thus. Lawrence and Platt [97] proposed an efﬁcient algorithm known as MT-IVM. respectively. 2.2.

ξti ’s are slack variables to measure the degree of misclassiﬁcation of the data xti ’s as used in standard SVMs. vS and vT simultaneously. t ∈ {S. we can learn the parameters w0 . a professor can be considered as playing a similar role in an academic domain as a manager in an industrial management domain.d. ξti ) = nt ∑ ∑ t∈{S. TAMARis a two-stage algorithm. the relationship between a professor and his or her students is similar to the relationship between a manager and his or her workers. a mapping is constructed from a source MLN to the target domain based on weighted pseudo loglikelihood measure (WPLL).T } yti (w0 + vt ) · xti ≥ 1 − ξti . 20 . are input feature vectors in the source and target domains. MLNs is a powerful formalism. 2. vt .4 Transferring Relational Knowledge Different from other three contexts. J(w0 . ξti ≥ 0.. yti ’s are the corresponding labels.extension of SVMs to multi-task learning case can be written as the following: min s. In an MLN. and can be represented by multiple relations. respectively. entities in a relational domain are represented by predicates and their relationships are represented in ﬁrst-order logic.. At the ﬁrst stage. It tries to transfer the relationship among data from a source domain to a target domain. [124] proposed an algorithm TAMAR that transfers relational knowledge with Markov Logic Networks (MLNs) [155] across relational domains.d. Thus. This approach does not assume that the data drawn from each domain be independent and identically distributed (i. In addition. the relational-knowledge-transfer approach deals with transfer learning problems in relational domains. statistical relational learning techniques are proposed to solve these problems. Basically. T }. TAMAR tries to use an MLN learned for a source domain to aid in the learning of an MLN for a target domain. In this vein.2.i. 2. which combines the compact expressiveness of ﬁrst order logic with ﬂexibility of probability. for statistical relational learning.T } i=1 w0 . such as networked data and social network data.t. λ1 and λ2 are positive regularization parameters to tradeoff the importance between the misclassiﬁcation and regularization terms. TAMAR is motivated by the fact that if two domains are related to each other.) as traditionally assumed.ξti ξti + λ1 ∑ ∥vt ∥2 + λ2 ∥w0 ∥2 2 t∈{S. there may exist a mapping from professor to manager and a mapping from the professorstudent relationship to the manager-worker relationship. there may exist mappings to connect entities and their relationships from a source domain to a target domain. By solving the optimization problem above. Mihalkova et al..vt . . T }.i. where the data are non-i. where xti ’s. For example. In this context. i ∈ {1. nt } & t ∈ {S.

Mihalkova et al. [52] proposed an approach to transferring relational knowledge based on a form of secondorder Markov logic. For example. In addition. (Transductive Transfer Learning) Given a source domain DS and a corresponding learning task TS . when some new test data arrive. they must be classiﬁed together with all existing data. [9]. Thus. The basic idea of the algorithm is to discover structural regularities in the source domain in the form of Markov logic formulas with predicate variables. On top of these conditions. by instantiating these formulas with predicates from the target domain.. i. where only one entity in a target domain is available. extended TAMAR to the single-entity-centered setting of transfer learning. a target domain DT and a corresponding learning task TT . where DS ̸= DT and TS = TT . which is an inductive logic programming (ILP) algorithm for revising ﬁrst order theories. in contrast. 2. transductive transfer learning aims to improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS . In our categorization of transfer learning. they further required that that all unlabeled data in the target domain are available at training time. we only require that part of the unlabeled target domain data be seen at training time in order to obtain the marginal probability for the target domain data. in our deﬁnition of the transductive transfer learning setting. the tasks must be the same and there must be some unlabeled data available in the target domain. Davis et al. the source 21 . In the traditional machine learning setting. The revised MLN can be used as a relational model for inference or reasoning in the target domain. but we believe that this condition can be relaxed. where the difference lies between the marginal probability distributions of source and target data. where they required that the source and target tasks be the same. Deﬁnition 3.At the second stage. This deﬁnition covers the work of Arnold et al. and that the learned model cannot be reused for future data. [9]. Note that the word “transductive” is used with several meanings. a revision is done for the mapped structure in the target domain through the FORTE algorithm [152]. some unlabeled target domain data must be available at training time. we use the term “transductive” to emphasize the concept that in this type of transfer learning. the tasks are the same but the domains are different. transductive learning [90] refers to the situation where all test data are required to be seen at training time. In a follow-up work [125].3 Transductive Transfer Learning The term transductive transfer learning was ﬁrst proposed by Arnold et al.e. although the domains may be different. assume our task is to classify whether a document describes information about laptop or not. since the latter considered domain adaptation. instead.

θ∗ = arg min θ∈Θ ∑ (x. θ)]. and are annotated by human.y)∈DT P (DT )l(x. we choose to minimize the ERM instead. this setting can be split to two cases: (a) The feature spaces between the source and target domains are different. θ) 22 . y. y. θ)]. n is size of the training data. As a result. The goal of transductive transfer learning is to make use of some unlabeled (without human annotation) target documents with a lot of labeled source domain documents to train a good classiﬁer to make predictions on the documents in the target domain (including unseen documents in training). In the above deﬁnition of transductive transfer learning. θ∗ = arg min E(x. we also assume that some target-domain unlabeled data be given. θ∗ = arg min θ∈Θ 1∑ [l(xi . we want to learn an optimal model for the target domain by minimizing the expected risk. we ﬁrst review the problem of empirical risk minimization (ERM) [189]. the data distributions may be different across domains. n i=1 n where xi ’s are input feature vectors and yi ’s are the corresponding labels. in our classiﬁcation scheme under transductive transfer learning. To see how importance sampling based methods may help in this setting. y.2. As mentioned in Chapter 2. θ) is a loss function that depends on the parameter θ. P (XS ) ̸= P (XT ). 2. since it is hard to estimate the probability distribution P . However. yi . but the marginal probability distributions of the input data are different.domain documents are downloaded from news websites. Most approaches described in the following sections are related to case (b) above. the source and target tasks are the same. θ∈Θ where l(x. which implies that one can adapt the predictive function learned in the source domain for use in the target domain through some unlabeled target-domain data. and (b) the feature spaces between domains are the same. we might want to learn the optimal parameters θ∗ of the model by minimizing the expected risk. which aims to make the best use of the unlabeled test data for learning. XS ̸= XT . This is similar to the requirements in domain adaptation and sample selection bias. the target domain documents are documents are downloaded from shopping websites. In general. XS = XT .3.y)∈P [l(x.1. Similar to the traditional transductive learning setting. In the transductive transfer learning setting.1 Transferring the Knowledge of Instances Most instance-transfer approaches to the transductive transfer learning setting are motivated by importance sampling.

T KT. θ). ∑ θ∗ = arg min P (DS )l(x. KS. ySi ) with the corresponding weight PT (xTi . xj ). An advantage of using KMM is that it can avoid performing density estimation of either P (xSi ) or P (xTi ). then we may simply learn the model by solving the following optimization problem for use in the target domain.T ij = k(xi . ySi . we need to modify the above optimization problem to learn a model with high generalization ability for the target domain. PS (xSi .S ij = k(xi . where xi ∈ XS XT . xj ). T where xi .4) nS ∑ PT (xT .S and KT.T ij = k(xi . xj ). xTj ). KMM can be rewritten as the following quadratic programming (QP) optimization problem. as follows: θ∗ = arg min θ∈Θ ∑ (x. and KT. [82] proposed a kernel-mean matching (KMM) algorithm to learn directly by matching the means between the source domain data and the target domain data in a reproducingkernel Hilbert space (RKHS). Thus the difference between P (DS ) and P (DT ) is caused by P (XS ) and P (XT ) and PT (xTi . ySi ) θ∈Θ i=1 Therefore. P (xSi ) . xj ∈ XS .t. we can solve the transductive transfer learning problems.ySi ) = If we can estimate for each instance. That means KS. when P (DS ) ̸= P (DT ). j=1 It can be proved that βi = P (xSi ) P (xTi ) [82]. since no labeled data in the target domain are observed in training data.S KS. by adding different penalty values to each instance (xSi .T ] and Kij = k(xi . Huang et al.However.5) βi − nS | ≤ nS ϵ s.yTi ) PS (xSi . θ∈Θ (x.S KT. θ) P (DS ) (2. which is difﬁcult when the size of the 23 .S = KS. PS (xSi . Furthermore. where xi . y. xj ). y. KS.yTi ) . min β 1 T β Kβ − κT β 2 nS ∑ i=1 (2. P (xTi ) P (xSi ) P (xTi ) since P (YT |XT ) = P (YS |XS ). we have to learn a model from the source domain data instead. κi = nT nT k(xi . respectively.y)∈DS Otherwise.T are kernel matrices for the source domain data and the target domain data. yT ) i i ≈ arg min l(xSi . xj ∈ XT . θ). βi ∈ [0. which implies KS.y)∈DS P (DT ) P (DS )l(x.T (KT. If P (DS ) = P (DT ).T ) is a kernel matrix across the source and target domain data. where ∪ nS ∑ xi ∈ XS and xj ∈ XT . P (xTi ) Zadrozny [219] proposed to estimate the terms P (xSi ) P (xTi ) P (xSi ) and P (xTi ) independently by constructing simple classiﬁcation problems.ySi ) we can learn a precise model for the target domain. There exist various ways to estimate P (xSi ) . B] and | [ where K = KS. while xTj ∈ XT .

Sugiyama et al. Kanamori et al. . which are common features that occur frequently and similarly across domains. [91] proposed a method called unconstrained least-squares importance ﬁtting (uLSIF) to estimate the importance efﬁciently by formulating the direct importance estimation problem as a leastsquares function ﬁtting problem. 2. readers can refer to a recently published book [148] by QuioneroCandela et al. However. ASO was proposed for multi-task learning. To be emphasized that besides covariate shift adaptation. reviews on different products). which are sentiment words and used commonly across different domains (e. these importance estimation techniques have also been applied to various applications. . . “nice” and “bad”. the words “good”. SCL treats each pivot feature as a new output vector to construct a task and non-pivot features as inputs. Dai et al. Blitzer et al. The m linear classiﬁers are trained to model the relationship between the non-pivot features and the pivot features as shown as follows. are examples of pivot features. to make use of the unlabeled data from the target domain to extract some common features to reduce the difference between the source and target domains. [180] proposed an algorithm known as Kullback-Leibler Importance Estimation Procedure (KLIEP) to estimate P (xSi ) P (xTi ) directly. this method is focused on estimating the importance in a latent space instead of learning a latent space for adaptation. Thus. . For more information on importance sampling and re-weighting methods for co-variate shift or sample selection bias. For example. Besides sample re-weighting techniques. [48] extended a traditional Naive Bayesian classiﬁer for the transductive transfer learning problems.3. (2) training models on the re-weighted data.data set is small. a ﬁrst step is to construct some pseudo related tasks.2 Transferring Knowledge of Feature Representations Most feature-representation transfer approaches to the transductive transfer learning setting are under unsupervised learning frameworks. Bickel et al. such as independent component analysis (ICA) [181]. based on the minimization of the Kullback-Leibler divergence. l = 1. Sugiyama et al. etc.2. using labeled source domain and unlabeled target domain data. As described in Chapter 2. proposed to ﬁrst deﬁne a set of pivot features (the number of pivot feature is denoted by m). m 24 . [26] proposed a structural correspondence learning (SCL) algorithm. outlier detection [78] and change-point detection [92]. [21] combined the two steps in a uniﬁed framework by deriving a kernel-logistic regression classiﬁer. KLIEP can be integrated with cross-validation to perform model selection automatically in two steps: (1) estimating the weights of the source domain data. fl (x) = sgn(wlT · x). Blitzer et al. [179] further extended the uLSIF algorithm by estimating importance in a non-stationary subspace. which modiﬁes the ASO [5] algorithm for transductive transfer learning. In SCL. Then.g. which performs well even when the dimensionality of the data domains is high. More recently.

In [26]. [47] proposed a co-clustering based algorithm to discover common feature clusters. Recently. xT ]. The matrix Dr×r is a diagonal matrix consists of non-negative singular values. where q is number of features of the original data. xS = [xS . proposed a novel algorithm known as bridged reﬁnement to correct the labels predicted by a shift-unaware classiﬁer towards a target distribution and take the mixture distribution of the training and test data as a bridge to better transfer from the training data to the test data. 0 is a vector of zeros. where Uq×r and Vr×m are the matrices of the left and right singular vectors. respectively. Dai et al. where xS and xT are original features vectors of the source and target domains. whose length is equivalent to that of the original feature vector. standard classiﬁcation algorithms can be applied on the new representations to train classiﬁers. Mutual Information (MI) is proposed for choosing the pivot features. Blitzer et al. which are ranked in non-increasing order. such that label information can be propagated across different domains by using the common clusters as a bridge. 25 . In their follow-up work [25]. [191] proposed a method u known as stationary subspace analysis (SSA) to match distributions in a latent space for time series data analysis. where the difference between domains can be reduced. Ling et al. into a uniﬁed probabilistic model. SSA theoretically studied the conditions under which a stationary space can be identiﬁed from multivariate time series. [207] proposed a cross-domain text classiﬁcation algorithm that extended the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data from different but related domains. Let T θ = U[1:h. Xue et al. 0. Xing et al. [205]. wm ] ∈ Rq×m . such as tagging of sentences. . . Most feature-based transductive transfer learning methods do not minimize the distance in distributions between domains directly. [108] proposed a spectral classiﬁcation framework for cross-domain transfer learning problem. SSA is focused on the identiﬁcation of a stationary subspace. without considering the preservation of properties such as data variance in the subspace. Daum´ III [51] proposed a simple feature augmentation method for transfer learning probe lems in the Natural Language Processing (NLP) area. where the objective function is introduced to seek consistency between the in-domain supervision and the out-of-domain intrinsic structure. It aims to augment each of the feature vectors of different domains to a high dimensional feature vector as follows. Finally. 0] & xT = [xT . von B¨ nau et al. used a heuristic method to select pivot features for natural language processing (NLP) problems. More speciﬁcally. such that W = U DV T . xS .:] (h is the number of the shared features) is a transformation mapping to map non- pivot features to a latent space.as in ASO singular value decomposition (SVD) is applied on the weight matrix W = [w1 w2 . They also proposed the SSA procedure to ﬁnd stationary components by matching the ﬁrst two moments of the data distributions in different epochs. However. The idea is to reduce the difference between domains while ensure the similarity between data within domains is larger than that across different domain data.

respectively. Thus. So far. which are referred to as target domain documents. Z) = I(XT . [50] studied a new case of clustering problems. the predicted labels are latent variables. ˜ ˜ Suppose that there exist three clustering functions CXT : XT → XT . where precise clusters can be obtained because of the sufﬁcient of training data. such as clusters or reduced dimensions 26 . Self-taught clustering (STC) is an instance of unsupervised transfer learning. The objective function of STC is shown as follows. XS . and I(·.However. ·) is the mutual information between two random variables. SSA is focused on how to identify a stationary subspace without considering how to preserve data properties in the latent space as well.1 Transferring Knowledge of Feature Representations Dai et al. where TS ̸= TT and YS and YT are not observable. unsupervised transfer learning aims to help improve the learning of the target predictive function fT (·) 4 in DT using the knowledge in DS and TS . and have a few documents downloaded from shopping websites. As a result. assume we have a lot of documents downloaded from news websites. to guide clustering on the target domain documents. known as self-taught clustering. which helps in clustering in the target domain. Z) − I(XS . Z) ˜ ˜ ˜ ˜ ˜ ˜ J(X where XS and XT are the source and target domain data. STC tries to learn a common feature space across domains. (Unsupervised Transfer Learning) Given a source domain DS with a learning task TS . Z) − I(XT . Z) + λ I(XS . which are referred to as source domain documents. SSA may map the data to some noisy factors which are stationary across domains but completely irrelevant to the target supervised task.4. there is little research work in this setting. The task is to cluster the target domain documents into some hidden categories. the goal of unsupervised transfer learning is to make use of the documents in the source domain. a target domain DT and a corresponding learning task TT . Then classiﬁers trained on the new representations learned by SSA may not get good performance for transductive transfer learning. which aims at clustering a small collection of unlabeled data in the target domain with the help of a large amount of unlabeled data in the source domain. CXS : XS → XS and 4 In unsupervised transfer learning. no labeled data are observed in the source and target domains in training. Based on the deﬁnition of the unsupervised transfer learning setting. 2. 2. [ ] ˜T . For example.4 Unsupervised Transfer Learning Deﬁnition 4. Note that it is usually unable to ﬁnd precise clusters if the data are sparse. Z is a shared feature space by XS and XT .

Though the generalization bounds proved in different literatures are different slightly. XS .5 Transfer Bounds and Negative Transfer An important issue is to recognize the limit of the power of transfer learning. more speciﬁcally the distance between marginal probability distributions of input features across domains. It then applies dimensionality reduction methods to the target domain data and labeled source domain data to reduce the dimensionalities. 17]. 24. where XT . the other is the distance between the source and target domains. In inductive transfer learning. such that the performance of clustering in the target domain can be improved. XS and Z.6) An iterative algorithm for solving the optimization function (2. Hassan Mahmud and Ray [120] analyzed the case of transfer learning using Kolmogorov complexity. one is the error bound of the learning model on the labeled source domain data. Z) ˜ ˜ ˜ XT . These two steps run iteratively to ﬁnd the best subspace for the target domain data. Similarly. The main idea is to deﬁne a cross-domain similarity measure to align clusters in the target domain and the partitions in the source domain.6) was presented in [50]. Eaton et al. TDA ﬁrst applies clustering methods to generate pseudo-class labels for the target domain unlabeled data. In transductive transfer learning.XS .Z (2. In particular. More recently. 15.6): ˜ ˜ ˜ arg min J(XT . where some theoretical bounds are proved. 27 . Transferring to a new task proceeds by mapping the problem into the graph and then learning a function on this graph that automatically determines the parameters to transfer to the new learning task. XS and Z are corresponding clusters of XT . Bhattacharya et al. to date. there is a common conclusion that the generalization bound of a learning model in transductive transfer learning consists of two terms. where the relationships between source tasks are modeled by embedding the set of learned source models in a graph using transferability as the metric. Wang et al. there are few work proposed to study the issue of transferability. where a relevant supervised data partition is provided. [19] proposed a new clustering algorithm to transfer supervision to a clustering task in a target domain from a different source domain. Recently. ˜ The goal of STC is to learn XT by solving the optimization problem (2. [60] proposed a novel graph-based method for knowledge transfer. respectively. the authors used conditional Kolmogorov complexity to measure relatedness between tasks and transfer the “right” amount of information in a sequential transfer learning task under a Bayesian framework. [197] proposed a transferred discriminative analysis (TDA) algorithm to solve the transfer dimensionality reduction problem.˜ ˜ ˜ ˜ CZ : Z → Z. 2. there are some research works focusing on the generalization bound when the training and test data distributions are different [16.

Despite the fact that how to avoid negative transfer is a very important issue. Negative transfer happens when the source domain data and task contribute to the reduced performance of learning in the target domain. Zhang and Yeung [224] proposed an improved regularization framework to model the negative and positive correlation between tasks. a new semi-parametric kernel is designed to model correlation between tasks. However. 11. where tasks in the same cluster are supposed to be related to each others. most research works on model task correlation are from the context of multi-task learning [18. [30] proposed an Adaptive Transfer learning algorithm based on GP (AT-GP). [159] empirically showed that if two tasks are too dissimilar. Bakker and Heskes [11] adopted a Bayesian approach in which some of the model parameters are shared for all tasks and others more loosely connected through a joint prior distribution that can be learned from the data. Some works have been exploited to analyze relatedness among tasks and task clustering techniques. [83] presented a convex approach to clustered multi-task learning by designing a new spectral norm to penalize over a set of weights. in inductive transfer learning. Thus. Jacob et al. which aims to adapt the transfer learning schemes by automatically estimating the similarity between the source and target tasks. As a result. such as [18. Motivated by [28]. Argyriou et al. In AT-GP. Bonilla et al. 224]. each of which is associated to a task. [167] empirically study the negative transfer problem by proposing a predictive distribution matching classiﬁer based on Support Vector Machines (SVMs) to identify the regions of relevant source domain data where the predictive distributions maximally align with that of the target domain data. Thus. Tasks within each group are related by sharing a low-dimensional representation. The main concern of inductive transfer learning is the learning performance in the target task only. and the learning procedure targets at improving performance of the target task only. the optimization procedure introduced in [28] is non-convex. Cao et al. little research work was published on this topic in the past. 28. we need to give an answer to the question that given a target task and a source task. More recently. [28] proposed a multi-task learning method based on Gaussian Process (GP). However. which may help provide guidance on how to avoid negative transfer automatically. 11]. Rosenstein et al. whose results may be sensitive to parameter initialization. then brute-force transfer may hurt the performance of the target task. whether transfer learning techniques should be applied or not. 7. 83. which differs among different groups. [7] considered situations in which the learning tasks can be divided into groups.How to avoid negative transfer and then ensure a “safe transfer” of knowledge is another crucial issue in transfer learning. Seah et al. and then avoid negative 28 . one may be particularly interested in transferring knowledge from one or more source tasks to a target task rather than learning these tasks simultaneously. tasks within a group can ﬁnd it easier to transfer useful knowledge. As described above. where the resultant optimization procedure is convex. which provides a global approach to model and learn task relatedness in the form of a task covariance matrix. the data are clustered based on the task parameters.

Theoretical studies on transfer learning from multiple source domains have also been presented in [122. 123. Transfer learning across the two feature spaces are achieved by designing a suitable mapping function across feature spaces as a bridge. [57] proposed algorithms to a new SVM for the target domains by adapting some SVMs learned from multiple source domains. The main idea is to estimate the data distribution of each source to reweight the data from different source domains. 2. we may have multiple sources at hand. It is related to unlike multi-view learning [27]. [109] proposed an information-theoretic approach for transfer learning to address the cross-language classiﬁcation problem. [119] proposed to train a classiﬁer for use in the target domain by maximizing the consensus of predictions from multiple sources. each with its own distinct feature space. there are several research issues of transfer learning that have attracted more and more attention from the machine learning and data mining communities. which means there are only one source domain and one target domain. Yang et al. • Transfer learning from multiple source domains. Ling et al. The approach addressed the problem when there are plenty of labeled English text data whereas there are only a small number of labeled Chinese text documents. text. transfer learning across different feature spaces aims to solve the problem when the source and target domain data belong to two different feature spaces such as image vs. the source and target domain data may come from different feature spaces. Though multi-view learning techniques can be applied to model multi-modality data. However.6 Other Research Issues of Transfer Learning Besides the negative transfer issue. Dai et al. 15] • Transfer learning across different feature space As described in the deﬁnition of transfer learning. in recent few years. in some real-world scenarios. Thus. Developing algorithms to make use of multiple sources for help learning models in the target domain is useful in practice. Luo et al. [209] and Duan et al. [45] proposed a new risk minimization framework based on a language model for machine translation [44] for 29 . which assumes the features for each instance can be divided into several views. Most transfer learning methods introduced in previous chapters focus on one-to-one transfer. without correspondences across feature spaces.transfer. it requires each instance in one view must have its correspondence in other views. [122] proposed a distribution weighted linear combining framework for learning from multiple sources. Mansour et al. how to transfer knowledge successfully across different feature spaces is another interesting issue. We summarize them as follows. In contrast.

Several researchers have proposed to combine these two techniques together in order to learn a more precise model with even less supervision. [40] proposed probabilistic models to leverage textual information for help image clustering and image advertising. and then adapt it to the transfer learning setting. Liao et al. Zha et al. recommendation systems. In [33]. activity recognition and wireless sensor networks. transfer learning has been applied successfully to many application areas. clustering and dimensionality reduction. [107] proposed a new active learning method to select the unlabeled data in a target domain to be labeled with the help of the source domain data.7 Real-world Applications of Transfer Learning Recently. such as metric learning [221. Information Retrieval (IR). such Natural Language Processing (NLP). Harpale and Yang [75] proposed an active learning framework for the multi-task adaptive ﬁltering [157] problem. 2. Zhao and Hoi investigated a framework to transfer knowledge from a source domain to an online learning task in a target domain. [211] and Chen et al. [221] proposed a regularization to learn a new distance metric in a target domain by leverage pre-learned distance metrics from auxiliary domains. Zhang and Yeung [226] ﬁrst proposed a convex formulation for multi-task metric learning by modeling the task relationships in the form of a task covariance matrix. structure learning [79] and online learning [227]. computer vision.dealing with the problem of learning heterogeneous data that belong to quite different feature spaces. • Transfer learning with active learning As mentioned at the beginning of this Chapter 2. They ﬁrst proposed to apply multi-task learning techniques for adaptive ﬁltering. • Transfer learning for other tasks Besides classiﬁcation. In [79]. 226]. Chan and NG proposed to adapt existing Word Sense Disambiguation (WSD) systems to a target domain by using domain adaptation techniques and employing an active learning strategy [101] to actively select examples to be annotated from the target domain. [169] applied an active learning algorithm to select important instances for transfer learning with TrAdaBoost [49] and standard SVM. regression. Yang et al. multimedia data mining. transfer learning has also been proposed for other tasks. bioinformatics. Shi et al. In [227]. 30 . Honorio and Samaras proposed to apply multi-task learning techniques to learn structures across multiple gaussian graphical models simultaneously. and then explore various active learning approaches to the multi-task adaptive ﬁlter to improve the performance. active learning and transfer learning both aims at learning a precise model with minimal human supervision for a target task. image analysis. respectively.

Chai et al. 185. 204. 26. Zhuo et al. 72. sentiment classiﬁcation [25. age estimation from facial images [225]. 195. 102. Web applications of transfer learning include text classiﬁcation across domains [151. 156. 36. respectively. 113]. splice site recognition of eukaryotic genomes [199]. 38]. some research works have been proposed to apply transfer learning techniques to solve various computational biological problems. which is known as domain adaptation. in the past two years. transfer learning. 203]. [3] applied multi-task learning techniques to solve braincomputer interfaces problems. 158. 73]. transfer learning techniques were proposed to extract knowledge from WiFi localization models across time periods [217. 213. which can automatically identify the relevant feature subset and use inductive transfer for learning multiple. 58] and event recognition from videos [59]. 48. 201]. 86] Information Retrieval (IR) is another application area where transfer learning techniques have been widely studied and applied. to beneﬁt WiFi localization tasks in other settings. such as name entity recognition [71. 192. 218. classiﬁers. motivated by that different biological entities. 35. etc. 104. [32] studied how to apply a GP based multi-task learning method to solve the inverse dynamics problem for a robotic manipulator [32]. 68] and multi-domain collaborative ﬁltering [103. face veriﬁcation from images [196]. such as identifying molecular association of phenotypic responses [222]. but conceptually related. Raykar et al. advertising [40. [235] studied how to transfer domain knowledge to learn relational action models across domains in automated planning. Applications of transfer learning in these ﬁelds include image classiﬁcation [202. 29. 115. learn to rank across domains [10. 31 . 51]. 143]. 223. image retrieval [73. 87. [228] proposed to apply transfer learning techniques for solving indoor sensor-based activity recognition problems. 136. Furthermore. For example. 94. 194] and mobile devices [229]. 9. has been widely studied for solving various tasks [42]. protein protein subcellular location prediction [206] and genetic association analysis [214] Transfer learning techniques has also explored to solve WiFi localization and sensor-based activity recognition problems [210]. 161. 230]. In [154]. genes. In Bioinformatics. proposed a novel Bayesian multipleinstance learning algorithm. 2] and information extraction [200. Rashidi and Cook [153] and Zheng et al. Alamgir et al. 39]. 207. transfer learning techniques have attracted more and more attention in the ﬁelds of computer vision. image semantic segmentation [190]. space [138. 47. mRNA splicing [166]. part-of-speech tagging [4. Besides NLP and IR. 177]. for computer aided design (CAD). image and multimedia analysis.In the ﬁelds of Natural Language Processing (NLP). 51. such as organisms. video retrieval [208. video concept detection [209. word sense disambiguation [33. may be related to each other from a biological point of view.

where a client moving in a WiFi environment wishes to use the received signal strength (RSS) 32 . respectively. we apply the proposed methods to two diverse applications. respectively. one objective is to minimize the distance between domains explicitly.3. such that the difference between different domain data can be reduced in the latent space. Then the subspace spanned by these latent factors can be treated as a bridge to make knowledge transfer possible. Then we present the proposed dimensionality reduction framework in Chapter 3. cross-domain WiFi localization and crossdomain text classiﬁcation. in Chapter 4 and Chapter 5. while others may not.3. observed data are controlled by a few latent factors. respectively. This chapter is organized as follows.CHAPTER 3 TRANSFER LEARNING VIA DIMENSIONALITY REDUCTION In this chapter. Another objective of the proposed framework is to preserve the properties of the original data as much as possible.2. Finally. The relationship between the proposed methods to other methods are discussed in Chapter 3.1 Motivation In many real world applications. some of these factors may capture the intrinsic structure or discriminative information underlying the original data. Transfer Component Analysis (TCA) and Semi-supervised Transfer Component Analysis (SSTCA) in Chapters 3. As reviewed in Chapter 2. while others may not. Our goal is to recover those common latent factors that do not cause much difference between data distributions and do preserve properties of the original data. We illustrate our idea using a learning-based indoor localization problem as an example. Our proposed framework can be referred to as a featurerepresentation-transfer approach. most previous feature-based transductive transfer learning methods do not explicitly minimize the distance between different domains in the new feature space.1 and Chapter 3.7. which limits the transferability across domains. Note that this objective is important for the ﬁnal target learning tasks in the latent space. 3. and three algorithms based on the frameworks known as Maximum Mean Discrepancy Embedding (MMDE). If the two domains are related to each other.6. we aim at proposing a novel and general dimensionality reduction framework for transductive transfer learning. we ﬁrst introduce our motivation and some preliminaries used in the proposed methods in Chapter 3. they may share some latent factors (or components). Some of these common latent factors may cause the data distributions between domains to be different. In contrast. in the proposed dimensionality reduction framework.4-3. Meanwhile.

resulting in changes in RSS values over time. Then. they cannot directly be used to solve domain adaptation problems. human movement.1 Dimensionality Reduction Dimensionality reduction has been studied widely in the machine learning community. Motivated by the observations mentioned above. If two text-classiﬁcation domains have different distributions. In an indoor building. this is the latent space where we can ensure a transferring of a learned localization model from one time period to another. we ﬁrst introduce some preliminaries that are used in our proposed solutions based on the framework. such as temperature. the temperature and the human movement may vary in time. we propose a novel and general dimensionality reduction framework for transductive transfer learning. but are related to each other (e. Some of them may be relatively stable while others may not. Thus. Among these hidden factors. news articles and blogs). 3..values to locate itself. RSS values are affected by many latent factors. A powerful dimensionality reduction technique is principal component analysis (PCA) and its kernelized version kernel PCA (KPCA). If we use the stable latent topics to represent documents. However. KPCA is an unsupervised method. 33 . We then project the data in both domains to this latent feature space. the distance between the distributions of documents in related domains may be small. Thus we need to develop a new dimensionality reduction algorithm for domain adaptation. properties of access points (APs). in this chapter. the building structure and properties of APs are relatively stable. [188] gave a recent survey on various dimensionality reduction methods. The key idea is to learn a low-dimensional latent feature space where the distributions of the source domain data and target domain data are close to each other and the data properties can be preserved as much as possible. Since they cannot guarantee that the distributions between different domain data are similar in the reduced latent space. Another example is learn to do text classiﬁcation across domains. van der Maaten et al. Before presenting our proposed dimensionality reduction framework in detail. Thus. building structure. if we use the latter two factors to represent the RSS data. there may be some latent topics shared by these domains. which aims at extracting a low-dimensional representation of the data by maximizing the variance of the embedding in a kernel space [163]. on which we subsequently apply standard learning algorithms to train classiﬁcation or regression models from labeled source domain data to make predictions on the unlabeled target domain data. we can transfer the text-classiﬁcation knowledge. etc. the distributions of the data collected in different time periods may be close to each other.2.2 Preliminaries 3.g. Traditional dimensionality reduction methods try to project the original data to a low-dimensional latent space while preserving some properties of the original data. in the latent space spanned by latent topics.

4 Dependence Measure Hilbert-Schmidt Independence Criterion Related to the MMD. Z) = MMD(X. is deﬁned as follow. the Hilbert-Schmidt Independence Criterion (HSIC) [70] is a simple yet powerful non-parametric criterion for measuring the dependence between the sets X and Y . Let the kernel-induced feature map be ϕ. the distance between two distributions is simply the distance between the two mean elements in the RKHS. . xn1 } and {z1 . 3. a non-parametric distance estimate was designed by embedding distributions in a RKHS [172]. Z) = MMD(X. [69] introduced the Maximum Mean Discrepancy (MMD) for comparing distributions based on the corresponding RKHS distance.2 Hilbert Space Embedding of Distributions Given samples X = {xi } and Z = {zi } drawn from two distributions. .2. 34 .In has been shown that in many real world applications. . the empirical estimate of MMD can be rewritten as follows: n1 n2 1 ∑ 1 ∑ Dist(X. However. By the fact that in a RKHS. which is deﬁned as follows. . . zn2 }. respectively. there exist many criteria (such as the Kullback-Leibler (KL) divergence) that can be used to estimate their distance. 3. To avoid such a non-trivial task. The estimate of MMD between {x1 . 3. PCA or KPCA works pretty well as a preprocess of supervised target tasks [186.2. it computes the Hilbert-Schmidt norm of a cross-covariance operator in the RKHS. function evaluation can be written as f (x) = ⟨ϕ(x). . a non-parametric distance estimate between distributions without density estimation is more desirable. where ϕ(x) : X → H.1) where ∥ · ∥H is the RKHS norm. f ⟩. which follow distributions P and Q. . It can be shown that when the RKHS is universal [178]. MMD will asymptotically approach zero if and only if the two distributions are the same. Gretton et al. many of these estimators are parametric and require an intermediate density estimate. . Z) = ϕ(xi ) − ϕ(zi ) n1 i=1 n2 i=1 2 . n1 i=1 n2 i=1 (3. Z) = sup ( ∥f ∥H ≤1 n1 n2 1 ∑ 1 ∑ f (xi ) − f (zi )). Dist(X. As its name implies. 162].2) Therefore. H (3.2.3 Maximum Mean Discrepancy Recently.

5) 3.. Kyy = I). . Where we denote dij the distance between xi and xj in the original feature space.e. as Dep(X. HSIC asymptotically approaches zeros if and only if X and Y are independent [178].4).4) 1 where K. An (biased) empirical estimate can be easily obtained from the corresponding kernel matrices. For example. if xi and xj are k-nearest neighbors. yS1 ).3 A Novel Dimensionality Reduction Framework Recall that given a lot of labeled source domain data DS = {(xS1 . it is often desirable to preserve the local data geometry while at the same time maximally align the embedding with available side information (such as labels). which is an unsupervised dimensionality reduction. 35 . For all i.n. j. Y ) = ∥Cxy ∥H = 2 ϕ(xk )) ⊗ (ψ(yj ) − ψ(yk ))] [(ϕ(xi ) − n i. it can be shown that if the RKHS is universal. H = I − n 11⊤ is the centering matrix and n is the number of samples in X and Y . ϕ : X → H. MVU.j=1 n k=1 n k=1 . ij K1 = 0. ySnS )}. Conversely.5) reduces to Maximum Variance Unfolding (MVU) [198] when no side information is given (i.n n n 1 ∑ 1∑ 1∑ HSIC(X.(3. (n − 1)2 (3. j) ∈ N . Mathematically. (xSn1 . . ∀(i. respectively. the local geometry is captured in the form of local distance constraints on the target embedding K. Kii + Kjj − 2Kij = d2 . this leads to the following semi-deﬁnite program (SDP) [95]: maxK≽0 tr(HKHKyy ) s. ψ : Y → H and H is a RKHS. . while the alignment with the side information (represented as kernel matrix Kyy ) is measured by the HSIC criterion in (3. In particular. Similar to MMD. Y ) = HSIC(X. j) ∈ N . (3. a large HSIC value suggests strong dependence. In that case. (3.t. Embedding via HSIC In embedding or dimensionality reduction. we denote this by using (i. Kyy are kernel matrices deﬁned on X and Y . .3) H where ⊗ denotes a tensor product. Y ) = 1 tr(HKHKyy ). aims at maximizing the variance in the high-dimensional feature space instead of the label dependence. in Colored Maximum Variance Unfolding (colored MVU) [174].

in many real-world applications. P and Q can be different. and apply it solve two real-world application successfully. we assume that DS are fully labeled. Here.where xSi ∈ X is the input and ySi ∈ Y is the corresponding output1 . where the input xTi is also in X . ϕ cannot be learned by directly minimizing the distance between P (YS |ϕ(XS )) and P (YT |ϕ(XT )). In general. 1 In general. the conditional probability P (Y |X) may also change across domains due to noisy or dynamic factors underlying the observed data. and some unlabeled target domain data DT = {xT1 . We then assume that such a ϕ satisﬁes P (YS |ϕ(XS )) ≈ P (YT |ϕ(XT )). but P (YS |XS ) = P (YT |XT ). we use new weaker assumption that P = Q. . However. denoted by Dist(P (ϕ(XS )). Let P(XS ) and Q(XT ) (or P and Q in short) be the marginal distributions of the input sets XS = {xSi } and XT = {xTi } from the source and target domains.1). Hence. . 3. together with the corresponding labels YS . We propose a new framework to learn a transformation ϕ. we study domain adaptation under this assumption empirically. but there exists a transformation ϕ such that P (ϕ(XS )) ≈ P (ϕ(XT )) and ̸ P (YS |ϕ(XS )) ≈ P (YT |ϕ(XT )). a major assumption in most transductive transfer learning methods is that P ̸= Q. Our task is then to predict the labels yTi ’s corresponding to the inputs xTi ’s in the target domain.3. Since we have no labeled data in the target domain. In this chapter. for simplicity. In this thesis. xTnT }. and (2) ϕ(XS ) and ϕ(XT ) preserve some important properties of XS and XT (Chapter 3. 36 . a key issue is how to ﬁnd this transformation ϕ. Standard supervised learning methods can then be applied on the mapped source domain data ϕ(XS ). such that P (ϕ(XS )) ≈ P (ϕ(XT )) and ϕ(XS ) and ϕ(XT ) preserve some important properties of XS and XT . respectively.3. Here.2). P (ϕ(XT ))).3. As mentioned in Chapter 2. . we propose a new dimensionality reduction framework to learn ϕ such that (1) the distance between the marginal distributions P (ϕ(XS )) and P (ϕ(XT )) is small (Chapter 3. though also much more challenging.1 Minimizing Distance between P (ϕ(XS )) and P (ϕ(XT )) To reduce the distance between domains. to train models for use on the mapped target domain data ϕ(XT ). DS may contain unlabeled data. a ﬁrst objective in our dimensionality reduction framework is to discover a desired mapping ϕ by minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )). . We believe that domain adaptation under this assumption is more realistic.3.

1). 3 2. the source domain data are shown in red. while x2 is a noisy dimension with small variance.5 2 1.2. Here.5 0 −0. source domain data Neg. source domain data Pos. which is completely irrelevant to the target supervised learning task. as is performed by the well-known PCA and KPCA (Chapter 3. x1 is the discriminative direction that can separate the positive and negative samples. target domain data Neg. by focusing only on minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )). source domain data Pos. 2. data properties should be preserved as much as possible. learning the transformation ϕ by only minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )) may not be enough. an effective dimensionality reduction framework for transfer learning should satisfy that in the reduced latent space. and the target domain data are in blue. However.5 x2 1 0.1(a) shows a simple two-dimensional example.2 Preserving Properties of XS and XT However. 1. besides reducing the distance between the two marginal distributions. Hence. target domain data − − + x2 + + 2 1 0 −1 −2 −2 + Source domain data Target domain data − Source domain data −1 0 1 2 3 x 1 − Source domain data 4 5 6 7 8 9 2 3 4 5 6 (a) Only minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )).1(b). target domain data Neg. For both the source and target domains. An obvious choice is to maximally preserve the data variance. source domain data Neg. An example is shown in Figure 3. one would select the noisy component x2 . Thus.3. 37 .5 −1 −3 −2 −1 0 1 x 1 5 Pos. Figure 3.3. ϕ should also preserve data properties that are useful for the target supervised learning task. However. Figure 3. target domain data 4 3 Pos. in transductive transfer learning. distance between data distributions across domains should be reduced. (b) Only maximizing the data variance. focusing only on the data variance is again not desirable in domain adaptation.1: Motivating examples for the dimensionality reduction framework. where the direction with the largest variance (x1 ) cannot be used to reduce the distance of distributions across domains and is not useful in boosting the performance for domain adaptation.

map them to the latent space to get new representations ϕ(xTi )’s. which is the feature map induced by a universal kernel. As can be seen.4 Maximum Mean Discrepancy Embedding (MMDE) As introduced in Chapter 3. on using the empirical measure of MMD(3. ϕ(XT )) is small. one just need to apply existing machine learning and data mining methods on the mapped data ϕ(XS ) with the corresponding labels YS for training models for use in the target domain. Algorithm 3. we propose a kernel based dimensionality reduction method called Maximum Mean Discrepancy Embedding (MMDE) 38 .1 Kernel Learning for Transfer Latent Space Instead of ﬁnding the transformation ϕ explicitly. we denote φ(ϕ(x)) by φ ◦ ϕ(x). the distance between two distributions P (ϕ(XS )) and P (ϕ(XT )) can be empirically measured by the (squared) distance between the empirical means of the two domains: nS nT 1 ∑ 1 ∑ Dist(ϕ(XS ). Therefore. such that Dist(ϕ(XS ). Thus. use the model f to make predictions f (ϕ(xTi ))’s. the corresponding kernel may not need to be universal [173]. in this section. because once the transformation mapping ϕ is learned in the ﬁrst step. Ensure: Predicted labels YT of the unlabeled data XT in the target domain. Furthermore.3.4. 3. 2: Train a classiﬁcation or regression model f on ϕ(XS ) with the corresponding labels YS . a direct optimization of (3. The proposed framework is summarized in Algorithm 3. ϕ(XT )) = φ ◦ ϕ(xSi ) − φ ◦ ϕ(xTi ) nS i=1 nT i=1 2 . respectively. 4: return ϕ and f (ϕ(xTi ))’s. ySi )}. ϕ(XS ) and ϕ(XT ) can preserve properties of XS and XT .1. Note that in practice. 1: Learn a transformation mapping ϕ. φ is usually highly nonlinear. one advantage of this framework is that most existing machine learning methods can be easy integrated into this framework.6) with respect to ϕ may be intractable and can get stuck in poor local minima. However. Then.2. the ﬁrst step is the key step. such as classiﬁcation and regression problems.6) for some φ ∈ H.1 A dimensionality reduction framework for transfer learning Require: A labeled source domain data set DS = {(xSi . Another advantage is that it works for diverse machine learning tasks. 3. As a result. H (3.2). an unlabeled target domain data set DT = {xTi }.Then standard supervised techniques can be applied in the reduced latent space to train classiﬁcation or regression models from source domain labeled data for use in the target domain. 3: For unlabeled data xTi ’s in DT . a desired nonlinear mapping ϕ can be found by minimizing the above quantity.

X ′ = {ϕ(x1 ). Given that φ ∈ H. x j ∈ XS .. here. we can learn the kernel matrix K instead of the universal kernel k by minimizing the distance (measured w. Since φ is the feature map induced by a kernel. ϕ(xn )} = {x′1 . Moreover. our goal becomes ﬁnding the feature map φ ◦ ϕ of some kernel such that (3. the MMD) between the projected source and 2 Note that. xSj ) + 2 k(xTi .nT nS nT 1 ∑ 1 ∑ 2 ∑ k(xSi .6) can be written in terms of the kernel matrices deﬁned by k. Based on the Mercer’s theory [164].T (3.j=1 nT i. x i . Thus. where k is the corresponding kernel. . Then φ ◦ ϕ is also the feature map induced by a kernel for any arbitrary map ϕ. xj ). . where Kij = k(xi .. and L = [Lij ] ≽ 0 with Lij = 1 n2 S 1 x i . xj ) = ⟨φ ◦ ϕ(xi ). xj ) is positive semi-deﬁnite for sample X ′ . Equation (3.r. xTj ) n2 i. the term “transductive” is to emphasize the concept that in this type of transfer learning.6) is minimized. when some new test data arrive. XT ) nS . the word “transductive” is from the transductive learning [90] setting in traditional machine learning. x′n }. φ ◦ ϕ(xj )⟩..[135] to construct the ϕ implicitly. we can get the following lemma: Lemma 1.. as: ′ ′ Dist(XS . For any ﬁnite sample X = {x1 . they must be classiﬁed together with all existing data. k = (.T T KT.. φ ◦ ϕ(xj )⟩ = k(xi .S KS. Thus. we ﬁrst introduce a lemma shown as follows. Therefore. the tasks must be the same and there must be some unlabeled data available in the target domain. Let φ be the feature map induced by a kernel. 1 − nS nT n2 T otherwise. . where K= [ ] ∈ R(nS +nT )×(nS +nT ) (3. which makes the optimization problem tractable. Therefore. xj ) = ⟨φ ◦ ϕ(xi ). k(xi .t. Recall that in transductive transfer learning. respectively.8) is a composite kernel matrix with KS and KT being the kernel matrices deﬁned by k on the data in the source and target domains. Denote x′ = ϕ(x). and that the learned model cannot be reused for future data.. φ ◦ ϕ(xj )⟩ = ⟨φ(x′i ).. and k(xi . xTj ) − = k(xSi .j=1 nS nT i. x j ∈ XT . Before describing the details of MMDE. one can ﬁnd their corresponding sample in the latent space. we can write ⟨φ ◦ ϕ(xi ). φ ◦ ϕ is also the feature map induced by a kernel for any arbitrary map ϕ. thus the corresponding matrix K.) is a valid kernel function. Here we translate the problem of learning ϕ to the problem of learning φ ◦ ϕ. φ(x′j )⟩... In the transductive setting2 . xn }. Proof.j=1 S = tr(KL). 39 .S KT. . by using the kernel trick.7) KS. which refers to the situation where all test data are required to be seen at training time..

Kii + Kjj − 2Kij = d2 for all i. as mentioned in Chapter 3. which is a standard condition for PCA applied on the resultant kernel matrix.9) is similar to colored MVU as introduced in Chapter 3.3. PCA is applied on the resultant kernel matrix and select the leading eigenvectors to reconstruct the desired mapping φ ◦ ϕ implicitly to map data ′ ′ into a low-dimensional latent space across domains.e. 0 and 1 are vectors of zeros and ones. Computationally.4. where the ﬁrst term in the objective minimizes the distance between distributions. After learning the kernel matrix K in (3. besides minimizing the trace of KL in 3. Thus.7.. it may not be sufﬁcient to learning the transformation by only minimizing the distance between the projected source and target domain data. maximizing the trace of K is equivalent to maximizing the variance of the embedded data. besides minimizing the trace of KL. ∀(i. which aims to preserve as much as variance in the feature space. Kii + Kjj − 2Kij = d2 . j) ∈ N . Second. The optimization problem of MMDE can then be written as: min tr(KL) − λtr(K) K≽0 (3. j such that (i. while the L in MMDE can be treated as a kernel matrix that encode distribution information of different data sets.4. 3.t. in MMDE. First. The distance preservation constraint is motivated by MVU. this leads to a SDP involving K [95]. similar to MVU as introduce in Chapter 3. as ij proposed in MVU and colored MVU. XS and XT .2.2. as proposed in MVU. MMDE also aims to unfold the high dimensional data by maximizing the trace of K. As mentioned in Chapter 3. The trace of K is maximized. ij K1 = 0. 1. and λ ≥ 0 is a tradeoff parameter.4. we will discuss the relationship between these methods in detail. Note that. In Chapter 3. Then. which can make the kernel matrix learning more tractable. while the second term maximizes the variance in the feature space. i.7. we also have the following objectives/constraints to preserve the properties of the original data. The distance is preserved. 40 . the optimization problem (3.2.9) s. for transductive transfer learning.2. 2. the L matrix in colored MVU is a kernel matrix that encodes label information of the data. j) ∈ N .target domain data.9). However. The embedded data are centered. The centering constraint is used for the post-process PCA on the resultant kernel matrix K. there are two major differences between MMDE and colored MVU. However. we can apply PCA on ′ ′ the resultant kernel matrix to reconstruct the low-dimensional representations XS and XT .

3. the overall time complexity is O((nS + nT )6. respectively.4. 1: Solve the SDPproblem in (3. ′ respectively. which is deﬁned in (3. In general. we use the method of harmonic functions [234].2. 5: For new test data xT ’s in the target domain.10) is used to make predictions on unseen target domain data. However. we can train a classiﬁcation or regression ′ model f from XS with the corresponding labels YS . ∑ j∈N wij fj fi = ∑ . Ensure: Predicted labels YT of the unlabeled data XT in the target domain. Here. the key step of MMDE is to apply a SDP solver to the optimization problem in (3.10). as there are O((nS + nT )2 ) variables in K. respectively. f (x′ i )} to T make predicts. 3: Learn a classiﬁer or regressor f : x′ i → ySi S 4: For unlabeled data xTi ’s in DT . to estimate the labels of the new test data in the target domain. This can then be used to make predictions ′ on XT .3 Summary In MMDE. since we do not learn a mapping explicitly to project the original data XS and ′ ′ XT to the embeddings XS and XT . ySi )}.3. j∈N wij (3.10) where N is the set of k nearest neighbors of xTi in XT . an unlabeled target domain data set DT = {xTi } and λ > 0. standard PCA is applied ′ ′ on the learned kernel matrix to construct latent representations XS and XT of XS and XT .9). Algorithm 3. 2: Apply PCA to the learned K to get new representations {x′ i } and {x′ i } of the original S T data {xSi } and {xTi }. In the second step. wij is the similarity between xTi and xTj .5 ) [129]. We then apply the standard 41 . which can be obtained based on Euclid distance or other similarity measure. 6: return Predicted labels YT . A model is then trained on XS with the corresponding labels YS . for out-of-sample data in the target domain. the learned classiﬁer or regressor to predict the labels of DT .2 Make Predictions in Latent Space ′ ′ After obtaining the new representations XS and XT . Finally. we need to apply other techniques to make predictions on the out-of-sample test data.2 Transfer learning via Maximum Mean Discrepancy Embedding (MMDE) Require: A labeled source domain data set DS = {(xSi . The MMDE algorithm for transfer learning is summarized in Algorithm 3. the method of harmonic functions introduced in (3. we translate the problem of learning the transformation mapping ϕ in the dimensionality reduction framework to a kernel matrix learning problem. and fj ’s is the predicted labels of xTj ’s. As can be seen in the algorithm. as: yTi = f (x′Ti ). use harmonic functions with {xTi .4.9) to obtain a kernel matrix K.

Besides. This becomes computationally prohibitive even for small-sized problems. This translation makes the problem of learning ϕ tractable.5. other techniques. 42 . It avoids the use of SDP and thus its high computational burden. Firstly. the learned kernel can be generalized to out-of-sample data directly. the obtained K has to be further post-processed by PCA.8) can be decomposed as K = (KK −1/2 )(K −1/2 K). The resultant kernel matrix3 is then K = (KK −1/2 W )(W ⊤ K −1/2 K) = KW W ⊤ K.5 Transfer Component Analysis (TCA) In order to overcome the limitations of MMDE as described in the previous section. For any unseen target domain data. the criterion (3. However. Consider the use of a (nS + nT ) × m matrix W that transforms the empirical kernel map features to a m-dimensional space (where m ≪ nS + nT ). In particular. in this section. 3 (3. In order to make predictions on unseen target domain unlabeled data. which is often known as the empirical kernel map [163]. Finally. in order to construct low′ ′ dimensional representations of XS and XT . one can ensure that the kernel matrix K is positive deﬁnite by adding a small ϵ > 0 to its diagonal [135]. 3.9) in MMDE. Secondly. it cannot generalize to unseen data. Furthermore.1 Parametric Kernel Map for Unseen Data First. which may discards potentially useful information in K. The overall time complexity is O((nS + nT )6. instead of using a two-step approach as in MMDE.11) where W = K −1/2 W ∈ R(nS +nT )×m .5 ) in general. it can ﬁt the data more perfectly. note that the kernel matrix K in (3. xj ) = kxi W W ⊤ kxj . since MMDE learns the kernel matrix from the data automatically.12) As is common practice. such as the method of harmonic functions need to be used. there are several limitations of MMDE. Moreover. we need to apply the MMDE method on the source domain and new target domain data to reconstruct their new representations. we propose an efﬁcient framework to ﬁnd the nonlinear mapping φ ◦ ϕ based on empirical kernel feature extraction. requires K to be positive semi-deﬁnite and the resultant kernel learning problem has to be solved by expensive SDP solvers. 3. we propose a uniﬁed kernel learning method which utilizes an explicit low-rank representation. the corresponding kernel evaluation between any two patterns xi and xj is given by ⊤ k(xi .PCAmethod on the resultant kernel matrix to reconstruct low-dimensional representations for different domain data. (3.

2. x)]⊤ ∈ RnS +nT . . where the ith column [W ⊤ K]i provides the embedding coordinates of xi . Moreover.2 Unsupervised Transfer Component Extraction Combining the parametric kernel representations for distance between distributions and data variance in the previous section. the variance of the projected samples is W ⊤ KHKW . we also need to preserve the data properties such as the data variance using the parametric kernel map. we develop a new dimensionality reduction method such that in the latent space spanned by the learned components. The optimization problem (3. . Hence.13). this regularization term can also avoid the rank deﬁciency of the denominator in the generalized eigenvalue decomposition. .5. (3.14) where µ > 0 is a trade-off parameter. . the MMD distance between the empirical ′ ′ means of the two domains XS and XT can be rewritten as: ′ ′ Dist(XS . W ⊤ KHKW = Im . Note from (3. W (3. this kernel k facilitates a readily parametric form for out-of-sample kernel evaluations. XT ) = tr((KW W ⊤ K)L) = tr(W ⊤ KLKW ). For notational simplicity.t. k(xnS +nT .where kx = [k(x1 . W (3. as is performed by the well-known PCA and KPCA (Chapter 3. on using the deﬁnition of K in (3. Though this optimization problem involves a non-convex norm constraint W ⊤ KHKW = I.16) 43 .11). we will drop the subscript m from Im in the sequel. 3. Besides reducing the distance between the two marginal distributions in (3. x).14) can be re-formulated as min tr((W ⊤ KHKW )† W ⊤ (KLK + µI)W ). (3. 1 ∈ Rn1 +n2 is the column vector with all ones.13) In minimizing objective (3. The kernel learning problem then becomes: minW tr(W ⊤ KLKW ) + µ tr(W ⊤ W ) s. and Im ∈ Rm×m is the identity matrix. and In1 +n2 ∈ R(n1 +n2 )×(n1 +n2 ) is the identity matrix. Hence. As will be shown later in this section. the variance of the data can be preserved as much as possible and the distance between different distributions across domains can be reduced. where H = In1 +n2 − 1 11⊤ n1 +n2 is the centering matrix. a regularization term tr(W ⊤ W ) is usually needed to control the complexity of W . it can still be solved efﬁciently by the following trace optimization problem: Proposition 1.13).15) or max tr((W ⊤ (KLK + µI)W )−1 W ⊤ KHKW ).1).11) that the embedding of the data in the latent space is W ⊤ K.

where m ≤ nS +nT −1.S KS. this will be referred to as Transfer Component Analysis (TCA).. t = 1. ySi )}nS . Since the matrix KLK +µI is non-singular. j=1 Ensure: Transformation matrix W and predicted labels YT of the unlabeled data XT in the target domain. 3: Do Eigen-decomposition and select the m leading eigenvectors to construct the transformation matrix W . In the sequel. (3. ′ ′ 4: Map the data xSi ’s and xTj ’s to x′ i ’s and x′ j ’s via using XS = [KS. Require: Source domain data set DS = {(xSi . once the W is learned . we have (KLK + µI)W = KHKW Z. matrix L from (3. we obtain (3.8). respectively. nS + 1. Similar to kernel Fisher discriminant analysis (KFD) [126..Proof. 44 . and target domain data set DT = i=1 {xTj }nT . (3.T ]W and XT = S T [KT. Thus. T As can be seen in the algorithm. nS nT 1: Construct kernel matrix K from {xSi }i=1 and {xTj }j=1 based on (3.3. and centering matrix H. The TCA algorithm for transfer learning is summarized in Algorithm 3. 2: Compute the matrix (KLK + µI)−1 KHK.16). Algorithm 3.16) are the m leading eigenvectors of (KLK +µI)−1 KHK.r. 7: return transformation matrix W and f (x′ )’s.. we obtain an equivalent trace maximization problem (3. it takes only O(m(nS + nT )2 ) time when m nonzero eigenvectors are to be extracted [175]. which is much more efﬁcient than MMDE.3 Transfer learning via Transfer Component Analysis (TCA)..7). and T κi = k(xT .17) where Z is a diagonal matrix containing Lagrange multipliers. 216].14) is tr(W ⊤ (KLK + µI)W ) − tr((W ⊤ KHKW − I)Z). In general. 6: For new test data xT from the target domain. we can use W to map it to the latent space directly. TCA can be generalized to out-of-sample data.17) w. The Lagrangian of (3.15). the key of TCA is the second step.17). W to zero. and then on substituting it into (3. nS + nT . . nS . to do eigen-decomposition on the matrix (KLK + µI)−1 KHK to ﬁnd m leading eigenvectors to construct the transformation W .T ]W . for new test data xT from the target domain. Setting the derivative of (3. Furthermore. . where I is the identity matrix.t. 5: Train a model f on xSi ’s with ySi ’s.S KT. xt ). where κ is a row vector. and the extracted components are called the transfer components. x′ = κW . the W solution in (3.18) Multiplying both sides on the left by W T ....

Only Minimizing Distance between Distributions As discussed in Chapter 3. though the variance of the mapped data in the 1D space learned by TCA is smaller than that learned by PCA. it is not desirable to learn the transformation ϕ by only minimizing the distance between the marginal distributions P (ϕ(XS )) and P (ϕ(XT )). For TCA. which is not useful for making predictions on the mapped target domain data. However. we will conduct more detailed experiments by comparing with other exciting methods in two real-world datasets in Chapter 4 and Chapter 5. and latent spaces learned by SSA and TCA. as can be seen from Figure 3.2.2(c). the distance between different domain data in the latent space is reduced and the positive and negative samples are more separated in the latent space. However.3. 45 . there are two main objectives: minimizing the distance between domains and maximizing the data variance in the latent space.3. the two classes are now more separated. TCA leads to signiﬁcantly better accuracy than SSA. the distance between the mapped data across different domains is still large and the positive and negative samples are overlapped together in the latent space.3(b) and 3.2(b). Only Maximizing the Data Variance As discussed in Chapter 3. the variance of the mapped data in the 1D space learned by PCA is very large. though the distance between distributions of different domain data in the latent space learned by TCA is larger than that learned by SSA.5.3(a). we perform experiments on synthetic data to demonstrate the effectiveness of these two objectives of TCA in learning a 1D latent space from the 2D data. On the other hand. As can be seen from Figures 3. learning the transformation ϕ by only maximizing the data variance may not be useful in domain adaptation. in TCA. On the other hand.2(b) and 3. which is an empirical method to ﬁnd an identical stationary latent space of the source and target domain data.2(c).3 Experiments on Synthetic Data As described in previous sections. Here. respectively. which is not useful for domain adaptation.1(a) (which is also reproduced in Figure 3. the distance between distributions of different domain data in the one-dimensional space learned by SSA is small.1(b) in Figure 3. the positive and negative samples are overlapped together in the latent space. 3.3(c).2(a)). As can be seen from Figures 3.2.2(a). For fully testing the effectiveness of TCA. we reproduce Figure 3. and ﬁx µ = 1.3. As can be seen from Figures 3. we illustrate this by using the synthetic data from the example in Figure 3. we use the linear kernel on inputs.3. In this section. We compare TCA with the method of stationary subspace analysis (SSA) as introduced in Chapter 2. Here. We further apply the one-nearest-neighbor (1-NN) classiﬁer to make predictions on the target domain data in the original 2D space.

However.6 0. (c) 1D projection by TCA (acc: 86%). the parametric kernel is a composing kernel that consists of an empirical kernel and a linear transformation. Once the transformation W is learned. (1) It is much more efﬁcient. which takes only O(m(nS + nT )2 ) time when m nonzero eigenvectors are to be extracted. Indeed. Figure 3.2 1. source domain data Neg.5.2: Illustrations of the proposed TCA and SSTCA on synthetic dataset 1. source domain data Pos.2 0 −2 −1 0 1 2 x 3 4 5 Pos. including unseen data. 46 . TCA only requires a simple and efﬁcient eigenvalue decomposition. TCA has two advantages.4 1. target domain data Neg. PDF PDF 1D latent space −13 −12 −11 −10 x −9 −8 −7 −25 1D latent space −20 −15 x −10 −5 0 (b) 1D projection by SSA (acc: 60%). Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. can be mapped to the latent space directly. we propose to learn a low-rank parametric kernel for transfer learning instead of the entire kernel matrix.8 0. data from the source and target domain. target domain data 1 (a) Data set 1 (acc: 82%).4 0. compared to MMDE. As can be seen from Algorithm 3. which may be sensitive to different application areas. Parameter values of the empirical kernel needs to be tuned by human experience or by using the cross-validation method.2 x2 1 0. (2) It can be generalized to out-of-sample target domain domain naturally.6 1.8 1. 3.3.4 Summary In TCA.

the unsupervised TCA proposed in Chapter 3. Figure 3.5 does not consider the label information in learning the components. the transfer components learned by TCA may not be useful for the target classiﬁcation task. (c) 1D projection by TCA (acc: 82%). However. target domain data −1 0 1 2 3 x1 4 5 6 7 8 9 (a) Data set 2 (accuracy: 50%).6 Semi-Supervised Transfer Component Analysis (SSTCA) As Ben-David et al. As a result. 4 Recall that in our setting. PDF PDF 1D latent space −150 −100 −50 0 x 50 100 150 −10 1D latent space −5 0 5 x 10 15 20 (b) 1D projection by PCA (acc: 48%). Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. 3. source domain data Neg. [24] and Mansour et al. there is no labeled data in the target domain. target domain data Neg. repectively.5 4 3 2 x2 1 0 −1 −2 −2 Pos. As shown in Figure 3. [121] mentioned in their works. the direction with the largest variance (x1 ) is orthogonal to the discriminative direction (x2 ). such that the learned components are discriminative to labels in the source domain and can be used to reduce the distance between the source and target domains as well. and (2) minimize the empirical error on the labeled data in the source domain4 .3: Illustrations of the proposed TCA and SSTCA on synthetic dataset 2. source domain data Pos. A solution to overcome this is to encode the source domain label information into the embedding learning. a good representation should (1) reduce the distance between the distributions of the source and target domain data. [16].4. 47 . Blitzer et al.

11 9 7 5 x 2 Pos. Note that in traditional semi-supervised learning settings [233.5: There exists an intrinsic manifold structure underlying the observed data. In this section. The effective use of manifold information is an important component in many semi-supervised learning algorithms [34. 233]. 34].4: The direction with the largest variance is orthogonal to the discriminative direction Moreover. source domain data Pos.1). there exists an intrinsic low-dimensional manifold underlying the high-dimensional observations. Moreover. 5 4 3 2 x2 1 0 −1 −2 −2 Pos. source domain data Neg.6. 48 . in many real-world applications (such as WiFi localization). source domain data Pos. target domain data Target domain data + + − Source domain data −10 −5 0 x 5 10 15 1 3 1 −1 −3 −15 − Figure 3. Hence.6. the labeled and unlabeled data are from the same domain. Motivated by the kernel target alignment [43].1). target domain data Neg. the labeled and unlabeled data are from different domains. we extend the unsupervised TCA in Chapter 3. we encode the manifold structure into the embedding learning so as to propagate label information from the labeled (source domain) data to the unlabeled (target domain) data (Chapter 3. we can maximize the label dependence instead of minimizing the empirical error (Chapter 3. in the context of domain adaptation here. a representation that maximizes its dependence with the data labels may lead to better generalization performance. However. target domain data Target domain data + − + − −1 0 1 2 3 x Source domain data 4 5 6 7 8 1 Figure 3. target domain data Neg. source domain data Neg.5 to the semi-supervised learning setting.

the target domain data are unlabeled. Intuitively. the dependence between features and labels can be estimated more precisely via HSIC. (2) high dependence on the label information.11)) with ˜ Kyy = γKl + (1 − γ)Kv . { [Kl ]ij = kyy (yi .3. Here.6.22) Note that γ is a tradeoff parameter that balances the label dependence and data variance terms. 49 .1 Optimization Objectives In this section. our ﬁrst objective is to minimize the MMD (3.20) (3. By substituting K (3. which is in line with ˜ MVU [198].11) and Kyy (3. yj ) i. We propose to maximally align the embedding (which is represented by K in (3. (1) maximal alignment of distributions between the source and target domain data in the embedded space.19) serves to maximize label dependence on the labeled data. while Kv = I.13) between the source and target domain data in the embedded space. when there are only a few labeled data in the source domain and a lot of unlabeled data in the target domain. namely.t.19) into HSIC (3. Recall that while the source domain data are fully labeled. we may use a small γ.4).21) serves to maximize the variance on both the source and target domain data.r. HSIC) between the embedding and labels. Empirically. and (3) preservation of the local geometry. Objective 2: Label Dependence Our second objective is to maximize the dependence (measured w. simply setting γ = 0.5 works well on all the data sets. (3. j ≤ nS . (3. (3. we delineate three desirable properties for this semi-supervised embedding. Objective 1: Distribution Matching As in the unsupervised TCA. The sensitivity of the performance to γ will be studied in more detail in Chapters 4 and 5. if there are sufﬁcient labeled data in the source domain. Otherwise. where γ ≥ 0. our objective is thus to maximize ˜ ˜ tr(H(KW W ⊤K)H Kyy ) = tr(W ⊤KH Kyy HKW ). and a large γ may be used. 0 otherwise.

and n2 = (nS + nT )2 is a normalization term. ij the resultant SDP will typically have a very large number of constraints.j)∈N mij [W ⊤ K]i − [W ⊤ K]j =tr(W ⊤ KLKW ). The procedure for both the unsupervised and semi-supervised TCA is summarized in Algorithm 3. and dij = ∥xi − xj ∥ be the distance between xi .13) and (3. First. where the ith column [W ⊤ K]i provides the embedding coordinates of xi . Let M = [mij ]. a constraint Kii + Kjj − 2Kij = d2 will be added to the optimization problem. Hence. 2 (3. More specifically. where λ ≥ 0 is another tradeoff parameter. or vice versa. we construct a graph with the afﬁnity mij = exp(−d2 /2σ 2 ) if xi is one ij of the k nearest neighbors of xj . Hence.4 and 3.4. It is well-known that (3.25) can be solved by eigendecomposing (K(L + λL)K + µI)−1 KH Kyy HK.6. (3. where D is the diagonal matrix with entries dii = n mij . For each (xi . The ﬁnal optimization problem can be written as min tr(W ⊤ KLKW ) + µ tr(W ⊤ W ) + W λ tr(W ⊤ KLKW ) n2 (3. Note from (3. xj j=1 are neighbors in the input space. we present how to combine the three objectives to ﬁnd a W that maximizes (3. W ⊤ KH Kyy HKW = I. The graph Laplacian matrix is ∑ L = D − M .Objective 3: Locality Preserving As reviewed in Chapters 3.2. Colored MVU and MMDE preserve the local geometry of the manifold by enforcing distance constraints on the desired kernel matrix K.24) can be formulated as the following quotient trace problem: max tr{(W ⊤ K(L + λL)KW + µI)−1 (W ⊤ KH Kyy HKW )}. xj ) in N . To avoid this problem.11) that the embedding of the data in Rm is W ⊤ K.2 Formulation and Optimization Procedure In this section. this will be referred to as Semi-Supervised Transfer Component Analysis (SSTCA).t. Similar to the unsupervised TCA. Intuitively. we use λ to denote λ n2 in the rest of this paper.24) s. if xi . xj in the original input space. For simplicity. W (3.22). xj )} be the set of sample pairs that are k-nearest neighbors of each other.4. the distance between the embedding coordinates of xi and xj should be small. we make use of the locality preserving property of the Laplacian Eigenmap [13]. our third objective is to minimize ∑ (i. while simultaneously minimizes (3.23) 3. let N = {(xi .23).25) In the sequel. 50 .

6(b). nS nT 1: Construct kernel matrix K from {xSi }i=1 and {xTj }j=1 based on (3.. nS + 1. in SSTCA. For SSTCA. xt ). the overall time complexity is still O(m(nS + nT )2 ). However.S KS. 3: Do eigen-decomposition and select the m leading eigenvectors to construct the transformation matrix W . x′ = κW . with the use of label information. The objectives include minimizing the distance between domains.. We will fully test the effectiveness of SSTCA next in two real-world applications in Chapter 4 and Chapter 5. γ = 0. ′ ′ 4: Map the data xSi ’s and xTj ’s to x′ i ’s and x′ j ’s via using XS = [KS. ySi )}nS . the positive and negative samples overlap signiﬁcantly in the latent space learned by TCA. T As can be seen in the algorithm.3 Experiments on Synthetic Data As described in previous sections. we aims to optimize three objectives simultaneously. 7: return transformation matrix W and f (x′ )’s. and centering matrix H.Algorithm 3. we set the λ in SSTCA to zero.T ]W . 3. The only difference between these algorithms is that in SSTCA. the matrix need to be eigen-decomposed is (K(L + λL)K + µI)−1 KH Kyy HK instead of (KLK + µI)−1 KHK. and ﬁx µ = 1. Since the focus is not on locality preserving. we demonstrate the advantage of using label information in the source domain data to improve classiﬁcation performance (Figure 3. respectively. matrix L from (3.. Label Information In this experiment. maximizing the source domain label dependence in the latent space and preserving local geometric structure in the latent space. and target domain data set DT = i=1 {xTj }nT . and T κi = k(xT .4 Transfer learning via Semi-Supervised Transfer Component Analysis (SSTCA). 5: Train a model f on xSi ’s with ySi ’s.6(a)). nS . j=1 Ensure: Transformation matrix W and predicted labels YT of the unlabeled data XT in the target domain. nS + nT . 6: For new test data xT from the target domain. .5. we perform experiments on synthetic data to demonstrate the effectiveness of these three objectives of SSTCA in learning a 1D latent space from the 2D data. As can be seen from Figure 3. On the other hand.S KT. respectively.8). 2: Compute the matrix (K(L + λL)K + µI)−1 KH Kyy HK. the difference between SSA and SSTCA is in the use of label information. SSTCA is also easy to be generalized to out-ofsample data.6. where κ is a row vector. . because the transformation W is learned explicitly.T ]W and XT = S T [KT. we use the linear kernel on both inputs and outputs..7). Require: Source domain data set DS = {(xSi . it is very similar to that of TCA. Consequently. t = 1.. the positive and negative samples are more separated in 51 . In this section... In addition.

source domain data Pos. In this case. As can be seen from Figures 3. SSTCA may not improve the performance or even performs worse than TCA.6(c)). encoding label information from the source domain (as SSTCA does) may not help or even hurt the classiﬁcation performance as compared to the unsupervised TCA. Nevertheless.7(a). and thus classiﬁcation also becomes easier. PDF PDF 1D latent space −300 −200 −100 0 x 100 200 300 −175 1D latent space −125 −75 −25 25 75 125 x (b) 1D projection by TCA (acc: 56%).6: Illustrations of the proposed TCA and SSTCA on synthetic dataset 3. target domain data Neg. An example is shown in Figure 3. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.7(b) and 3. 52 . In summary. positive and negative samples in the target domain are more separated in the latent space learned by TCA than in that learned by SSTCA.the latent space learned by SSTCA (Figure 3. 9 Pos. when the discriminative directions across different domains are similar. source domain data Neg. However. Figure 3.7(c). (c) 1D projection by SSTCA (acc: 79%). SSTCA can outperform TCA by encoding label information into the embedding learning. However. both SSTCA and TCA can obtain better performance. target domain data 7 5 x2 3 1 −1 −3 −15 −10 −5 0 x 5 10 15 1 (a) Data set 3 (acc: 69%). it may be possible that the discriminative direction of the source domain data is quite different from that of the target domain data. when the discriminative directions across different domains are different. in some applications. compared to nonadaptive methods.

24) to 1000 and 0.7: Illustrations of the proposed TCA and SSTCA on synthetic dataset 4. we demonstrate the advantage of using manifold information to improve classiﬁcation performance. PDF PDF 1D latent space −5 −2.5 x 0 2.5−10 −7. respectively). As can be seen from Figures 3.8(b) and 3. target domain data −1 0 1 2 3 x1 4 5 6 7 8 (a) Data set 4 (accuracy: 60%).5 x 10 12. source domain data Pos. SSTCA is used with and without Laplacian smoothing (by setting λ in (3.8(c). 3.5−15−12.5 15 17. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.5 0 2. Laplacian smoothing can indeed help improve classiﬁcation performance when the manifold structure is available underlying the observed data.8(a)). (c) 1D projection by SSTCA (acc: 68%). Both the source and domain data have the well-known two-moon manifold structure [232] (Figure 3. Figure 3.4 Summary In SSTCA.5 10 (b) 1D projection by TCA (acc: 90%).5 5 7.5 4 3 2 x2 1 0 −1 −2 −2 Pos. target domain data Neg.5 20 1D latent space −20−17.5 5 7. source domain data Neg. the time complexity 53 . Manifold Information In this experiment. Similar to TCA. we extend unsupervised TCA in the semi-supervised manner by maximizing the source domain label dependence in embedding learning.6.5 −5 −2.

embedding approaches. when the source and target domain data have a intrinsic manifold structure. TCA and SSTCA will conducted in Chapters 4-5. ing (acc: 91%). Colored MVU. However. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. when the discriminative directions of the source and target domains are different.2.5 1 0. SSTCA is more effective.5 10 12 1D latent space −12 −10 −8 −6 −4 −2 x 0 2 4 6 8 10 (b) 1D projection by SSTCA without Laplacian (c) 1D projection by SSTCA with Laplacian smoothsmoothing (acc: 83%).7 Further Discussion As mentioned in Chapter 3.5 0 x 2.5 0 −0. MMDE and the proposed TCA and SSTCA. such as MVU.5 5 7. then SSTCA does not work well.4 3. source domain data Pos. 3. are all based on Hilbert space embedding of distributions 54 . Experiments on synthetic data show that when the discriminative directions of the source and target domains are the same or close to each other. of SSTCA is O(m(nS +nT )2 ). in some cases. Furthermore. PDF PDF 1D latent space −10 −7. More completed comparison between MMDE. target domain data 1 (a) Data set 5 (acc: 70%). then SSTCA can boost the classiﬁcation accuracy compared to TCA.4.5 −1 −2 −1 0 1 2 3 x 4 5 6 7 8 Pos.5 2 x2 1.5 3 2.8: Illustrations of the proposed TCA and SSTCA on synthetic dataset 5. target domain data Neg. source domain data Neg. Figure 3.5 −5 −2.

If we further drop the objective 1 described in Chapter 3. SSTCA is semi-supervised. such as KMM. and exploits side information in embedding learning.Table 3.6. while MVU and colored MVU are dimensionality reduction approaches for singledomain data visualization. Note that criterion (3. instance re-weighting methods. uLSIF. In summary. Essentially. 55 .. the proposed TCA and SSTCA learn parametric kernel matrices by simultaneously minimizing distribution distance and maximizing data variance and label dependence. MMDE and TCA are unsupervised. TCA and SSTCA are dimensionality reduction approaches for domain adaptation. Note that SSTCA reduces to TCA when γ = 0 and λ = 0. etc. KLIEP.3. As mentioned in Chapter 2. gradient descent is required to reﬁne the embedding space and thus the solution can still get stuck in a local minimum. MVU and Colored MVU learn a nonparametric kernel matrix by maximizing data variance or label dependence for data analysis in a single domain. method MVU Color MVU MMDE TCA SSTCA setting unsupervised supervised unsupervised unsupervised semi-supervised out-ofsample label distribution geometry variance matching √ √ nonparametric √ √ nonparametric √ √ √ nonparametric √ √ parametric √ √ √ √ parametric kernel √ √ via MMD and HSIC. while MMDE learns a nonparametric kernel matrix by minimizing domain distance for cross-domain learning. However. MMDE. while TCA can be generalized to out-ofsample patterns even in this restricted setting. SSTCA can also be reduced to the semisupervised Color MVU when we drop the distribution matching term and set γ = 1 and λ > 0. The main difference between these methods and our proposed method is that we aim to match data distributions between domains in a latent space. and do not utilize side information. Furthermore.1. in which a low-rank approximation is used to reduce the number of constraints and variables in the SDP. which then reduce the time complexity from O(m(nS + nT )6. where data properties can be preserved.5 ) to O(m(nS + nT )2 ). while the latent space has 5 Note that MVU is a transductive dimensionality reduction method. In contrast. They are all transductive and computationally expensive. besides the embedding approaches. instead of matching them in the original feature space. Last but not the least.1: Summary of dimensionality reduction methods based on Hilbert space embedding of distributions. To balance prediction performance with computational complexity in cross-domain learning.1. We summarize the relationships among these approaches in Table 3. TCA reduces to MVU5 . TCA is a generalized version of MMDE.7) in the kernel learning problem of MMDE is similar to the recently proposed supervised dimensionality reduction method Colored MVU[174]. Finally. This is beneﬁcial as the real-world data are often noisy. have also been proposed to solve the transductive transfer learning problems by matching data distributions.

been de-noised. As a result. matching data distributions in the de-noised latent space may be more useful than matching them in the noisy original feature space for target learning tasks in the transductive transfer learning setting. in practice. 56 .

we consider the two-dimensional coordinates of a location1 .tF . which periodically send out wireless signals to others.. and in most indoor places. 65]. which we can ﬁll in a small default value..xtF are the 2D coordinates of the locations A. . 106] to robotics [12]. .. In general.. While GPS is widely used. Figure 4.. xik ) received from k Access Points (APs). and indoor WiFi localization is intrinsically a regression problem. as shown in Table 4. are used as labeled training data to learn a computational or statistical model. As an example. 132. each of which is 5-dimensional. the location coordinates in the 2D ﬂoor where the mobile device is). . the signal vector xi = (xi1 . Then.1 shows an indoor 802. The objective of indoor WiFi localization is to estimate the location yi of a mobile device based on the RSS values xi = (xi1 . With the increasing availability of 802.g.g. . machine-learning-based methods have been applied to the indoor WiFi localization [134.. 65. In the following. 132.F in the building. . 4.. 57 .AP5 ) are set up in the environment.1.. xi2 . 100. GPS cannot work well.11 wireless environment of size about 60m × 50m. The corresponding labels of the signal vectors xtA . e.CHAPTER 4 APPLICATIONS TO WIFI LOCALIZATION In this section. xi2 .. a mobile device (or multiple ones) moving around the wireless environment is used to collect wireless signals from APs. Five APs (AP1 .. In an ofﬂine phase.. . 1 Extension to three-dimensional localization is straightforward. −100dBm. A user with an IBM T42 laptop that is equipped with an Intel Pro/2200BG internal wireless card walks through the environment from the location A to F at time tA .e. we apply the proposed dimensionality reduction framework to an application in wireless sensor networks: indoor WiFi localization.1 WiFi Localization and Cross-Domain WiFi Localization Location estimation is a major task of pervasive computing and AI applications that range from context-aware computing [80. xik )) together with the physical location information (i.11 WiFi networks in cities and buildings.. In recent years. these methods work in two phases.. the RSS values (e.. . .. Note that the blank cells in the table denote the missing values. 133. 100]. locating and tracking a user or cargo with wireless signal strength or received signal strength (RSS) is becoming a reality [74. six signal vectors are collected. in many places outdoor where high buildings can block signals. Then...

the RSS values are noisy and can vary with time [217.1: An example of signal vectors (unit:dBm) xtA xtB -60 xtC -40 -70 xtD -80 -40 -70 xtE -40 -70 -40 -80 xtF -80 -80 -50 (All values are rounded for illustration) However. Table 4. However.E AP4 AP5 B A F AP1 AP2 AP3 C D Figure 4.2 Experimental Setup We use a public data set from the 2007 IEEE ICDM Contest (the 2nd Task). domain adaptation is necessary for indoor WiFi localization. As a result. Hence. WiFi data collected from different time periods are considered as different domains. Moreover. most machine-learning methods rely on collecting a lot of labeled data to train an accurate localization model ofﬂine for use online. the contours of the RSS values received from the same AP at different time periods are very different.1: An indoor wireless environment example.2. Here. 136]. The task is to predict the labels of the WiFi 58 . “label” refers to the location information for which the WiFi data are received. it is expensive to calibrate a localization model in a large environment. AP1 -40 -50 AP2 AP3 -60 AP4 -40 -80 AP5 -70 4. An example is shown in Figure 4. even in the same environment. the RSS data collected in one time period may differ from those collected in another. As can be seen. and assuming that the distributions of RSS data over different time periods are static. This contains a few labeled WiFi data collected in time period T1 (the source domain) and a large amount of unlabeled WiFi data collected in time period T2 (the target domain).

For parameter tuning of TCA. which is standard regression model. All the source domain data (621 instances in total) are used for training. Figure 4. We train it on DS only. These include the (supervised) regularized least square regression (RLSR). For more details on the data set.yi )∈D |f (xi ) − yi | N . The following methods will be compared. In the out-of-sample evaluation setting. 50 labeled data are sampled from the source domain as validation set. we randomly split u o DT into DT (the label information is removed in training) and DT . Furthermore. 59 . respectively. we repeat 10 times and then report the average performance using the Average Error Distance (AED): AED = ∑ (xi . For each experiment. our goal is to learn a u o model from DS and DT . SSTCA and all the other baseline methods. 128. and D = DT in the out-of- sample evaluation setting. Traditional regression models that do not perform domain adaptation. our goal is to learn a model from DS and DT . and then evaluate the model on DT (out-of-sample patterns).2: Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods. data collected in time period T2 .35 30 25 20 15 10 5 0 0 35 30 25 20 20 18 16 14 12 10 15 10 5 0 0 8 6 4 2 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 (a) WiFi RSS received in T1 from two APs (unit:dBm). In the experiments. Here. and a variable number of patterns are sampled from the remaining 800 u patterns to form DT . and u then evaluate the model on DT . and the (semi-supervised) Laplacian RLSR (LapRLSR) [14]. (b) WiFi RSS received in T2 from two APs (unit:dBm).328 patterns are o sampled to form DT . f (xi ) is the predicted location. Here. u In the transductive evaluation setting. while D = DT in the transductive setting. we have |DS | = 621 and |DT | = 3. 2. 1. Note that the original signal strength values are non-positive (the larger the stronger). Different colors denote different signal strength values (unit:dBm). As for the target domain data. Denote the data collected in time period T1 and time period T2 by DS and DT . we shift them to positive values for visualization. xi is a vector of RSS values. yi is the corresponding u o ground truth location. readers may refer to the contest report article [212].

It ﬁrst learns a projection from both DS and DT via KPCA. First. 6. λ. Thus. Methods that only perform distribution matching in a latent space: SSA5 as introduced in 2. ( ) ∥x −x ∥ 4 We use the Laplace kernel k(xi .u which is trained on both DS and DT but without considering the difference in distribu- tions.jp/˜sugi/software/KLIEP/index.cs. the pivot features are selected by mutual information while the number of pivots and other SCL parameters are determined by the validation data. 5 The code of SSA is provided by Paul von B¨ nau. Afterwards. For SSTCA. 3. The proposed TCA and SSTCA. 2. µ. and map data in DS to the latent space. Sample selection bias (or covariate shift) methods: KMM and KLIEP2 as introduced in u Chapter 2. For KLIEP.3. and γ). a RLSR model is trained on the projected source domain data. We ﬁrst set µ = 1 and search for the best σ value (based on the validation set) in the range [10−5 . Note that this parameter tuning strategy may not get a global optimal combination of values of parameters. 1]. we ﬁx λ and search for γ in [0. They use both DS and DT to learn weights of the patterns in DS . and then train a RLSR model on the weighted data. which replaces the constraint W ⊤ KHKW = I in TCA by 2 The code of KLIEP is downloaded from http://sugiyama-www. A RLSR model is then trained. we ﬁx σ and search for the best µ value in [10−3 . kernel width4 σ and parameter µ. Afterwards. 103 ]. RLSR is then applied on the projected DS to learn a localization model. However. we set the ϵ parameter in √ KMM as B/ n1 . xj ) = exp − i σ j . we ﬁnd that the resulting performance is satisfactory in practice. its initial value is also tuned on the validation set.html 3 Following [25]. 105 ]. and there are four tunable parameters (σ. where n1 is the number of training data in the source domain.19) on the labels.ac. Note that the Laplacian RLSR has been applied to WiFi localization and is one of the state-of-the-art localization models [134].titech. 106 ]. It learns u a set of new cross-domain features from both DS and DT . Following [82]. 4. which has been shown to be a suitable kernel for the WiFi data [139]. We set σ and µ in the same manner as TCA.3. Finally. the ﬁrst author of [191]. we use linear kernel for kyy in (3. we apply TCA / SSTCA on both DS and DT to learn transfer components. A traditional dimensionality reduction method: Kernel PCA (KPCA) as introduced in u Chapter 3.3 and TCAReduced.5 and search for the best λ value in [10−6 .1. and then augments features on the source domain data in DS with the new features. we set γ = 0. u 5. A state-of-the-art domain adaptation method: SCL3 as introduced in Chapter 2. we use the likelihood cross-validation method in [180] to automatically select the kernel width. u 60 . There are two parameters in TCA.2. Preliminary results suggest that the ﬁnal performance of KLIEP can be sensitive to the initialization of the kernel width. Then.

61 . Since MMDE (the last method) is transductive. we will compare the performance of TCA. KPCA can only de-noise but cannot ensure that the distance between data distributions in the two domains is reduced.3 Results 4. The ﬁrst six methods will be compared in the out-of-sample setting in Chapters 4. and thus localization models learned in the de-noised latent space can be more accurate than those learned in the original input space. though TCAReduced and SSA aim to reduce distance between domains. The dimensionalities of the latent spaces for KPCA. LapRLSR and KPCA.3 shows the results. TCA and SSTCA are ﬁxed at 15. TCA performs better than KPCA. the graph Laplacian term in SSTCA can effectively exploit label information from the labeled data to the unlabeled data across domains. they may lose important information of the original data in the latent space. including RLSR. Thus. Finally. In addition. we observe that SSTCA obtains better performance than TCA.3.1-4. note that KPCA. 4. 7. we compare TCA and SSTCA with learning-based localization models that do not perform domain adaptation. Thus. A closely related dimensionality reduction method: MMDE.3. However.3.3. 4. The number of unlabeled patterns u in DT is ﬁxed at 400.4.1 Comparison with Dimensionality Reduction Methods We ﬁrst compare TCA and SSTCA with some dimensionality reduction methods.2. as mentioned in Chapter 3. Moreover. Figure 4. the manifold assumption holds on the WiFi data. including KPCA. As demonstrated in previous research [134]. As can be seen. Thus. This is because the WiFi data are highly noisy.3.3. which in turn may hurt performance of the target learning tasks.3. These values are determined based on the ﬁrst experiment in Chapter 4. while the dimensionality of the latent space varies from 5 to 50.2 Comparison with Non-Adaptive Methods In this experiment. though simple. SSA and TCAReduced in the out-of-sample setting. can lead to signiﬁcantly improved performance. Hence. TCA and SSTCA outperform all the other methods. TCAReduced aims to ﬁnd a transformation W that minimizes the distance between different distributions without maximizing the variance in the latent space.1.W ⊤ W = I.3. This is a state-of-the-art method on the ICDM-07 contest data set [135. 139]. they do not obtain good performance. SSTCA and MMDE in the transductive setting in Chapter 4.

For training. KLIEP. This is again because the WiFi data are 62 . while we ﬁx the dimensionalities of the latent space in SSA and TCAReduced at 50. including KMM.3: Comparison with dimensionality reduction methods. We ﬁx the dimensionalities of the latent space in TCA and SSTCA at 15. SCL and SSA. 8 RLSR LapRLSR KPCA SSTCA TCA Average Error Distance (unit: m) 7 6 5 4 3 2 0 100 200 300 400 500 600 700 800 900 # Unlabeled Data in the Target Domain for Training Figure 4. As can be seen. even with only a few unlabeled data in the target domain. 4.4: Comparison with localization methods that do not perform domain adaption. all the source domain data are used and varying amount u of the target domain data are sampled as |DT |.5 0 5 10 15 20 25 30 35 # Dimensions 40 45 50 55 Figure 4. As can be seen.5 1. TCA and SSTCA can perform well for domain adaptation. we compare TCA and SSTCA with some state-of-the-art domain adaptation methods.3.100 55 30 KPCA SSTCA TCA TCAReduced SSA Average Error Distance (unit: m) 15 8 4.3 Comparison with Domain Adaptation Methods In this subchapter.5 2. TCA and SSTCA) perform much better than instance re-weighting methods (including KMM and KLIEP).5. u Figure 4. domain adaptation methods that are based on feature extraction (including SCL. Results are shown in Figure 4.4 shows the performance when the number of unlabeled patterns in DT varies.

2 2 1.6 2. highly noisy. Indeed.5 2 1.8 2. we compare TCA and SSTCA with MMDE in the transductive setting. The performance is then measured on the same unlabeled data subset. (b) Varying the number of unlabeled data. SCL may suffer from the bad choice of pivot features due to the noisy observations. and so matching distributions directly based on the noisy observations may not be useful. u Figure 4. where the WiFi data have been implicitly de-noised. On the other hand. The latent space is learned from DS and a subset of the unlabeled target domain data sampled from u DT . As can be seen.5 Average Error Distance (unti: m) 5 4. TCA and SSTCA match distributions in the latent space.6(a) shows the results for different dimensionalities of the latent space with |DT | = 400.5 0 5 10 15 20 25 30 35 # Dimensions 40 45 50 55 MMDE SSTCA TCA 3 2.8 1. SSTCA and the various baseline methods in the inductive setting on the WiFi data.4 2. with the dimensionalities of MMDE. while Figure 4. TCA and SSTCA ﬁxed at 15.6: Comparison with MMDE in the transductive setting on the WiFi data.4 Comparison with MMDE In this subchapter.5 4 3.90 KLIEP SCL KMM SSTCA TCA TCAReduced SSA Average Error Distance (unit: m) 40 22 13 8 5 3 2 0 100 200 300 400 500 600 700 800 900 # Unlabeled Data in the Target Domain for Training Figure 4.3. 4.6 0 MMDE SSTCA TCA Average Error Distance (unit: m) 100 200 300 400 500 600 700 800 # Unlabeled Data in the Target Domain for Training 900 (a) Varying the dimensionality of the latent space. MMDE 63 . Figure 4. 6 5.5: Comparison of TCA.6(b) shows the results for different amounts of unlabeled target domain data.5 3 2.

we may observe that in the transductive experimental setting. the two additional parameters γ and λ. we applied MMDE. 4. Experimental results veriﬁed the effectiveness of our proposed methods in WiFi localization compared several cross-domain methods. However. TCA and SSTCA to the WiFi localization problem. and for SSTCA. tradeoff parameter µ.7. This is conﬁrmed in the training time comparison in Figure 4. MMDE may be not applicable in practice because of its expensive computational cost. based on the results in WiFi localization. Furthermore. for real-world applications. As a result. This may be due to the limitation that the kernel matrix used in TCA / SSTCA is parametric. All the source domain data are used.5 Sensitivity to Model Parameters In this subchapter. MMDE is computationally expensive because it involves a SDP. we investigate the effects of the parameters on the regression performance. As can be seen from Figure 4. The reason is that in TCA and SSTCA.4 Summary In this section. Thus.4. both TCA and SSTCA are insensitive to the settings of the various parameters.7: Training time with varying amount of unlabeled data for training. and we sample 2. The dimensionalities of the latent spaces in TCA and SSTCA are ﬁxed at 15.328 samples from the target domain data to form o u DT . TCA and SSTCA are more desirable. In practice. SSTCA performs better than 64 . as mentioned in Chapter 3. while in MMDE. MMDE SSTCA TCA 10 Running Time (unit: sec) 5 10 4 10 3 10 2 10 1 10 0 0 100 200 300 400 500 600 700 800 # Unlabeled Data in the Target Domain for Training 900 Figure 4. which is useful for learning tasks. and another 400 samples to form DT . The out-of-sample evaluation setting is used.8. the kernel matrix in MMDE can more ﬁt the data. However. the kernel matrix is learned from the data automatically. 4. TCA or SSTCA may be a better choice than MMDE for cross-domain adaptation.3. From the results.outperforms TCA and SSTCA. These include the kernel width σ in the Laplacian kernel. MMDE performs best. we need tune the kernel parameters using cross-validation.

(d) varying λ in SSTCA.25 2.5 0.35 2. when the manifold assumption does hold on the source and target domain data.3 0.9 2.8 2.8 0.4 2.7 2.45 2.2 −4 TCA SSTCA −3 −2 −1 0 log10µ 1 2 3 4 (a) varying σ of the Laplacian kernel. The reason may be that the manifold assumption hold strongly on the WiFi data.2 −0.5 2. we suggest to use SSTCA for learning the latent space for transfer learning.55 50 Average Error Distrance (unit: m) 45 40 35 30 25 20 15 10 5 0 −5 −4 −3 −2 −1 0 1 log10σ 2 3 4 5 TCA SSTCA Average Error Distance (unit: m) 2.6 2.2 0. Figure 4.6 2.65 Average Error Distance (unit: m) 2. Therefore.8: Sensitivity analysis of the TCA / SSTCA parameters on the WiFi data.55 2. SSTCA −7 −6 −5 −4 −3 −2 −1 0 log10λ 1 2 3 4 5 6 7 (c) varying γ in SSTCA.6 0.4 0.4 2.7 SSTCA 2. 2. As a result the manifold regularization term in SSTCA can be able to propagate label information from the source domain to the target domain effectively. 65 .9 1 1.1 Average Error Distance (unit: m) 40 35 30 25 20 15 10 5 0 45 (b) varying µ.1 0 0.1 0.3 2.3 2.5 2.7 Values of γ1 (γ2=1−γ1) 0. TCA.

we apply the framework to another application: text classiﬁcation.000 news1 http://people. aims to assign a document to one ore more categories based on its content. 49. we proposed dimensionality reduction framework can be used for general cross-domain regression or classiﬁcation problems. It is a fundamental task for Web mining and has been widely studied in the ﬁelds of machine learning. thus it is hard to be applied to other applications. The proposed method based on naive bayes. also known as document classiﬁcation. in many real-world applications. we perform cross-domain text classiﬁcation experiments on a preprocessed data set of the 20-Newsgroups1 [96]. 5. This is a text collection of approximately 20. cross-domain learning algorithms are desirable in text classiﬁcation. For example. such as WiFi localization. which have been reviewed in Chapter 2.CHAPTER 5 APPLICATIONS TO TEXT CLASSIFICATION In the previous section. 207. 47. labeled labeled are in short supply in the domain of interest. However.mit. In recent years. In this section. we test the effectiveness of the proposed framework on a real-world cross-domain text classiﬁcation dataset. data mining and information retrieval [89. most of these methods are unable to be applied to solve cross-domain text classiﬁcation problems. In contrast. As a result. Test data are assumed to come from the same domain. transfer learning techniques have been proposed to address the cross-domain text classiﬁcation problems [48.edu/jrennie/20Newsgroups/ 66 . In the following sections. Thus. it can be applied various diverse applications.2 Experimental Setup In this section. 204].1 Text Classiﬁcation and Cross-domain Text Classiﬁcation Text classiﬁcation. Thus. Most of these methods are designed for the text classiﬁcation area only. we have showed the effectiveness of the proposed dimensionality reduction framework for the cross-domain WiFi localization problem. 5.csail. 215]. Traditional supervised learning approaches to text classiﬁcation require sufﬁcient labeled instances in a domain of interest in order to train a high-quality classiﬁer. following the same data distribution of the training domain data. in [48] proposed a modiﬁed naive bayes algorithm for cross-domain text classiﬁcation.

500 1. “rec vs. “rec vs.514 # Doc. we follow the preprocessing strategy in [47. In this experiment. S) Sci vs. since they are under the same top categories. The evaluation criterion is the classiﬁcation accuracy.500 1.500 1. 1. talk”.500 1. 38. one as positive and the other as negative. This splitting strategy ensures that the domains of labeled and unlabeled data are related. All experiments are performed in the out-of-sample setting.500 1.500 1. T) # Fea. Different sub-categories are considered as different domains. All the alphabets are converted to lower cases and stemmed using the Porter stemmer.165 29. talk”. 2.000 6. “sci vs.500 1. The six data sets created are “comp vs. since they are drawn from different sub-categories. For each data set. Three domain adaptation methods: KMM.500 1. Again. |DT | = 1200 and |DT | = 1800.1). Table 5. Stop words are also removed and the document frequency feature is used to cut down the number of features.500 From each of these six data sets. sci”. T) Rec vs.500 1. We run 10 repetitions and report the average results. rec” and “comp vs. Task Comp vs. # Neg.500 1.1: Summary of the six data sets constructed from the 20-Newsgroups data.500 1. Talk (C vs. Sci (R vs.000 6.500 1. KLIEP and SCL. “comp vs. # Pos. # Neg. 3. S) Rec vs.500 1. 48] to create six data sets from this collection. Talk (S vs.500 1.500 1. two top categories are chosen. because it results in “out of memory” on learning the kernel matrix using SDP solvers. 6.500 1. we perform a series of experiments to compare TCA and SSTCA with the following methods: 1.151 40.500 1.group documents hierarchically partitioned into 20 newsgroups. and sample 40% from the target domain to form the unlabeled subset DT .500 1.500 1.644 33. Linear support vector machine (SVM) in the original input space. and o the remaining 60% in the target domain to form the out-of-sample subset DT .500 1.065 30. talk” (Table 5. the domains are also ensured to be different. Documents from different subcategories but under the same parent category are considered to be related domains. Rec (C vs.000 Source Domain Target Domain # Pos.500 1. We then split the data based on sub-categories. A linear SVM is then trained in the latent space.500 1. a linear SVM is used as the classiﬁer in the latent space. R) Comp vs. 67 . Sci (C vs.000 6. we randomly sample 40% of the documents from the source u domain as DS . and the binary classiﬁcation task is deﬁned as top category classiﬁcation. KPCA.500 1. Note that we do not show the performance of MMDE in this experiment. Hence.000 6. |DS | = 1200. T) Comp vs.827 45. in each u o cross-domain classiﬁcation task.000 6. Similar to Chapter 4. Besides. Talk (R vs. sci”.

2 and 5. µ for TCA.1 Comparison to Other Methods Results are shown in Tables 5. TCA and SSTCA. Figure 5. For SSTCA.3.5 and λ = 10−4 ). one may notice two main differences between the results on the WiFi data and those on the text data. This agrees with our motivation and the previous conclusion on the WiFi experiments. In addition. However. Finally. on tasks such as “R vs. S” and “S vs. there is a wide range of γ for which the performance of SSTCA is quite stable. This indicates that mani2 ) ( ∥x −x ∥2 and k(xi .1(d) shows the sensitivity w.8(d) where SSTCA performs well when λ ≤ 102 . T”. the performance of PCA is comparable to that of linear TCA.3 Results 5. γ (with µ = 1 and λ = 10−4 ).3.r. xj ) = ∥xi − xj ∥2 RBF and linear kernels are deﬁned as k(xi . we can obtain a similar conclusion as in Chapter 4. and µ. The µ parameters in TCA and SSTCA are set to 1. λ (with γ = 0.t.0001.1(b) shows that for SSTCA (with γ = 0.We experiment with the Laplace.r. KMM.t. Different from the WiFi results in Figure 4. 5. here it performs well only when λ is very small (λ ≤ 10−4 ). the linear kernel performs better than the RBF and Laplacian kernels here.2 Sensitivity to Model Parameters Figure 5. Overall. First. on tasks such as “R vs. Thus.5 and µ = 1). In general. S” and “C vs.1 shows how the various parameters in TCA and SSTCA affect the classiﬁcation performance. the remaining free parameters are µ for TCA. we use linear kernels for both the inputs and outputs. As can be seen. feature extraction methods outperform instance-reweighting methods. linear TCA performs much better than PCA. R”.19) is the linear kernel. 5. Again.1(a) shows the sensitivity w. Figure 5. Moreover. Figure 5. Note that we do not compare with SSA because it results in “out of memory” on computing the covariance matrices. xj ) = exp − i σ j 68 . and the λ parameter in SSTCA is set to 0. and Figure 5. T”. TCA performs better than SSTCA on the text data.3.t. This agrees with the well-known observation that the linear kernel is often adequate for high-dimensional text data. First. “C vs. and the dimensionalities of the latent spaces are ﬁxed at 10. γ and λ for SSTCA. This may be because the manifold assumption is weaker in the text domain than in the WiFi domain. As can be seen. T”. namely that mapping data from different domains to a latent space spanned by the principal components may not work well as PCA cannot guarantee a reduction in distance of the two domain distributions. Next.1(c) shows the sensitivity of SSTCA w. kernel kyy in (3. Here. both TCA and SSTCA are insensitive to the setting of µ. RBF and linear kernels2 for feature extraction or re-weighting in KPCA.r. “C vs.

81 (1.30 (.81 (4.47 (0. method SVM linear kernel S vs. T all 76.72 (0.24) 83.28 (1.15 (0.51 (5.97) 75.53) 67.64) 85.22 (3.77 (0.26 (2.41) 90.73 (4.21) 77.30 (1.23 (1.46 (1.0252) 5 61.57 (2.02) 92.94 (3.31) 20 74.72) 88.46 (3.14) 10 73.0252) 10 79.93) 58.0252) 20 72.0287) 81.35) 78.31 (.14 (2.02 (3.77 (1.74) 67.83) 75.31 (6.64) 57.75) 74.13) 93.04) 91.10) 20 74.51 (3.64) 88.04 (9.13) 10 81.06) 56.66 (4.84 (3.35 (1.89 (2.02 (5.38 (2.59 (1.65 (4.17) all 75.0252) 20 79.83) 90.03) 62.50 (1.69) 90.79) 5 69.81 (2.81) 68.18 (4.33 (3.67 (3.03 (2.09 (.70) 83.19 (12.80) 74.02 (.35) 73.36) 85.95 (0.06) 91.99) TCA Laplacian kernel RBF kernel linear kernel SSTCA Laplacian kernel RBF kernel linear kernel KPCA Laplacian kernel RBF kernel SCL linear kernel Laplacian kernel RBF kernel KLIEP KMM .43 (1.49) 5 63.27) 85.12 (4.96) 85.31 (2.12 (0.81) 30 77.11) 82.92 (.82 (2.84) 93.71 (3.41 (1.64) 92.32) all 75.18) 77.05) 5 70.14) 90.89) 30 79.38 (2.23 (2.48) 81.30 (1.34 (2.83 (1.60 (1.01 (1.02) 89.21 (0.05 (0.28 (1.55) 20 78.69 (2.38 (1.12) 92.49 (3.32) 90.37 (.64) 88.87 (4.08) 20 79.79) 10 77.22) 78.25 (1.82) 82.19 (4.28) 30 74.79 (5.54 (1.67 (1.24) 56.31 (1.47 (2.73 (1.83) 91.71 (2.92 (4.59 (1.17 (1.76 (5.41) 85.81) 69 #dim Task C vs.10 (2.73 (5.07 (10.0287) 83.56) 89.68 (1.71 (6.74) 92.51 (0.78) 58.0252) all+50 76.29 (2.88 (1.51) 67.22) 91.84) 86.94 (2.37) 91.46 (1.87) 10 73.11 (0.09) C vs.95 (1.23 (2.2: Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).18) 90.09 (3.73 (0.62) 10 81.26) 57.84) 64.49) 20 79.92 (0.75 (7.98) 87.27) 30 79.37 (.31) 10 84.66 (1.0287) 80.14) 10 69.55) 89.33) 85.36 (.27) 83.67 (.32) 5 58.99 (3.81 (3.38) 81.29) 77.01) 92.89 (.17) 30 72.27 (.58 (0.92) 93.70 (1.77 (5.70) 89.79 (2.22) 86.24 (1.70) 84.68 (2.23) 79.71 (3.63 (0. R 81.42 (6.57) 85.11 (12.75) 5 71.36 (3.00) 85.25) 84.04) 20 82.78) 91.90 (.0252) 30 69.00) all 76.44 (1.61 (0.64) 30 79.91) 5 69.61 (1.54) 89.36 (1.32) 83.0252) 10 80.67) 20 80.0252) 30 72.34) 91.Table 5.81 (2.05) 92.41) 87.27) all 76.20) 5 58.59) 87.36) 77.52) 5 67.62 (0.06 (1. T 90.76) 86.40 (2.0287) 60.01) 30 74.24 (1.94 (1.

62 (1.00) 64.23) 47.16 (5.51) 70.04 (5.34 (10.21) 74.25) 70.27) 69.84) 69.30) 66.24) 77.22) 78.60 (5.08 (10.61 (3.82 (6.64 (5.36) 70 Task R vs.24 (4.28 (4.46) 45.09 (6.88 (3.26 (1.52) 67.17) TCA Laplacian kernel RBF kernel linear kernel SSTCA Laplacian kernel RBF kernel linear kernel KPCA Laplacian kernel RBF kernel SCL linear kernel Laplacian kernel RBF kernel KLIEP KMM .58 (8.89) 69.72) 75.27) 69.00 (7.03) 72.10 (1.90) 70.67) 79.99) 75.69 (9.22) 68.59 (10.24) 68.66 (6.86 (1.55 (2.03) 67.85) 68.29) 71.77 (3.21 (3.51 (7.86) 83.34 (21.64 (1.49 (6.13 (9.49 (1.60) 70.42) 72.20) 75.33 (2.35) 73.50 (7.03 (5.01 (2.12) 72.43) 82.00) 68.24 (8.29 (1.86 (3.73) 76.32 (8.86 (1.32) 75.51 (7.66 (8.82) 72.79 (2.99) 78.89) 62.07) 44.75) 80.34 (6.74) 77.44 (15.53) 74.63 (1.84 (5.59) 79.90 (8.35) 50.23) 51.51) 62.67) 76.64) 76.60) 71.99 (9.43 (4.87) 77.55) 72.01) 63.36 (7.57 (3.42 (7.26 (5.Table 5.44 (2.52 (9.83 (3.07) 71.14) 72.48) 72.38 (2.41) 80.98) 63.41 (6.81) 48.03) 70.46 (7.76) 50.53 (1.37) 68.31) 78.37 (11.28 (3.12) 82.68) 81.20) 77.97) 50.26 (15.69 (14.44) 72.63) 78.19) 47.27) 70.68 (8.93) 81.67) 88.01 (8.07 (5.64 (2.44 (5.00) 69.07) 69.43 (1.59) 69.76 (4.58 (11. S 72.27) 69.08 (6.56) 79.58) 77.22) 80.11 (11.08 (3.11) 60.62) 78.70 (10.23) 63.06 (7.00) 54.53) 75.59) 60.11 (7.52) 76.23 (5.13 (7.92) 71.86 (3.42 (1.94 (11.38) 53.99 (20.07 (1.85) 73.63 (10.67 (6.39) 56.07) 81.48) 65.50 (4.50) 75.46 (2.23 (6.49) 65.71 (9.28 (7.19) 76.43 (8.55) 76.65 (1.33) 64.53 (1.73 (1. S 68.67 (2.01) 49.82) 84.87 (1.64 (3.82 (8.18 (6.87 (10.30 (10.62 (3.23) 73.3: Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).81) 84.55 (1.85) 78.98 (4.45 (5.47 (6.48) 76.09 (4.45) 68.76 (3.12) 79.94 (10.81 (1.80) 75.43) 77.71 (4.22) 69.59 (7.29 (1.29) 87.15) 74.73) 69.36) 75.67) 77.52 (8.00) 82.73) 73.29 (3.69) 81.99) 82.17) 53.23) 79.41) 78.51) 72.87 (7. method SVM linear kernel #dim all 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 all+50 all all all all C vs.02 (8.88) 77.23 (1.33) 55.46 (3.12 (13.30 (16.65) 78. T R vs.66) 51.99 (4.57 (4.47 (2.

we applied TCA and SSTCA to the text classiﬁcation problem. using unsupervised TCA for domain adaptation may be a better choice than using SSTCA. 75 Classification Accuracy (in %) 1 1.5 0. Experimental results veriﬁed that our proposed methods based on the dimensionality reduction framework can achieve promising results in cross-domain text classiﬁcation. 1 0. SSTCA is not supposed 71 .8 0.75 0. However.6 0 0. In this case.1: Sensitivity analysis of the TCA / SSTCA parameters on the text data.65 0. As a result the manifold regularization term in SSTCA may not be able to propagate label information from the source domain to the target domain. TCA outperforms SSTCA slightly in the text classiﬁcation problem.1 0.1 70 65 60 55 50 45 −6 −5 −4 −3 log λ 10 −2 −1 0 (c) varying γ in SSTCA.7 0.3 0. 5.95 Classification Accuracy (in %) 0.6 0.fold regularization may not be useful on the text data. 95 90 Classification Accuracy (in %) Classification Accuracy (in %) −3 −2 −1 0 log µ 10 1 95 90 85 80 75 70 65 1 2 3 4 60 −4 −3 2 1 0 log µ 10 85 80 75 70 65 60 −4 1 2 3 4 (a) varying µ in TCA.2 0.8 0. (d) varying λ in SSTCA.9 Values of γ (γ =1−γ ) 1 2 1 (b) varying µ in SSTCA.4 Summary In this section. In this case. The reason may be that the manifold assumption may not hold on the text data.4 0.7 0. Figure 5. different from the conclusion in WiFi localization.85 0.9 0.

SSTCA may even perform worse than TCA if the manifold regularization hurt the learning performance. when the manifold assumption does not hold on the source and target domain data. 72 . we suggest to use TCA for learning the latent space for transfer learning.to outperform TCA. In a worse case. Therefore.

for example. 145. 184. opinion extraction and summarization [81. opinion mining. etc.. In this chapter. 6. also known as subjective classiﬁcation [145. in some domain areas. there are some interesting tasks. 114] and opinion analysis in multiple languages [1]. etc.1 Sentiment Classiﬁcation With the explosion of Web 2. in posts of personal blogs or users feedback on news. how to ﬁnd. In sentiment analysis. extract or summarize useful information from opinion sources on the Web is an important and challenging task. 111].0 services. also known as sentiment analysis. 93]. Our strategy is to exploit a general dimensionality reduction framework to discover common latent factors that are useful for knowledge transfer across domains.g. 111]. As a result. They exist in the form of online user reviews on shopping or opinion sites. 118. Sentiment classiﬁcation. positive or negative). we have considered transfer learning when the source and target domain have implicit overlap. We have proposed three methods based on the framework to learn the transfer latent space and applied them to two diverse real-world applications: indoor WiFi localization and text classiﬁcation successfully.g. we have some explicit domain knowledge that can help to design a more effective latent space for speciﬁc applications. e. Recently. which aims at classifying text segment. Among these tasks. review quality prediction [116. is an important task and has been widely studied because many users 73 . which we explain next.CHAPTER 6 DOMAIN-DRIVEN FEATURE SPACE TRANSFER FOR SENTIMENT CLASSIFICATION In the previous chapters. or even help the government to get the feedback on a policy. opinions are hidden in long forum posts and blogs. opinion integration [117].. This rich opinion information on the Web can help an individual make a decision to buy a product or plan a traveling route. we highlight one such application in sentiment classiﬁcation. has attracted much attention in the natural language processing and information retrieval areas [110. into polarity categories (e. However. or help a company to do a survey on its products or services. text sentences and review articles. However. and in many cases. which means the domain knowledge across these domains is hidden. to show that domain knowledge can be encoded in feature space learning for transfer learning when it is available. more and more user-generated resources have been available on the World Wide Web. review spam identiﬁcation [88]. the sentiment data are in large-scale.

Goldberg and Zhu adapted a graph-based semi-supervised learning method to make use of unlabeled data for sentiment classiﬁcation. events and their properties in the world. Neutral polarity means that there is no sentiment expressed by users. it is unlabeled sentiment data. A pair of sentiment text and its corresponding sentiment polarity {xj . xj ) to denote the frequency of word wi in xj . yj } is called the labeled sentiment data. with c(wi . In this chapter. In sentiment classiﬁcation. To date. For each sentiment data xj . we will use word and feature interchangeably. a domain D denotes a class of objects. 128]. User sentiment may exist in the form of a sentence. For example. thus in the rest of this chapter. In order to reduce the human-annotating cost. Without loss of generality.. various machine learning techniques have been proposed for sentiment classiﬁcation [145. Besides positive and negative sentiment. there is a corresponding label yj . but it is not hard to extend the proposed solution to address multi-category sentiment classiﬁcation problems. In either case. If xj has no polarity assigned. which is denoted by xj . 6. there are also neutral and mixed sentiment data in practical applications. compared to unsupervised learning methods [187].2 Existing Works in Cross-Domain Sentiment Classiﬁcation In recent years. different types of products. Here. either single word or NGram can be used as features to represent sentiment data. there exist a lot of research work being proposed to improve the classiﬁcation performance in the supervised setting [144. it corresponds with a sequence of words w1 w2 . Sindhwani et al. paragraph or article. we use a uniﬁed vocabulary W for all domains and |W | = m. yj = +1 if the overall sentiment expressed in xj is positive. can be regarded as different domains.wxj . [171] and Li et al. we only focus on positive and negative sentiment data. where wi is a word from a vocabulary W . In literature. [105] proposed to incorporate lexical knowledge to the graph-based semi-supervised learning and non-negative matrix tri-factorization approaches to sentiment classiﬁcation with a few labeled data. However. Furthermore. in sentiment classiﬁcation tasks. we represent user sentiment data with a bag-of-words method. 74 . these methods rely heavily on manually labeled training data to train an accurate sentiment classiﬁer. Mixed polarity means user sentiment is positive in some aspects but negative in other ones. dvds and furniture.. Sentiment data are the text segment containing user opinions about objects. events and their properties of the domain.do not explicitly indicate their sentiment polarity thus we need to predict it from the text data generated by users. such as books. yj = −1 if the overall sentiment expressed in xj is negative. supervised learning algorithms [146] have been proven promising and widely used in sentiment classiﬁcation. 111].

However, these semi-supervised learning methods still require a few labeled data in the target domain to train an accurate sentiment classiﬁer. In practice, we may have hundreds or thousands of domains at hand, it would be nice to annotate only several domain data to sentiment classiﬁers, which can be used to make predictions accurately in all other domains. Furthermore, similar to supervised learning methods, these approaches are domain dependent caused by changes in vocabulary. The reason is that users may use domain-speciﬁc words to express their sentiment in different domains. Table 6.1 shows several user review sentences from two domains: electronics and video games. In the electronics domain, we may use words like “compact”, “sharp” to express our positive sentiment and use “blurry” to express our negative sentiment. While in the video game domain, words like “hooked”, “realistic” indicate positive opinion and the word “boring” indicates negative opinion. Due to the mismatch between domain-speciﬁc words, a sentiment classiﬁer trained in one domain may not work well when directly applied to other domains. Thus cross-domain sentiment classiﬁcation algorithms are highly desirable to reduce domain dependency and manually labeling cost. Table 6.1: Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products. Boldfaces are domain-speciﬁc words, which are much more frequent in one domain than in the other one. Italic words are some domain-independent words, which occur frequently in both domains. “+” denotes positive sentiment, and “-” denotes negative sentiment. + + electronics video games Compact; easy to operate; very good pic- A very good game! It is action packed and ture quality; looks sharp! full of excitement. I am very much hooked on this game. I purchased this unit from Circuit City and Very realistic shooting action and good I was very excited about the quality of the plots. We played this and were hooked. picture. It is really nice and sharp. It is also quite blurry in very dark settings. The game is so boring. I am extremely unI will never buy HP again. happy and will probably never buy UbiSoft again. As a result, a sentiment classiﬁer trained in one domain cannot be applied to another domain directly. To address this problem, Blitzer et al. [25] proposed the structural correspondence learning (SCL) algorithm, which has been introduced in Chapter 2.3, to exploit domain adaptation techniques for sentiment classiﬁcation, which is a state-of-the-art method in cross-domain sentiment classiﬁcation. SCL is motivated by a multi-task learning algorithm, alternating structural optimization (ASO), proposed by Ando and Zhang [4]. SCL tries to construct a set of related tasks to model the relationship between “pivot features” and “non-pivot features”. Then “non-pivot features” with similar weights among tasks tend to be close with each other in a lowdimensional latent space. However, in practice, it is hard to construct a reasonable number of related tasks from data, which may limit the transfer ability of SCL for cross-domain sentiment classiﬁcation. More recently, Li et al. [104] proposed to transfer common lexical knowledge across domains via matrix factorization techniques. 75

In this chapter, we target at ﬁnding an effective approach for the cross-domain sentiment classiﬁcation problem. In particular, we propose a spectral feature alignment (SFA) algorithm to ﬁnd a new representation for cross-domain sentiment data, such that the gap between domains can be reduced. SFA uses some domain-independent words as a bridge to construct a bipartite graph to model the co-occurrence relationship between domain-speciﬁc words and domainindependent words. The idea is that if two domain-speciﬁc words have connections to more common domain-independent words in the graph, they tend to be aligned together with higher probability. Similarly, if two domain-independent words have connections to more common domain-speciﬁc words in the graph, they tend to be aligned together with higher probability. We adapt a spectral clustering algorithm, which is based on the graph spectral theory [41], on the bipartite graph to co-align domain-speciﬁc and domain-independent words into a set of feature-clusters. In this way, the clusters can be used to reduce the mismatch between domainspeciﬁc words of both domains. Finally, we represent all data examples with these clusters and train sentiment classiﬁers based on the new representation.

**6.3 Problem Statement and A Motivating Example
**

Problem Deﬁnition Assume we are given two speciﬁc domains DS and DT , where DS and DT are referred to as a source domain and a target domain respectively, suppose we have a set of labeled sentiment data DS = {(xSi , ySi )}nS in DS , and some unlabeled sentiment data i=1 DT = {xTj }nT in DT . The task of cross-domain sentiment classiﬁcation is to learn an accurate j=1 classiﬁer to predict the polarity of unseen sentiment data from DT . In this section, we use an example to introduce the motivation of our solution to the crossdomain sentiment classiﬁcation problem. First of all, we assume the sentiment classiﬁer f is a linear function, which can be written as y ∗ = f (x) = sgn(xwT ), where x ∈ R1×m and sgn(xwT ) = +1 if xwT ≥ 0, otherwise, sgn(xwT ) = −1. w is the weight vector of the classiﬁer, which can be learned from a set of training data (pairs of sentiment data and their corresponding polarity labels). Consider the example shown in Table 6.1 to illustrate our idea. We use a standard bag-ofwords method to represent sentiment data of the electronics (E) and video games (V) domains. From Table 6.2, we can see that the difference between domains is caused by the frequency of the domain-speciﬁc words. Domain-speciﬁc words in the E domain, such as compact, sharp, blurry, do not occur in the V domain. On the other hand, domain-speciﬁc words in the V domain, such as hooked, realistic, boring, do not occur in the E domain. Suppose the E domain is the source domain and the V domain is the target domain, our goal is to train a vector of 76

weights w∗ with labeled data from the E domain, and use it to predict sentiment polarity for the V domain data.1 Based on the three training sentences in the E domain, the weights of features such as compact and sharp should be positive. The weight of features such as blurry should be negative and the weights of features such as hooked, realistic and boring can be arbitrary or zeros if an L1 regularizer is applied on w for model training. However, an ideal weight vector in the V domain should have positive weights for features such as hooked, realistic and a negative weight for the feature boring, while the weights of features such as compact, sharp and blurry may take arbitrary values. That is why the classiﬁer learned from the E domain may not work well in the V domain. Table 6.2: Bag-of-words representations of electronics (E) and video games (V) reviews. Only domain-speciﬁc features are considered. “...” denotes all other words. + + + V + E ... compact ... 1 ... 0 ... 0 ... 0 ... 0 ... 0 sharp 1 1 0 0 0 0 blurry 0 0 1 0 0 0 hooked realistic 0 0 0 0 0 0 1 0 1 1 0 0 boring 0 0 0 0 0 1

In order to reduce the mismatch between features of the source and target domains, a straightforward solution is to make them more similar by adopting a new representation. Table 6.3 shows an ideal representation of domain-speciﬁc features. Here, sharp hooked denotes a cluster consisting of sharp and hooked, compact realistic denotes a cluster consisting of compact and realistic, and blurry boring denotes a cluster consisting of blurry and boring. We can use these clusters as high-level features to represent domain-speciﬁc words. Based on the new representation, the weight vector w∗ trained in the E domain should be also an ideal weight vector in the V domain. In this way, based on the new representation, the classiﬁer learned from one domain can be easily adapted to another one. Table 6.3: Ideal representations of domain-speciﬁc words. + + + + ... sharp hooked compact realistic blurry boring ... 1 1 0 ... 1 0 0 ... 0 0 1 ... 1 0 0 ... 1 1 0 ... 0 0 1

E

V

The problem is how to construct such an ideal representation as shown in Table 6.3. Clearly, if we directly apply traditional clustering algorithms such as k-means [76] on Table 6.2, we

1

For simplicity, we only discuss domain-speciﬁc words here and ignore all other words.

77

are not able to align sharp and hooked into one cluster, since the distance between them is large. In order to reduce the gap and align domain-speciﬁc words from different domains, we can utilize domain-independent words as a bridge. As shown in Table 6.1, words such as sharp, hooked compact and realistic often co-occur with other words such as good and exciting, while words such as blurry and boring often co-occur with a word never buy. Since the words like good, exciting and never buy occur frequently in both the E and V domains, they can be treated as domain-independent features. Table 6.4 shows co-occurrences between domainindependent and domain-speciﬁc words. It is easy to ﬁnd that, by applying clustering algorithms such as k-means on Table 6.4, we can get the feature clusters shown in Table 6.3: sharp hooked, blurry boring and compact realistic. Table 6.4: A co-occurrence matrix of domain-speciﬁc and domain-independent words. good exciting never buy compact realistic 1 1 0 0 0 0 sharp 1 1 0 hooked blurry 1 0 1 0 0 1 boring 0 0 1

The example described above motivates us that the co-occurrence relationship between domain-speciﬁc and domain-independent features is useful for feature alignment across different domains. We proposed to use a bipartite graph to represent this relationship and then adapt spectral clustering techniques to ﬁnd a new representation for domain-speciﬁc features. In the following section, we will present spectral domain-speciﬁc feature alignment algorithm in detail.

**6.4 Spectral Domain-Speciﬁc Feature Alignment
**

In this section, we describe our algorithm for adapting spectral clustering techniques to align domain-speciﬁc features from different domains for cross-domain sentiment classiﬁcation. As mentioned above, our proposed method consists of two steps: (1) to identify domainindependent features and (2) to align domain-speciﬁc features. In the ﬁrst step, we aim to learn a feature selection function ϕDI (·) to select l domain-independent features, which occur frequently and act similarly across domains DS and DT . These domain-independent features are used as a bridge to make knowledge transfer across domains possible. After identifying domainindependent features, we can use ϕDS (·) to denote a feature selection function for selecting domain-speciﬁc features, which can be deﬁned as the complement of domain-independent features. In the second step, we aims to to learn an alignment function φ : R(m−l) → Rk to align domain-speciﬁc features from both domains into k predeﬁned feature clusters z1 , z2 , ..., zk , s.t. the difference between domain speciﬁc features from different domains on the new representation constructed by the learned clusters can be dramatically reduced. 78

D) = i ∑ d∈D ∑ x∈X i .x̸=0 ( p(x. 6.1) where we denote X i and y a feature and class label. But there is no guarantee that the selected features act similarly in both domains. it is domain independent. k is set to be the largest number such that we can get at least l such features. then it is domain speciﬁc. If a feature has high mutual value. and the other is composed of features in WDS . the more likely that X i can be treated as a domain-independent feature.−1} x∈X i p(x. respectively.1) as follows. More speciﬁcally. can help identify features relevant to source domain labels. we require domain-independent features occur frequently.1 Domain-Independent Feature Selection First of all. which is shown in (6. 79 . we choose features that occur more than k times in both the source and target domains. domain-independent features should occur frequently and act similarly in both the source and target domains. Motivated by the supervised feature selection criteria. I(X . Furthermore. y)log2 y∈{+1. we modify the mutual information criterion between features and domains as follows. D) is. We use ϕDI (xi ) and ϕDS (xi ) to denote the two views respectively. A second strategy is based on the mutual dependence between features and labels on the source domain data. (6. A ﬁrst strategy is to select domain-independent features based on their frequency in both domains.For simplicity. In information theory. we use WDI and WDS to denote the vocabulary of domain-independent and domain-speciﬁc features respectively. So.2) where D is a domain variable and we only sum over non-zero values of a speciﬁc feature X i . Here we propose a third strategy for selecting domain-independent features. In this section. which can be referred to as domain-independent features in this papers. given the number l of domain-independent features to be selected. we can use mutual information to measure the dependence between features and domains. Feature selection using mutual information. The smaller I(X i . y) = i ∑ ∑ ( p(x. As mentioned above.4. we need to identify which features are domain independent. we present several strategies for selecting domain-independent features. I(X . mutual information is applied on source domain labeled data to select features as “pivots”. d) p(x)p(d) ) . (6. In [25]. Then sentiment data xi can be divided into two disjoint views. One view consists of features in WDI . mutual information is used to measure the mutual dependence between two random variables. Otherwise. y) p(x)p(y) ) . d)log2 p(x.

Given domain-independent ∪ and domain-speciﬁc features. In contrast. which is relevant to class labels. For example.2 Bipartite Feature Graph Construction Based on the above strategies for selecting domain-independent features. in this chapter. some words may have opposing sentiment polarity in different domains. In G. E) between ∪ them. domain-independent feature selection is a challenge task. we can also adopt more meaningful methods to estimate mij . Note that “sharp” should be selected because it is domain-dependent word.4. we can deﬁne a reasonable “window 80 . each edge eij ∈ E is associated with a non-negative weight mij .1. If the word “very” is used as a bridge then the words “sharp” and “boring” may be aligned in a same clustering. The reason is that in the electronics domain. Thus. we can identify which features are domain independent and which ones are domain speciﬁc. which will be presented next. where we denote VDS VDI and E the vertices or edges of the graph G. For example. we may say “this game is very boring”. the total number of co-occurrence of wi ∈ WDS and wj ∈ WDI in DS and DT ).4. Using the ﬁrst and third strategies. “never buy” and “very” as domain-independent features. we may say “this unit is very sharp”. we may be guaranteed to select sentiment words. we may select the words such as “good”.g. we focus more on addressing the problem of how to model the correlation between domain-independent and domain-speciﬁc words for transfer learning. However. Hence. “nice” and “sharp” (assume the electronics domain is a source domain). However. Here “good” and “never buy” may be good domain-independent features for serving as a bridge across domains. the word “sharp” has high mutual value to class labels in the electronics domain. Furthermore. Thus. we can construct a bipartite graph G = (VDS VDI . The score of mij measures the relationship between word wi ∈ WDS and wj ∈ WDI in DS and DT (e.. the word “very” should not be used a bridge for aligning words from different domains. each vertex in VDS corresponds to a domain-speciﬁc word in WDS . which makes the task more challenge. such as “good”. using the second strategy. Note that there is no intra-set edges linking two vertexes in VDS or VDI . and each vertex in VDI corresponds to a domain-independent word in WDI . 6. Besides using the co-occurrence frequency of words within documents.1 to explain the proposed three strategies for domain-independent features selection. An edge in E connects two vertexes in VDS and VDI respectively. we can use the constructed bipartite graph to model the intrinsic relationship between domain-speciﬁc and domain-independent features. A bipartite graph example is shown in Figure 6. while in the video games domain. the polarity of the word “thin” may be positive in the electronics domain but negative in the furniture domain. We leave the issue that how to develop a criteria to select domain-independent word precisely in our future work. In a worse case.We still use the example shown in Table 6. which is constructed based on the example shown in Table 6. which is not what we expect.

g. we can algin domain-speciﬁc features effectively. then there is an edge connecting them. then the correlation between them is very week. we set the “window size” to be the maximum length of all documents. then they tend 81 .4. we have presented how to construct a bipartite graph between domainspeciﬁc and domain-independent features. In this section. Thus. Based on these two assumptions. We want to show that by constructing a simple bipartite graph and adapting spectral clustering techniques on it. The smaller is their distance. In this chapter.. dimensionality reduction and clustering [130. e. where two nodes are similar to each other if they are similar in the original graph. 54]. In spectral graph theory [41]. Also we do not consider word position to determine the weights for edges. we assume (1) if two domain-speciﬁc features are connected to many common domain-independent features. there are two main assumptions: (1) if two nodes in a graph are connected to many common nodes.size”. for simplicity. we show how to adapt a spectral clustering algorithm on the feature bipartite graph to align domain-speciﬁc features. 13. we can also use the distance between wi and wj to adjust the score of mij . then these two nodes should be very similar (or quite related).1: A bipartite graph example of domain-speciﬁc and domain-independent features. 6. In our case. by assuming that if the distance between two words exceeds the “window size”. the larger the weight we can assign to the corresponding edge. spectral graph theory has been widely applied in many problems. if a domain-speciﬁc word and a domainindependent word co-occur within the “window size”.3 Spectral Feature Clustering In the previous section. (2) if two domainindependent features are connected to many common domain-speciﬁc features. then they tend to be very related and will be aligned to a same cluster with high probability. realistic exciting 1 1 1 1 1 1 1 1 compact hooked sharp blurry boring good never buy Figure 6. (2) there is a low-dimensional latent space underlying a complex graph. Furthermore.

2 In spectral graph theory [41] and Laplacian Eigenmaps [13]. which are referred to as the k largest eigenvectors u1 . 13] is equivalent to selecting the k largest eigenvectors of L in this paper. and form the matrix U = [u1 u2 . 5. [220] and Ding and He [55] have proven that the k principal components of a term-document co-occurrence matrix.. 1. Normalize U . such that Uij = Uij /( j Uij )1/2 . Given a set of points V = {v1 . 82 .. uk .. Therefore. where Dii = ∑ j Aij . with the above assumptions. u1 . Apply the k-means algorithm on U to cluster the n points into k clusters. Before we present how to adapt a spectral clustering algorithm to align domain-speciﬁc features. Based on the above description.. Thus selecting the k smallest eigenvectors of L in [41. . l is the number of domainindependent features and m − l is the number of domain-speciﬁc features. the standard spectral clustering algorithm clusters n points to k discrete indicators. v2 . vn } and their corresponding weighted graph G. Aii = 0. which can reduce the gap between domains. Form a diagonal matrix D. are actually the continuous solution of the cluster membership indicators of documents in the k-means clustering method. our goal is to learn a feature alignment mapping function φ(·) : Rm−l → Rk .. The changes in these forms of Laplacian matrix will only change the eigenvalues (from λi to 1 − λi ) but have no impact on eigenvectors.. u2 . we expect the mismatch problem between domain-speciﬁc features can be alleviated by applying graph spectral techniques on the feature bipartite graph to discover a new representation for domain-speciﬁc features. Find the k largest eigenvectors of L. 3.. where Aij = mij . we ﬁrst brieﬂy introduce a standard spectral clustering algorithm [130] as follows.to be very related and will be aligned to a same cluster with high probability. where m is the number of all features. Given the feature bipartite graph G. u2 .. Zha et al. . which can be referred to as “discrete clustering”.. if i ̸= j. Motivated by this discovery. More speciﬁcally. This implies that a mapping function constructed from the k principal components can cluster original data and map them to a new space spanned by the clusters simultaneously. and construct the matrix2 L = D−1/2 AD−1/2 . the k principal components can automatically perform data clustering in the subspace spanned by the k principle components. we show how to adapt the spectral clustering algorithm for cross-domain feature alignment.. 2.uk ] ∈ Rn×k . Form an afﬁnity matrix for V : A ∈ Rn×n . where k is an input parameter.. uk in step 3. ∑ 2 4. the Laplacian matrix L = I − L. (3) we can ﬁnd a more compact and meaningful representation for domain-speciﬁc features. . where I is an identity matrix. the goal is to cluster the points into k clusters.

Form a diagonal matrix D. Though our goal is only to cluster domain-speciﬁc features. However. it is proved that clustering two related sets of points simultaneously can often get better results than only clustering one single set of points [54]. Given a feature alignment mapping function φ(·)..uk ] ∈ Rm×k . 25]. and construct the matrix L = D−1/2 AD−1/2 . and form the matrix U = [u1 u2 . where U[1:m−l. Similar to the strategy used in [4. we augment all original features with features learned by feature alignment to construct a new representation. we can ﬁrst apply ϕDS (·) to identify the view associated with domainspeciﬁc features of xi . A tradeoff parameter γ is used in this feature augmentation to balance the effect of original 83 . then we can simply augment domain-independent features with features via feature alignment to generate a perfect representation for cross-domain sentiment classiﬁcation.4 Feature Augmentation If we have selected domain-independent features and aligned domain-speciﬁc features perfectly. ] 0 M 2. which is used for spectral co-clustering terms and documents simultaneously.:] denotes the ﬁrst m − l rows of U and x ∈ R1×(m−l) . 3. uk . 5. we may not be able to identify domain-independent features correctly and thus fail to perform feature alignment perfectly. where the ﬁrst MT 0 m − l rows and columns correspond to the m − l domain-speciﬁc features.4. where Dii = ∑ j [ Aij . where Mij corresponds to the co-occurrence relationship between a domain-speciﬁc word wi ∈ WDS and a domain-independent word wj ∈ WDI . Form an afﬁnity matrix A = ∈ Rm×m of the bipartite graph. u2 ..:] . and the last l rows and columns correspond to the l domain-independent features. Deﬁne the feature alignment mapping function as φ(x) = xU[1:m−l.1. and then apply φ(·) to ﬁnd a new representation φ(ϕDS (xi )) of the view of domain-speciﬁc features of xi . for a data example xi in either a source domain or target domain.. .. u1 .. Note that the afﬁnity matrix A constructed in Step 2 is similar to the afﬁnity matrix of a term-document bipartite graph proposed in [54]. Form a weight matrix M ∈ R(m−l)×l . Find the k largest eigenvectors of L. 6. in practice. 4.

the new feature representation is deﬁned as xi = [xi .features and new features.. Let mapping φ(xi ) = xi U[1:m−l.4. the value of λ can be determined by evaluation on some heldout data. . ϕDI (xT ) ϕDS (xT ) 2: By using ΦDI and ΦDS . Finally training and testing are both performed on augmented representations. Thus.5 Computational Issues Note the computational cost of the SFA algorithm is maintained by an eigen-decomposition of the matrix L ∈ Rm×m . where A = MT 0 4: Find the K largest eigenvectors of L.4. γφ(ϕDS (xSi ))]. ΦDS = . in the ﬁrst step. ySi )}nS i=1 i=1 6: return f (xTi )’s. [ ] [ ] ϕDI (xS ) ϕDS (xS ) ΦDI = . where xi ∈ R1×m .:] . and form the matrix [ −1/2 −1/2 R(m−l)×l . Ensure: Predicted labels YT of the unlabeled data XT in the target domain.. where m is the total number 84 . calculate (DI-word)-(DS-word) co-occurrence matrix M ] 0 M .1 Spectral Feature Alignment (SFA) for cross-domain sentiment classiﬁcation Require: A labeled source domain data set DS = {(xSi . u1 . an unlabeled target domain i=1 data set DT = {xTj }nT . for each data example xi . 1: Apply the criteria mentioned in Chapter 6. we select a set of domain-independent features using one of the three strategies introduced in Chapter 6.. xi ∈ R1×m+k and 0 ≤ γ ≤ 1. In the second step. γφ(ϕDS (xi ))]. Eigen-decomposition is performed on L to ﬁnd the K leading eigenvectors to construct a mapping ϕ.1. 6.1.. ySi )}nS = {([xSi .1 on DS and DT to select l domain-independent features. the number of clusters K and the number of domain-independent j=1 features m. uK . we calculate the corresponding co-occurrence matrix M and then construct a normalized matrix L corresponding to the bipartite graph of the domain-independent and domain-speciﬁc features.uK ] ∈ Rm×K . it takes O(Km2 ) [175]. where xi ∈ Rm−l 5: Train a classiﬁer f on {(xSi . As can be seen in the algorithm. ∈ U = [u1 u2 . In practice. Algorithm 6. The remaining m − l features are treated as domain-speciﬁc features.. 3: Construct matrix L = D AD . In general. u2 . ySi )}nS . The whole process of our proposed framework for crossdomain sentiment classiﬁcation is presented in Algorithm 6.

However.6 Connection to Other Methods Different from the state-of-the-art cross-domain sentiment classiﬁcation algorithms such as the SCLalgorithm.7 Experiments In this section. 6. 6. In practice. The ﬁrst dataset is from Blitzer et al.4. where p is the number iterations in the Implicitly Restarted Arnoldi method deﬁned by users. 6. the matrix L is sparse as well. there are 1.1 Experimental Setup Datasets In this section.2 is extremely sparse. if the matrix L is sparse. 000 negative ones. E → B. we can construct 12 cross-domain sentiment classiﬁcation tasks from with source: D → B. the eigen-decomposition of L can be solved efﬁciently.of features and K is the dimensionality in the latent space. If m is large.com. In this dataset. then it would be computationally expensive. E → K. D → E. Each review is assigned a sentiment label. the bipartite feature graph deﬁned in Chapter 6. which only learns common structures underlying domain-speciﬁc words without fully exploiting the relationship between domain-independent and domain-speciﬁc words. B → K. K → D. The reviews are about four product domains: books (B). based on the rating score given by the review author. 000 positive reviews and 1. we describe our experiments on two real-world datasets and show the effectiveness of our proposed SFA for cross-domain sentiment classiﬁcation. where the word before an arrow corresponds with the source domain and the word after an arrow corresponds with the target domain. K → B. E → D. We use 85 . SFA can better capture the intrinsic relationship between domain-independent and domain-speciﬁc words via the bipartite graph representation and learn a more compact and meaningful representation underlying the graph via co-aligning domain-independent and domain-speciﬁc words. then we can still apply the Implicitly Restarted Arnoldi method to solve the eigen-decomposition of L iteratively and efﬁciently [175]. Hence. we ﬁrst describe the datasets used in our experiments. D → K. electronics (E) and kitchen appliances (K). Experiments in two real world domains indicate that SFA is indeed promising in obtaining better performance than several baselines including SCL in terms of the accuracy for cross-domain sentiment classiﬁcation. The computational time is O(Kpm) approximately.7. K → E. −1 (negative review) or +1 (positive review). B → E. In each domain. [25]. It contains a collection of product reviews from Amazon. dvds (D). As a result. B → D.

000 1.com/ 86 . we randomly select 1. electronics (E) and software (S). H → E. We use SentDat to denote this dataset. 000 473. 500 287. upperBound corresponds with a classiﬁer trained with the labeled data from B domain. 000 1. 500 SentDat Baselines In order to investigate the effectiveness of our method.RevDat to denote this dataset. 000 electronics 2. In this section. Sentiment classiﬁcation task on this dataset is sentence-level sentiment classiﬁcation. V → E. S → E. 500 1. 000 video game 3. 000 1. we have compared it with several algorithms. The TripAdvisor reviews are about the hotel (H) domain. Table 6. we use Unigram and Bigram features to represent each data example (a review in RevDat and a sentence in SentDat). 000 books 2. 000 software 3. for D → B task. we also construct 12 cross-domain sentiment classiﬁcation tasks: V → H. S → V. the performance of upperBound in D → B task can be also regarded as an upper bound of E → B and K → B tasks. So. we split these reviews into sentences and manually assign a polarity label for each sentence. 500 1. 000 kitchen 2. The sentiment classiﬁcation task on this dataset is documentlevel sentiment classiﬁcation. 504 1. Another baseline method denoted by PCA is a classiﬁer trained on a new representation which augments original 3 4 http://www. We have crawled a set of reviews from Amazon3 and TripAdvisor4 websites. Dataset RevDat Domain # Reviews dvds 2. The reviews from Amazon are about three product domains: video game (V). E → V. 856 1. One baseline method denoted by NoTransf. H → S. 000 1. E → H. V → S. 000 1. 500 positive sentences and 1. 000 1. E → S. is a classiﬁer trained directly with the source domain training data. S → V. 500 1. For example. Instead of assigning each review with a label. 000 electronics 3. S → H. 500 1. Similarly. 500 1. 000 1.5.amazon. 500 1. 000 # Pos # Neg # Features 1. NoTransf means that we train a classiﬁer with labeled data of D domain. The gold standard (denoted by upperBound) is an in-domain classiﬁer trained with labeled data from the target domain.com/ http://www. The other dataset is collected by us for experiment purpose. 500 negative ones for experiment. For both datasets. In each domain.tripadvisor. The summary of the datasets is described in Table 6. we describe some baseline algorithms with which we compare SFA. 000 hotel 3.5: Summary of datasets for evaluation.

7. y is the ground truth sentiment polarity and f (x) is the predicted sentiment polarity.0001 in [23]. We have also compared our algorithm with a method: structural correspondence learning (SCL) proposed in [25]. The tradeoff parameter C in logistic regression [64] is set to be 10000.2) deﬁned in Chapter 6. The library implemented in [64] is used in all our experiments. For all experiments on SentDat.2 Results Overall Comparison Results between SFA and Baselines In this section we compare the accuracy of SFA with NoTransf. PCA. The parameters of each model are tuned on some heldout data in E → B task of RevDat and H → S task of SentDat. PCA. FALSA and SCL by 24 tasks on two datasets. we randomly split each domain data into a training set of 2.features with new features which are learned by applying latent semantic analysis (also can be referred to as principal component analysis) [53] on the original view of domain-speciﬁc features (as shown in Table 6.(6. 6. PCA.000 instances. The evaluation of cross-domain sentiment classiﬁcation methods is conducted on the test set in the target domain without labeled training data in the same domain. We use accuracy to evaluate the sentiment classiﬁcation result: the percentage of correctly classiﬁed examples over all testing examples. |{x|x ∈ Dtst }| where Dtst denotes the test data. and are ﬁxed to be used in all experiments. For all experiments on RevDat. FALSA and SFA. which is equivalent to set λ = 0. we use logistic regression as the basic sentiment classiﬁer. we use Eqn. For PCA. FALSA and our proposed SFA all use unlabeled data from the source and target domains to learn a new representation and train classiﬁers using the labeled source domain data with new representations. We follow the details described in Blitzer’s thesis [23] to implement SCL with logistic regression to construct auxiliary tasks. upperBound. Parameter Settings & Evaluation Criteria For NoTransf.2).4. The deﬁnition of accuracy is given as follows. We compare our method with PCA and FALSA in order to investigate if spectral feature clustering is effective in aligning domain-speciﬁc features. Accuracy = |{x|x ∈ Dtst ∩ f (x) = y}| .600 instances and a test set of 400 instances. We report the average results of 5 random trails. Note that SCL. FALSA and SFA.000 instances and a test set of 1. we randomly split each domain data into a training set of 1. A third baseline method denoted by FALSA is a classiﬁer trained on a new representation which augments original features with new features which are learned by applying latent semantic analysis on the co-occurrence matrix of domain-independent and domain-speciﬁc features.1 87 .

5 77. Each bar in speciﬁc color represents a speciﬁc method. as are K and E.28 75 72. Studies of the SFA parameters are presented in Chapters 6.5 65 70 V−>E S−>E H−>E E−>V S−>V H−>V E−>S V−>S H−>S E−>H V−>H S−>H (b) Comparison Results on SentDat. we can observe that the four domains of RevDat can be roughly classiﬁed into two groups: B and D domains are similar to each other.5 81.5 67.2(a) shows the comparison results of different methods on RevDat.5 70 78. our proposed SFA performs better than other methods 88 . In the ﬁgure. The horizontal lines are accuracies of upperBound.to identity domain-independent and domain-speciﬁc features.4 80 85 84.6.1 Accuracy (%) 75 Accuracy (%) B−>D E−>D K−>D D−>B E−>B K−>B 80 70 75 65 70 D−>E B−>E K−>E D−>K B−>K E−>K (a) Comparison Results on RevDat.72 77.5 82. Clearly.55 81. 85 85 82. and the number of dimensionality h in [25] is set to be 50. Figure 6. Adapting a classiﬁer from K domain to E domain is much easier than adapting it from B domain. we use mutual information to select “pivots”. the number of domain-speciﬁc features clusters k = 100 and the parameter in feature augmentation γ = 0.7 82.2. each group of bars represents a cross-domain sentiment classiﬁcation task. For SCL.5 75 72. Figure 6. All these parameters and domain-independent feature (or “pivot”) selection methods are determined based on results on the heldout data mentioned in the previous section. From the ﬁgure. The number of “pivots” is set to be 500. 85 82.2: Comparison results (unit: %) on two datasets.7.34 80 Accuracy (%) Accuracy (%) 80 78.2 and 6.6 90 87. We adopt the following settings: the number of domain-independent features l = 500.7. but the two groups are different from each other.

2) to identify domain-independent and domain-speciﬁc features.6. clustering domain-speciﬁc features with bag-of-words representation may fail to ﬁnd a meaningful new representation for cross-domain sentiment classiﬁcation. While SFAM I may work very well in some tasks such as K → D and E → B of RevDat.2). we summarize the comparison results of SFA using different methods to identify domainindependent features. we can get similar conclusion: SFA outperforms other methods in most tasks. One reason may be that in sentence-level sentiment classiﬁcation. In this case. besides using Eqn. SCL even performs worse than FALSA in some tasks. but also use graph spectral clustering techniques to co-align both kinds of features to discover meaningful clusters for domain-speciﬁc features. Our proposed SFA can not only utilize the co-occurrence relationship between domain-independent and domain-speciﬁc features to reduce the gap between domains. The reason is that representing domain-speciﬁc features via domain-independent features can reduce the gap between domains and thus ﬁnd a reasonable representation for cross-domain sentiment classiﬁcation. From the table. we can also use the other two strategies to identify them. The performance of SCL highly relies on the auxiliary tasks. we conduct two experiments to study the effect of domain-independent features on the performance of SFA. (6. The reason is that applying mutual information on source domain data 89 . It is not surprising to ﬁnd that FALSA gets signiﬁcant improvement compared to NoTransf and PCA.2(b). As mentioned in Chapter 6. Thus in this dataset. Effect of Domain Independent Features of SFA In this section. Though our goal is only to cluster domain-speciﬁc features. it is hard to construct a reasonable number of auxiliary tasks that are useful to model the relationship between “pivots” and “non-pivots”.4. From the comparison results on SentDat shown in Figure 6. In Table 6. As mentioned in Chapter 6.95 conﬁdence intervals. but its performance may even drop on other tasks. The ﬁrst experiment is to test the effect of domain-independent features identiﬁed by different methods on the overall performance of SFA. frequency of features in both domains and mutual information between features and labels in the source domain respectively.3. Thus PCA only outperforms NoTransf slightly in some tasks.1. (6. One interesting observation from the results is that SCL does not work well compared to its performance on RevDat. it has been proved that clustering two related sets of points simultaneously can often get better results than clustering one single set of points only [54].including state-of-the-art method SCL in most tasks. the data are quite sparse. SFAF Q and SFAM I to denote SFA using Eqn. We do t-test on the comparison results of the two datasets and ﬁnd that SFA outperforms other methods with 0. We use SFADI . we can observe that SFADI and SFAF Q achieve comparable results and they are stable in most tasks. The second one is to test the effect of different numbers of domain-independent features on SFA performance. but work very bad in some tasks such as E → D and D → E of RevDat.

54 74.58 77.05 78.3(a) and 6.85 78.05 73.3 74.92 71.7 72.6: Experiments with different domain-independent feature selection methods.15 76.75 76.45 79.64 74.64 79. and ﬁx k = 100.9 75. γ = 0.35 77. In this section.15 85.96 72.5 76.5 72.14 74.65 77.22 72.3(b).1 70.9 80.52 76. Thus SFA is robust to the quality and numbers of domain-independent features.25 75.98 73.46 71.55 78.16 73.80 S→V 72.96 79.8 81.46 74.64 74.65 81. SFA performs well and stably in most tasks.55 H→V 74. The results are shown in Figure 6. we further test the sensitivity of these two parameters on the overall performance of SFA.4 78. the selected features may be irrelevant to the labels of the target domain.95 77.62 SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I Model Parameter Sensitivity of SFA Besides the number of domain-independent features l.8 79.6 70. we apply SFA on all 24 tasks in the two datasets.22 75 77.50 D→E B→E K→E D→K 76.54 S→H 71.3 74.94 72.62 72.25 77 76. RevDat B→D E→D K→D D→B 81. In addition. Numbers in the table are accuracies in percentage.48 73.can ﬁnd features that are relevant to the source domain labels but cannot guarantee the selected features to be domain independent.75 E→K 86.05 80.8 74.8 79. From the ﬁgures.10 85. The value of l is changed from 300 to 700 with step length 100.92 71.22 72.45 79.62 79.58 76.35 78.55 86.55 78.15 SentDat V→E S→E H→E E→V 76.80 77.08 77.16 V→H 77.06 76.25 80.86 72.6 78.5 81.05 74.62 79.95 81. To test the effect of the number of domain-independent features on the performance of SFA.45 78. 90 .94 72.52 76. Table 6.46 74.38 76.44 77.25 77.6.22 72.38 E→S V→S H→S E→H 77.55 B→K 78.98 73. we can ﬁnd that when l is in the range of [400.46 K→B 74.44 77.58 77.96 72.35 75.22 75 E→B 75.15 76.80 79.75 87.46 77.45 84.80 77.90 81.08 76.05 78.90 77.85 73 82.8 86.46 75.5 72. 700].75 85.5 76.14 74.64 79.88 74.5 85.38 75. there are two other parameters in SFA: one of them is the number of domain-speciﬁc feature-clusters k and the other is the tradeoff parameter λ in feature augmentation.98 77.65 75.58 76.

91 . In this experiment. we ﬁx l = 500.6 and change the value of k from 50 to 200 with step length 25.5 Accuracy (%) Accuracy (%) 400 500 600 Numbers of Domain−Independent Features 700 77. which implies that the features learned by the spectral feature alignment algorithm do boost the performance in cross-domain sentiment classiﬁcation.5(b).5 75 77.4(b) show the results of SFA under varying values of k.5 70 300 400 500 600 Numbers of Domain−Independent Features 700 70 300 400 500 600 Numbers of Domain−Independent Features 700 (a) Results on RevDat.5 75 75 72.5 S−>E H−>E E−>V S−>V H−>V 80 E−>S V−>S H−>S E−>H V−>H S−>H 80 77. when γ ≥ 0. k = 100 and change the values of γ from 0.4(a) and 6.5 75 72. in most task. Figure 6.1.B−>D 85 E−>D K−>D D−>B E−>B K−>B 90 87.3: Study on varying numbers of domain-independent features of SFA We ﬁrst test the sensitivity of the parameter k.2.5 72. However.5 72. Figure 6. when the cluster number k falls in the range from 75 to 175. Apparently. Finally.1 to 1 with step length 0. SFA works badly in most tasks.5 80 77.3. Results are shown in Figure 6. we ﬁx l = 500.5 80 Accuracy (%) Accuracy (%) 82. when γ < 0. In this experiment. SFA gets worse performance when the value of k increases. γ = 0. we test the sensitivity of the parameter γ. H → S and E → H. V−>E 82.5(a) and 6. which means the original features dominate the effect of the new feature representation. SFA performs well and stably. SFA works stably in most tasks.5 85 D−>E B−>E K−>E D−>K B−>K E−K 82.5 70 300 70 300 400 500 600 Numbers of Domain−Independent Features 700 (b) Results on SentDat. Note that though for some tasks such as V → H.

92 . which means it is hard to be applied to other applications.5 72.5 85 80 Accuracy (%) Accuracy (%) 75 100 125 150 175 Numbers of Clusters of Domain−Specifc Features 200 82.5 S−>E H−>E E−>V S−>V H−>V 80 E−>S V−>S H−>S E−>H V−>H S−>H 80 77.5 70 50 70 50 75 100 125 150 175 Numbers of Clusters of Domain−Specific Features 200 (b) Results on SentDat under Varying Values of k. One limitation of SFA is that it is domain-driven.5 72. We proposed a spectral feature alignment (SFA) algorithm for sentiment classiﬁcation. we have studied the problem of how to develop a latent space for transfer learning when explicit domain knowledge is available in some speciﬁc applications. 6. to develop a domain-driven method may be more effective than general solutions for learning the latent space for transfer learning. Experimental results showed that our proposed method outperforms a state-of-the-art method in cross-domain sentiment classiﬁcation.5 70 50 70 50 77. However.5 75 75 72. where domain knowledge is available or can be easy to obtain.5 Accuracy (%) Accuracy (%) 75 100 125 150 175 Numbers of Clusters of Domain−Specific Features 200 77.B−>D 85 E−>D K−>D D−>B E−>B K−>B 90 87.8 Summary In this section.5 75 72.4: Model parameter sensitivity study of k on two datasets. V−>E 82.5 D−>E B−>E K−>E D−>K B−>K E−>K 82. the success of SFA in sentiment classiﬁcation suggests that in some domain areas. Figure 6.5 80 77.5 75 75 100 125 150 175 Numbers of Clusters of Domain−Specific Features 200 (a) Results on RevDat under Varying Values of k.

5 Accuracy (%) 77.9 1 (a) Results on RevDat under Varying Values of γ.4 0.3 0. 93 .2 0.5 85 80 Accuracy (%) Accuracy (%) 82.5 0.1 0.7 0.4 0.5 S−>E H−>E E−>V S−>V H−>V E−>S 80 V−>S H−>S E−>H V−>H S−>H 80 77.5 70 0.1 0.9 1 75 75 72.7 0.5 72.5 75 72.2 0.5 0.9 1 70 0.5 72.4 0.7 0.B−>D 85 E−>D K−>D D−>B E−>B K−>B 90 87.6 Values of γ 0.1 0.5 D−>E B−>E K−>E D−>K B−>K E−>K 82.8 0.5 80 77.5 0.8 0.5 70 0.8 0.2 0.3 0.6 Values of γ 0.2 0.5 0.6 Values of γ 0.1 70 0.7 0.9 1 (b) Results on SentDat under Varying Values of γ.5 Accuracy (%) 0.5: Model parameter sensitivity study of γ on two datasets.4 0. Figure 6.8 0.6 Values of γ 0.3 0.5 75 77. V−>E 82.3 0.

**CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion
**

Transfer learning is still at an early but promising stage. There exist many research issues needed to be addressed. In this thesis, we ﬁrst surveyed the ﬁeld of transfer learning, where we give some deﬁnitions of transfer learning, summarize transfer learning into different settings, categorize transfer learning approaches into four contexts, analyze the relationship between transfer learning and other related areas, discuss some interesting research issues in transfer learning and introduce some applications of transfer learning. In this thesis, we focus on the transductive transfer learning setting, and propose two different feature-based transfer learning frameworks based different situations: (1) domain knowledge is hidden or hard to capture, and (2) domain knowledge can be observed or easy to encode into learning. In the ﬁrst situation, we proposed a novel and general dimensionality reduction framework for transfer learning, which aims to learn a latent space to reduce the distance between domains and preserve properties of original data simultaneously. Based on this framework, we propose three different algorithms to learn the latent space, known as Maximum Mean Discrepancy Embedding (MMDE) [135], Transfer Component Analysis (TCA) [139] and Semisupervised Transfer Component Analysis (SSTCA) [140]. MMDE learns the latent by learning a non-parametric kernel matrix, which may be more precise. However, the non-parametric kernel matrix learning results in a need to solve a semi-deﬁnite program (SDP), which is extremely expensive (O((nS + nT )6.5 )). In contrast, TCA and SSTCA learn a parametric kernel matrix instead, which may make the latent space less precise, but avoid the use of a SDP solver, which is much more efﬁcient (O(m(nS + nT )6.5 ), where m ≪ nS + nT ). Hence, in practice, TCA and SSTCA are more applicable. We apply these three algorithms to two diverse applications: WiFi localization and text classiﬁcation, and achieve promising results. In order to capture and apply the domain knowledge inherent in many application areas, we also propose a Spectral Feature Alignment (SFA) [137] algorithm for cross-domain sentiment classiﬁcation, which aims at aligning domain-speciﬁc words from different domains in a latent space by modeling the correlation between the domain-independent and domain-speciﬁc words in a bipartite graph and using the domain-speciﬁc features as a bridge. Experimental results show the great classiﬁcation performance of SFA in cross-domain sentiment classiﬁcation.

94

7.2 Future Work

In the future, we plan to continue to explore our work on transfer learning along the following directions. • We plan to study the proposed dimensionality reduction framework for transfer learning theoretically. Research issues include: (1) how to determine the number of the reduced dimensionalities adaptively? (2) How to get a generalization error bound of the proposed framework? (3) How to develop an efﬁcient algorithm for kernel choice of TCA and SSTCA based on the recent theoretic study in [176]. • We plan to extend the dimensionality reduction framework in a relational learning manner, which means that data in source and target domains can be relational instead of being independent distributed. Most previous transfer learning methods assumed data from different domains to be independent distributed. However, in many real-world applications, such as link prediction in social networks or classifying Web pages within a Web site, data are often intrinsically relational, which present a major challenge to transfer learning [52]. We plan to develop a relational dimensionality reduction framework, which can encode the relational information among data into the latent space learning, while can also reduce the difference between domains and preserve properties of the original data. • We plan to study the negative transfer learning issue. It has been proven that when the source and target tasks are unrelated, the knowledge extracted from a source task may not help, and even hurt, the performance of the target task [159]. Thus, how to avoid negative transfer and then ensure safe transfer of knowledge is crucial in transfer learning. Recently, in [30], we have proposed an Adaptive Transfer learning algorithm based on Gaussian Processes (AT-GP), which can be used to adapt the transfer learning schemes by automatically estimating the similarity between a source and a target task. Preliminary experimental results show that the proposed algorithm is promising to avoid negative transfer in transfer learning. In the future, we plan to develop a criterion to measure whether there exists a latent space, onto which knowledge can be safely transferred across domain/tasks. Based on the criterion, we can develop a general framework to automatically identify whether the source and target tasks are related or not, and determine whether we should project the data from the source and target domains onto the latent space.

95

REFERENCES

[1] Ahmed Abbasi, Hsinchun Chen, and Arab Salem. Sentiment analysis in multiple languages: Feature selection for opinion classiﬁcation in web forums. ACM Transaction Information System, 26(3):1–34, June 2008. [2] Eneko Agirre and Oier Lopez de Lacalle. On robustness and domain adaptation using svd for word sense disambiguation. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 17–24. ACL, June 2008. [3] Morteza Alamgir, Moritz Grosse-Wentrup, and Yasemin Altun. Multitask learning for brain-computer interfaces. In Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics, volume 9, pages 17–24. JMLR W&CP, May 2010. [4] Rie K. Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, 2005. [5] Rie Kubota Ando and Tong Zhang. A high-performance semi-supervised learning

method for text chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 1–9. ACL, June 2005. [6] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19, pages 41–48. MIT Press, 2007. [7] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pages 71–85. Springer, September 2008. [8] Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil, and Yiming Ying. A spectral regularization framework for multi-task structure learning. In Annual Conference on Neural Information Processing Systems 20, pages 25–32. MIT Press, 2008. [9] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods for transductive transfer learning. In Proceedings of the 7th IEEE International Conference on Data Mining Workshops, pages 77–82. IEEE Computer Society, 2007.

96

Task clustering and gating for bayesian multitask learning. pages 636–642. 2009. A theory of learning from different domains. Partha Niyogi. May 2010. Shantanu Godbole. 97 . and Tobias Scheffer. July 2008. In Annual Conference on Neural Information Processing Systems 19. November 2006. 7:2399–2434. ACM. Multi-task learning for HIV therapy screening. John Blitzer. [12] Maxim Batalin. In Proceeding of the 18th ACM conference on Information and knowledge management. [11] Bart Bakker and Tom Heskes. In Proceedings of IEEE International Conference on Robotics and Automation. In Proceedings of the 25th International Conference on Machine learning. Belle Tseng. pages 56–63. Teresa Luu. [15] Shai Ben-David. Morgan Kaufmann Publishers Inc. Koby Crammer. Sachindra Joshi. December 2009. Zhaohui Zheng. Crossguided clustering: Transfer of relevant supervision across domains for improved clustering. Ke Zhou. 2010. [16] Shai Ben-David. IEEE Computer Society. volume 9. pages 825–830. MIT Press. [18] Shai Ben-David and Reba Schuller. August 2003. Koby Crammer. and Yi Chang. Jasmina Bogojeska. Neural Computation. and David Pal. Gaurav Sukhatme. Thomas Lengauer. [19] Indrajit Bhattacharya. Guirong Xue. 4:83–99. In Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics. pages 137–144. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. 2003. pages 41–50. [14] Mikhail Belkin. and Ashish Verma. April 2003. 2007. Multi-task learning for learning to rank in web search. [13] Mikhail Belkin and Partha Niyogi. and Jenn Wortman. Alex Kulesza. John Blitzer. In Proceedings of the Sixteenth Annual Conference on Learning Theory. ACM. Laplacian eigenmaps for dimensionality reduction and data representation. Exploiting task relatedness for multiple task learning. Journal of Machine Learning Reserch. Mobile robot navigation using a sensor network. 15(6):1373–1396. Machine Learning. [20] Steffen Bickel. Gordon Sun. and Myron Hattig. In Proceedings of the 9th IEEE International Conference on Data Mining. JMLR W&CP. [17] Shai Ben-David. Analysis of representations for domain adaptation. Fernando Pereira. pages 1549–1552. pages 129–136. Hongyuan Zha. and Fernando Pereira. 79(12):151–175. Journal of Machine Learning Research.[10] Jing Bai. 2003. Tyler Lu. Impossibility theorems for domain adaptation.. and Vikas Sindhwani.

98 . 2008. June 2007. Domain Adaptation of Natural Language Processing Systems. Transfer learning for collective link prediction in multiple heterogenous domains. Liu. Nathan N. pages 120–128. boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation. In Proceedings of the 24th international conference on Machine learning. pages 145–152. In Proceedings of the Conference on Empirical Methods in Natural Language. and Qiang Yang. [29] Bin Cao. [23] John Blitzer. In Annual Conference on Neural Information Processing Systems 20. Learning bounds for domain adaptation. and Qiang Yang. [28] Edwin Bonilla. Biographies. Sinno Jialin Pan. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Fernando Pereira. Alex Kulesza. Multitask learning. and Fernando Pereira. 2008. bollywood.[21] Steffen Bickel. Yu Zhang. [25] John Blitzer. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence. 2009. Michael Br¨ ckner. Domain adaptation with structural correspondence learning. [24] John Blitzer. June 2007. The University of Pennsylvania. Mark Dredze. ACM. PhD thesis. July 1998. and Tobias Scheffer. July 2006. Dit-Yan Yeung. Combining labeled and unlabeled data with co-training. ACM. 28(1):41–75. MIT Press. Kian Ming Chai. Transfer learning by distribution matching for targeted advertising. June 2010. Ryan McDonald. 2007. and Tobias Scheffer. Christoph Sawade. pages 153–160. Discriminative learning for difu fering training and test distributions. In Annual Conference on Neural Information Processing Systems 20. [27] Avrim Blum and Tom Mitchell. pages 129–136. July 2010. ACL. [22] Steffen Bickel. Machine Learning. [26] John Blitzer. AAAI Press. and Jenn Wortman. [30] Bin Cao. pages 92–100. Koby Crammer. ACL. In Proceedings of the 11th Annual Conference on Learning Theory. In Proceedings of the 27th International Conference on Machine learning. In Advances in Neural Information Processing Systems 21. pages 432–439. and Fernando Pereira. 1997. Multi-task gaussian process prediction. Adaptive transfer learning. [31] Rich Caruana. MIT Press. and Chris Williams. pages 81–88.

Wai Lam. [36] Bo Chen. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. Chai. Jun Liu. and A. pages 1189–1198. Jia Chen. Kilian Weinberger. Domain adaptation with active learning for word sense disambiguation. A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. Chapelle. ACL. Srinivas Vadrevu. pages 1077–1078. [37] Jianhui Chen. Zien. [39] Tianqi Chen. [40] Yuqiang Chen. June 2007. Semi-Supervised Learning. 2009. [33] Yee Seng Chan and Hwee Tou Ng. Tag-based web photo retrieval improved by batch mode re-tagging. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. AAAI Press. July 2010. Number 92 in CBMS Regional Conference Series in Mathematics.[32] Kian Ming A. and Jiebo Luo. In Proceedings of the 19th International Conference on World Wide Web. Tsang. 99 . In Proceedings of the 26th Annual International Conference on Machine Learning. pages 179–188. Tsang. and Jieping Ye. pages 137–144. MIT Press. Christopher K. ACM. 1997. June 2009. June 2010. July 2010. [38] Lin Chen. Williams. pages 265–272. ACM. and Qiang Yang. and Zheng Chen. Stefan Klanke. June 2009. Gui-Rong Xue. Dong Xu. and Belle Tseng. Visual contextual advertising: Bringing textual advertisements to images. pages 49–56. IEEE. [34] O. Extracting discriminative concepts for domain adaptation in text mining. Sch¨ lkopf. ACM. Ou Jin. K. ACM. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [35] Olivier Chapelle. Gui-Rong Xue. Transfer learning for behavioral targeting. Pannagadatta Shivaswamy. In Proceedings of the 25th international conference on Machine learning. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence. July 2008. and Tak-Lam Wong. A convex formulation for learning shared structures from multiple tasks. Chung. Lei Tang. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2006. In Advances in Neural Information Processing Systems 21. American Mathematical Society. USA. I. B. [41] Fan R. Multi-task gaussian process learning of robot inverse dynamics. Spectral Graph Theory. Ivor W. Multi-task learning for boosting with application to web search ranking. Jun Yan. ACM. [42] Ronan Collobert and Jason Weston. pages 160–167. Camo bridge. Ivor W. Ya Zhang. April 2010. MA. and Sethu Vijayakumar.

Guirong Xue. Transferring naive bayes classiﬁers for text classiﬁcation. ACM. In Proceedings of the 24th International Conference on Machine Learning. and Yong Yu. ACM. Translated learning: Transfer learning across different feature spaces. In Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining. Ou Jin. Co-clustering based classiﬁcation for out-of-domain documents. 41:391–407. MA. ACM. AAAI Press. 2003. pages 25–31. June 2009. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence. In Advances in Neural Information Processing Systems 14. [50] Wenyuan Dai. Self-taught clustering. ACM. [44] W.[43] Nello Cristianini. June 2007. and Yong Yu. August 2007. Furnas. Guirong Xue. Thomas K. Qiang Yang. and Yong Yu. Indexing by latent semantic analysis. Dumais. Deep transfer via second-order markov logic. 1990. [49] Wenyuan Dai. Frustratingly easy domain adaptation. 2002. and Yong Yu. Landauer. June 2007. Norwell. and Yong Yu. Qiang Yang. Eigentransfer: a uniﬁed framework for transfer learning. Kluwer Academic Publishers. [52] Jesse Davis and Pedro Domingos. Language Modeling for Information Retrieval. Andre Elisseeff. Bruce Croft and John Lafferty. [48] Wenyuan Dai. and Yong Yu. Gui-Rong Xue. USA. Jaz Kandola. Yuqiang Chen. MIT Press. and John Shawe-Taylor. In Annual Conference on Neural Information Processing Systems 21. In Proceedings of the 26th Annual International Conference on Machine Learning. Journal of the American Society for Information Science. On kerneltarget alignment. [47] Wenyuan Dai. [51] Hal Daum´ III. pages 353–360. [46] Wenyuan Dai. pages 200–207. and Richard Harshman. 2009. George W. pages 217–224. Guirong Xue. In Proceedings of the 25th International Conference of Machine Learning. [53] Scott Deerwester. In Proceedings of the 45th Annual e Meeting of the Association of Computational Linguistics. Qiang Yang. 100 . July 2008. pages 367– 373. July 2007. MIT Press. Boosting for transfer learning. ACL. June 2009. In Proceedings of the 26th Annual International Conference on Machine Learning. pages 193– 200. [45] Wenyuan Dai. ACM. Gui-Rong Xue. Guirong Xue. pages 540–545. Susan T. pages 256–263. Qiang Yang. Qiang Yang. Qiang Yang.

Kai-Wei Chang. Charles A. pages 225–232. Ivor W. In Proceedings of the twenty-ﬁrst international conference on Machine learning. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Dong Xu. Tsang. [62] Theodoros Evgeniou. June 2009. and Massimiliano Pontil.. Springer. September 2008. [60] Eric Eaton. Journal of Machine Learning Research. June 2010. [57] Lixin Duan. and Chih-Jen Lin. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1965. In Proceedings of the Fourteenth International Conference on Machine Learning. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. July 2001. Morgan Kaufmann Publishers Inc. ACM. In Proceedings of the 7th ACM SIGKDD international conference on Knowledge discovery and data mining. IEEE. K-means clustering via principal component analysis. June 2009. Dhillon. 2008. New York. Ivor W. pages 1375–1381. pages 289–296. Improving regressors using boosting techniques. and Jiebo Luo. [58] Lixin Duan. In Proceedings of the 22nd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE. 2005. Ellis. Micchelli. Lecture Notes in Computer Science. ACM. Dong Xu. 6:615–637. and Tat-Seng Chua. Journal of Machine Learning Research. July 2004. [61] Henry C. August 2004. Marie desJardins. Domain adaptation from multiple sources via auxiliary classiﬁers. Dong Xu. Tsang. 9:1871–1874. and Terran Lane. pages 109–117. [63] Theodoros Evgeniou and Massimiliano Pontil. [59] Lixin Duan. Maybank. Cho-Jui Hsieh. pages 269–274. Co-clustering documents and words using bipartite spectral graph partitioning. pages 107–115. Learning multiple tasks with kernel methods. Tsang. Domain transfer svm for video concept detection. ACM. Regularized multi-task learning. The Transfer of Learning. [64] Rong-En Fan. Modeling transfer relationships between learning tasks for improved inductive transfer. LIBLINEAR: A library for large linear classiﬁcation. and Stephen J. Xiang-Rui Wang. In Proceedings of the 26th International Conference on Machine Learning. The Macmillan Company. Visual event recognition in videos by learning from web data. ACM. [55] Chris Ding and Xiaofeng He.[54] Inderjit S. [56] Harris Drucker. 101 . pages 317–332. July 1997. Ivor W.

Gretton. AAAI Press. Dan S. Smola. and Neil Lawrence. October 2005. A decision-theoretic generalization of on-line learning and an application to boosting. ACM. Kam-Fai Wong. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 102 . Springer-Verlag. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. pages 1169–1178. Truyen Tran. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Eliot Flannery. K. Text categorization with knowledge transfer from heterogeneous data sources. and A. Ladd. Alex J. 1995. Nonnegative shared subspace learning and its application to social media retrieval. Schapire. Knowledge transfer via multiple model local structure mapping. ACM. ACM. [68] Wei Gao. [73] Sunil Kumar Gupta. Practical robust localization over large-scale 802. August 2008. [72] Rakesh Gupta and Lev Ratinov. ACL. Algis Rudys. Li Zhang. Dinh Phung. M. Sch¨ lkopf. A kernel method o for the two-sample problem. Brett Adams.[65] Brian Ferris. In Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence. pages 509–516. Empirical study on the performance stability of named entity recognition model across domains. [74] Andreas Haeberlen. Measuring o statistical dependence with Hilbert-Schmidt norms. pages 162–169. In Proceedings of the 23rd national conference on Artiﬁcial intelligence. pages 842–847. and Bernhard Sch¨ lkopf. and Aoying Zhou. Kavraki. July 2008. pages 2480–2485. Springer. Jing Jiang. In Advances in Neural Information Processing Systems 19. Andrew M. and Zhong Su. and Jiawei Han. Learning to rank only using training data from related domain. pages 23–37. [67] Jing Gao. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. pages 63–77. B. In Proceedings of the 16th International Conference on Algorithmic Learning Theory. Wallach. In Proceedings of the 10th Annual International Conference on Mobile Computing and Networking. Peng Cai. January 2007. [70] Arthur Gretton. pages 513–520. volume 3734 of Lecture Notes in Computer Science. pages 70–84. July 2006. and Svetha Venkatesh. [69] A. Rasch. Wei Fan. September 2004. and Lydia E. In Proceedings of the Second European Conference on Computational Learning Theory. July 2010. Dieter Fox. Smola. Borgwardt. July 2010. 2007. [71] Hong Lei Guo.11 wireless networks. pages 283–291. Olivier Bousquet. WiFi-SLAM using gaussian process latent variable models. [66] Yoav Freund and Robert E. MIT Press.

2009. 28:100–108. Yuta Tsuboi. Zheng. August 2004. Springer. June 2010. ACM. Applied Statistics. and Jean-Philippe Vert. pages 745–752. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. In Proceedings of the 27th International Conference on Machine learning. pages 168–177. and Qiang Yang. Alex Smola. pages 601–608. Active learning for multi-task adaptive ﬁltering. [77] Trevor Hastie. Hartigan and M. Domain Adaptation in Natural Language Processing. Borgwardt. Inlier-based outlier detection via direct density ratio estimation. [86] Jing Jiang. Vincent W. Anthony Wong. volume 69. Arthur Gretton. ACM. PhD thesis. [80] Derek H. Real world activity recognition with multiple goals. In Proceedings of the 8th IEEE International Conference on Data Mining. 2 edition. [83] Laurent Jacob. Mining and summarizing customer reviews. 2008. Multi-task feature and kernel selection for SVMs. [82] Jiayuan Huang. Nathan N. In Proceedings of the 21st International Conference on Machine Learning. Hu. In Proceedings of the 27th International Conference on Machine learning. The University of Illinois at Urbana-Champaign. Correcting sample selection bias by unlabeled data. and Bernhard Sch¨ lkopf. pages 30–39. A k-means clustering algorithm. [79] Jean Honorio and Dimitris Samaras. [76] John A. Francis Bach. pages 223–232. [78] Shohei Hido. 2007. Robert Tibshirani. and Jerome Friedman. June 2010. [84] Tony Jebara.[75] Abhay Harpale and Yiming Yang. ACM. Liu. Clustered multi-task learning: A convex formulation. ACM. 1979. and Takafumi Kanamori. September 2008. inference and prediction. Hisashi Kashima. Masashi Sugiyama. [85] Jing Jiang. Sinno Jialin Pan. In Advances in Neuo ral Information Processing Systems 19. In Advances in Neural Information Processing Systems 21. In Proceedings of 10th ACM International Conference on Ubiquitous Computing. Israel. MIT Press. 2009. Multi-task transfer learning for weakly-supervised relation extraction. July 2004. [81] Minqing Hu and Bing Liu. Karsten M. Multi-task learning of gaussian graphical models. December 2008. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th 103 . Haifa. The elements of statistical learning: data mining. ACM.

[94] Christoph H. Lampert and Oliver Kr¨ mer. 10:1391–1445. 2009. Bartlett. June 2007. pages 137–142. April 1998. 104 . Journal of Machine Learning Research. [95] Gert R. summarization and tracking in news and blog corpora. Learning the kernel matrix with semideﬁnite programming. Yu-Ting Liang. ACL. pages 1012–1020. In Proceedings of the 10th European Conference on Machine Learning. [96] Ken Lang. [90] Thorsten Joachims. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Springer. 1995. Shohei Hido. and Hsin-Hsi Chen. Lecture Notes in Computer Science. Weakly-paired maximum covariance analysis o for multimodal dimensionality reduction and transfer learning. Nello Cristianini. A least-squares approach to direct importance estimation. June 1999. In Proceedings of the 12th International Conference on Machine Learning. G. April 2008.International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. [91] Takafumi Kanamori. [88] Nitin Jindal and Bing Liu. Opinion extraction. pages 264–271.. Jordan. 2006. ACM. Peter L. In Proceedings of the international conference on Web search and web data mining. pages 219–230. In Proceedings of the Sixteenth International Conference on Machine Learning. ACL. 2004. 5:27–72. In Proceedings of 2009 SIAM International Conference on Data Mining. pages 331–339. In Proceedings of the 11th European Conference on Computer Vision. Newsweeder: Learning to ﬁlter netnews. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. and Michael I. 2009. and Masashi Sugiyama. Instance weighting for domain adaptation in NLP. [87] Jing Jiang and ChengXiang Zhai. Journal of Machine Learning Research. Lanckriet. volume 2. September 2010. Laurent El Ghaoui. Change-point detection in time-series data by direct density-ratio estimation. [93] Lun-Wei Ku. Opinion spam and analysis. pages 389–400. [92] Yoshinobu Kawahara and Masashi Sugiyama. Transductive inference for text classiﬁcation using support vector machines. Morgan Kaufmann Pulishers Inc. pages 200–209. [89] Thorsten Joachims. April 2009.

and Vikas Sindhwani. 105 . and Anthony LaMarca. AAAI Press / The MIT Press. Alberta. [99] Su-In Lee. [106] Lin Liao. and Henry A. Efﬁcient sparse coding algorithms. and Yi Zhang. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. Large-scale localization from wireless signal strength. pages 617–624. Lewis and William A.[97] Neil D. [100] Julie Letchner. Vikas Sindhwani. Dieter Fox. Yi Zhang. 2007. ACM. ACL. Chris Ding. In Annual Conference on Neural Information Processing Systems 19. Alexis Battle. Artiﬁcial Intelligence. July 2009. ACM. pages 801–808. A sequential algorithm for training text classiﬁers. July 1994. ACM / Springer. and Xiangyang Xue. and Andrew Y. Banff. In Proceedings of the 20th National Conference on Artiﬁcial Intelligence. Lawrence and John C. A non-negative matrix tri-factorization approach to sentiment classiﬁcation with lexical prior knowledge. ACM. [98] Honglak Lee. July 2004. Learning a metalevel prior for feature relevance from multiple related tasks. [102] Bin Li. Learning to learn with the informative vector machine. Can movies and books collaborate?: crossdomain collaborative ﬁltering for sparsity reduction. Transfer learning for collaborative ﬁltering via a rating-matrix generative model. Ng. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. July 2007. pages 489–496. August 2009. Dieter Fox. Qiang Yang. Patterson. 2007. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Rajat Raina. Platt. Vassil Chatalbashev. [105] Tao Li. June 2009. and Xiangyang Xue. pages 2052–2057. In Proceedings of the 24th International Conference on Machine Learning. [104] Tao Li. pages 3–12. 171(5-6):311–331. Knowledge transformation for cross-domain sentiment classiﬁcation. Canada. pages 716– 717.. [103] Bin Li. Qiang Yang. In Proceedings of the 21st International Conference on Machine Learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association of Computational Linguistics. ACM. pages 15–20. [101] David D. and Daphne Koller. Morgan Kaufmann Publishers Inc. Kautz. MIT Press. In Proceedings of the 26th Annual International Conference on Machine Learning. Learning and inferring transportation routines. David Vickrey. July 2009. July 2005. Gale. Donald J. pages 244–252.

Contents. August 2008. ACM. pages 691–700. Exploiting social context for review quality prediction. Yun Jiang. 2010. Spectral domaintransfer learning. [113] Kang Liu and Jun Zhao. and Usage Data. ACM. Gui-Rong Xue. In Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence. April 2010. October 2009. Ivor W. pages 121–130. Xiangji Huang. Modeling and predicting the helpfulness of online reviews. ACM. Alexandros Ntoulas. In Proceeding of the 18th ACM conference on Information and knowledge management. and Livia Polanyi. [111] Bing Liu. Shuiwang Ji. [110] Bing Liu. ACM. [116] Yue Lu. IEEE Computer Society. Gui-Rong Xue. [114] Yang Liu. and Jiebo Luo. In Proceedings of the 21st International Conference on Machine Learning. January 2007. pages 969–978. June 2009. [108] Xiao Ling. Dong Xu. Handbook of Natural Language Processing. [117] Yue Lu and Chengxiang Zhai. November 2009. August 2005. Sentiment analysis and subjectivity. Opinion integration through semi-supervised topic modeling. [115] Yiming Liu. Cross-domain sentiment classiﬁcation using a two-stage method. pages 1717–1720. Panayiotis Tsaparas. pages 488–496. In Proceedings of the 19th international conference on Wrld wide web. ACM. In Proceedings of the seventeen ACM international conference on Multimedia. ACM. Multi-task feature learning via efﬁcient l2. and Yong Yu. pages 505–512. 106 . Ya Xue. [112] Jun Liu. Wenyuan Dai. Tsang. Qiang Yang.[107] Xuejun Liao. [109] Xiao Ling. and Yong Yu. Using large-scale web data to facilitate textual query based retrieval of consumer photos. Springer. December 2008. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. and Xiaohui Yu. Wenyuan Dai. Logistic regression with an auxiliary data source. ACM. pages 443–452. In Proceedings of the 17th International Conference on World Wide Web.1 -norm minimization. and Lawrence Carin. Can chinese web pages be classiﬁed with english data source? In Proceedings of the 17th International Conference on World Wide Web. and Jieping Ye. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. Second Edition. April 2008. Web Data Mining: Exploring Hyperlinks. Aijun An. April 2008. pages 55–64. Qiang Yang.

Fuzhen Zhuang. 2009. Mooney. pages 103–112. Bernhard Sch¨ lkopf. Domain adaptation with multiple sources. [128] Tony Mullen and Nigel Collier. July 2007. October 2008. [125] Lilyana Mihalkova and Raymond J. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. New York. [124] Lilyana Mihalkova. pages 985–992. and Neel Sundaresan. MIT Press. Mapping and revising markov logic networks for transfer learning. Hui Xiong. McGraw-Hill. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Ray. Domain adaptation: Learning bounds and algorithms. ChengXiang Zhai. In Proceeding of the 17th ACM conference on Information and knowledge management. Machine Learning. Sentiment analysis using support vector machines with diverse information sources. July 2004. and Afshin Rostamizadeh. Gunnar R¨ tsch. pages 412–418. [123] Yishay Mansour.. pages 608–614. AAAI Press. 107 . Multiple source adaptation and the renyi divergence. [127] Tom M. July 2009. M. June 2009. Tuyen Huynh. Mehryar Mohri. In Proceedings of the 22nd Annual Conference on Learning Theory. [121] Yishay Mansour. pages 1163–1168. In Annual Conference on Neural Information Processing Systems 20. 2009. In Advances in Neural Information Processing Systems 21. Mooney. [126] Sebastian Mika. pages 41–48. and Qing He. 1997. and Afshin Rostamizadeh. 2009. In Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence.[118] Yue Lu. 1999. [122] Yishay Mansour. Mehryar Mohri. Jason Weston. Rated aspect summarization of short comments. Transfer learning using kolmogorov complexity: Basic theory and empirical evaluations. In Neural Networks for Signal Prou cessing IX. [119] Ping Luo. Yuhong Xiong. and Klaus-Robert a o M¨ ller. 2008. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence. In Proceedings of the 18th international conference on World wide web. Mitchell. Transfer learning from minimal target data by mapping across relational domains. ACM. and Afshin Rostamizadeh. Mehryar Mohri. [120] M. Fisher discriminant analysis with kernels. Hassan Mahmud and Sylvian R. pages 1041–1048. Transfer learning from multiple source domains via consensus regularization. and Raymond J. Morgan Kaufmann Publishers Inc.

Multidimensional vector regression for accurate and low-cost location estimation in pervasive computing. Adaptive localization in a dynamic WiFi environment through multi-view learning. Kwok. [135] Sinno Jialin Pan. pages 1108–1113. Interior-point Polynomial Algorithms in Convex Programming. [130] Andrew Y. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence. AAAI Press. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence. July 2008. and Jeffrey J. pages 751–760. Qiang Yang. In Advances in Neural Information Processing Systems 14. Qiang Yang. MIT Press. July 2006. Jian-Tao Sun. In Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence. Co-localization from labeled and unlabeled data using graph laplacian. ACM. and James T. [132] Jeffrey J. Ng. James T. Kwok. Ivor W. IEEE Transactions on Knowledge and Data Engineering. Crossdomain sentiment classiﬁcation via spectral feauture alignment. [133] Jeffrey J. James T. April 2010. Qiang Yang. and Qiang Yang. Domain adaptation via transfer component analysis. pages 849–856. and Yair Weiss. 39(2-3):103–134. [137] Sinno Jialin Pan. July 2008. Jordan. Society for Industrial and Applied Mathematics. AAAI Press. September 2006. James T. Sebastian Thrun. pages 988–993. Qiang Yang. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. Machine Learning. Pan. 1994. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence. Transfer learning via dimensionality reduction. On spectral clustering: Analysis and an algorithm. Philadelphia. Pan.[129] Yurii Nesterov and Arkadii Nemirovskii. Xiaochuan Ni. In Proceedings of the 19th International Conference on World Wide Web. 108 . and Yiqiang Chen. [131] Kamal Nigam. [139] Sinno Jialin Pan. Kwok. Tsang. January 2007. Kwok. 2000. A manifold regularization approach to calibration reduction for sensor-network based tracking. and Dit-Yan Yeung. and Tom Mitchell. [134] Jeffrey J. and Qiang Yang. pages 1187–1192. Michael I. Hong Chang. PA. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. pages 677–682. Andrew Kachites McCallum. James T. Qiang Yang. Pan. [136] Sinno Jialin Pan. July 2009. pages 1383–1388. Pan and Qiang Yang. AAAI Press. July 2007. 2001. Dou Shen. Kwok. pages 2166–2171. and Chen Zheng. Transferring localization models across space. 18(9):1181–1193. [138] Sinno Jialin Pan. Text classiﬁcation from labeled and unlabeled documents using EM. AAAI Press.

and Shivakumar Vaithyanathan. A survey on transfer learning. Selftaught learning: Transfer learning from unlabeled data. Dataset Shift in Machine Learning. IEEE Transactions on Neural Networks. Hu. ACL. Kwok. ACL. Benjamin Packer. June 2006. [142] Sinno Jialin Pan. and Derek H. and Andrew Y. [150] Rajat Raina. [144] Bo Pang and Lillian Lee. Zheng. Inﬁnite predictor subspace models for multitask learning. [145] Bo Pang and Lillian Lee. Evan W. [149] Piyush Rai and Hal Daum´ III. July 2008. ACM. pages 271–278. 22(10):1345–1359. Xiang. Tsang. e In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence. Lawrence. Domain adaptation via transfer component analysis. [147] David Pardoe and Peter Stone. Alexis Battle. October 2010. Thumbs up? Sentiment classiﬁcation using machine learning techniques. July 2002. Qiang Yang. James T. Liu. In Proceedings of the 23rd International Conference on Machine Learning. 2008. [151] Rajat Raina. and Neil D. Vincent W. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. In Proceedings of the 24th International Conference on Machine Learning. Transfer learning in collaborative ﬁltering for sparsity reduction. [141] Sinno Jialin Pan and Qiang Yang. Foundations and Trends in Information Retrieval. 2(1-2):1–135. 2010. [143] Weike Pan. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. 2009. Ng. and Daphne Koller. Lillian Lee. pages 713–720. ACM. and Qiang Yang. Masashi Sugiyama.[140] Sinno Jialin Pan. Opinion mining and sentiment analysis. 109 . In Proceedings of the 42nd Annual Meeting of the Association of Computational Linguistics. MIT Press. In Proceedings of the 27th International Conference on Machine learning. Boosting for regression transfer. Transfer learning for WiFi-based indoor localization. [148] Joaquin Quionero-Candela. June 2010. pages 79–86. July 2010. Ivor W. Constructing informative priors using transfer learning. Ng. June 2007. and Qiang Yang. [146] Bo Pang. pages 759–766. In Proceedings of the Workshop on Transfer Learning for Complex Task of the 23rd AAAI Conference on Artiﬁcial Intelligence. AAAI Press. July 2010. Submitted. July 2004. Honglak Lee. ACM. Nathan N. Anton Schwaighofer. Andrew Y. IEEE Transactions on Knowledge and Data Engineering.

Bayesian multiple instance learning: automatic feature selection and inductive transfer. [155] Matthew Richardson and Pedro Domingos. Kernel-based inductive transfer. Adapting visual category models to new domains. July 1998. and Bernt Schiele. September 2010. September 1999. Brian Kulis. Richman and Patrick Schone.[152] Sowmya Ramachandran and Raymond J.J. In Text REtrieval Conference 2001. To transfer or not to transfer. Smola. R¨ tsch. July 2008. ACM. Mining Wiki resources for multilingual named entity recognition. July 2010. Activity recognition based on home to home transfer learning. In a NIPS-05 Workshop on Inductive Transfer: 10 Years Later. Gy¨ rgy Szarvas. Mika. pages 1–9. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. and Intent Recognition of the 24th AAAI Conference on Artiﬁcial Intelligence. Bharat Rao. June 2010. S. Theory reﬁnement of bayesian networks with hidden variables. [160] Ulrich R¨ ckert and Stefan Kramer. [159] Michael T. Rosenstein. In Proceedings of the 14th International Conference on Machine Learning. Cook. and R. Balaji Krishnapuram. Burges. P. In Proceedings of the 11th European Conference on Computer Vision. December 2005. [157] Stephen Robertson. Mooney. pages 454–462. IEEE. 2006. G. and Ian Soboroff. Jinbo Bi. Lecture Notes in Computer Science. ACL. In Proceedings of the 25th International Conference on Machine learning. Knirsch. Activity. In Proceedings of the Workshop on Plan. Input space versus feature space in kernel-based methods. [153] Parisa Rashidi and Diane J. pages 220–233. pages 808– 815. Springer. AAAI Press. and o u a A. o What helps where . 2001. Markov logic networks. IEEE Transactions on Neural Networks.J. 110 . and Trevor Darrell. Stephen Robertson. September 2008. Klaus-Robert M¨ ller. Iryna Gurevych. [161] Kate Saenko. In Proceedings u of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. Zvika Marx. C. [158] Marcus Rohrbach. [154] Vikas C. Murat Dundar. The trec 2002 ﬁltering track report. Mario Fritz. In Proceedings of 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technology. Michael Stark. June 2008. 10(5):1000–1017. [156] Alexander E.and why? semantic relatedness for knowledge transfer. Raykar. Sch¨ lkopf. and Leslie Pack Kaelbling. [162] B. 62(1-2):107–136. Machine Learning Journal.C.

Predictive distribution matching svm for multi-domain learning. 90:227–244. Arthur Gretton. and Jiangtao Ren. and Kai Yu. Computer Sciences Technical Report 1648. IEEE Computer Society. [168] Burr Settles. Tsang. and Gunnar R¨ tsch. and Beyond. Smola. pages 342–357. Alexander Smola. Yew-Soon Ong Ivor W. 2005. Journal of Statistical Planning and Inference. and Kee-Khoon Lee. 2001. Active learning literature survey. pages 1025–1030. Neural Computation. Nonlinear component o u analysis as a kernel eigenvalue problem. [166] Gabriele Schweikert. [173] Le Song. Volker Tresp. pages 1433–1440. Lecture Notes in Computer Science. Wei Fan. In Proceedings of the 18th International Conference on Algorithmic Learning Theory. [164] Bernhard Scholkopf and Alexander J. 111 . Springer. and Bernhard Sch¨ lkopf.[163] Bernhard Sch¨ lkopf. September 2008. Smola. MIT Press. [172] Alex J. Le Song. In Annual Conference on Neural Information Processing Systems 17. 2000. 2009. MA. MIT Press. December 2007. 2008. Regularization. USA. Bernhard Sch¨ lkopf. 1998. [169] Xiaoxiao Shi. Actively transfer domain knowledge. pages 1209–1216. The University of Sydney. University of Wisconsin–Madison. 10(5):1299–1319. Christian Widmer. In Proceedings of the 8th IEEE International Conference on Data Mining. Optimization. [167] Chun-Wei Seah. A Hilbert space o embedding for distributions. Learning via Hilbert Space Embedding of Distributions. pages 13–31. [170] Hidetoshi Shimodaira. 2009. PhD thesis. October 2007. An o a empirical analysis of domain adaptation algorithms for genomic sequence analysis. Learning with Kernels: Support Vector Machines. Learning gaussian process kernels via hierarchical bayes. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. [171] Vikas Sindhwani and Prem Melville. [165] Anton Schwaighofer. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases 2010. In Advances in Neural Information Processing Systems 21. and Klaus-Robert M¨ ller. Improving predictive inference under covariate shift by weighting the log-likelihood function. Document-word co-regularization for semi- supervised sentiment analysis. September 2010. Cambridge.

Michael Goesele. September 2009. Sorensen. [180] Masashi Sugiyama. 2001. Motoaki Kawanabe. pages 1750–1758. [176] Bharath Sriperumbudur. Colored maximum variance unfolding. Learning to learn. Hisashi Kashima. [177] Michael Stark. pages 130–137. Neural Network. editors. and Pui Ling Chui. Dimensionality reduction for density ratio estimation in high-dimensional spaces. USA. Rice University. and Arthur Gretton. Kernel choice and classiﬁability for RKHS embeddings of probability distrio butions. In Advances in Neural Information Processing Systems 20. Journal of Machine Learning Research. pages 308–316. ACL. A shape-based object class model for knowledge transfer. Kenji Fukumizu. 2009. and Bernt Schiele. Gert Lanckriet. In Proceedings of the 8th International Conference on Independent Component Analysis and Signal Separation. pages 1433–1440. Norwell. [178] Ingo Steinwart. 1998. 23(1):44–59. MIT Press. 1996. MIT Press. and Bernhard Sch¨ lkopf. 10(1):1633– 1685. Department of Computational and Applied Mathematics. On the inﬂuence of the kernel on the consistency of support vector machines. Kluwer Academic Publishers. MA. In Advances in Neural Information Processing Systems 22. In Annual Conference on Neural Information Processing Systems 20. Taylor and Peter Stone. 2009.[174] Le Song. June 2008. [181] Taiji Suzuki and Masashi Sugiyama. Technical Report TR-96-40. and Motoaki Kawanabe. [179] Masashi Sugiyama. Shinichi Nakajima. [183] Sebastian Thrun and Lorien Pratt. Estimating squared-loss mutual information for independent component analysis. pages 1385–1392. Houston. March 2009. Arthur Gretton. 112 . Karsten Borgwardt. 2008. Texas. In Proceedings of 12th IEEE International Conference on Computer Vision. [182] Matthew E. Direct importance estimation with model selection and its application to covariate shift adaptation. [175] Danny C. [184] Ivan Titov and Ryan McDonald. Paul Von Buenau. 2008. Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. 2010. A joint model of text and aspect ratings for sentiment summarization. Journal of Machine Learning Research. In Proceedings of 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technology. Alex Smola. 2:67–93.

Wiley-Interscience. Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning. [187] Peter Turney. ACM. 113 Statistical Learning Theory. March 2010. Carlotta Domeniconi. Journal of Cognitive Neuroscience. [193] Chang Wang and Sridhar Mahadevan. and Jian Hu. Vapnik. pages 417–424. and Jaap van den Herik. 2002. Finding stau u u tionary subspaces in multivariate time series. pages 244–252. IEEE. Technical Report TiCC-TR 2009-005. [192] Bo Wang. In Proceedings of the 25th International Conference on Machine learning. Using Wikipedia for co-clustering based cross-domain text classiﬁcation. [188] Laurens van der Maaten. and Qiang Yang. pages 987–996. 3(1):71–86. November 2009. M¨ ller. and Klaus R. Physical Review Letters. September 2009. [194] Hua-Yan Wang. Jie Tang. ACM. September 1998. Vincent W. pages 1085–1090. Kir´ ly. IEEE Computer Society. July 2008. Junhui Zhao. In Proceedings of the 8th Annual IEEE International Conference on Pervasive Computing and Communications. Eric Postma. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2009. [195] Pu Wang. pages 1120– 1127. Thumbs up or thumbs down? semantic orientation applied to unsupervised classiﬁcation of reviews. New York. In Proceeding of the 18th ACM conference on Information and knowledge management. June 2010. Manifold alignment using procrustes analysis. 2008. and Yanzhu Liu. IEEE. Wei Fan. 103(21). Zi Yang. [186] Matthew Turk and Alex Pentland. [189] Vladimir N. [191] Paul von B¨ nau. and Barbara Caputo. Frank C. Indoor localization in multi-ﬂoor environments with reduced effort. Eigenfaces for recognition. Dimensionality reduction: A comparative review. Safety in numbers: Learning categories from few examples with multi model knowledge transfer. Meinecke.[185] Tatiana Tommasi. 1991. IEEE Computer Society. Songcan Chen. Franz C. . In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. Francesco Orabona. In Proceedings of the 40th Annual Meeting fo the Association for Computational Linguistics. Tilburg University. [190] Alexander Vezhnevets and Joachim Buhmann. Zheng. June 2010. Heterogeneous cross domain ranking in latent space.

IEEE Transactions on Knowledge and Data Engineering. Nan Ye. and Jiangtao Ren. ACM. Leveraging sequence a classiﬁcation by taxonomy-based multitask learning. and Qiang Yang. ACM. Canada. Boosted multi-task learning for face veriﬁcation with applications to web image and video search. volume 6044 of Lecture Notes in Computer Science. 2009. [205] Dikan Xing. In Proceedings of the 21st International Conference on Machine Learning. April 2010. Latent space domain transfer between high dimensional overlapping distributions. pages 142–149. pages 522–534. and Gunnar R¨ tsch. ACM. In Proceedings of the 2008 European Conference on European Conference on Machine Learning and Knowledge Discovery in Databases. In 18th International World Wide Web Conference. Saul. pages 91–100. [198] Kilian Q. IEEE. Yasemin Altun. [203] Evan W. and Lawrence K. July 2004. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. In Proceedings of the 21st International Conference on Machine Learning. Improving svm accuracy by training on auxiliary data sources. Wai Lam. and Hai Leong Chieu. September 2008. Bridging domains using world wide knowledge for transfer learning. July 2004. [202] Pengcheng Wu and Thomas G. Xiang. Wei Fan. Olivier Verscheure. Springer. and Yong Yu. Bin Cao. Weinberger. In 11th European Conference on Principles and Practice of Knowledge 114 . June 2010. Domain adaptive bootstrapping for named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Jose Leiva. pages 1523–1532. Yangqiu Song. 22:770–783. Fei Sha. Wenyuan Dai. ACL. June 2009.[196] Xiaogang Wang. Cha Zhang. Hu. In Proceedings of the 22nd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. and Bo Chen. Mining employment market via text block detection and adaptive cross-domain information extraction. pages 283–290. [200] Tak-Lam Wong. Gui-Rong Xue. Wee Sun Lee. Bridged reﬁnement for transfer learning. Springer. Banff. [204] Sihong Xie. April 2009. and Zhengyou Zhang. Transferred dimensionality reduction. Learning a kernel matrix for nonlinear dimensionality reduction. [197] Zheng Wang. [199] Christian Widmer. August 2009. Jing Peng. Alberta. Derek H. and Changshui Zhang. [201] Dan Wu. In Proceedings of 14th Annual International Conference on Research in Computational Molecular Biology. Dietterich. pages 550–565.

2007. July 2009. Morgan Kaufmann Publishers Inc. and Eric Xing. Wenyuan Dai. Transfer learning using task-level features with application to information retrieval. and Qiang Yang. Springer. In Advances in Neural Information Processing Systems 22. ACM. Yang Zhou. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. [212] Qiang Yang. Anil K. [209] Jun Yang. pages 2151–2159. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. pages 1–9. Unsupervised transfer classiﬁcation: application to text categorization. [214] Xiaolin Yang. pages 324–335. and Alexander G. Activity recognition: Linking low-level sensors to high-level intelligence. [213] Tianbao Yang. Qiang Yang. Topic-bridged plsa for crossdomain text classiﬁcation. July 2010. 115 . Morgan Kaufmann Publishers Inc. In Proceedings of the 15th international conference on Multimedia. [211] Qiang Yang. pages 20–25. pages 1315–1320. 2010. Seyoung Kim. [210] Qiang Yang. 99(PrePrints). 2009. Hauptmann. ACM. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Estimating location using Wi-Fi. Heterogeneous transfer learning for image clustering via the social web. and Yong Yu. Sinno Jialin Pan.. and Wei Tong. Yuqiang Chen. [207] Gui-Rong Xue. 2008. Gui-Rong Xue. ACL. [206] Qian Xu. IEEE Intelligent Systems. Hannah Hong Xue. volume 1. and Vincent W. IEEE/ACM Transactions on Computational Biology and Bioinformatics. pages 1159–1168. Rong Jin.. 2009.Discovery in Databases. 2009. Multitask learning for protein subcellular location prediction. Zheng. [208] Rong Yan and Jian Zhang. Sinno Jialin Pan. pages 627–634. and Yong Yu. Wenyuan Dai. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. Jain. 23(1):8–13. July 2008. September 2007. pages 188–197. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence. Lecture Notes in Computer Science. Cross-domain video concept detection using adaptive svms. Heterogeneous multitask learning with joint sparsity constraints. ACM. Rong Yan.

In Proceedings of the 14th International Conference on Machine Learning. In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence. July 2010. In Advances in Neural Information Processing Systems 14. [221] Zheng-Jun Zha. [224] Yu Zhang and Dit-Yan Yeung. Chris Ding. Visual classiﬁcation with multi-task joint sparse representation. Bin Cao. Sparse multitask regression for identifying common mechanism of response to therapeutic targets. 2001. San Francisco. Multi-domain collaborative ﬁltering. IEEE. and Ming Gu. pages 412–420. 26(12):i97–i105. and Bahram Parvin.M. Multi-task warped gaussian process for personalized age estimation. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. 7(7):869– 883. Morgan Kaufmann Publishers Inc. June 2010. 116 . In Proceedings of the 21st International Conference on Machine Learning. Qiang Yang. July 2004. 2005. and Xian-Sheng Hua. Meng Wang. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. [222] Kai Zhang. [217] Jie Yin. Bioinformatics. 2010. Learning adaptive temporal radio maps for signalstrength-based location estimation. and Dit-Yan Yeung. pages 1057–1064. Gray. IEEE. Tao Mei. CA. Robust distance metric learning with auxiliary knowledge. Xiaofeng He. Ni. [225] Yu Zhang and Dit-Yan Yeung. and L. A comparative study on feature selection in text categorization. Learning and evaluating classiﬁers under sample selection bias.[215] Yiming Yang and Jan O. July 2010. July 1997. July 2008. pages 1327–1332. [219] Bianca Zadrozny. Pedersen. Journal of Machine Learning Research. [223] Yu Zhang. 2009. Morgan Kaufmann Publishers Inc. In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence. Joe W. IEEE Transactions on Mobile Computing. [220] Hongyuan Zha. 6:483–502. USA. Spectral relaxation for k-means clustering. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition.. [216] Jieping Ye. Horst Simon. [218] Xiao-Tong Yuan and Shuicheng Yan. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. June 2010. MIT Press. A convex formulation for learning task relationships in multi-task learning. Zengfu Wang.

Derek H. and John D. Jason Weston. Derek H. Evan W. Computer Sciences. Transferring localization models over time. Deepak Turaga. Zheng. pages 1421–1426. Sinno Jialin Pan. ACM. 2004. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. Zoubin Ghahramani. 117 . [227] Peilin Zhao and Steven C. and Jeffrey J. June 2009. OTL: A framework of online transfer learning. December 2008. July 2008. and Olivier Verscheure. AAAI Press. In Advances in Neural Informao tion Processing Systems 16. July 2008. Transferring multidevice localization models using latent multi-task learning. Pan. and Dou Shen. Springer-Verlag. Semi-supervised learning literature survey. Thomas Navin Lal. [231] Erheng Zhong. [233] Xiaojin Zhu. Technical Report 1530. University of Wisconsin-Madison.[226] Yu Zhang and Dit-Yan Yeung. pages 1199–1208. In Proceedings of the 11th International Conference on Ubiquitous Computing. pages 1027–1036. Qiang Yang. and Bernhard Sch¨ lkopf. Lafferty. Qiang Yang. Xiang. In Proceedings of the 27th International Conference on Machine learning. [230] Vincent W. Transfer metric learning by learning task relationships.H. [235] Hankui Zhuo. September 2009. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. Cross-domain activity recognition. Kun Zhang. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. pages 1110–1115. Semi-supervised learning using gaussian ﬁelds and harmonic functions. Zheng. Hu. Cross domain distribution adaptation via kernel mapping. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. Qiang Yang. In Proceedings of 10th Paciﬁc Rim International Conference on Artiﬁcial Intelligence. In Proceeding of The 22nd International Conference on Machine Learning. Zheng. MIT Press. AAAI Press. August 2003. [232] Dengyong Zhou. Olivier Bousquet. Jiangtao Ren. Jing Peng. Learning with local and global consistency. ACM. June 2010. and Qiang Yang. and Lei Li. [234] Xiaojin Zhu. [228] Vincent W. Hoi. pages 321–328. 2005. July 2010. pages 1427–1432. Wei Fan. [229] Vincent W. ACM. Transferring knowledge from another domain for learning action models. pages 912–919. Hu.

- abap 4 questn
- What is Abap
- SAP ABAP Scripts Interview Questions and Answers PDF Free Download
- Modularization in ABAP Interview Questions and Answers PDF Free Download
- SAP ABAP Scripts Interview Questions and Answers PDF Free Download
- SAP ABAP
- dp 2 marks
- 21. B.E.CSE qp
- Preferred Local Author Books
- sb5
- Ge2112 Fundamentals of Computing and Programming
- Tamil Samayal - Tiffion Items 30 Varities
- PHP Tutorial From Beginner to Master
- Network Lab Manual
- 55 Ways to Have Fun With Google

- Cooperative Machine Learning Method
- 10 - Introduction to Machine Learning
- Machine Learning and Data
- ics 806-Week 10
- Machine Learning with R Second Edition - Sample Chapter
- Yale.pdf
- 1a - Pendahuluan
- Machine Learning Cognition and Big Data Oberlin
- Practical Machine Learning - Sample Chapter
- A Hybrid Prognostics and Health-VImp
- CourseIntroduction_37285
- ANN Homework
- Introduction
- Machine Learning
- A Few Useful Things to Know About Machine Learning
- Learning
- Machine Learning
- Backgammon Strategy
- Useful Notes on ML
- 1 - 1 - Welcome (7 Min)
- Deep Learning
- Machine Learning
- Introducing Machine Learning With matlab
- A Survey on Machine Learning Approach in Data Mining
- Machine Learning Toolkit User Manual
- NIPS 2013 Workshop Book
- Machinelearning Concepts
- Learning Kernel Classifiers. Theory and Algorithms
- Machine Learning and Motion Planning
- Deep Learning.pdf
- [PhDThesis10]Feature-Based Transfer Learning With Real-World Applications

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd