This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

by JIALIN PAN

A Thesis Submitted to The Hong Kong University of Science and Technology in Partial Fulﬁllment of the Requirements for the Degree of Doctor of Philosophy in Computer Science and Engineering

September 2010, Hong Kong

c Copyright ⃝ by Jialin Pan 2010

Authorization

I hereby declare that I am the sole author of the thesis.

I authorize the Hong Kong University of Science and Technology to lend this thesis to other institutions or individuals for the purpose of scholarly research.

I further authorize the Hong Kong University of Science and Technology to reproduce the thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.

JIALIN PAN

ii

**FEATURE-BASED TRANSFER LEARNING WITH REAL-WORLD APPLICATIONS
**

by JIALIN PAN

This is to certify that I have examined the above Ph.D. thesis and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the thesis examination committee have been made.

PROF. QIANG YANG, THESIS SUPERVISOR

PROF. SIU-WING CHENG, ACTING HEAD OF DEPARTMENT Department of Computer Science and Engineering 17 September 2010 iii

Bin Cao. Dr. Dr. Her kind help and support make everything I have possible. They are Dr. Chak-Keung Chan. Nan Liu. I dedicate this dissertation to my parents. I also got great help from Xiaochuan Ni. Si Shen and so on. Jie Yin. I would also like to give my great thanks to my wife Zhiyun Zhao. Rong Pan. Yu Zhang. Dr. Dr. Dr. Rong Pan brought me to the ﬁeld of machine learning. iv . I also learned that as a researcher. I learned how to survey a ﬁeld of interest. I would like to express my thanks to my supervisor Prof. Dr. I thank them very much. during my internship at Microsoft Research Asia. Erheng Zhong. Wenchen Zheng. Gang Wang. Zhihua Zhang. Prof. Dr. Prof. Dr. Weizhu Chen. Meanwhile. I would like to express my special thanks to Dr. my wife and my little son Zhuoxuan Pan. Special thanks goes to Guang Dai. I got generous help from my mentor Dr. I would also like to thank Prof. Yi Zhen and so on. James deeply impressed me from his passion for doing research to his sense to research ideas. I am also very thankful to Prof. Raymond Wong for serving as committee members of my proposal and thesis defenses. Jian Xia. Their valuable comments are of great help in my research work.ACKNOWLEDGMENTS First of all. Brian Mak and Prof. Furthermore. James Kwok. What I learned from him would be great wealth in my future research career. I am also grateful to a number of colleagues at HKUST. Zheng Chen. I also thank his useful advice and information on my job hunting. Last but not least. I would like to give my deepest gratitude to my parents Yunyu Pan and Jinping Feng. respectively. Hao Hu. Dr. Carlos Guestrin and Prof. Prof. Wei Xiang. Rong Pan. Ivor Tsang. how to discover interesting research topics. who I worked with closely at HKUST during the past four years. study. Andreas Krause. Wu-Jun Li. Qiang Yang sincerely and gratefully for his valuable advice and in-depth guidance throughout my Ph.D. I am grateful to current and previous members in our research group. who always encourage and support me when I feel depressed. In addition. Fugee Tsung. Dr. I learned a lot from him. From him. Qi Wang. Dr. Special thanks also goes to Prof. and work harder and harder. Bingsheng He. Junfeng Pan. Dit-Yan Yeung. Jieping Ye. I thank them for their advice and discussion on my research work during my visit. Jian-Tao Sun. Yin Zhu. Yi Wang. His research passion made me become more and more interested in doing research on machine learning. When I was still a master student in Sun Yat-Sen University. who hosted and supervised me during my visit to Carnegie Mellon University and California Institute of Technology. Dr. one needs to be always positive and active. In the past four years. Qian Xu. Jingdong Wang and Prof. Dr. Pingzhong Tang. Hong Chang. With his help. More especially. Dr. Prof. how to write research papers and how to do presentation clearly. Weike Pan. Dou Shen.

2 Inductive Transfer Learning 2.2 2.1 2.TABLE OF CONTENTS Title Page Authorization Page Signature Page Acknowledgments Table of Contents List of Figures List of Tables Abstract Chapter 1 Introduction i ii iii iv v viii x xi 1 4 6 8 8 8 10 11 15 16 17 19 20 21 22 24 26 26 v 1.3.2.1 Overview 2.4 2.3.1.2 The Organization of This Thesis Chapter 2 A Survey on Transfer Learning A Brief History of Transfer Learning Notations and Deﬁnitions A Categorization of Transfer Learning Techniques Transferring Knowledge of Instances Transferring Knowledge of Feature Representations Transferring Knowledge of Parameters Transferring Relational Knowledge Transferring the Knowledge of Instances Transferring Knowledge of Feature Representations Transferring Knowledge of Feature Representations 2.1.1 The Contribution of This Thesis 1.2 2.2.4 Unsupervised Transfer Learning .1 2.2.2.2 2.4.3 Transductive Transfer Learning 2.1 2.1 2.3 2.3 2.1.

2.4 Maximum Mean Discrepancy Embedding (MMDE) 3.2 3.4.2 3.6.2.3.4.6 Semi-Supervised Transfer Component Analysis (SSTCA) 3.1 Motivation 3.1 3.6.6.3 Results 4.7 Real-world Applications of Transfer Learning Chapter 3 Transfer Learning via Dimensionality Reduction 27 29 30 32 32 33 33 34 34 34 35 36 37 38 38 41 41 42 42 43 45 46 47 49 50 51 53 54 57 57 58 61 3.2 Preliminaries 3.2 3.7 Further Discussion Chapter 4 Applications to WiFi Localization 4.2 3.5.2.1 Comparison with Dimensionality Reduction Methods vi 61 .3 3.4.3.4 Optimization Objectives Formulation and Optimization Procedure Experiments on Synthetic Data Summary 3.4 Dimensionality Reduction Hilbert Space Embedding of Distributions Maximum Mean Discrepancy Dependence Measure 3.1 3.3 A Novel Dimensionality Reduction Framework 3.3 3.5.1 3.5 Transfer Component Analysis (TCA) 3.2.6 Other Research Issues of Transfer Learning 2.5.2 Experimental Setup 4.3 3.3.1 3.2 Minimizing Distance between P (ϕ(XS )) and P (ϕ(XT )) Preserving Properties of XS and XT Kernel Learning for Transfer Latent Space Make Predictions in Latent Space Summary 3.1 WiFi Localization and Cross-Domain WiFi Localization 4.1 3.6.5.3 3.4 Parametric Kernel Map for Unseen Data Unsupervised Transfer Component Extraction Experiments on Synthetic Data Summary 3.2.5 Transfer Bounds and Negative Transfer 2.

3 6.2 Comparison to Other Methods Sensitivity to Model Parameters 68 68 71 73 73 74 76 78 79 80 81 83 84 85 85 85 87 92 94 94 95 96 5.2 4.4.1 Sentiment Classiﬁcation 6.4.5 Computational Issues 6.6 Connection to Other Methods 6.1 Conclusion 7.4.2 Experimental Setup Results 6.4 Summary Chapter 6 Domain-Driven Feature Space Transfer for Sentiment Classiﬁcation 6.3.3.4 Summary Chapter 5 Applications to Text Classiﬁcation 5.1 6.3.7.4 4.2 Future Work References vii .3.1 Text Classiﬁcation and Cross-domain Text Classiﬁcation 5.3 4.7 Experiments 6.1 6.2 Existing Works in Cross-Domain Sentiment Classiﬁcation 6.4.4 Spectral Domain-Speciﬁc Feature Alignment 6.4.3.5 Comparison with Non-Adaptive Methods Comparison with Domain Adaptation Methods Comparison with MMDE Sensitivity to Model Parameters 61 62 63 64 64 66 66 66 68 4.2 6.2 Experimental Setup 5.8 Summary Chapter 7 Conclusion and Future Work 7.7.3 Results 5.3 Problem Statement and A Motivating Example 6.3.4 Domain-Independent Feature Selection Bipartite Feature Graph Construction Spectral Feature Clustering Feature Augmentation 6.1 5.

SSTCA and the various baseline methods in the inductive setting on the WiFi data. The organization of thesis. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. An indoor wireless environment example. viii 3 7 9 14 37 1.8 54 58 4.4 3.1 4. Illustrations of the proposed TCA and SSTCA on synthetic dataset 2.7 53 3.5 3.2 46 3. Different colors denote different signal strength values (unit:dBm). Different colors denote different signal strength values. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.2 59 62 62 63 63 4.1 Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods and received by different mobile devices.4 4. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.3 47 48 48 3.5 4. Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods.6 52 3. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.1 3.3 4. Here. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.2 2. Comparison of TCA. Different learning processes between traditional machine learning and transfer learning An overview of different settings of transfer Motivating examples for the dimensionality reduction framework. Illustrations of the proposed TCA and SSTCA on synthetic dataset 1. Comparison with dimensionality reduction methods.2 3.6 . Illustrations of the proposed TCA and SSTCA on synthetic dataset 5. Illustrations of the proposed TCA and SSTCA on synthetic dataset 4. Note that the original signal strength values are non-positive (the larger the stronger).1 2. Comparison with localization methods that do not perform domain adaption. Illustrations of the proposed TCA and SSTCA on synthetic dataset 3. The direction with the largest variance is orthogonal to the discriminative direction There exists an intrinsic manifold structure underlying the observed data. we shift them to positive values for visualization. Comparison with MMDE in the transductive setting on the WiFi data.LIST OF FIGURES 1.

Comparison results (unit: %) on two datasets. Sensitivity analysis of the TCA / SSTCA parameters on the text data.3 6. Sensitivity analysis of the TCA / SSTCA parameters on the WiFi data. Model parameter sensitivity study of γ on two datasets. Study on varying numbers of domain-independent features of SFA Model parameter sensitivity study of k on two datasets.8 5. A bipartite graph example of domain-speciﬁc and domain-independent features.1 6.2 6.7 4.4 6.4.5 Training time with varying amount of unlabeled data for training.1 6. 64 65 71 81 88 91 92 93 ix .

An example of signal vectors (unit:dBm) Summary of the six data sets constructed from the 20-Newsgroups data.6 Experiments with different domain-independent feature selection methods. Relationship between traditional machine learning and transfer learning settings Different settings of transfer learning Different approaches to transfer learning Different approaches in different settings Summary of dimensionality reduction methods based on Hilbert space embedding of distributions.1 4.1 5.3 6.1 5. A co-occurrence matrix of domain-speciﬁc and domain-independent words. and “-” denotes negative sentiment. Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).. Only domain-speciﬁc features are considered.2 5.3 2. Num90 bers in the table are accuracies in percentage.” denotes all other words. “. Italic words are some domain-independent words.. Boldfaces are domain-speciﬁc words. Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products.2 2.1 Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products.LIST OF TABLES 1. Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).5 6. Summary of datasets for evaluation. Boldfaces are domain-speciﬁc words. which are much more frequent in one domain than in the other one. Bag-of-words representations of electronics (E) and video games (V) reviews.4 6.2 6.1 2. 4 12 14 15 15 55 58 67 69 70 2. “+” denotes positive sentiment.4 3. Ideal representations of domain-speciﬁc words. and “-” denotes negative sentiment.1 75 77 77 78 86 6. x .3 6. which occur frequently in both domains. which are much more frequent in one domain than in the other one. “+” denotes positive sentiment.

especially when we have limited labeled data in our domain of interest. In this thesis. we ﬁrst survey different settings and approaches of transfer learning and give a big picture of the ﬁeld. we propose three effective solutions to learn the latent space for transfer learning. In our study. We focus on latent space learning for transfer learning. we try to align domain-speciﬁc features from different domains by using some domain independent features as a bridge.FEATURE-BASED TRANSFER LEARNING WITH REAL-WORLD APPLICATIONS by JIALIN PAN Department of Computer Science and Engineering The Hong Kong University of Science and Technology ABSTRACT Transfer learning is a new machine learning and data mining framework that allows the training and test data to come from different distributions and/or feature spaces. We can ﬁnd many novel applications of machine learning and data mining where transfer learning is helpful. Based on this framework. such as sentiment classiﬁcation. and achieve promising results. Experimental results show that this method outperforms a state-of-the-art algorithm in two real-world datasets on cross-domain sentiment classiﬁcation. This framework is general for many transfer learning problems when domain knowledge is unavailable. We apply these methods to two diverse applications: cross-domain WiFi localization and cross-domain text classiﬁcation. we propose a spectral feature alignment algorithm for cross-domain learning. which tries to reduce the distance between different domains while preserve data properties as much as possible. xi . such that knowledge transfer becomes possible. which aims at discovering a “good” common feature space across domain. Furthermore. where domain knowledge is available for encoding to transfer learning methods. In this algorithm. we propose a novel dimensionality reduction framework for transfer learning. for a speciﬁc application area.

most active learning methods assume that there is a budget for the active learner to pose queries in the domain of interest. However. semi-supervised learning [233. the performance of these algorithms heavily reply on collecting high quality and sufﬁcient labeled training data to train a statistical or computational model to make predictions on the future data [127. tasks. Transfer learning. 131. and distributions used in training and testing to be different. allows the domains. Thus. 90] techniques have been proposed to address the problem that the labeled training data may be too few to build a good classiﬁer. most traditional supervised algorithms work well only under a common assumption: the training and test data are drawn from the same feature space and the same distribution. 34. The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns [101. a human annotator). we can observe many examples of 1 . In the real world.. The main idea behind transfer learning is to borrow labeled data or knowledge extracted from some related domains to help a machine learning algorithm to achieve greater performance in the domain of interest [183]. Instead of exploring unlabeled data to train a precise model. and the test data are both from the same domain of interest. usually in the form of unlabeled data instances to be labeled by an oracle (e. in contrast. which implicitly assumes the training and test data are still represented in the same feature space and drawn from the same data distribution. tries to design an active learner to pose queries. 77. Nevertheless. in many real-world scenarios. active learning. 27. 168]. In the last decade. most semi-supervised methods require that the training data. including labeled and unlabeled data.g.CHAPTER 1 INTRODUCTION Supervised data mining and machine learning technologies have already been widely studied and applied to many knowledge engineering areas. In some real-world applications. 189]. However. Furthermore. by making use of a large amount of unlabeled data to discover a powerful structure together with a small amount of labeled data to train models. This problem has become a major bottleneck of making machine learning and data mining methods more applicable in practice. where active learning methods may not work in learning accurate classiﬁers in the domain of interest. transfer learning can be referred to as a different strategy for learning model with minimal human supervision. which is another branch in machine learning for reducing annotation effort of supervised learning. the budget may be quite limited. compared to semi-supervised and active learning. labeled training data are in short supply or can only be obtained with expensive cost. However.

into polarity categories (e. Consider the problem of sentiment classiﬁcation. where our task is to automatically classify the reviews on a product. device or other dynamic factors.. As a result. It would be nice if one could reuse the training data which have been collected in some related domains/tasks or the knowledge that is already extracted from some related domains/tasks to learn a precise model for use in the domain of interest. However. In such cases. As shown in Figure 4. there may be a lack of labeled training data. the WiFi signal-strength values may be a function of time. The need for transfer learning may also arise when the data can be easily outdated. transfer learning is also desirable when the features between domains change.. One example is Web document classiﬁcation. learning to play the electronic organ may help facilitate learning the piano. As a third example. it would be helpful if we could transfer the classiﬁcation knowledge into the new domain.transfer learning. as introduced in [142]. the labeled examples may be the university Web pages that are associated with category information obtained through previous manual-labeling efforts. supervised learning algorithms [146] have proven to be promising and widely used in sentiment classiﬁcation. Many examples in knowledge engineering can be found where transfer learning can truly be beneﬁcial. which aims to detect a user’s current location based on previously collected WiFi data. knowledge transfer or transfer learning between tasks or domains become more desirable and crucial. we may ﬁnd that learning to recognize apples might help to recognize pears. In such cases. positive or negative). Similarly. For a classiﬁcation task on a newly created Web site where the data features or data distributions may be different. in many engineering applications. [49]). it is expensive or impossible to collect sufﬁcient training data to train a model for use in each domain of interest. these methods are domain 2 . In literature. such as a brand of camera. we may not be able to directly apply the Web-page classiﬁers learned on the university Web site to the new Web site. However. or to adapt the localization model trained on a mobile device (the source domain) for a new mobile device (the target domain). where our goal is to classify a given Web document into several predeﬁned categories.g.2.g. In this case. As a result. we might wish to adapt the localization model trained in one time period (the source domain) for a new time period (the target domain). the labeled data obtained in one time period may not follow the same distribution in a later time period. As an example in the area of Web-document classiﬁcation (see. To reduce the re-calibration effort. in indoor WiFi localization problems. Furthermore. For example. a model trained in one time period or on one device may cause the performance for location estimation in another time period or on another device to be reduced. values of received signal strength (RSS) may differ across time periods and mobile devices. because a user needs to label a large collection of WiFi signal data at each location. e. it is very expensive to calibrate WiFi data for building localization models in a large-scale environment. For example.

1 shows several user review sentences from two domains: electronics and video games.35 30 25 20 15 10 5 0 0 35 30 25 20 20 18 16 14 12 10 15 10 5 0 0 8 6 4 2 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 (a) WiFi RSS received by device A in T1 .1: Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods and received by different mobile devices. 3 . 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 20 40 60 80 100 120 140 0 0 20 40 60 80 100 120 140 (c) WiFi RSS received by device B in T1 . Different colors denote different signal strength values. we may use words like “compact”. In the electronics domain. words like “hooked”. (d) WiFi RSS received by device B in T2 . Figure 1. a sentiment classiﬁer trained in one domain may not work well when directly applied to other domains. The reason is that users may use domain-speciﬁc words to express sentiment in different domains. cross-domain sentiment classiﬁcation algorithms are highly desirable to reduce domain dependency and manually labeling cost by transferring knowledge from related domains to the domain of interest [25]. dependent. Thus. “realistic” indicate positive opinion and the word “boring” indicates negative opinion. “sharp” to express our positive sentiment and use “blurry” to express our negative sentiment. 35 (b) WiFi RSS received by device A in T2 . Due to the mismatch among domain-speciﬁc words. Table 1. While in the video game domain.

we focus on “What to transfer” and “How to transfer” by implicitly assuming that the source and target domains are related to each other. no labeled data in the target domain are available for training. Furthermore. In this thesis. Some knowledge is speciﬁc for individual domains or tasks. learning algorithms need to be developed to transfer the knowledge. It is really nice and sharp. (3) When to transfer [141]. knowledge should not be transferred. easy to operate. brute-force transfer may be unsuccessful. I am extremely unI will never buy HP again. Boldfaces are domain-speciﬁc words. The game is so boring. a situation which is often referred to as negative transfer. In this thesis. I purchased this unit from Circuit City and Very realistic shooting action and good I was very excited about the quality of the plots. “When to transfer” asks in which situations. In the worst case. we have the following three main research issues: (1) What to transfer. + + electronics Compact. transfer learning can be categorized into three settings: inductive transfer. “What to transfer” asks which part of knowledge can be transferred across domains or tasks. In some situations. very good picture quality. picture. which are much more frequent in one domain than in the other one. happy and will probably never buy UbiSoft again. which will be introduced in detail in Chapter 2 as well.1 The Contribution of This Thesis Generally speaking. our goal is to learn an accurate model for use in the target domain. 1. Note that in this setting. transductive transfer and unsupervised transfer. when the source domain and target domain are not related to each other. We leave the issue on how to avoid 4 . Likewise. and “-” denotes negative sentiment. we are interested in knowing in which situations. I am very much hooked on this game. in transfer learning.1: Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products. transferring skills should be done. which corresponds to the “how to transfer” issue. We played this and were hooked. “+” denotes positive sentiment. we focus on the transductive transfer learning setting. and some knowledge may be common between different domains such that they may help improve performance for the target domain or task. which is ﬁrst described in our survey article [141] and will be introduced in detail in Chapter 2. It is also quite blurry in very dark settings. After discovering which knowledge can be transferred. (2) How to transfer. it may even hurt the performance of learning in the target domain. where we are given a lot of labeled data in a source domain and some unlabeled data in a target domain. looks sharp! video games A very good game! It is action packed and full of excitement.Table 1.

in TCA and SSTCA. For “How to transfer”.1. the domain knowledge is hidden. Based on this framework. For “How to transfer”. We apply these three algorithms to two diverse application areas: wireless sensor networks and Web mining. though users may use some domain-speciﬁc words as shown in Table 1. etc. For example. such as “good”. they may use some domain-independent sentiment words. we propose to learn parametric kernel based embeddings for transfer learning instead.negative transfer to our future work. Thus. we propose to discover a latent feature space for transfer learning. where the distance between domains can then be reduced and the important information of the original data can be preserved simultaneously. etc. such as the structure of a building. such as text classiﬁcation or WiFi localization. such that the distance between data distributions can be dramatically reduced and the original data properties. we propose a novel and general dimensionality reduction framework for transfer learning. In addition. we propose three different algorithms to learn the latent space. Maximum Mean Discrepancy Embedding (MMDE) [135]. In contrast to the general framework to transfer learning without domain knowledge. The main difference between TCA and SSTCA is that TCA is an unsupervised feature extraction method while SSTCA is semi-supervised feature extrantion method. such as sentiment classiﬁcation. In this case. Standard machine learning and data mining methods can be applied directly in the latent space to train models for making predictions on the target domain data. More speciﬁcally. which means there may be a correlation between these words. text data may be controlled by some latent topics. Thus. Our framework aims to learn a latent space underlying across domains. In most application areas. some domain knowledge can be observed and used for learning the latent space across domains. when data from different domains are projected onto the latent space. This observation motivates us to propose a spectral feature clustering framework [137] to align domain-speciﬁc words from different domains in a latent space by modeling the correlation between the domain-independent and domain-speciﬁc words in a bipartite graph and using the domain-speciﬁc features as a 5 . “never buy”. some domain-speciﬁc and domain-independent words may co-occur in reviews frequently. For example. and WiFi data may be controlled by some hidden factors. in sentiment classiﬁcation. can be preserved as much as possible. in some application areas. such as variance and local geometric structure. The resultant kernel in a free form may be more precise for transfer learning. and (2) domain knowledge can be observed or easy to encode in embedding learning. the latent space can be treated as a bridge across domains to make knowledge transfer possible and successful. but suffers from expensive computational cost. we propose two embedding learning frameworks to learn the latent space based on two different situations: (1) domain knowledge is hidden or hard to capture. in MMDE we translate the latent space learning for transfer learning to a non-parametric kernel matrix learning problem. Transfer Component Analysis (TCA) [139] and Semi-supervised Transfer Component Analysis (SSTCA) [140].

in this thesis. • We propose a general dimensionality reduction framework for transfer learning without any domain knowledge.4). • We give a comprehensive survey on transfer learning. which encode the domain knowledge in a spectral feature alignment framework. we survey the ﬁeld of transfer learning. Based on the framework. The other framework is focused on the situation that domain knowledge is explicit and easy to be encoded in feature space learning. text classiﬁcation and sentiment classiﬁcation. such as WiFi localization. we propose two feature space learning frameworks for transfer learning.. a current survey article [182]). Furthermore. The main contributions of this thesis can be summarized as follows. The ﬁrst framework is focused on the situation that domain knowledge is hidden and hard to encode in embedding learning. In this thesis. where we summarize different transfer learning settings and approaches and discuss the relationship between transfer learning and other related areas. Transfer Component Analysis (TCA) (Chapter 3.2. Then we apply these three methods to two diverse applications: WiFi localization (Chapter 4) and text classiﬁcation (Chapter 5). Other researchers may get a big picture of transfer learning by reading the survey. categorize transfer learning approaches into four contexts. we present our proposed general dimensionality reduction framework and three proposed algorithms. 1.bridge for cross-domain sentiment classiﬁcation. analyze the relationship between transfer learning and other related areas. Maximum Mean Discrepancy Embedding (MMDE) (Chapter 3. Based on different different situations that whether domain knowledge is available. where we give some deﬁnitions of transfer learning. The proposed method outperforms a sate-of-the-art cross-domain methods in the ﬁeld of sentiment classiﬁcation.2 The Organization of This Thesis The organization of this thesis is shown in Figure 1. In Chapter 3. we propose three solutions to learn the latent space for transfer learning. In Chapter 2. However. summarize transfer learning into three settings. Note that there has been a large amount of work on transfer learning for reinforcement learning in the machine learning literature (e. we study the problem of feature-based transfer learning and its real-world applications. discuss some interesting research issues in transfer learning and introduce some applications of transfer learning. we only focus on transfer learning for classiﬁcation and regression tasks that are related more closely to machine learning and data mining tasks. • We propose a speciﬁc latent space learning for sentiment classiﬁcation. In 6 . we apply them to solve the WiFi localization and text classiﬁcation problems and achieve promising results.6).5) and Semi-supervised Transfer Component Analysis (SSTCA) (Chapter 3.g.

we conclude this thesis and discuss some thoughts on future work in Chapter 7. Finally.Chapter 6. Figure 1. we present a spectral feature alignment (SFA) algorithm for sentiment classiﬁcation across domains.2: The organization of thesis. 7 .

knowledge transfer. we add some up-to-date materials on transfer learning. and their real-world applications. we give a comprehensive survey of transfer learning for classiﬁcation. In fact. Research on transfer learning has attracted more and more attention since 1995 in different names: learning to learn. transfer learning aims to extract the knowledge from one or more source tasks and applies the knowledge to a target task. which tries to learn multiple tasks simultaneously even when they are different.ca/courses/comp/dsilver/NIPS95 LTL/transfer. and incremental/cumulative learning [183]. context-sensitive learning.workshop.CHAPTER 2 A SURVEY ON TRANSFER LEARNING In this chapter. Compared to the previous survey article.html http://www. in this chapter. rather than learning all 1 2 http:/socrates. which is the ﬁrst survey in the ﬁeld of transfer learning.1. Among these. the years after the NIPS-05 workshop. a closely related learning technique to transfer learning is the multi-task learning framework [31]. In contrast to multi-task learning.darpa. multi-task learning. regression and clustering developed in machine learning and data mining areas. the Broad Agency Announcement (BAA) 05-29 of Defense Advanced Research Projects Agency (DARPA)’s Information Processing Technology Ofﬁce (IPTO) 2 gave a new mission of transfer learning: the ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks. which focused on the need for lifelong machinelearning methods that retain and reuse previously learned knowledge. inductive transfer. In 2005. this chapter originated as our survey article [141].acadiau.1 Overview 2. including both methodologies and applications. In this deﬁnition. knowledge-based inductive bias.1995. meta learning. life-long learning.1 A Brief History of Transfer Learning The study of transfer learning is motivated by the fact that people can intelligently apply knowledge learned previously to solve new problems faster or with better solutions [61].asp 8 .mil/ipto/programs/tl/tl. 2. The fundamental motivation for transfer learning in the ﬁeld of machine learning was discussed in a NIPS-95 workshop on “Learning to Learn”1 . A typical approach for multi-task learning is to uncover the common (latent) features that can beneﬁt each individual task. knowledge consolidation.

Figure 2. artiﬁcial intelligence (AAAI Conference on Artiﬁcial Intelligence (AAAI) and International Joint Conference on Artiﬁcial Intelligence (IJCAI). transfer learning cares most about the target task. International World Wide Web Conference (WWW). The roles of the source and target tasks are no longer symmetric in transfer learning. for example). IEEE International Conference on Computer Vision (ICCV).1 shows the difference between the learning processes of traditional and transfer learning techniques. ACM international conference on Multimedia (MM). IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR).of the source and target tasks simultaneously. (a) Traditional Machine Learning (b) Transfer Learning Figure 2. As we can see. most notably in data mining (ACM International Conference on Knowledge Discovery and Data Mining (KDD). International Conference on Computational Linguistics (COLING). Annual Conference on Neural Information Processing Systems (NIPS) and European Conference on Machine Learning (ECML) for example). while transfer learning techniques try to transfer the knowledge from some previous tasks to a target task when the latter has fewer high-quality training data. IEEE International Conference on Data Mining (ICDM) and European Conference on Knowledge Discovery in Databases (PKDD). machine learning (International Conference on Machine Learning (ICML). traditional machine learning techniques try to learn each task from scratch. for example) and applications (Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Conference on Empirical Methods in Natural Language Processing (EMNLP). transfer learning methods appear in several top venues. European Conference on Computer Vision (ECCV). Joint Conference of the 47th Annual Meeting of the Association of Computational Linguistics (ACL). ACM International Conference on Ubiquitous Computing (UBICOMP).1: Different learning processes between traditional machine learning and transfer learning Today. Annual International Conference on Research in Computational Molecular Biology (RECOMB) and Annual International 9 .

xi is the ith term vector corresponding to some documents. a domain is a pair D = {X . . as this is by far the most popular of the research works in the literature. In general. respectively. In the above deﬁnition. A domain D consists of two components: a feature space X and a marginal probability distribution P (X). we ﬁrst describe the notations and deﬁnitions used in this chapter. (xTnT .ust. where DS ̸= DT . we give the deﬁnitions of a “domain” and a “task”. f (·)}). . . yT1 ).1. DT . D = {X . (xSnS . a task T consists of two components: a label space Y and an objective predictive function f (·) (denoted by T = {Y. which is True. and yi is “True” or “False”. yTnT )}. if two domains are different. Thus the condition DS ̸= DT implies that either XS ̸= XT or PS (X) ̸= PT (X). . f (x). Here. we denote DS = {(xS1 . a target domain DT and learning task TT . where the input xTi is in XT and yTi ∈ YT is the corresponding output. where xi ∈ X and yi ∈ Y. From a probabilistic viewpoint. then X is the space of all term vectors. where X = {x1 . False for a binary classiﬁcation task. of a new instance x. . yi }. DS can be a set of term vectors together with their associated true or false class labels. .2 Notations and Deﬁnitions First of all. Given a source domain DS and learning task TS . . More speciﬁcally.6. In our document classiﬁcation example. . yS1 ). . Given a speciﬁc domain. For example. xn } ∈ X . Similarly. For example. Y is the set of all labels. where xSi ∈ XS is the data instance and ySi ∈ YS is the corresponding class label. P (X)}. http://www. . . 2. In most cases. 0 ≤ nT ≪ nS . and X is a particular learning sample. The function f (·) can be used to predict the corresponding label. .cse. The issue of transfer learning from multiple source domains will be discussed in 2. transfer learning aims to help improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS . and each term is taken as a binary feature. P (X)}.Conference on Intelligent Systems for Molecular Biology (ISMB) for example) 3 . which is not observed but can be learned from the training data. we denote the target domain data as DT = {(xT1 . then they may have different feature spaces or different marginal probability distributions. ySnS )} the source domain data. which consist of pairs {xi . We now give a uniﬁed deﬁnition of transfer learning. if our learning task is document classiﬁcation.htm 10 . f (x) can be written as P (y|x). Deﬁnition 1.hk/˜sinnopan/conferenceTL. and one target domain. Before we give different categorizations of transfer learning. In our document classiﬁcation example. or TS ̸= TT . we only consider the case where there is one source domain DS . in our document classiﬁcation 3 We summarize a list of transfer learning papers published in these few years and a list of workshops that are related to transfer learning at the following url for reference.

In our document classiﬁcation example. we say that the source and target domains or tasks are related. resulting in P (Y |X) changes across different users. When the target and source domains are the same.e. case (1) corresponds to when the two sets of documents are described in different languages. P (XS ) ̸= P (XT ). or (2) the conditional probability distributions between the domains are different. Given speciﬁc domains DS and DT . Thus the condition TS ̸= TT implies that either YS ̸= YT or P (YS |XS ) ̸= P (YT |XT ). laptop}.e. and their learning tasks are the same. This because from a semantic point of view. whereas the target domain has ten classes to classify the documents to. i.example. between the two domains or tasks.5. How to measure the relatedness between domains or tasks is an important research issue in transfer learning. then either (1) the label spaces between the domains are different.. When the domains are different. i. either the term features are different between the two sets (e.e. we have the following three main research issues: (1) What to transfer. P (YS |XS ) ̸= P (YT |XT ). in our document classiﬁcation example. and some knowledge may be com11 . desktop} may be related the task classifying documents into the categories{book.3 A Categorization of Transfer Learning Techniques In transfer learning. or their marginal distributions are different. DS = DT . In addition. YS ̸= YT . 2. Thus. the learning problem becomes a traditional machine learning problem. i. where YSi ∈ YS and YTi ∈ YT . (3) When to transfer.e. XS ̸= XT . Different users may deﬁne different different tags for a same document. As a result.1. i. “What to transfer” asks which part of knowledge can be transferred across domains or tasks. Similarly. the learning tasks may be related to each other..e. For example. or (2) the feature spaces between the domains are the same but the marginal probability distributions between domain data are different. i. in most transfer learning methods introduced in the following sections assume that the source and target domains or tasks are related. when the learning tasks TS and TT are different.g. which will be discussed in Chapter 2. (2) How to transfer. TS = TT . then either (1) the feature spaces between the domains are different. Some knowledge is speciﬁc for individual domains or tasks. when there exists some relationship. a task is deﬁned as a pair T = {Y. Case (2) corresponds to the situation where the documents classes are deﬁned subjectively. they use different languages). where XSi ∈ XS and XTi ∈ XT . case (1) corresponds to the situation where source domain has binary document classes. the task classifying documents into the categories {book. as such tagging. i. this means that between a source document set and a target document set. Note that it is hard to deﬁne the term “relationship” mathematically. P (Y |X)}. the terms “laptop” and “desktop” are close to each other. As an example.e. explicit or implicit. and case (2) may correspond to when the source domain documents and the target domain documents focus on different topics.

transferring skills should be done. Most current work on transfer learning focuses on “What to transfer” and “How to transfer”. However. After discovering which knowledge can be transferred. “When to transfer” asks in which situations.1) A lot of labeled data in the source domain are available. 12 . we summarize the relationship between traditional machine learning and various transfer learning settings in Table 2. how to avoid negative transfer is an important open issue that is attracting more and more attention in the future. which corresponds to the “how to transfer” issue. some labeled data in the target domain are required to induce an objective predictive model fT (·) for use in the target domain. a situation which is often referred to as negative transfer. In addition.1. In the worst case. Table 2. knowledge should not be transferred. learning algorithms need to be developed to transfer the knowledge. based on different situations between the source and target domains and tasks. we can further categorize the inductive transfer learning setting into two cases: (1. In this case. where we categorize transfer learning under three settings.1: Relationship between traditional machine learning and transfer learning settings Learning Settings Traditional Machine Learning Inductive Transfer Learning / Transfer Unsupervised Transfer Learning Learning Transductive Transfer Learning Source & Target Domains the same the same different but related different but related Source & Target Tasks the same different but related different but related the same 1. In this case. the target task is different from the source task. transductive transfer learning and unsupervised transfer learning. the inductive transfer learning setting only aims at achieving high performance in the target task by transferring knowledge from the source task while multi-task learning tries to learn the target and source task simultaneously. by implicitly assuming that the source and target domains be related to each other. the inductive transfer learning setting is similar to the multi-task learning setting [31]. In the inductive transfer learning setting. inductive transfer learning. it may even hurt the performance of learning in the target domain. In some situations. we are interested in knowing in which situations. according to different situations of labeled and unlabeled data in the source domain. Based on the deﬁnition of transfer learning. brute-force transfer may be unsuccessful. However. when the source domain and target domain are not related to each other. no matter when the source and target domains are the same or not.mon between different domains such that they may help improve performance for the target domain or task. Likewise.

49.g. the inductive transfer learning setting is similar to the self-taught learning setting. 180. which implies the side information of the source domain cannot be used directly. e. 147] for example). XS = XT . while the source and target domains are different. The relationship between the different settings of transfer learning and the related areas are summarized in Table 2. but the marginal probability distributions of the input data are different. In this case. The intuitive idea behind this case is to learn a “good” feature representation for the target domain. XS ̸= XT . whose assumptions are similar. 67.. [150]. the label spaces between the source and target domains may be different.2) No labeled data in the source domain are available. the unsupervised transfer learning focus on solving unsupervised learning tasks in the target domain. 107. 22.2.g. 21.3 shows these four cases and brief description. it’s similar to the inductive transfer learning setting where the labeled data in the source domain are unavailable. 37. However. in the unsupervised transfer learning setting.(1. according to different situations between the source and target domains. 50. In this case. such as clustering. which assumes that certain parts of the data in the source domain can be reused for learning in the target domain by re-weighting. Instance re-weighting and importance sampling are two major techniques in this context. Thus. In addition. which is ﬁrst proposed by Raina et al. In this case. Approaches to transfer learning in the above three different settings can be summarized into four contexts based on “What to transfer”. 112. 82. the source and target tasks are the same. e. 46. 3.1) The feature spaces between the source and target domains are different. 47. The latter case of the transductive transfer learning setting is related to domain adaptation for knowledge transfer in Natural Language Processing (NLP) [23. no labeled data in the target domain are available while a lot of labeled data in the source domain are available. 83.. we can further categorize the transductive transfer learning setting into two cases. In the transductive transfer learning setting. 148. 48. 149. P (XS ) ̸= P (XT ). 26. the 13 . the target task is different from but related to the source task. 51. 20. there are no labeled data available in both source and target domains in training. A second case can be referred to as feature-representation-transfer approach (see. 99. 150. 231] for example). [219. In the self-taught learning setting. Table 2. 84. 2. similar to inductive transfer learning setting. The ﬁrst context can be referred to as instance-based transfer learning(or instance-transfer) approach (see. dimensionality reduction and density estimation [50. 197]. 8. [31. 6. (2.2 and Figure 2. 87. Finally. (2.2) The feature spaces between domains are the same. 85] and sample selection bias [219] or co-variate shift [170]. In this situation. 4.

The basic assumption behind this context is that some relationship among the data in the source and target domains are similar. 63. the performance of the target task is expected to improve signiﬁcantly. which assumes that the source tasks and the target tasks share some parameters or prior distributions of the hyper-parameters of the models. knowledge can be transferred across tasks. Thus. Clustering Figure 2. e. The transferred knowledge is encoded into the shared parameters or priors. Co-variate Shift Source Domain Labels Available Unavailable Available Unavailable Target Domain Labels Available Available Unavailable Unavailable Tasks Regression. Sample Selection Bias. Classiﬁcation Regression. Finally. A third case can be referred to as parameter-transfer approach (see. which deals with transfer learning for relational domains. Classiﬁcation Dimensionality Reduction.2: Different settings of transfer learning Transfer Learning Related Areas Settings Inductive Transfer Multi-task Learning Learning Self-taught Learning Transductive Transfer Learning Unsupervised Transfer Learning Domain Adaptation.2: An overview of different settings of transfer knowledge used to transfer across domains is encoded into the learned feature representation.g. by discovering the shared parameters or priors. 165. 62. the last case can be referred to as the relational-knowledge-transfer problem [124].. Classiﬁcation Regression. 14 . With the new feature representation. 28.Table 2. 30] for example). [97.

e. 21.2 Inductive Transfer Learning Deﬁnition 2. e. the parameter-transfer and the relational-knowledge-transfer approach are only studied in the inductive transfer learning setting.4 shows the cases where the different approaches are used for each transfer learning setting. 84.g. [219. 147] for example). the knowledge to be transferred is the relationship among the data. while the unsupervised transfer learning setting is a relatively new research topic and only studied in the context of the feature-representation-transfer case. 231] for example). 30] for example). In addition. [97. 149. Build mapping of relational knowledge between the source domain and the target domains.g.g. inductive transfer learning aims to help improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS . 47. Both domains are relational domains and i.. 52]. where 15 .4: Different approaches in different settings Inductive Transfer Learning √ √ √ √ Transductive Transfer Learning √ √ Unsupervised Transfer Learning √ Instance-transfer Feature-representationtransfer Parameter-transfer Relational-knowledgetransfer 2. 37. which we discuss in detail in the following chapters. 165. 28. 148. statistical relational learning techniques dominate this context [125. Thus.d assumption is relaxed in each domain (see. the feature-representation-transfer problem has been proposed to all three settings of transfer learning. [124. a target domain DT and a learning task TT . 51. 22. 107. (Inductive Transfer Learning) Given a source domain DS and a learning task TS .3: Different approaches to transfer learning Approaches Instance-transfer Feature-representationtransfer Parameter-transfer Relational-knowledgetransfer Brief Description To re-weight some labeled data in the source domain for use in the target domain (see.Table 2. Discover shared parameters or priors between the source domain and target domain models. Find a “good” feature representation that reduces difference between the source and the target domains and the error of classiﬁcation and regression models (see. 20.. e. However. 62. 150.g. 83.. We can see that the inductive transfer learning setting has been studied in many research works. 6. 49. 52] for example). Table 2. 87. e. 4.i. 125. [31. 26. 180. 112. 99. 63. 48. which can beneﬁt for transfer learning (see.. 46. Recently. 82. 8. 50. Table 2.

this setting has two cases: (1) Labeled data in the source domain are available.TS ̸= TT . In addition. As mentioned in Chapter 2. Dai et al.1 Transferring Knowledge of Instances The instance-transfer approach to the inductive transfer learning setting is intuitively appealing: although the source domain data cannot be reused directly. TrAdaBoost trains the base classiﬁer on the weighted source and target data.R2 to extend the TrAdaBoost algorithm for regression problems.3. (2) Labeled data in the source domain are unavailable while unlabeled data in the source domain are available. Pardoe and Stone [147] presented a two-stage algorithm TrAdaBoost. a few labeled data in the target domain are required as the training data to induce the target predictive function. TrAdaBoost attempts to iteratively re-weight the source domain data to reduce the effect of the “bad” source data while encourage the “good” source data to contribute more for the target domain. The error is only calculated on the target data. Most transfer learning approaches in this setting focus on the former case. to avoid the overﬁtting problem in TrAdaBoost. TrAdaBoost also assumes that.2. Then in the second stage. [49] proposed a boosting algorithm. Based on the above deﬁnition of the inductive transfer learning setting. which is an extension of the AdaBoost [66] algorithm. Besides modifying boosting algorithms for transfer learning. More recently. Theoretical analysis of TrAdaBoost is also presented in [49]. For each round of iteration. to address classiﬁcation problems in the inductive transfer learning setting. TrAdaBoost. there are certain parts of the data that can still be reused together with a few labeled data in the target domain. Pardoe and Stone proposed to adjust the weights of the data in two stages. due to the difference in distributions between the source and the target domains. TrAdaBoost uses the same strategy as AdaBoost to update the incorrectly classiﬁed examples in the target domain while using a different strategy from AdaBoost to update the incorrectly classiﬁed source examples in the source domain. In the ﬁrst stage. only the weights of the source domain data are adjusted downwards gradually until reaching a certain point. the weights of source domain data are ﬁxed while the weights of the target domain data are updated as normal in the TrAdaBoost algorithm. Jiang and Zhai [87] proposed a heuristic method to remove “misleading” training examples from the source domain based 16 . Furthermore.1. In detail. 2. some of the source domain data may be useful in learning for the target domain but some of them may not and could even be harmful. TrAdaBoost assumes that the source and target domain data use exactly the same set of features and labels. Furthermore. The idea is to apply the techniques which have been proposed for modifying AdaBoost for regression [56] on TrAdaBoost. but the distributions of the data in the two domains are different.

In the inductive transfer learning setting. In addition. U T XS and the parameters. If no labeled data in the source domain are available. ⟨at . p)-norm of A is deﬁned as ∥A∥r. The optimization problem (2. U is a d × d orthogonal matrix (mapping function) for mapping the original high-dimensional data to low-dimensional representations. In a follow-up work.p := ( d ∥ai ∥p ) p . 17 .S} i=1 L(yti .U nt ∑ ∑ t∈{T. If a lot of labeled data in the source domain are available.1) r i=1 estimates the low-dimensional representations U T XT . The idea is to assign weights to various models dynamically based on local structures in a graph.2 Transferring Knowledge of Feature Representations The feature-representation-transfer approach to the inductive transfer learning problem aims at ﬁnding “good” feature representations to minimize domain divergence and classiﬁcation or regression model error. supervised learning methods can be used to construct a feature representation. respectively.t. arg min A.1 (2. then weighted models are used to make predictions on text data. S and T denote the tasks in the source domain and target domain. A = [aS . Gao et al.1) s. aT ] ∈ Rd×2 is a matrix of parameters. [67] proposed a graph-based locally weighted ensemble framework to combine multiple models for transfer learning. A. ∑ 1 The (r. unsupervised learning methods are proposed to construct the feature representation.1) can be further transformed into an equivalent convex optimization formulation and be solved efﬁciently. U T xti ⟩) + γ∥A∥2 2. Strategies to ﬁnd “good” feature representations are different for different types of the source domain data. U ∈ Od In this equation. Wu and Dietterich [202] integrated the source domain (auxiliary) data in a Support Vector Machine (SVM) [189] framework for improving the classiﬁcation performance in the target domain. the common features can be learned by solving an optimization problem. The basic idea is to learn a low-dimensional representation that is shared across related tasks. The optimization problem (2. the learned new representation can reduce the classiﬁcation or regression model error of each task as well. Supervised Feature Construction Supervised feature construction methods for the inductive transfer learning setting are similar to those used in multi-task learning. This is similar to common feature learning in the ﬁeld of multi-task learning [31].on the difference between conditional probabilities P (yT |xT ) and P (yS |xS ). of the model at the same time. [6] proposed a sparse feature learning method for multi-task learning. 2. given as follows. Argyriou et al.2.

b2 .2) s. However. Chen et al. bs } are learned on the source domain data by solving the optimization problem (2.Argyriou et al. proposed to apply sparse coding [98]. Jebara [84] proposed to select features for multi-task learning with SVMs. . which is an unsupervised feature construction method. Chen et al. Then weight vectors of the multiple classiﬁers are used to construct a predictor space. an optimization algorithm (2. [37] presented an improved formulation (iASO) based on ASO by proposing a novel regularizer. One drawback of this method 18 . 5]. discriminative algorithms can be applied to {a∗ i }′ s with corresponding labels to train T classiﬁcation or regression models for use in the target domain. In ASO. [99] proposed a convex optimization algorithm for simultaneously learning metapriors and feature weights from an ensemble of related prediction tasks.t. . Raina et al. The ASO algorithm has been applied successfully to several applications [25. ∀j ∈ 1. in [37]. Finally. a linear classiﬁer is trained for each of the multiple tasks. After learning the basis vectors b. . proposed a convex alternating structure optimization (cASO) algorithm to the optimization problem. . which aims at ﬁnding a suitable kernel for the target data. The basic idea of this approach consists of two steps. s In this equation.3) is applied on the target domain data to learn higher level features based on the basis vectors b. More recently. aj i is a new representation of basis bj for input xSi and β is a coefﬁcient S to balance the feature construction term and the regularization term. in order to convert the new formulation into a convex formulation.3) Finally. higher-level basis vectors b = {b1 . it is non-convex and does not guarantee to ﬁnd a global optimum. .b ∑ i ∥xSi − ∑ j aj i bj ∥2 + β∥aSi ∥1 2 S (2. proposed by Ando and Zhang [4]. Another famous common feature learning method for multi-task learning is the alternating structure optimization ASO algorithm. At the ﬁrst step. [8] proposed a spectral regularization framework on matrices for multi-task structure learning. Unsupervised Feature Construction In [150]. Singular Vector Decomposition (SVD) is applied on the space to recover a low-dimensional predictive space as a common feature space underlying multiple tasks.2) as shown as follows. . [160] designed a kernel-based approach to inductive transfer. ∥bj ∥2 ≤ 1. Furthermore. ∑ j a∗ i = arg min ∥xTi − aTi bj ∥2 + β∥aTi ∥1 T 2 a Ti j (2. . for learning higher level features for transfer learning. . Lee et al. The meta-priors can be transferred among different tasks. in the second step. min a. Ruckert et al.

respectively. Thus.2) may not be suitable for use in the target domain. One is a common term over tasks and the other is a task-speciﬁc term. respectively. MT-IVM tries to learn parameters of a Gaussian Process over multiple tasks by sharing the same GP prior. in multi-task learning. manifold learning methods have been adapted for transfer learning. are designed to work under multi-task learning. where a GP prior is used to induce correlations between tasks. Intuitively. 2. some researchers also proposed to transfer parameters of SVMs under a regularization framework. In inductive transfer learning. an 19 . In contrast. wS = w0 + vS . which can be used to transfer the knowledge across domains via the aligned manifolds. Evgeniou and Pontil [63] borrowed the idea of HB to SVMs for multi-task learning. By assuming ft = wt · x to be a hyper-plane for task t. Schwaighofer et al.2. in SVMs for each task can be separated into two terms. weights of the loss functions for the source and target data are the same. wS and wT are parameters of the SVMs for the source task and the target learning task. which is based on Gaussian Processes (GP). As mentioned above. in transfer learning. Besides transferring the priors of the GP models. including a regularization framework and a hierarchical Bayesian framework. In [193]. The proposed method assumes that the parameter.is that the so-called higher-level basis vectors learned on the source domain in the optimization problem (2. [165] proposed to use a hierarchical Bayesian framework (HB) together with GP for multi-task learning. weights in the loss functions for different domains can be different. they can be easily modiﬁed for transfer learning.3 Transferring Knowledge of Parameters Most parameter-transfer approaches to the inductive transfer learning setting assume that individual models for related tasks should share some parameters or prior distributions of hyperparameters. to handle the multi-task learning case. Wang and Mahadevan proposed a Procrustes analysis based approach to manifold alignment without correspondences. w. we may assign a larger weight to the loss function of the target domain to make sure that we can achieve better performance in the target domain. [28] also investigated multi-task learning in the context of GP. Bonilla et al. Recently. multi-task learning tries to learn both the source and target tasks simultaneously and perfectly. while transfer learning only aims at boosting the performance of the target domain by utilizing the source domain data. Lawrence and Platt [97] proposed an efﬁcient algorithm known as MT-IVM. However. w0 is a common parameter while vS and vT are speciﬁc parameters for the source task and the target task. Most approaches described in this section. The authors proposed to use a free-form covariance matrix over tasks to model inter-task dependencies. & wT = w0 + vT . where.

respectively. 2.T } i=1 w0 . for statistical relational learning. In this context.d. It tries to transfer the relationship among data from a source domain to a target domain. This approach does not assume that the data drawn from each domain be independent and identically distributed (i. where xti ’s. TAMAR is motivated by the fact that if two domains are related to each other. i ∈ {1.. MLNs is a powerful formalism. Thus. For example. ξti ) = nt ∑ ∑ t∈{S.) as traditionally assumed. At the ﬁrst stage. vt . T }. 2. which combines the compact expressiveness of ﬁrst order logic with ﬂexibility of probability. 20 . there may exist a mapping from professor to manager and a mapping from the professorstudent relationship to the manager-worker relationship.extension of SVMs to multi-task learning case can be written as the following: min s. nt } & t ∈ {S.ξti ξti + λ1 ∑ ∥vt ∥2 + λ2 ∥w0 ∥2 2 t∈{S. statistical relational learning techniques are proposed to solve these problems.vt . In an MLN.i. In addition. Mihalkova et al.2. In this vein. entities in a relational domain are represented by predicates and their relationships are represented in ﬁrst-order logic.i. a mapping is constructed from a source MLN to the target domain based on weighted pseudo loglikelihood measure (WPLL).d.T } yti (w0 + vt ) · xti ≥ 1 − ξti . TAMAR tries to use an MLN learned for a source domain to aid in the learning of an MLN for a target domain. yti ’s are the corresponding labels. a professor can be considered as playing a similar role in an academic domain as a manager in an industrial management domain.. [124] proposed an algorithm TAMAR that transfers relational knowledge with Markov Logic Networks (MLNs) [155] across relational domains. vS and vT simultaneously. the relationship between a professor and his or her students is similar to the relationship between a manager and his or her workers..4 Transferring Relational Knowledge Different from other three contexts. t ∈ {S. ξti ≥ 0. such as networked data and social network data. T }. the relational-knowledge-transfer approach deals with transfer learning problems in relational domains. . By solving the optimization problem above. and can be represented by multiple relations. J(w0 . TAMARis a two-stage algorithm. are input feature vectors in the source and target domains. λ1 and λ2 are positive regularization parameters to tradeoff the importance between the misclassiﬁcation and regularization terms.t. we can learn the parameters w0 . Basically. ξti ’s are slack variables to measure the degree of misclassiﬁcation of the data xti ’s as used in standard SVMs. where the data are non-i. there may exist mappings to connect entities and their relationships from a source domain to a target domain.

[9]. Davis et al. 2.3 Transductive Transfer Learning The term transductive transfer learning was ﬁrst proposed by Arnold et al. On top of these conditions. Thus. assume our task is to classify whether a document describes information about laptop or not. transductive transfer learning aims to improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS . the tasks are the same but the domains are different. and that the learned model cannot be reused for future data. which is an inductive logic programming (ILP) algorithm for revising ﬁrst order theories. but we believe that this condition can be relaxed. a target domain DT and a corresponding learning task TT . Note that the word “transductive” is used with several meanings. we use the term “transductive” to emphasize the concept that in this type of transfer learning. In addition.e. The basic idea of the algorithm is to discover structural regularities in the source domain in the form of Markov logic formulas with predicate variables. In the traditional machine learning setting. where they required that the source and target tasks be the same. [52] proposed an approach to transferring relational knowledge based on a form of secondorder Markov logic. the tasks must be the same and there must be some unlabeled data available in the target domain. where DS ̸= DT and TS = TT . [9]. Deﬁnition 3. This deﬁnition covers the work of Arnold et al. where the difference lies between the marginal probability distributions of source and target data. In our categorization of transfer learning. (Transductive Transfer Learning) Given a source domain DS and a corresponding learning task TS . some unlabeled target domain data must be available at training time. extended TAMAR to the single-entity-centered setting of transfer learning. when some new test data arrive. The revised MLN can be used as a relational model for inference or reasoning in the target domain. i. we only require that part of the unlabeled target domain data be seen at training time in order to obtain the marginal probability for the target domain data. instead. the source 21 . since the latter considered domain adaptation. where only one entity in a target domain is available. although the domains may be different. they further required that that all unlabeled data in the target domain are available at training time. transductive learning [90] refers to the situation where all test data are required to be seen at training time. a revision is done for the mapped structure in the target domain through the FORTE algorithm [152]. they must be classiﬁed together with all existing data.At the second stage.. by instantiating these formulas with predicates from the target domain. Mihalkova et al. For example. in contrast. In a follow-up work [125]. in our deﬁnition of the transductive transfer learning setting.

we ﬁrst review the problem of empirical risk minimization (ERM) [189]. we also assume that some target-domain unlabeled data be given.2. θ) 22 .y)∈P [l(x. P (XS ) ̸= P (XT ).1 Transferring the Knowledge of Instances Most instance-transfer approaches to the transductive transfer learning setting are motivated by importance sampling. but the marginal probability distributions of the input data are different. the target domain documents are documents are downloaded from shopping websites. θ∗ = arg min θ∈Θ 1∑ [l(xi .1.3. XS ̸= XT . In the transductive transfer learning setting. In general. As a result. This is similar to the requirements in domain adaptation and sample selection bias. θ∗ = arg min E(x. and are annotated by human. θ∈Θ where l(x. yi . which implies that one can adapt the predictive function learned in the source domain for use in the target domain through some unlabeled target-domain data. θ)].y)∈DT P (DT )l(x. in our classiﬁcation scheme under transductive transfer learning. this setting can be split to two cases: (a) The feature spaces between the source and target domains are different. To see how importance sampling based methods may help in this setting. Most approaches described in the following sections are related to case (b) above. θ) is a loss function that depends on the parameter θ. y. θ)]. θ∗ = arg min θ∈Θ ∑ (x. n i=1 n where xi ’s are input feature vectors and yi ’s are the corresponding labels. and (b) the feature spaces between domains are the same. The goal of transductive transfer learning is to make use of some unlabeled (without human annotation) target documents with a lot of labeled source domain documents to train a good classiﬁer to make predictions on the documents in the target domain (including unseen documents in training). However. 2. the data distributions may be different across domains. y. since it is hard to estimate the probability distribution P . we want to learn an optimal model for the target domain by minimizing the expected risk. n is size of the training data. XS = XT . y. As mentioned in Chapter 2. we choose to minimize the ERM instead. In the above deﬁnition of transductive transfer learning. Similar to the traditional transductive learning setting. the source and target tasks are the same. which aims to make the best use of the unlabeled test data for learning. we might want to learn the optimal parameters θ∗ of the model by minimizing the expected risk.domain documents are downloaded from news websites.

xj ). while xTj ∈ XT . T where xi . xj ∈ XT .y)∈DS P (DT ) P (DS )l(x. An advantage of using KMM is that it can avoid performing density estimation of either P (xSi ) or P (xTi ). P (xTi ) P (xSi ) P (xTi ) since P (YT |XT ) = P (YS |XS ). where ∪ nS ∑ xi ∈ XS and xj ∈ XT .T (KT.S = KS. κi = nT nT k(xi . then we may simply learn the model by solving the following optimization problem for use in the target domain.ySi ) we can learn a precise model for the target domain.5) βi − nS | ≤ nS ϵ s. θ) P (DS ) (2.S ij = k(xi .T ) is a kernel matrix across the source and target domain data. where xi .S KS. ySi . where xi ∈ XS XT . respectively.y)∈DS Otherwise. P (xSi ) . ySi ) with the corresponding weight PT (xTi .4) nS ∑ PT (xT . yT ) i i ≈ arg min l(xSi . If P (DS ) = P (DT ). P (xTi ) Zadrozny [219] proposed to estimate the terms P (xSi ) P (xTi ) P (xSi ) and P (xTi ) independently by constructing simple classiﬁcation problems. B] and | [ where K = KS. y. PS (xSi . ∑ θ∗ = arg min P (DS )l(x. ySi ) θ∈Θ i=1 Therefore.T ] and Kij = k(xi .S KT. since no labeled data in the target domain are observed in training data. min β 1 T β Kβ − κT β 2 nS ∑ i=1 (2. by adding different penalty values to each instance (xSi . PS (xSi .T ij = k(xi .yTi ) . we need to modify the above optimization problem to learn a model with high generalization ability for the target domain.S and KT.ySi ) = If we can estimate for each instance. which implies KS.T are kernel matrices for the source domain data and the target domain data. βi ∈ [0. KS. Thus the difference between P (DS ) and P (DT ) is caused by P (XS ) and P (XT ) and PT (xTi .T ij = k(xi . That means KS. xTj ).yTi ) PS (xSi . we have to learn a model from the source domain data instead. θ∈Θ (x. which is difﬁcult when the size of the 23 . θ). we can solve the transductive transfer learning problems. when P (DS ) ̸= P (DT ). [82] proposed a kernel-mean matching (KMM) algorithm to learn directly by matching the means between the source domain data and the target domain data in a reproducingkernel Hilbert space (RKHS). as follows: θ∗ = arg min θ∈Θ ∑ (x. and KT. j=1 It can be proved that βi = P (xSi ) P (xTi ) [82]. Huang et al. xj ). Furthermore. xj ). xj ).However. KS. xj ∈ XS .T KT. y.t. KMM can be rewritten as the following quadratic programming (QP) optimization problem. There exist various ways to estimate P (xSi ) . θ).

readers can refer to a recently published book [148] by QuioneroCandela et al. Sugiyama et al. (2) training models on the re-weighted data. such as independent component analysis (ICA) [181]. In SCL. which modiﬁes the ASO [5] algorithm for transductive transfer learning. Blitzer et al. a ﬁrst step is to construct some pseudo related tasks. . 2. which are sentiment words and used commonly across different domains (e. [179] further extended the uLSIF algorithm by estimating importance in a non-stationary subspace. l = 1. Besides sample re-weighting techniques. Bickel et al. To be emphasized that besides covariate shift adaptation.data set is small. ASO was proposed for multi-task learning. reviews on different products). [180] proposed an algorithm known as Kullback-Leibler Importance Estimation Procedure (KLIEP) to estimate P (xSi ) P (xTi ) directly. which are common features that occur frequently and similarly across domains. [26] proposed a structural correspondence learning (SCL) algorithm. outlier detection [78] and change-point detection [92]. The m linear classiﬁers are trained to model the relationship between the non-pivot features and the pivot features as shown as follows. fl (x) = sgn(wlT · x). SCL treats each pivot feature as a new output vector to construct a task and non-pivot features as inputs. using labeled source domain and unlabeled target domain data. KLIEP can be integrated with cross-validation to perform model selection automatically in two steps: (1) estimating the weights of the source domain data. Thus. “nice” and “bad”.2.2 Transferring Knowledge of Feature Representations Most feature-representation transfer approaches to the transductive transfer learning setting are under unsupervised learning frameworks. As described in Chapter 2. For example. [21] combined the two steps in a uniﬁed framework by deriving a kernel-logistic regression classiﬁer. For more information on importance sampling and re-weighting methods for co-variate shift or sample selection bias. based on the minimization of the Kullback-Leibler divergence. . [48] extended a traditional Naive Bayesian classiﬁer for the transductive transfer learning problems. this method is focused on estimating the importance in a latent space instead of learning a latent space for adaptation. [91] proposed a method called unconstrained least-squares importance ﬁtting (uLSIF) to estimate the importance efﬁciently by formulating the direct importance estimation problem as a leastsquares function ﬁtting problem.g. Sugiyama et al. to make use of the unlabeled data from the target domain to extract some common features to reduce the difference between the source and target domains. proposed to ﬁrst deﬁne a set of pivot features (the number of pivot feature is denoted by m). Blitzer et al. . which performs well even when the dimensionality of the data domains is high. the words “good”. More recently. are examples of pivot features. Kanamori et al. . Then.3. these importance estimation techniques have also been applied to various applications. etc. m 24 . Dai et al. However.

25 . where the difference between domains can be reduced. Finally. [207] proposed a cross-domain text classiﬁcation algorithm that extended the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data from different but related domains. 0 is a vector of zeros. standard classiﬁcation algorithms can be applied on the new representations to train classiﬁers. von B¨ nau et al. proposed a novel algorithm known as bridged reﬁnement to correct the labels predicted by a shift-unaware classiﬁer towards a target distribution and take the mixture distribution of the training and test data as a bridge to better transfer from the training data to the test data. xS = [xS . respectively. where the objective function is introduced to seek consistency between the in-domain supervision and the out-of-domain intrinsic structure. such that W = U DV T . [108] proposed a spectral classiﬁcation framework for cross-domain transfer learning problem. used a heuristic method to select pivot features for natural language processing (NLP) problems. [191] proposed a method u known as stationary subspace analysis (SSA) to match distributions in a latent space for time series data analysis. Blitzer et al. SSA theoretically studied the conditions under which a stationary space can be identiﬁed from multivariate time series. xT ]. where q is number of features of the original data. where Uq×r and Vr×m are the matrices of the left and right singular vectors. without considering the preservation of properties such as data variance in the subspace.:] (h is the number of the shared features) is a transformation mapping to map non- pivot features to a latent space. In [26]. wm ] ∈ Rq×m . 0. Let T θ = U[1:h. [205]. xS . Dai et al. However. Most feature-based transductive transfer learning methods do not minimize the distance in distributions between domains directly. In their follow-up work [25]. where xS and xT are original features vectors of the source and target domains. which are ranked in non-increasing order. . such as tagging of sentences. SSA is focused on the identiﬁcation of a stationary subspace. The matrix Dr×r is a diagonal matrix consists of non-negative singular values. It aims to augment each of the feature vectors of different domains to a high dimensional feature vector as follows. Mutual Information (MI) is proposed for choosing the pivot features. into a uniﬁed probabilistic model. Ling et al. Recently.as in ASO singular value decomposition (SVD) is applied on the weight matrix W = [w1 w2 . Xue et al. More speciﬁcally. . 0] & xT = [xT . whose length is equivalent to that of the original feature vector. Xing et al. [47] proposed a co-clustering based algorithm to discover common feature clusters. Daum´ III [51] proposed a simple feature augmentation method for transfer learning probe lems in the Natural Language Processing (NLP) area. such that label information can be propagated across different domains by using the common clusters as a bridge. They also proposed the SSA procedure to ﬁnd stationary components by matching the ﬁrst two moments of the data distributions in different epochs. The idea is to reduce the difference between domains while ensure the similarity between data within domains is larger than that across different domain data.

where TS ̸= TT and YS and YT are not observable.1 Transferring Knowledge of Feature Representations Dai et al. where precise clusters can be obtained because of the sufﬁcient of training data. and have a few documents downloaded from shopping websites. The objective function of STC is shown as follows. to guide clustering on the target domain documents. which aims at clustering a small collection of unlabeled data in the target domain with the help of a large amount of unlabeled data in the source domain. XS . SSA may map the data to some noisy factors which are stationary across domains but completely irrelevant to the target supervised task. Based on the deﬁnition of the unsupervised transfer learning setting.4 Unsupervised Transfer Learning Deﬁnition 4. which are referred to as source domain documents. Note that it is usually unable to ﬁnd precise clusters if the data are sparse. Z is a shared feature space by XS and XT . (Unsupervised Transfer Learning) Given a source domain DS with a learning task TS .4. Z) − I(XT . the predicted labels are latent variables. CXS : XS → XS and 4 In unsupervised transfer learning. [50] studied a new case of clustering problems. For example. which are referred to as target domain documents. Then classiﬁers trained on the new representations learned by SSA may not get good performance for transductive transfer learning. 2. the goal of unsupervised transfer learning is to make use of the documents in the source domain. [ ] ˜T . assume we have a lot of documents downloaded from news websites. As a result. ˜ ˜ Suppose that there exist three clustering functions CXT : XT → XT . The task is to cluster the target domain documents into some hidden categories. SSA is focused on how to identify a stationary subspace without considering how to preserve data properties in the latent space as well. respectively. Z) − I(XS . Z) = I(XT . and I(·. known as self-taught clustering. no labeled data are observed in the source and target domains in training. Z) + λ I(XS . Thus. such as clusters or reduced dimensions 26 . there is little research work in this setting. So far. 2. a target domain DT and a corresponding learning task TT . Z) ˜ ˜ ˜ ˜ ˜ ˜ J(X where XS and XT are the source and target domain data. ·) is the mutual information between two random variables. Self-taught clustering (STC) is an instance of unsupervised transfer learning. STC tries to learn a common feature space across domains. which helps in clustering in the target domain. unsupervised transfer learning aims to help improve the learning of the target predictive function fT (·) 4 in DT using the knowledge in DS and TS .However.

one is the error bound of the learning model on the labeled source domain data. XS and Z are corresponding clusters of XT . 27 .XS . 2. Similarly. In transductive transfer learning. Wang et al. there are some research works focusing on the generalization bound when the training and test data distributions are different [16. where XT . In particular.Z (2. More recently.˜ ˜ ˜ ˜ CZ : Z → Z. Though the generalization bounds proved in different literatures are different slightly. respectively. there is a common conclusion that the generalization bound of a learning model in transductive transfer learning consists of two terms. Recently. Bhattacharya et al. to date. the other is the distance between the source and target domains. where a relevant supervised data partition is provided. [19] proposed a new clustering algorithm to transfer supervision to a clustering task in a target domain from a different source domain.5 Transfer Bounds and Negative Transfer An important issue is to recognize the limit of the power of transfer learning. 24. more speciﬁcally the distance between marginal probability distributions of input features across domains. The main idea is to deﬁne a cross-domain similarity measure to align clusters in the target domain and the partitions in the source domain. TDA ﬁrst applies clustering methods to generate pseudo-class labels for the target domain unlabeled data. where the relationships between source tasks are modeled by embedding the set of learned source models in a graph using transferability as the metric. 17]. there are few work proposed to study the issue of transferability.6): ˜ ˜ ˜ arg min J(XT . Eaton et al. In inductive transfer learning. Hassan Mahmud and Ray [120] analyzed the case of transfer learning using Kolmogorov complexity.6) An iterative algorithm for solving the optimization function (2. Transferring to a new task proceeds by mapping the problem into the graph and then learning a function on this graph that automatically determines the parameters to transfer to the new learning task. ˜ The goal of STC is to learn XT by solving the optimization problem (2. where some theoretical bounds are proved. such that the performance of clustering in the target domain can be improved. It then applies dimensionality reduction methods to the target domain data and labeled source domain data to reduce the dimensionalities. Z) ˜ ˜ ˜ XT . These two steps run iteratively to ﬁnd the best subspace for the target domain data.6) was presented in [50]. [197] proposed a transferred discriminative analysis (TDA) algorithm to solve the transfer dimensionality reduction problem. XS . XS and Z. [60] proposed a novel graph-based method for knowledge transfer. the authors used conditional Kolmogorov complexity to measure relatedness between tasks and transfer the “right” amount of information in a sequential transfer learning task under a Bayesian framework. 15.

a new semi-parametric kernel is designed to model correlation between tasks. However. where tasks in the same cluster are supposed to be related to each others. However. where the resultant optimization procedure is convex. 83. which differs among different groups. The main concern of inductive transfer learning is the learning performance in the target task only. [30] proposed an Adaptive Transfer learning algorithm based on GP (AT-GP). whose results may be sensitive to parameter initialization. which provides a global approach to model and learn task relatedness in the form of a task covariance matrix. 7. Jacob et al. Bonilla et al. the data are clustered based on the task parameters. which aims to adapt the transfer learning schemes by automatically estimating the similarity between the source and target tasks. Bakker and Heskes [11] adopted a Bayesian approach in which some of the model parameters are shared for all tasks and others more loosely connected through a joint prior distribution that can be learned from the data. [159] empirically showed that if two tasks are too dissimilar. and then avoid negative 28 . [7] considered situations in which the learning tasks can be divided into groups. then brute-force transfer may hurt the performance of the target task. Zhang and Yeung [224] proposed an improved regularization framework to model the negative and positive correlation between tasks. each of which is associated to a task. Cao et al. we need to give an answer to the question that given a target task and a source task. and the learning procedure targets at improving performance of the target task only. Negative transfer happens when the source domain data and task contribute to the reduced performance of learning in the target domain. little research work was published on this topic in the past. As described above. 11].How to avoid negative transfer and then ensure a “safe transfer” of knowledge is another crucial issue in transfer learning. Some works have been exploited to analyze relatedness among tasks and task clustering techniques. [167] empirically study the negative transfer problem by proposing a predictive distribution matching classiﬁer based on Support Vector Machines (SVMs) to identify the regions of relevant source domain data where the predictive distributions maximally align with that of the target domain data. [28] proposed a multi-task learning method based on Gaussian Process (GP). Tasks within each group are related by sharing a low-dimensional representation. 11. [83] presented a convex approach to clustered multi-task learning by designing a new spectral norm to penalize over a set of weights. Thus. Despite the fact that how to avoid negative transfer is a very important issue. most research works on model task correlation are from the context of multi-task learning [18. tasks within a group can ﬁnd it easier to transfer useful knowledge. one may be particularly interested in transferring knowledge from one or more source tasks to a target task rather than learning these tasks simultaneously. As a result. Seah et al. Argyriou et al. the optimization procedure introduced in [28] is non-convex. 28. More recently. 224]. In AT-GP. Motivated by [28]. which may help provide guidance on how to avoid negative transfer automatically. in inductive transfer learning. Rosenstein et al. whether transfer learning techniques should be applied or not. Thus. such as [18.

text. in recent few years. 123. • Transfer learning from multiple source domains. [122] proposed a distribution weighted linear combining framework for learning from multiple sources. we may have multiple sources at hand. in some real-world scenarios. The main idea is to estimate the data distribution of each source to reweight the data from different source domains. However. Developing algorithms to make use of multiple sources for help learning models in the target domain is useful in practice. there are several research issues of transfer learning that have attracted more and more attention from the machine learning and data mining communities. it requires each instance in one view must have its correspondence in other views.6 Other Research Issues of Transfer Learning Besides the negative transfer issue. Yang et al. Ling et al. Transfer learning across the two feature spaces are achieved by designing a suitable mapping function across feature spaces as a bridge. how to transfer knowledge successfully across different feature spaces is another interesting issue.transfer. transfer learning across different feature spaces aims to solve the problem when the source and target domain data belong to two different feature spaces such as image vs. The approach addressed the problem when there are plenty of labeled English text data whereas there are only a small number of labeled Chinese text documents. each with its own distinct feature space. Dai et al. [119] proposed to train a classiﬁer for use in the target domain by maximizing the consensus of predictions from multiple sources. which means there are only one source domain and one target domain. In contrast. [109] proposed an information-theoretic approach for transfer learning to address the cross-language classiﬁcation problem. [45] proposed a new risk minimization framework based on a language model for machine translation [44] for 29 . [57] proposed algorithms to a new SVM for the target domains by adapting some SVMs learned from multiple source domains. Theoretical studies on transfer learning from multiple source domains have also been presented in [122. the source and target domain data may come from different feature spaces. Most transfer learning methods introduced in previous chapters focus on one-to-one transfer. Luo et al. It is related to unlike multi-view learning [27]. 15] • Transfer learning across different feature space As described in the deﬁnition of transfer learning. Thus. 2. [209] and Duan et al. which assumes the features for each instance can be divided into several views. Though multi-view learning techniques can be applied to model multi-modality data. We summarize them as follows. Mansour et al. without correspondences across feature spaces.

transfer learning has been applied successfully to many application areas. respectively. 226]. transfer learning has also been proposed for other tasks. In [33]. Liao et al. Zhao and Hoi investigated a framework to transfer knowledge from a source domain to an online learning task in a target domain. In [79]. • Transfer learning for other tasks Besides classiﬁcation. They ﬁrst proposed to apply multi-task learning techniques for adaptive ﬁltering. and then explore various active learning approaches to the multi-task adaptive ﬁlter to improve the performance. bioinformatics. [40] proposed probabilistic models to leverage textual information for help image clustering and image advertising. recommendation systems. Zha et al. clustering and dimensionality reduction. regression. 30 . such Natural Language Processing (NLP). [211] and Chen et al. such as metric learning [221. Shi et al. and then adapt it to the transfer learning setting. In [227]. [169] applied an active learning algorithm to select important instances for transfer learning with TrAdaBoost [49] and standard SVM. active learning and transfer learning both aims at learning a precise model with minimal human supervision for a target task. image analysis. computer vision. Chan and NG proposed to adapt existing Word Sense Disambiguation (WSD) systems to a target domain by using domain adaptation techniques and employing an active learning strategy [101] to actively select examples to be annotated from the target domain. Information Retrieval (IR). Harpale and Yang [75] proposed an active learning framework for the multi-task adaptive ﬁltering [157] problem. Several researchers have proposed to combine these two techniques together in order to learn a more precise model with even less supervision.dealing with the problem of learning heterogeneous data that belong to quite different feature spaces.7 Real-world Applications of Transfer Learning Recently. [107] proposed a new active learning method to select the unlabeled data in a target domain to be labeled with the help of the source domain data. activity recognition and wireless sensor networks. Zhang and Yeung [226] ﬁrst proposed a convex formulation for multi-task metric learning by modeling the task relationships in the form of a task covariance matrix. Honorio and Samaras proposed to apply multi-task learning techniques to learn structures across multiple gaussian graphical models simultaneously. structure learning [79] and online learning [227]. Yang et al. [221] proposed a regularization to learn a new distance metric in a target domain by leverage pre-learned distance metrics from auxiliary domains. 2. • Transfer learning with active learning As mentioned at the beginning of this Chapter 2. multimedia data mining.

respectively. In [154]. 192. [3] applied multi-task learning techniques to solve braincomputer interfaces problems. advertising [40. 68] and multi-domain collaborative ﬁltering [103. 35. word sense disambiguation [33. image retrieval [73. transfer learning. transfer learning techniques have attracted more and more attention in the ﬁelds of computer vision. etc. image semantic segmentation [190]. part-of-speech tagging [4. 58] and event recognition from videos [59]. 223. 102. 47. In Bioinformatics. such as name entity recognition [71. For example. 194] and mobile devices [229]. 161. 156. 86] Information Retrieval (IR) is another application area where transfer learning techniques have been widely studied and applied. protein protein subcellular location prediction [206] and genetic association analysis [214] Transfer learning techniques has also explored to solve WiFi localization and sensor-based activity recognition problems [210]. 51. 185. sentiment classiﬁcation [25. 218. 9. motivated by that different biological entities. transfer learning techniques were proposed to extract knowledge from WiFi localization models across time periods [217. may be related to each other from a biological point of view. 115. 213. 26. 72. 29. classiﬁers. video retrieval [208. 104. 177]. 113]. 51]. Besides NLP and IR. which can automatically identify the relevant feature subset and use inductive transfer for learning multiple. 203]. genes. 136. 201]. Furthermore. 207. 195. such as identifying molecular association of phenotypic responses [222]. age estimation from facial images [225]. which is known as domain adaptation. such as organisms. 143]. to beneﬁt WiFi localization tasks in other settings. image and multimedia analysis. 38]. Applications of transfer learning in these ﬁelds include image classiﬁcation [202. learn to rank across domains [10. 73]. [235] studied how to transfer domain knowledge to learn relational action models across domains in automated planning. 158. [32] studied how to apply a GP based multi-task learning method to solve the inverse dynamics problem for a robotic manipulator [32]. has been widely studied for solving various tasks [42]. [228] proposed to apply transfer learning techniques for solving indoor sensor-based activity recognition problems. 204. 48. 87. face veriﬁcation from images [196]. mRNA splicing [166]. 36. 39]. Web applications of transfer learning include text classiﬁcation across domains [151. Raykar et al. 31 . Alamgir et al.In the ﬁelds of Natural Language Processing (NLP). Rashidi and Cook [153] and Zheng et al. in the past two years. 94. space [138. Zhuo et al. video concept detection [209. Chai et al. for computer aided design (CAD). 230]. some research works have been proposed to apply transfer learning techniques to solve various computational biological problems. 2] and information extraction [200. but conceptually related. splice site recognition of eukaryotic genomes [199]. proposed a novel Bayesian multipleinstance learning algorithm.

This chapter is organized as follows.3. As reviewed in Chapter 2. Then the subspace spanned by these latent factors can be treated as a bridge to make knowledge transfer possible. we ﬁrst introduce our motivation and some preliminaries used in the proposed methods in Chapter 3. in Chapter 4 and Chapter 5. We illustrate our idea using a learning-based indoor localization problem as an example. and three algorithms based on the frameworks known as Maximum Mean Discrepancy Embedding (MMDE).1 Motivation In many real world applications.4-3. respectively.6.1 and Chapter 3. such that the difference between different domain data can be reduced in the latent space. they may share some latent factors (or components). If the two domains are related to each other. Some of these common latent factors may cause the data distributions between domains to be different. which limits the transferability across domains.7. cross-domain WiFi localization and crossdomain text classiﬁcation.3. Meanwhile.CHAPTER 3 TRANSFER LEARNING VIA DIMENSIONALITY REDUCTION In this chapter. respectively. while others may not.2. observed data are controlled by a few latent factors. Transfer Component Analysis (TCA) and Semi-supervised Transfer Component Analysis (SSTCA) in Chapters 3. respectively. Note that this objective is important for the ﬁnal target learning tasks in the latent space. some of these factors may capture the intrinsic structure or discriminative information underlying the original data. Finally. Our proposed framework can be referred to as a featurerepresentation-transfer approach. while others may not. In contrast. we apply the proposed methods to two diverse applications. in the proposed dimensionality reduction framework. 3. where a client moving in a WiFi environment wishes to use the received signal strength (RSS) 32 . we aim at proposing a novel and general dimensionality reduction framework for transductive transfer learning. The relationship between the proposed methods to other methods are discussed in Chapter 3. most previous feature-based transductive transfer learning methods do not explicitly minimize the distance between different domains in the new feature space. Then we present the proposed dimensionality reduction framework in Chapter 3. one objective is to minimize the distance between domains explicitly. Our goal is to recover those common latent factors that do not cause much difference between data distributions and do preserve properties of the original data. Another objective of the proposed framework is to preserve the properties of the original data as much as possible.

building structure. news articles and blogs). etc. Another example is learn to do text classiﬁcation across domains. human movement. 3. if we use the latter two factors to represent the RSS data. Before presenting our proposed dimensionality reduction framework in detail.2 Preliminaries 3. KPCA is an unsupervised method. Among these hidden factors. the distance between the distributions of documents in related domains may be small. In an indoor building. However. Thus we need to develop a new dimensionality reduction algorithm for domain adaptation. we propose a novel and general dimensionality reduction framework for transductive transfer learning. RSS values are affected by many latent factors. 33 . If two text-classiﬁcation domains have different distributions.2. Some of them may be relatively stable while others may not.1 Dimensionality Reduction Dimensionality reduction has been studied widely in the machine learning community. the building structure and properties of APs are relatively stable.g. but are related to each other (e. We then project the data in both domains to this latent feature space. If we use the stable latent topics to represent documents. Then. resulting in changes in RSS values over time. which aims at extracting a low-dimensional representation of the data by maximizing the variance of the embedding in a kernel space [163]. we can transfer the text-classiﬁcation knowledge. properties of access points (APs). in the latent space spanned by latent topics. The key idea is to learn a low-dimensional latent feature space where the distributions of the source domain data and target domain data are close to each other and the data properties can be preserved as much as possible.. this is the latent space where we can ensure a transferring of a learned localization model from one time period to another. they cannot directly be used to solve domain adaptation problems. Thus. A powerful dimensionality reduction technique is principal component analysis (PCA) and its kernelized version kernel PCA (KPCA). the temperature and the human movement may vary in time. Motivated by the observations mentioned above. the distributions of the data collected in different time periods may be close to each other. such as temperature. Since they cannot guarantee that the distributions between different domain data are similar in the reduced latent space. on which we subsequently apply standard learning algorithms to train classiﬁcation or regression models from labeled source domain data to make predictions on the unlabeled target domain data.values to locate itself. van der Maaten et al. [188] gave a recent survey on various dimensionality reduction methods. Traditional dimensionality reduction methods try to project the original data to a low-dimensional latent space while preserving some properties of the original data. Thus. there may be some latent topics shared by these domains. we ﬁrst introduce some preliminaries that are used in our proposed solutions based on the framework. in this chapter.

1) where ∥ · ∥H is the RKHS norm. f ⟩.2. a non-parametric distance estimate was designed by embedding distributions in a RKHS [172].2. Gretton et al. . It can be shown that when the RKHS is universal [178]. [69] introduced the Maximum Mean Discrepancy (MMD) for comparing distributions based on the corresponding RKHS distance. Z) = ϕ(xi ) − ϕ(zi ) n1 i=1 n2 i=1 2 . PCA or KPCA works pretty well as a preprocess of supervised target tasks [186. 3. n1 i=1 n2 i=1 (3. . . where ϕ(x) : X → H. respectively. is deﬁned as follow. xn1 } and {z1 . As its name implies. Z) = MMD(X. the empirical estimate of MMD can be rewritten as follows: n1 n2 1 ∑ 1 ∑ Dist(X. 3. Let the kernel-induced feature map be ϕ. Dist(X. MMD will asymptotically approach zero if and only if the two distributions are the same. which is deﬁned as follows.2. a non-parametric distance estimate between distributions without density estimation is more desirable. . 34 . the distance between two distributions is simply the distance between the two mean elements in the RKHS. it computes the Hilbert-Schmidt norm of a cross-covariance operator in the RKHS. However. which follow distributions P and Q. zn2 }. there exist many criteria (such as the Kullback-Leibler (KL) divergence) that can be used to estimate their distance. Z) = MMD(X. . By the fact that in a RKHS. 3. the Hilbert-Schmidt Independence Criterion (HSIC) [70] is a simple yet powerful non-parametric criterion for measuring the dependence between the sets X and Y . many of these estimators are parametric and require an intermediate density estimate.In has been shown that in many real world applications. . To avoid such a non-trivial task. 162]. Z) = sup ( ∥f ∥H ≤1 n1 n2 1 ∑ 1 ∑ f (xi ) − f (zi )). .2 Hilbert Space Embedding of Distributions Given samples X = {xi } and Z = {zi } drawn from two distributions.3 Maximum Mean Discrepancy Recently.2) Therefore. The estimate of MMD between {x1 . H (3.4 Dependence Measure Hilbert-Schmidt Independence Criterion Related to the MMD. . function evaluation can be written as f (x) = ⟨ϕ(x).

respectively. we denote this by using (i. if xi and xj are k-nearest neighbors. . Y ) = HSIC(X. Mathematically. as Dep(X. (xSn1 . j) ∈ N . ij K1 = 0.3) H where ⊗ denotes a tensor product. yS1 ). Embedding via HSIC In embedding or dimensionality reduction. 35 . a large HSIC value suggests strong dependence. the local geometry is captured in the form of local distance constraints on the target embedding K.e. . .3 A Novel Dimensionality Reduction Framework Recall that given a lot of labeled source domain data DS = {(xS1 . For all i.(3.n. (3. which is an unsupervised dimensionality reduction. In that case.4) 1 where K. it can be shown that if the RKHS is universal.5) 3. (3. For example. Y ) = ∥Cxy ∥H = 2 ϕ(xk )) ⊗ (ψ(yj ) − ψ(yk ))] [(ϕ(xi ) − n i. this leads to the following semi-deﬁnite program (SDP) [95]: maxK≽0 tr(HKHKyy ) s. Kyy = I). ϕ : X → H. Similar to MMD. Kii + Kjj − 2Kij = d2 . In particular. aims at maximizing the variance in the high-dimensional feature space instead of the label dependence.4). An (biased) empirical estimate can be easily obtained from the corresponding kernel matrices. j) ∈ N . MVU. j.5) reduces to Maximum Variance Unfolding (MVU) [198] when no side information is given (i. HSIC asymptotically approaches zeros if and only if X and Y are independent [178]. Y ) = 1 tr(HKHKyy ). ∀(i. while the alignment with the side information (represented as kernel matrix Kyy ) is measured by the HSIC criterion in (3. .. it is often desirable to preserve the local data geometry while at the same time maximally align the embedding with available side information (such as labels). ySnS )}. H = I − n 11⊤ is the centering matrix and n is the number of samples in X and Y .t. Conversely. Kyy are kernel matrices deﬁned on X and Y .j=1 n k=1 n k=1 . ψ : Y → H and H is a RKHS.n n n 1 ∑ 1∑ 1∑ HSIC(X. Where we denote dij the distance between xi and xj in the original feature space. in Colored Maximum Variance Unfolding (colored MVU) [174]. (n − 1)2 (3.

xTnT }. We then assume that such a ϕ satisﬁes P (YS |ϕ(XS )) ≈ P (YT |ϕ(XT )). However. . and some unlabeled target domain data DT = {xT1 . . in many real-world applications.3. we use new weaker assumption that P = Q. . . we propose a new dimensionality reduction framework to learn ϕ such that (1) the distance between the marginal distributions P (ϕ(XS )) and P (ϕ(XT )) is small (Chapter 3. 1 In general. together with the corresponding labels YS . we assume that DS are fully labeled.3. P and Q can be different. Here.3.3. In general.2). Our task is then to predict the labels yTi ’s corresponding to the inputs xTi ’s in the target domain. where the input xTi is also in X . though also much more challenging.1 Minimizing Distance between P (ϕ(XS )) and P (ϕ(XT )) To reduce the distance between domains. for simplicity. Hence. and (2) ϕ(XS ) and ϕ(XT ) preserve some important properties of XS and XT (Chapter 3. and apply it solve two real-world application successfully. a key issue is how to ﬁnd this transformation ϕ. denoted by Dist(P (ϕ(XS )). Here. a major assumption in most transductive transfer learning methods is that P ̸= Q. Standard supervised learning methods can then be applied on the mapped source domain data ϕ(XS ). we study domain adaptation under this assumption empirically. We believe that domain adaptation under this assumption is more realistic. but there exists a transformation ϕ such that P (ϕ(XS )) ≈ P (ϕ(XT )) and ̸ P (YS |ϕ(XS )) ≈ P (YT |ϕ(XT )). As mentioned in Chapter 2. but P (YS |XS ) = P (YT |XT ). ϕ cannot be learned by directly minimizing the distance between P (YS |ϕ(XS )) and P (YT |ϕ(XT )). We propose a new framework to learn a transformation ϕ. a ﬁrst objective in our dimensionality reduction framework is to discover a desired mapping ϕ by minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )). P (ϕ(XT ))). the conditional probability P (Y |X) may also change across domains due to noisy or dynamic factors underlying the observed data. In this chapter. In this thesis. 3. to train models for use on the mapped target domain data ϕ(XT ). Since we have no labeled data in the target domain. 36 .where xSi ∈ X is the input and ySi ∈ Y is the corresponding output1 . such that P (ϕ(XS )) ≈ P (ϕ(XT )) and ϕ(XS ) and ϕ(XT ) preserve some important properties of XS and XT .1). Let P(XS ) and Q(XT ) (or P and Q in short) be the marginal distributions of the input sets XS = {xSi } and XT = {xTi } from the source and target domains. DS may contain unlabeled data. respectively.

An example is shown in Figure 3. source domain data Neg. by focusing only on minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )). source domain data Neg.1). as is performed by the well-known PCA and KPCA (Chapter 3. source domain data Pos. an effective dimensionality reduction framework for transfer learning should satisfy that in the reduced latent space. one would select the noisy component x2 . the source domain data are shown in red.3. Here. besides reducing the distance between the two marginal distributions. 3 2. However. Hence. 1.2 Preserving Properties of XS and XT However.5 2 1. target domain data Neg. while x2 is a noisy dimension with small variance. Figure 3. which is completely irrelevant to the target supervised learning task. However. An obvious choice is to maximally preserve the data variance.1: Motivating examples for the dimensionality reduction framework. Figure 3. focusing only on the data variance is again not desirable in domain adaptation.1(a) shows a simple two-dimensional example. 37 .1(b). where the direction with the largest variance (x1 ) cannot be used to reduce the distance of distributions across domains and is not useful in boosting the performance for domain adaptation.5 −1 −3 −2 −1 0 1 x 1 5 Pos. 2. learning the transformation ϕ by only minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )) may not be enough. Thus. target domain data Neg.5 0 −0. (b) Only maximizing the data variance. distance between data distributions across domains should be reduced. target domain data 4 3 Pos.2. target domain data − − + x2 + + 2 1 0 −1 −2 −2 + Source domain data Target domain data − Source domain data −1 0 1 2 3 x 1 − Source domain data 4 5 6 7 8 9 2 3 4 5 6 (a) Only minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )). ϕ should also preserve data properties that are useful for the target supervised learning task. data properties should be preserved as much as possible. in transductive transfer learning.3. source domain data Pos. x1 is the discriminative direction that can separate the positive and negative samples. and the target domain data are in blue. For both the source and target domains.5 x2 1 0.

H (3. we denote φ(ϕ(x)) by φ ◦ ϕ(x).Then standard supervised techniques can be applied in the reduced latent space to train classiﬁcation or regression models from source domain labeled data for use in the target domain. ϕ(XS ) and ϕ(XT ) can preserve properties of XS and XT .2).1 Kernel Learning for Transfer Latent Space Instead of ﬁnding the transformation ϕ explicitly. the corresponding kernel may not need to be universal [173]. in this section. Another advantage is that it works for diverse machine learning tasks. Note that in practice. The proposed framework is summarized in Algorithm 3. 3. Ensure: Predicted labels YT of the unlabeled data XT in the target domain.6) with respect to ϕ may be intractable and can get stuck in poor local minima. 4: return ϕ and f (ϕ(xTi ))’s. Furthermore. ySi )}. Thus. one advantage of this framework is that most existing machine learning methods can be easy integrated into this framework.6) for some φ ∈ H.3.2. we propose a kernel based dimensionality reduction method called Maximum Mean Discrepancy Embedding (MMDE) 38 . Then. the ﬁrst step is the key step. 1: Learn a transformation mapping ϕ. φ is usually highly nonlinear. As can be seen. ϕ(XT )) = φ ◦ ϕ(xSi ) − φ ◦ ϕ(xTi ) nS i=1 nT i=1 2 . such as classiﬁcation and regression problems. one just need to apply existing machine learning and data mining methods on the mapped data ϕ(XS ) with the corresponding labels YS for training models for use in the target domain. respectively.1. However. ϕ(XT )) is small.4 Maximum Mean Discrepancy Embedding (MMDE) As introduced in Chapter 3. map them to the latent space to get new representations ϕ(xTi )’s. use the model f to make predictions f (ϕ(xTi ))’s. a desired nonlinear mapping ϕ can be found by minimizing the above quantity. Therefore. 2: Train a classiﬁcation or regression model f on ϕ(XS ) with the corresponding labels YS . 3. such that Dist(ϕ(XS ). on using the empirical measure of MMD(3. 3: For unlabeled data xTi ’s in DT . As a result.1 A dimensionality reduction framework for transfer learning Require: A labeled source domain data set DS = {(xSi .4. because once the transformation mapping ϕ is learned in the ﬁrst step. the distance between two distributions P (ϕ(XS )) and P (ϕ(XT )) can be empirically measured by the (squared) distance between the empirical means of the two domains: nS nT 1 ∑ 1 ∑ Dist(ϕ(XS ). Algorithm 3. which is the feature map induced by a universal kernel. a direct optimization of (3. an unlabeled target domain data set DT = {xTi }.

.j=1 S = tr(KL). 1 − nS nT n2 T otherwise.. . xSj ) + 2 k(xTi . we can learn the kernel matrix K instead of the universal kernel k by minimizing the distance (measured w. thus the corresponding matrix K. 39 . the tasks must be the same and there must be some unlabeled data available in the target domain. Then φ ◦ ϕ is also the feature map induced by a kernel for any arbitrary map ϕ. where K= [ ] ∈ R(nS +nT )×(nS +nT ) (3. x i . xj ) = ⟨φ ◦ ϕ(xi ).7) KS. when some new test data arrive.) is a valid kernel function. xTj ) n2 i. .T (3. we can get the following lemma: Lemma 1..T T KT. For any ﬁnite sample X = {x1 . x j ∈ XS . X ′ = {ϕ(x1 ).r. . k = (. here.8) is a composite kernel matrix with KS and KT being the kernel matrices deﬁned by k on the data in the source and target domains.. they must be classiﬁed together with all existing data..nT nS nT 1 ∑ 1 ∑ 2 ∑ k(xSi . Thus. .j=1 nT i. xj ). xj ) = ⟨φ ◦ ϕ(xi ).j=1 nS nT i.S KS.6) can be written in terms of the kernel matrices deﬁned by k. Moreover. Therefore. and L = [Lij ] ≽ 0 with Lij = 1 n2 S 1 x i . xj ) is positive semi-deﬁnite for sample X ′ . x′n }. xTj ) − = k(xSi . ϕ(xn )} = {x′1 . φ ◦ ϕ(xj )⟩ = k(xi . the word “transductive” is from the transductive learning [90] setting in traditional machine learning. In the transductive setting2 . respectively. φ(x′j )⟩. the MMD) between the projected source and 2 Note that. k(xi . φ ◦ ϕ(xj )⟩ = ⟨φ(x′i )..t.6) is minimized.. we ﬁrst introduce a lemma shown as follows. and k(xi . Based on the Mercer’s theory [164]. Before describing the details of MMDE. Thus.S KT. one can ﬁnd their corresponding sample in the latent space. and that the learned model cannot be reused for future data. which refers to the situation where all test data are required to be seen at training time.. by using the kernel trick. which makes the optimization problem tractable. the term “transductive” is to emphasize the concept that in this type of transfer learning. Given that φ ∈ H.. xn }.[135] to construct the ϕ implicitly. XT ) nS . Therefore. Recall that in transductive transfer learning. Denote x′ = ϕ(x). φ ◦ ϕ is also the feature map induced by a kernel for any arbitrary map ϕ. as: ′ ′ Dist(XS .. φ ◦ ϕ(xj )⟩. Let φ be the feature map induced by a kernel. where Kij = k(xi . Here we translate the problem of learning ϕ to the problem of learning φ ◦ ϕ. Equation (3. x j ∈ XT . we can write ⟨φ ◦ ϕ(xi ). where k is the corresponding kernel. our goal becomes ﬁnding the feature map φ ◦ ϕ of some kernel such that (3. Proof. Since φ is the feature map induced by a kernel.

we can apply PCA on ′ ′ the resultant kernel matrix to reconstruct the low-dimensional representations XS and XT .3.. 2. Note that. while the L in MMDE can be treated as a kernel matrix that encode distribution information of different data sets. Second. First.4. In Chapter 3. The distance preservation constraint is motivated by MVU. However. as ij proposed in MVU and colored MVU. Thus. 0 and 1 are vectors of zeros and ones. which can make the kernel matrix learning more tractable. which is a standard condition for PCA applied on the resultant kernel matrix. j such that (i.2. we also have the following objectives/constraints to preserve the properties of the original data. and λ ≥ 0 is a tradeoff parameter. The distance is preserved. this leads to a SDP involving K [95]. After learning the kernel matrix K in (3. as mentioned in Chapter 3. PCA is applied on the resultant kernel matrix and select the leading eigenvectors to reconstruct the desired mapping φ ◦ ϕ implicitly to map data ′ ′ into a low-dimensional latent space across domains. it may not be sufﬁcient to learning the transformation by only minimizing the distance between the projected source and target domain data.9). Computationally. The optimization problem of MMDE can then be written as: min tr(KL) − λtr(K) K≽0 (3.4. we will discuss the relationship between these methods in detail. the L matrix in colored MVU is a kernel matrix that encodes label information of the data. j) ∈ N . 1. besides minimizing the trace of KL. The embedded data are centered. The centering constraint is used for the post-process PCA on the resultant kernel matrix K.9) is similar to colored MVU as introduced in Chapter 3. Then.4.2. the optimization problem (3.e. The trace of K is maximized. Kii + Kjj − 2Kij = d2 . 40 . as proposed in MVU. 3. which aims to preserve as much as variance in the feature space. As mentioned in Chapter 3. i.2. in MMDE.7.t.2.9) s. j) ∈ N . there are two major differences between MMDE and colored MVU. while the second term maximizes the variance in the feature space. XS and XT . However.target domain data. for transductive transfer learning. MMDE also aims to unfold the high dimensional data by maximizing the trace of K. Kii + Kjj − 2Kij = d2 for all i. besides minimizing the trace of KL in 3. where the ﬁrst term in the objective minimizes the distance between distributions. similar to MVU as introduce in Chapter 3. maximizing the trace of K is equivalent to maximizing the variance of the embedded data. ∀(i. ij K1 = 0.7.

we can train a classiﬁcation or regression ′ model f from XS with the corresponding labels YS . the method of harmonic functions introduced in (3. we use the method of harmonic functions [234]. Ensure: Predicted labels YT of the unlabeled data XT in the target domain.10) where N is the set of k nearest neighbors of xTi in XT . 1: Solve the SDPproblem in (3.2. f (x′ i )} to T make predicts. In general.4. Finally. respectively.2 Make Predictions in Latent Space ′ ′ After obtaining the new representations XS and XT . since we do not learn a mapping explicitly to project the original data XS and ′ ′ XT to the embeddings XS and XT . Algorithm 3. we translate the problem of learning the transformation mapping ϕ in the dimensionality reduction framework to a kernel matrix learning problem.2 Transfer learning via Maximum Mean Discrepancy Embedding (MMDE) Require: A labeled source domain data set DS = {(xSi . 5: For new test data xT ’s in the target domain. ′ respectively. as: yTi = f (x′Ti ). 3: Learn a classiﬁer or regressor f : x′ i → ySi S 4: For unlabeled data xTi ’s in DT . as there are O((nS + nT )2 ) variables in K.10). However. j∈N wij (3. This can then be used to make predictions ′ on XT . and fj ’s is the predicted labels of xTj ’s. ∑ j∈N wij fj fi = ∑ . for out-of-sample data in the target domain. 6: return Predicted labels YT . to estimate the labels of the new test data in the target domain. which can be obtained based on Euclid distance or other similarity measure. Here. use harmonic functions with {xTi . wij is the similarity between xTi and xTj . the learned classiﬁer or regressor to predict the labels of DT . As can be seen in the algorithm.9) to obtain a kernel matrix K.5 ) [129]. ySi )}. The MMDE algorithm for transfer learning is summarized in Algorithm 3. We then apply the standard 41 . 2: Apply PCA to the learned K to get new representations {x′ i } and {x′ i } of the original S T data {xSi } and {xTi }. respectively. standard PCA is applied ′ ′ on the learned kernel matrix to construct latent representations XS and XT of XS and XT . we need to apply other techniques to make predictions on the out-of-sample test data. In the second step.9). A model is then trained on XS with the corresponding labels YS .10) is used to make predictions on unseen target domain data. 3.3 Summary In MMDE.4. the key step of MMDE is to apply a SDP solver to the optimization problem in (3.3. the overall time complexity is O((nS + nT )6. which is deﬁned in (3. an unlabeled target domain data set DT = {xTi } and λ > 0.

5. 42 .PCAmethod on the resultant kernel matrix to reconstruct low-dimensional representations for different domain data. the corresponding kernel evaluation between any two patterns xi and xj is given by ⊤ k(xi . note that the kernel matrix K in (3. It avoids the use of SDP and thus its high computational burden. one can ensure that the kernel matrix K is positive deﬁnite by adding a small ϵ > 0 to its diagonal [135]. 3. such as the method of harmonic functions need to be used. This translation makes the problem of learning ϕ tractable. (3. which is often known as the empirical kernel map [163]. Firstly. which may discards potentially useful information in K. 3. This becomes computationally prohibitive even for small-sized problems. it cannot generalize to unseen data. Secondly. we propose an efﬁcient framework to ﬁnd the nonlinear mapping φ ◦ ϕ based on empirical kernel feature extraction.5 Transfer Component Analysis (TCA) In order to overcome the limitations of MMDE as described in the previous section. in this section. it can ﬁt the data more perfectly. the obtained K has to be further post-processed by PCA. we need to apply the MMDE method on the source domain and new target domain data to reconstruct their new representations. 3 (3. The overall time complexity is O((nS + nT )6. In order to make predictions on unseen target domain unlabeled data. For any unseen target domain data. xj ) = kxi W W ⊤ kxj . The resultant kernel matrix3 is then K = (KK −1/2 W )(W ⊤ K −1/2 K) = KW W ⊤ K. Moreover. Besides. other techniques. Furthermore. Consider the use of a (nS + nT ) × m matrix W that transforms the empirical kernel map features to a m-dimensional space (where m ≪ nS + nT ). However.11) where W = K −1/2 W ∈ R(nS +nT )×m . requires K to be positive semi-deﬁnite and the resultant kernel learning problem has to be solved by expensive SDP solvers. the learned kernel can be generalized to out-of-sample data directly. Finally. we propose a uniﬁed kernel learning method which utilizes an explicit low-rank representation. instead of using a two-step approach as in MMDE.12) As is common practice.1 Parametric Kernel Map for Unseen Data First. the criterion (3.5 ) in general.8) can be decomposed as K = (KK −1/2 )(K −1/2 K). there are several limitations of MMDE. in order to construct low′ ′ dimensional representations of XS and XT .9) in MMDE. since MMDE learns the kernel matrix from the data automatically. In particular.

this regularization term can also avoid the rank deﬁciency of the denominator in the generalized eigenvalue decomposition. the variance of the projected samples is W ⊤ KHKW .11) that the embedding of the data in the latent space is W ⊤ K.1).11). As will be shown later in this section. Moreover. Hence. W (3.16) 43 . x)]⊤ ∈ RnS +nT . 1 ∈ Rn1 +n2 is the column vector with all ones.13) In minimizing objective (3.13). the variance of the data can be preserved as much as possible and the distance between different distributions across domains can be reduced. as is performed by the well-known PCA and KPCA (Chapter 3. we develop a new dimensionality reduction method such that in the latent space spanned by the learned components. on using the deﬁnition of K in (3. this kernel k facilitates a readily parametric form for out-of-sample kernel evaluations. x). . Hence. .15) or max tr((W ⊤ (KLK + µI)W )−1 W ⊤ KHKW ). The optimization problem (3. it can still be solved efﬁciently by the following trace optimization problem: Proposition 1.t. 3. we will drop the subscript m from Im in the sequel. and In1 +n2 ∈ R(n1 +n2 )×(n1 +n2 ) is the identity matrix.5. XT ) = tr((KW W ⊤ K)L) = tr(W ⊤ KLKW ). . where the ith column [W ⊤ K]i provides the embedding coordinates of xi . a regularization term tr(W ⊤ W ) is usually needed to control the complexity of W . (3. k(xnS +nT .14) where µ > 0 is a trade-off parameter. W ⊤ KHKW = Im .2. and Im ∈ Rm×m is the identity matrix. The kernel learning problem then becomes: minW tr(W ⊤ KLKW ) + µ tr(W ⊤ W ) s.2 Unsupervised Transfer Component Extraction Combining the parametric kernel representations for distance between distributions and data variance in the previous section. . W (3. we also need to preserve the data properties such as the data variance using the parametric kernel map. Besides reducing the distance between the two marginal distributions in (3. Note from (3. Though this optimization problem involves a non-convex norm constraint W ⊤ KHKW = I.14) can be re-formulated as min tr((W ⊤ KHKW )† W ⊤ (KLK + µI)W ). the MMD distance between the empirical ′ ′ means of the two domains XS and XT can be rewritten as: ′ ′ Dist(XS . For notational simplicity.where kx = [k(x1 . where H = In1 +n2 − 1 11⊤ n1 +n2 is the centering matrix. (3.13).

S KT. ySi )}nS . the key of TCA is the second step. Similar to kernel Fisher discriminant analysis (KFD) [126. the W solution in (3. we obtain an equivalent trace maximization problem (3. respectively.18) Multiplying both sides on the left by W T .3. nS nT 1: Construct kernel matrix K from {xSi }i=1 and {xTj }j=1 based on (3..T ]W . once the W is learned . . and then on substituting it into (3.8). xt ). W to zero. to do eigen-decomposition on the matrix (KLK + µI)−1 KHK to ﬁnd m leading eigenvectors to construct the transformation W . Require: Source domain data set DS = {(xSi .Proof. we obtain (3.S KS. .14) is tr(W ⊤ (KLK + µI)W ) − tr((W ⊤ KHKW − I)Z). and T κi = k(xT . t = 1. we can use W to map it to the latent space directly. The Lagrangian of (3.. nS . 44 . 2: Compute the matrix (KLK + µI)−1 KHK.. nS + nT . j=1 Ensure: Transformation matrix W and predicted labels YT of the unlabeled data XT in the target domain.. this will be referred to as Transfer Component Analysis (TCA).3 Transfer learning via Transfer Component Analysis (TCA).17) w.T ]W and XT = S T [KT. T As can be seen in the algorithm. 7: return transformation matrix W and f (x′ )’s.15). Setting the derivative of (3.17).t. 216]. (3.. ′ ′ 4: Map the data xSi ’s and xTj ’s to x′ i ’s and x′ j ’s via using XS = [KS. and target domain data set DT = i=1 {xTj }nT . TCA can be generalized to out-of-sample data. and centering matrix H.16) are the m leading eigenvectors of (KLK +µI)−1 KHK.. we have (KLK + µI)W = KHKW Z. and the extracted components are called the transfer components. for new test data xT from the target domain.16). where m ≤ nS +nT −1. x′ = κW . where I is the identity matrix.. 3: Do Eigen-decomposition and select the m leading eigenvectors to construct the transformation matrix W . 5: Train a model f on xSi ’s with ySi ’s. Furthermore. it takes only O(m(nS + nT )2 ) time when m nonzero eigenvectors are to be extracted [175].7). nS + 1. where κ is a row vector.r. 6: For new test data xT from the target domain. The TCA algorithm for transfer learning is summarized in Algorithm 3.17) where Z is a diagonal matrix containing Lagrange multipliers. In general. Since the matrix KLK +µI is non-singular. (3. matrix L from (3. Thus. Algorithm 3. In the sequel. which is much more efﬁcient than MMDE.

the two classes are now more separated.2(a). On the other hand.2(a)). we reproduce Figure 3.3(b) and 3.1(a) (which is also reproduced in Figure 3.3.3(a).1(b) in Figure 3. learning the transformation ϕ by only maximizing the data variance may not be useful in domain adaptation.3.2. the distance between different domain data in the latent space is reduced and the positive and negative samples are more separated in the latent space. and latent spaces learned by SSA and TCA. respectively. though the distance between distributions of different domain data in the latent space learned by TCA is larger than that learned by SSA.3. Here. TCA leads to signiﬁcantly better accuracy than SSA. there are two main objectives: minimizing the distance between domains and maximizing the data variance in the latent space. we perform experiments on synthetic data to demonstrate the effectiveness of these two objectives of TCA in learning a 1D latent space from the 2D data. We compare TCA with the method of stationary subspace analysis (SSA) as introduced in Chapter 2.3. which is not useful for making predictions on the mapped target domain data. Here. 3. though the variance of the mapped data in the 1D space learned by TCA is smaller than that learned by PCA. and ﬁx µ = 1. For fully testing the effectiveness of TCA.2(b) and 3. which is an empirical method to ﬁnd an identical stationary latent space of the source and target domain data. Only Maximizing the Data Variance As discussed in Chapter 3.2(c). in TCA. 45 .2(b).2(c). For TCA. As can be seen from Figures 3. On the other hand. We further apply the one-nearest-neighbor (1-NN) classiﬁer to make predictions on the target domain data in the original 2D space. the positive and negative samples are overlapped together in the latent space.2. However. as can be seen from Figure 3. the distance between distributions of different domain data in the one-dimensional space learned by SSA is small. we illustrate this by using the synthetic data from the example in Figure 3.5. As can be seen from Figures 3. which is not useful for domain adaptation. we will conduct more detailed experiments by comparing with other exciting methods in two real-world datasets in Chapter 4 and Chapter 5. it is not desirable to learn the transformation ϕ by only minimizing the distance between the marginal distributions P (ϕ(XS )) and P (ϕ(XT )). As can be seen from Figures 3.3(c).3 Experiments on Synthetic Data As described in previous sections. the variance of the mapped data in the 1D space learned by PCA is very large. In this section. the distance between the mapped data across different domains is still large and the positive and negative samples are overlapped together in the latent space. we use the linear kernel on inputs. Only Minimizing Distance between Distributions As discussed in Chapter 3. However.

target domain data Neg. can be mapped to the latent space directly. (1) It is much more efﬁcient. Indeed. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. PDF PDF 1D latent space −13 −12 −11 −10 x −9 −8 −7 −25 1D latent space −20 −15 x −10 −5 0 (b) 1D projection by SSA (acc: 60%). source domain data Pos.4 0.4 Summary In TCA.6 1. TCA has two advantages. target domain data 1 (a) Data set 1 (acc: 82%).8 0. which may be sensitive to different application areas. Parameter values of the empirical kernel needs to be tuned by human experience or by using the cross-validation method. including unseen data.2 0 −2 −1 0 1 2 x 3 4 5 Pos. data from the source and target domain.2 x2 1 0. (2) It can be generalized to out-of-sample target domain domain naturally.2 1. Figure 3.4 1. TCA only requires a simple and efﬁcient eigenvalue decomposition. 46 . which takes only O(m(nS + nT )2 ) time when m nonzero eigenvectors are to be extracted. compared to MMDE.2: Illustrations of the proposed TCA and SSTCA on synthetic dataset 1.6 0. source domain data Neg. we propose to learn a low-rank parametric kernel for transfer learning instead of the entire kernel matrix. However.5. Once the transformation W is learned. As can be seen from Algorithm 3. (c) 1D projection by TCA (acc: 86%).3. the parametric kernel is a composing kernel that consists of an empirical kernel and a linear transformation.8 1. 3.

source domain data Neg. 47 . 3. PDF PDF 1D latent space −150 −100 −50 0 x 50 100 150 −10 1D latent space −5 0 5 x 10 15 20 (b) 1D projection by PCA (acc: 48%). a good representation should (1) reduce the distance between the distributions of the source and target domain data. there is no labeled data in the target domain.4. and (2) minimize the empirical error on the labeled data in the source domain4 . As a result. 4 Recall that in our setting.5 does not consider the label information in learning the components. A solution to overcome this is to encode the source domain label information into the embedding learning.5 4 3 2 x2 1 0 −1 −2 −2 Pos. the unsupervised TCA proposed in Chapter 3. source domain data Pos. Figure 3. the transfer components learned by TCA may not be useful for the target classiﬁcation task. [121] mentioned in their works. Blitzer et al. [24] and Mansour et al. However. As shown in Figure 3. target domain data −1 0 1 2 3 x1 4 5 6 7 8 9 (a) Data set 2 (accuracy: 50%). (c) 1D projection by TCA (acc: 82%). the direction with the largest variance (x1 ) is orthogonal to the discriminative direction (x2 ). [16]. repectively.6 Semi-Supervised Transfer Component Analysis (SSTCA) As Ben-David et al. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. such that the learned components are discriminative to labels in the source domain and can be used to reduce the distance between the source and target domains as well.3: Illustrations of the proposed TCA and SSTCA on synthetic dataset 2. target domain data Neg.

we encode the manifold structure into the embedding learning so as to propagate label information from the labeled (source domain) data to the unlabeled (target domain) data (Chapter 3.6. source domain data Pos. Moreover. Note that in traditional semi-supervised learning settings [233. target domain data Neg. source domain data Pos. source domain data Neg.6.1). 5 4 3 2 x2 1 0 −1 −2 −2 Pos.11 9 7 5 x 2 Pos. the labeled and unlabeled data are from different domains. there exists an intrinsic low-dimensional manifold underlying the high-dimensional observations. source domain data Neg. target domain data Neg. Hence. However. a representation that maximizes its dependence with the data labels may lead to better generalization performance.4: The direction with the largest variance is orthogonal to the discriminative direction Moreover. In this section.5 to the semi-supervised learning setting. 233].5: There exists an intrinsic manifold structure underlying the observed data. 34]. in many real-world applications (such as WiFi localization). Motivated by the kernel target alignment [43]. The effective use of manifold information is an important component in many semi-supervised learning algorithms [34.1). in the context of domain adaptation here. target domain data Target domain data + − + − −1 0 1 2 3 x Source domain data 4 5 6 7 8 1 Figure 3. target domain data Target domain data + + − Source domain data −10 −5 0 x 5 10 15 1 3 1 −1 −3 −15 − Figure 3. we can maximize the label dependence instead of minimizing the empirical error (Chapter 3. 48 . we extend the unsupervised TCA in Chapter 3. the labeled and unlabeled data are from the same domain.

(3. Objective 1: Distribution Matching As in the unsupervised TCA. 0 otherwise. if there are sufﬁcient labeled data in the source domain.22) Note that γ is a tradeoff parameter that balances the label dependence and data variance terms. Otherwise. our objective is thus to maximize ˜ ˜ tr(H(KW W ⊤K)H Kyy ) = tr(W ⊤KH Kyy HKW ). and a large γ may be used. the dependence between features and labels can be estimated more precisely via HSIC.4). we delineate three desirable properties for this semi-supervised embedding. and (3) preservation of the local geometry. { [Kl ]ij = kyy (yi . Empirically. we may use a small γ. Objective 2: Label Dependence Our second objective is to maximize the dependence (measured w.5 works well on all the data sets.19) serves to maximize label dependence on the labeled data. (1) maximal alignment of distributions between the source and target domain data in the embedded space. Intuitively. the target domain data are unlabeled. (3. while Kv = I.13) between the source and target domain data in the embedded space. HSIC) between the embedding and labels.19) into HSIC (3. (2) high dependence on the label information.21) serves to maximize the variance on both the source and target domain data.20) (3.1 Optimization Objectives In this section. (3.r. The sensitivity of the performance to γ will be studied in more detail in Chapters 4 and 5. 49 . We propose to maximally align the embedding (which is represented by K in (3.11)) with ˜ Kyy = γKl + (1 − γ)Kv . yj ) i. By substituting K (3.t. j ≤ nS . where γ ≥ 0. simply setting γ = 0. our ﬁrst objective is to minimize the MMD (3. which is in line with ˜ MVU [198]. Recall that while the source domain data are fully labeled.3. Here.11) and Kyy (3.6. namely. when there are only a few labeled data in the source domain and a lot of unlabeled data in the target domain.

Note from (3. we present how to combine the three objectives to ﬁnd a W that maximizes (3. we construct a graph with the afﬁnity mij = exp(−d2 /2σ 2 ) if xi is one ij of the k nearest neighbors of xj . More specifically. a constraint Kii + Kjj − 2Kij = d2 will be added to the optimization problem. xj ) in N .Objective 3: Locality Preserving As reviewed in Chapters 3.4 and 3.13) and (3. Hence. our third objective is to minimize ∑ (i. For simplicity. It is well-known that (3. or vice versa.24) can be formulated as the following quotient trace problem: max tr{(W ⊤ K(L + λL)KW + µI)−1 (W ⊤ KH Kyy HKW )}. Hence.25) can be solved by eigendecomposing (K(L + λL)K + µI)−1 KH Kyy HK. where D is the diagonal matrix with entries dii = n mij . xj j=1 are neighbors in the input space.t. xj in the original input space. The ﬁnal optimization problem can be written as min tr(W ⊤ KLKW ) + µ tr(W ⊤ W ) + W λ tr(W ⊤ KLKW ) n2 (3.24) s. W ⊤ KH Kyy HKW = I. W (3. First.23) 3. let N = {(xi .23). xj )} be the set of sample pairs that are k-nearest neighbors of each other. and dij = ∥xi − xj ∥ be the distance between xi . and n2 = (nS + nT )2 is a normalization term. 50 . where λ ≥ 0 is another tradeoff parameter. Intuitively. (3. For each (xi . if xi .2 Formulation and Optimization Procedure In this section. the distance between the embedding coordinates of xi and xj should be small. where the ith column [W ⊤ K]i provides the embedding coordinates of xi .4. we make use of the locality preserving property of the Laplacian Eigenmap [13].25) In the sequel. The procedure for both the unsupervised and semi-supervised TCA is summarized in Algorithm 3. To avoid this problem. The graph Laplacian matrix is ∑ L = D − M . 2 (3.2. we use λ to denote λ n2 in the rest of this paper.6. Colored MVU and MMDE preserve the local geometry of the manifold by enforcing distance constraints on the desired kernel matrix K. this will be referred to as Semi-Supervised Transfer Component Analysis (SSTCA). while simultaneously minimizes (3.11) that the embedding of the data in Rm is W ⊤ K. Similar to the unsupervised TCA.j)∈N mij [W ⊤ K]i − [W ⊤ K]j =tr(W ⊤ KLKW ). Let M = [mij ].22).4. ij the resultant SDP will typically have a very large number of constraints.

Algorithm 3.. nS + nT . we aims to optimize three objectives simultaneously.7). 3. with the use of label information. 5: Train a model f on xSi ’s with ySi ’s. respectively.6. where κ is a row vector. The only difference between these algorithms is that in SSTCA. x′ = κW . and target domain data set DT = i=1 {xTj }nT . nS + 1.S KS. The objectives include minimizing the distance between domains.6(b). it is very similar to that of TCA. As can be seen from Figure 3. the positive and negative samples are more separated in 51 . the overall time complexity is still O(m(nS + nT )2 ). For SSTCA. 7: return transformation matrix W and f (x′ )’s..T ]W . ′ ′ 4: Map the data xSi ’s and xTj ’s to x′ i ’s and x′ j ’s via using XS = [KS. we set the λ in SSTCA to zero. j=1 Ensure: Transformation matrix W and predicted labels YT of the unlabeled data XT in the target domain. ySi )}nS . maximizing the source domain label dependence in the latent space and preserving local geometric structure in the latent space. and centering matrix H. the positive and negative samples overlap signiﬁcantly in the latent space learned by TCA. 3: Do eigen-decomposition and select the m leading eigenvectors to construct the transformation matrix W . T As can be seen in the algorithm.. In this section. matrix L from (3. nS nT 1: Construct kernel matrix K from {xSi }i=1 and {xTj }j=1 based on (3.8). t = 1.6(a)). 6: For new test data xT from the target domain. Consequently. 2: Compute the matrix (K(L + λL)K + µI)−1 KH Kyy HK. we use the linear kernel on both inputs and outputs. we demonstrate the advantage of using label information in the source domain data to improve classiﬁcation performance (Figure 3. SSTCA is also easy to be generalized to out-ofsample data. Label Information In this experiment. and T κi = k(xT .. γ = 0... We will fully test the effectiveness of SSTCA next in two real-world applications in Chapter 4 and Chapter 5. the difference between SSA and SSTCA is in the use of label information.3 Experiments on Synthetic Data As described in previous sections. because the transformation W is learned explicitly. xt ).T ]W and XT = S T [KT. In addition. respectively.S KT. .. in SSTCA. the matrix need to be eigen-decomposed is (K(L + λL)K + µI)−1 KH Kyy HK instead of (KLK + µI)−1 KHK. Since the focus is not on locality preserving.4 Transfer learning via Semi-Supervised Transfer Component Analysis (SSTCA). On the other hand. we perform experiments on synthetic data to demonstrate the effectiveness of these three objectives of SSTCA in learning a 1D latent space from the 2D data.5. . However. and ﬁx µ = 1. Require: Source domain data set DS = {(xSi . nS .

In summary. when the discriminative directions across different domains are similar.the latent space learned by SSTCA (Figure 3. SSTCA may not improve the performance or even performs worse than TCA. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. 52 .6: Illustrations of the proposed TCA and SSTCA on synthetic dataset 3.6(c)). PDF PDF 1D latent space −300 −200 −100 0 x 100 200 300 −175 1D latent space −125 −75 −25 25 75 125 x (b) 1D projection by TCA (acc: 56%). target domain data 7 5 x2 3 1 −1 −3 −15 −10 −5 0 x 5 10 15 1 (a) Data set 3 (acc: 69%). encoding label information from the source domain (as SSTCA does) may not help or even hurt the classiﬁcation performance as compared to the unsupervised TCA. positive and negative samples in the target domain are more separated in the latent space learned by TCA than in that learned by SSTCA. (c) 1D projection by SSTCA (acc: 79%). in some applications. and thus classiﬁcation also becomes easier. both SSTCA and TCA can obtain better performance. 9 Pos.7(a).7(c). source domain data Pos. SSTCA can outperform TCA by encoding label information into the embedding learning. An example is shown in Figure 3. it may be possible that the discriminative direction of the source domain data is quite different from that of the target domain data. when the discriminative directions across different domains are different. compared to nonadaptive methods. Nevertheless. In this case. However. target domain data Neg. As can be seen from Figures 3.7(b) and 3. However. source domain data Neg. Figure 3.

5−15−12.8(a)).5 −5 −2. target domain data −1 0 1 2 3 x1 4 5 6 7 8 (a) Data set 4 (accuracy: 60%). source domain data Neg. PDF PDF 1D latent space −5 −2. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. respectively).4 Summary In SSTCA.24) to 1000 and 0.5 0 2.5 x 10 12. Laplacian smoothing can indeed help improve classiﬁcation performance when the manifold structure is available underlying the observed data. target domain data Neg. we extend unsupervised TCA in the semi-supervised manner by maximizing the source domain label dependence in embedding learning.6.7: Illustrations of the proposed TCA and SSTCA on synthetic dataset 4.5 5 7. Both the source and domain data have the well-known two-moon manifold structure [232] (Figure 3.5 5 7. Similar to TCA. 3. (c) 1D projection by SSTCA (acc: 68%).5 x 0 2.5 15 17.8(b) and 3.5 20 1D latent space −20−17.5 10 (b) 1D projection by TCA (acc: 90%). the time complexity 53 . SSTCA is used with and without Laplacian smoothing (by setting λ in (3. source domain data Pos.8(c). we demonstrate the advantage of using manifold information to improve classiﬁcation performance. Manifold Information In this experiment. As can be seen from Figures 3.5−10 −7.5 4 3 2 x2 1 0 −1 −2 −2 Pos. Figure 3.

5 2 x2 1.5 10 12 1D latent space −12 −10 −8 −6 −4 −2 x 0 2 4 6 8 10 (b) 1D projection by SSTCA without Laplacian (c) 1D projection by SSTCA with Laplacian smoothsmoothing (acc: 83%). such as MVU. Colored MVU.7 Further Discussion As mentioned in Chapter 3. embedding approaches. when the source and target domain data have a intrinsic manifold structure.5 −1 −2 −1 0 1 2 3 x 4 5 6 7 8 Pos. then SSTCA does not work well. then SSTCA can boost the classiﬁcation accuracy compared to TCA. Figure 3. SSTCA is more effective.8: Illustrations of the proposed TCA and SSTCA on synthetic dataset 5. of SSTCA is O(m(nS +nT )2 ). ing (acc: 91%).5 5 7.4 3.5 0 −0. 3. are all based on Hilbert space embedding of distributions 54 . in some cases.5 −5 −2. TCA and SSTCA will conducted in Chapters 4-5. More completed comparison between MMDE. source domain data Neg.5 1 0. However.5 3 2.2. target domain data 1 (a) Data set 5 (acc: 70%). MMDE and the proposed TCA and SSTCA. PDF PDF 1D latent space −10 −7.5 0 x 2. Furthermore. source domain data Pos.4. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. target domain data Neg. Experiments on synthetic data show that when the discriminative directions of the source and target domains are the same or close to each other. when the discriminative directions of the source and target domains are different.

uLSIF. In contrast. We summarize the relationships among these approaches in Table 3. the proposed TCA and SSTCA learn parametric kernel matrices by simultaneously minimizing distribution distance and maximizing data variance and label dependence. MMDE and TCA are unsupervised. Note that SSTCA reduces to TCA when γ = 0 and λ = 0. while MMDE learns a nonparametric kernel matrix by minimizing domain distance for cross-domain learning. As mentioned in Chapter 2. In summary. SSTCA is semi-supervised. gradient descent is required to reﬁne the embedding space and thus the solution can still get stuck in a local minimum. MMDE.. where data properties can be preserved. SSTCA can also be reduced to the semisupervised Color MVU when we drop the distribution matching term and set γ = 1 and λ > 0. TCA reduces to MVU5 .1. besides the embedding approaches. instead of matching them in the original feature space. method MVU Color MVU MMDE TCA SSTCA setting unsupervised supervised unsupervised unsupervised semi-supervised out-ofsample label distribution geometry variance matching √ √ nonparametric √ √ nonparametric √ √ √ nonparametric √ √ parametric √ √ √ √ parametric kernel √ √ via MMD and HSIC.5 ) to O(m(nS + nT )2 ). KLIEP. which then reduce the time complexity from O(m(nS + nT )6. while TCA can be generalized to out-ofsample patterns even in this restricted setting. Finally. in which a low-rank approximation is used to reduce the number of constraints and variables in the SDP. instance re-weighting methods. and do not utilize side information. TCA and SSTCA are dimensionality reduction approaches for domain adaptation. This is beneﬁcial as the real-world data are often noisy.7) in the kernel learning problem of MMDE is similar to the recently proposed supervised dimensionality reduction method Colored MVU[174]. The main difference between these methods and our proposed method is that we aim to match data distributions between domains in a latent space. such as KMM.1.Table 3. To balance prediction performance with computational complexity in cross-domain learning. However. etc. and exploits side information in embedding learning. They are all transductive and computationally expensive.3. If we further drop the objective 1 described in Chapter 3. MVU and Colored MVU learn a nonparametric kernel matrix by maximizing data variance or label dependence for data analysis in a single domain. Furthermore.6. Essentially. while the latent space has 5 Note that MVU is a transductive dimensionality reduction method.1: Summary of dimensionality reduction methods based on Hilbert space embedding of distributions. 55 . have also been proposed to solve the transductive transfer learning problems by matching data distributions. TCA is a generalized version of MMDE. Last but not the least. while MVU and colored MVU are dimensionality reduction approaches for singledomain data visualization. Note that criterion (3.

been de-noised. matching data distributions in the de-noised latent space may be more useful than matching them in the noisy original feature space for target learning tasks in the transductive transfer learning setting. 56 . in practice. As a result.

and indoor WiFi localization is intrinsically a regression problem.1. locating and tracking a user or cargo with wireless signal strength or received signal strength (RSS) is becoming a reality [74. we apply the proposed dimensionality reduction framework to an application in wireless sensor networks: indoor WiFi localization. xik )) together with the physical location information (i.g.11 wireless environment of size about 60m × 50m.. six signal vectors are collected. 133. 4.. Note that the blank cells in the table denote the missing values. the signal vector xi = (xi1 .. Figure 4.. . In the following. .. Then.F in the building. ..1 WiFi Localization and Cross-Domain WiFi Localization Location estimation is a major task of pervasive computing and AI applications that range from context-aware computing [80.. are used as labeled training data to learn a computational or statistical model. In recent years. While GPS is widely used.. GPS cannot work well. Then. The corresponding labels of the signal vectors xtA . xi2 . each of which is 5-dimensional. In an ofﬂine phase. which we can ﬁll in a small default value. which periodically send out wireless signals to others..CHAPTER 4 APPLICATIONS TO WIFI LOCALIZATION In this section.1 shows an indoor 802. these methods work in two phases. in many places outdoor where high buildings can block signals. 65]. and in most indoor places... Five APs (AP1 . e. as shown in Table 4. The objective of indoor WiFi localization is to estimate the location yi of a mobile device based on the RSS values xi = (xi1 .xtF are the 2D coordinates of the locations A. xik ) received from k Access Points (APs)... 65. −100dBm. .. 1 Extension to three-dimensional localization is straightforward.. the location coordinates in the 2D ﬂoor where the mobile device is). the RSS values (e. 100].e.. 132.AP5 ) are set up in the environment.11 WiFi networks in cities and buildings. 132.tF . . . we consider the two-dimensional coordinates of a location1 .. In general. machine-learning-based methods have been applied to the indoor WiFi localization [134. With the increasing availability of 802. 106] to robotics [12]. xi2 .g. A user with an IBM T42 laptop that is equipped with an Intel Pro/2200BG internal wireless card walks through the environment from the location A to F at time tA . As an example. . a mobile device (or multiple ones) moving around the wireless environment is used to collect wireless signals from APs. 57 . 100. .

AP1 -40 -50 AP2 AP3 -60 AP4 -40 -80 AP5 -70 4. Moreover. domain adaptation is necessary for indoor WiFi localization.2. most machine-learning methods rely on collecting a lot of labeled data to train an accurate localization model ofﬂine for use online. the RSS values are noisy and can vary with time [217. Hence. As can be seen. The task is to predict the labels of the WiFi 58 . the RSS data collected in one time period may differ from those collected in another. As a result. This contains a few labeled WiFi data collected in time period T1 (the source domain) and a large amount of unlabeled WiFi data collected in time period T2 (the target domain). WiFi data collected from different time periods are considered as different domains. Here.E AP4 AP5 B A F AP1 AP2 AP3 C D Figure 4. the contours of the RSS values received from the same AP at different time periods are very different. However. even in the same environment. and assuming that the distributions of RSS data over different time periods are static.1: An example of signal vectors (unit:dBm) xtA xtB -60 xtC -40 -70 xtD -80 -40 -70 xtE -40 -70 -40 -80 xtF -80 -80 -50 (All values are rounded for illustration) However.2 Experimental Setup We use a public data set from the 2007 IEEE ICDM Contest (the 2nd Task). An example is shown in Figure 4.1: An indoor wireless environment example. “label” refers to the location information for which the WiFi data are received. 136]. Table 4. it is expensive to calibrate a localization model in a large environment.

1. In the out-of-sample evaluation setting. 2. our goal is to learn a u o model from DS and DT . our goal is to learn a model from DS and DT . respectively. and then evaluate the model on DT (out-of-sample patterns). For parameter tuning of TCA. Different colors denote different signal strength values (unit:dBm). Here.35 30 25 20 15 10 5 0 0 35 30 25 20 20 18 16 14 12 10 15 10 5 0 0 8 6 4 2 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 (a) WiFi RSS received in T1 from two APs (unit:dBm). Traditional regression models that do not perform domain adaptation. For more details on the data set. (b) WiFi RSS received in T2 from two APs (unit:dBm). we shift them to positive values for visualization. data collected in time period T2 . As for the target domain data. 50 labeled data are sampled from the source domain as validation set. SSTCA and all the other baseline methods. and the (semi-supervised) Laplacian RLSR (LapRLSR) [14]. The following methods will be compared. and u then evaluate the model on DT . while D = DT in the transductive setting. readers may refer to the contest report article [212]. All the source domain data (621 instances in total) are used for training. Figure 4. f (xi ) is the predicted location. These include the (supervised) regularized least square regression (RLSR). 59 .328 patterns are o sampled to form DT . xi is a vector of RSS values. we have |DS | = 621 and |DT | = 3. Here. we randomly split u o DT into DT (the label information is removed in training) and DT . In the experiments. yi is the corresponding u o ground truth location. Furthermore. Denote the data collected in time period T1 and time period T2 by DS and DT . and a variable number of patterns are sampled from the remaining 800 u patterns to form DT . we repeat 10 times and then report the average performance using the Average Error Distance (AED): AED = ∑ (xi . 128. u In the transductive evaluation setting. Note that the original signal strength values are non-positive (the larger the stronger).yi )∈D |f (xi ) − yi | N .2: Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods. which is standard regression model. and D = DT in the out-of- sample evaluation setting. For each experiment. We train it on DS only.

titech. where n1 is the number of training data in the source domain. We ﬁrst set µ = 1 and search for the best σ value (based on the validation set) in the range [10−5 . Then. a RLSR model is trained on the projected source domain data. we use the likelihood cross-validation method in [180] to automatically select the kernel width. and then train a RLSR model on the weighted data. ( ) ∥x −x ∥ 4 We use the Laplace kernel k(xi . xj ) = exp − i σ j .3 and TCAReduced.19) on the labels. we ﬁx σ and search for the best µ value in [10−3 . We set σ and µ in the same manner as TCA. u 60 . 5 The code of SSA is provided by Paul von B¨ nau. its initial value is also tuned on the validation set. However. and γ). It ﬁrst learns a projection from both DS and DT via KPCA. and there are four tunable parameters (σ. the pivot features are selected by mutual information while the number of pivots and other SCL parameters are determined by the validation data. For SSTCA.3. Note that the Laplacian RLSR has been applied to WiFi localization and is one of the state-of-the-art localization models [134]. 4. we ﬁx λ and search for γ in [0.1. Finally. It learns u a set of new cross-domain features from both DS and DT .u which is trained on both DS and DT but without considering the difference in distribu- tions. A traditional dimensionality reduction method: Kernel PCA (KPCA) as introduced in u Chapter 3. we set γ = 0.jp/˜sugi/software/KLIEP/index. 3. 1]. and then augments features on the source domain data in DS with the new features. Methods that only perform distribution matching in a latent space: SSA5 as introduced in 2. the ﬁrst author of [191]. A state-of-the-art domain adaptation method: SCL3 as introduced in Chapter 2. There are two parameters in TCA. Thus. and map data in DS to the latent space. First. 2. λ. For KLIEP. we set the ϵ parameter in √ KMM as B/ n1 . Afterwards. The proposed TCA and SSTCA. which has been shown to be a suitable kernel for the WiFi data [139]. Following [82].2.5 and search for the best λ value in [10−6 .html 3 Following [25]. u 5. RLSR is then applied on the projected DS to learn a localization model.cs. we use linear kernel for kyy in (3. which replaces the constraint W ⊤ KHKW = I in TCA by 2 The code of KLIEP is downloaded from http://sugiyama-www. 6.ac. we ﬁnd that the resulting performance is satisfactory in practice. Afterwards. kernel width4 σ and parameter µ. 105 ]. 106 ]. They use both DS and DT to learn weights of the patterns in DS . we apply TCA / SSTCA on both DS and DT to learn transfer components. 103 ]. A RLSR model is then trained. Sample selection bias (or covariate shift) methods: KMM and KLIEP2 as introduced in u Chapter 2. µ. Note that this parameter tuning strategy may not get a global optimal combination of values of parameters.3. Preliminary results suggest that the ﬁnal performance of KLIEP can be sensitive to the initialization of the kernel width.

3 Results 4.3.4. Since MMDE (the last method) is transductive. can lead to signiﬁcantly improved performance. Finally. 61 . 4. SSA and TCAReduced in the out-of-sample setting. they may lose important information of the original data in the latent space. In addition.2. KPCA can only de-noise but cannot ensure that the distance between data distributions in the two domains is reduced.3. 7.3. TCA and SSTCA outperform all the other methods.3. A closely related dimensionality reduction method: MMDE. they do not obtain good performance. LapRLSR and KPCA. as mentioned in Chapter 3. TCA and SSTCA are ﬁxed at 15. The number of unlabeled patterns u in DT is ﬁxed at 400. As can be seen. and thus localization models learned in the de-noised latent space can be more accurate than those learned in the original input space. Thus.3. while the dimensionality of the latent space varies from 5 to 50. we will compare the performance of TCA. Thus. we compare TCA and SSTCA with learning-based localization models that do not perform domain adaptation.3. Hence. However. the manifold assumption holds on the WiFi data. Moreover. including RLSR. TCA performs better than KPCA. 4. The ﬁrst six methods will be compared in the out-of-sample setting in Chapters 4. The dimensionalities of the latent spaces for KPCA.2 Comparison with Non-Adaptive Methods In this experiment. These values are determined based on the ﬁrst experiment in Chapter 4. which in turn may hurt performance of the target learning tasks. though TCAReduced and SSA aim to reduce distance between domains. 139].1-4. As demonstrated in previous research [134]. This is a state-of-the-art method on the ICDM-07 contest data set [135. SSTCA and MMDE in the transductive setting in Chapter 4.1.3. TCAReduced aims to ﬁnd a transformation W that minimizes the distance between different distributions without maximizing the variance in the latent space.W ⊤ W = I.3. we observe that SSTCA obtains better performance than TCA. This is because the WiFi data are highly noisy. including KPCA. though simple.3 shows the results. the graph Laplacian term in SSTCA can effectively exploit label information from the labeled data to the unlabeled data across domains.1 Comparison with Dimensionality Reduction Methods We ﬁrst compare TCA and SSTCA with some dimensionality reduction methods. Figure 4. note that KPCA. Thus.

4. domain adaptation methods that are based on feature extraction (including SCL.100 55 30 KPCA SSTCA TCA TCAReduced SSA Average Error Distance (unit: m) 15 8 4. Results are shown in Figure 4. SCL and SSA. As can be seen.4 shows the performance when the number of unlabeled patterns in DT varies. TCA and SSTCA) perform much better than instance re-weighting methods (including KMM and KLIEP).4: Comparison with localization methods that do not perform domain adaption. even with only a few unlabeled data in the target domain. TCA and SSTCA can perform well for domain adaptation. all the source domain data are used and varying amount u of the target domain data are sampled as |DT |. KLIEP. As can be seen.5 2. u Figure 4. 8 RLSR LapRLSR KPCA SSTCA TCA Average Error Distance (unit: m) 7 6 5 4 3 2 0 100 200 300 400 500 600 700 800 900 # Unlabeled Data in the Target Domain for Training Figure 4.5.5 0 5 10 15 20 25 30 35 # Dimensions 40 45 50 55 Figure 4.3.3 Comparison with Domain Adaptation Methods In this subchapter. while we ﬁx the dimensionalities of the latent space in SSA and TCAReduced at 50. including KMM. For training.3: Comparison with dimensionality reduction methods.5 1. We ﬁx the dimensionalities of the latent space in TCA and SSTCA at 15. we compare TCA and SSTCA with some state-of-the-art domain adaptation methods. This is again because the WiFi data are 62 .

SSTCA and the various baseline methods in the inductive setting on the WiFi data. SCL may suffer from the bad choice of pivot features due to the noisy observations.4 Comparison with MMDE In this subchapter.90 KLIEP SCL KMM SSTCA TCA TCAReduced SSA Average Error Distance (unit: m) 40 22 13 8 5 3 2 0 100 200 300 400 500 600 700 800 900 # Unlabeled Data in the Target Domain for Training Figure 4.2 2 1. The latent space is learned from DS and a subset of the unlabeled target domain data sampled from u DT .6 2. (b) Varying the number of unlabeled data.4 2. while Figure 4.8 1. 6 5. where the WiFi data have been implicitly de-noised. TCA and SSTCA ﬁxed at 15. we compare TCA and SSTCA with MMDE in the transductive setting.5: Comparison of TCA.8 2. TCA and SSTCA match distributions in the latent space.5 2 1. As can be seen.5 4 3. On the other hand.6(b) shows the results for different amounts of unlabeled target domain data. u Figure 4. 4. MMDE 63 .5 3 2. The performance is then measured on the same unlabeled data subset.6(a) shows the results for different dimensionalities of the latent space with |DT | = 400.5 Average Error Distance (unti: m) 5 4. highly noisy.5 0 5 10 15 20 25 30 35 # Dimensions 40 45 50 55 MMDE SSTCA TCA 3 2.6 0 MMDE SSTCA TCA Average Error Distance (unit: m) 100 200 300 400 500 600 700 800 # Unlabeled Data in the Target Domain for Training 900 (a) Varying the dimensionality of the latent space. with the dimensionalities of MMDE.3. Indeed. and so matching distributions directly based on the noisy observations may not be useful.6: Comparison with MMDE in the transductive setting on the WiFi data. Figure 4.

This may be due to the limitation that the kernel matrix used in TCA / SSTCA is parametric. we need tune the kernel parameters using cross-validation. tradeoff parameter µ. Experimental results veriﬁed the effectiveness of our proposed methods in WiFi localization compared several cross-domain methods. TCA or SSTCA may be a better choice than MMDE for cross-domain adaptation. as mentioned in Chapter 3. 4. the two additional parameters γ and λ. SSTCA performs better than 64 .4. for real-world applications. MMDE performs best. However. we investigate the effects of the parameters on the regression performance. and another 400 samples to form DT . the kernel matrix is learned from the data automatically. In practice. From the results.outperforms TCA and SSTCA. These include the kernel width σ in the Laplacian kernel. and we sample 2. which is useful for learning tasks.4 Summary In this section. both TCA and SSTCA are insensitive to the settings of the various parameters.328 samples from the target domain data to form o u DT . MMDE SSTCA TCA 10 Running Time (unit: sec) 5 10 4 10 3 10 2 10 1 10 0 0 100 200 300 400 500 600 700 800 # Unlabeled Data in the Target Domain for Training 900 Figure 4.7: Training time with varying amount of unlabeled data for training. based on the results in WiFi localization. The out-of-sample evaluation setting is used. we applied MMDE. we may observe that in the transductive experimental setting. However. 4. As a result. All the source domain data are used. Furthermore. As can be seen from Figure 4. while in MMDE.7. TCA and SSTCA are more desirable.3. MMDE is computationally expensive because it involves a SDP. TCA and SSTCA to the WiFi localization problem. and for SSTCA.8. Thus. MMDE may be not applicable in practice because of its expensive computational cost. This is conﬁrmed in the training time comparison in Figure 4. The reason is that in TCA and SSTCA. the kernel matrix in MMDE can more ﬁt the data. The dimensionalities of the latent spaces in TCA and SSTCA are ﬁxed at 15.5 Sensitivity to Model Parameters In this subchapter.

1 0.4 0.3 0.7 Values of γ1 (γ2=1−γ1) 0.9 1 1.6 0. Therefore.1 0 0. when the manifold assumption does hold on the source and target domain data.4 2.6 2. TCA. 65 .3 2.55 2.8 0.25 2. The reason may be that the manifold assumption hold strongly on the WiFi data.2 −0. Figure 4.7 2.7 SSTCA 2.5 0.3 2. 2.8: Sensitivity analysis of the TCA / SSTCA parameters on the WiFi data. we suggest to use SSTCA for learning the latent space for transfer learning.2 0. (d) varying λ in SSTCA.9 2.65 Average Error Distance (unit: m) 2.2 −4 TCA SSTCA −3 −2 −1 0 log10µ 1 2 3 4 (a) varying σ of the Laplacian kernel.1 Average Error Distance (unit: m) 40 35 30 25 20 15 10 5 0 45 (b) varying µ.8 2.4 2.5 2.5 2. SSTCA −7 −6 −5 −4 −3 −2 −1 0 log10λ 1 2 3 4 5 6 7 (c) varying γ in SSTCA.55 50 Average Error Distrance (unit: m) 45 40 35 30 25 20 15 10 5 0 −5 −4 −3 −2 −1 0 1 log10σ 2 3 4 5 TCA SSTCA Average Error Distance (unit: m) 2.45 2.35 2.6 2. As a result the manifold regularization term in SSTCA can be able to propagate label information from the source domain to the target domain effectively.

data mining and information retrieval [89. also known as document classiﬁcation. In the following sections. The proposed method based on naive bayes. in [48] proposed a modiﬁed naive bayes algorithm for cross-domain text classiﬁcation. following the same data distribution of the training domain data. such as WiFi localization. Test data are assumed to come from the same domain. aims to assign a document to one ore more categories based on its content. 5. 207. In contrast. we test the effectiveness of the proposed framework on a real-world cross-domain text classiﬁcation dataset. we proposed dimensionality reduction framework can be used for general cross-domain regression or classiﬁcation problems. we apply the framework to another application: text classiﬁcation.000 news1 http://people.edu/jrennie/20Newsgroups/ 66 . 215]. it can be applied various diverse applications.mit. 204]. Thus. 47. However. transfer learning techniques have been proposed to address the cross-domain text classiﬁcation problems [48. we have showed the effectiveness of the proposed dimensionality reduction framework for the cross-domain WiFi localization problem. cross-domain learning algorithms are desirable in text classiﬁcation. which have been reviewed in Chapter 2.2 Experimental Setup In this section. In this section.1 Text Classiﬁcation and Cross-domain Text Classiﬁcation Text classiﬁcation. in many real-world applications.csail. As a result. 5. Thus. 49. most of these methods are unable to be applied to solve cross-domain text classiﬁcation problems. For example. It is a fundamental task for Web mining and has been widely studied in the ﬁelds of machine learning. In recent years. we perform cross-domain text classiﬁcation experiments on a preprocessed data set of the 20-Newsgroups1 [96]. labeled labeled are in short supply in the domain of interest. thus it is hard to be applied to other applications. Most of these methods are designed for the text classiﬁcation area only.CHAPTER 5 APPLICATIONS TO TEXT CLASSIFICATION In the previous section. Traditional supervised learning approaches to text classiﬁcation require sufﬁcient labeled instances in a domain of interest in order to train a high-quality classiﬁer. This is a text collection of approximately 20.

827 45. since they are under the same top categories.500 1.500 1. Talk (R vs. a linear SVM is used as the classiﬁer in the latent space.500 1. “rec vs. two top categories are chosen.500 From each of these six data sets. # Pos. Besides.151 40. T) # Fea.500 1. # Neg. |DT | = 1200 and |DT | = 1800. we perform a series of experiments to compare TCA and SSTCA with the following methods: 1.065 30. Hence. The evaluation criterion is the classiﬁcation accuracy.500 1. 48] to create six data sets from this collection. and o the remaining 60% in the target domain to form the out-of-sample subset DT . in each u o cross-domain classiﬁcation task. KLIEP and SCL. we randomly sample 40% of the documents from the source u domain as DS . This splitting strategy ensures that the domains of labeled and unlabeled data are related. we follow the preprocessing strategy in [47. KPCA.500 1. T) Rec vs. # Neg.group documents hierarchically partitioned into 20 newsgroups.500 1. 38. Documents from different subcategories but under the same parent category are considered to be related domains. 3.000 6. We run 10 repetitions and report the average results.500 1. Sci (R vs. Different sub-categories are considered as different domains.500 1. 67 .500 1. Talk (C vs. We then split the data based on sub-categories.165 29. Similar to Chapter 4. All experiments are performed in the out-of-sample setting. and sample 40% from the target domain to form the unlabeled subset DT . “comp vs. talk”. talk”. 2. Sci (C vs.500 1.500 1. rec” and “comp vs.000 6. For each data set.500 1. Rec (C vs. one as positive and the other as negative. the domains are also ensured to be different. sci”. A linear SVM is then trained in the latent space. “sci vs. talk” (Table 5. Task Comp vs.500 1.500 1. Note that we do not show the performance of MMDE in this experiment.644 33.1).000 6. S) Rec vs. Table 5. In this experiment. 6.500 1. since they are drawn from different sub-categories. The six data sets created are “comp vs.500 1. and the binary classiﬁcation task is deﬁned as top category classiﬁcation.500 1. Talk (S vs.000 6. 1. sci”.500 1. S) Sci vs. R) Comp vs.514 # Doc. “rec vs.500 1. Linear support vector machine (SVM) in the original input space. Again. T) Comp vs. All the alphabets are converted to lower cases and stemmed using the Porter stemmer.500 1. Stop words are also removed and the document frequency feature is used to cut down the number of features.1: Summary of the six data sets constructed from the 20-Newsgroups data. because it results in “out of memory” on learning the kernel matrix using SDP solvers.500 1.500 1.000 Source Domain Target Domain # Pos. Three domain adaptation methods: KMM.000 6. |DS | = 1200.

xj ) = ∥xi − xj ∥2 RBF and linear kernels are deﬁned as k(xi . Figure 5.3.3. This agrees with our motivation and the previous conclusion on the WiFi experiments. we can obtain a similar conclusion as in Chapter 4.1(a) shows the sensitivity w. Thus. Next. and the λ parameter in SSTCA is set to 0. This may be because the manifold assumption is weaker in the text domain than in the WiFi domain. both TCA and SSTCA are insensitive to the setting of µ. First.3 Results 5.3.0001. µ for TCA. γ and λ for SSTCA.r. Moreover. RBF and linear kernels2 for feature extraction or re-weighting in KPCA. TCA performs better than SSTCA on the text data. on tasks such as “R vs. one may notice two main differences between the results on the WiFi data and those on the text data.r.t.1(c) shows the sensitivity of SSTCA w. T”. λ (with γ = 0. Again.1 Comparison to Other Methods Results are shown in Tables 5. First. In general. As can be seen. namely that mapping data from different domains to a latent space spanned by the principal components may not work well as PCA cannot guarantee a reduction in distance of the two domain distributions. the linear kernel performs better than the RBF and Laplacian kernels here. R”. However. and the dimensionalities of the latent spaces are ﬁxed at 10. the performance of PCA is comparable to that of linear TCA.r.t. KMM. Note that we do not compare with SSA because it results in “out of memory” on computing the covariance matrices. Figure 5. linear TCA performs much better than PCA. Figure 5.t.5 and λ = 10−4 ).1 shows how the various parameters in TCA and SSTCA affect the classiﬁcation performance. Different from the WiFi results in Figure 4. S” and “C vs. 5. As can be seen. 5.1(d) shows the sensitivity w. feature extraction methods outperform instance-reweighting methods. γ (with µ = 1 and λ = 10−4 ).1(b) shows that for SSTCA (with γ = 0. This indicates that mani2 ) ( ∥x −x ∥2 and k(xi . Here. This agrees with the well-known observation that the linear kernel is often adequate for high-dimensional text data.2 and 5. In addition. “C vs. we use linear kernels for both the inputs and outputs. kernel kyy in (3. here it performs well only when λ is very small (λ ≤ 10−4 ). S” and “S vs.19) is the linear kernel. “C vs.2 Sensitivity to Model Parameters Figure 5. on tasks such as “R vs.5 and µ = 1). there is a wide range of γ for which the performance of SSTCA is quite stable.We experiment with the Laplace. xj ) = exp − i σ j 68 . and µ. T”. TCA and SSTCA. the remaining free parameters are µ for TCA. T”. and Figure 5. Overall.8(d) where SSTCA performs well when λ ≤ 102 . The µ parameters in TCA and SSTCA are set to 1. For SSTCA. Finally.

82) 82.41 (1.64) 57.12 (0.57 (2.36 (3.36 (.88 (1.28 (1.41) 90.34 (2.44 (1.52) 5 67.0287) 80.09 (3.32) 90.36) 77.14 (2.51 (3.23) 79.05 (0.67 (.75) 74.35 (1.06 (1.50 (1.0252) 5 61.30 (1.25 (1.25) 84.48) 81.70 (1.31) 10 84.83 (1.18) 77.87 (4.27) 85.64) 92.63 (0.68 (1.58 (0.22) 91.77 (5.92) 93.84) 64.30 (1.2: Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).80) 74.56) 89.79 (2.0287) 60.49 (3.09) C vs.62 (0.14) 10 73.37) 91.73 (1.00) 85.49) 5 63.38 (1.71 (3.01) 92.83) 91.75 (7.02 (.89) 30 79.93) 58.11) 82.15 (0.47 (0.94 (2.99) TCA Laplacian kernel RBF kernel linear kernel SSTCA Laplacian kernel RBF kernel linear kernel KPCA Laplacian kernel RBF kernel SCL linear kernel Laplacian kernel RBF kernel KLIEP KMM .61 (0.Table 5.29 (2.55) 20 78.69 (2.87) 10 73.03) 62. T 90.27) 83.82 (2.23 (1.43 (1.55) 89.24) 56.31 (6.32) all 75.34) 91.31 (2.35) 78.73 (5.59 (1.92 (0.47 (2.94 (3.81 (1.51 (0.27 (.31 (1.09 (.10 (2.33) 85.0252) 20 72.37 (.70) 83.05) 92.71 (6.22) 86.23 (2.66 (4.32) 5 58.78) 91.53) 67.04 (9.81) 68.70) 84.41) 85.22) 78.22 (3.67 (3. R 81.77 (0.73 (4.79) 5 69.91) 5 69.76 (5.46 (1.24 (1.37 (.32) 83.96) 85.21 (0.10) 20 74. method SVM linear kernel S vs.0252) 30 69.79 (5.46 (3.11 (0.72) 88.12 (4.05) 5 70.31) 20 74.65 (4.64) 30 79.84) 93.0252) 10 79.04) 91.28) 30 74.83) 90.27) all 76.97) 75.23 (2.03 (2.38 (2.70) 89.24) 83.18 (4.89 (.72 (0.02) 89.40 (2.0252) 20 79.51) 67.64) 88.71 (3.76) 86.20) 5 58.67) 20 80.95 (0.0252) all+50 76.92 (4.92 (.81 (4.26) 57.18) 90.36) 85.17) 30 72.38 (2.74) 67.75) 5 71.30 (.64) 85.61 (1.62) 10 81.60 (1.71 (2.81 (2.95 (1.83) 75.51 (5.94 (1.98) 87.17) all 75.00) all 76.64) 88.42 (6.59 (1.90 (.35) 73.36 (1.0287) 81.02 (5.77 (1.06) 56.99 (3.59) 87.38) 81.81 (3.89 (2.11 (12.14) 90.28 (1.06) 91.14) 10 69.19 (4.31 (.68 (2.02) 92.29) 77.0287) 83.01) 30 74.84) 86.02 (3.04) 20 82.27) 30 79.74) 92.19 (12.46 (1.79) 10 77.81 (2.17 (1.66 (1.26 (2.54) 89.33 (3.07 (10.12) 92.21) 77.57) 85.08) 20 79.0252) 30 72.73 (0.54 (1.24 (1. T all 76.01 (1.49) 20 79.13) 93.0252) 10 80.84 (3.81) 30 77.41) 87.69) 90.81) 69 #dim Task C vs.78) 58.67 (1.13) 10 81.

51 (7.65) 78.73 (1.36) 75.36) 70 Task R vs.44 (5.08 (6.86 (3.24) 77.67) 79.88 (3.50 (7. S 68.24 (8.80) 75.99) 78.29 (3.41) 80.87 (7.81) 84.59 (7.66 (8.68 (8.24 (4.33) 64.94 (10.15) 74.46) 45.87 (1.00 (7.09 (6.11 (11.49 (1.37 (11.64 (2.34 (6.63 (10.55 (2.59) 79.67) 77.22) 68.04 (5.55) 76.23 (5.38 (2.61 (3.69 (14.25) 70.34 (21.3: Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).00) 64.99 (20.82) 72.85) 68.47 (2.29 (1.44 (15.23 (1.99) 82.19) 76.07) 69.53) 75.58 (11.62 (1.12 (13.89) 69.67 (2. method SVM linear kernel #dim all 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 all+50 all all all all C vs.30 (16.86 (3.76 (3.23) 63.00) 82.71 (9.63) 78.51) 70.45) 68.52) 67.12) 72.07 (1.09 (4.72) 75.57 (4.03) 67.60) 70.36 (7.56) 79.00) 54.94 (11.27) 69.14) 72.21 (3.69) 81.59) 60.90 (8.82) 84.17) TCA Laplacian kernel RBF kernel linear kernel SSTCA Laplacian kernel RBF kernel linear kernel KPCA Laplacian kernel RBF kernel SCL linear kernel Laplacian kernel RBF kernel KLIEP KMM .01) 49.42 (1.63 (1.39) 56.32 (8.12) 82.22) 69.27) 70.26 (15.86 (1.77 (3.89) 62.53 (1.79 (2.19) 47.51 (7.28 (7.87 (10.Table 5.06 (7.58 (8.29) 71.71 (4.67) 76.27) 69.35) 50.67 (6.51) 62.02 (8.32) 75.99) 75.00) 69.64 (5.47 (6.87) 77.46 (3.97) 50.03) 72.07) 71.23) 79.57 (3.48) 65.64 (3.60) 71.45 (5.11) 60.18 (6.55 (1.29) 87.85) 73.68) 81.00) 68.46 (7.49 (6.03) 70.52 (9.24) 68.30) 66.48) 76.37) 68.28 (4.07) 44.43 (1.07) 81.21) 74.23) 47.75) 80.65 (1.99 (4.23 (6.49) 65.26 (5.66) 51.11 (7.35) 73.30 (10.88) 77.74) 77.33) 55.07 (5.81 (1.50 (4.84) 69.62 (3.83 (3.34 (10.93) 81.31) 78.55) 72.42 (7.48) 72.20) 77.73) 76.01 (8.01) 63. S 72.76 (4.22) 78.13 (9.92) 71.59 (10.33 (2.27) 69.28 (3.85) 78. T R vs.44) 72.38) 53.59) 69.64 (1.08 (3.86 (1.82 (8.82 (6.23) 51.70 (10.51) 72.69 (9.44 (2.41 (6.98 (4.76) 50.43) 77.17) 53.53 (1.01 (2.41) 78.23) 73.08 (10.43) 82.26 (1.58) 77.67) 88.03 (5.22) 80.52) 76.29 (1.42) 72.86) 83.46 (2.43 (8.50) 75.99 (9.84 (5.13 (7.73) 73.52 (8.10 (1.62) 78.73) 69.64) 76.53) 74.90) 70.81) 48.16 (5.20) 75.66 (6.12) 79.60 (5.43 (4.98) 63.

85 0.8 0. As a result the manifold regularization term in SSTCA may not be able to propagate label information from the source domain to the target domain.8 0. TCA outperforms SSTCA slightly in the text classiﬁcation problem. However. different from the conclusion in WiFi localization.4 Summary In this section. 95 90 Classification Accuracy (in %) Classification Accuracy (in %) −3 −2 −1 0 log µ 10 1 95 90 85 80 75 70 65 1 2 3 4 60 −4 −3 2 1 0 log µ 10 85 80 75 70 65 60 −4 1 2 3 4 (a) varying µ in TCA. Figure 5. 75 Classification Accuracy (in %) 1 1. In this case. The reason may be that the manifold assumption may not hold on the text data.9 Values of γ (γ =1−γ ) 1 2 1 (b) varying µ in SSTCA. Experimental results veriﬁed that our proposed methods based on the dimensionality reduction framework can achieve promising results in cross-domain text classiﬁcation. 1 0.9 0.7 0. (d) varying λ in SSTCA. SSTCA is not supposed 71 .4 0.5 0. 5. using unsupervised TCA for domain adaptation may be a better choice than using SSTCA. In this case.3 0.1 70 65 60 55 50 45 −6 −5 −4 −3 log λ 10 −2 −1 0 (c) varying γ in SSTCA.6 0 0. we applied TCA and SSTCA to the text classiﬁcation problem.1: Sensitivity analysis of the TCA / SSTCA parameters on the text data.7 0.1 0.6 0.65 0.95 Classification Accuracy (in %) 0.2 0.fold regularization may not be useful on the text data.75 0.

In a worse case. 72 . when the manifold assumption does not hold on the source and target domain data. Therefore. SSTCA may even perform worse than TCA if the manifold regularization hurt the learning performance. we suggest to use TCA for learning the latent space for transfer learning.to outperform TCA.

opinion extraction and summarization [81. or even help the government to get the feedback on a policy. which means the domain knowledge across these domains is hidden. Recently. is an important task and has been widely studied because many users 73 . positive or negative). review quality prediction [116. in some domain areas. However. 111]. also known as subjective classiﬁcation [145. 111]. 118. we have considered transfer learning when the source and target domain have implicit overlap. also known as sentiment analysis. opinion mining. As a result. which aims at classifying text segment. how to ﬁnd. or help a company to do a survey on its products or services. 6.1 Sentiment Classiﬁcation With the explosion of Web 2. review spam identiﬁcation [88]. for example. 145. This rich opinion information on the Web can help an individual make a decision to buy a product or plan a traveling route. 184. text sentences and review articles. etc. which we explain next. more and more user-generated resources have been available on the World Wide Web. we highlight one such application in sentiment classiﬁcation. 93]. However. Our strategy is to exploit a general dimensionality reduction framework to discover common latent factors that are useful for knowledge transfer across domains.g. the sentiment data are in large-scale. 114] and opinion analysis in multiple languages [1].g. We have proposed three methods based on the framework to learn the transfer latent space and applied them to two diverse real-world applications: indoor WiFi localization and text classiﬁcation successfully. we have some explicit domain knowledge that can help to design a more effective latent space for speciﬁc applications.CHAPTER 6 DOMAIN-DRIVEN FEATURE SPACE TRANSFER FOR SENTIMENT CLASSIFICATION In the previous chapters. to show that domain knowledge can be encoded in feature space learning for transfer learning when it is available.. in posts of personal blogs or users feedback on news. into polarity categories (e. They exist in the form of online user reviews on shopping or opinion sites. and in many cases. opinion integration [117]. Sentiment classiﬁcation. there are some interesting tasks. e.0 services. Among these tasks. extract or summarize useful information from opinion sources on the Web is an important and challenging task. opinions are hidden in long forum posts and blogs. etc. In sentiment analysis.. has attracted much attention in the natural language processing and information retrieval areas [110. In this chapter.

where wi is a word from a vocabulary W . we will use word and feature interchangeably.do not explicitly indicate their sentiment polarity thus we need to predict it from the text data generated by users. yj = −1 if the overall sentiment expressed in xj is negative. with c(wi . For example. different types of products. In literature. User sentiment may exist in the form of a sentence. compared to unsupervised learning methods [187]. such as books. 111]. paragraph or article. thus in the rest of this chapter. Sindhwani et al. various machine learning techniques have been proposed for sentiment classiﬁcation [145. [171] and Li et al. there is a corresponding label yj .. 128]. it corresponds with a sequence of words w1 w2 . Sentiment data are the text segment containing user opinions about objects. In this chapter. Neutral polarity means that there is no sentiment expressed by users. Furthermore. In either case. If xj has no polarity assigned. these methods rely heavily on manually labeled training data to train an accurate sentiment classiﬁer. Here.wxj . To date. However. In sentiment classiﬁcation. it is unlabeled sentiment data. can be regarded as different domains.. yj = +1 if the overall sentiment expressed in xj is positive. supervised learning algorithms [146] have been proven promising and widely used in sentiment classiﬁcation. events and their properties of the domain. we only focus on positive and negative sentiment data. Besides positive and negative sentiment. events and their properties in the world. which is denoted by xj . there exist a lot of research work being proposed to improve the classiﬁcation performance in the supervised setting [144. in sentiment classiﬁcation tasks. but it is not hard to extend the proposed solution to address multi-category sentiment classiﬁcation problems. yj } is called the labeled sentiment data. we use a uniﬁed vocabulary W for all domains and |W | = m. Without loss of generality. there are also neutral and mixed sentiment data in practical applications. 74 . A pair of sentiment text and its corresponding sentiment polarity {xj .2 Existing Works in Cross-Domain Sentiment Classiﬁcation In recent years. 6. In order to reduce the human-annotating cost. xj ) to denote the frequency of word wi in xj . we represent user sentiment data with a bag-of-words method. [105] proposed to incorporate lexical knowledge to the graph-based semi-supervised learning and non-negative matrix tri-factorization approaches to sentiment classiﬁcation with a few labeled data. Goldberg and Zhu adapted a graph-based semi-supervised learning method to make use of unlabeled data for sentiment classiﬁcation. dvds and furniture. For each sentiment data xj . a domain D denotes a class of objects. either single word or NGram can be used as features to represent sentiment data. Mixed polarity means user sentiment is positive in some aspects but negative in other ones.

However, these semi-supervised learning methods still require a few labeled data in the target domain to train an accurate sentiment classiﬁer. In practice, we may have hundreds or thousands of domains at hand, it would be nice to annotate only several domain data to sentiment classiﬁers, which can be used to make predictions accurately in all other domains. Furthermore, similar to supervised learning methods, these approaches are domain dependent caused by changes in vocabulary. The reason is that users may use domain-speciﬁc words to express their sentiment in different domains. Table 6.1 shows several user review sentences from two domains: electronics and video games. In the electronics domain, we may use words like “compact”, “sharp” to express our positive sentiment and use “blurry” to express our negative sentiment. While in the video game domain, words like “hooked”, “realistic” indicate positive opinion and the word “boring” indicates negative opinion. Due to the mismatch between domain-speciﬁc words, a sentiment classiﬁer trained in one domain may not work well when directly applied to other domains. Thus cross-domain sentiment classiﬁcation algorithms are highly desirable to reduce domain dependency and manually labeling cost. Table 6.1: Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products. Boldfaces are domain-speciﬁc words, which are much more frequent in one domain than in the other one. Italic words are some domain-independent words, which occur frequently in both domains. “+” denotes positive sentiment, and “-” denotes negative sentiment. + + electronics video games Compact; easy to operate; very good pic- A very good game! It is action packed and ture quality; looks sharp! full of excitement. I am very much hooked on this game. I purchased this unit from Circuit City and Very realistic shooting action and good I was very excited about the quality of the plots. We played this and were hooked. picture. It is really nice and sharp. It is also quite blurry in very dark settings. The game is so boring. I am extremely unI will never buy HP again. happy and will probably never buy UbiSoft again. As a result, a sentiment classiﬁer trained in one domain cannot be applied to another domain directly. To address this problem, Blitzer et al. [25] proposed the structural correspondence learning (SCL) algorithm, which has been introduced in Chapter 2.3, to exploit domain adaptation techniques for sentiment classiﬁcation, which is a state-of-the-art method in cross-domain sentiment classiﬁcation. SCL is motivated by a multi-task learning algorithm, alternating structural optimization (ASO), proposed by Ando and Zhang [4]. SCL tries to construct a set of related tasks to model the relationship between “pivot features” and “non-pivot features”. Then “non-pivot features” with similar weights among tasks tend to be close with each other in a lowdimensional latent space. However, in practice, it is hard to construct a reasonable number of related tasks from data, which may limit the transfer ability of SCL for cross-domain sentiment classiﬁcation. More recently, Li et al. [104] proposed to transfer common lexical knowledge across domains via matrix factorization techniques. 75

In this chapter, we target at ﬁnding an effective approach for the cross-domain sentiment classiﬁcation problem. In particular, we propose a spectral feature alignment (SFA) algorithm to ﬁnd a new representation for cross-domain sentiment data, such that the gap between domains can be reduced. SFA uses some domain-independent words as a bridge to construct a bipartite graph to model the co-occurrence relationship between domain-speciﬁc words and domainindependent words. The idea is that if two domain-speciﬁc words have connections to more common domain-independent words in the graph, they tend to be aligned together with higher probability. Similarly, if two domain-independent words have connections to more common domain-speciﬁc words in the graph, they tend to be aligned together with higher probability. We adapt a spectral clustering algorithm, which is based on the graph spectral theory [41], on the bipartite graph to co-align domain-speciﬁc and domain-independent words into a set of feature-clusters. In this way, the clusters can be used to reduce the mismatch between domainspeciﬁc words of both domains. Finally, we represent all data examples with these clusters and train sentiment classiﬁers based on the new representation.

**6.3 Problem Statement and A Motivating Example
**

Problem Deﬁnition Assume we are given two speciﬁc domains DS and DT , where DS and DT are referred to as a source domain and a target domain respectively, suppose we have a set of labeled sentiment data DS = {(xSi , ySi )}nS in DS , and some unlabeled sentiment data i=1 DT = {xTj }nT in DT . The task of cross-domain sentiment classiﬁcation is to learn an accurate j=1 classiﬁer to predict the polarity of unseen sentiment data from DT . In this section, we use an example to introduce the motivation of our solution to the crossdomain sentiment classiﬁcation problem. First of all, we assume the sentiment classiﬁer f is a linear function, which can be written as y ∗ = f (x) = sgn(xwT ), where x ∈ R1×m and sgn(xwT ) = +1 if xwT ≥ 0, otherwise, sgn(xwT ) = −1. w is the weight vector of the classiﬁer, which can be learned from a set of training data (pairs of sentiment data and their corresponding polarity labels). Consider the example shown in Table 6.1 to illustrate our idea. We use a standard bag-ofwords method to represent sentiment data of the electronics (E) and video games (V) domains. From Table 6.2, we can see that the difference between domains is caused by the frequency of the domain-speciﬁc words. Domain-speciﬁc words in the E domain, such as compact, sharp, blurry, do not occur in the V domain. On the other hand, domain-speciﬁc words in the V domain, such as hooked, realistic, boring, do not occur in the E domain. Suppose the E domain is the source domain and the V domain is the target domain, our goal is to train a vector of 76

weights w∗ with labeled data from the E domain, and use it to predict sentiment polarity for the V domain data.1 Based on the three training sentences in the E domain, the weights of features such as compact and sharp should be positive. The weight of features such as blurry should be negative and the weights of features such as hooked, realistic and boring can be arbitrary or zeros if an L1 regularizer is applied on w for model training. However, an ideal weight vector in the V domain should have positive weights for features such as hooked, realistic and a negative weight for the feature boring, while the weights of features such as compact, sharp and blurry may take arbitrary values. That is why the classiﬁer learned from the E domain may not work well in the V domain. Table 6.2: Bag-of-words representations of electronics (E) and video games (V) reviews. Only domain-speciﬁc features are considered. “...” denotes all other words. + + + V + E ... compact ... 1 ... 0 ... 0 ... 0 ... 0 ... 0 sharp 1 1 0 0 0 0 blurry 0 0 1 0 0 0 hooked realistic 0 0 0 0 0 0 1 0 1 1 0 0 boring 0 0 0 0 0 1

In order to reduce the mismatch between features of the source and target domains, a straightforward solution is to make them more similar by adopting a new representation. Table 6.3 shows an ideal representation of domain-speciﬁc features. Here, sharp hooked denotes a cluster consisting of sharp and hooked, compact realistic denotes a cluster consisting of compact and realistic, and blurry boring denotes a cluster consisting of blurry and boring. We can use these clusters as high-level features to represent domain-speciﬁc words. Based on the new representation, the weight vector w∗ trained in the E domain should be also an ideal weight vector in the V domain. In this way, based on the new representation, the classiﬁer learned from one domain can be easily adapted to another one. Table 6.3: Ideal representations of domain-speciﬁc words. + + + + ... sharp hooked compact realistic blurry boring ... 1 1 0 ... 1 0 0 ... 0 0 1 ... 1 0 0 ... 1 1 0 ... 0 0 1

E

V

The problem is how to construct such an ideal representation as shown in Table 6.3. Clearly, if we directly apply traditional clustering algorithms such as k-means [76] on Table 6.2, we

1

For simplicity, we only discuss domain-speciﬁc words here and ignore all other words.

77

are not able to align sharp and hooked into one cluster, since the distance between them is large. In order to reduce the gap and align domain-speciﬁc words from different domains, we can utilize domain-independent words as a bridge. As shown in Table 6.1, words such as sharp, hooked compact and realistic often co-occur with other words such as good and exciting, while words such as blurry and boring often co-occur with a word never buy. Since the words like good, exciting and never buy occur frequently in both the E and V domains, they can be treated as domain-independent features. Table 6.4 shows co-occurrences between domainindependent and domain-speciﬁc words. It is easy to ﬁnd that, by applying clustering algorithms such as k-means on Table 6.4, we can get the feature clusters shown in Table 6.3: sharp hooked, blurry boring and compact realistic. Table 6.4: A co-occurrence matrix of domain-speciﬁc and domain-independent words. good exciting never buy compact realistic 1 1 0 0 0 0 sharp 1 1 0 hooked blurry 1 0 1 0 0 1 boring 0 0 1

The example described above motivates us that the co-occurrence relationship between domain-speciﬁc and domain-independent features is useful for feature alignment across different domains. We proposed to use a bipartite graph to represent this relationship and then adapt spectral clustering techniques to ﬁnd a new representation for domain-speciﬁc features. In the following section, we will present spectral domain-speciﬁc feature alignment algorithm in detail.

**6.4 Spectral Domain-Speciﬁc Feature Alignment
**

In this section, we describe our algorithm for adapting spectral clustering techniques to align domain-speciﬁc features from different domains for cross-domain sentiment classiﬁcation. As mentioned above, our proposed method consists of two steps: (1) to identify domainindependent features and (2) to align domain-speciﬁc features. In the ﬁrst step, we aim to learn a feature selection function ϕDI (·) to select l domain-independent features, which occur frequently and act similarly across domains DS and DT . These domain-independent features are used as a bridge to make knowledge transfer across domains possible. After identifying domainindependent features, we can use ϕDS (·) to denote a feature selection function for selecting domain-speciﬁc features, which can be deﬁned as the complement of domain-independent features. In the second step, we aims to to learn an alignment function φ : R(m−l) → Rk to align domain-speciﬁc features from both domains into k predeﬁned feature clusters z1 , z2 , ..., zk , s.t. the difference between domain speciﬁc features from different domains on the new representation constructed by the learned clusters can be dramatically reduced. 78

If a feature has high mutual value. mutual information is used to measure the mutual dependence between two random variables. y) = i ∑ ∑ ( p(x. We use ϕDI (xi ) and ϕDS (xi ) to denote the two views respectively. (6. given the number l of domain-independent features to be selected. Otherwise.−1} x∈X i p(x. In information theory. In [25]. Motivated by the supervised feature selection criteria. 6. can help identify features relevant to source domain labels.1 Domain-Independent Feature Selection First of all. D) is. the more likely that X i can be treated as a domain-independent feature. Then sentiment data xi can be divided into two disjoint views. k is set to be the largest number such that we can get at least l such features. we require domain-independent features occur frequently. we need to identify which features are domain independent. d) p(x)p(d) ) . we modify the mutual information criterion between features and domains as follows. domain-independent features should occur frequently and act similarly in both the source and target domains. d)log2 p(x. then it is domain speciﬁc. Feature selection using mutual information. which can be referred to as domain-independent features in this papers. The smaller I(X i . y) p(x)p(y) ) . we can use mutual information to measure the dependence between features and domains. and the other is composed of features in WDS . As mentioned above.1) as follows. A ﬁrst strategy is to select domain-independent features based on their frequency in both domains. y)log2 y∈{+1.4. More speciﬁcally. I(X . A second strategy is based on the mutual dependence between features and labels on the source domain data. But there is no guarantee that the selected features act similarly in both domains. 79 . which is shown in (6.1) where we denote X i and y a feature and class label. we choose features that occur more than k times in both the source and target domains. So. I(X . One view consists of features in WDI .x̸=0 ( p(x. we use WDI and WDS to denote the vocabulary of domain-independent and domain-speciﬁc features respectively. mutual information is applied on source domain labeled data to select features as “pivots”. Here we propose a third strategy for selecting domain-independent features. we present several strategies for selecting domain-independent features. (6. respectively. In this section. Furthermore. it is domain independent.For simplicity.2) where D is a domain variable and we only sum over non-zero values of a speciﬁc feature X i . D) = i ∑ d∈D ∑ x∈X i .

g. 6. The reason is that in the electronics domain. domain-independent feature selection is a challenge task. each vertex in VDS corresponds to a domain-speciﬁc word in WDS . However. In a worse case. we can identify which features are domain independent and which ones are domain speciﬁc. which is relevant to class labels. Using the ﬁrst and third strategies. we may say “this unit is very sharp”. We leave the issue that how to develop a criteria to select domain-independent word precisely in our future work. where we denote VDS VDI and E the vertices or edges of the graph G. we focus more on addressing the problem of how to model the correlation between domain-independent and domain-speciﬁc words for transfer learning. some words may have opposing sentiment polarity in different domains. However. An edge in E connects two vertexes in VDS and VDI respectively.We still use the example shown in Table 6. Thus. and each vertex in VDI corresponds to a domain-independent word in WDI . each edge eij ∈ E is associated with a non-negative weight mij . Besides using the co-occurrence frequency of words within documents. If the word “very” is used as a bridge then the words “sharp” and “boring” may be aligned in a same clustering. For example.4. the word “sharp” has high mutual value to class labels in the electronics domain. A bipartite graph example is shown in Figure 6. “never buy” and “very” as domain-independent features. which is constructed based on the example shown in Table 6. we may select the words such as “good”. the word “very” should not be used a bridge for aligning words from different domains. which is not what we expect.1 to explain the proposed three strategies for domain-independent features selection. E) between ∪ them. such as “good”.1. we can construct a bipartite graph G = (VDS VDI . which will be presented next. the total number of co-occurrence of wi ∈ WDS and wj ∈ WDI in DS and DT ). in this chapter. we may say “this game is very boring”. Here “good” and “never buy” may be good domain-independent features for serving as a bridge across domains. Hence. we can deﬁne a reasonable “window 80 . Note that there is no intra-set edges linking two vertexes in VDS or VDI . we can also adopt more meaningful methods to estimate mij . For example. In G. Furthermore. using the second strategy. while in the video games domain.. The score of mij measures the relationship between word wi ∈ WDS and wj ∈ WDI in DS and DT (e. In contrast. “nice” and “sharp” (assume the electronics domain is a source domain). we can use the constructed bipartite graph to model the intrinsic relationship between domain-speciﬁc and domain-independent features.4. Given domain-independent ∪ and domain-speciﬁc features. we may be guaranteed to select sentiment words. Thus.2 Bipartite Feature Graph Construction Based on the above strategies for selecting domain-independent features. which makes the task more challenge. the polarity of the word “thin” may be positive in the electronics domain but negative in the furniture domain. Note that “sharp” should be selected because it is domain-dependent word.

for simplicity. we show how to adapt a spectral clustering algorithm on the feature bipartite graph to align domain-speciﬁc features. we can also use the distance between wi and wj to adjust the score of mij . then these two nodes should be very similar (or quite related). then they tend to be very related and will be aligned to a same cluster with high probability. we have presented how to construct a bipartite graph between domainspeciﬁc and domain-independent features. 6. we set the “window size” to be the maximum length of all documents. (2) there is a low-dimensional latent space underlying a complex graph. then they tend 81 . Based on these two assumptions. We want to show that by constructing a simple bipartite graph and adapting spectral clustering techniques on it..1: A bipartite graph example of domain-speciﬁc and domain-independent features. e. where two nodes are similar to each other if they are similar in the original graph. we can algin domain-speciﬁc features effectively. we assume (1) if two domain-speciﬁc features are connected to many common domain-independent features. Also we do not consider word position to determine the weights for edges. spectral graph theory has been widely applied in many problems. by assuming that if the distance between two words exceeds the “window size”. Furthermore.3 Spectral Feature Clustering In the previous section. In this section.g. realistic exciting 1 1 1 1 1 1 1 1 compact hooked sharp blurry boring good never buy Figure 6. In our case. there are two main assumptions: (1) if two nodes in a graph are connected to many common nodes. The smaller is their distance. then there is an edge connecting them. 13. the larger the weight we can assign to the corresponding edge. In this chapter. (2) if two domainindependent features are connected to many common domain-speciﬁc features. dimensionality reduction and clustering [130. 54]. Thus.4. then the correlation between them is very week. if a domain-speciﬁc word and a domainindependent word co-occur within the “window size”.size”. In spectral graph theory [41].

∑ 2 4. such that Uij = Uij /( j Uij )1/2 . the standard spectral clustering algorithm clusters n points to k discrete indicators. Given the feature bipartite graph G. Aii = 0. the goal is to cluster the points into k clusters. v2 . which can reduce the gap between domains.. if i ̸= j. the Laplacian matrix L = I − L. we ﬁrst brieﬂy introduce a standard spectral clustering algorithm [130] as follows. where Aij = mij . Find the k largest eigenvectors of L. the k principal components can automatically perform data clustering in the subspace spanned by the k principle components. which can be referred to as “discrete clustering”.. where k is an input parameter. and form the matrix U = [u1 u2 . (3) we can ﬁnd a more compact and meaningful representation for domain-speciﬁc features. we show how to adapt the spectral clustering algorithm for cross-domain feature alignment. uk . uk in step 3. u2 .. Motivated by this discovery.to be very related and will be aligned to a same cluster with high probability. [220] and Ding and He [55] have proven that the k principal components of a term-document co-occurrence matrix. . Form a diagonal matrix D. 2 In spectral graph theory [41] and Laplacian Eigenmaps [13].. Thus selecting the k smallest eigenvectors of L in [41. 1. Zha et al. we expect the mismatch problem between domain-speciﬁc features can be alleviated by applying graph spectral techniques on the feature bipartite graph to discover a new representation for domain-speciﬁc features. 2. u1 .. Form an afﬁnity matrix for V : A ∈ Rn×n . 13] is equivalent to selecting the k largest eigenvectors of L in this paper. Based on the above description.. with the above assumptions. . which are referred to as the k largest eigenvectors u1 . The changes in these forms of Laplacian matrix will only change the eigenvalues (from λi to 1 − λi ) but have no impact on eigenvectors. where Dii = ∑ j Aij . our goal is to learn a feature alignment mapping function φ(·) : Rm−l → Rk . Apply the k-means algorithm on U to cluster the n points into k clusters. vn } and their corresponding weighted graph G. where I is an identity matrix. Before we present how to adapt a spectral clustering algorithm to align domain-speciﬁc features.. Therefore. Normalize U . This implies that a mapping function constructed from the k principal components can cluster original data and map them to a new space spanned by the clusters simultaneously. u2 . and construct the matrix2 L = D−1/2 AD−1/2 .. 5. . are actually the continuous solution of the cluster membership indicators of documents in the k-means clustering method. 82 . More speciﬁcally. Given a set of points V = {v1 . 3. where m is the number of all features....uk ] ∈ Rn×k . l is the number of domainindependent features and m − l is the number of domain-speciﬁc features.

. 6. where U[1:m−l. Form a diagonal matrix D. 5. Deﬁne the feature alignment mapping function as φ(x) = xU[1:m−l.uk ] ∈ Rm×k .:] denotes the ﬁrst m − l rows of U and x ∈ R1×(m−l) . in practice. u2 . u1 . Though our goal is only to cluster domain-speciﬁc features. and then apply φ(·) to ﬁnd a new representation φ(ϕDS (xi )) of the view of domain-speciﬁc features of xi . Find the k largest eigenvectors of L.4. and construct the matrix L = D−1/2 AD−1/2 . where the ﬁrst MT 0 m − l rows and columns correspond to the m − l domain-speciﬁc features. we can ﬁrst apply ϕDS (·) to identify the view associated with domainspeciﬁc features of xi . 3. Note that the afﬁnity matrix A constructed in Step 2 is similar to the afﬁnity matrix of a term-document bipartite graph proposed in [54]. where Mij corresponds to the co-occurrence relationship between a domain-speciﬁc word wi ∈ WDS and a domain-independent word wj ∈ WDI .. we augment all original features with features learned by feature alignment to construct a new representation. .1. 25]. 4. it is proved that clustering two related sets of points simultaneously can often get better results than only clustering one single set of points [54]. However. and form the matrix U = [u1 u2 .4 Feature Augmentation If we have selected domain-independent features and aligned domain-speciﬁc features perfectly.. Form an afﬁnity matrix A = ∈ Rm×m of the bipartite graph. we may not be able to identify domain-independent features correctly and thus fail to perform feature alignment perfectly. A tradeoff parameter γ is used in this feature augmentation to balance the effect of original 83 . then we can simply augment domain-independent features with features via feature alignment to generate a perfect representation for cross-domain sentiment classiﬁcation... uk . Similar to the strategy used in [4. where Dii = ∑ j [ Aij . Form a weight matrix M ∈ R(m−l)×l . Given a feature alignment mapping function φ(·). which is used for spectral co-clustering terms and documents simultaneously. for a data example xi in either a source domain or target domain. ] 0 M 2.:] . and the last l rows and columns correspond to the l domain-independent features.

1.. the new feature representation is deﬁned as xi = [xi . 1: Apply the criteria mentioned in Chapter 6. where A = MT 0 4: Find the K largest eigenvectors of L. The whole process of our proposed framework for crossdomain sentiment classiﬁcation is presented in Algorithm 6.1 on DS and DT to select l domain-independent features. Thus. we select a set of domain-independent features using one of the three strategies introduced in Chapter 6. Ensure: Predicted labels YT of the unlabeled data XT in the target domain. ySi )}nS . γφ(ϕDS (xi ))]. the number of clusters K and the number of domain-independent j=1 features m.:] . u2 . Algorithm 6. In general. ϕDI (xT ) ϕDS (xT ) 2: By using ΦDI and ΦDS . the value of λ can be determined by evaluation on some heldout data.4. . u1 .. an unlabeled target domain i=1 data set DT = {xTj }nT .. In the second step.5 Computational Issues Note the computational cost of the SFA algorithm is maintained by an eigen-decomposition of the matrix L ∈ Rm×m . uK .4. As can be seen in the algorithm.uK ] ∈ Rm×K . 6.1 Spectral Feature Alignment (SFA) for cross-domain sentiment classiﬁcation Require: A labeled source domain data set DS = {(xSi . In practice.1. ySi )}nS = {([xSi . it takes O(Km2 ) [175].features and new features. xi ∈ R1×m+k and 0 ≤ γ ≤ 1.. calculate (DI-word)-(DS-word) co-occurrence matrix M ] 0 M . [ ] [ ] ϕDI (xS ) ϕDS (xS ) ΦDI = . for each data example xi . γφ(ϕDS (xSi ))]. where xi ∈ R1×m .. 3: Construct matrix L = D AD . and form the matrix [ −1/2 −1/2 R(m−l)×l . where xi ∈ Rm−l 5: Train a classiﬁer f on {(xSi . where m is the total number 84 . we calculate the corresponding co-occurrence matrix M and then construct a normalized matrix L corresponding to the bipartite graph of the domain-independent and domain-speciﬁc features. ΦDS = . Eigen-decomposition is performed on L to ﬁnd the K leading eigenvectors to construct a mapping ϕ. ySi )}nS i=1 i=1 6: return f (xTi )’s. The remaining m − l features are treated as domain-speciﬁc features. ∈ U = [u1 u2 . in the ﬁrst step. Let mapping φ(xi ) = xi U[1:m−l. Finally training and testing are both performed on augmented representations.

In each domain. if the matrix L is sparse. However. In this dataset. we describe our experiments on two real-world datasets and show the effectiveness of our proposed SFA for cross-domain sentiment classiﬁcation. where the word before an arrow corresponds with the source domain and the word after an arrow corresponds with the target domain. Each review is assigned a sentiment label.2 is extremely sparse. K → B. B → E. we can construct 12 cross-domain sentiment classiﬁcation tasks from with source: D → B. If m is large. −1 (negative review) or +1 (positive review). B → K. 6. SFA can better capture the intrinsic relationship between domain-independent and domain-speciﬁc words via the bipartite graph representation and learn a more compact and meaningful representation underlying the graph via co-aligning domain-independent and domain-speciﬁc words. K → E. 000 positive reviews and 1. Hence. which only learns common structures underlying domain-speciﬁc words without fully exploiting the relationship between domain-independent and domain-speciﬁc words. E → K. It contains a collection of product reviews from Amazon.4.1 Experimental Setup Datasets In this section. based on the rating score given by the review author. The ﬁrst dataset is from Blitzer et al. 000 negative ones.6 Connection to Other Methods Different from the state-of-the-art cross-domain sentiment classiﬁcation algorithms such as the SCLalgorithm. we ﬁrst describe the datasets used in our experiments. D → K. D → E. there are 1.com. As a result. 6. B → D. The reviews are about four product domains: books (B). the matrix L is sparse as well. We use 85 . The computational time is O(Kpm) approximately. dvds (D). the bipartite feature graph deﬁned in Chapter 6. where p is the number iterations in the Implicitly Restarted Arnoldi method deﬁned by users.of features and K is the dimensionality in the latent space. 6. electronics (E) and kitchen appliances (K). K → D. E → B. In practice. [25]. Experiments in two real world domains indicate that SFA is indeed promising in obtaining better performance than several baselines including SCL in terms of the accuracy for cross-domain sentiment classiﬁcation. then we can still apply the Implicitly Restarted Arnoldi method to solve the eigen-decomposition of L iteratively and efﬁciently [175]. E → D. the eigen-decomposition of L can be solved efﬁciently.7 Experiments In this section.7. then it would be computationally expensive.

000 electronics 2. 000 1. The sentiment classiﬁcation task on this dataset is documentlevel sentiment classiﬁcation. upperBound corresponds with a classiﬁer trained with the labeled data from B domain. 500 SentDat Baselines In order to investigate the effectiveness of our method. we split these reviews into sentences and manually assign a polarity label for each sentence. Instead of assigning each review with a label.com/ http://www. V → E. we describe some baseline algorithms with which we compare SFA. In each domain. E → S. NoTransf means that we train a classiﬁer with labeled data of D domain. Table 6. the performance of upperBound in D → B task can be also regarded as an upper bound of E → B and K → B tasks. The gold standard (denoted by upperBound) is an in-domain classiﬁer trained with labeled data from the target domain.amazon. We have crawled a set of reviews from Amazon3 and TripAdvisor4 websites. S → E. 000 hotel 3. 500 positive sentences and 1. 504 1. Dataset RevDat Domain # Reviews dvds 2. The reviews from Amazon are about three product domains: video game (V). The other dataset is collected by us for experiment purpose. 000 kitchen 2. 500 negative ones for experiment. S → H. 000 1. electronics (E) and software (S). 856 1. V → S. One baseline method denoted by NoTransf. 000 video game 3. 000 1. E → H.5: Summary of datasets for evaluation. Similarly. 500 1. So. The summary of the datasets is described in Table 6. we use Unigram and Bigram features to represent each data example (a review in RevDat and a sentence in SentDat). In this section. 000 1. S → V.tripadvisor. H → S. for D → B task. Another baseline method denoted by PCA is a classiﬁer trained on a new representation which augments original 3 4 http://www. we also construct 12 cross-domain sentiment classiﬁcation tasks: V → H. 000 1. we have compared it with several algorithms. 500 1. H → E.5. S → V.com/ 86 . 000 books 2. For both datasets. 000 473. 500 1. 000 electronics 3. 000 # Pos # Neg # Features 1. 000 1. For example. 500 287. 500 1. we randomly select 1. 500 1. The TripAdvisor reviews are about the hotel (H) domain. is a classiﬁer trained directly with the source domain training data.RevDat to denote this dataset. 000 1. We use SentDat to denote this dataset. 000 software 3. E → V. Sentiment classiﬁcation task on this dataset is sentence-level sentiment classiﬁcation. 500 1.

PCA.(6.4. which is equivalent to set λ = 0. FALSA and SFA. For all experiments on RevDat. A third baseline method denoted by FALSA is a classiﬁer trained on a new representation which augments original features with new features which are learned by applying latent semantic analysis on the co-occurrence matrix of domain-independent and domain-speciﬁc features.0001 in [23]. Note that SCL. FALSA and SFA.2) deﬁned in Chapter 6. we use logistic regression as the basic sentiment classiﬁer.2). we randomly split each domain data into a training set of 2. FALSA and SCL by 24 tasks on two datasets. We compare our method with PCA and FALSA in order to investigate if spectral feature clustering is effective in aligning domain-speciﬁc features. FALSA and our proposed SFA all use unlabeled data from the source and target domains to learn a new representation and train classiﬁers using the labeled source domain data with new representations. we randomly split each domain data into a training set of 1.7. We report the average results of 5 random trails.600 instances and a test set of 400 instances. We have also compared our algorithm with a method: structural correspondence learning (SCL) proposed in [25]. we use Eqn. For PCA.1 87 .000 instances and a test set of 1.features with new features which are learned by applying latent semantic analysis (also can be referred to as principal component analysis) [53] on the original view of domain-speciﬁc features (as shown in Table 6.000 instances. The evaluation of cross-domain sentiment classiﬁcation methods is conducted on the test set in the target domain without labeled training data in the same domain. Parameter Settings & Evaluation Criteria For NoTransf. |{x|x ∈ Dtst }| where Dtst denotes the test data. For all experiments on SentDat. PCA. 6. upperBound. The library implemented in [64] is used in all our experiments. We follow the details described in Blitzer’s thesis [23] to implement SCL with logistic regression to construct auxiliary tasks. The deﬁnition of accuracy is given as follows. and are ﬁxed to be used in all experiments. We use accuracy to evaluate the sentiment classiﬁcation result: the percentage of correctly classiﬁed examples over all testing examples. The tradeoff parameter C in logistic regression [64] is set to be 10000.2 Results Overall Comparison Results between SFA and Baselines In this section we compare the accuracy of SFA with NoTransf. y is the ground truth sentiment polarity and f (x) is the predicted sentiment polarity. Accuracy = |{x|x ∈ Dtst ∩ f (x) = y}| . PCA. The parameters of each model are tuned on some heldout data in E → B task of RevDat and H → S task of SentDat.

7.5 77. and the number of dimensionality h in [25] is set to be 50. we use mutual information to select “pivots”.28 75 72. we can observe that the four domains of RevDat can be roughly classiﬁed into two groups: B and D domains are similar to each other.2 and 6. the number of domain-speciﬁc features clusters k = 100 and the parameter in feature augmentation γ = 0. 85 85 82. 85 82. Adapting a classiﬁer from K domain to E domain is much easier than adapting it from B domain.34 80 Accuracy (%) Accuracy (%) 80 78. Clearly.2.to identity domain-independent and domain-speciﬁc features.5 70 78.4 80 85 84. From the ﬁgure. Figure 6. The number of “pivots” is set to be 500. each group of bars represents a cross-domain sentiment classiﬁcation task.2(a) shows the comparison results of different methods on RevDat. For SCL. Each bar in speciﬁc color represents a speciﬁc method.5 81.55 81.5 75 72. We adopt the following settings: the number of domain-independent features l = 500. Studies of the SFA parameters are presented in Chapters 6.7.6. In the ﬁgure.5 65 70 V−>E S−>E H−>E E−>V S−>V H−>V E−>S V−>S H−>S E−>H V−>H S−>H (b) Comparison Results on SentDat.2: Comparison results (unit: %) on two datasets. Figure 6. as are K and E.72 77. All these parameters and domain-independent feature (or “pivot”) selection methods are determined based on results on the heldout data mentioned in the previous section. The horizontal lines are accuracies of upperBound.1 Accuracy (%) 75 Accuracy (%) B−>D E−>D K−>D D−>B E−>B K−>B 80 70 75 65 70 D−>E B−>E K−>E D−>K B−>K E−>K (a) Comparison Results on RevDat.5 67.5 82. our proposed SFA performs better than other methods 88 . but the two groups are different from each other.7 82.6 90 87.

As mentioned in Chapter 6. The second one is to test the effect of different numbers of domain-independent features on SFA performance. besides using Eqn. we can also use the other two strategies to identify them. frequency of features in both domains and mutual information between features and labels in the source domain respectively. As mentioned in Chapter 6. The ﬁrst experiment is to test the effect of domain-independent features identiﬁed by different methods on the overall performance of SFA. but its performance may even drop on other tasks.4.6. (6. We use SFADI .1. It is not surprising to ﬁnd that FALSA gets signiﬁcant improvement compared to NoTransf and PCA. From the table. we can observe that SFADI and SFAF Q achieve comparable results and they are stable in most tasks. Effect of Domain Independent Features of SFA In this section. We do t-test on the comparison results of the two datasets and ﬁnd that SFA outperforms other methods with 0. While SFAM I may work very well in some tasks such as K → D and E → B of RevDat.3. it is hard to construct a reasonable number of auxiliary tasks that are useful to model the relationship between “pivots” and “non-pivots”. From the comparison results on SentDat shown in Figure 6.2). One reason may be that in sentence-level sentiment classiﬁcation.95 conﬁdence intervals. The reason is that applying mutual information on source domain data 89 . Our proposed SFA can not only utilize the co-occurrence relationship between domain-independent and domain-speciﬁc features to reduce the gap between domains. One interesting observation from the results is that SCL does not work well compared to its performance on RevDat. but also use graph spectral clustering techniques to co-align both kinds of features to discover meaningful clusters for domain-speciﬁc features. SFAF Q and SFAM I to denote SFA using Eqn. the data are quite sparse.2) to identify domain-independent and domain-speciﬁc features. it has been proved that clustering two related sets of points simultaneously can often get better results than clustering one single set of points only [54]. SCL even performs worse than FALSA in some tasks. we summarize the comparison results of SFA using different methods to identify domainindependent features. (6. In Table 6. Thus PCA only outperforms NoTransf slightly in some tasks. but work very bad in some tasks such as E → D and D → E of RevDat. Thus in this dataset. clustering domain-speciﬁc features with bag-of-words representation may fail to ﬁnd a meaningful new representation for cross-domain sentiment classiﬁcation. we conduct two experiments to study the effect of domain-independent features on the performance of SFA. In this case. The reason is that representing domain-speciﬁc features via domain-independent features can reduce the gap between domains and thus ﬁnd a reasonable representation for cross-domain sentiment classiﬁcation.2(b).including state-of-the-art method SCL in most tasks. The performance of SCL highly relies on the auxiliary tasks. we can get similar conclusion: SFA outperforms other methods in most tasks. Though our goal is only to cluster domain-speciﬁc features.

and ﬁx k = 100.46 77. In this section.05 73. we further test the sensitivity of these two parameters on the overall performance of SFA.58 76.94 72.10 85.58 76. there are two other parameters in SFA: one of them is the number of domain-speciﬁc feature-clusters k and the other is the tradeoff parameter λ in feature augmentation.8 79.08 77.80 79.15 76.64 74.15 SentDat V→E S→E H→E E→V 76.25 75.8 74.65 75.22 75 E→B 75.90 77.85 78.80 S→V 72.25 77.46 74. 90 .22 75 77.96 72. RevDat B→D E→D K→D D→B 81.55 78.52 76. Numbers in the table are accuracies in percentage.64 74.05 74.64 79.6.7 72.95 81.52 76.35 78.06 76.55 78.46 K→B 74.62 72.5 76.14 74.22 72. In addition. we can ﬁnd that when l is in the range of [400.45 79.35 75.75 E→K 86.6 78.80 77.62 SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I Model Parameter Sensitivity of SFA Besides the number of domain-independent features l.3 74. γ = 0.96 72.88 74.5 72.5 81.80 77.98 77.45 79. 700]. To test the effect of the number of domain-independent features on the performance of SFA.44 77.can ﬁnd features that are relevant to the source domain labels but cannot guarantee the selected features to be domain independent.05 78.16 V→H 77.55 86.94 72.1 70.62 79.6: Experiments with different domain-independent feature selection methods. we apply SFA on all 24 tasks in the two datasets.46 71.4 78.48 73.38 75.3(b).92 71.5 85. the selected features may be irrelevant to the labels of the target domain.65 81.58 77.54 74.8 86.58 77.50 D→E B→E K→E D→K 76.38 E→S V→S H→S E→H 77.65 77.8 79.75 85.9 80.54 S→H 71.22 72. Table 6.3 74.45 78. SFA performs well and stably in most tasks.75 76.46 75.35 77.25 77 76.45 84.64 79.8 81. The value of l is changed from 300 to 700 with step length 100.9 75.44 77.46 74.90 81.38 76.5 76.05 80.15 76.96 79.05 78.85 73 82.15 85.75 87.5 72.16 73.62 79.98 73. Thus SFA is robust to the quality and numbers of domain-independent features.6 70. The results are shown in Figure 6.55 B→K 78.98 73.3(a) and 6.92 71. From the ﬁgures.14 74.55 H→V 74.22 72.86 72.08 76.25 80.95 77.

However. we ﬁx l = 500.2. we test the sensitivity of the parameter γ. Note that though for some tasks such as V → H. Figure 6. when γ < 0.5 75 72.5 75 75 72.5 Accuracy (%) Accuracy (%) 400 500 600 Numbers of Domain−Independent Features 700 77.1.3: Study on varying numbers of domain-independent features of SFA We ﬁrst test the sensitivity of the parameter k. In this experiment. SFA performs well and stably. H → S and E → H.5 70 300 400 500 600 Numbers of Domain−Independent Features 700 70 300 400 500 600 Numbers of Domain−Independent Features 700 (a) Results on RevDat.4(b) show the results of SFA under varying values of k. γ = 0. In this experiment. when the cluster number k falls in the range from 75 to 175.5 S−>E H−>E E−>V S−>V H−>V 80 E−>S V−>S H−>S E−>H V−>H S−>H 80 77.1 to 1 with step length 0.5 75 77. SFA works badly in most tasks.3.5 80 Accuracy (%) Accuracy (%) 82. Figure 6.6 and change the value of k from 50 to 200 with step length 25. 91 . in most task.4(a) and 6.B−>D 85 E−>D K−>D D−>B E−>B K−>B 90 87.5 85 D−>E B−>E K−>E D−>K B−>K E−K 82.5 72. which implies that the features learned by the spectral feature alignment algorithm do boost the performance in cross-domain sentiment classiﬁcation. we ﬁx l = 500.5(b). SFA gets worse performance when the value of k increases.5(a) and 6.5 80 77.5 70 300 70 300 400 500 600 Numbers of Domain−Independent Features 700 (b) Results on SentDat. Finally. when γ ≥ 0. k = 100 and change the values of γ from 0. which means the original features dominate the effect of the new feature representation. V−>E 82. Results are shown in Figure 6. SFA works stably in most tasks.5 72. Apparently.

5 80 77. we have studied the problem of how to develop a latent space for transfer learning when explicit domain knowledge is available in some speciﬁc applications. One limitation of SFA is that it is domain-driven. where domain knowledge is available or can be easy to obtain. 92 .5 70 50 70 50 75 100 125 150 175 Numbers of Clusters of Domain−Specific Features 200 (b) Results on SentDat under Varying Values of k. Experimental results showed that our proposed method outperforms a state-of-the-art method in cross-domain sentiment classiﬁcation.5 72. However.4: Model parameter sensitivity study of k on two datasets.5 75 72. We proposed a spectral feature alignment (SFA) algorithm for sentiment classiﬁcation.5 75 75 72.5 72. the success of SFA in sentiment classiﬁcation suggests that in some domain areas.5 75 75 100 125 150 175 Numbers of Clusters of Domain−Specific Features 200 (a) Results on RevDat under Varying Values of k.5 85 80 Accuracy (%) Accuracy (%) 75 100 125 150 175 Numbers of Clusters of Domain−Specifc Features 200 82. Figure 6.5 S−>E H−>E E−>V S−>V H−>V 80 E−>S V−>S H−>S E−>H V−>H S−>H 80 77.5 D−>E B−>E K−>E D−>K B−>K E−>K 82. to develop a domain-driven method may be more effective than general solutions for learning the latent space for transfer learning. which means it is hard to be applied to other applications. 6. V−>E 82.5 Accuracy (%) Accuracy (%) 75 100 125 150 175 Numbers of Clusters of Domain−Specific Features 200 77.8 Summary In this section.5 70 50 70 50 77.B−>D 85 E−>D K−>D D−>B E−>B K−>B 90 87.

2 0.4 0.1 0.3 0.4 0.8 0.5 70 0.3 0.9 1 (b) Results on SentDat under Varying Values of γ.1 70 0.5 0.8 0.8 0.9 1 70 0.5 S−>E H−>E E−>V S−>V H−>V E−>S 80 V−>S H−>S E−>H V−>H S−>H 80 77.1 0.2 0.5 85 80 Accuracy (%) Accuracy (%) 82.7 0.5 75 77.5 72.9 1 75 75 72. 93 .5 D−>E B−>E K−>E D−>K B−>K E−>K 82.4 0.2 0.B−>D 85 E−>D K−>D D−>B E−>B K−>B 90 87.5 Accuracy (%) 0.7 0.7 0.5 0.6 Values of γ 0.4 0.5 Accuracy (%) 77.1 0.6 Values of γ 0.5 80 77.9 1 (a) Results on RevDat under Varying Values of γ.6 Values of γ 0. Figure 6.8 0.5 70 0.6 Values of γ 0.5: Model parameter sensitivity study of γ on two datasets.2 0.5 72.3 0.5 75 72.7 0.5 0. V−>E 82.3 0.5 0.

**CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion
**

Transfer learning is still at an early but promising stage. There exist many research issues needed to be addressed. In this thesis, we ﬁrst surveyed the ﬁeld of transfer learning, where we give some deﬁnitions of transfer learning, summarize transfer learning into different settings, categorize transfer learning approaches into four contexts, analyze the relationship between transfer learning and other related areas, discuss some interesting research issues in transfer learning and introduce some applications of transfer learning. In this thesis, we focus on the transductive transfer learning setting, and propose two different feature-based transfer learning frameworks based different situations: (1) domain knowledge is hidden or hard to capture, and (2) domain knowledge can be observed or easy to encode into learning. In the ﬁrst situation, we proposed a novel and general dimensionality reduction framework for transfer learning, which aims to learn a latent space to reduce the distance between domains and preserve properties of original data simultaneously. Based on this framework, we propose three different algorithms to learn the latent space, known as Maximum Mean Discrepancy Embedding (MMDE) [135], Transfer Component Analysis (TCA) [139] and Semisupervised Transfer Component Analysis (SSTCA) [140]. MMDE learns the latent by learning a non-parametric kernel matrix, which may be more precise. However, the non-parametric kernel matrix learning results in a need to solve a semi-deﬁnite program (SDP), which is extremely expensive (O((nS + nT )6.5 )). In contrast, TCA and SSTCA learn a parametric kernel matrix instead, which may make the latent space less precise, but avoid the use of a SDP solver, which is much more efﬁcient (O(m(nS + nT )6.5 ), where m ≪ nS + nT ). Hence, in practice, TCA and SSTCA are more applicable. We apply these three algorithms to two diverse applications: WiFi localization and text classiﬁcation, and achieve promising results. In order to capture and apply the domain knowledge inherent in many application areas, we also propose a Spectral Feature Alignment (SFA) [137] algorithm for cross-domain sentiment classiﬁcation, which aims at aligning domain-speciﬁc words from different domains in a latent space by modeling the correlation between the domain-independent and domain-speciﬁc words in a bipartite graph and using the domain-speciﬁc features as a bridge. Experimental results show the great classiﬁcation performance of SFA in cross-domain sentiment classiﬁcation.

94

7.2 Future Work

In the future, we plan to continue to explore our work on transfer learning along the following directions. • We plan to study the proposed dimensionality reduction framework for transfer learning theoretically. Research issues include: (1) how to determine the number of the reduced dimensionalities adaptively? (2) How to get a generalization error bound of the proposed framework? (3) How to develop an efﬁcient algorithm for kernel choice of TCA and SSTCA based on the recent theoretic study in [176]. • We plan to extend the dimensionality reduction framework in a relational learning manner, which means that data in source and target domains can be relational instead of being independent distributed. Most previous transfer learning methods assumed data from different domains to be independent distributed. However, in many real-world applications, such as link prediction in social networks or classifying Web pages within a Web site, data are often intrinsically relational, which present a major challenge to transfer learning [52]. We plan to develop a relational dimensionality reduction framework, which can encode the relational information among data into the latent space learning, while can also reduce the difference between domains and preserve properties of the original data. • We plan to study the negative transfer learning issue. It has been proven that when the source and target tasks are unrelated, the knowledge extracted from a source task may not help, and even hurt, the performance of the target task [159]. Thus, how to avoid negative transfer and then ensure safe transfer of knowledge is crucial in transfer learning. Recently, in [30], we have proposed an Adaptive Transfer learning algorithm based on Gaussian Processes (AT-GP), which can be used to adapt the transfer learning schemes by automatically estimating the similarity between a source and a target task. Preliminary experimental results show that the proposed algorithm is promising to avoid negative transfer in transfer learning. In the future, we plan to develop a criterion to measure whether there exists a latent space, onto which knowledge can be safely transferred across domain/tasks. Based on the criterion, we can develop a general framework to automatically identify whether the source and target tasks are related or not, and determine whether we should project the data from the source and target domains onto the latent space.

95

REFERENCES

[1] Ahmed Abbasi, Hsinchun Chen, and Arab Salem. Sentiment analysis in multiple languages: Feature selection for opinion classiﬁcation in web forums. ACM Transaction Information System, 26(3):1–34, June 2008. [2] Eneko Agirre and Oier Lopez de Lacalle. On robustness and domain adaptation using svd for word sense disambiguation. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 17–24. ACL, June 2008. [3] Morteza Alamgir, Moritz Grosse-Wentrup, and Yasemin Altun. Multitask learning for brain-computer interfaces. In Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics, volume 9, pages 17–24. JMLR W&CP, May 2010. [4] Rie K. Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, 2005. [5] Rie Kubota Ando and Tong Zhang. A high-performance semi-supervised learning

method for text chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 1–9. ACL, June 2005. [6] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19, pages 41–48. MIT Press, 2007. [7] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pages 71–85. Springer, September 2008. [8] Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil, and Yiming Ying. A spectral regularization framework for multi-task structure learning. In Annual Conference on Neural Information Processing Systems 20, pages 25–32. MIT Press, 2008. [9] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods for transductive transfer learning. In Proceedings of the 7th IEEE International Conference on Data Mining Workshops, pages 77–82. IEEE Computer Society, 2007.

96

4:83–99. Ke Zhou. In Proceedings of the 25th International Conference on Machine learning. ACM. [18] Shai Ben-David and Reba Schuller. [14] Mikhail Belkin. 2003. In Annual Conference on Neural Information Processing Systems 19. and Tobias Scheffer. volume 9. [11] Bart Bakker and Tom Heskes. Gordon Sun. In Proceedings of the Sixteenth Annual Conference on Learning Theory. Alex Kulesza. Analysis of representations for domain adaptation. Jasmina Bogojeska. pages 129–136. pages 137–144. 2007. A theory of learning from different domains. and Ashish Verma. May 2010. Fernando Pereira. Task clustering and gating for bayesian multitask learning. pages 825–830. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples..[10] Jing Bai. [19] Indrajit Bhattacharya. In Proceedings of IEEE International Conference on Robotics and Automation. Exploiting task relatedness for multiple task learning. Crossguided clustering: Transfer of relevant supervision across domains for improved clustering. 15(6):1373–1396. and Yi Chang. Journal of Machine Learning Reserch. 2009. Neural Computation. John Blitzer. Guirong Xue. November 2006. Hongyuan Zha. Zhaohui Zheng. [16] Shai Ben-David. Machine Learning. pages 636–642. Journal of Machine Learning Research. and Fernando Pereira. Teresa Luu. Laplacian eigenmaps for dimensionality reduction and data representation. In Proceedings of the 9th IEEE International Conference on Data Mining. Sachindra Joshi. pages 56–63. 2003. John Blitzer. MIT Press. Shantanu Godbole. Multi-task learning for HIV therapy screening. pages 41–50. and David Pal. December 2009. and Vikas Sindhwani. Belle Tseng. [17] Shai Ben-David. Morgan Kaufmann Publishers Inc. 7:2399–2434. Impossibility theorems for domain adaptation. Tyler Lu. 97 . [12] Maxim Batalin. JMLR W&CP. August 2003. In Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics. 2010. April 2003. Mobile robot navigation using a sensor network. ACM. Koby Crammer. July 2008. Koby Crammer. and Myron Hattig. IEEE Computer Society. [15] Shai Ben-David. Partha Niyogi. In Proceeding of the 18th ACM conference on Information and knowledge management. Multi-task learning for learning to rank in web search. Thomas Lengauer. [20] Steffen Bickel. [13] Mikhail Belkin and Partha Niyogi. and Jenn Wortman. Gaurav Sukhatme. 79(12):151–175. pages 1549–1552.

Learning bounds for domain adaptation. and Qiang Yang. MIT Press. and Jenn Wortman. July 1998. 2008. 2007.[21] Steffen Bickel. ACM. Mark Dredze. and Chris Williams. [23] John Blitzer. [28] Edwin Bonilla. June 2007. Yu Zhang. ACL. Alex Kulesza. Domain Adaptation of Natural Language Processing Systems. The University of Pennsylvania. pages 92–100. 2009. Adaptive transfer learning. Kian Ming Chai. ACM. [29] Bin Cao. pages 153–160. July 2010. and Fernando Pereira. pages 145–152. bollywood. and Tobias Scheffer. pages 81–88. June 2010. Dit-Yan Yeung. In Proceedings of the Conference on Empirical Methods in Natural Language. Multi-task gaussian process prediction. 1997. AAAI Press. Fernando Pereira. Ryan McDonald. In Annual Conference on Neural Information Processing Systems 20. [27] Avrim Blum and Tom Mitchell. PhD thesis. Koby Crammer. July 2006. In Proceedings of the 24th international conference on Machine learning. In Proceedings of the 11th Annual Conference on Learning Theory. In Proceedings of the 27th International Conference on Machine learning. pages 432–439. Domain adaptation with structural correspondence learning. Discriminative learning for difu fering training and test distributions. [24] John Blitzer. pages 120–128. Machine Learning. and Qiang Yang. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. and Fernando Pereira. 98 . Nathan N. In Annual Conference on Neural Information Processing Systems 20. June 2007. Biographies. 2008. [30] Bin Cao. Liu. In Advances in Neural Information Processing Systems 21. boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation. MIT Press. ACL. Transfer learning by distribution matching for targeted advertising. 28(1):41–75. Combining labeled and unlabeled data with co-training. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence. [25] John Blitzer. Transfer learning for collective link prediction in multiple heterogenous domains. and Tobias Scheffer. Michael Br¨ ckner. pages 129–136. Christoph Sawade. [31] Rich Caruana. [22] Steffen Bickel. [26] John Blitzer. Multitask learning. Sinno Jialin Pan.

Visual contextual advertising: Bringing textual advertisements to images. In Proceedings of the 26th Annual International Conference on Machine Learning. In Proceedings of the 25th international conference on Machine learning. Tag-based web photo retrieval improved by batch mode re-tagging. Chapelle. Semi-Supervised Learning. ACM. Stefan Klanke. Wai Lam. I. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. July 2010. [39] Tianqi Chen. Ya Zhang. July 2010. April 2010. In Advances in Neural Information Processing Systems 21. [38] Lin Chen. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence. Multi-task learning for boosting with application to web search ranking. 99 . Christopher K. Lei Tang. 1997. USA. American Mathematical Society. June 2010. June 2007. Domain adaptation with active learning for word sense disambiguation. [34] O. [37] Jianhui Chen. pages 1077–1078. AAAI Press. A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. Williams. [35] Olivier Chapelle. [40] Yuqiang Chen. Jia Chen. Ivor W. 2006. and Qiang Yang. ACL. Number 92 in CBMS Regional Conference Series in Mathematics. A convex formulation for learning shared structures from multiple tasks. and Sethu Vijayakumar. MA. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Sch¨ lkopf. IEEE. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. pages 49–56. Extracting discriminative concepts for domain adaptation in text mining. ACM. [36] Bo Chen. Spectral Graph Theory. and Jieping Ye. Ou Jin. Jun Yan.[32] Kian Ming A. pages 265–272. June 2009. In Proceedings of the 19th International Conference on World Wide Web. 2009. pages 137–144. June 2009. Tsang. July 2008. and Jiebo Luo. pages 1189–1198. Camo bridge. [33] Yee Seng Chan and Hwee Tou Ng. [41] Fan R. Gui-Rong Xue. pages 160–167. and Tak-Lam Wong. Pannagadatta Shivaswamy. and Belle Tseng. Jun Liu. Multi-task gaussian process learning of robot inverse dynamics. Tsang. Gui-Rong Xue. Ivor W. ACM. MIT Press. and A. Chung. Chai. Dong Xu. [42] Ronan Collobert and Jason Weston. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Zien. B. Transfer learning for behavioral targeting. and Zheng Chen. Srinivas Vadrevu. ACM. Kilian Weinberger. pages 179–188. K.

Guirong Xue. [53] Scott Deerwester. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence. pages 353–360. June 2009. [51] Hal Daum´ III. pages 200–207. Guirong Xue. [52] Jesse Davis and Pedro Domingos. pages 217–224. Thomas K. In Annual Conference on Neural Information Processing Systems 21. 2002. In Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining. Qiang Yang. In Advances in Neural Information Processing Systems 14. On kerneltarget alignment. Norwell. pages 540–545. Kluwer Academic Publishers. and Yong Yu. In Proceedings of the 26th Annual International Conference on Machine Learning. ACM. Susan T. pages 256–263. Furnas. Language Modeling for Information Retrieval. 100 . pages 193– 200. [50] Wenyuan Dai. Qiang Yang. and Yong Yu. Boosting for transfer learning. Gui-Rong Xue. Landauer. ACL. In Proceedings of the 45th Annual e Meeting of the Association of Computational Linguistics. Gui-Rong Xue. Jaz Kandola. Qiang Yang. Qiang Yang. June 2009. Self-taught clustering. Bruce Croft and John Lafferty. [46] Wenyuan Dai. [47] Wenyuan Dai. Guirong Xue. August 2007. and Yong Yu. MA. Qiang Yang. 2003. June 2007. Co-clustering based classiﬁcation for out-of-domain documents. Transferring naive bayes classiﬁers for text classiﬁcation. pages 25–31. July 2007. 2009. 41:391–407. USA. ACM. In Proceedings of the 24th International Conference on Machine Learning. Eigentransfer: a uniﬁed framework for transfer learning. ACM. Yuqiang Chen. Translated learning: Transfer learning across different feature spaces. MIT Press. [48] Wenyuan Dai. Deep transfer via second-order markov logic. Ou Jin. ACM. [49] Wenyuan Dai. Dumais. George W. and Yong Yu. Andre Elisseeff. MIT Press. [44] W. In Proceedings of the 26th Annual International Conference on Machine Learning. June 2007. AAAI Press. ACM. pages 367– 373. and John Shawe-Taylor. and Yong Yu. Indexing by latent semantic analysis. [45] Wenyuan Dai. Journal of the American Society for Information Science. In Proceedings of the 25th International Conference of Machine Learning.[43] Nello Cristianini. and Yong Yu. Frustratingly easy domain adaptation. Guirong Xue. July 2008. and Richard Harshman. Qiang Yang. 1990.

Improving regressors using boosting techniques. Journal of Machine Learning Research. June 2009. Xiang-Rui Wang. [60] Eric Eaton. and Jiebo Luo. [59] Lixin Duan. ACM. IEEE. 101 . 2008. Springer. In Proceedings of the twenty-ﬁrst international conference on Machine learning. July 2001. Domain transfer svm for video concept detection. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. ACM. July 2004. [63] Theodoros Evgeniou and Massimiliano Pontil. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases.. September 2008. June 2010. and Terran Lane. and Chih-Jen Lin. Modeling transfer relationships between learning tasks for improved inductive transfer. K-means clustering via principal component analysis. pages 109–117. Ivor W. The Transfer of Learning. [58] Lixin Duan. Dong Xu. August 2004. The Macmillan Company. In Proceedings of the Fourteenth International Conference on Machine Learning. Ivor W. In Proceedings of the 22nd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. and Tat-Seng Chua. Dong Xu. Visual event recognition in videos by learning from web data. Lecture Notes in Computer Science. [61] Henry C. Dong Xu. and Massimiliano Pontil. and Stephen J. [64] Rong-En Fan. July 1997.[54] Inderjit S. [57] Lixin Duan. Learning multiple tasks with kernel methods. Dhillon. Morgan Kaufmann Publishers Inc. Maybank. Domain adaptation from multiple sources via auxiliary classiﬁers. In Proceedings of the 26th International Conference on Machine Learning. Ivor W. [62] Theodoros Evgeniou. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Cho-Jui Hsieh. ACM. Co-clustering documents and words using bipartite spectral graph partitioning. [55] Chris Ding and Xiaofeng He. Marie desJardins. [56] Harris Drucker. ACM. 2005. Ellis. pages 317–332. 1965. 6:615–637. Tsang. 9:1871–1874. In Proceedings of the 7th ACM SIGKDD international conference on Knowledge discovery and data mining. June 2009. Tsang. pages 107–115. pages 225–232. Micchelli. IEEE. pages 289–296. pages 269–274. LIBLINEAR: A library for large linear classiﬁcation. Journal of Machine Learning Research. Tsang. New York. Charles A. Kai-Wei Chang. pages 1375–1381. Regularized multi-task learning.

Olivier Bousquet. [74] Andreas Haeberlen. 1995. [71] Hong Lei Guo. pages 162–169. Wallach. Learning to rank only using training data from related domain.11 wireless networks. and Zhong Su. Eliot Flannery. In Proceedings of the 10th Annual International Conference on Mobile Computing and Networking. pages 2480–2485. [69] A. A kernel method o for the two-sample problem. Schapire. Smola. AAAI Press. pages 70–84. [66] Yoav Freund and Robert E. ACM. and Neil Lawrence. Measuring o statistical dependence with Hilbert-Schmidt norms. Kam-Fai Wong. and Svetha Venkatesh. [68] Wei Gao. pages 509–516.[65] Brian Ferris. Algis Rudys. pages 283–291. In Advances in Neural Information Processing Systems 19. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Jing Jiang. Li Zhang. In Proceedings of the Second European Conference on Computational Learning Theory. In Proceedings of the 16th International Conference on Algorithmic Learning Theory. ACL. In Proceedings of the 23rd national conference on Artiﬁcial intelligence. pages 23–37. Springer. MIT Press. July 2010. pages 842–847. Borgwardt. and A. ACM. M. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. Text categorization with knowledge transfer from heterogeneous data sources. Dinh Phung. Sch¨ lkopf. October 2005. Alex J. Rasch. and Bernhard Sch¨ lkopf. [73] Sunil Kumar Gupta. Dan S. July 2006. Smola. July 2010. ACM. Nonnegative shared subspace learning and its application to social media retrieval. Kavraki. pages 63–77. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval. and Aoying Zhou. Ladd. January 2007. Empirical study on the performance stability of named entity recognition model across domains. K. Andrew M. In Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence. [70] Arthur Gretton. Wei Fan. Truyen Tran. Knowledge transfer via multiple model local structure mapping. pages 513–520. Peng Cai. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Gretton. WiFi-SLAM using gaussian process latent variable models. September 2004. 102 . Dieter Fox. 2007. [67] Jing Gao. pages 1169–1178. Brett Adams. B. August 2008. A decision-theoretic generalization of on-line learning and an application to boosting. volume 3734 of Lecture Notes in Computer Science. [72] Rakesh Gupta and Lev Ratinov. and Jiawei Han. July 2008. and Lydia E. Springer-Verlag. Practical robust localization over large-scale 802.

Active learning for multi-task adaptive ﬁltering. inference and prediction. and Takafumi Kanamori. The University of Illinois at Urbana-Champaign. Borgwardt. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th 103 . Multi-task learning of gaussian graphical models. Hu. Clustered multi-task learning: A convex formulation. MIT Press. [79] Jean Honorio and Dimitris Samaras. [86] Jing Jiang. Applied Statistics. Mining and summarizing customer reviews. Anthony Wong. Arthur Gretton. Multi-task feature and kernel selection for SVMs. 28:100–108. Karsten M. In Proceedings of 10th ACM International Conference on Ubiquitous Computing. July 2004. The elements of statistical learning: data mining. In Proceedings of the 21st International Conference on Machine Learning. [82] Jiayuan Huang. Yuta Tsuboi.[75] Abhay Harpale and Yiming Yang. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. September 2008. December 2008. Nathan N. [83] Laurent Jacob. volume 69. pages 168–177. Correcting sample selection bias by unlabeled data. [85] Jing Jiang. pages 601–608. Real world activity recognition with multiple goals. and Jean-Philippe Vert. Domain Adaptation in Natural Language Processing. In Proceedings of the 27th International Conference on Machine learning. Haifa. Alex Smola. Zheng. ACM. August 2004. [80] Derek H. pages 223–232. [76] John A. [81] Minqing Hu and Bing Liu. In Advances in Neuo ral Information Processing Systems 19. In Proceedings of the 8th IEEE International Conference on Data Mining. ACM. 2007. Israel. A k-means clustering algorithm. ACM. Francis Bach. ACM. pages 30–39. and Jerome Friedman. 2009. Sinno Jialin Pan. 1979. Robert Tibshirani. 2 edition. pages 745–752. Hisashi Kashima. June 2010. In Advances in Neural Information Processing Systems 21. Hartigan and M. [84] Tony Jebara. [77] Trevor Hastie. Vincent W. In Proceedings of the 27th International Conference on Machine learning. Multi-task transfer learning for weakly-supervised relation extraction. [78] Shohei Hido. and Bernhard Sch¨ lkopf. PhD thesis. and Qiang Yang. Liu. Springer. Masashi Sugiyama. 2008. June 2010. 2009. ACM. Inlier-based outlier detection via direct density ratio estimation.

Journal of Machine Learning Research. Yu-Ting Liang. April 2009. [96] Ken Lang. ACL. Shohei Hido. In Proceedings of the 12th International Conference on Machine Learning. ACL. Peter L. Nello Cristianini. pages 389–400. [95] Gert R. Opinion extraction. September 2010. pages 1012–1020. Lecture Notes in Computer Science. Lanckriet. 104 . [91] Takafumi Kanamori. 2009. April 2008. June 2007. Learning the kernel matrix with semideﬁnite programming. Instance weighting for domain adaptation in NLP. Springer. 5:27–72. A least-squares approach to direct importance estimation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. and Hsin-Hsi Chen. 10:1391–1445. G. pages 219–230. 2009. Weakly-paired maximum covariance analysis o for multimodal dimensionality reduction and transfer learning. Opinion spam and analysis. Bartlett. 1995. [88] Nitin Jindal and Bing Liu. Jordan. [89] Thorsten Joachims. In Proceedings of the Sixteenth International Conference on Machine Learning. pages 264–271. pages 137–142. Lampert and Oliver Kr¨ mer. [87] Jing Jiang and ChengXiang Zhai. In Proceedings of the 11th European Conference on Computer Vision. Laurent El Ghaoui. Newsweeder: Learning to ﬁlter netnews.. In Proceedings of the international conference on Web search and web data mining. Journal of Machine Learning Research. 2004. Transductive inference for text classiﬁcation using support vector machines. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. 2006. In Proceedings of 2009 SIAM International Conference on Data Mining. April 1998. In Proceedings of the 10th European Conference on Machine Learning. ACM. volume 2. pages 331–339. summarization and tracking in news and blog corpora. [92] Yoshinobu Kawahara and Masashi Sugiyama. Change-point detection in time-series data by direct density-ratio estimation.International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. [94] Christoph H. [93] Lun-Wei Ku. June 1999. and Masashi Sugiyama. pages 200–209. and Michael I. [90] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. Morgan Kaufmann Pulishers Inc.

In Proceedings of the 24th International Conference on Machine Learning. Efﬁcient sparse coding algorithms. ACM / Springer. Ng. pages 617–624. July 2004. David Vickrey. Transfer learning for collaborative ﬁltering via a rating-matrix generative model. June 2009. and Henry A. pages 489–496. and Andrew Y. pages 716– 717. Rajat Raina. pages 2052–2057. July 2009. Dieter Fox. Donald J. In Annual Conference on Neural Information Processing Systems 19. Large-scale localization from wireless signal strength. A non-negative matrix tri-factorization approach to sentiment classiﬁcation with lexical prior knowledge. In Proceedings of the 20th National Conference on Artiﬁcial Intelligence. July 1994. Morgan Kaufmann Publishers Inc. pages 244–252. Gale. pages 3–12. 105 . Qiang Yang. Can movies and books collaborate?: crossdomain collaborative ﬁltering for sparsity reduction. [106] Lin Liao. and Daphne Koller. [98] Honglak Lee. Learning to learn with the informative vector machine. Banff. MIT Press. [105] Tao Li. August 2009. [99] Su-In Lee. 171(5-6):311–331. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. July 2009. ACM. [103] Bin Li. Canada. and Xiangyang Xue. Vassil Chatalbashev. Dieter Fox. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. pages 15–20. Kautz. Alexis Battle. [101] David D. [102] Bin Li. [100] Julie Letchner. and Vikas Sindhwani. July 2007. 2007. ACM. Patterson. AAAI Press / The MIT Press. Alberta. Learning and inferring transportation routines. Lewis and William A. Knowledge transformation for cross-domain sentiment classiﬁcation. and Xiangyang Xue. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association of Computational Linguistics. In Proceedings of the 21st International Conference on Machine Learning. Lawrence and John C. [104] Tao Li. 2007. Yi Zhang. Platt. Chris Ding. and Anthony LaMarca. and Yi Zhang. ACM. Vikas Sindhwani.. In Proceedings of the 26th Annual International Conference on Machine Learning. Qiang Yang. July 2005. A sequential algorithm for training text classiﬁers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. ACL. Artiﬁcial Intelligence. pages 801–808.[97] Neil D. ACM. Learning a metalevel prior for feature relevance from multiple related tasks.

In Proceedings of the 21st International Conference on Machine Learning. Tsang. and Jieping Ye. pages 488–496.[107] Xuejun Liao. pages 505–512. pages 121–130. and Livia Polanyi. Yun Jiang. ACM. ACM. Qiang Yang. April 2008. ACM. IEEE Computer Society. [109] Xiao Ling. Springer. 2010. August 2005. [112] Jun Liu. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. April 2008. April 2010. Panayiotis Tsaparas. [114] Yang Liu. pages 969–978. In Proceedings of the seventeen ACM international conference on Multimedia. December 2008. Wenyuan Dai. Second Edition. [111] Bing Liu. Contents. Dong Xu. and Yong Yu. In Proceeding of the 18th ACM conference on Information and knowledge management. [108] Xiao Ling. 106 . Alexandros Ntoulas. ACM. and Yong Yu. Sentiment analysis and subjectivity. pages 55–64. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Ya Xue. ACM. [115] Yiming Liu. Can chinese web pages be classiﬁed with english data source? In Proceedings of the 17th International Conference on World Wide Web. Multi-task feature learning via efﬁcient l2. August 2008. In Proceedings of the 19th international conference on Wrld wide web. pages 443–452. October 2009. [116] Yue Lu. January 2007. Logistic regression with an auxiliary data source. Exploiting social context for review quality prediction. Wenyuan Dai. In Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence. Web Data Mining: Exploring Hyperlinks. Ivor W. pages 1717–1720. [117] Yue Lu and Chengxiang Zhai. Opinion integration through semi-supervised topic modeling. and Xiaohui Yu. November 2009. Spectral domaintransfer learning. and Lawrence Carin. Qiang Yang. [113] Kang Liu and Jun Zhao. and Usage Data. and Jiebo Luo. In Proceedings of the 17th International Conference on World Wide Web. Cross-domain sentiment classiﬁcation using a two-stage method.1 -norm minimization. Gui-Rong Xue. Modeling and predicting the helpfulness of online reviews. ACM. Handbook of Natural Language Processing. June 2009. ACM. Gui-Rong Xue. Using large-scale web data to facilitate textual query based retrieval of consumer photos. Shuiwang Ji. pages 691–700. Xiangji Huang. [110] Bing Liu. Aijun An.

Hui Xiong. MIT Press. pages 985–992. Ray.. Mooney. [120] M. Multiple source adaptation and the renyi divergence. [127] Tom M. Jason Weston. pages 1041–1048. July 2009. October 2008. July 2007. ACM. M. and Neel Sundaresan. [123] Yishay Mansour. Sentiment analysis using support vector machines with diverse information sources. In Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence. [122] Yishay Mansour. Transfer learning from minimal target data by mapping across relational domains. pages 103–112. Tuyen Huynh. Fisher discriminant analysis with kernels. Machine Learning. In Proceeding of the 17th ACM conference on Information and knowledge management. pages 608–614. and Afshin Rostamizadeh. 2008. Transfer learning from multiple source domains via consensus regularization. Yuhong Xiong. [124] Lilyana Mihalkova. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. In Proceedings of the 22nd Annual Conference on Learning Theory. and Afshin Rostamizadeh. In Annual Conference on Neural Information Processing Systems 20. June 2009. AAAI Press. 2009. pages 412–418. Mitchell. and Afshin Rostamizadeh. July 2004. [126] Sebastian Mika. 107 . Hassan Mahmud and Sylvian R.[118] Yue Lu. ChengXiang Zhai. Mehryar Mohri. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. Transfer learning using kolmogorov complexity: Basic theory and empirical evaluations. 2009. Morgan Kaufmann Publishers Inc. In Neural Networks for Signal Prou cessing IX. Mehryar Mohri. and Qing He. Rated aspect summarization of short comments. Gunnar R¨ tsch. Mapping and revising markov logic networks for transfer learning. pages 1163–1168. In Advances in Neural Information Processing Systems 21. In Proceedings of the 18th international conference on World wide web. McGraw-Hill. [128] Tony Mullen and Nigel Collier. Fuzhen Zhuang. and Raymond J. New York. pages 41–48. Domain adaptation: Learning bounds and algorithms. [125] Lilyana Mihalkova and Raymond J. 1997. 2009. Mehryar Mohri. and Klaus-Robert a o M¨ ller. [119] Ping Luo. Domain adaptation with multiple sources. [121] Yishay Mansour. 1999. Bernhard Sch¨ lkopf. Mooney.

and Yair Weiss. 2000. Pan. Transfer learning via dimensionality reduction. James T. Kwok. AAAI Press. Jian-Tao Sun. A manifold regularization approach to calibration reduction for sensor-network based tracking. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence. On spectral clustering: Analysis and an algorithm. 108 . [131] Kamal Nigam. Kwok. [139] Sinno Jialin Pan.[129] Yurii Nesterov and Arkadii Nemirovskii. and Yiqiang Chen. Qiang Yang. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. pages 751–760. Philadelphia. and Qiang Yang. pages 2166–2171. 39(2-3):103–134. MIT Press. 18(9):1181–1193. and Jeffrey J. AAAI Press. Co-localization from labeled and unlabeled data using graph laplacian. IEEE Transactions on Knowledge and Data Engineering. PA. January 2007. [132] Jeffrey J. [136] Sinno Jialin Pan. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. pages 849–856. In Proceedings of the 19th International Conference on World Wide Web. and Qiang Yang. Qiang Yang. July 2006. September 2006. [130] Andrew Y. Transferring localization models across space. Xiaochuan Ni. Text classiﬁcation from labeled and unlabeled documents using EM. Jordan. Tsang. Crossdomain sentiment classiﬁcation via spectral feauture alignment. In Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence. Pan. Michael I. and James T. Sebastian Thrun. [133] Jeffrey J. AAAI Press. Hong Chang. James T. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence. July 2008. pages 988–993. [134] Jeffrey J. Society for Industrial and Applied Mathematics. Interior-point Polynomial Algorithms in Convex Programming. AAAI Press. ACM. and Dit-Yan Yeung. Ng. July 2007. Multidimensional vector regression for accurate and low-cost location estimation in pervasive computing. Qiang Yang. 1994. [135] Sinno Jialin Pan. Kwok. pages 677–682. Kwok. Pan. pages 1383–1388. April 2010. James T. Dou Shen. James T. July 2008. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence. Qiang Yang. [138] Sinno Jialin Pan. 2001. In Advances in Neural Information Processing Systems 14. Adaptive localization in a dynamic WiFi environment through multi-view learning. Andrew Kachites McCallum. Domain adaptation via transfer component analysis. and Chen Zheng. [137] Sinno Jialin Pan. and Tom Mitchell. Ivor W. Machine Learning. July 2009. pages 1108–1113. pages 1187–1192. Pan and Qiang Yang. Qiang Yang. Kwok.

and Derek H. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. ACM. Lawrence. Qiang Yang. MIT Press. Selftaught learning: Transfer learning from unlabeled data. July 2004. Lillian Lee. pages 713–720. Dataset Shift in Machine Learning. ACM. Zheng. Inﬁnite predictor subspace models for multitask learning. June 2010. June 2007. June 2006. 22(10):1345–1359. In Proceedings of the 42nd Annual Meeting of the Association of Computational Linguistics. ACL. A survey on transfer learning. pages 79–86. Kwok. pages 271–278.[140] Sinno Jialin Pan. Ivor W. Vincent W. and Daphne Koller. 2009. and Shivakumar Vaithyanathan. [149] Piyush Rai and Hal Daum´ III. Ng. 2008. Benjamin Packer. e In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence. and Qiang Yang. July 2010. ACL. Boosting for regression transfer. Evan W. and Neil D. July 2002. Liu. Constructing informative priors using transfer learning. [141] Sinno Jialin Pan and Qiang Yang. 2010. [147] David Pardoe and Peter Stone. Foundations and Trends in Information Retrieval. [143] Weike Pan. Thumbs up? Sentiment classiﬁcation using machine learning techniques. Transfer learning for WiFi-based indoor localization. [151] Rajat Raina. Opinion mining and sentiment analysis. [146] Bo Pang. Andrew Y. 109 . October 2010. Nathan N. Transfer learning in collaborative ﬁltering for sparsity reduction. 2(1-2):1–135. Submitted. Tsang. and Andrew Y. ACM. James T. In Proceedings of the 23rd International Conference on Machine Learning. [148] Joaquin Quionero-Candela. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks. [142] Sinno Jialin Pan. [150] Rajat Raina. Xiang. Hu. Masashi Sugiyama. Honglak Lee. Alexis Battle. pages 759–766. Ng. [144] Bo Pang and Lillian Lee. In Proceedings of the Workshop on Transfer Learning for Complex Task of the 23rd AAAI Conference on Artiﬁcial Intelligence. July 2008. and Qiang Yang. In Proceedings of the 24th International Conference on Machine Learning. [145] Bo Pang and Lillian Lee. July 2010. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence. Anton Schwaighofer. In Proceedings of the 27th International Conference on Machine learning. IEEE Transactions on Knowledge and Data Engineering. AAAI Press. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

C. Machine Learning Journal. S. and Intent Recognition of the 24th AAAI Conference on Artiﬁcial Intelligence. In Text REtrieval Conference 2001. Mooney. Springer. To transfer or not to transfer. In Proceedings of 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technology. Cook. In Proceedings of the 14th International Conference on Machine Learning. [157] Stephen Robertson. Jinbo Bi. [153] Parisa Rashidi and Diane J. Iryna Gurevych. pages 220–233. Bayesian multiple instance learning: automatic feature selection and inductive transfer. IEEE Transactions on Neural Networks. [161] Kate Saenko. and Leslie Pack Kaelbling. Kernel-based inductive transfer.and why? semantic relatedness for knowledge transfer. Bharat Rao. July 2010. ACL. R¨ tsch. pages 454–462. Mario Fritz. 2006. pages 1–9. Gy¨ rgy Szarvas. and o u a A. [155] Matthew Richardson and Pedro Domingos.J. September 2008. Rosenstein.[152] Sowmya Ramachandran and Raymond J. 62(1-2):107–136. Brian Kulis. Murat Dundar. Raykar. The trec 2002 ﬁltering track report. [160] Ulrich R¨ ckert and Stefan Kramer. July 1998. 2001. pages 808– 815. IEEE. In Proceedings of the 11th European Conference on Computer Vision. G. Sch¨ lkopf. Balaji Krishnapuram. In Proceedings of the 25th International Conference on Machine learning. Activity recognition based on home to home transfer learning. AAAI Press. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition.J. In a NIPS-05 Workshop on Inductive Transfer: 10 Years Later. Theory reﬁnement of bayesian networks with hidden variables. Klaus-Robert M¨ ller. C. and Ian Soboroff. Michael Stark. Richman and Patrick Schone. December 2005. September 1999. September 2010. [158] Marcus Rohrbach. Smola. June 2008. Adapting visual category models to new domains. [159] Michael T. Mika. and Bernt Schiele. June 2010. Markov logic networks. Knirsch. Activity. Zvika Marx. [162] B. and R. P. July 2008. Mining Wiki resources for multilingual named entity recognition. [156] Alexander E. Burges. Stephen Robertson. 10(5):1000–1017. ACM. Lecture Notes in Computer Science. 110 . Input space versus feature space in kernel-based methods. and Trevor Darrell. In Proceedings u of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. o What helps where . [154] Vikas C. In Proceedings of the Workshop on Plan.

pages 1433–1440. 2005. MIT Press. MIT Press. Learning gaussian process kernels via hierarchical bayes. [164] Bernhard Scholkopf and Alexander J. Volker Tresp. [172] Alex J. MA. IEEE Computer Society. Active learning literature survey. PhD thesis. 2000. 1998. 2009. and Kee-Khoon Lee. Regularization. Improving predictive inference under covariate shift by weighting the log-likelihood function.[163] Bernhard Sch¨ lkopf. In Advances in Neural Information Processing Systems 21. Neural Computation. A Hilbert space o embedding for distributions. Lecture Notes in Computer Science. 2008. [165] Anton Schwaighofer. [166] Gabriele Schweikert. pages 342–357. and Jiangtao Ren. In Proceedings of the 8th IEEE International Conference on Data Mining. 2001. USA. pages 1209–1216. Computer Sciences Technical Report 1648. Document-word co-regularization for semi- supervised sentiment analysis. Tsang. [173] Le Song. An o a empirical analysis of domain adaptation algorithms for genomic sequence analysis. and Gunnar R¨ tsch. Le Song. pages 1025–1030. In Annual Conference on Neural Information Processing Systems 17. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases 2010. Christian Widmer. Nonlinear component o u analysis as a kernel eigenvalue problem. [170] Hidetoshi Shimodaira. and Bernhard Sch¨ lkopf. September 2010. Springer. 2009. and Klaus-Robert M¨ ller. The University of Sydney. and Kai Yu. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. 90:227–244. September 2008. Yew-Soon Ong Ivor W. 111 . Wei Fan. Learning with Kernels: Support Vector Machines. Optimization. Smola. Learning via Hilbert Space Embedding of Distributions. Predictive distribution matching svm for multi-domain learning. Actively transfer domain knowledge. [168] Burr Settles. pages 13–31. [171] Vikas Sindhwani and Prem Melville. [169] Xiaoxiao Shi. Alexander Smola. Journal of Statistical Planning and Inference. and Beyond. In Proceedings of the 18th International Conference on Algorithmic Learning Theory. Smola. Cambridge. [167] Chun-Wei Seah. December 2007. Arthur Gretton. 10(5):1299–1319. University of Wisconsin–Madison. October 2007. Bernhard Sch¨ lkopf.

and Bernt Schiele. Journal of Machine Learning Research. [180] Masashi Sugiyama. MA. Rice University. In Annual Conference on Neural Information Processing Systems 20. pages 1750–1758. Gert Lanckriet. Texas. pages 1433–1440. [183] Sebastian Thrun and Lorien Pratt. Kernel choice and classiﬁability for RKHS embeddings of probability distrio butions. [176] Bharath Sriperumbudur. In Proceedings of the 8th International Conference on Independent Component Analysis and Signal Separation. Colored maximum variance unfolding. Shinichi Nakajima. Arthur Gretton. Department of Computational and Applied Mathematics. 2:67–93. Norwell. Learning to learn. Paul Von Buenau. Michael Goesele. In Proceedings of 12th IEEE International Conference on Computer Vision. 2009. 1996. Taylor and Peter Stone. Direct importance estimation with model selection and its application to covariate shift adaptation. MIT Press. [175] Danny C. pages 1385–1392. In Proceedings of 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technology.[174] Le Song. 2010. and Bernhard Sch¨ lkopf. Kenji Fukumizu. Hisashi Kashima. Motoaki Kawanabe. 2008. Alex Smola. A shape-based object class model for knowledge transfer. In Advances in Neural Information Processing Systems 20. Kluwer Academic Publishers. [184] Ivan Titov and Ryan McDonald. USA. On the inﬂuence of the kernel on the consistency of support vector machines. [182] Matthew E. MIT Press. 10(1):1633– 1685. Houston. and Motoaki Kawanabe. ACL. Sorensen. [178] Ingo Steinwart. 23(1):44–59. June 2008. September 2009. pages 308–316. 112 . Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. Karsten Borgwardt. 1998. Journal of Machine Learning Research. Neural Network. [181] Taiji Suzuki and Masashi Sugiyama. editors. A joint model of text and aspect ratings for sentiment summarization. Dimensionality reduction for density ratio estimation in high-dimensional spaces. In Advances in Neural Information Processing Systems 22. 2009. 2008. March 2009. Estimating squared-loss mutual information for independent component analysis. and Pui Ling Chui. 2001. [177] Michael Stark. and Arthur Gretton. pages 130–137. [179] Masashi Sugiyama. Technical Report TR-96-40.

Physical Review Letters. ACM. Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning. September 2009. Francesco Orabona. Kir´ ly. [189] Vladimir N. In Proceeding of the 18th ACM conference on Information and knowledge management. and Qiang Yang. Using Wikipedia for co-clustering based cross-domain text classiﬁcation. July 2008. [187] Peter Turney. 113 Statistical Learning Theory. Jie Tang. In Proceedings of the 25th International Conference on Machine learning. September 1998. [192] Bo Wang. In Proceedings of the 40th Annual Meeting fo the Association for Computational Linguistics. and Klaus R. Indoor localization in multi-ﬂoor environments with reduced effort. pages 1085–1090. New York. M¨ ller. Finding stau u u tionary subspaces in multivariate time series. Manifold alignment using procrustes analysis. and Jaap van den Herik. [190] Alexander Vezhnevets and Joachim Buhmann. IEEE Computer Society. Eric Postma. Junhui Zhao. 3(1):71–86. [186] Matthew Turk and Alex Pentland. [191] Paul von B¨ nau. Thumbs up or thumbs down? semantic orientation applied to unsupervised classiﬁcation of reviews. Technical Report TiCC-TR 2009-005. Eigenfaces for recognition. Vincent W. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1991. pages 417–424. [188] Laurens van der Maaten. Wei Fan. March 2010. June 2010. 2008. Songcan Chen. [193] Chang Wang and Sridhar Mahadevan. and Yanzhu Liu. Meinecke. pages 987–996. Vapnik. pages 1120– 1127. Carlotta Domeniconi. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Frank C. 2009. Journal of Cognitive Neuroscience. [194] Hua-Yan Wang. Franz C. June 2010. 2002. ACM. pages 244–252. 103(21). IEEE Computer Society. November 2009. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. . and Barbara Caputo.[185] Tatiana Tommasi. Tilburg University. Zi Yang. and Jian Hu. IEEE. Safety in numbers: Learning categories from few examples with multi model knowledge transfer. In Proceedings of the 8th Annual IEEE International Conference on Pervasive Computing and Communications. Wiley-Interscience. IEEE. Dimensionality reduction: A comparative review. Heterogeneous cross domain ranking in latent space. Zheng. [195] Pu Wang.

ACL.[196] Xiaogang Wang. [201] Dan Wu. pages 1523–1532. Domain adaptive bootstrapping for named entity recognition. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM. In 18th International World Wide Web Conference. and Changshui Zhang. April 2010. June 2010. and Bo Chen. Boosted multi-task learning for face veriﬁcation with applications to web image and video search. Bin Cao. In Proceedings of the 2008 European Conference on European Conference on Machine Learning and Knowledge Discovery in Databases. Mining employment market via text block detection and adaptive cross-domain information extraction. IEEE Transactions on Knowledge and Data Engineering. In Proceedings of 14th Annual International Conference on Research in Computational Molecular Biology. Cha Zhang. [203] Evan W. Alberta. Olivier Verscheure. Learning a kernel matrix for nonlinear dimensionality reduction. In Proceedings of the 22nd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. and Hai Leong Chieu. Wenyuan Dai. Bridging domains using world wide knowledge for transfer learning. and Zhengyou Zhang. and Gunnar R¨ tsch. August 2009. pages 283–290. Weinberger. Springer. In Proceedings of the 21st International Conference on Machine Learning. and Yong Yu. [199] Christian Widmer. Wei Fan. 2009. June 2009. July 2004. Yasemin Altun. Derek H. and Jiangtao Ren. Wee Sun Lee. Saul. [198] Kilian Q. pages 522–534. In 11th European Conference on Principles and Practice of Knowledge 114 . July 2004. In Proceedings of the 21st International Conference on Machine Learning. Nan Ye. and Lawrence K. Leveraging sequence a classiﬁcation by taxonomy-based multitask learning. pages 142–149. Wai Lam. [200] Tak-Lam Wong. [197] Zheng Wang. ACM. Banff. Improving svm accuracy by training on auxiliary data sources. ACM. Bridged reﬁnement for transfer learning. [205] Dikan Xing. Latent space domain transfer between high dimensional overlapping distributions. [204] Sihong Xie. 22:770–783. and Qiang Yang. Fei Sha. Hu. [202] Pengcheng Wu and Thomas G. Jing Peng. pages 91–100. pages 550–565. Canada. Yangqiu Song. IEEE. April 2009. Jose Leiva. Xiang. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Gui-Rong Xue. September 2008. volume 6044 of Lecture Notes in Computer Science. Transferred dimensionality reduction. Dietterich. Springer.

2009. [208] Rong Yan and Jian Zhang. pages 1315–1320. pages 627–634. Multitask learning for protein subcellular location prediction. Activity recognition: Linking low-level sensors to high-level intelligence. Wenyuan Dai. and Yong Yu. pages 1159–1168. volume 1. pages 2151–2159. pages 1–9. In Advances in Neural Information Processing Systems 22. IEEE/ACM Transactions on Computational Biology and Bioinformatics. [212] Qiang Yang. [211] Qiang Yang. 2010. pages 324–335. ACM. ACL. 115 . [209] Jun Yang. Seyoung Kim. Sinno Jialin Pan. July 2009. Gui-Rong Xue. 2009. Qiang Yang. Yang Zhou. 23(1):8–13. 2009. Zheng. ACM. Yuqiang Chen. Rong Jin. Hannah Hong Xue. Estimating location using Wi-Fi. July 2008. Lecture Notes in Computer Science. [210] Qiang Yang. In Proceedings of the 15th international conference on Multimedia. Jain. and Vincent W. September 2007. Sinno Jialin Pan.. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence.Discovery in Databases. and Alexander G. and Wei Tong. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. 99(PrePrints). July 2010. [213] Tianbao Yang. Wenyuan Dai. and Yong Yu. Transfer learning using task-level features with application to information retrieval. Anil K. [214] Xiaolin Yang. IEEE Intelligent Systems. Cross-domain video concept detection using adaptive svms. 2007. Morgan Kaufmann Publishers Inc. [207] Gui-Rong Xue. Heterogeneous transfer learning for image clustering via the social web. pages 188–197. and Eric Xing. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence.. Rong Yan. Springer. 2008. Topic-bridged plsa for crossdomain text classiﬁcation. Heterogeneous multitask learning with joint sparsity constraints. Unsupervised transfer classiﬁcation: application to text categorization. pages 20–25. Hauptmann. and Qiang Yang. Morgan Kaufmann Publishers Inc. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. [206] Qian Xu. ACM. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[220] Hongyuan Zha. June 2010. San Francisco. [223] Yu Zhang. A comparative study on feature selection in text categorization. [218] Xiao-Tong Yuan and Shuicheng Yan. 2001. Visual classiﬁcation with multi-task joint sparse representation. July 2010. 7(7):869– 883. [222] Kai Zhang. Qiang Yang. [217] Jie Yin. USA. Xiaofeng He. Horst Simon. 116 . and Ming Gu.. 2005. Morgan Kaufmann Publishers Inc. A convex formulation for learning task relationships in multi-task learning. [221] Zheng-Jun Zha. Sparse multitask regression for identifying common mechanism of response to therapeutic targets. [224] Yu Zhang and Dit-Yan Yeung. Spectral relaxation for k-means clustering. pages 1057–1064. In Advances in Neural Information Processing Systems 14. and L. Learning adaptive temporal radio maps for signalstrength-based location estimation. Ni. and Dit-Yan Yeung. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. pages 412–420. July 2008. Journal of Machine Learning Research. [219] Bianca Zadrozny. Robust distance metric learning with auxiliary knowledge. Multi-task warped gaussian process for personalized age estimation. July 1997. IEEE Transactions on Mobile Computing. [216] Jieping Ye.M. pages 1327–1332. Bioinformatics. IEEE. [225] Yu Zhang and Dit-Yan Yeung. Bin Cao. MIT Press. and Xian-Sheng Hua. 26(12):i97–i105. In Proceedings of the 14th International Conference on Machine Learning. In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence. Gray. Tao Mei. 2009. Multi-domain collaborative ﬁltering. 2010. 6:483–502. In Proceedings of the 21st International Conference on Machine Learning. Zengfu Wang. CA. Pedersen. June 2010. Joe W. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence. IEEE. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. July 2004. Meng Wang. July 2010.[215] Yiming Yang and Jan O. Morgan Kaufmann Publishers Inc. and Bahram Parvin. Chris Ding. Learning and evaluating classiﬁers under sample selection bias. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

In Proceedings of 10th Paciﬁc Rim International Conference on Artiﬁcial Intelligence. In Proceeding of The 22nd International Conference on Machine Learning. December 2008. August 2003. Qiang Yang. pages 321–328. AAAI Press. Hu. [228] Vincent W. 2005. In Advances in Neural Informao tion Processing Systems 16. and Jeffrey J. In Proceedings of the 27th International Conference on Machine learning. pages 1199–1208. Springer-Verlag. pages 1421–1426. ACM. Jing Peng. Derek H. Zheng. September 2009. Sinno Jialin Pan. and Lei Li. Wei Fan. Semi-supervised learning using gaussian ﬁelds and harmonic functions. and Qiang Yang. Qiang Yang. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. [231] Erheng Zhong. In Proceedings of the 11th International Conference on Ubiquitous Computing. Learning with local and global consistency. Derek H. July 2008. Olivier Bousquet. University of Wisconsin-Madison. pages 1427–1432. pages 912–919. and Dou Shen. Zheng. ACM. [229] Vincent W. and Bernhard Sch¨ lkopf. Transferring knowledge from another domain for learning action models. AAAI Press. pages 1110–1115. Evan W. MIT Press.H. Hoi.[226] Yu Zhang and Dit-Yan Yeung. Qiang Yang. 2004. Thomas Navin Lal. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. Semi-supervised learning literature survey. Cross-domain activity recognition. pages 1027–1036. [232] Dengyong Zhou. [233] Xiaojin Zhu. Computer Sciences. OTL: A framework of online transfer learning. Hu. Technical Report 1530. Jason Weston. 117 . July 2010. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. Cross domain distribution adaptation via kernel mapping. Transferring multidevice localization models using latent multi-task learning. Zheng. and Olivier Verscheure. Lafferty. Transfer metric learning by learning task relationships. [227] Peilin Zhao and Steven C. [235] Hankui Zhuo. Pan. Xiang. July 2008. Jiangtao Ren. Kun Zhang. ACM. Transferring localization models over time. [234] Xiaojin Zhu. and John D. June 2010. Deepak Turaga. June 2009. [230] Vincent W. Zoubin Ghahramani. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd