This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

by JIALIN PAN

A Thesis Submitted to The Hong Kong University of Science and Technology in Partial Fulﬁllment of the Requirements for the Degree of Doctor of Philosophy in Computer Science and Engineering

September 2010, Hong Kong

c Copyright ⃝ by Jialin Pan 2010

Authorization

I hereby declare that I am the sole author of the thesis.

I authorize the Hong Kong University of Science and Technology to lend this thesis to other institutions or individuals for the purpose of scholarly research.

I further authorize the Hong Kong University of Science and Technology to reproduce the thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.

JIALIN PAN

ii

**FEATURE-BASED TRANSFER LEARNING WITH REAL-WORLD APPLICATIONS
**

by JIALIN PAN

This is to certify that I have examined the above Ph.D. thesis and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the thesis examination committee have been made.

PROF. QIANG YANG, THESIS SUPERVISOR

PROF. SIU-WING CHENG, ACTING HEAD OF DEPARTMENT Department of Computer Science and Engineering 17 September 2010 iii

Yi Zhen and so on. I also thank his useful advice and information on my job hunting. More especially. Jian-Tao Sun. Hong Chang. Dou Shen. I got generous help from my mentor Dr. my wife and my little son Zhuoxuan Pan. Chak-Keung Chan. I learned how to survey a ﬁeld of interest. I also learned that as a researcher. I would also like to thank Prof. Weike Pan. Furthermore. Yu Zhang. Dr. Dr. who I worked with closely at HKUST during the past four years.ACKNOWLEDGMENTS First of all.D. Si Shen and so on. With his help. respectively. Dr. who hosted and supervised me during my visit to Carnegie Mellon University and California Institute of Technology. Wei Xiang. Jie Yin. during my internship at Microsoft Research Asia. Gang Wang. Wu-Jun Li. who always encourage and support me when I feel depressed. Dr. Prof. Wenchen Zheng. His research passion made me become more and more interested in doing research on machine learning. I am also very thankful to Prof. I would like to express my thanks to my supervisor Prof. I also got great help from Xiaochuan Ni. Nan Liu. Pingzhong Tang. how to write research papers and how to do presentation clearly. Carlos Guestrin and Prof. In the past four years. I would like to give my deepest gratitude to my parents Yunyu Pan and Jinping Feng. Dr. Bingsheng He. Dr. Hao Hu. Dr. I am also grateful to a number of colleagues at HKUST. I am grateful to current and previous members in our research group. Meanwhile. Dr. Raymond Wong for serving as committee members of my proposal and thesis defenses. one needs to be always positive and active. Dr. Their valuable comments are of great help in my research work. I would like to express my special thanks to Dr. I would also like to give my great thanks to my wife Zhiyun Zhao. Dr. Junfeng Pan. Qiang Yang sincerely and gratefully for his valuable advice and in-depth guidance throughout my Ph. Brian Mak and Prof. Dit-Yan Yeung. James deeply impressed me from his passion for doing research to his sense to research ideas. I thank them very much. Qian Xu. I thank them for their advice and discussion on my research work during my visit. Dr. Special thanks also goes to Prof. Zheng Chen. Fugee Tsung. and work harder and harder. When I was still a master student in Sun Yat-Sen University. Dr. Qi Wang. Erheng Zhong. Prof. how to discover interesting research topics. Ivor Tsang. James Kwok. Jingdong Wang and Prof. Rong Pan brought me to the ﬁeld of machine learning. Andreas Krause. Bin Cao. Jian Xia. I dedicate this dissertation to my parents. Rong Pan. In addition. Weizhu Chen. Dr. Dr. iv . study. Yi Wang. Prof. Special thanks goes to Guang Dai. I learned a lot from him. Rong Pan. From him. Jieping Ye. Her kind help and support make everything I have possible. They are Dr. What I learned from him would be great wealth in my future research career. Yin Zhu. Zhihua Zhang. Prof. Last but not least.

3.1 2.1.4.1 2.2.3 Transductive Transfer Learning 2.3 2.2.1 2.3 2.3.2 2.1 Overview 2.2 2.4 2.2.2 2.2.4 Unsupervised Transfer Learning .1 The Contribution of This Thesis 1.1.1 2.2 The Organization of This Thesis Chapter 2 A Survey on Transfer Learning A Brief History of Transfer Learning Notations and Deﬁnitions A Categorization of Transfer Learning Techniques Transferring Knowledge of Instances Transferring Knowledge of Feature Representations Transferring Knowledge of Parameters Transferring Relational Knowledge Transferring the Knowledge of Instances Transferring Knowledge of Feature Representations Transferring Knowledge of Feature Representations 2.1.2 Inductive Transfer Learning 2.TABLE OF CONTENTS Title Page Authorization Page Signature Page Acknowledgments Table of Contents List of Figures List of Tables Abstract Chapter 1 Introduction i ii iii iv v viii x xi 1 4 6 8 8 8 10 11 15 16 17 19 20 21 22 24 26 26 v 1.

3 A Novel Dimensionality Reduction Framework 3.2 3.3.2.2 3.2 3.6 Other Research Issues of Transfer Learning 2.4.6.3 3.5.2.4 Maximum Mean Discrepancy Embedding (MMDE) 3.7 Further Discussion Chapter 4 Applications to WiFi Localization 4.6.2 Minimizing Distance between P (ϕ(XS )) and P (ϕ(XT )) Preserving Properties of XS and XT Kernel Learning for Transfer Latent Space Make Predictions in Latent Space Summary 3.5.6.1 WiFi Localization and Cross-Domain WiFi Localization 4.3 3.3 Results 4.6 Semi-Supervised Transfer Component Analysis (SSTCA) 3.1 Motivation 3.5 Transfer Bounds and Negative Transfer 2.5.4 Dimensionality Reduction Hilbert Space Embedding of Distributions Maximum Mean Discrepancy Dependence Measure 3.2.1 3.6.2.4.7 Real-world Applications of Transfer Learning Chapter 3 Transfer Learning via Dimensionality Reduction 27 29 30 32 32 33 33 34 34 34 35 36 37 38 38 41 41 42 42 43 45 46 47 49 50 51 53 54 57 57 58 61 3.4 Optimization Objectives Formulation and Optimization Procedure Experiments on Synthetic Data Summary 3.5 Transfer Component Analysis (TCA) 3.1 3.4 Parametric Kernel Map for Unseen Data Unsupervised Transfer Component Extraction Experiments on Synthetic Data Summary 3.1 Comparison with Dimensionality Reduction Methods vi 61 .3.1 3.1 3.2 Preliminaries 3.3.3 3.2 3.5.2 Experimental Setup 4.4.2.3 3.1 3.

3 6.4.7.2 4.3 Results 5.4.1 6.3.3.5 Computational Issues 6.2 Existing Works in Cross-Domain Sentiment Classiﬁcation 6.2 Experimental Setup Results 6.7.4.4 4.4.4 Summary Chapter 5 Applications to Text Classiﬁcation 5.2 Comparison to Other Methods Sensitivity to Model Parameters 68 68 71 73 73 74 76 78 79 80 81 83 84 85 85 85 87 92 94 94 95 96 5.4 Spectral Domain-Speciﬁc Feature Alignment 6.3.4 Domain-Independent Feature Selection Bipartite Feature Graph Construction Spectral Feature Clustering Feature Augmentation 6.1 Sentiment Classiﬁcation 6.3 4.5 Comparison with Non-Adaptive Methods Comparison with Domain Adaptation Methods Comparison with MMDE Sensitivity to Model Parameters 61 62 63 64 64 66 66 66 68 4.1 Text Classiﬁcation and Cross-domain Text Classiﬁcation 5.6 Connection to Other Methods 6.3.1 6.3.4.3 Problem Statement and A Motivating Example 6.2 Future Work References vii .7 Experiments 6.4 Summary Chapter 6 Domain-Driven Feature Space Transfer for Sentiment Classiﬁcation 6.1 Conclusion 7.1 5.2 6.3.2 Experimental Setup 5.8 Summary Chapter 7 Conclusion and Future Work 7.

6 .4 3. Comparison with localization methods that do not perform domain adaption. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. viii 3 7 9 14 37 1. The direction with the largest variance is orthogonal to the discriminative direction There exists an intrinsic manifold structure underlying the observed data. Here.5 3. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.1 Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods and received by different mobile devices.3 47 48 48 3.1 2. Illustrations of the proposed TCA and SSTCA on synthetic dataset 1. Different learning processes between traditional machine learning and transfer learning An overview of different settings of transfer Motivating examples for the dimensionality reduction framework. Illustrations of the proposed TCA and SSTCA on synthetic dataset 4.1 4.2 46 3. Comparison with dimensionality reduction methods. Comparison of TCA. Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods.8 54 58 4.2 59 62 62 63 63 4. Comparison with MMDE in the transductive setting on the WiFi data. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.2 3. The organization of thesis. Different colors denote different signal strength values (unit:dBm).6 52 3.1 3.7 53 3. Illustrations of the proposed TCA and SSTCA on synthetic dataset 5. Illustrations of the proposed TCA and SSTCA on synthetic dataset 2.LIST OF FIGURES 1. Note that the original signal strength values are non-positive (the larger the stronger). Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.3 4. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.2 2. Different colors denote different signal strength values. Illustrations of the proposed TCA and SSTCA on synthetic dataset 3. we shift them to positive values for visualization.5 4. An indoor wireless environment example.4 4. SSTCA and the various baseline methods in the inductive setting on the WiFi data.

3 6.7 4.1 6.4. Comparison results (unit: %) on two datasets.4 6.5 Training time with varying amount of unlabeled data for training. Sensitivity analysis of the TCA / SSTCA parameters on the text data. A bipartite graph example of domain-speciﬁc and domain-independent features. Study on varying numbers of domain-independent features of SFA Model parameter sensitivity study of k on two datasets.2 6.1 6. Sensitivity analysis of the TCA / SSTCA parameters on the WiFi data. 64 65 71 81 88 91 92 93 ix . Model parameter sensitivity study of γ on two datasets.8 5.

“.2 5.2 6.4 6.1 4. x . Ideal representations of domain-speciﬁc words.” denotes all other words..3 2. and “-” denotes negative sentiment. “+” denotes positive sentiment.1 Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products. Boldfaces are domain-speciﬁc words. and “-” denotes negative sentiment. 4 12 14 15 15 55 58 67 69 70 2.2 2.3 6. Summary of datasets for evaluation. which occur frequently in both domains. Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation). Relationship between traditional machine learning and transfer learning settings Different settings of transfer learning Different approaches to transfer learning Different approaches in different settings Summary of dimensionality reduction methods based on Hilbert space embedding of distributions. “+” denotes positive sentiment. Bag-of-words representations of electronics (E) and video games (V) reviews. Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products.3 6..1 5. which are much more frequent in one domain than in the other one.1 75 77 77 78 86 6. A co-occurrence matrix of domain-speciﬁc and domain-independent words. Italic words are some domain-independent words. An example of signal vectors (unit:dBm) Summary of the six data sets constructed from the 20-Newsgroups data. Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).1 5. Boldfaces are domain-speciﬁc words.1 2.5 6. Num90 bers in the table are accuracies in percentage. Only domain-speciﬁc features are considered.6 Experiments with different domain-independent feature selection methods.4 3. which are much more frequent in one domain than in the other one.LIST OF TABLES 1.

We apply these methods to two diverse applications: cross-domain WiFi localization and cross-domain text classiﬁcation. we propose a spectral feature alignment algorithm for cross-domain learning. Experimental results show that this method outperforms a state-of-the-art algorithm in two real-world datasets on cross-domain sentiment classiﬁcation. we propose three effective solutions to learn the latent space for transfer learning. In our study. we propose a novel dimensionality reduction framework for transfer learning. We focus on latent space learning for transfer learning. Based on this framework. We can ﬁnd many novel applications of machine learning and data mining where transfer learning is helpful. such as sentiment classiﬁcation. for a speciﬁc application area. xi . where domain knowledge is available for encoding to transfer learning methods. In this thesis. In this algorithm. which aims at discovering a “good” common feature space across domain. we try to align domain-speciﬁc features from different domains by using some domain independent features as a bridge. Furthermore. we ﬁrst survey different settings and approaches of transfer learning and give a big picture of the ﬁeld.FEATURE-BASED TRANSFER LEARNING WITH REAL-WORLD APPLICATIONS by JIALIN PAN Department of Computer Science and Engineering The Hong Kong University of Science and Technology ABSTRACT Transfer learning is a new machine learning and data mining framework that allows the training and test data to come from different distributions and/or feature spaces. which tries to reduce the distance between different domains while preserve data properties as much as possible. and achieve promising results. such that knowledge transfer becomes possible. especially when we have limited labeled data in our domain of interest. This framework is general for many transfer learning problems when domain knowledge is unavailable.

The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns [101. However. In some real-world applications. 34.g. In the last decade. semi-supervised learning [233. compared to semi-supervised and active learning. 77. This problem has become a major bottleneck of making machine learning and data mining methods more applicable in practice. tries to design an active learner to pose queries. most traditional supervised algorithms work well only under a common assumption: the training and test data are drawn from the same feature space and the same distribution. In the real world. 189]. tasks. we can observe many examples of 1 . 27. where active learning methods may not work in learning accurate classiﬁers in the domain of interest. most semi-supervised methods require that the training data. However. including labeled and unlabeled data. in contrast. Transfer learning. the budget may be quite limited. However. 131. The main idea behind transfer learning is to borrow labeled data or knowledge extracted from some related domains to help a machine learning algorithm to achieve greater performance in the domain of interest [183]. 90] techniques have been proposed to address the problem that the labeled training data may be too few to build a good classiﬁer. by making use of a large amount of unlabeled data to discover a powerful structure together with a small amount of labeled data to train models. Thus.CHAPTER 1 INTRODUCTION Supervised data mining and machine learning technologies have already been widely studied and applied to many knowledge engineering areas. which implicitly assumes the training and test data are still represented in the same feature space and drawn from the same data distribution. in many real-world scenarios. and the test data are both from the same domain of interest. most active learning methods assume that there is a budget for the active learner to pose queries in the domain of interest. 168]. and distributions used in training and testing to be different.. usually in the form of unlabeled data instances to be labeled by an oracle (e. Furthermore. the performance of these algorithms heavily reply on collecting high quality and sufﬁcient labeled training data to train a statistical or computational model to make predictions on the future data [127. transfer learning can be referred to as a different strategy for learning model with minimal human supervision. a human annotator). which is another branch in machine learning for reducing annotation effort of supervised learning. Instead of exploring unlabeled data to train a precise model. allows the domains. labeled training data are in short supply or can only be obtained with expensive cost. Nevertheless. active learning.

As an example in the area of Web-document classiﬁcation (see. knowledge transfer or transfer learning between tasks or domains become more desirable and crucial. Similarly. which aims to detect a user’s current location based on previously collected WiFi data. As a third example. Many examples in knowledge engineering can be found where transfer learning can truly be beneﬁcial. [49]).. In such cases. device or other dynamic factors. such as a brand of camera. Furthermore. positive or negative).2. In such cases. in many engineering applications. For a classiﬁcation task on a newly created Web site where the data features or data distributions may be different. because a user needs to label a large collection of WiFi signal data at each location. in indoor WiFi localization problems. For example. into polarity categories (e.g. as introduced in [142]. the WiFi signal-strength values may be a function of time. it is expensive or impossible to collect sufﬁcient training data to train a model for use in each domain of interest. As a result.g. In literature. To reduce the re-calibration effort. values of received signal strength (RSS) may differ across time periods and mobile devices. supervised learning algorithms [146] have proven to be promising and widely used in sentiment classiﬁcation. the labeled examples may be the university Web pages that are associated with category information obtained through previous manual-labeling efforts. In this case. we may not be able to directly apply the Web-page classiﬁers learned on the university Web site to the new Web site. we may ﬁnd that learning to recognize apples might help to recognize pears. e. it is very expensive to calibrate WiFi data for building localization models in a large-scale environment. a model trained in one time period or on one device may cause the performance for location estimation in another time period or on another device to be reduced. The need for transfer learning may also arise when the data can be easily outdated. these methods are domain 2 . learning to play the electronic organ may help facilitate learning the piano. there may be a lack of labeled training data. the labeled data obtained in one time period may not follow the same distribution in a later time period. One example is Web document classiﬁcation. However. or to adapt the localization model trained on a mobile device (the source domain) for a new mobile device (the target domain). we might wish to adapt the localization model trained in one time period (the source domain) for a new time period (the target domain). transfer learning is also desirable when the features between domains change. As shown in Figure 4. However.. As a result. Consider the problem of sentiment classiﬁcation. For example. where our task is to automatically classify the reviews on a product. It would be nice if one could reuse the training data which have been collected in some related domains/tasks or the knowledge that is already extracted from some related domains/tasks to learn a precise model for use in the domain of interest. where our goal is to classify a given Web document into several predeﬁned categories. it would be helpful if we could transfer the classiﬁcation knowledge into the new domain.transfer learning.

we may use words like “compact”. 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 20 40 60 80 100 120 140 0 0 20 40 60 80 100 120 140 (c) WiFi RSS received by device B in T1 . The reason is that users may use domain-speciﬁc words to express sentiment in different domains. cross-domain sentiment classiﬁcation algorithms are highly desirable to reduce domain dependency and manually labeling cost by transferring knowledge from related domains to the domain of interest [25]. Different colors denote different signal strength values. Figure 1. a sentiment classiﬁer trained in one domain may not work well when directly applied to other domains. In the electronics domain. Table 1.1: Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods and received by different mobile devices. “realistic” indicate positive opinion and the word “boring” indicates negative opinion. 35 (b) WiFi RSS received by device A in T2 . 3 . dependent.35 30 25 20 15 10 5 0 0 35 30 25 20 20 18 16 14 12 10 15 10 5 0 0 8 6 4 2 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 (a) WiFi RSS received by device A in T1 . Thus. Due to the mismatch among domain-speciﬁc words. While in the video game domain. “sharp” to express our positive sentiment and use “blurry” to express our negative sentiment. words like “hooked”.1 shows several user review sentences from two domains: electronics and video games. (d) WiFi RSS received by device B in T2 .

I am extremely unI will never buy HP again. in transfer learning. brute-force transfer may be unsuccessful. picture. transductive transfer and unsupervised transfer. It is really nice and sharp. Likewise. + + electronics Compact. which corresponds to the “how to transfer” issue. very good picture quality. easy to operate. which is ﬁrst described in our survey article [141] and will be introduced in detail in Chapter 2.1 The Contribution of This Thesis Generally speaking.Table 1. (2) How to transfer. Note that in this setting. transferring skills should be done. The game is so boring. (3) When to transfer [141]. We played this and were hooked. learning algorithms need to be developed to transfer the knowledge. In this thesis. Some knowledge is speciﬁc for individual domains or tasks. where we are given a lot of labeled data in a source domain and some unlabeled data in a target domain. and “-” denotes negative sentiment. Boldfaces are domain-speciﬁc words. I am very much hooked on this game. We leave the issue on how to avoid 4 . It is also quite blurry in very dark settings. “What to transfer” asks which part of knowledge can be transferred across domains or tasks. we have the following three main research issues: (1) What to transfer. happy and will probably never buy UbiSoft again. we are interested in knowing in which situations. and some knowledge may be common between different domains such that they may help improve performance for the target domain or task. which are much more frequent in one domain than in the other one. our goal is to learn an accurate model for use in the target domain. In some situations. which will be introduced in detail in Chapter 2 as well. “+” denotes positive sentiment. no labeled data in the target domain are available for training. “When to transfer” asks in which situations. knowledge should not be transferred. After discovering which knowledge can be transferred. Furthermore.1: Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products. 1. looks sharp! video games A very good game! It is action packed and full of excitement. In this thesis. we focus on the transductive transfer learning setting. I purchased this unit from Circuit City and Very realistic shooting action and good I was very excited about the quality of the plots. it may even hurt the performance of learning in the target domain. we focus on “What to transfer” and “How to transfer” by implicitly assuming that the source and target domains are related to each other. a situation which is often referred to as negative transfer. transfer learning can be categorized into three settings: inductive transfer. In the worst case. when the source domain and target domain are not related to each other.

Based on this framework. such as text classiﬁcation or WiFi localization. we propose three different algorithms to learn the latent space. and WiFi data may be controlled by some hidden factors. can be preserved as much as possible. Thus. when data from different domains are projected onto the latent space. in some application areas. Thus.1. they may use some domain-independent sentiment words. where the distance between domains can then be reduced and the important information of the original data can be preserved simultaneously. we propose two embedding learning frameworks to learn the latent space based on two different situations: (1) domain knowledge is hidden or hard to capture. but suffers from expensive computational cost. the latent space can be treated as a bridge across domains to make knowledge transfer possible and successful. For “How to transfer”. such as sentiment classiﬁcation. and (2) domain knowledge can be observed or easy to encode in embedding learning. etc. in MMDE we translate the latent space learning for transfer learning to a non-parametric kernel matrix learning problem. For example. such as “good”. Standard machine learning and data mining methods can be applied directly in the latent space to train models for making predictions on the target domain data. Transfer Component Analysis (TCA) [139] and Semi-supervised Transfer Component Analysis (SSTCA) [140]. More speciﬁcally. For “How to transfer”. though users may use some domain-speciﬁc words as shown in Table 1. we propose to discover a latent feature space for transfer learning. the domain knowledge is hidden. such that the distance between data distributions can be dramatically reduced and the original data properties. Our framework aims to learn a latent space underlying across domains. we propose a novel and general dimensionality reduction framework for transfer learning. We apply these three algorithms to two diverse application areas: wireless sensor networks and Web mining. In most application areas. in TCA and SSTCA. etc.negative transfer to our future work. “never buy”. Maximum Mean Discrepancy Embedding (MMDE) [135]. some domain knowledge can be observed and used for learning the latent space across domains. In contrast to the general framework to transfer learning without domain knowledge. such as variance and local geometric structure. The main difference between TCA and SSTCA is that TCA is an unsupervised feature extraction method while SSTCA is semi-supervised feature extrantion method. some domain-speciﬁc and domain-independent words may co-occur in reviews frequently. This observation motivates us to propose a spectral feature clustering framework [137] to align domain-speciﬁc words from different domains in a latent space by modeling the correlation between the domain-independent and domain-speciﬁc words in a bipartite graph and using the domain-speciﬁc features as a 5 . in sentiment classiﬁcation. such as the structure of a building. we propose to learn parametric kernel based embeddings for transfer learning instead. text data may be controlled by some latent topics. which means there may be a correlation between these words. For example. The resultant kernel in a free form may be more precise for transfer learning. In addition. In this case.

The ﬁrst framework is focused on the situation that domain knowledge is hidden and hard to encode in embedding learning. a current survey article [182]). In this thesis. Transfer Component Analysis (TCA) (Chapter 3. Other researchers may get a big picture of transfer learning by reading the survey. • We propose a general dimensionality reduction framework for transfer learning without any domain knowledge. categorize transfer learning approaches into four contexts. Note that there has been a large amount of work on transfer learning for reinforcement learning in the machine learning literature (e. we survey the ﬁeld of transfer learning. However. The proposed method outperforms a sate-of-the-art cross-domain methods in the ﬁeld of sentiment classiﬁcation. analyze the relationship between transfer learning and other related areas. we only focus on transfer learning for classiﬁcation and regression tasks that are related more closely to machine learning and data mining tasks.bridge for cross-domain sentiment classiﬁcation.g. which encode the domain knowledge in a spectral feature alignment framework. Based on different different situations that whether domain knowledge is available.6). • We give a comprehensive survey on transfer learning. we propose three solutions to learn the latent space for transfer learning. In 6 . discuss some interesting research issues in transfer learning and introduce some applications of transfer learning. summarize transfer learning into three settings.5) and Semi-supervised Transfer Component Analysis (SSTCA) (Chapter 3. we study the problem of feature-based transfer learning and its real-world applications. Furthermore. In Chapter 3. in this thesis. we propose two feature space learning frameworks for transfer learning.2.2 The Organization of This Thesis The organization of this thesis is shown in Figure 1. 1. where we summarize different transfer learning settings and approaches and discuss the relationship between transfer learning and other related areas. text classiﬁcation and sentiment classiﬁcation. Based on the framework. Maximum Mean Discrepancy Embedding (MMDE) (Chapter 3. In Chapter 2. Then we apply these three methods to two diverse applications: WiFi localization (Chapter 4) and text classiﬁcation (Chapter 5). such as WiFi localization.. The other framework is focused on the situation that domain knowledge is explicit and easy to be encoded in feature space learning. The main contributions of this thesis can be summarized as follows. we apply them to solve the WiFi localization and text classiﬁcation problems and achieve promising results.4). we present our proposed general dimensionality reduction framework and three proposed algorithms. where we give some deﬁnitions of transfer learning. • We propose a speciﬁc latent space learning for sentiment classiﬁcation.

we conclude this thesis and discuss some thoughts on future work in Chapter 7. we present a spectral feature alignment (SFA) algorithm for sentiment classiﬁcation across domains. 7 . Finally.Chapter 6.2: The organization of thesis. Figure 1.

a closely related learning technique to transfer learning is the multi-task learning framework [31]. A typical approach for multi-task learning is to uncover the common (latent) features that can beneﬁt each individual task. inductive transfer. knowledge consolidation.asp 8 .1 Overview 2. this chapter originated as our survey article [141]. which is the ﬁrst survey in the ﬁeld of transfer learning. In fact.ca/courses/comp/dsilver/NIPS95 LTL/transfer.html http://www. multi-task learning. life-long learning.1 A Brief History of Transfer Learning The study of transfer learning is motivated by the fact that people can intelligently apply knowledge learned previously to solve new problems faster or with better solutions [61]. Research on transfer learning has attracted more and more attention since 1995 in different names: learning to learn. we add some up-to-date materials on transfer learning. knowledge transfer.1995. transfer learning aims to extract the knowledge from one or more source tasks and applies the knowledge to a target task. the Broad Agency Announcement (BAA) 05-29 of Defense Advanced Research Projects Agency (DARPA)’s Information Processing Technology Ofﬁce (IPTO) 2 gave a new mission of transfer learning: the ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks. the years after the NIPS-05 workshop. in this chapter. rather than learning all 1 2 http:/socrates.workshop. meta learning.darpa. which tries to learn multiple tasks simultaneously even when they are different.CHAPTER 2 A SURVEY ON TRANSFER LEARNING In this chapter. and their real-world applications. In 2005. The fundamental motivation for transfer learning in the ﬁeld of machine learning was discussed in a NIPS-95 workshop on “Learning to Learn”1 . Among these. 2. including both methodologies and applications. knowledge-based inductive bias. context-sensitive learning. In this deﬁnition. and incremental/cumulative learning [183].acadiau. regression and clustering developed in machine learning and data mining areas.mil/ipto/programs/tl/tl. we give a comprehensive survey of transfer learning for classiﬁcation.1. which focused on the need for lifelong machinelearning methods that retain and reuse previously learned knowledge. Compared to the previous survey article. In contrast to multi-task learning.

As we can see. for example). International World Wide Web Conference (WWW). Joint Conference of the 47th Annual Meeting of the Association of Computational Linguistics (ACL). for example) and applications (Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).of the source and target tasks simultaneously. ACM international conference on Multimedia (MM). IEEE International Conference on Data Mining (ICDM) and European Conference on Knowledge Discovery in Databases (PKDD). transfer learning cares most about the target task. while transfer learning techniques try to transfer the knowledge from some previous tasks to a target task when the latter has fewer high-quality training data. Annual Conference on Neural Information Processing Systems (NIPS) and European Conference on Machine Learning (ECML) for example).1 shows the difference between the learning processes of traditional and transfer learning techniques.1: Different learning processes between traditional machine learning and transfer learning Today. Annual International Conference on Research in Computational Molecular Biology (RECOMB) and Annual International 9 . traditional machine learning techniques try to learn each task from scratch. Figure 2. artiﬁcial intelligence (AAAI Conference on Artiﬁcial Intelligence (AAAI) and International Joint Conference on Artiﬁcial Intelligence (IJCAI). transfer learning methods appear in several top venues. The roles of the source and target tasks are no longer symmetric in transfer learning. International Conference on Computational Linguistics (COLING). European Conference on Computer Vision (ECCV). most notably in data mining (ACM International Conference on Knowledge Discovery and Data Mining (KDD). IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). machine learning (International Conference on Machine Learning (ICML). Conference on Empirical Methods in Natural Language Processing (EMNLP). (a) Traditional Machine Learning (b) Transfer Learning Figure 2. ACM International Conference on Ubiquitous Computing (UBICOMP). IEEE International Conference on Computer Vision (ICCV).

.Conference on Intelligent Systems for Molecular Biology (ISMB) for example) 3 . 0 ≤ nT ≪ nS . http://www. f (x). DT . . . in our document classiﬁcation 3 We summarize a list of transfer learning papers published in these few years and a list of workshops that are related to transfer learning at the following url for reference. (xTnT . For example. . . The function f (·) can be used to predict the corresponding label. DS can be a set of term vectors together with their associated true or false class labels. A domain D consists of two components: a feature space X and a marginal probability distribution P (X). a target domain DT and learning task TT . . of a new instance x. where DS ̸= DT . we ﬁrst describe the notations and deﬁnitions used in this chapter. a domain is a pair D = {X . Given a speciﬁc domain. In our document classiﬁcation example. yi }. . In our document classiﬁcation example. f (x) can be written as P (y|x). (xSnS . For example. f (·)}). False for a binary classiﬁcation task. where the input xTi is in XT and yTi ∈ YT is the corresponding output. which is True. Y is the set of all labels. From a probabilistic viewpoint. . a task T consists of two components: a label space Y and an objective predictive function f (·) (denoted by T = {Y. More speciﬁcally. .2 Notations and Deﬁnitions First of all. transfer learning aims to help improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS . if our learning task is document classiﬁcation.htm 10 . ySnS )} the source domain data. then they may have different feature spaces or different marginal probability distributions. we only consider the case where there is one source domain DS .cse.hk/˜sinnopan/conferenceTL. Deﬁnition 1. then X is the space of all term vectors. xi is the ith term vector corresponding to some documents. where xi ∈ X and yi ∈ Y. xn } ∈ X . The issue of transfer learning from multiple source domains will be discussed in 2. respectively. In most cases. and yi is “True” or “False”. In general. where xSi ∈ XS is the data instance and ySi ∈ YS is the corresponding class label. we denote the target domain data as DT = {(xT1 . Before we give different categorizations of transfer learning. we give the deﬁnitions of a “domain” and a “task”. or TS ̸= TT . 2. yS1 ). and each term is taken as a binary feature. Thus the condition DS ̸= DT implies that either XS ̸= XT or PS (X) ̸= PT (X). We now give a uniﬁed deﬁnition of transfer learning. . Here. .ust. we denote DS = {(xS1 .1. P (X)}. and one target domain. where X = {x1 . Similarly. In the above deﬁnition. . D = {X .6. yTnT )}. which is not observed but can be learned from the training data. and X is a particular learning sample. as this is by far the most popular of the research works in the literature. if two domains are different. yT1 ). P (X)}. Given a source domain DS and learning task TS . which consist of pairs {xi .

(2) How to transfer. we have the following three main research issues: (1) What to transfer.. As a result. laptop}. which will be discussed in Chapter 2. This because from a semantic point of view. Some knowledge is speciﬁc for individual domains or tasks. TS = TT . i. where XSi ∈ XS and XTi ∈ XT . Different users may deﬁne different different tags for a same document. case (1) corresponds to the situation where source domain has binary document classes.5. Note that it is hard to deﬁne the term “relationship” mathematically. P (YS |XS ) ̸= P (YT |XT ). or their marginal distributions are different. when the learning tasks TS and TT are different. When the domains are different. then either (1) the feature spaces between the domains are different.e. resulting in P (Y |X) changes across different users. i. and some knowledge may be com11 .g. the terms “laptop” and “desktop” are close to each other. XS ̸= XT . P (XS ) ̸= P (XT ). 2. case (1) corresponds to when the two sets of documents are described in different languages.e. YS ̸= YT . the learning problem becomes a traditional machine learning problem. i. between the two domains or tasks. then either (1) the label spaces between the domains are different. or (2) the conditional probability distributions between the domains are different.3 A Categorization of Transfer Learning Techniques In transfer learning. this means that between a source document set and a target document set.example. desktop} may be related the task classifying documents into the categories{book. Similarly. the task classifying documents into the categories {book.e. and case (2) may correspond to when the source domain documents and the target domain documents focus on different topics. and their learning tasks are the same. As an example. whereas the target domain has ten classes to classify the documents to. they use different languages). the learning tasks may be related to each other. Thus the condition TS ̸= TT implies that either YS ̸= YT or P (YS |XS ) ̸= P (YT |XT ). Thus. i. Case (2) corresponds to the situation where the documents classes are deﬁned subjectively. a task is deﬁned as a pair T = {Y. How to measure the relatedness between domains or tasks is an important research issue in transfer learning. P (Y |X)}. For example. when there exists some relationship. When the target and source domains are the same.e. Given speciﬁc domains DS and DT . In addition. (3) When to transfer. In our document classiﬁcation example. DS = DT . as such tagging. i. explicit or implicit. or (2) the feature spaces between the domains are the same but the marginal probability distributions between domain data are different. either the term features are different between the two sets (e. “What to transfer” asks which part of knowledge can be transferred across domains or tasks. i.e. where YSi ∈ YS and YTi ∈ YT . we say that the source and target domains or tasks are related.e.. in most transfer learning methods introduced in the following sections assume that the source and target domains or tasks are related.1. in our document classiﬁcation example.

“When to transfer” asks in which situations. the target task is different from the source task. inductive transfer learning. the inductive transfer learning setting only aims at achieving high performance in the target task by transferring knowledge from the source task while multi-task learning tries to learn the target and source task simultaneously. transferring skills should be done. we can further categorize the inductive transfer learning setting into two cases: (1. Likewise. In addition. 12 . according to different situations of labeled and unlabeled data in the source domain. In this case. Based on the deﬁnition of transfer learning. how to avoid negative transfer is an important open issue that is attracting more and more attention in the future. Most current work on transfer learning focuses on “What to transfer” and “How to transfer”. by implicitly assuming that the source and target domains be related to each other.1) A lot of labeled data in the source domain are available. it may even hurt the performance of learning in the target domain. we summarize the relationship between traditional machine learning and various transfer learning settings in Table 2. a situation which is often referred to as negative transfer. some labeled data in the target domain are required to induce an objective predictive model fT (·) for use in the target domain. no matter when the source and target domains are the same or not. However.1. After discovering which knowledge can be transferred.1: Relationship between traditional machine learning and transfer learning settings Learning Settings Traditional Machine Learning Inductive Transfer Learning / Transfer Unsupervised Transfer Learning Learning Transductive Transfer Learning Source & Target Domains the same the same different but related different but related Source & Target Tasks the same different but related different but related the same 1. knowledge should not be transferred. brute-force transfer may be unsuccessful. based on different situations between the source and target domains and tasks. In the inductive transfer learning setting. where we categorize transfer learning under three settings. we are interested in knowing in which situations. which corresponds to the “how to transfer” issue. learning algorithms need to be developed to transfer the knowledge. when the source domain and target domain are not related to each other. Table 2. In this case. In the worst case. the inductive transfer learning setting is similar to the multi-task learning setting [31].mon between different domains such that they may help improve performance for the target domain or task. However. transductive transfer learning and unsupervised transfer learning. In some situations.

. In this situation. 26. 51. the unsupervised transfer learning focus on solving unsupervised learning tasks in the target domain. In this case. Table 2. [150]. we can further categorize the transductive transfer learning setting into two cases. [219. 49. the label spaces between the source and target domains may be different. while the source and target domains are different. 67. but the marginal probability distributions of the input data are different. Finally. 99.(1. A second case can be referred to as feature-representation-transfer approach (see. 21. the inductive transfer learning setting is similar to the self-taught learning setting.. 6. 4. In this case. 148. In addition. P (XS ) ̸= P (XT ).2 and Figure 2. such as clustering. However. Instance re-weighting and importance sampling are two major techniques in this context. according to different situations between the source and target domains. it’s similar to the inductive transfer learning setting where the labeled data in the source domain are unavailable.2) No labeled data in the source domain are available. no labeled data in the target domain are available while a lot of labeled data in the source domain are available. dimensionality reduction and density estimation [50. e. In the transductive transfer learning setting. (2. The intuitive idea behind this case is to learn a “good” feature representation for the target domain.1) The feature spaces between the source and target domains are different. 82.2. which is ﬁrst proposed by Raina et al. The latter case of the transductive transfer learning setting is related to domain adaptation for knowledge transfer in Natural Language Processing (NLP) [23. similar to inductive transfer learning setting. In this case. The ﬁrst context can be referred to as instance-based transfer learning(or instance-transfer) approach (see. 3. 47.2) The feature spaces between domains are the same. 48. XS ̸= XT .g. Thus. (2. whose assumptions are similar. The relationship between the different settings of transfer learning and the related areas are summarized in Table 2. 85] and sample selection bias [219] or co-variate shift [170]. [31. the 13 .g. 197]. Approaches to transfer learning in the above three different settings can be summarized into four contexts based on “What to transfer”. which implies the side information of the source domain cannot be used directly. 231] for example). XS = XT .3 shows these four cases and brief description. 50. 149. 8. in the unsupervised transfer learning setting. 37. 20. In the self-taught learning setting. 180. 84. e. 2. 22. there are no labeled data available in both source and target domains in training. 87. which assumes that certain parts of the data in the source domain can be reused for learning in the target domain by re-weighting. 107. 112. the target task is different from but related to the source task. 147] for example). 83. the source and target tasks are the same. 150. 46.

14 . 63.g. Finally. Classiﬁcation Regression. e.2: Different settings of transfer learning Transfer Learning Related Areas Settings Inductive Transfer Multi-task Learning Learning Self-taught Learning Transductive Transfer Learning Unsupervised Transfer Learning Domain Adaptation. 30] for example). The transferred knowledge is encoded into the shared parameters or priors. With the new feature representation. 28. 165. Classiﬁcation Dimensionality Reduction.Table 2. Clustering Figure 2.2: An overview of different settings of transfer knowledge used to transfer across domains is encoded into the learned feature representation. by discovering the shared parameters or priors. which deals with transfer learning for relational domains. The basic assumption behind this context is that some relationship among the data in the source and target domains are similar. knowledge can be transferred across tasks. the performance of the target task is expected to improve signiﬁcantly. A third case can be referred to as parameter-transfer approach (see. Classiﬁcation Regression. which assumes that the source tasks and the target tasks share some parameters or prior distributions of the hyper-parameters of the models.. the last case can be referred to as the relational-knowledge-transfer problem [124]. Thus. Sample Selection Bias. 62. [97. Co-variate Shift Source Domain Labels Available Unavailable Available Unavailable Target Domain Labels Available Available Unavailable Unavailable Tasks Regression.

d assumption is relaxed in each domain (see. 82. the parameter-transfer and the relational-knowledge-transfer approach are only studied in the inductive transfer learning setting. 149. 22. 231] for example). e. 165. 107. 49. Table 2. the knowledge to be transferred is the relationship among the data.3: Different approaches to transfer learning Approaches Instance-transfer Feature-representationtransfer Parameter-transfer Relational-knowledgetransfer Brief Description To re-weight some labeled data in the source domain for use in the target domain (see. 52] for example). 83. 52]. 20. the feature-representation-transfer problem has been proposed to all three settings of transfer learning. e. 47. 30] for example). 50. [124. statistical relational learning techniques dominate this context [125. 4. Table 2. 148. [97.Table 2. 6. (Inductive Transfer Learning) Given a source domain DS and a learning task TS . while the unsupervised transfer learning setting is a relatively new research topic and only studied in the context of the feature-representation-transfer case. 28.i. Build mapping of relational knowledge between the source domain and the target domains. 99. However. 150. [219. e. which we discuss in detail in the following chapters. Find a “good” feature representation that reduces difference between the source and the target domains and the error of classiﬁcation and regression models (see. We can see that the inductive transfer learning setting has been studied in many research works. 180.4 shows the cases where the different approaches are used for each transfer learning setting. Discover shared parameters or priors between the source domain and target domain models. 112. Both domains are relational domains and i. 125..g. 87. 63.g. 46. e. [31. 37. 48. 21.4: Different approaches in different settings Inductive Transfer Learning √ √ √ √ Transductive Transfer Learning √ √ Unsupervised Transfer Learning √ Instance-transfer Feature-representationtransfer Parameter-transfer Relational-knowledgetransfer 2. 62. Recently. which can beneﬁt for transfer learning (see. 26. 84. 147] for example).. Thus.g.g. 8. where 15 . inductive transfer learning aims to help improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS .2 Inductive Transfer Learning Deﬁnition 2. In addition.. a target domain DT and a learning task TT . 51..

a few labeled data in the target domain are required as the training data to induce the target predictive function. the weights of source domain data are ﬁxed while the weights of the target domain data are updated as normal in the TrAdaBoost algorithm. 2. to avoid the overﬁtting problem in TrAdaBoost. Besides modifying boosting algorithms for transfer learning. but the distributions of the data in the two domains are different.2. this setting has two cases: (1) Labeled data in the source domain are available. Dai et al. there are certain parts of the data that can still be reused together with a few labeled data in the target domain. Then in the second stage. TrAdaBoost uses the same strategy as AdaBoost to update the incorrectly classiﬁed examples in the target domain while using a different strategy from AdaBoost to update the incorrectly classiﬁed source examples in the source domain. TrAdaBoost attempts to iteratively re-weight the source domain data to reduce the effect of the “bad” source data while encourage the “good” source data to contribute more for the target domain.1. The idea is to apply the techniques which have been proposed for modifying AdaBoost for regression [56] on TrAdaBoost. only the weights of the source domain data are adjusted downwards gradually until reaching a certain point. Pardoe and Stone [147] presented a two-stage algorithm TrAdaBoost. As mentioned in Chapter 2.R2 to extend the TrAdaBoost algorithm for regression problems. TrAdaBoost trains the base classiﬁer on the weighted source and target data. Jiang and Zhai [87] proposed a heuristic method to remove “misleading” training examples from the source domain based 16 . TrAdaBoost also assumes that. Furthermore. TrAdaBoost. [49] proposed a boosting algorithm. Most transfer learning approaches in this setting focus on the former case. which is an extension of the AdaBoost [66] algorithm. The error is only calculated on the target data. (2) Labeled data in the source domain are unavailable while unlabeled data in the source domain are available. Based on the above deﬁnition of the inductive transfer learning setting. In addition. Furthermore.3. to address classiﬁcation problems in the inductive transfer learning setting. In detail. For each round of iteration. due to the difference in distributions between the source and the target domains. In the ﬁrst stage. some of the source domain data may be useful in learning for the target domain but some of them may not and could even be harmful. Pardoe and Stone proposed to adjust the weights of the data in two stages. TrAdaBoost assumes that the source and target domain data use exactly the same set of features and labels. More recently. Theoretical analysis of TrAdaBoost is also presented in [49].1 Transferring Knowledge of Instances The instance-transfer approach to the inductive transfer learning setting is intuitively appealing: although the source domain data cannot be reused directly.TS ̸= TT .

on the difference between conditional probabilities P (yT |xT ) and P (yS |xS ). In a follow-up work. A. S and T denote the tasks in the source domain and target domain. U is a d × d orthogonal matrix (mapping function) for mapping the original high-dimensional data to low-dimensional representations. the learned new representation can reduce the classiﬁcation or regression model error of each task as well.2 Transferring Knowledge of Feature Representations The feature-representation-transfer approach to the inductive transfer learning problem aims at ﬁnding “good” feature representations to minimize domain divergence and classiﬁcation or regression model error. A = [aS . Strategies to ﬁnd “good” feature representations are different for different types of the source domain data. In the inductive transfer learning setting. ⟨at . given as follows. Gao et al.S} i=1 L(yti .t. Argyriou et al.p := ( d ∥ai ∥p ) p . In addition. arg min A. The idea is to assign weights to various models dynamically based on local structures in a graph. The optimization problem (2. Supervised Feature Construction Supervised feature construction methods for the inductive transfer learning setting are similar to those used in multi-task learning. unsupervised learning methods are proposed to construct the feature representation. Wu and Dietterich [202] integrated the source domain (auxiliary) data in a Support Vector Machine (SVM) [189] framework for improving the classiﬁcation performance in the target domain.1) can be further transformed into an equivalent convex optimization formulation and be solved efﬁciently.U nt ∑ ∑ t∈{T. 17 . then weighted models are used to make predictions on text data.1 (2. U ∈ Od In this equation. 2. respectively. The optimization problem (2. This is similar to common feature learning in the ﬁeld of multi-task learning [31]. p)-norm of A is deﬁned as ∥A∥r. U T XS and the parameters. The basic idea is to learn a low-dimensional representation that is shared across related tasks. If a lot of labeled data in the source domain are available. the common features can be learned by solving an optimization problem. of the model at the same time.1) s. If no labeled data in the source domain are available. ∑ 1 The (r.1) r i=1 estimates the low-dimensional representations U T XT . supervised learning methods can be used to construct a feature representation. [67] proposed a graph-based locally weighted ensemble framework to combine multiple models for transfer learning. aT ] ∈ Rd×2 is a matrix of parameters. [6] proposed a sparse feature learning method for multi-task learning. U T xti ⟩) + γ∥A∥2 2.2.

in [37]. 5]. proposed to apply sparse coding [98]. aj i is a new representation of basis bj for input xSi and β is a coefﬁcient S to balance the feature construction term and the regularization term. More recently. Chen et al. After learning the basis vectors b. [160] designed a kernel-based approach to inductive transfer.2) s. . ∀j ∈ 1. s In this equation. discriminative algorithms can be applied to {a∗ i }′ s with corresponding labels to train T classiﬁcation or regression models for use in the target domain. which is an unsupervised feature construction method.2) as shown as follows. The ASO algorithm has been applied successfully to several applications [25. Unsupervised Feature Construction In [150]. proposed by Ando and Zhang [4]. Ruckert et al. min a. bs } are learned on the source domain data by solving the optimization problem (2. Singular Vector Decomposition (SVD) is applied on the space to recover a low-dimensional predictive space as a common feature space underlying multiple tasks. Then weight vectors of the multiple classiﬁers are used to construct a predictor space. it is non-convex and does not guarantee to ﬁnd a global optimum. .t. Lee et al. . a linear classiﬁer is trained for each of the multiple tasks. [99] proposed a convex optimization algorithm for simultaneously learning metapriors and feature weights from an ensemble of related prediction tasks. Jebara [84] proposed to select features for multi-task learning with SVMs. Another famous common feature learning method for multi-task learning is the alternating structure optimization ASO algorithm.3) Finally. in order to convert the new formulation into a convex formulation. for learning higher level features for transfer learning. The meta-priors can be transferred among different tasks. . ∥bj ∥2 ≤ 1. At the ﬁrst step. Finally.3) is applied on the target domain data to learn higher level features based on the basis vectors b. in the second step. [8] proposed a spectral regularization framework on matrices for multi-task structure learning. .b ∑ i ∥xSi − ∑ j aj i bj ∥2 + β∥aSi ∥1 2 S (2. . b2 . In ASO. proposed a convex alternating structure optimization (cASO) algorithm to the optimization problem. However.Argyriou et al. which aims at ﬁnding a suitable kernel for the target data. The basic idea of this approach consists of two steps. ∑ j a∗ i = arg min ∥xTi − aTi bj ∥2 + β∥aTi ∥1 T 2 a Ti j (2. . . higher-level basis vectors b = {b1 . Raina et al. Furthermore. Chen et al. [37] presented an improved formulation (iASO) based on ASO by proposing a novel regularizer. an optimization algorithm (2. One drawback of this method 18 .

which is based on Gaussian Processes (GP). Besides transferring the priors of the GP models. Recently. 2. some researchers also proposed to transfer parameters of SVMs under a regularization framework. are designed to work under multi-task learning. Thus. to handle the multi-task learning case. In contrast. while transfer learning only aims at boosting the performance of the target domain by utilizing the source domain data. Most approaches described in this section. respectively. However. in multi-task learning. weights in the loss functions for different domains can be different. we may assign a larger weight to the loss function of the target domain to make sure that we can achieve better performance in the target domain. which can be used to transfer the knowledge across domains via the aligned manifolds. Evgeniou and Pontil [63] borrowed the idea of HB to SVMs for multi-task learning. In [193]. The proposed method assumes that the parameter. manifold learning methods have been adapted for transfer learning. As mentioned above. One is a common term over tasks and the other is a task-speciﬁc term. wS and wT are parameters of the SVMs for the source task and the target learning task. including a regularization framework and a hierarchical Bayesian framework. Bonilla et al. an 19 . [28] also investigated multi-task learning in the context of GP. where a GP prior is used to induce correlations between tasks. wS = w0 + vS . they can be easily modiﬁed for transfer learning. weights of the loss functions for the source and target data are the same.2. By assuming ft = wt · x to be a hyper-plane for task t. [165] proposed to use a hierarchical Bayesian framework (HB) together with GP for multi-task learning. w0 is a common parameter while vS and vT are speciﬁc parameters for the source task and the target task.is that the so-called higher-level basis vectors learned on the source domain in the optimization problem (2. w. Intuitively. Schwaighofer et al. In inductive transfer learning. in SVMs for each task can be separated into two terms. & wT = w0 + vT . where. The authors proposed to use a free-form covariance matrix over tasks to model inter-task dependencies. Wang and Mahadevan proposed a Procrustes analysis based approach to manifold alignment without correspondences. respectively.2) may not be suitable for use in the target domain. in transfer learning. MT-IVM tries to learn parameters of a Gaussian Process over multiple tasks by sharing the same GP prior. multi-task learning tries to learn both the source and target tasks simultaneously and perfectly.3 Transferring Knowledge of Parameters Most parameter-transfer approaches to the inductive transfer learning setting assume that individual models for related tasks should share some parameters or prior distributions of hyperparameters. Lawrence and Platt [97] proposed an efﬁcient algorithm known as MT-IVM.

. entities in a relational domain are represented by predicates and their relationships are represented in ﬁrst-order logic.) as traditionally assumed. [124] proposed an algorithm TAMAR that transfers relational knowledge with Markov Logic Networks (MLNs) [155] across relational domains. where xti ’s. 2. such as networked data and social network data. By solving the optimization problem above. TAMARis a two-stage algorithm.4 Transferring Relational Knowledge Different from other three contexts. For example.. t ∈ {S.ξti ξti + λ1 ∑ ∥vt ∥2 + λ2 ∥w0 ∥2 2 t∈{S. ξti ≥ 0. we can learn the parameters w0 .i. At the ﬁrst stage. It tries to transfer the relationship among data from a source domain to a target domain.d.t. In this vein.d. statistical relational learning techniques are proposed to solve these problems. ξti ) = nt ∑ ∑ t∈{S. a mapping is constructed from a source MLN to the target domain based on weighted pseudo loglikelihood measure (WPLL).. which combines the compact expressiveness of ﬁrst order logic with ﬂexibility of probability. nt } & t ∈ {S. λ1 and λ2 are positive regularization parameters to tradeoff the importance between the misclassiﬁcation and regularization terms. T }. J(w0 . In this context. Basically. 2. the relationship between a professor and his or her students is similar to the relationship between a manager and his or her workers. the relational-knowledge-transfer approach deals with transfer learning problems in relational domains. yti ’s are the corresponding labels.T } i=1 w0 . vt . a professor can be considered as playing a similar role in an academic domain as a manager in an industrial management domain. Mihalkova et al. T }.i. ξti ’s are slack variables to measure the degree of misclassiﬁcation of the data xti ’s as used in standard SVMs. In addition. vS and vT simultaneously. TAMAR is motivated by the fact that if two domains are related to each other. are input feature vectors in the source and target domains. there may exist mappings to connect entities and their relationships from a source domain to a target domain. In an MLN. MLNs is a powerful formalism. . Thus. for statistical relational learning.2.vt . i ∈ {1. respectively. TAMAR tries to use an MLN learned for a source domain to aid in the learning of an MLN for a target domain. and can be represented by multiple relations. This approach does not assume that the data drawn from each domain be independent and identically distributed (i. 20 .extension of SVMs to multi-task learning case can be written as the following: min s. where the data are non-i. there may exist a mapping from professor to manager and a mapping from the professorstudent relationship to the manager-worker relationship.T } yti (w0 + vt ) · xti ≥ 1 − ξti .

e. a revision is done for the mapped structure in the target domain through the FORTE algorithm [152]. the tasks are the same but the domains are different. and that the learned model cannot be reused for future data.At the second stage. where DS ̸= DT and TS = TT . In our categorization of transfer learning. a target domain DT and a corresponding learning task TT . transductive transfer learning aims to improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS . which is an inductive logic programming (ILP) algorithm for revising ﬁrst order theories. in our deﬁnition of the transductive transfer learning setting. Davis et al. instead. The revised MLN can be used as a relational model for inference or reasoning in the target domain. we only require that part of the unlabeled target domain data be seen at training time in order to obtain the marginal probability for the target domain data. we use the term “transductive” to emphasize the concept that in this type of transfer learning. Note that the word “transductive” is used with several meanings. [9]. where they required that the source and target tasks be the same. they further required that that all unlabeled data in the target domain are available at training time. Thus. This deﬁnition covers the work of Arnold et al. they must be classiﬁed together with all existing data. extended TAMAR to the single-entity-centered setting of transfer learning. some unlabeled target domain data must be available at training time. i. the source 21 . For example.. On top of these conditions. since the latter considered domain adaptation. transductive learning [90] refers to the situation where all test data are required to be seen at training time.3 Transductive Transfer Learning The term transductive transfer learning was ﬁrst proposed by Arnold et al. [52] proposed an approach to transferring relational knowledge based on a form of secondorder Markov logic. the tasks must be the same and there must be some unlabeled data available in the target domain. but we believe that this condition can be relaxed. The basic idea of the algorithm is to discover structural regularities in the source domain in the form of Markov logic formulas with predicate variables. when some new test data arrive. although the domains may be different. 2. In a follow-up work [125]. assume our task is to classify whether a document describes information about laptop or not. Mihalkova et al. in contrast. Deﬁnition 3. (Transductive Transfer Learning) Given a source domain DS and a corresponding learning task TS . where the difference lies between the marginal probability distributions of source and target data. In the traditional machine learning setting. by instantiating these formulas with predicates from the target domain. In addition. [9]. where only one entity in a target domain is available.

we might want to learn the optimal parameters θ∗ of the model by minimizing the expected risk. As a result. in our classiﬁcation scheme under transductive transfer learning. but the marginal probability distributions of the input data are different. the data distributions may be different across domains. which implies that one can adapt the predictive function learned in the source domain for use in the target domain through some unlabeled target-domain data. we choose to minimize the ERM instead. 2. the source and target tasks are the same. This is similar to the requirements in domain adaptation and sample selection bias. n i=1 n where xi ’s are input feature vectors and yi ’s are the corresponding labels. θ∗ = arg min θ∈Θ 1∑ [l(xi . we also assume that some target-domain unlabeled data be given. θ)]. The goal of transductive transfer learning is to make use of some unlabeled (without human annotation) target documents with a lot of labeled source domain documents to train a good classiﬁer to make predictions on the documents in the target domain (including unseen documents in training). As mentioned in Chapter 2. θ) 22 . Similar to the traditional transductive learning setting. θ∗ = arg min E(x. y. y. XS ̸= XT . we ﬁrst review the problem of empirical risk minimization (ERM) [189].domain documents are downloaded from news websites.1. since it is hard to estimate the probability distribution P . this setting can be split to two cases: (a) The feature spaces between the source and target domains are different.y)∈P [l(x. θ∈Θ where l(x. XS = XT . y. θ)]. and are annotated by human.y)∈DT P (DT )l(x. In the above deﬁnition of transductive transfer learning. the target domain documents are documents are downloaded from shopping websites. θ) is a loss function that depends on the parameter θ. yi . In general. However. In the transductive transfer learning setting. Most approaches described in the following sections are related to case (b) above. θ∗ = arg min θ∈Θ ∑ (x.3. To see how importance sampling based methods may help in this setting. which aims to make the best use of the unlabeled test data for learning.2. we want to learn an optimal model for the target domain by minimizing the expected risk. and (b) the feature spaces between domains are the same.1 Transferring the Knowledge of Instances Most instance-transfer approaches to the transductive transfer learning setting are motivated by importance sampling. n is size of the training data. P (XS ) ̸= P (XT ).

[82] proposed a kernel-mean matching (KMM) algorithm to learn directly by matching the means between the source domain data and the target domain data in a reproducingkernel Hilbert space (RKHS).t. xj ∈ XS .4) nS ∑ PT (xT .S = KS. There exist various ways to estimate P (xSi ) . κi = nT nT k(xi . we have to learn a model from the source domain data instead.However.T (KT. Huang et al.T ij = k(xi . An advantage of using KMM is that it can avoid performing density estimation of either P (xSi ) or P (xTi ). y. min β 1 T β Kβ − κT β 2 nS ∑ i=1 (2. where xi ∈ XS XT . Furthermore. xj ). T where xi . respectively. xj ∈ XT . which implies KS.yTi ) . P (xSi ) . PS (xSi . PS (xSi . xj ).S KT.5) βi − nS | ≤ nS ϵ s. where ∪ nS ∑ xi ∈ XS and xj ∈ XT . when P (DS ) ̸= P (DT ).T ij = k(xi . That means KS. KS. ySi ) θ∈Θ i=1 Therefore. as follows: θ∗ = arg min θ∈Θ ∑ (x. and KT. KMM can be rewritten as the following quadratic programming (QP) optimization problem. P (xTi ) P (xSi ) P (xTi ) since P (YT |XT ) = P (YS |XS ).y)∈DS Otherwise. θ) P (DS ) (2.yTi ) PS (xSi . since no labeled data in the target domain are observed in training data.T are kernel matrices for the source domain data and the target domain data.ySi ) we can learn a precise model for the target domain. xj ). ∑ θ∗ = arg min P (DS )l(x.S ij = k(xi . y.S and KT. ySi ) with the corresponding weight PT (xTi . while xTj ∈ XT . θ). θ∈Θ (x. xj ). we can solve the transductive transfer learning problems. KS.ySi ) = If we can estimate for each instance.y)∈DS P (DT ) P (DS )l(x. If P (DS ) = P (DT ).T ] and Kij = k(xi . yT ) i i ≈ arg min l(xSi . then we may simply learn the model by solving the following optimization problem for use in the target domain.T KT. by adding different penalty values to each instance (xSi . B] and | [ where K = KS. which is difﬁcult when the size of the 23 . j=1 It can be proved that βi = P (xSi ) P (xTi ) [82].T ) is a kernel matrix across the source and target domain data. xTj ). ySi . βi ∈ [0.S KS. P (xTi ) Zadrozny [219] proposed to estimate the terms P (xSi ) P (xTi ) P (xSi ) and P (xTi ) independently by constructing simple classiﬁcation problems. where xi . Thus the difference between P (DS ) and P (DT ) is caused by P (XS ) and P (XT ) and PT (xTi . we need to modify the above optimization problem to learn a model with high generalization ability for the target domain. θ).

Sugiyama et al. .data set is small. which are common features that occur frequently and similarly across domains. Besides sample re-weighting techniques. As described in Chapter 2. readers can refer to a recently published book [148] by QuioneroCandela et al. . SCL treats each pivot feature as a new output vector to construct a task and non-pivot features as inputs. reviews on different products). which are sentiment words and used commonly across different domains (e. based on the minimization of the Kullback-Leibler divergence. which modiﬁes the ASO [5] algorithm for transductive transfer learning. Sugiyama et al. this method is focused on estimating the importance in a latent space instead of learning a latent space for adaptation. Blitzer et al.g. .2 Transferring Knowledge of Feature Representations Most feature-representation transfer approaches to the transductive transfer learning setting are under unsupervised learning frameworks. fl (x) = sgn(wlT · x). To be emphasized that besides covariate shift adaptation. Blitzer et al. [180] proposed an algorithm known as Kullback-Leibler Importance Estimation Procedure (KLIEP) to estimate P (xSi ) P (xTi ) directly. More recently. [179] further extended the uLSIF algorithm by estimating importance in a non-stationary subspace. l = 1. proposed to ﬁrst deﬁne a set of pivot features (the number of pivot feature is denoted by m).2. The m linear classiﬁers are trained to model the relationship between the non-pivot features and the pivot features as shown as follows. (2) training models on the re-weighted data. are examples of pivot features.3. Thus. [21] combined the two steps in a uniﬁed framework by deriving a kernel-logistic regression classiﬁer. “nice” and “bad”. using labeled source domain and unlabeled target domain data. the words “good”. such as independent component analysis (ICA) [181]. Bickel et al. Kanamori et al. etc. . [26] proposed a structural correspondence learning (SCL) algorithm. which performs well even when the dimensionality of the data domains is high. In SCL. to make use of the unlabeled data from the target domain to extract some common features to reduce the difference between the source and target domains. ASO was proposed for multi-task learning. [48] extended a traditional Naive Bayesian classiﬁer for the transductive transfer learning problems. KLIEP can be integrated with cross-validation to perform model selection automatically in two steps: (1) estimating the weights of the source domain data. Then. m 24 . outlier detection [78] and change-point detection [92]. Dai et al. these importance estimation techniques have also been applied to various applications. For example. For more information on importance sampling and re-weighting methods for co-variate shift or sample selection bias. a ﬁrst step is to construct some pseudo related tasks. However. [91] proposed a method called unconstrained least-squares importance ﬁtting (uLSIF) to estimate the importance efﬁciently by formulating the direct importance estimation problem as a leastsquares function ﬁtting problem. 2.

Daum´ III [51] proposed a simple feature augmentation method for transfer learning probe lems in the Natural Language Processing (NLP) area.as in ASO singular value decomposition (SVD) is applied on the weight matrix W = [w1 w2 . which are ranked in non-increasing order. 25 .:] (h is the number of the shared features) is a transformation mapping to map non- pivot features to a latent space. [205]. such that W = U DV T . More speciﬁcally. . . xS . It aims to augment each of the feature vectors of different domains to a high dimensional feature vector as follows. Xing et al. 0 is a vector of zeros. Finally. In [26]. Most feature-based transductive transfer learning methods do not minimize the distance in distributions between domains directly. wm ] ∈ Rq×m . xS = [xS . SSA theoretically studied the conditions under which a stationary space can be identiﬁed from multivariate time series. Mutual Information (MI) is proposed for choosing the pivot features. [47] proposed a co-clustering based algorithm to discover common feature clusters. without considering the preservation of properties such as data variance in the subspace. Dai et al. xT ]. respectively. where the objective function is introduced to seek consistency between the in-domain supervision and the out-of-domain intrinsic structure. Let T θ = U[1:h. Recently. where q is number of features of the original data. Ling et al. von B¨ nau et al. In their follow-up work [25]. Blitzer et al. [108] proposed a spectral classiﬁcation framework for cross-domain transfer learning problem. However. where Uq×r and Vr×m are the matrices of the left and right singular vectors. [191] proposed a method u known as stationary subspace analysis (SSA) to match distributions in a latent space for time series data analysis. 0. used a heuristic method to select pivot features for natural language processing (NLP) problems. They also proposed the SSA procedure to ﬁnd stationary components by matching the ﬁrst two moments of the data distributions in different epochs. The idea is to reduce the difference between domains while ensure the similarity between data within domains is larger than that across different domain data. [207] proposed a cross-domain text classiﬁcation algorithm that extended the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data from different but related domains. where the difference between domains can be reduced. into a uniﬁed probabilistic model. proposed a novel algorithm known as bridged reﬁnement to correct the labels predicted by a shift-unaware classiﬁer towards a target distribution and take the mixture distribution of the training and test data as a bridge to better transfer from the training data to the test data. standard classiﬁcation algorithms can be applied on the new representations to train classiﬁers. Xue et al. such that label information can be propagated across different domains by using the common clusters as a bridge. such as tagging of sentences. 0] & xT = [xT . where xS and xT are original features vectors of the source and target domains. The matrix Dr×r is a diagonal matrix consists of non-negative singular values. SSA is focused on the identiﬁcation of a stationary subspace. whose length is equivalent to that of the original feature vector.

(Unsupervised Transfer Learning) Given a source domain DS with a learning task TS . Z) − I(XS . the goal of unsupervised transfer learning is to make use of the documents in the source domain. [ ] ˜T .4 Unsupervised Transfer Learning Deﬁnition 4. The task is to cluster the target domain documents into some hidden categories. SSA may map the data to some noisy factors which are stationary across domains but completely irrelevant to the target supervised task. For example.4. unsupervised transfer learning aims to help improve the learning of the target predictive function fT (·) 4 in DT using the knowledge in DS and TS . Z) ˜ ˜ ˜ ˜ ˜ ˜ J(X where XS and XT are the source and target domain data. where precise clusters can be obtained because of the sufﬁcient of training data. 2. there is little research work in this setting. Based on the deﬁnition of the unsupervised transfer learning setting. ˜ ˜ Suppose that there exist three clustering functions CXT : XT → XT . where TS ̸= TT and YS and YT are not observable.However. and have a few documents downloaded from shopping websites. respectively. which aims at clustering a small collection of unlabeled data in the target domain with the help of a large amount of unlabeled data in the source domain. The objective function of STC is shown as follows. the predicted labels are latent variables. STC tries to learn a common feature space across domains. So far. assume we have a lot of documents downloaded from news websites. which are referred to as source domain documents. such as clusters or reduced dimensions 26 . [50] studied a new case of clustering problems. As a result. Z) + λ I(XS . CXS : XS → XS and 4 In unsupervised transfer learning. Note that it is usually unable to ﬁnd precise clusters if the data are sparse. which are referred to as target domain documents. 2. SSA is focused on how to identify a stationary subspace without considering how to preserve data properties in the latent space as well. Z) = I(XT . known as self-taught clustering. Thus. and I(·. a target domain DT and a corresponding learning task TT . Then classiﬁers trained on the new representations learned by SSA may not get good performance for transductive transfer learning. Z is a shared feature space by XS and XT . Self-taught clustering (STC) is an instance of unsupervised transfer learning.1 Transferring Knowledge of Feature Representations Dai et al. XS . Z) − I(XT . which helps in clustering in the target domain. no labeled data are observed in the source and target domains in training. to guide clustering on the target domain documents. ·) is the mutual information between two random variables.

[19] proposed a new clustering algorithm to transfer supervision to a clustering task in a target domain from a different source domain.6): ˜ ˜ ˜ arg min J(XT . the authors used conditional Kolmogorov complexity to measure relatedness between tasks and transfer the “right” amount of information in a sequential transfer learning task under a Bayesian framework. 17]. 15. XS . there are some research works focusing on the generalization bound when the training and test data distributions are different [16.˜ ˜ ˜ ˜ CZ : Z → Z. where a relevant supervised data partition is provided. [197] proposed a transferred discriminative analysis (TDA) algorithm to solve the transfer dimensionality reduction problem.Z (2. ˜ The goal of STC is to learn XT by solving the optimization problem (2. to date. respectively. Though the generalization bounds proved in different literatures are different slightly. 24. TDA ﬁrst applies clustering methods to generate pseudo-class labels for the target domain unlabeled data. These two steps run iteratively to ﬁnd the best subspace for the target domain data. In transductive transfer learning. where XT .6) was presented in [50]. more speciﬁcally the distance between marginal probability distributions of input features across domains.XS .6) An iterative algorithm for solving the optimization function (2. one is the error bound of the learning model on the labeled source domain data. Wang et al. Bhattacharya et al. 2. Recently. XS and Z are corresponding clusters of XT . 27 . [60] proposed a novel graph-based method for knowledge transfer. Similarly. Z) ˜ ˜ ˜ XT . In particular. there are few work proposed to study the issue of transferability. Transferring to a new task proceeds by mapping the problem into the graph and then learning a function on this graph that automatically determines the parameters to transfer to the new learning task. there is a common conclusion that the generalization bound of a learning model in transductive transfer learning consists of two terms. XS and Z. The main idea is to deﬁne a cross-domain similarity measure to align clusters in the target domain and the partitions in the source domain. the other is the distance between the source and target domains. It then applies dimensionality reduction methods to the target domain data and labeled source domain data to reduce the dimensionalities. More recently. Hassan Mahmud and Ray [120] analyzed the case of transfer learning using Kolmogorov complexity.5 Transfer Bounds and Negative Transfer An important issue is to recognize the limit of the power of transfer learning. In inductive transfer learning. such that the performance of clustering in the target domain can be improved. where the relationships between source tasks are modeled by embedding the set of learned source models in a graph using transferability as the metric. where some theoretical bounds are proved. Eaton et al.

However. little research work was published on this topic in the past. where the resultant optimization procedure is convex. which provides a global approach to model and learn task relatedness in the form of a task covariance matrix. 83. tasks within a group can ﬁnd it easier to transfer useful knowledge.How to avoid negative transfer and then ensure a “safe transfer” of knowledge is another crucial issue in transfer learning. [159] empirically showed that if two tasks are too dissimilar. where tasks in the same cluster are supposed to be related to each others. which aims to adapt the transfer learning schemes by automatically estimating the similarity between the source and target tasks. 11. then brute-force transfer may hurt the performance of the target task. 28. most research works on model task correlation are from the context of multi-task learning [18. the optimization procedure introduced in [28] is non-convex. whose results may be sensitive to parameter initialization. Despite the fact that how to avoid negative transfer is a very important issue. 7. [28] proposed a multi-task learning method based on Gaussian Process (GP). Motivated by [28]. Thus. [83] presented a convex approach to clustered multi-task learning by designing a new spectral norm to penalize over a set of weights. Bonilla et al. 11]. Cao et al. whether transfer learning techniques should be applied or not. 224]. Seah et al. the data are clustered based on the task parameters. Tasks within each group are related by sharing a low-dimensional representation. More recently. such as [18. Argyriou et al. which may help provide guidance on how to avoid negative transfer automatically. [30] proposed an Adaptive Transfer learning algorithm based on GP (AT-GP). [7] considered situations in which the learning tasks can be divided into groups. However. Zhang and Yeung [224] proposed an improved regularization framework to model the negative and positive correlation between tasks. which differs among different groups. each of which is associated to a task. we need to give an answer to the question that given a target task and a source task. Rosenstein et al. [167] empirically study the negative transfer problem by proposing a predictive distribution matching classiﬁer based on Support Vector Machines (SVMs) to identify the regions of relevant source domain data where the predictive distributions maximally align with that of the target domain data. Jacob et al. and the learning procedure targets at improving performance of the target task only. Thus. Negative transfer happens when the source domain data and task contribute to the reduced performance of learning in the target domain. Some works have been exploited to analyze relatedness among tasks and task clustering techniques. As described above. Bakker and Heskes [11] adopted a Bayesian approach in which some of the model parameters are shared for all tasks and others more loosely connected through a joint prior distribution that can be learned from the data. in inductive transfer learning. As a result. The main concern of inductive transfer learning is the learning performance in the target task only. a new semi-parametric kernel is designed to model correlation between tasks. and then avoid negative 28 . one may be particularly interested in transferring knowledge from one or more source tasks to a target task rather than learning these tasks simultaneously. In AT-GP.

text. we may have multiple sources at hand. transfer learning across different feature spaces aims to solve the problem when the source and target domain data belong to two different feature spaces such as image vs. Most transfer learning methods introduced in previous chapters focus on one-to-one transfer. It is related to unlike multi-view learning [27]. the source and target domain data may come from different feature spaces. Ling et al. Mansour et al. without correspondences across feature spaces. [209] and Duan et al. Dai et al. [45] proposed a new risk minimization framework based on a language model for machine translation [44] for 29 . how to transfer knowledge successfully across different feature spaces is another interesting issue. However. [109] proposed an information-theoretic approach for transfer learning to address the cross-language classiﬁcation problem.6 Other Research Issues of Transfer Learning Besides the negative transfer issue. which means there are only one source domain and one target domain. • Transfer learning from multiple source domains. 123. The approach addressed the problem when there are plenty of labeled English text data whereas there are only a small number of labeled Chinese text documents. it requires each instance in one view must have its correspondence in other views. The main idea is to estimate the data distribution of each source to reweight the data from different source domains. in some real-world scenarios. Theoretical studies on transfer learning from multiple source domains have also been presented in [122. We summarize them as follows. Thus. [122] proposed a distribution weighted linear combining framework for learning from multiple sources. Luo et al. [119] proposed to train a classiﬁer for use in the target domain by maximizing the consensus of predictions from multiple sources. Yang et al. there are several research issues of transfer learning that have attracted more and more attention from the machine learning and data mining communities. In contrast. in recent few years. 2.transfer. Though multi-view learning techniques can be applied to model multi-modality data. each with its own distinct feature space. 15] • Transfer learning across different feature space As described in the deﬁnition of transfer learning. which assumes the features for each instance can be divided into several views. Transfer learning across the two feature spaces are achieved by designing a suitable mapping function across feature spaces as a bridge. [57] proposed algorithms to a new SVM for the target domains by adapting some SVMs learned from multiple source domains. Developing algorithms to make use of multiple sources for help learning models in the target domain is useful in practice.

Information Retrieval (IR). transfer learning has also been proposed for other tasks. multimedia data mining. image analysis. [40] proposed probabilistic models to leverage textual information for help image clustering and image advertising. recommendation systems. regression. bioinformatics. Harpale and Yang [75] proposed an active learning framework for the multi-task adaptive ﬁltering [157] problem. [221] proposed a regularization to learn a new distance metric in a target domain by leverage pre-learned distance metrics from auxiliary domains. and then adapt it to the transfer learning setting. transfer learning has been applied successfully to many application areas. Zha et al. • Transfer learning with active learning As mentioned at the beginning of this Chapter 2. In [33]. [169] applied an active learning algorithm to select important instances for transfer learning with TrAdaBoost [49] and standard SVM. computer vision. Liao et al. activity recognition and wireless sensor networks.dealing with the problem of learning heterogeneous data that belong to quite different feature spaces. Zhang and Yeung [226] ﬁrst proposed a convex formulation for multi-task metric learning by modeling the task relationships in the form of a task covariance matrix. • Transfer learning for other tasks Besides classiﬁcation. 226]. 2. In [79]. Shi et al. 30 . active learning and transfer learning both aims at learning a precise model with minimal human supervision for a target task. structure learning [79] and online learning [227]. Yang et al. Chan and NG proposed to adapt existing Word Sense Disambiguation (WSD) systems to a target domain by using domain adaptation techniques and employing an active learning strategy [101] to actively select examples to be annotated from the target domain. Honorio and Samaras proposed to apply multi-task learning techniques to learn structures across multiple gaussian graphical models simultaneously. In [227]. such as metric learning [221. [211] and Chen et al. Several researchers have proposed to combine these two techniques together in order to learn a more precise model with even less supervision.7 Real-world Applications of Transfer Learning Recently. Zhao and Hoi investigated a framework to transfer knowledge from a source domain to an online learning task in a target domain. and then explore various active learning approaches to the multi-task adaptive ﬁlter to improve the performance. such Natural Language Processing (NLP). [107] proposed a new active learning method to select the unlabeled data in a target domain to be labeled with the help of the source domain data. respectively. They ﬁrst proposed to apply multi-task learning techniques for adaptive ﬁltering. clustering and dimensionality reduction.

Web applications of transfer learning include text classiﬁcation across domains [151. Alamgir et al. For example. 2] and information extraction [200. age estimation from facial images [225]. 35. 143]. space [138. for computer aided design (CAD). Zhuo et al. Applications of transfer learning in these ﬁelds include image classiﬁcation [202. 194] and mobile devices [229]. which is known as domain adaptation. 195. sentiment classiﬁcation [25. 192. image and multimedia analysis. 201]. 51. 87. has been widely studied for solving various tasks [42]. 161. respectively. to beneﬁt WiFi localization tasks in other settings. 185. 9. 51]. 113]. 36. part-of-speech tagging [4. advertising [40. [228] proposed to apply transfer learning techniques for solving indoor sensor-based activity recognition problems. some research works have been proposed to apply transfer learning techniques to solve various computational biological problems. protein protein subcellular location prediction [206] and genetic association analysis [214] Transfer learning techniques has also explored to solve WiFi localization and sensor-based activity recognition problems [210]. 104. but conceptually related. 156. 72. 68] and multi-domain collaborative ﬁltering [103. 203]. 26. face veriﬁcation from images [196]. motivated by that different biological entities. [32] studied how to apply a GP based multi-task learning method to solve the inverse dynamics problem for a robotic manipulator [32]. Raykar et al. 223. genes. such as organisms. etc. learn to rank across domains [10. 73]. 94. word sense disambiguation [33. Furthermore. image semantic segmentation [190]. [235] studied how to transfer domain knowledge to learn relational action models across domains in automated planning. may be related to each other from a biological point of view. in the past two years. mRNA splicing [166]. classiﬁers. 31 . image retrieval [73. transfer learning techniques were proposed to extract knowledge from WiFi localization models across time periods [217. proposed a novel Bayesian multipleinstance learning algorithm. such as identifying molecular association of phenotypic responses [222]. 213. 47. In [154]. 58] and event recognition from videos [59]. 204. 115. Besides NLP and IR. which can automatically identify the relevant feature subset and use inductive transfer for learning multiple. Rashidi and Cook [153] and Zheng et al. 207. 38]. 218. 102. transfer learning techniques have attracted more and more attention in the ﬁelds of computer vision. Chai et al. 158. [3] applied multi-task learning techniques to solve braincomputer interfaces problems. transfer learning. video retrieval [208. In Bioinformatics. splice site recognition of eukaryotic genomes [199]. video concept detection [209. 136. 230]. 86] Information Retrieval (IR) is another application area where transfer learning techniques have been widely studied and applied. 48. such as name entity recognition [71. 177]. 29.In the ﬁelds of Natural Language Processing (NLP). 39].

If the two domains are related to each other. Some of these common latent factors may cause the data distributions between domains to be different. Then we present the proposed dimensionality reduction framework in Chapter 3. and three algorithms based on the frameworks known as Maximum Mean Discrepancy Embedding (MMDE). Finally. Transfer Component Analysis (TCA) and Semi-supervised Transfer Component Analysis (SSTCA) in Chapters 3.1 and Chapter 3. we ﬁrst introduce our motivation and some preliminaries used in the proposed methods in Chapter 3. respectively. We illustrate our idea using a learning-based indoor localization problem as an example.2.1 Motivation In many real world applications. observed data are controlled by a few latent factors. Then the subspace spanned by these latent factors can be treated as a bridge to make knowledge transfer possible. Our goal is to recover those common latent factors that do not cause much difference between data distributions and do preserve properties of the original data.3. most previous feature-based transductive transfer learning methods do not explicitly minimize the distance between different domains in the new feature space. Meanwhile. while others may not.7. some of these factors may capture the intrinsic structure or discriminative information underlying the original data. where a client moving in a WiFi environment wishes to use the received signal strength (RSS) 32 . In contrast. they may share some latent factors (or components). Note that this objective is important for the ﬁnal target learning tasks in the latent space.3. cross-domain WiFi localization and crossdomain text classiﬁcation. respectively. we aim at proposing a novel and general dimensionality reduction framework for transductive transfer learning. while others may not. This chapter is organized as follows. in Chapter 4 and Chapter 5. Our proposed framework can be referred to as a featurerepresentation-transfer approach. we apply the proposed methods to two diverse applications.CHAPTER 3 TRANSFER LEARNING VIA DIMENSIONALITY REDUCTION In this chapter. respectively. which limits the transferability across domains. As reviewed in Chapter 2.6.4-3. Another objective of the proposed framework is to preserve the properties of the original data as much as possible. The relationship between the proposed methods to other methods are discussed in Chapter 3. in the proposed dimensionality reduction framework. 3. such that the difference between different domain data can be reduced in the latent space. one objective is to minimize the distance between domains explicitly.

we can transfer the text-classiﬁcation knowledge. there may be some latent topics shared by these domains. etc.2. in the latent space spanned by latent topics. such as temperature. if we use the latter two factors to represent the RSS data. If we use the stable latent topics to represent documents. Thus. the distance between the distributions of documents in related domains may be small. properties of access points (APs). news articles and blogs). we propose a novel and general dimensionality reduction framework for transductive transfer learning. Traditional dimensionality reduction methods try to project the original data to a low-dimensional latent space while preserving some properties of the original data. building structure. which aims at extracting a low-dimensional representation of the data by maximizing the variance of the embedding in a kernel space [163].. on which we subsequently apply standard learning algorithms to train classiﬁcation or regression models from labeled source domain data to make predictions on the unlabeled target domain data. the building structure and properties of APs are relatively stable. Thus we need to develop a new dimensionality reduction algorithm for domain adaptation.2 Preliminaries 3. van der Maaten et al. resulting in changes in RSS values over time. RSS values are affected by many latent factors. the distributions of the data collected in different time periods may be close to each other. Motivated by the observations mentioned above. we ﬁrst introduce some preliminaries that are used in our proposed solutions based on the framework. We then project the data in both domains to this latent feature space. 3. they cannot directly be used to solve domain adaptation problems. Among these hidden factors. The key idea is to learn a low-dimensional latent feature space where the distributions of the source domain data and target domain data are close to each other and the data properties can be preserved as much as possible. Then.values to locate itself. KPCA is an unsupervised method. Some of them may be relatively stable while others may not. 33 . human movement. If two text-classiﬁcation domains have different distributions. in this chapter. Thus.g.1 Dimensionality Reduction Dimensionality reduction has been studied widely in the machine learning community. this is the latent space where we can ensure a transferring of a learned localization model from one time period to another. However. Since they cannot guarantee that the distributions between different domain data are similar in the reduced latent space. Before presenting our proposed dimensionality reduction framework in detail. In an indoor building. [188] gave a recent survey on various dimensionality reduction methods. A powerful dimensionality reduction technique is principal component analysis (PCA) and its kernelized version kernel PCA (KPCA). but are related to each other (e. the temperature and the human movement may vary in time. Another example is learn to do text classiﬁcation across domains.

zn2 }.2 Hilbert Space Embedding of Distributions Given samples X = {xi } and Z = {zi } drawn from two distributions. To avoid such a non-trivial task. Z) = MMD(X.2. Z) = MMD(X. .1) where ∥ · ∥H is the RKHS norm. PCA or KPCA works pretty well as a preprocess of supervised target tasks [186. However. As its name implies. where ϕ(x) : X → H. 34 .2) Therefore.2. there exist many criteria (such as the Kullback-Leibler (KL) divergence) that can be used to estimate their distance. f ⟩. .4 Dependence Measure Hilbert-Schmidt Independence Criterion Related to the MMD. Let the kernel-induced feature map be ϕ. the Hilbert-Schmidt Independence Criterion (HSIC) [70] is a simple yet powerful non-parametric criterion for measuring the dependence between the sets X and Y . function evaluation can be written as f (x) = ⟨ϕ(x). n1 i=1 n2 i=1 (3. 3. Z) = ϕ(xi ) − ϕ(zi ) n1 i=1 n2 i=1 2 . which is deﬁned as follows.3 Maximum Mean Discrepancy Recently. is deﬁned as follow.In has been shown that in many real world applications. MMD will asymptotically approach zero if and only if the two distributions are the same. . . It can be shown that when the RKHS is universal [178]. Gretton et al. the distance between two distributions is simply the distance between the two mean elements in the RKHS. . many of these estimators are parametric and require an intermediate density estimate. the empirical estimate of MMD can be rewritten as follows: n1 n2 1 ∑ 1 ∑ Dist(X. xn1 } and {z1 . respectively. 3. a non-parametric distance estimate between distributions without density estimation is more desirable. Z) = sup ( ∥f ∥H ≤1 n1 n2 1 ∑ 1 ∑ f (xi ) − f (zi )). . [69] introduced the Maximum Mean Discrepancy (MMD) for comparing distributions based on the corresponding RKHS distance.2. 3. 162]. . it computes the Hilbert-Schmidt norm of a cross-covariance operator in the RKHS. The estimate of MMD between {x1 . a non-parametric distance estimate was designed by embedding distributions in a RKHS [172]. . Dist(X. H (3. By the fact that in a RKHS. which follow distributions P and Q.

yS1 ). HSIC asymptotically approaches zeros if and only if X and Y are independent [178]. Y ) = 1 tr(HKHKyy ). the local geometry is captured in the form of local distance constraints on the target embedding K. j) ∈ N .n n n 1 ∑ 1∑ 1∑ HSIC(X. ∀(i. Kii + Kjj − 2Kij = d2 .4) 1 where K.5) reduces to Maximum Variance Unfolding (MVU) [198] when no side information is given (i. j. . (3.. Mathematically. . in Colored Maximum Variance Unfolding (colored MVU) [174]. if xi and xj are k-nearest neighbors. (n − 1)2 (3. An (biased) empirical estimate can be easily obtained from the corresponding kernel matrices. this leads to the following semi-deﬁnite program (SDP) [95]: maxK≽0 tr(HKHKyy ) s. Kyy = I).n. Similar to MMD. respectively. .(3. while the alignment with the side information (represented as kernel matrix Kyy ) is measured by the HSIC criterion in (3. (xSn1 . 35 .5) 3.3) H where ⊗ denotes a tensor product. For example. Conversely. Y ) = ∥Cxy ∥H = 2 ϕ(xk )) ⊗ (ψ(yj ) − ψ(yk ))] [(ϕ(xi ) − n i. Y ) = HSIC(X. it can be shown that if the RKHS is universal. which is an unsupervised dimensionality reduction. aims at maximizing the variance in the high-dimensional feature space instead of the label dependence.4). Kyy are kernel matrices deﬁned on X and Y .t. In particular.3 A Novel Dimensionality Reduction Framework Recall that given a lot of labeled source domain data DS = {(xS1 . In that case. ySnS )}. H = I − n 11⊤ is the centering matrix and n is the number of samples in X and Y . Embedding via HSIC In embedding or dimensionality reduction. we denote this by using (i.e. ϕ : X → H.j=1 n k=1 n k=1 . as Dep(X. MVU. (3. a large HSIC value suggests strong dependence. it is often desirable to preserve the local data geometry while at the same time maximally align the embedding with available side information (such as labels). . ij K1 = 0. Where we denote dij the distance between xi and xj in the original feature space. For all i. ψ : Y → H and H is a RKHS. j) ∈ N .

such that P (ϕ(XS )) ≈ P (ϕ(XT )) and ϕ(XS ) and ϕ(XT ) preserve some important properties of XS and XT .where xSi ∈ X is the input and ySi ∈ Y is the corresponding output1 .3. a ﬁrst objective in our dimensionality reduction framework is to discover a desired mapping ϕ by minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )). We then assume that such a ϕ satisﬁes P (YS |ϕ(XS )) ≈ P (YT |ϕ(XT )).3. In this thesis. We believe that domain adaptation under this assumption is more realistic. Standard supervised learning methods can then be applied on the mapped source domain data ϕ(XS ). but P (YS |XS ) = P (YT |XT ). and some unlabeled target domain data DT = {xT1 . . Here. denoted by Dist(P (ϕ(XS )).2).3. and apply it solve two real-world application successfully. in many real-world applications. DS may contain unlabeled data.1 Minimizing Distance between P (ϕ(XS )) and P (ϕ(XT )) To reduce the distance between domains. . but there exists a transformation ϕ such that P (ϕ(XS )) ≈ P (ϕ(XT )) and ̸ P (YS |ϕ(XS )) ≈ P (YT |ϕ(XT )). though also much more challenging. xTnT }.3. we study domain adaptation under this assumption empirically. respectively. ϕ cannot be learned by directly minimizing the distance between P (YS |ϕ(XS )) and P (YT |ϕ(XT )). P and Q can be different. Here. In this chapter. to train models for use on the mapped target domain data ϕ(XT ). In general. . a key issue is how to ﬁnd this transformation ϕ. and (2) ϕ(XS ) and ϕ(XT ) preserve some important properties of XS and XT (Chapter 3. As mentioned in Chapter 2. for simplicity. we assume that DS are fully labeled. Our task is then to predict the labels yTi ’s corresponding to the inputs xTi ’s in the target domain. the conditional probability P (Y |X) may also change across domains due to noisy or dynamic factors underlying the observed data. together with the corresponding labels YS . However. . Let P(XS ) and Q(XT ) (or P and Q in short) be the marginal distributions of the input sets XS = {xSi } and XT = {xTi } from the source and target domains. Hence. where the input xTi is also in X . 1 In general. We propose a new framework to learn a transformation ϕ.1). Since we have no labeled data in the target domain. a major assumption in most transductive transfer learning methods is that P ̸= Q. P (ϕ(XT ))). we use new weaker assumption that P = Q. 36 . 3. we propose a new dimensionality reduction framework to learn ϕ such that (1) the distance between the marginal distributions P (ϕ(XS )) and P (ϕ(XT )) is small (Chapter 3.

An obvious choice is to maximally preserve the data variance. target domain data 4 3 Pos.3. ϕ should also preserve data properties that are useful for the target supervised learning task.5 0 −0. as is performed by the well-known PCA and KPCA (Chapter 3. focusing only on the data variance is again not desirable in domain adaptation. source domain data Neg. An example is shown in Figure 3. However.5 2 1. source domain data Pos.3. one would select the noisy component x2 .1(b).2 Preserving Properties of XS and XT However. 37 .2. (b) Only maximizing the data variance. target domain data Neg.1). Hence. Figure 3. 1.1: Motivating examples for the dimensionality reduction framework. besides reducing the distance between the two marginal distributions. where the direction with the largest variance (x1 ) cannot be used to reduce the distance of distributions across domains and is not useful in boosting the performance for domain adaptation. source domain data Pos. learning the transformation ϕ by only minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )) may not be enough. an effective dimensionality reduction framework for transfer learning should satisfy that in the reduced latent space.5 −1 −3 −2 −1 0 1 x 1 5 Pos. For both the source and target domains. 2. which is completely irrelevant to the target supervised learning task. However. target domain data Neg.5 x2 1 0. by focusing only on minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )). the source domain data are shown in red. and the target domain data are in blue. x1 is the discriminative direction that can separate the positive and negative samples. Thus. in transductive transfer learning. distance between data distributions across domains should be reduced.1(a) shows a simple two-dimensional example. while x2 is a noisy dimension with small variance. data properties should be preserved as much as possible. source domain data Neg. Here. Figure 3. 3 2. target domain data − − + x2 + + 2 1 0 −1 −2 −2 + Source domain data Target domain data − Source domain data −1 0 1 2 3 x 1 − Source domain data 4 5 6 7 8 9 2 3 4 5 6 (a) Only minimizing the distance between P (ϕ(XS )) and P (ϕ(XT )).

which is the feature map induced by a universal kernel. the distance between two distributions P (ϕ(XS )) and P (ϕ(XT )) can be empirically measured by the (squared) distance between the empirical means of the two domains: nS nT 1 ∑ 1 ∑ Dist(ϕ(XS ). Algorithm 3. Thus. The proposed framework is summarized in Algorithm 3. on using the empirical measure of MMD(3. ϕ(XT )) is small. in this section. 3: For unlabeled data xTi ’s in DT .Then standard supervised techniques can be applied in the reduced latent space to train classiﬁcation or regression models from source domain labeled data for use in the target domain. the ﬁrst step is the key step. φ is usually highly nonlinear. a direct optimization of (3. 1: Learn a transformation mapping ϕ.2). Ensure: Predicted labels YT of the unlabeled data XT in the target domain. 3. As can be seen.6) with respect to ϕ may be intractable and can get stuck in poor local minima. map them to the latent space to get new representations ϕ(xTi )’s. respectively. we denote φ(ϕ(x)) by φ ◦ ϕ(x). However. Therefore. 3. Furthermore. As a result. Note that in practice. H (3. the corresponding kernel may not need to be universal [173].6) for some φ ∈ H. 2: Train a classiﬁcation or regression model f on ϕ(XS ) with the corresponding labels YS . an unlabeled target domain data set DT = {xTi }.1 A dimensionality reduction framework for transfer learning Require: A labeled source domain data set DS = {(xSi . a desired nonlinear mapping ϕ can be found by minimizing the above quantity. Then.1.4 Maximum Mean Discrepancy Embedding (MMDE) As introduced in Chapter 3. one advantage of this framework is that most existing machine learning methods can be easy integrated into this framework. Another advantage is that it works for diverse machine learning tasks.2. such as classiﬁcation and regression problems. such that Dist(ϕ(XS ). one just need to apply existing machine learning and data mining methods on the mapped data ϕ(XS ) with the corresponding labels YS for training models for use in the target domain. 4: return ϕ and f (ϕ(xTi ))’s. ϕ(XS ) and ϕ(XT ) can preserve properties of XS and XT .1 Kernel Learning for Transfer Latent Space Instead of ﬁnding the transformation ϕ explicitly. because once the transformation mapping ϕ is learned in the ﬁrst step. ϕ(XT )) = φ ◦ ϕ(xSi ) − φ ◦ ϕ(xTi ) nS i=1 nT i=1 2 .4. we propose a kernel based dimensionality reduction method called Maximum Mean Discrepancy Embedding (MMDE) 38 . use the model f to make predictions f (ϕ(xTi ))’s.3. ySi )}.

. and that the learned model cannot be reused for future data. k = (. Then φ ◦ ϕ is also the feature map induced by a kernel for any arbitrary map ϕ. where Kij = k(xi . φ ◦ ϕ(xj )⟩. .. where k is the corresponding kernel. φ(x′j )⟩.T T KT. respectively.6) can be written in terms of the kernel matrices deﬁned by k.[135] to construct the ϕ implicitly. 39 . Here we translate the problem of learning ϕ to the problem of learning φ ◦ ϕ.6) is minimized. here. Based on the Mercer’s theory [164]. we can write ⟨φ ◦ ϕ(xi ). and L = [Lij ] ≽ 0 with Lij = 1 n2 S 1 x i . φ ◦ ϕ is also the feature map induced by a kernel for any arbitrary map ϕ.. one can ﬁnd their corresponding sample in the latent space. φ ◦ ϕ(xj )⟩ = k(xi . the word “transductive” is from the transductive learning [90] setting in traditional machine learning. φ ◦ ϕ(xj )⟩ = ⟨φ(x′i ). we can get the following lemma: Lemma 1.T (3.r.S KS.j=1 nS nT i. Proof. Therefore. and k(xi . xTj ) n2 i. Thus. xj ) is positive semi-deﬁnite for sample X ′ . xj ) = ⟨φ ◦ ϕ(xi ).8) is a composite kernel matrix with KS and KT being the kernel matrices deﬁned by k on the data in the source and target domains. xj ) = ⟨φ ◦ ϕ(xi ). the tasks must be the same and there must be some unlabeled data available in the target domain. the term “transductive” is to emphasize the concept that in this type of transfer learning. the MMD) between the projected source and 2 Note that. Denote x′ = ϕ(x). Since φ is the feature map induced by a kernel. . XT ) nS . x j ∈ XS . by using the kernel trick. which refers to the situation where all test data are required to be seen at training time. k(xi . 1 − nS nT n2 T otherwise. Moreover. ϕ(xn )} = {x′1 .. Equation (3. thus the corresponding matrix K. when some new test data arrive.S KT. For any ﬁnite sample X = {x1 .. xTj ) − = k(xSi . where K= [ ] ∈ R(nS +nT )×(nS +nT ) (3.. x j ∈ XT . X ′ = {ϕ(x1 ). Therefore. x i .nT nS nT 1 ∑ 1 ∑ 2 ∑ k(xSi . we ﬁrst introduce a lemma shown as follows. Before describing the details of MMDE.. x′n }.j=1 S = tr(KL). In the transductive setting2 . xSj ) + 2 k(xTi . . we can learn the kernel matrix K instead of the universal kernel k by minimizing the distance (measured w. they must be classiﬁed together with all existing data. Let φ be the feature map induced by a kernel.. Given that φ ∈ H. xj ). ..7) KS.j=1 nT i. xn }.. Recall that in transductive transfer learning. our goal becomes ﬁnding the feature map φ ◦ ϕ of some kernel such that (3.t. as: ′ ′ Dist(XS .) is a valid kernel function. which makes the optimization problem tractable. Thus.

we can apply PCA on ′ ′ the resultant kernel matrix to reconstruct the low-dimensional representations XS and XT . in MMDE. j) ∈ N . similar to MVU as introduce in Chapter 3. j such that (i. Kii + Kjj − 2Kij = d2 .. 40 . Second.t. while the second term maximizes the variance in the feature space. 1. As mentioned in Chapter 3.e. The centering constraint is used for the post-process PCA on the resultant kernel matrix K. 3. 0 and 1 are vectors of zeros and ones. PCA is applied on the resultant kernel matrix and select the leading eigenvectors to reconstruct the desired mapping φ ◦ ϕ implicitly to map data ′ ′ into a low-dimensional latent space across domains.7.4. while the L in MMDE can be treated as a kernel matrix that encode distribution information of different data sets. we will discuss the relationship between these methods in detail. the optimization problem (3. this leads to a SDP involving K [95]. Thus. as mentioned in Chapter 3. it may not be sufﬁcient to learning the transformation by only minimizing the distance between the projected source and target domain data. ∀(i. besides minimizing the trace of KL in 3. which is a standard condition for PCA applied on the resultant kernel matrix. The distance is preserved. MMDE also aims to unfold the high dimensional data by maximizing the trace of K.3. besides minimizing the trace of KL.2. The optimization problem of MMDE can then be written as: min tr(KL) − λtr(K) K≽0 (3. In Chapter 3. Note that. as ij proposed in MVU and colored MVU.9). the L matrix in colored MVU is a kernel matrix that encodes label information of the data. The distance preservation constraint is motivated by MVU.2. 2. j) ∈ N . Computationally.9) s.2. which can make the kernel matrix learning more tractable. Kii + Kjj − 2Kij = d2 for all i. The trace of K is maximized.4. After learning the kernel matrix K in (3.2. The embedded data are centered.7.9) is similar to colored MVU as introduced in Chapter 3. However. where the ﬁrst term in the objective minimizes the distance between distributions. which aims to preserve as much as variance in the feature space.target domain data. maximizing the trace of K is equivalent to maximizing the variance of the embedded data. we also have the following objectives/constraints to preserve the properties of the original data.4. However. i. and λ ≥ 0 is a tradeoff parameter. for transductive transfer learning. ij K1 = 0. Then. First. XS and XT . as proposed in MVU. there are two major differences between MMDE and colored MVU.

2. wij is the similarity between xTi and xTj . f (x′ i )} to T make predicts.10).9) to obtain a kernel matrix K. an unlabeled target domain data set DT = {xTi } and λ > 0.4. A model is then trained on XS with the corresponding labels YS . the key step of MMDE is to apply a SDP solver to the optimization problem in (3. ySi )}. As can be seen in the algorithm. the learned classiﬁer or regressor to predict the labels of DT . Finally. since we do not learn a mapping explicitly to project the original data XS and ′ ′ XT to the embeddings XS and XT . ∑ j∈N wij fj fi = ∑ . respectively. In general. the method of harmonic functions introduced in (3.9).10) is used to make predictions on unseen target domain data. the overall time complexity is O((nS + nT )6. which is deﬁned in (3. as: yTi = f (x′Ti ). 2: Apply PCA to the learned K to get new representations {x′ i } and {x′ i } of the original S T data {xSi } and {xTi }. which can be obtained based on Euclid distance or other similarity measure.10) where N is the set of k nearest neighbors of xTi in XT . 6: return Predicted labels YT . respectively. 1: Solve the SDPproblem in (3. use harmonic functions with {xTi . to estimate the labels of the new test data in the target domain.2 Transfer learning via Maximum Mean Discrepancy Embedding (MMDE) Require: A labeled source domain data set DS = {(xSi . However. j∈N wij (3. and fj ’s is the predicted labels of xTj ’s.2 Make Predictions in Latent Space ′ ′ After obtaining the new representations XS and XT . 5: For new test data xT ’s in the target domain. we use the method of harmonic functions [234]. we translate the problem of learning the transformation mapping ϕ in the dimensionality reduction framework to a kernel matrix learning problem. The MMDE algorithm for transfer learning is summarized in Algorithm 3. as there are O((nS + nT )2 ) variables in K. We then apply the standard 41 . we need to apply other techniques to make predictions on the out-of-sample test data. 3: Learn a classiﬁer or regressor f : x′ i → ySi S 4: For unlabeled data xTi ’s in DT .3 Summary In MMDE. we can train a classiﬁcation or regression ′ model f from XS with the corresponding labels YS . for out-of-sample data in the target domain.4. Ensure: Predicted labels YT of the unlabeled data XT in the target domain.3. Algorithm 3. This can then be used to make predictions ′ on XT . In the second step.5 ) [129]. standard PCA is applied ′ ′ on the learned kernel matrix to construct latent representations XS and XT of XS and XT . ′ respectively. 3. Here.

which is often known as the empirical kernel map [163]. This translation makes the problem of learning ϕ tractable. 42 .5.5 Transfer Component Analysis (TCA) In order to overcome the limitations of MMDE as described in the previous section. note that the kernel matrix K in (3. It avoids the use of SDP and thus its high computational burden. Firstly. requires K to be positive semi-deﬁnite and the resultant kernel learning problem has to be solved by expensive SDP solvers. it cannot generalize to unseen data. one can ensure that the kernel matrix K is positive deﬁnite by adding a small ϵ > 0 to its diagonal [135]. in this section. This becomes computationally prohibitive even for small-sized problems. Besides. The overall time complexity is O((nS + nT )6. Secondly. 3 (3. In particular.PCAmethod on the resultant kernel matrix to reconstruct low-dimensional representations for different domain data. other techniques. However. which may discards potentially useful information in K. in order to construct low′ ′ dimensional representations of XS and XT . we propose an efﬁcient framework to ﬁnd the nonlinear mapping φ ◦ ϕ based on empirical kernel feature extraction. The resultant kernel matrix3 is then K = (KK −1/2 W )(W ⊤ K −1/2 K) = KW W ⊤ K. Consider the use of a (nS + nT ) × m matrix W that transforms the empirical kernel map features to a m-dimensional space (where m ≪ nS + nT ).11) where W = K −1/2 W ∈ R(nS +nT )×m . (3. xj ) = kxi W W ⊤ kxj . the criterion (3. 3.8) can be decomposed as K = (KK −1/2 )(K −1/2 K).5 ) in general. Moreover. such as the method of harmonic functions need to be used.1 Parametric Kernel Map for Unseen Data First. the obtained K has to be further post-processed by PCA. instead of using a two-step approach as in MMDE. the corresponding kernel evaluation between any two patterns xi and xj is given by ⊤ k(xi . In order to make predictions on unseen target domain unlabeled data. Finally. Furthermore. we need to apply the MMDE method on the source domain and new target domain data to reconstruct their new representations. we propose a uniﬁed kernel learning method which utilizes an explicit low-rank representation. 3.9) in MMDE. the learned kernel can be generalized to out-of-sample data directly.12) As is common practice. it can ﬁt the data more perfectly. since MMDE learns the kernel matrix from the data automatically. For any unseen target domain data. there are several limitations of MMDE.

Besides reducing the distance between the two marginal distributions in (3. 1 ∈ Rn1 +n2 is the column vector with all ones.11). W (3. the variance of the data can be preserved as much as possible and the distance between different distributions across domains can be reduced. x).5. 3. a regularization term tr(W ⊤ W ) is usually needed to control the complexity of W . and Im ∈ Rm×m is the identity matrix.1). W ⊤ KHKW = Im . where H = In1 +n2 − 1 11⊤ n1 +n2 is the centering matrix. Though this optimization problem involves a non-convex norm constraint W ⊤ KHKW = I.16) 43 .where kx = [k(x1 .13) In minimizing objective (3.13). W (3.14) where µ > 0 is a trade-off parameter. Moreover.11) that the embedding of the data in the latent space is W ⊤ K. . on using the deﬁnition of K in (3. . and In1 +n2 ∈ R(n1 +n2 )×(n1 +n2 ) is the identity matrix. XT ) = tr((KW W ⊤ K)L) = tr(W ⊤ KLKW ). (3.2.14) can be re-formulated as min tr((W ⊤ KHKW )† W ⊤ (KLK + µI)W ). the variance of the projected samples is W ⊤ KHKW . Note from (3. x)]⊤ ∈ RnS +nT . (3. The kernel learning problem then becomes: minW tr(W ⊤ KLKW ) + µ tr(W ⊤ W ) s. The optimization problem (3. we develop a new dimensionality reduction method such that in the latent space spanned by the learned components. .2 Unsupervised Transfer Component Extraction Combining the parametric kernel representations for distance between distributions and data variance in the previous section. k(xnS +nT . where the ith column [W ⊤ K]i provides the embedding coordinates of xi . Hence.t. Hence. this kernel k facilitates a readily parametric form for out-of-sample kernel evaluations. the MMD distance between the empirical ′ ′ means of the two domains XS and XT can be rewritten as: ′ ′ Dist(XS . as is performed by the well-known PCA and KPCA (Chapter 3.15) or max tr((W ⊤ (KLK + µI)W )−1 W ⊤ KHKW ).13). this regularization term can also avoid the rank deﬁciency of the denominator in the generalized eigenvalue decomposition. we will drop the subscript m from Im in the sequel. For notational simplicity. we also need to preserve the data properties such as the data variance using the parametric kernel map. As will be shown later in this section. . it can still be solved efﬁciently by the following trace optimization problem: Proposition 1.

Proof. Furthermore.18) Multiplying both sides on the left by W T ..S KT. Since the matrix KLK +µI is non-singular.r. respectively.T ]W and XT = S T [KT.S KS. T As can be seen in the algorithm. for new test data xT from the target domain. . .7). and then on substituting it into (3. where m ≤ nS +nT −1.. the key of TCA is the second step. j=1 Ensure: Transformation matrix W and predicted labels YT of the unlabeled data XT in the target domain. and target domain data set DT = i=1 {xTj }nT . The Lagrangian of (3. x′ = κW . Setting the derivative of (3. we obtain an equivalent trace maximization problem (3. W to zero. Similar to kernel Fisher discriminant analysis (KFD) [126.8).. the W solution in (3. and the extracted components are called the transfer components.3. once the W is learned . this will be referred to as Transfer Component Analysis (TCA). 3: Do Eigen-decomposition and select the m leading eigenvectors to construct the transformation matrix W . ′ ′ 4: Map the data xSi ’s and xTj ’s to x′ i ’s and x′ j ’s via using XS = [KS.. 216]. we can use W to map it to the latent space directly. and centering matrix H. nS nT 1: Construct kernel matrix K from {xSi }i=1 and {xTj }j=1 based on (3. Algorithm 3. matrix L from (3. (3. ySi )}nS . nS + nT . 5: Train a model f on xSi ’s with ySi ’s. In general. The TCA algorithm for transfer learning is summarized in Algorithm 3. (3.T ]W . it takes only O(m(nS + nT )2 ) time when m nonzero eigenvectors are to be extracted [175]. 6: For new test data xT from the target domain. 44 .. which is much more efﬁcient than MMDE. Thus. to do eigen-decomposition on the matrix (KLK + µI)−1 KHK to ﬁnd m leading eigenvectors to construct the transformation W . and T κi = k(xT . nS + 1.3 Transfer learning via Transfer Component Analysis (TCA).17) where Z is a diagonal matrix containing Lagrange multipliers.17) w. we obtain (3. t = 1. nS . where κ is a row vector.t.16). In the sequel.14) is tr(W ⊤ (KLK + µI)W ) − tr((W ⊤ KHKW − I)Z). TCA can be generalized to out-of-sample data. where I is the identity matrix.15). 2: Compute the matrix (KLK + µI)−1 KHK. we have (KLK + µI)W = KHKW Z. Require: Source domain data set DS = {(xSi . xt ).17)..16) are the m leading eigenvectors of (KLK +µI)−1 KHK. 7: return transformation matrix W and f (x′ )’s..

3(c).2. learning the transformation ϕ by only maximizing the data variance may not be useful in domain adaptation.3. Here. which is an empirical method to ﬁnd an identical stationary latent space of the source and target domain data. the positive and negative samples are overlapped together in the latent space.3. As can be seen from Figures 3.2(a)). we use the linear kernel on inputs. TCA leads to signiﬁcantly better accuracy than SSA.3(b) and 3. Here. We compare TCA with the method of stationary subspace analysis (SSA) as introduced in Chapter 2. the distance between distributions of different domain data in the one-dimensional space learned by SSA is small. Only Minimizing Distance between Distributions As discussed in Chapter 3. For fully testing the effectiveness of TCA. the two classes are now more separated. 45 . the distance between the mapped data across different domains is still large and the positive and negative samples are overlapped together in the latent space.3. On the other hand. As can be seen from Figures 3. Only Maximizing the Data Variance As discussed in Chapter 3.3.1(a) (which is also reproduced in Figure 3. in TCA.2(c).5. though the distance between distributions of different domain data in the latent space learned by TCA is larger than that learned by SSA. We further apply the one-nearest-neighbor (1-NN) classiﬁer to make predictions on the target domain data in the original 2D space. In this section. As can be seen from Figures 3. respectively. as can be seen from Figure 3. we perform experiments on synthetic data to demonstrate the effectiveness of these two objectives of TCA in learning a 1D latent space from the 2D data. though the variance of the mapped data in the 1D space learned by TCA is smaller than that learned by PCA. However. the distance between different domain data in the latent space is reduced and the positive and negative samples are more separated in the latent space.2(b) and 3. the variance of the mapped data in the 1D space learned by PCA is very large.2(a).2(b). However. we reproduce Figure 3.3(a).3 Experiments on Synthetic Data As described in previous sections. which is not useful for domain adaptation. it is not desirable to learn the transformation ϕ by only minimizing the distance between the marginal distributions P (ϕ(XS )) and P (ϕ(XT )).2(c). On the other hand. For TCA. 3. and ﬁx µ = 1. we will conduct more detailed experiments by comparing with other exciting methods in two real-world datasets in Chapter 4 and Chapter 5. which is not useful for making predictions on the mapped target domain data.1(b) in Figure 3. there are two main objectives: minimizing the distance between domains and maximizing the data variance in the latent space. and latent spaces learned by SSA and TCA.2. we illustrate this by using the synthetic data from the example in Figure 3.

source domain data Neg.2: Illustrations of the proposed TCA and SSTCA on synthetic dataset 1. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. PDF PDF 1D latent space −13 −12 −11 −10 x −9 −8 −7 −25 1D latent space −20 −15 x −10 −5 0 (b) 1D projection by SSA (acc: 60%). data from the source and target domain. which takes only O(m(nS + nT )2 ) time when m nonzero eigenvectors are to be extracted. we propose to learn a low-rank parametric kernel for transfer learning instead of the entire kernel matrix. (2) It can be generalized to out-of-sample target domain domain naturally. which may be sensitive to different application areas. Figure 3.6 0.4 Summary In TCA. 46 .4 1. However.6 1. (1) It is much more efﬁcient. Once the transformation W is learned.3. 3.2 x2 1 0.2 1. As can be seen from Algorithm 3. the parametric kernel is a composing kernel that consists of an empirical kernel and a linear transformation. including unseen data.4 0. TCA has two advantages.5. (c) 1D projection by TCA (acc: 86%). Parameter values of the empirical kernel needs to be tuned by human experience or by using the cross-validation method. can be mapped to the latent space directly.2 0 −2 −1 0 1 2 x 3 4 5 Pos. TCA only requires a simple and efﬁcient eigenvalue decomposition.8 0. source domain data Pos. target domain data 1 (a) Data set 1 (acc: 82%).8 1. target domain data Neg. Indeed. compared to MMDE.

3.5 does not consider the label information in learning the components. such that the learned components are discriminative to labels in the source domain and can be used to reduce the distance between the source and target domains as well. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.3: Illustrations of the proposed TCA and SSTCA on synthetic dataset 2. A solution to overcome this is to encode the source domain label information into the embedding learning. the unsupervised TCA proposed in Chapter 3.5 4 3 2 x2 1 0 −1 −2 −2 Pos. there is no labeled data in the target domain. Blitzer et al. 4 Recall that in our setting. PDF PDF 1D latent space −150 −100 −50 0 x 50 100 150 −10 1D latent space −5 0 5 x 10 15 20 (b) 1D projection by PCA (acc: 48%). 47 . [24] and Mansour et al.4. and (2) minimize the empirical error on the labeled data in the source domain4 . (c) 1D projection by TCA (acc: 82%). However. [16]. the direction with the largest variance (x1 ) is orthogonal to the discriminative direction (x2 ). As shown in Figure 3. Figure 3. [121] mentioned in their works. source domain data Neg. target domain data −1 0 1 2 3 x1 4 5 6 7 8 9 (a) Data set 2 (accuracy: 50%). target domain data Neg.6 Semi-Supervised Transfer Component Analysis (SSTCA) As Ben-David et al. the transfer components learned by TCA may not be useful for the target classiﬁcation task. a good representation should (1) reduce the distance between the distributions of the source and target domain data. repectively. As a result. source domain data Pos.

target domain data Target domain data + + − Source domain data −10 −5 0 x 5 10 15 1 3 1 −1 −3 −15 − Figure 3. we encode the manifold structure into the embedding learning so as to propagate label information from the labeled (source domain) data to the unlabeled (target domain) data (Chapter 3.5: There exists an intrinsic manifold structure underlying the observed data.5 to the semi-supervised learning setting. target domain data Neg.6. 48 .1). 34]. Moreover. a representation that maximizes its dependence with the data labels may lead to better generalization performance. 5 4 3 2 x2 1 0 −1 −2 −2 Pos. we can maximize the label dependence instead of minimizing the empirical error (Chapter 3. source domain data Pos. Hence. there exists an intrinsic low-dimensional manifold underlying the high-dimensional observations. in many real-world applications (such as WiFi localization). However. Motivated by the kernel target alignment [43].6. the labeled and unlabeled data are from different domains. source domain data Neg.4: The direction with the largest variance is orthogonal to the discriminative direction Moreover. source domain data Neg. In this section. Note that in traditional semi-supervised learning settings [233. we extend the unsupervised TCA in Chapter 3. 233]. target domain data Neg. The effective use of manifold information is an important component in many semi-supervised learning algorithms [34. target domain data Target domain data + − + − −1 0 1 2 3 x Source domain data 4 5 6 7 8 1 Figure 3. the labeled and unlabeled data are from the same domain.1). source domain data Pos. in the context of domain adaptation here.11 9 7 5 x 2 Pos.

{ [Kl ]ij = kyy (yi . Objective 2: Label Dependence Our second objective is to maximize the dependence (measured w. the target domain data are unlabeled. (3.22) Note that γ is a tradeoff parameter that balances the label dependence and data variance terms.5 works well on all the data sets.6. By substituting K (3. while Kv = I. simply setting γ = 0. we delineate three desirable properties for this semi-supervised embedding. (1) maximal alignment of distributions between the source and target domain data in the embedded space.21) serves to maximize the variance on both the source and target domain data. Here.1 Optimization Objectives In this section. our ﬁrst objective is to minimize the MMD (3. (3. HSIC) between the embedding and labels. 0 otherwise.11)) with ˜ Kyy = γKl + (1 − γ)Kv . The sensitivity of the performance to γ will be studied in more detail in Chapters 4 and 5.13) between the source and target domain data in the embedded space.11) and Kyy (3. and a large γ may be used. Recall that while the source domain data are fully labeled.19) serves to maximize label dependence on the labeled data. (2) high dependence on the label information. Empirically.20) (3. we may use a small γ. (3. We propose to maximally align the embedding (which is represented by K in (3.r. yj ) i. 49 . our objective is thus to maximize ˜ ˜ tr(H(KW W ⊤K)H Kyy ) = tr(W ⊤KH Kyy HKW ).t. j ≤ nS . Otherwise. Intuitively. if there are sufﬁcient labeled data in the source domain. when there are only a few labeled data in the source domain and a lot of unlabeled data in the target domain. namely. which is in line with ˜ MVU [198]. Objective 1: Distribution Matching As in the unsupervised TCA. the dependence between features and labels can be estimated more precisely via HSIC. where γ ≥ 0. and (3) preservation of the local geometry.3.4).19) into HSIC (3.

13) and (3. It is well-known that (3. For simplicity.25) In the sequel. where the ith column [W ⊤ K]i provides the embedding coordinates of xi . and dij = ∥xi − xj ∥ be the distance between xi . let N = {(xi .24) can be formulated as the following quotient trace problem: max tr{(W ⊤ K(L + λL)KW + µI)−1 (W ⊤ KH Kyy HKW )}. we construct a graph with the afﬁnity mij = exp(−d2 /2σ 2 ) if xi is one ij of the k nearest neighbors of xj . Intuitively. xj in the original input space. Let M = [mij ]. The procedure for both the unsupervised and semi-supervised TCA is summarized in Algorithm 3.24) s.2 Formulation and Optimization Procedure In this section. Hence.4 and 3.Objective 3: Locality Preserving As reviewed in Chapters 3.11) that the embedding of the data in Rm is W ⊤ K. Colored MVU and MMDE preserve the local geometry of the manifold by enforcing distance constraints on the desired kernel matrix K. a constraint Kii + Kjj − 2Kij = d2 will be added to the optimization problem. To avoid this problem. while simultaneously minimizes (3.22). For each (xi . xj j=1 are neighbors in the input space. Hence.25) can be solved by eigendecomposing (K(L + λL)K + µI)−1 KH Kyy HK. More specifically. W (3. where λ ≥ 0 is another tradeoff parameter. The graph Laplacian matrix is ∑ L = D − M . (3. or vice versa. 50 . this will be referred to as Semi-Supervised Transfer Component Analysis (SSTCA). the distance between the embedding coordinates of xi and xj should be small. First. we present how to combine the three objectives to ﬁnd a W that maximizes (3.23).23) 3.2. and n2 = (nS + nT )2 is a normalization term. Similar to the unsupervised TCA.j)∈N mij [W ⊤ K]i − [W ⊤ K]j =tr(W ⊤ KLKW ).4. our third objective is to minimize ∑ (i.t. The ﬁnal optimization problem can be written as min tr(W ⊤ KLKW ) + µ tr(W ⊤ W ) + W λ tr(W ⊤ KLKW ) n2 (3. xj )} be the set of sample pairs that are k-nearest neighbors of each other. where D is the diagonal matrix with entries dii = n mij . Note from (3. xj ) in N . W ⊤ KH Kyy HKW = I.4. ij the resultant SDP will typically have a very large number of constraints. 2 (3. if xi . we use λ to denote λ n2 in the rest of this paper. we make use of the locality preserving property of the Laplacian Eigenmap [13].6.

3: Do eigen-decomposition and select the m leading eigenvectors to construct the transformation matrix W . the difference between SSA and SSTCA is in the use of label information. t = 1. where κ is a row vector. we set the λ in SSTCA to zero. it is very similar to that of TCA. Since the focus is not on locality preserving. we perform experiments on synthetic data to demonstrate the effectiveness of these three objectives of SSTCA in learning a 1D latent space from the 2D data. nS + nT .4 Transfer learning via Semi-Supervised Transfer Component Analysis (SSTCA)..S KS. ′ ′ 4: Map the data xSi ’s and xTj ’s to x′ i ’s and x′ j ’s via using XS = [KS. the positive and negative samples are more separated in 51 . nS . we demonstrate the advantage of using label information in the source domain data to improve classiﬁcation performance (Figure 3. and ﬁx µ = 1.S KT. SSTCA is also easy to be generalized to out-ofsample data. nS + 1. the positive and negative samples overlap signiﬁcantly in the latent space learned by TCA. matrix L from (3.. and target domain data set DT = i=1 {xTj }nT .. Require: Source domain data set DS = {(xSi . and T κi = k(xT . ySi )}nS . the overall time complexity is still O(m(nS + nT )2 )..T ]W . In addition. The objectives include minimizing the distance between domains. As can be seen from Figure 3. because the transformation W is learned explicitly. γ = 0. In this section. respectively. . We will fully test the effectiveness of SSTCA next in two real-world applications in Chapter 4 and Chapter 5. Consequently. Label Information In this experiment. .T ]W and XT = S T [KT. we aims to optimize three objectives simultaneously.. 7: return transformation matrix W and f (x′ )’s.3 Experiments on Synthetic Data As described in previous sections. 6: For new test data xT from the target domain. 5: Train a model f on xSi ’s with ySi ’s. However. On the other hand. 3..7).6(b). maximizing the source domain label dependence in the latent space and preserving local geometric structure in the latent space. 2: Compute the matrix (K(L + λL)K + µI)−1 KH Kyy HK. the matrix need to be eigen-decomposed is (K(L + λL)K + µI)−1 KH Kyy HK instead of (KLK + µI)−1 KHK. j=1 Ensure: Transformation matrix W and predicted labels YT of the unlabeled data XT in the target domain. respectively. x′ = κW . and centering matrix H. For SSTCA. The only difference between these algorithms is that in SSTCA. xt ). nS nT 1: Construct kernel matrix K from {xSi }i=1 and {xTj }j=1 based on (3. T As can be seen in the algorithm. in SSTCA. we use the linear kernel on both inputs and outputs..5.Algorithm 3.6.6(a)).8). with the use of label information.

9 Pos.7(a). it may be possible that the discriminative direction of the source domain data is quite different from that of the target domain data. Figure 3.7(b) and 3. As can be seen from Figures 3.the latent space learned by SSTCA (Figure 3. in some applications. An example is shown in Figure 3. However. Nevertheless. In summary. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. both SSTCA and TCA can obtain better performance. SSTCA may not improve the performance or even performs worse than TCA. source domain data Pos. However. compared to nonadaptive methods. positive and negative samples in the target domain are more separated in the latent space learned by TCA than in that learned by SSTCA.6: Illustrations of the proposed TCA and SSTCA on synthetic dataset 3.7(c). when the discriminative directions across different domains are similar. In this case. encoding label information from the source domain (as SSTCA does) may not help or even hurt the classiﬁcation performance as compared to the unsupervised TCA. (c) 1D projection by SSTCA (acc: 79%). PDF PDF 1D latent space −300 −200 −100 0 x 100 200 300 −175 1D latent space −125 −75 −25 25 75 125 x (b) 1D projection by TCA (acc: 56%). 52 . and thus classiﬁcation also becomes easier. SSTCA can outperform TCA by encoding label information into the embedding learning. when the discriminative directions across different domains are different. source domain data Neg. target domain data 7 5 x2 3 1 −1 −3 −15 −10 −5 0 x 5 10 15 1 (a) Data set 3 (acc: 69%). target domain data Neg.6(c)).

source domain data Pos. Figure 3.5 5 7. Both the source and domain data have the well-known two-moon manifold structure [232] (Figure 3. Laplacian smoothing can indeed help improve classiﬁcation performance when the manifold structure is available underlying the observed data. we extend unsupervised TCA in the semi-supervised manner by maximizing the source domain label dependence in embedding learning.5 4 3 2 x2 1 0 −1 −2 −2 Pos.5 10 (b) 1D projection by TCA (acc: 90%). the time complexity 53 .7: Illustrations of the proposed TCA and SSTCA on synthetic dataset 4.8(b) and 3.5 x 10 12.5 20 1D latent space −20−17. target domain data Neg. SSTCA is used with and without Laplacian smoothing (by setting λ in (3. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets.5−15−12.24) to 1000 and 0.6.5−10 −7. target domain data −1 0 1 2 3 x1 4 5 6 7 8 (a) Data set 4 (accuracy: 60%). source domain data Neg. (c) 1D projection by SSTCA (acc: 68%). 3.8(a)). PDF PDF 1D latent space −5 −2. we demonstrate the advantage of using manifold information to improve classiﬁcation performance.5 x 0 2. As can be seen from Figures 3.4 Summary In SSTCA. Similar to TCA. respectively).5 5 7.5 15 17. Manifold Information In this experiment.8(c).5 −5 −2.5 0 2.

ing (acc: 91%).4 3.5 5 7. then SSTCA does not work well. target domain data Neg.5 −1 −2 −1 0 1 2 3 x 4 5 6 7 8 Pos. of SSTCA is O(m(nS +nT )2 ).8: Illustrations of the proposed TCA and SSTCA on synthetic dataset 5. target domain data 1 (a) Data set 5 (acc: 70%). MMDE and the proposed TCA and SSTCA.4. Furthermore.5 −5 −2. Figure 3.5 3 2.5 0 −0.5 0 x 2. Experiments on synthetic data show that when the discriminative directions of the source and target domains are the same or close to each other. However. Accuracy of the 1-NN classiﬁer in the original input space / latent space is shown inside brackets. then SSTCA can boost the classiﬁcation accuracy compared to TCA.2. source domain data Pos. are all based on Hilbert space embedding of distributions 54 . TCA and SSTCA will conducted in Chapters 4-5. More completed comparison between MMDE. Colored MVU. source domain data Neg. when the source and target domain data have a intrinsic manifold structure. in some cases. PDF PDF 1D latent space −10 −7. such as MVU.7 Further Discussion As mentioned in Chapter 3.5 1 0. when the discriminative directions of the source and target domains are different. embedding approaches.5 2 x2 1. SSTCA is more effective.5 10 12 1D latent space −12 −10 −8 −6 −4 −2 x 0 2 4 6 8 10 (b) 1D projection by SSTCA without Laplacian (c) 1D projection by SSTCA with Laplacian smoothsmoothing (acc: 83%). 3.

To balance prediction performance with computational complexity in cross-domain learning. However.5 ) to O(m(nS + nT )2 ). TCA and SSTCA are dimensionality reduction approaches for domain adaptation. Note that SSTCA reduces to TCA when γ = 0 and λ = 0.Table 3. and do not utilize side information. have also been proposed to solve the transductive transfer learning problems by matching data distributions. TCA reduces to MVU5 .6. while TCA can be generalized to out-ofsample patterns even in this restricted setting. Furthermore.1. instead of matching them in the original feature space. instance re-weighting methods. Last but not the least. while MVU and colored MVU are dimensionality reduction approaches for singledomain data visualization. Note that criterion (3. MVU and Colored MVU learn a nonparametric kernel matrix by maximizing data variance or label dependence for data analysis in a single domain. Finally. In contrast. which then reduce the time complexity from O(m(nS + nT )6. while MMDE learns a nonparametric kernel matrix by minimizing domain distance for cross-domain learning. MMDE. TCA is a generalized version of MMDE.. gradient descent is required to reﬁne the embedding space and thus the solution can still get stuck in a local minimum. We summarize the relationships among these approaches in Table 3. 55 . such as KMM.7) in the kernel learning problem of MMDE is similar to the recently proposed supervised dimensionality reduction method Colored MVU[174]. and exploits side information in embedding learning. The main difference between these methods and our proposed method is that we aim to match data distributions between domains in a latent space. where data properties can be preserved. As mentioned in Chapter 2. while the latent space has 5 Note that MVU is a transductive dimensionality reduction method.3. They are all transductive and computationally expensive. MMDE and TCA are unsupervised. method MVU Color MVU MMDE TCA SSTCA setting unsupervised supervised unsupervised unsupervised semi-supervised out-ofsample label distribution geometry variance matching √ √ nonparametric √ √ nonparametric √ √ √ nonparametric √ √ parametric √ √ √ √ parametric kernel √ √ via MMD and HSIC. besides the embedding approaches. SSTCA is semi-supervised. This is beneﬁcial as the real-world data are often noisy. Essentially. the proposed TCA and SSTCA learn parametric kernel matrices by simultaneously minimizing distribution distance and maximizing data variance and label dependence. KLIEP. uLSIF. SSTCA can also be reduced to the semisupervised Color MVU when we drop the distribution matching term and set γ = 1 and λ > 0.1. etc. in which a low-rank approximation is used to reduce the number of constraints and variables in the SDP.1: Summary of dimensionality reduction methods based on Hilbert space embedding of distributions. If we further drop the objective 1 described in Chapter 3. In summary.

56 . in practice.been de-noised. As a result. matching data distributions in the de-noised latent space may be more useful than matching them in the noisy original feature space for target learning tasks in the transductive transfer learning setting.

xi2 . locating and tracking a user or cargo with wireless signal strength or received signal strength (RSS) is becoming a reality [74.. 132. in many places outdoor where high buildings can block signals. 100..1 WiFi Localization and Cross-Domain WiFi Localization Location estimation is a major task of pervasive computing and AI applications that range from context-aware computing [80. 133. The objective of indoor WiFi localization is to estimate the location yi of a mobile device based on the RSS values xi = (xi1 . In the following. 65].. and indoor WiFi localization is intrinsically a regression problem.. 132..1 shows an indoor 802. Then.F in the building. we apply the proposed dimensionality reduction framework to an application in wireless sensor networks: indoor WiFi localization. 65. The corresponding labels of the signal vectors xtA . With the increasing availability of 802.e. Figure 4.tF . as shown in Table 4.AP5 ) are set up in the environment. e.. 57 . Then. these methods work in two phases. we consider the two-dimensional coordinates of a location1 . the location coordinates in the 2D ﬂoor where the mobile device is). and in most indoor places. A user with an IBM T42 laptop that is equipped with an Intel Pro/2200BG internal wireless card walks through the environment from the location A to F at time tA . a mobile device (or multiple ones) moving around the wireless environment is used to collect wireless signals from APs. six signal vectors are collected. which periodically send out wireless signals to others. xik ) received from k Access Points (APs). While GPS is widely used.. 4. . machine-learning-based methods have been applied to the indoor WiFi localization [134. . In recent years. the RSS values (e.xtF are the 2D coordinates of the locations A. .. ..11 wireless environment of size about 60m × 50m. As an example.11 WiFi networks in cities and buildings. xi2 . each of which is 5-dimensional. the signal vector xi = (xi1 .. . .. 106] to robotics [12]. Note that the blank cells in the table denote the missing values. In an ofﬂine phase.CHAPTER 4 APPLICATIONS TO WIFI LOCALIZATION In this section. GPS cannot work well. 100].1. In general. . which we can ﬁll in a small default value.g.. are used as labeled training data to learn a computational or statistical model.. −100dBm. xik )) together with the physical location information (i. .g. 1 Extension to three-dimensional localization is straightforward.. Five APs (AP1 ....

The task is to predict the labels of the WiFi 58 . the RSS data collected in one time period may differ from those collected in another.E AP4 AP5 B A F AP1 AP2 AP3 C D Figure 4. As a result. Here. most machine-learning methods rely on collecting a lot of labeled data to train an accurate localization model ofﬂine for use online.1: An indoor wireless environment example. As can be seen. Table 4. the contours of the RSS values received from the same AP at different time periods are very different.2. This contains a few labeled WiFi data collected in time period T1 (the source domain) and a large amount of unlabeled WiFi data collected in time period T2 (the target domain). even in the same environment. An example is shown in Figure 4.2 Experimental Setup We use a public data set from the 2007 IEEE ICDM Contest (the 2nd Task). it is expensive to calibrate a localization model in a large environment. “label” refers to the location information for which the WiFi data are received.1: An example of signal vectors (unit:dBm) xtA xtB -60 xtC -40 -70 xtD -80 -40 -70 xtE -40 -70 -40 -80 xtF -80 -80 -50 (All values are rounded for illustration) However. and assuming that the distributions of RSS data over different time periods are static. WiFi data collected from different time periods are considered as different domains. the RSS values are noisy and can vary with time [217. domain adaptation is necessary for indoor WiFi localization. AP1 -40 -50 AP2 AP3 -60 AP4 -40 -80 AP5 -70 4. 136]. However. Moreover. Hence.

35 30 25 20 15 10 5 0 0 35 30 25 20 20 18 16 14 12 10 15 10 5 0 0 8 6 4 2 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 (a) WiFi RSS received in T1 from two APs (unit:dBm). our goal is to learn a model from DS and DT .328 patterns are o sampled to form DT . 1. For parameter tuning of TCA. Here.yi )∈D |f (xi ) − yi | N . 59 . and the (semi-supervised) Laplacian RLSR (LapRLSR) [14]. 2. data collected in time period T2 . we shift them to positive values for visualization. Traditional regression models that do not perform domain adaptation. and then evaluate the model on DT (out-of-sample patterns). respectively. xi is a vector of RSS values. In the experiments. which is standard regression model. Here. We train it on DS only. yi is the corresponding u o ground truth location. For each experiment. In the out-of-sample evaluation setting. readers may refer to the contest report article [212]. 50 labeled data are sampled from the source domain as validation set. (b) WiFi RSS received in T2 from two APs (unit:dBm). Figure 4. The following methods will be compared. our goal is to learn a u o model from DS and DT . u In the transductive evaluation setting. Note that the original signal strength values are non-positive (the larger the stronger). f (xi ) is the predicted location. we have |DS | = 621 and |DT | = 3. All the source domain data (621 instances in total) are used for training. and D = DT in the out-of- sample evaluation setting. For more details on the data set. we repeat 10 times and then report the average performance using the Average Error Distance (AED): AED = ∑ (xi . Furthermore. and a variable number of patterns are sampled from the remaining 800 u patterns to form DT . SSTCA and all the other baseline methods. These include the (supervised) regularized least square regression (RLSR). Different colors denote different signal strength values (unit:dBm). and u then evaluate the model on DT . As for the target domain data. 128. we randomly split u o DT into DT (the label information is removed in training) and DT .2: Contours of RSS values over a 2-dimensional environment collected from the same AP but in different time periods. Denote the data collected in time period T1 and time period T2 by DS and DT . while D = DT in the transductive setting.

It learns u a set of new cross-domain features from both DS and DT . we set the ϵ parameter in √ KMM as B/ n1 . 3.5 and search for the best λ value in [10−6 . we ﬁx λ and search for γ in [0. For SSTCA.3 and TCAReduced. 103 ]. we apply TCA / SSTCA on both DS and DT to learn transfer components. µ. u 60 . It ﬁrst learns a projection from both DS and DT via KPCA. the ﬁrst author of [191]. a RLSR model is trained on the projected source domain data. Thus. 5 The code of SSA is provided by Paul von B¨ nau. There are two parameters in TCA. which replaces the constraint W ⊤ KHKW = I in TCA by 2 The code of KLIEP is downloaded from http://sugiyama-www. We set σ and µ in the same manner as TCA. 4. Following [82].3. A traditional dimensionality reduction method: Kernel PCA (KPCA) as introduced in u Chapter 3. We ﬁrst set µ = 1 and search for the best σ value (based on the validation set) in the range [10−5 . kernel width4 σ and parameter µ.1. 2. we ﬁnd that the resulting performance is satisfactory in practice. The proposed TCA and SSTCA. Afterwards.titech. However. Note that the Laplacian RLSR has been applied to WiFi localization and is one of the state-of-the-art localization models [134]. Then. Methods that only perform distribution matching in a latent space: SSA5 as introduced in 2. First.2. Sample selection bias (or covariate shift) methods: KMM and KLIEP2 as introduced in u Chapter 2. RLSR is then applied on the projected DS to learn a localization model. and then augments features on the source domain data in DS with the new features. For KLIEP. 106 ].jp/˜sugi/software/KLIEP/index. and γ). and there are four tunable parameters (σ. 6. A state-of-the-art domain adaptation method: SCL3 as introduced in Chapter 2. A RLSR model is then trained. λ. which has been shown to be a suitable kernel for the WiFi data [139]. Note that this parameter tuning strategy may not get a global optimal combination of values of parameters. its initial value is also tuned on the validation set. Preliminary results suggest that the ﬁnal performance of KLIEP can be sensitive to the initialization of the kernel width.19) on the labels. the pivot features are selected by mutual information while the number of pivots and other SCL parameters are determined by the validation data. 105 ]. we set γ = 0. and then train a RLSR model on the weighted data.u which is trained on both DS and DT but without considering the difference in distribu- tions. u 5. xj ) = exp − i σ j . Afterwards. Finally. 1].cs. we use the likelihood cross-validation method in [180] to automatically select the kernel width.html 3 Following [25]. we use linear kernel for kyy in (3. They use both DS and DT to learn weights of the patterns in DS . ( ) ∥x −x ∥ 4 We use the Laplace kernel k(xi .3. where n1 is the number of training data in the source domain. and map data in DS to the latent space.ac. we ﬁx σ and search for the best µ value in [10−3 .

1-4. Since MMDE (the last method) is transductive. The ﬁrst six methods will be compared in the out-of-sample setting in Chapters 4. the manifold assumption holds on the WiFi data. we compare TCA and SSTCA with learning-based localization models that do not perform domain adaptation. can lead to signiﬁcantly improved performance.1. including KPCA.3. 61 . Hence. 4. TCAReduced aims to ﬁnd a transformation W that minimizes the distance between different distributions without maximizing the variance in the latent space. Thus. TCA and SSTCA are ﬁxed at 15. SSA and TCAReduced in the out-of-sample setting.4. As can be seen. including RLSR.3 Results 4. The dimensionalities of the latent spaces for KPCA. we observe that SSTCA obtains better performance than TCA.3.3. TCA and SSTCA outperform all the other methods. This is a state-of-the-art method on the ICDM-07 contest data set [135.2. These values are determined based on the ﬁrst experiment in Chapter 4.W ⊤ W = I. 4. while the dimensionality of the latent space varies from 5 to 50.3.2 Comparison with Non-Adaptive Methods In this experiment. the graph Laplacian term in SSTCA can effectively exploit label information from the labeled data to the unlabeled data across domains. In addition. KPCA can only de-noise but cannot ensure that the distance between data distributions in the two domains is reduced. TCA performs better than KPCA. Thus. though simple. Figure 4. 139]. they do not obtain good performance. we will compare the performance of TCA. Thus. 7. The number of unlabeled patterns u in DT is ﬁxed at 400. SSTCA and MMDE in the transductive setting in Chapter 4. Moreover.3. and thus localization models learned in the de-noised latent space can be more accurate than those learned in the original input space. A closely related dimensionality reduction method: MMDE. Finally.1 Comparison with Dimensionality Reduction Methods We ﬁrst compare TCA and SSTCA with some dimensionality reduction methods. they may lose important information of the original data in the latent space. LapRLSR and KPCA. note that KPCA. as mentioned in Chapter 3.3. This is because the WiFi data are highly noisy. However.3 shows the results.3. which in turn may hurt performance of the target learning tasks. As demonstrated in previous research [134].3. though TCAReduced and SSA aim to reduce distance between domains.

4: Comparison with localization methods that do not perform domain adaption. We ﬁx the dimensionalities of the latent space in TCA and SSTCA at 15.100 55 30 KPCA SSTCA TCA TCAReduced SSA Average Error Distance (unit: m) 15 8 4. This is again because the WiFi data are 62 . domain adaptation methods that are based on feature extraction (including SCL. For training. TCA and SSTCA can perform well for domain adaptation.3 Comparison with Domain Adaptation Methods In this subchapter. while we ﬁx the dimensionalities of the latent space in SSA and TCAReduced at 50. 8 RLSR LapRLSR KPCA SSTCA TCA Average Error Distance (unit: m) 7 6 5 4 3 2 0 100 200 300 400 500 600 700 800 900 # Unlabeled Data in the Target Domain for Training Figure 4. TCA and SSTCA) perform much better than instance re-weighting methods (including KMM and KLIEP).4 shows the performance when the number of unlabeled patterns in DT varies. Results are shown in Figure 4. SCL and SSA. As can be seen. 4. u Figure 4. KLIEP. even with only a few unlabeled data in the target domain.3: Comparison with dimensionality reduction methods. we compare TCA and SSTCA with some state-of-the-art domain adaptation methods. As can be seen.5 0 5 10 15 20 25 30 35 # Dimensions 40 45 50 55 Figure 4. including KMM.5 1.5 2.3.5. all the source domain data are used and varying amount u of the target domain data are sampled as |DT |.

SCL may suffer from the bad choice of pivot features due to the noisy observations. 4.6: Comparison with MMDE in the transductive setting on the WiFi data.5 4 3.5 3 2.5 0 5 10 15 20 25 30 35 # Dimensions 40 45 50 55 MMDE SSTCA TCA 3 2. where the WiFi data have been implicitly de-noised.6 0 MMDE SSTCA TCA Average Error Distance (unit: m) 100 200 300 400 500 600 700 800 # Unlabeled Data in the Target Domain for Training 900 (a) Varying the dimensionality of the latent space.4 Comparison with MMDE In this subchapter. SSTCA and the various baseline methods in the inductive setting on the WiFi data. Figure 4.5 2 1.3.6(a) shows the results for different dimensionalities of the latent space with |DT | = 400. u Figure 4.8 2.6 2. with the dimensionalities of MMDE. Indeed. The latent space is learned from DS and a subset of the unlabeled target domain data sampled from u DT . while Figure 4. (b) Varying the number of unlabeled data. 6 5. and so matching distributions directly based on the noisy observations may not be useful. we compare TCA and SSTCA with MMDE in the transductive setting. As can be seen. On the other hand.4 2.8 1. TCA and SSTCA ﬁxed at 15. The performance is then measured on the same unlabeled data subset.5: Comparison of TCA.5 Average Error Distance (unti: m) 5 4.2 2 1. highly noisy. MMDE 63 .6(b) shows the results for different amounts of unlabeled target domain data. TCA and SSTCA match distributions in the latent space.90 KLIEP SCL KMM SSTCA TCA TCAReduced SSA Average Error Distance (unit: m) 40 22 13 8 5 3 2 0 100 200 300 400 500 600 700 800 900 # Unlabeled Data in the Target Domain for Training Figure 4.

both TCA and SSTCA are insensitive to the settings of the various parameters.7: Training time with varying amount of unlabeled data for training. TCA and SSTCA to the WiFi localization problem. the kernel matrix is learned from the data automatically. the kernel matrix in MMDE can more ﬁt the data. we may observe that in the transductive experimental setting. SSTCA performs better than 64 . we investigate the effects of the parameters on the regression performance. As can be seen from Figure 4. The dimensionalities of the latent spaces in TCA and SSTCA are ﬁxed at 15. and for SSTCA. Thus. The out-of-sample evaluation setting is used. In practice. However. we applied MMDE. As a result. Furthermore. while in MMDE.8. MMDE performs best. This may be due to the limitation that the kernel matrix used in TCA / SSTCA is parametric. The reason is that in TCA and SSTCA. MMDE SSTCA TCA 10 Running Time (unit: sec) 5 10 4 10 3 10 2 10 1 10 0 0 100 200 300 400 500 600 700 800 # Unlabeled Data in the Target Domain for Training 900 Figure 4. tradeoff parameter µ. All the source domain data are used. TCA or SSTCA may be a better choice than MMDE for cross-domain adaptation. the two additional parameters γ and λ. Experimental results veriﬁed the effectiveness of our proposed methods in WiFi localization compared several cross-domain methods.5 Sensitivity to Model Parameters In this subchapter.4 Summary In this section. However.328 samples from the target domain data to form o u DT . These include the kernel width σ in the Laplacian kernel.4. we need tune the kernel parameters using cross-validation. based on the results in WiFi localization. as mentioned in Chapter 3. 4.3. which is useful for learning tasks. From the results. MMDE may be not applicable in practice because of its expensive computational cost. This is conﬁrmed in the training time comparison in Figure 4. TCA and SSTCA are more desirable. 4. for real-world applications. and we sample 2. and another 400 samples to form DT .7. MMDE is computationally expensive because it involves a SDP.outperforms TCA and SSTCA.

55 50 Average Error Distrance (unit: m) 45 40 35 30 25 20 15 10 5 0 −5 −4 −3 −2 −1 0 1 log10σ 2 3 4 5 TCA SSTCA Average Error Distance (unit: m) 2.8 0.3 2.7 2.3 0.9 2.7 Values of γ1 (γ2=1−γ1) 0.6 0.1 0 0. when the manifold assumption does hold on the source and target domain data.1 0.35 2.2 −0.7 SSTCA 2.5 2. (d) varying λ in SSTCA.3 2.8 2. Figure 4.55 2.45 2.65 Average Error Distance (unit: m) 2.5 2.5 0. Therefore.8: Sensitivity analysis of the TCA / SSTCA parameters on the WiFi data. TCA. As a result the manifold regularization term in SSTCA can be able to propagate label information from the source domain to the target domain effectively.4 2.25 2. we suggest to use SSTCA for learning the latent space for transfer learning. 2.2 −4 TCA SSTCA −3 −2 −1 0 log10µ 1 2 3 4 (a) varying σ of the Laplacian kernel. The reason may be that the manifold assumption hold strongly on the WiFi data. SSTCA −7 −6 −5 −4 −3 −2 −1 0 log10λ 1 2 3 4 5 6 7 (c) varying γ in SSTCA.1 Average Error Distance (unit: m) 40 35 30 25 20 15 10 5 0 45 (b) varying µ.6 2.6 2.4 2.2 0.4 0. 65 .9 1 1.

1 Text Classiﬁcation and Cross-domain Text Classiﬁcation Text classiﬁcation. cross-domain learning algorithms are desirable in text classiﬁcation. However. The proposed method based on naive bayes. This is a text collection of approximately 20. we test the effectiveness of the proposed framework on a real-world cross-domain text classiﬁcation dataset. thus it is hard to be applied to other applications. It is a fundamental task for Web mining and has been widely studied in the ﬁelds of machine learning. it can be applied various diverse applications. data mining and information retrieval [89. 207. Traditional supervised learning approaches to text classiﬁcation require sufﬁcient labeled instances in a domain of interest in order to train a high-quality classiﬁer. Most of these methods are designed for the text classiﬁcation area only.000 news1 http://people. As a result. we proposed dimensionality reduction framework can be used for general cross-domain regression or classiﬁcation problems.mit.CHAPTER 5 APPLICATIONS TO TEXT CLASSIFICATION In the previous section. in [48] proposed a modiﬁed naive bayes algorithm for cross-domain text classiﬁcation.csail. also known as document classiﬁcation. 47. In contrast. Test data are assumed to come from the same domain. we perform cross-domain text classiﬁcation experiments on a preprocessed data set of the 20-Newsgroups1 [96]. which have been reviewed in Chapter 2. 215]. 204].edu/jrennie/20Newsgroups/ 66 . we apply the framework to another application: text classiﬁcation. following the same data distribution of the training domain data. In this section. we have showed the effectiveness of the proposed dimensionality reduction framework for the cross-domain WiFi localization problem. labeled labeled are in short supply in the domain of interest. For example. 49. most of these methods are unable to be applied to solve cross-domain text classiﬁcation problems. aims to assign a document to one ore more categories based on its content. 5. Thus. Thus. In recent years. such as WiFi localization.2 Experimental Setup In this section. transfer learning techniques have been proposed to address the cross-domain text classiﬁcation problems [48. 5. in many real-world applications. In the following sections.

and the binary classiﬁcation task is deﬁned as top category classiﬁcation.500 1.000 6.500 1. sci”. since they are under the same top categories. Table 5. talk”.500 1. # Neg. 2.500 1. and sample 40% from the target domain to form the unlabeled subset DT .500 1. Talk (C vs. S) Rec vs. Documents from different subcategories but under the same parent category are considered to be related domains. we perform a series of experiments to compare TCA and SSTCA with the following methods: 1. Hence.500 1.500 1.500 1. R) Comp vs. We then split the data based on sub-categories.500 1. “comp vs.500 1.644 33.000 6. Sci (R vs.500 1. Linear support vector machine (SVM) in the original input space. sci”.500 1. we randomly sample 40% of the documents from the source u domain as DS . In this experiment. “rec vs.500 From each of these six data sets. Note that we do not show the performance of MMDE in this experiment.1). Talk (S vs. The six data sets created are “comp vs.165 29. The evaluation criterion is the classiﬁcation accuracy. Task Comp vs.500 1. 67 .500 1. A linear SVM is then trained in the latent space. All experiments are performed in the out-of-sample setting. KLIEP and SCL. T) # Fea. For each data set.500 1. “sci vs.000 Source Domain Target Domain # Pos. Three domain adaptation methods: KMM.500 1.514 # Doc.000 6. # Neg.1: Summary of the six data sets constructed from the 20-Newsgroups data. 1. Similar to Chapter 4. We run 10 repetitions and report the average results. Talk (R vs. in each u o cross-domain classiﬁcation task.000 6. # Pos. 38. S) Sci vs. and o the remaining 60% in the target domain to form the out-of-sample subset DT .151 40. |DT | = 1200 and |DT | = 1800. the domains are also ensured to be different. talk” (Table 5.500 1. Besides. Rec (C vs.500 1. Again.065 30. 3. |DS | = 1200. rec” and “comp vs. All the alphabets are converted to lower cases and stemmed using the Porter stemmer. a linear SVM is used as the classiﬁer in the latent space. we follow the preprocessing strategy in [47. KPCA. since they are drawn from different sub-categories.group documents hierarchically partitioned into 20 newsgroups. “rec vs.000 6. Different sub-categories are considered as different domains.827 45. 48] to create six data sets from this collection. Stop words are also removed and the document frequency feature is used to cut down the number of features.500 1.500 1. T) Rec vs.500 1. two top categories are chosen. one as positive and the other as negative. because it results in “out of memory” on learning the kernel matrix using SDP solvers. Sci (C vs.500 1. 6. T) Comp vs.500 1. This splitting strategy ensures that the domains of labeled and unlabeled data are related. talk”.

1(a) shows the sensitivity w. Again. Figure 5. S” and “C vs. Overall. xj ) = ∥xi − xj ∥2 RBF and linear kernels are deﬁned as k(xi . 5. Moreover. Thus. Note that we do not compare with SSA because it results in “out of memory” on computing the covariance matrices. Different from the WiFi results in Figure 4. here it performs well only when λ is very small (λ ≤ 10−4 ).5 and λ = 10−4 ). In addition. KMM. TCA and SSTCA. R”.1 Comparison to Other Methods Results are shown in Tables 5.3.1 shows how the various parameters in TCA and SSTCA affect the classiﬁcation performance. on tasks such as “R vs. one may notice two main differences between the results on the WiFi data and those on the text data. T”.We experiment with the Laplace. the linear kernel performs better than the RBF and Laplacian kernels here. Finally.r.r. Figure 5. “C vs.1(d) shows the sensitivity w. γ and λ for SSTCA. and the λ parameter in SSTCA is set to 0. both TCA and SSTCA are insensitive to the setting of µ. As can be seen.1(b) shows that for SSTCA (with γ = 0.t. For SSTCA. we use linear kernels for both the inputs and outputs.0001. Next. feature extraction methods outperform instance-reweighting methods. This indicates that mani2 ) ( ∥x −x ∥2 and k(xi . and µ.r. In general.3. linear TCA performs much better than PCA. RBF and linear kernels2 for feature extraction or re-weighting in KPCA.3 Results 5.5 and µ = 1).t. µ for TCA. S” and “S vs. xj ) = exp − i σ j 68 . TCA performs better than SSTCA on the text data. Here. there is a wide range of γ for which the performance of SSTCA is quite stable. The µ parameters in TCA and SSTCA are set to 1. we can obtain a similar conclusion as in Chapter 4.3. and Figure 5. However.2 and 5. This agrees with the well-known observation that the linear kernel is often adequate for high-dimensional text data.t. As can be seen. the remaining free parameters are µ for TCA. “C vs. First. kernel kyy in (3. This may be because the manifold assumption is weaker in the text domain than in the WiFi domain. and the dimensionalities of the latent spaces are ﬁxed at 10. 5. λ (with γ = 0. on tasks such as “R vs.8(d) where SSTCA performs well when λ ≤ 102 . the performance of PCA is comparable to that of linear TCA. namely that mapping data from different domains to a latent space spanned by the principal components may not work well as PCA cannot guarantee a reduction in distance of the two domain distributions.19) is the linear kernel. First. Figure 5. T”. This agrees with our motivation and the previous conclusion on the WiFi experiments.1(c) shows the sensitivity of SSTCA w. T”.2 Sensitivity to Model Parameters Figure 5. γ (with µ = 1 and λ = 10−4 ).

69) 90.0252) all+50 76.35) 73.24) 83.70) 89.12 (4.73 (5.71 (3.09) C vs.18) 77.08) 20 79.31 (2.77 (0.72) 88.51) 67.12) 92.76 (5.70) 84.41) 85.28 (1.12 (0.94 (1.04 (9.78) 91.71 (3.28 (1.63 (0.81 (2.18) 90.31 (1.22) 78.78) 58.84) 86.64) 57.15 (0.83 (1.06) 56.32) all 75.74) 92. T all 76.02) 89.23 (2.92 (.40 (2.38 (2.30 (1.38 (2.49 (3.56) 89.50 (1.18 (4.32) 83.65 (4.41) 87.27) 85.75) 74.96) 85.27) 30 79.62) 10 81.26) 57.11) 82.83) 90.67) 20 80.42 (6.47 (0.23) 79.13) 10 81.0252) 10 79.81) 30 77.79 (5.14 (2.11 (12.75) 5 71.81 (2.64) 85.89) 30 79.67 (1.88 (1.73 (1.04) 91.24) 56.21 (0.47 (2.35 (1.89 (2.79) 5 69.95 (0.41 (1.37) 91.07 (10.92 (4.99 (3.30 (1.02 (3.87 (4.0287) 81.01) 30 74.0287) 80.68 (2.57) 85.49) 20 79.35) 78.36 (.81) 69 #dim Task C vs.36 (3.23 (2. R 81.10 (2.19 (4.2: Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).14) 10 73.31 (6.44 (1.92 (0.53) 67.51 (5.61 (1.73 (0.81 (3.87) 10 73.02) 92.05) 5 70.92) 93.46 (1.17) all 75.59) 87.68 (1.34) 91.19 (12.66 (4.33 (3.20) 5 58.27) all 76.24 (1.27) 83.22 (3.01 (1.55) 20 78.61 (0.49) 5 63.31 (.43 (1.79 (2.54) 89.25) 84.59 (1.77 (1.30 (.14) 90.84) 93.64) 88.81 (1.70 (1.38) 81.80) 74.97) 75. method SVM linear kernel S vs.32) 5 58.55) 89.74) 67.98) 87.94 (2.34 (2.84) 64.0252) 30 69.37 (.22) 91.93) 58.67 (3.52) 5 67.41) 90.17 (1.31) 10 84.00) all 76.36) 77.46 (1.32) 90.23 (1.31) 20 74.03 (2.36 (1.82) 82.64) 30 79.Table 5.28) 30 74.94 (3.95 (1.25 (1.69 (2.0252) 20 72.71 (6.21) 77.10) 20 74.79) 10 77.71 (2.99) TCA Laplacian kernel RBF kernel linear kernel SSTCA Laplacian kernel RBF kernel linear kernel KPCA Laplacian kernel RBF kernel SCL linear kernel Laplacian kernel RBF kernel KLIEP KMM .09 (3.09 (.73 (4.14) 10 69.77 (5.76) 86.48) 81.67 (.24 (1.91) 5 69.82 (2.84 (3.60 (1.0252) 10 80.33) 85.00) 85.01) 92.06 (1.66 (1.59 (1.90 (.62 (0.70) 83.27 (.29 (2.02 (.26 (2.81 (4.72 (0.05) 92.89 (. T 90.03) 62.17) 30 72.06) 91.51 (0.02 (5.13) 93.58 (0.11 (0.83) 75.05 (0.75 (7.04) 20 82.64) 92.0252) 5 61.38 (1.54 (1.37 (.29) 77.22) 86.57 (2.0252) 30 72.64) 88.46 (3.51 (3.83) 91.0287) 83.0287) 60.36) 85.0252) 20 79.81) 68.

24) 77.63 (1.36) 70 Task R vs.56) 79.41) 80.21) 74.01 (8.53) 74.38 (2.44 (5.58 (11.74) 77.37 (11.99 (4.13 (7.58) 77.27) 69.68 (8.51) 70.37) 68.55 (2.67 (6.07) 69. method SVM linear kernel #dim all 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 all+50 all all all all C vs.73) 69.66) 51.89) 62.10 (1.59) 79.86 (3.94 (11.44 (15.49 (6.50) 75.01 (2.64 (3.80) 75.98) 63.99) 78.14) 72.30 (10.90) 70.23 (1. T R vs.72) 75.31) 78.63 (10.53 (1.06 (7.26 (5.99) 82.77 (3.33 (2.15) 74.12) 79.53 (1.00) 64.64 (1.03) 67.17) TCA Laplacian kernel RBF kernel linear kernel SSTCA Laplacian kernel RBF kernel linear kernel KPCA Laplacian kernel RBF kernel SCL linear kernel Laplacian kernel RBF kernel KLIEP KMM .76 (4.22) 68.51) 62.89) 69.19) 47.16 (5.64) 76.02 (8.81) 84.85) 73.85) 68.08 (3.59 (7.82 (8.44) 72.22) 80.62 (1.09 (6.64 (5.33) 64.29 (3.59) 60.46 (7.51) 72.60 (5.60) 70.38) 53.07 (1.52) 76.27) 70.88 (3.28 (7.63) 78.09 (4.48) 65.04 (5.47 (6.66 (6.11 (7.23 (6.12) 82.55) 76.41 (6.81 (1.26 (1.07) 81.44 (2.24 (8.17) 53.23) 51.71 (9.65 (1.23) 73.29 (1.08 (6.32 (8.13 (9.00) 68.76) 50.00) 82.29 (1.99 (9.86 (1.07) 44.53) 75.27) 69.86 (3.50 (4.33) 55.Table 5.59 (10.73) 76.28 (3.52) 67.30) 66.94 (10.00) 69.65) 78.62 (3.30 (16.87) 77.83 (3.42) 72.20) 75.26 (15.43 (1.49 (1.23) 47.58 (8.81) 48.34 (6.00) 54.57 (4.73) 73.98 (4.52 (8.22) 69.85) 78.70 (10.87 (7.21 (3.73 (1.99) 75.35) 73.75) 80.84) 69.19) 76.64 (2.50 (7.82 (6.46 (3.51 (7.62) 78.93) 81.43) 82.32) 75.36 (7.49) 65.76 (3.00 (7.59) 69. S 68.24 (4.08 (10.01) 49.92) 71.27) 69.67 (2.34 (21.36) 75.55 (1.03) 72.47 (2.23) 79.67) 88.61 (3.69 (9.51 (7.23 (5.45 (5.90 (8.84 (5.41) 78.82) 84.67) 76.67) 79.35) 50.71 (4.88) 77.86) 83.20) 77.03 (5.87 (10.82) 72.12 (13.11) 60.22) 78. S 72.79 (2.55) 72.01) 63.48) 72.29) 87.48) 76.86 (1.03) 70.69) 81.68) 81.11 (11.67) 77.28 (4.43) 77.25) 70.43 (4.60) 71.07 (5.69 (14.34 (10.52 (9.87 (1.99 (20.97) 50.46) 45.18 (6.42 (1.46 (2.3: Classiﬁcation accuracies (%) of the various methods (the number inside parentheses is the standard deviation).43 (8.45) 68.66 (8.29) 71.57 (3.42 (7.39) 56.07) 71.23) 63.12) 72.24) 68.

75 Classification Accuracy (in %) 1 1. Figure 5. In this case.2 0.4 Summary In this section. we applied TCA and SSTCA to the text classiﬁcation problem.85 0.6 0. 1 0.1: Sensitivity analysis of the TCA / SSTCA parameters on the text data. 95 90 Classification Accuracy (in %) Classification Accuracy (in %) −3 −2 −1 0 log µ 10 1 95 90 85 80 75 70 65 1 2 3 4 60 −4 −3 2 1 0 log µ 10 85 80 75 70 65 60 −4 1 2 3 4 (a) varying µ in TCA. As a result the manifold regularization term in SSTCA may not be able to propagate label information from the source domain to the target domain.8 0. different from the conclusion in WiFi localization.9 Values of γ (γ =1−γ ) 1 2 1 (b) varying µ in SSTCA.4 0. The reason may be that the manifold assumption may not hold on the text data. using unsupervised TCA for domain adaptation may be a better choice than using SSTCA.75 0. (d) varying λ in SSTCA.7 0.6 0 0. In this case.9 0. SSTCA is not supposed 71 . 5.3 0.1 70 65 60 55 50 45 −6 −5 −4 −3 log λ 10 −2 −1 0 (c) varying γ in SSTCA.1 0.8 0.95 Classification Accuracy (in %) 0.fold regularization may not be useful on the text data.5 0. However. TCA outperforms SSTCA slightly in the text classiﬁcation problem. Experimental results veriﬁed that our proposed methods based on the dimensionality reduction framework can achieve promising results in cross-domain text classiﬁcation.65 0.7 0.

when the manifold assumption does not hold on the source and target domain data. In a worse case.to outperform TCA. SSTCA may even perform worse than TCA if the manifold regularization hurt the learning performance. Therefore. 72 . we suggest to use TCA for learning the latent space for transfer learning.

for example. positive or negative). Sentiment classiﬁcation. has attracted much attention in the natural language processing and information retrieval areas [110. which aims at classifying text segment. Among these tasks. 111]. which we explain next. into polarity categories (e. we have some explicit domain knowledge that can help to design a more effective latent space for speciﬁc applications. 184. extract or summarize useful information from opinion sources on the Web is an important and challenging task. However. or help a company to do a survey on its products or services. They exist in the form of online user reviews on shopping or opinion sites. etc. This rich opinion information on the Web can help an individual make a decision to buy a product or plan a traveling route. we have considered transfer learning when the source and target domain have implicit overlap. 114] and opinion analysis in multiple languages [1]. 145. In this chapter.0 services. review spam identiﬁcation [88]. in some domain areas. However. 6. also known as subjective classiﬁcation [145. opinions are hidden in long forum posts and blogs. In sentiment analysis. is an important task and has been widely studied because many users 73 . As a result. opinion integration [117]. We have proposed three methods based on the framework to learn the transfer latent space and applied them to two diverse real-world applications: indoor WiFi localization and text classiﬁcation successfully. more and more user-generated resources have been available on the World Wide Web. opinion mining.. we highlight one such application in sentiment classiﬁcation. and in many cases. which means the domain knowledge across these domains is hidden. text sentences and review articles. in posts of personal blogs or users feedback on news. how to ﬁnd. Our strategy is to exploit a general dimensionality reduction framework to discover common latent factors that are useful for knowledge transfer across domains.1 Sentiment Classiﬁcation With the explosion of Web 2. there are some interesting tasks. opinion extraction and summarization [81. etc. review quality prediction [116.g. Recently. e. 111]. also known as sentiment analysis.g.CHAPTER 6 DOMAIN-DRIVEN FEATURE SPACE TRANSFER FOR SENTIMENT CLASSIFICATION In the previous chapters. 93]. 118. the sentiment data are in large-scale. or even help the government to get the feedback on a policy. to show that domain knowledge can be encoded in feature space learning for transfer learning when it is available..

either single word or NGram can be used as features to represent sentiment data. If xj has no polarity assigned. yj } is called the labeled sentiment data. various machine learning techniques have been proposed for sentiment classiﬁcation [145. thus in the rest of this chapter. such as books. we represent user sentiment data with a bag-of-words method. where wi is a word from a vocabulary W . Sindhwani et al. 74 . we only focus on positive and negative sentiment data. supervised learning algorithms [146] have been proven promising and widely used in sentiment classiﬁcation. paragraph or article. there is a corresponding label yj . dvds and furniture. it is unlabeled sentiment data. [171] and Li et al. yj = −1 if the overall sentiment expressed in xj is negative.. In order to reduce the human-annotating cost. For each sentiment data xj .do not explicitly indicate their sentiment polarity thus we need to predict it from the text data generated by users. in sentiment classiﬁcation tasks. events and their properties of the domain. events and their properties in the world. Mixed polarity means user sentiment is positive in some aspects but negative in other ones. we will use word and feature interchangeably. Sentiment data are the text segment containing user opinions about objects. Neutral polarity means that there is no sentiment expressed by users.wxj . 6. these methods rely heavily on manually labeled training data to train an accurate sentiment classiﬁer. but it is not hard to extend the proposed solution to address multi-category sentiment classiﬁcation problems. there exist a lot of research work being proposed to improve the classiﬁcation performance in the supervised setting [144. which is denoted by xj . a domain D denotes a class of objects. For example.2 Existing Works in Cross-Domain Sentiment Classiﬁcation In recent years. In literature. In this chapter. we use a uniﬁed vocabulary W for all domains and |W | = m. In sentiment classiﬁcation. xj ) to denote the frequency of word wi in xj . [105] proposed to incorporate lexical knowledge to the graph-based semi-supervised learning and non-negative matrix tri-factorization approaches to sentiment classiﬁcation with a few labeled data. Here. yj = +1 if the overall sentiment expressed in xj is positive. User sentiment may exist in the form of a sentence. To date. Furthermore. it corresponds with a sequence of words w1 w2 . However. there are also neutral and mixed sentiment data in practical applications. Besides positive and negative sentiment. 128]. Without loss of generality. different types of products. compared to unsupervised learning methods [187]. Goldberg and Zhu adapted a graph-based semi-supervised learning method to make use of unlabeled data for sentiment classiﬁcation. 111]. In either case. can be regarded as different domains. A pair of sentiment text and its corresponding sentiment polarity {xj .. with c(wi .

However, these semi-supervised learning methods still require a few labeled data in the target domain to train an accurate sentiment classiﬁer. In practice, we may have hundreds or thousands of domains at hand, it would be nice to annotate only several domain data to sentiment classiﬁers, which can be used to make predictions accurately in all other domains. Furthermore, similar to supervised learning methods, these approaches are domain dependent caused by changes in vocabulary. The reason is that users may use domain-speciﬁc words to express their sentiment in different domains. Table 6.1 shows several user review sentences from two domains: electronics and video games. In the electronics domain, we may use words like “compact”, “sharp” to express our positive sentiment and use “blurry” to express our negative sentiment. While in the video game domain, words like “hooked”, “realistic” indicate positive opinion and the word “boring” indicates negative opinion. Due to the mismatch between domain-speciﬁc words, a sentiment classiﬁer trained in one domain may not work well when directly applied to other domains. Thus cross-domain sentiment classiﬁcation algorithms are highly desirable to reduce domain dependency and manually labeling cost. Table 6.1: Cross-domain sentiment classiﬁcation examples: reviews of electronics and video games products. Boldfaces are domain-speciﬁc words, which are much more frequent in one domain than in the other one. Italic words are some domain-independent words, which occur frequently in both domains. “+” denotes positive sentiment, and “-” denotes negative sentiment. + + electronics video games Compact; easy to operate; very good pic- A very good game! It is action packed and ture quality; looks sharp! full of excitement. I am very much hooked on this game. I purchased this unit from Circuit City and Very realistic shooting action and good I was very excited about the quality of the plots. We played this and were hooked. picture. It is really nice and sharp. It is also quite blurry in very dark settings. The game is so boring. I am extremely unI will never buy HP again. happy and will probably never buy UbiSoft again. As a result, a sentiment classiﬁer trained in one domain cannot be applied to another domain directly. To address this problem, Blitzer et al. [25] proposed the structural correspondence learning (SCL) algorithm, which has been introduced in Chapter 2.3, to exploit domain adaptation techniques for sentiment classiﬁcation, which is a state-of-the-art method in cross-domain sentiment classiﬁcation. SCL is motivated by a multi-task learning algorithm, alternating structural optimization (ASO), proposed by Ando and Zhang [4]. SCL tries to construct a set of related tasks to model the relationship between “pivot features” and “non-pivot features”. Then “non-pivot features” with similar weights among tasks tend to be close with each other in a lowdimensional latent space. However, in practice, it is hard to construct a reasonable number of related tasks from data, which may limit the transfer ability of SCL for cross-domain sentiment classiﬁcation. More recently, Li et al. [104] proposed to transfer common lexical knowledge across domains via matrix factorization techniques. 75

In this chapter, we target at ﬁnding an effective approach for the cross-domain sentiment classiﬁcation problem. In particular, we propose a spectral feature alignment (SFA) algorithm to ﬁnd a new representation for cross-domain sentiment data, such that the gap between domains can be reduced. SFA uses some domain-independent words as a bridge to construct a bipartite graph to model the co-occurrence relationship between domain-speciﬁc words and domainindependent words. The idea is that if two domain-speciﬁc words have connections to more common domain-independent words in the graph, they tend to be aligned together with higher probability. Similarly, if two domain-independent words have connections to more common domain-speciﬁc words in the graph, they tend to be aligned together with higher probability. We adapt a spectral clustering algorithm, which is based on the graph spectral theory [41], on the bipartite graph to co-align domain-speciﬁc and domain-independent words into a set of feature-clusters. In this way, the clusters can be used to reduce the mismatch between domainspeciﬁc words of both domains. Finally, we represent all data examples with these clusters and train sentiment classiﬁers based on the new representation.

**6.3 Problem Statement and A Motivating Example
**

Problem Deﬁnition Assume we are given two speciﬁc domains DS and DT , where DS and DT are referred to as a source domain and a target domain respectively, suppose we have a set of labeled sentiment data DS = {(xSi , ySi )}nS in DS , and some unlabeled sentiment data i=1 DT = {xTj }nT in DT . The task of cross-domain sentiment classiﬁcation is to learn an accurate j=1 classiﬁer to predict the polarity of unseen sentiment data from DT . In this section, we use an example to introduce the motivation of our solution to the crossdomain sentiment classiﬁcation problem. First of all, we assume the sentiment classiﬁer f is a linear function, which can be written as y ∗ = f (x) = sgn(xwT ), where x ∈ R1×m and sgn(xwT ) = +1 if xwT ≥ 0, otherwise, sgn(xwT ) = −1. w is the weight vector of the classiﬁer, which can be learned from a set of training data (pairs of sentiment data and their corresponding polarity labels). Consider the example shown in Table 6.1 to illustrate our idea. We use a standard bag-ofwords method to represent sentiment data of the electronics (E) and video games (V) domains. From Table 6.2, we can see that the difference between domains is caused by the frequency of the domain-speciﬁc words. Domain-speciﬁc words in the E domain, such as compact, sharp, blurry, do not occur in the V domain. On the other hand, domain-speciﬁc words in the V domain, such as hooked, realistic, boring, do not occur in the E domain. Suppose the E domain is the source domain and the V domain is the target domain, our goal is to train a vector of 76

weights w∗ with labeled data from the E domain, and use it to predict sentiment polarity for the V domain data.1 Based on the three training sentences in the E domain, the weights of features such as compact and sharp should be positive. The weight of features such as blurry should be negative and the weights of features such as hooked, realistic and boring can be arbitrary or zeros if an L1 regularizer is applied on w for model training. However, an ideal weight vector in the V domain should have positive weights for features such as hooked, realistic and a negative weight for the feature boring, while the weights of features such as compact, sharp and blurry may take arbitrary values. That is why the classiﬁer learned from the E domain may not work well in the V domain. Table 6.2: Bag-of-words representations of electronics (E) and video games (V) reviews. Only domain-speciﬁc features are considered. “...” denotes all other words. + + + V + E ... compact ... 1 ... 0 ... 0 ... 0 ... 0 ... 0 sharp 1 1 0 0 0 0 blurry 0 0 1 0 0 0 hooked realistic 0 0 0 0 0 0 1 0 1 1 0 0 boring 0 0 0 0 0 1

In order to reduce the mismatch between features of the source and target domains, a straightforward solution is to make them more similar by adopting a new representation. Table 6.3 shows an ideal representation of domain-speciﬁc features. Here, sharp hooked denotes a cluster consisting of sharp and hooked, compact realistic denotes a cluster consisting of compact and realistic, and blurry boring denotes a cluster consisting of blurry and boring. We can use these clusters as high-level features to represent domain-speciﬁc words. Based on the new representation, the weight vector w∗ trained in the E domain should be also an ideal weight vector in the V domain. In this way, based on the new representation, the classiﬁer learned from one domain can be easily adapted to another one. Table 6.3: Ideal representations of domain-speciﬁc words. + + + + ... sharp hooked compact realistic blurry boring ... 1 1 0 ... 1 0 0 ... 0 0 1 ... 1 0 0 ... 1 1 0 ... 0 0 1

E

V

The problem is how to construct such an ideal representation as shown in Table 6.3. Clearly, if we directly apply traditional clustering algorithms such as k-means [76] on Table 6.2, we

1

For simplicity, we only discuss domain-speciﬁc words here and ignore all other words.

77

are not able to align sharp and hooked into one cluster, since the distance between them is large. In order to reduce the gap and align domain-speciﬁc words from different domains, we can utilize domain-independent words as a bridge. As shown in Table 6.1, words such as sharp, hooked compact and realistic often co-occur with other words such as good and exciting, while words such as blurry and boring often co-occur with a word never buy. Since the words like good, exciting and never buy occur frequently in both the E and V domains, they can be treated as domain-independent features. Table 6.4 shows co-occurrences between domainindependent and domain-speciﬁc words. It is easy to ﬁnd that, by applying clustering algorithms such as k-means on Table 6.4, we can get the feature clusters shown in Table 6.3: sharp hooked, blurry boring and compact realistic. Table 6.4: A co-occurrence matrix of domain-speciﬁc and domain-independent words. good exciting never buy compact realistic 1 1 0 0 0 0 sharp 1 1 0 hooked blurry 1 0 1 0 0 1 boring 0 0 1

The example described above motivates us that the co-occurrence relationship between domain-speciﬁc and domain-independent features is useful for feature alignment across different domains. We proposed to use a bipartite graph to represent this relationship and then adapt spectral clustering techniques to ﬁnd a new representation for domain-speciﬁc features. In the following section, we will present spectral domain-speciﬁc feature alignment algorithm in detail.

**6.4 Spectral Domain-Speciﬁc Feature Alignment
**

In this section, we describe our algorithm for adapting spectral clustering techniques to align domain-speciﬁc features from different domains for cross-domain sentiment classiﬁcation. As mentioned above, our proposed method consists of two steps: (1) to identify domainindependent features and (2) to align domain-speciﬁc features. In the ﬁrst step, we aim to learn a feature selection function ϕDI (·) to select l domain-independent features, which occur frequently and act similarly across domains DS and DT . These domain-independent features are used as a bridge to make knowledge transfer across domains possible. After identifying domainindependent features, we can use ϕDS (·) to denote a feature selection function for selecting domain-speciﬁc features, which can be deﬁned as the complement of domain-independent features. In the second step, we aims to to learn an alignment function φ : R(m−l) → Rk to align domain-speciﬁc features from both domains into k predeﬁned feature clusters z1 , z2 , ..., zk , s.t. the difference between domain speciﬁc features from different domains on the new representation constructed by the learned clusters can be dramatically reduced. 78

mutual information is used to measure the mutual dependence between two random variables. Here we propose a third strategy for selecting domain-independent features. and the other is composed of features in WDS . respectively.1 Domain-Independent Feature Selection First of all. we present several strategies for selecting domain-independent features. we require domain-independent features occur frequently. A second strategy is based on the mutual dependence between features and labels on the source domain data. domain-independent features should occur frequently and act similarly in both the source and target domains. In [25]. I(X . D) = i ∑ d∈D ∑ x∈X i . If a feature has high mutual value. But there is no guarantee that the selected features act similarly in both domains. Otherwise. can help identify features relevant to source domain labels.1) as follows. A ﬁrst strategy is to select domain-independent features based on their frequency in both domains. we use WDI and WDS to denote the vocabulary of domain-independent and domain-speciﬁc features respectively. then it is domain speciﬁc. D) is. we need to identify which features are domain independent. We use ϕDI (xi ) and ϕDS (xi ) to denote the two views respectively. which is shown in (6. 79 .For simplicity. The smaller I(X i . (6. y) p(x)p(y) ) . (6. we choose features that occur more than k times in both the source and target domains. mutual information is applied on source domain labeled data to select features as “pivots”.−1} x∈X i p(x. So. y) = i ∑ ∑ ( p(x. In this section. I(X . Motivated by the supervised feature selection criteria. it is domain independent. which can be referred to as domain-independent features in this papers. the more likely that X i can be treated as a domain-independent feature.2) where D is a domain variable and we only sum over non-zero values of a speciﬁc feature X i . d) p(x)p(d) ) . Then sentiment data xi can be divided into two disjoint views. More speciﬁcally. Furthermore. we can use mutual information to measure the dependence between features and domains. One view consists of features in WDI . d)log2 p(x. As mentioned above.x̸=0 ( p(x. 6.1) where we denote X i and y a feature and class label. k is set to be the largest number such that we can get at least l such features. we modify the mutual information criterion between features and domains as follows. given the number l of domain-independent features to be selected.4. Feature selection using mutual information. In information theory. y)log2 y∈{+1.

If the word “very” is used as a bridge then the words “sharp” and “boring” may be aligned in a same clustering. which is relevant to class labels. Given domain-independent ∪ and domain-speciﬁc features.1 to explain the proposed three strategies for domain-independent features selection. we can also adopt more meaningful methods to estimate mij . However. which makes the task more challenge.g. An edge in E connects two vertexes in VDS and VDI respectively. Furthermore. we may be guaranteed to select sentiment words. Here “good” and “never buy” may be good domain-independent features for serving as a bridge across domains. in this chapter. Note that there is no intra-set edges linking two vertexes in VDS or VDI .4. using the second strategy. The reason is that in the electronics domain. A bipartite graph example is shown in Figure 6. and each vertex in VDI corresponds to a domain-independent word in WDI . the word “sharp” has high mutual value to class labels in the electronics domain. we may say “this unit is very sharp”. some words may have opposing sentiment polarity in different domains. each edge eij ∈ E is associated with a non-negative weight mij .2 Bipartite Feature Graph Construction Based on the above strategies for selecting domain-independent features. For example. Thus. 6. For example. In G. each vertex in VDS corresponds to a domain-speciﬁc word in WDS . while in the video games domain. we can identify which features are domain independent and which ones are domain speciﬁc. the polarity of the word “thin” may be positive in the electronics domain but negative in the furniture domain. “never buy” and “very” as domain-independent features. Using the ﬁrst and third strategies. we may select the words such as “good”. In contrast. domain-independent feature selection is a challenge task. However. which is not what we expect. the word “very” should not be used a bridge for aligning words from different domains.4. The score of mij measures the relationship between word wi ∈ WDS and wj ∈ WDI in DS and DT (e. we can construct a bipartite graph G = (VDS VDI . we focus more on addressing the problem of how to model the correlation between domain-independent and domain-speciﬁc words for transfer learning. which will be presented next. Note that “sharp” should be selected because it is domain-dependent word. In a worse case. We leave the issue that how to develop a criteria to select domain-independent word precisely in our future work. where we denote VDS VDI and E the vertices or edges of the graph G. we can deﬁne a reasonable “window 80 .We still use the example shown in Table 6. Besides using the co-occurrence frequency of words within documents. we may say “this game is very boring”. Thus. E) between ∪ them. which is constructed based on the example shown in Table 6. “nice” and “sharp” (assume the electronics domain is a source domain). such as “good”. we can use the constructed bipartite graph to model the intrinsic relationship between domain-speciﬁc and domain-independent features..1. the total number of co-occurrence of wi ∈ WDS and wj ∈ WDI in DS and DT ). Hence.

size”. then the correlation between them is very week. Thus. e. (2) there is a low-dimensional latent space underlying a complex graph. (2) if two domainindependent features are connected to many common domain-speciﬁc features. then there is an edge connecting them. In this chapter. then these two nodes should be very similar (or quite related). where two nodes are similar to each other if they are similar in the original graph. dimensionality reduction and clustering [130. Furthermore. 13.g.4. In our case. spectral graph theory has been widely applied in many problems.. Also we do not consider word position to determine the weights for edges. We want to show that by constructing a simple bipartite graph and adapting spectral clustering techniques on it. realistic exciting 1 1 1 1 1 1 1 1 compact hooked sharp blurry boring good never buy Figure 6. 54]. the larger the weight we can assign to the corresponding edge. In this section. by assuming that if the distance between two words exceeds the “window size”. we assume (1) if two domain-speciﬁc features are connected to many common domain-independent features. we can algin domain-speciﬁc features effectively. Based on these two assumptions. we show how to adapt a spectral clustering algorithm on the feature bipartite graph to align domain-speciﬁc features. if a domain-speciﬁc word and a domainindependent word co-occur within the “window size”. 6. there are two main assumptions: (1) if two nodes in a graph are connected to many common nodes.3 Spectral Feature Clustering In the previous section. we can also use the distance between wi and wj to adjust the score of mij . we have presented how to construct a bipartite graph between domainspeciﬁc and domain-independent features. In spectral graph theory [41]. then they tend to be very related and will be aligned to a same cluster with high probability. then they tend 81 . for simplicity. The smaller is their distance. we set the “window size” to be the maximum length of all documents.1: A bipartite graph example of domain-speciﬁc and domain-independent features.

where Dii = ∑ j Aij . [220] and Ding and He [55] have proven that the k principal components of a term-document co-occurrence matrix.. . are actually the continuous solution of the cluster membership indicators of documents in the k-means clustering method. u1 . Find the k largest eigenvectors of L. . u2 .to be very related and will be aligned to a same cluster with high probability. Therefore.. l is the number of domainindependent features and m − l is the number of domain-speciﬁc features. the goal is to cluster the points into k clusters. which can reduce the gap between domains. Thus selecting the k smallest eigenvectors of L in [41.uk ] ∈ Rn×k . 82 . we expect the mismatch problem between domain-speciﬁc features can be alleviated by applying graph spectral techniques on the feature bipartite graph to discover a new representation for domain-speciﬁc features. Form an afﬁnity matrix for V : A ∈ Rn×n . we ﬁrst brieﬂy introduce a standard spectral clustering algorithm [130] as follows. we show how to adapt the spectral clustering algorithm for cross-domain feature alignment..... Apply the k-means algorithm on U to cluster the n points into k clusters. and form the matrix U = [u1 u2 . 1. vn } and their corresponding weighted graph G. Based on the above description. where Aij = mij .. the standard spectral clustering algorithm clusters n points to k discrete indicators. uk in step 3. Motivated by this discovery. u2 . More speciﬁcally. v2 . if i ̸= j. where k is an input parameter.. 2 In spectral graph theory [41] and Laplacian Eigenmaps [13]. the Laplacian matrix L = I − L.. where I is an identity matrix. which are referred to as the k largest eigenvectors u1 . 13] is equivalent to selecting the k largest eigenvectors of L in this paper.. with the above assumptions. . uk . This implies that a mapping function constructed from the k principal components can cluster original data and map them to a new space spanned by the clusters simultaneously. The changes in these forms of Laplacian matrix will only change the eigenvalues (from λi to 1 − λi ) but have no impact on eigenvectors. where m is the number of all features. 5. 3. our goal is to learn a feature alignment mapping function φ(·) : Rm−l → Rk . Given a set of points V = {v1 . and construct the matrix2 L = D−1/2 AD−1/2 . Aii = 0. (3) we can ﬁnd a more compact and meaningful representation for domain-speciﬁc features. Before we present how to adapt a spectral clustering algorithm to align domain-speciﬁc features. 2. which can be referred to as “discrete clustering”. Normalize U . Zha et al.. the k principal components can automatically perform data clustering in the subspace spanned by the k principle components. Given the feature bipartite graph G. such that Uij = Uij /( j Uij )1/2 . ∑ 2 4. Form a diagonal matrix D.

it is proved that clustering two related sets of points simultaneously can often get better results than only clustering one single set of points [54]. Deﬁne the feature alignment mapping function as φ(x) = xU[1:m−l.:] denotes the ﬁrst m − l rows of U and x ∈ R1×(m−l) .4 Feature Augmentation If we have selected domain-independent features and aligned domain-speciﬁc features perfectly. Given a feature alignment mapping function φ(·).uk ] ∈ Rm×k . where Dii = ∑ j [ Aij . uk . which is used for spectral co-clustering terms and documents simultaneously. Find the k largest eigenvectors of L. 5. However. Form a weight matrix M ∈ R(m−l)×l .. ] 0 M 2.. where the ﬁrst MT 0 m − l rows and columns correspond to the m − l domain-speciﬁc features. where Mij corresponds to the co-occurrence relationship between a domain-speciﬁc word wi ∈ WDS and a domain-independent word wj ∈ WDI . and then apply φ(·) to ﬁnd a new representation φ(ϕDS (xi )) of the view of domain-speciﬁc features of xi . u2 . Form a diagonal matrix D. 25]. . and construct the matrix L = D−1/2 AD−1/2 . we may not be able to identify domain-independent features correctly and thus fail to perform feature alignment perfectly.. and form the matrix U = [u1 u2 . 4. Form an afﬁnity matrix A = ∈ Rm×m of the bipartite graph. u1 . for a data example xi in either a source domain or target domain. we augment all original features with features learned by feature alignment to construct a new representation.. A tradeoff parameter γ is used in this feature augmentation to balance the effect of original 83 .4.. then we can simply augment domain-independent features with features via feature alignment to generate a perfect representation for cross-domain sentiment classiﬁcation. Similar to the strategy used in [4. where U[1:m−l. 6. 3. Note that the afﬁnity matrix A constructed in Step 2 is similar to the afﬁnity matrix of a term-document bipartite graph proposed in [54]. we can ﬁrst apply ϕDS (·) to identify the view associated with domainspeciﬁc features of xi .:] . Though our goal is only to cluster domain-speciﬁc features. in practice.1. and the last l rows and columns correspond to the l domain-independent features.

. ϕDI (xT ) ϕDS (xT ) 2: By using ΦDI and ΦDS . the value of λ can be determined by evaluation on some heldout data.4. where A = MT 0 4: Find the K largest eigenvectors of L. ∈ U = [u1 u2 . 6. ySi )}nS = {([xSi . As can be seen in the algorithm. In the second step. Algorithm 6. ΦDS = . ySi )}nS i=1 i=1 6: return f (xTi )’s. the new feature representation is deﬁned as xi = [xi . ySi )}nS . γφ(ϕDS (xSi ))]..1 on DS and DT to select l domain-independent features.1 Spectral Feature Alignment (SFA) for cross-domain sentiment classiﬁcation Require: A labeled source domain data set DS = {(xSi . γφ(ϕDS (xi ))]. u1 . Let mapping φ(xi ) = xi U[1:m−l. In practice. Finally training and testing are both performed on augmented representations. uK . u2 . where xi ∈ R1×m .uK ] ∈ Rm×K . where m is the total number 84 .. [ ] [ ] ϕDI (xS ) ϕDS (xS ) ΦDI = . we calculate the corresponding co-occurrence matrix M and then construct a normalized matrix L corresponding to the bipartite graph of the domain-independent and domain-speciﬁc features. Thus.features and new features.. 1: Apply the criteria mentioned in Chapter 6. In general. in the ﬁrst step.. an unlabeled target domain i=1 data set DT = {xTj }nT . and form the matrix [ −1/2 −1/2 R(m−l)×l .1. The remaining m − l features are treated as domain-speciﬁc features. we select a set of domain-independent features using one of the three strategies introduced in Chapter 6. the number of clusters K and the number of domain-independent j=1 features m. 3: Construct matrix L = D AD . where xi ∈ Rm−l 5: Train a classiﬁer f on {(xSi . for each data example xi . xi ∈ R1×m+k and 0 ≤ γ ≤ 1. calculate (DI-word)-(DS-word) co-occurrence matrix M ] 0 M . it takes O(Km2 ) [175]. Ensure: Predicted labels YT of the unlabeled data XT in the target domain.5 Computational Issues Note the computational cost of the SFA algorithm is maintained by an eigen-decomposition of the matrix L ∈ Rm×m .4.:] .. Eigen-decomposition is performed on L to ﬁnd the K leading eigenvectors to construct a mapping ϕ.1. The whole process of our proposed framework for crossdomain sentiment classiﬁcation is presented in Algorithm 6.

D → K.7 Experiments In this section. The ﬁrst dataset is from Blitzer et al.1 Experimental Setup Datasets In this section. E → K. dvds (D). electronics (E) and kitchen appliances (K). B → K. As a result. D → E. the eigen-decomposition of L can be solved efﬁciently. The reviews are about four product domains: books (B). 000 negative ones.com. E → D. SFA can better capture the intrinsic relationship between domain-independent and domain-speciﬁc words via the bipartite graph representation and learn a more compact and meaningful representation underlying the graph via co-aligning domain-independent and domain-speciﬁc words. B → D. K → D. K → B.2 is extremely sparse. In this dataset. where p is the number iterations in the Implicitly Restarted Arnoldi method deﬁned by users.7. B → E. which only learns common structures underlying domain-speciﬁc words without fully exploiting the relationship between domain-independent and domain-speciﬁc words. In each domain. we can construct 12 cross-domain sentiment classiﬁcation tasks from with source: D → B. −1 (negative review) or +1 (positive review). we ﬁrst describe the datasets used in our experiments. E → B. 6. if the matrix L is sparse. If m is large.6 Connection to Other Methods Different from the state-of-the-art cross-domain sentiment classiﬁcation algorithms such as the SCLalgorithm. 6. the matrix L is sparse as well. we describe our experiments on two real-world datasets and show the effectiveness of our proposed SFA for cross-domain sentiment classiﬁcation. based on the rating score given by the review author. [25].4. K → E. there are 1. In practice. Experiments in two real world domains indicate that SFA is indeed promising in obtaining better performance than several baselines including SCL in terms of the accuracy for cross-domain sentiment classiﬁcation. Hence. It contains a collection of product reviews from Amazon. Each review is assigned a sentiment label. where the word before an arrow corresponds with the source domain and the word after an arrow corresponds with the target domain. 000 positive reviews and 1. then it would be computationally expensive. the bipartite feature graph deﬁned in Chapter 6. The computational time is O(Kpm) approximately. then we can still apply the Implicitly Restarted Arnoldi method to solve the eigen-decomposition of L iteratively and efﬁciently [175].of features and K is the dimensionality in the latent space. However. 6. We use 85 .

tripadvisor. we use Unigram and Bigram features to represent each data example (a review in RevDat and a sentence in SentDat). 000 electronics 3. 000 1. 000 1. NoTransf means that we train a classiﬁer with labeled data of D domain. So. V → S. is a classiﬁer trained directly with the source domain training data. 000 1. for D → B task. The sentiment classiﬁcation task on this dataset is documentlevel sentiment classiﬁcation. For both datasets. E → V. 500 SentDat Baselines In order to investigate the effectiveness of our method. The reviews from Amazon are about three product domains: video game (V). S → V. 856 1. In this section. 500 1. The other dataset is collected by us for experiment purpose. we also construct 12 cross-domain sentiment classiﬁcation tasks: V → H. H → E. One baseline method denoted by NoTransf. we have compared it with several algorithms.RevDat to denote this dataset. 500 positive sentences and 1. E → H. E → S. The TripAdvisor reviews are about the hotel (H) domain. H → S. 500 1. Instead of assigning each review with a label. 000 electronics 2. we describe some baseline algorithms with which we compare SFA. The gold standard (denoted by upperBound) is an in-domain classiﬁer trained with labeled data from the target domain. 500 1. 000 books 2.5. V → E. S → V. 504 1.com/ http://www. 500 1. Dataset RevDat Domain # Reviews dvds 2. 000 1. Similarly. In each domain.com/ 86 . 000 video game 3. For example. 000 1. The summary of the datasets is described in Table 6. 000 kitchen 2. 500 1. 000 1. We have crawled a set of reviews from Amazon3 and TripAdvisor4 websites. We use SentDat to denote this dataset. 000 # Pos # Neg # Features 1. Another baseline method denoted by PCA is a classiﬁer trained on a new representation which augments original 3 4 http://www. S → H. Sentiment classiﬁcation task on this dataset is sentence-level sentiment classiﬁcation. 000 1. the performance of upperBound in D → B task can be also regarded as an upper bound of E → B and K → B tasks. 500 287.5: Summary of datasets for evaluation. electronics (E) and software (S). S → E. we split these reviews into sentences and manually assign a polarity label for each sentence. 000 software 3. 000 473. we randomly select 1.amazon. 500 negative ones for experiment. 000 hotel 3. 500 1. Table 6. upperBound corresponds with a classiﬁer trained with the labeled data from B domain.

2 Results Overall Comparison Results between SFA and Baselines In this section we compare the accuracy of SFA with NoTransf.000 instances.000 instances and a test set of 1. We have also compared our algorithm with a method: structural correspondence learning (SCL) proposed in [25]. we use Eqn. FALSA and SCL by 24 tasks on two datasets.0001 in [23].2) deﬁned in Chapter 6. which is equivalent to set λ = 0.4.600 instances and a test set of 400 instances. A third baseline method denoted by FALSA is a classiﬁer trained on a new representation which augments original features with new features which are learned by applying latent semantic analysis on the co-occurrence matrix of domain-independent and domain-speciﬁc features. we use logistic regression as the basic sentiment classiﬁer.features with new features which are learned by applying latent semantic analysis (also can be referred to as principal component analysis) [53] on the original view of domain-speciﬁc features (as shown in Table 6. We follow the details described in Blitzer’s thesis [23] to implement SCL with logistic regression to construct auxiliary tasks.1 87 . y is the ground truth sentiment polarity and f (x) is the predicted sentiment polarity. |{x|x ∈ Dtst }| where Dtst denotes the test data. 6. Note that SCL. and are ﬁxed to be used in all experiments.2). The deﬁnition of accuracy is given as follows. FALSA and our proposed SFA all use unlabeled data from the source and target domains to learn a new representation and train classiﬁers using the labeled source domain data with new representations. The tradeoff parameter C in logistic regression [64] is set to be 10000. upperBound. We use accuracy to evaluate the sentiment classiﬁcation result: the percentage of correctly classiﬁed examples over all testing examples. we randomly split each domain data into a training set of 1. The parameters of each model are tuned on some heldout data in E → B task of RevDat and H → S task of SentDat. For all experiments on RevDat. We compare our method with PCA and FALSA in order to investigate if spectral feature clustering is effective in aligning domain-speciﬁc features. FALSA and SFA. PCA.7. PCA. FALSA and SFA. For PCA. PCA. The library implemented in [64] is used in all our experiments. For all experiments on SentDat. we randomly split each domain data into a training set of 2. The evaluation of cross-domain sentiment classiﬁcation methods is conducted on the test set in the target domain without labeled training data in the same domain. We report the average results of 5 random trails. Parameter Settings & Evaluation Criteria For NoTransf.(6. Accuracy = |{x|x ∈ Dtst ∩ f (x) = y}| .

5 65 70 V−>E S−>E H−>E E−>V S−>V H−>V E−>S V−>S H−>S E−>H V−>H S−>H (b) Comparison Results on SentDat. The number of “pivots” is set to be 500. Figure 6.5 77.7 82.5 75 72.28 75 72.6.2.5 81.7.2(a) shows the comparison results of different methods on RevDat. Clearly.to identity domain-independent and domain-speciﬁc features. For SCL. the number of domain-speciﬁc features clusters k = 100 and the parameter in feature augmentation γ = 0. but the two groups are different from each other. Figure 6.2 and 6. Adapting a classiﬁer from K domain to E domain is much easier than adapting it from B domain. All these parameters and domain-independent feature (or “pivot”) selection methods are determined based on results on the heldout data mentioned in the previous section. In the ﬁgure.1 Accuracy (%) 75 Accuracy (%) B−>D E−>D K−>D D−>B E−>B K−>B 80 70 75 65 70 D−>E B−>E K−>E D−>K B−>K E−>K (a) Comparison Results on RevDat.72 77.4 80 85 84.7.2: Comparison results (unit: %) on two datasets. we use mutual information to select “pivots”. Studies of the SFA parameters are presented in Chapters 6. and the number of dimensionality h in [25] is set to be 50. From the ﬁgure.5 70 78. 85 82.6 90 87.55 81. We adopt the following settings: the number of domain-independent features l = 500.5 82.34 80 Accuracy (%) Accuracy (%) 80 78.5 67. we can observe that the four domains of RevDat can be roughly classiﬁed into two groups: B and D domains are similar to each other. The horizontal lines are accuracies of upperBound. each group of bars represents a cross-domain sentiment classiﬁcation task. our proposed SFA performs better than other methods 88 . Each bar in speciﬁc color represents a speciﬁc method. as are K and E. 85 85 82.

it is hard to construct a reasonable number of auxiliary tasks that are useful to model the relationship between “pivots” and “non-pivots”. (6. One interesting observation from the results is that SCL does not work well compared to its performance on RevDat.2) to identify domain-independent and domain-speciﬁc features. The second one is to test the effect of different numbers of domain-independent features on SFA performance.95 conﬁdence intervals. it has been proved that clustering two related sets of points simultaneously can often get better results than clustering one single set of points only [54].6. but work very bad in some tasks such as E → D and D → E of RevDat. but also use graph spectral clustering techniques to co-align both kinds of features to discover meaningful clusters for domain-speciﬁc features. frequency of features in both domains and mutual information between features and labels in the source domain respectively. The reason is that applying mutual information on source domain data 89 . we summarize the comparison results of SFA using different methods to identify domainindependent features. we can observe that SFADI and SFAF Q achieve comparable results and they are stable in most tasks.1. The reason is that representing domain-speciﬁc features via domain-independent features can reduce the gap between domains and thus ﬁnd a reasonable representation for cross-domain sentiment classiﬁcation. The performance of SCL highly relies on the auxiliary tasks. From the table. (6.2(b). the data are quite sparse. SFAF Q and SFAM I to denote SFA using Eqn. Thus in this dataset.4. While SFAM I may work very well in some tasks such as K → D and E → B of RevDat. As mentioned in Chapter 6. Effect of Domain Independent Features of SFA In this section. we can also use the other two strategies to identify them. One reason may be that in sentence-level sentiment classiﬁcation. but its performance may even drop on other tasks. clustering domain-speciﬁc features with bag-of-words representation may fail to ﬁnd a meaningful new representation for cross-domain sentiment classiﬁcation. It is not surprising to ﬁnd that FALSA gets signiﬁcant improvement compared to NoTransf and PCA.2). we conduct two experiments to study the effect of domain-independent features on the performance of SFA. In this case. From the comparison results on SentDat shown in Figure 6. The ﬁrst experiment is to test the effect of domain-independent features identiﬁed by different methods on the overall performance of SFA. Though our goal is only to cluster domain-speciﬁc features.including state-of-the-art method SCL in most tasks. we can get similar conclusion: SFA outperforms other methods in most tasks. SCL even performs worse than FALSA in some tasks. besides using Eqn. Our proposed SFA can not only utilize the co-occurrence relationship between domain-independent and domain-speciﬁc features to reduce the gap between domains. Thus PCA only outperforms NoTransf slightly in some tasks. We do t-test on the comparison results of the two datasets and ﬁnd that SFA outperforms other methods with 0.3. In Table 6. We use SFADI . As mentioned in Chapter 6.

55 B→K 78. Table 6. there are two other parameters in SFA: one of them is the number of domain-speciﬁc feature-clusters k and the other is the tradeoff parameter λ in feature augmentation.85 78.22 72.3(a) and 6.46 77.38 75. RevDat B→D E→D K→D D→B 81.6 78. SFA performs well and stably in most tasks.62 79.92 71.55 78.58 76.7 72.90 81.62 SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I SFADI SFAF Q SFAM I SFADI+M I SFAF Q+M I Model Parameter Sensitivity of SFA Besides the number of domain-independent features l.10 85.14 74.52 76.45 84.16 V→H 77.16 73.85 73 82.3(b).46 75.5 72.80 77.5 76.22 72.8 79.75 87.80 77.38 E→S V→S H→S E→H 77.92 71.25 77.25 75.3 74.5 85.58 77.15 SentDat V→E S→E H→E E→V 76.can ﬁnd features that are relevant to the source domain labels but cannot guarantee the selected features to be domain independent.52 76.9 75.46 74. In this section.05 80.98 73.3 74. In addition.80 79.45 79. 90 .95 81.46 71.35 77.98 73. To test the effect of the number of domain-independent features on the performance of SFA.6.06 76. Numbers in the table are accuracies in percentage.8 74.5 81.1 70.35 75.65 77.22 75 E→B 75.64 74.22 72. we further test the sensitivity of these two parameters on the overall performance of SFA. The value of l is changed from 300 to 700 with step length 100.94 72.08 77.5 72.9 80.25 77 76.6 70. The results are shown in Figure 6.80 S→V 72.75 85.08 76.15 76. Thus SFA is robust to the quality and numbers of domain-independent features.75 E→K 86.54 S→H 71.64 79.55 H→V 74.88 74. we can ﬁnd that when l is in the range of [400.55 86.15 85. From the ﬁgures.64 79.05 78.44 77.6: Experiments with different domain-independent feature selection methods.14 74.4 78.90 77.50 D→E B→E K→E D→K 76.48 73.46 74. we apply SFA on all 24 tasks in the two datasets.75 76. and ﬁx k = 100. 700].35 78.38 76.62 72. γ = 0.62 79. the selected features may be irrelevant to the labels of the target domain.94 72.46 K→B 74.96 72.45 79.86 72.8 86.96 79.8 79.96 72.54 74.65 81.98 77.55 78.5 76.05 78.65 75.05 73.64 74.15 76.05 74.95 77.22 75 77.58 76.25 80.8 81.44 77.45 78.58 77.

In this experiment.5 75 72.2. SFA works stably in most tasks.5(b). SFA gets worse performance when the value of k increases. when γ ≥ 0. k = 100 and change the values of γ from 0. Results are shown in Figure 6. γ = 0. when the cluster number k falls in the range from 75 to 175. which implies that the features learned by the spectral feature alignment algorithm do boost the performance in cross-domain sentiment classiﬁcation. in most task. Finally.3: Study on varying numbers of domain-independent features of SFA We ﬁrst test the sensitivity of the parameter k. we ﬁx l = 500. we ﬁx l = 500. Apparently.1. In this experiment.5 Accuracy (%) Accuracy (%) 400 500 600 Numbers of Domain−Independent Features 700 77.5 75 77.4(b) show the results of SFA under varying values of k.B−>D 85 E−>D K−>D D−>B E−>B K−>B 90 87.5 80 Accuracy (%) Accuracy (%) 82. Figure 6. Figure 6.5(a) and 6.6 and change the value of k from 50 to 200 with step length 25.5 75 75 72.5 S−>E H−>E E−>V S−>V H−>V 80 E−>S V−>S H−>S E−>H V−>H S−>H 80 77.1 to 1 with step length 0. 91 .5 70 300 70 300 400 500 600 Numbers of Domain−Independent Features 700 (b) Results on SentDat. However.3. which means the original features dominate the effect of the new feature representation.4(a) and 6. Note that though for some tasks such as V → H. H → S and E → H.5 72.5 72. SFA performs well and stably.5 80 77. V−>E 82.5 70 300 400 500 600 Numbers of Domain−Independent Features 700 70 300 400 500 600 Numbers of Domain−Independent Features 700 (a) Results on RevDat. we test the sensitivity of the parameter γ. SFA works badly in most tasks.5 85 D−>E B−>E K−>E D−>K B−>K E−K 82. when γ < 0.

One limitation of SFA is that it is domain-driven. to develop a domain-driven method may be more effective than general solutions for learning the latent space for transfer learning. 92 .5 72.5 85 80 Accuracy (%) Accuracy (%) 75 100 125 150 175 Numbers of Clusters of Domain−Specifc Features 200 82. the success of SFA in sentiment classiﬁcation suggests that in some domain areas.5 75 75 72.5 80 77. which means it is hard to be applied to other applications.8 Summary In this section.5 Accuracy (%) Accuracy (%) 75 100 125 150 175 Numbers of Clusters of Domain−Specific Features 200 77. However. we have studied the problem of how to develop a latent space for transfer learning when explicit domain knowledge is available in some speciﬁc applications.5 S−>E H−>E E−>V S−>V H−>V 80 E−>S V−>S H−>S E−>H V−>H S−>H 80 77.5 75 72.5 D−>E B−>E K−>E D−>K B−>K E−>K 82. where domain knowledge is available or can be easy to obtain. Experimental results showed that our proposed method outperforms a state-of-the-art method in cross-domain sentiment classiﬁcation.4: Model parameter sensitivity study of k on two datasets.5 72. We proposed a spectral feature alignment (SFA) algorithm for sentiment classiﬁcation. Figure 6.5 70 50 70 50 77. 6.B−>D 85 E−>D K−>D D−>B E−>B K−>B 90 87.5 70 50 70 50 75 100 125 150 175 Numbers of Clusters of Domain−Specific Features 200 (b) Results on SentDat under Varying Values of k.5 75 75 100 125 150 175 Numbers of Clusters of Domain−Specific Features 200 (a) Results on RevDat under Varying Values of k. V−>E 82.

9 1 (a) Results on RevDat under Varying Values of γ.4 0.6 Values of γ 0.B−>D 85 E−>D K−>D D−>B E−>B K−>B 90 87.5 S−>E H−>E E−>V S−>V H−>V E−>S 80 V−>S H−>S E−>H V−>H S−>H 80 77.5 70 0.5: Model parameter sensitivity study of γ on two datasets.4 0.5 75 77.5 D−>E B−>E K−>E D−>K B−>K E−>K 82.1 0.2 0.8 0.2 0. V−>E 82.5 75 72.8 0.7 0.9 1 (b) Results on SentDat under Varying Values of γ.2 0.2 0.5 0.6 Values of γ 0.5 80 77.3 0.1 0.9 1 75 75 72.5 0.8 0.5 85 80 Accuracy (%) Accuracy (%) 82.6 Values of γ 0.1 0.6 Values of γ 0.9 1 70 0.5 0.3 0.8 0.7 0. Figure 6.3 0.1 70 0.5 70 0.3 0.5 Accuracy (%) 0.5 72. 93 .7 0.4 0.4 0.5 0.7 0.5 Accuracy (%) 77.5 72.

**CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion
**

Transfer learning is still at an early but promising stage. There exist many research issues needed to be addressed. In this thesis, we ﬁrst surveyed the ﬁeld of transfer learning, where we give some deﬁnitions of transfer learning, summarize transfer learning into different settings, categorize transfer learning approaches into four contexts, analyze the relationship between transfer learning and other related areas, discuss some interesting research issues in transfer learning and introduce some applications of transfer learning. In this thesis, we focus on the transductive transfer learning setting, and propose two different feature-based transfer learning frameworks based different situations: (1) domain knowledge is hidden or hard to capture, and (2) domain knowledge can be observed or easy to encode into learning. In the ﬁrst situation, we proposed a novel and general dimensionality reduction framework for transfer learning, which aims to learn a latent space to reduce the distance between domains and preserve properties of original data simultaneously. Based on this framework, we propose three different algorithms to learn the latent space, known as Maximum Mean Discrepancy Embedding (MMDE) [135], Transfer Component Analysis (TCA) [139] and Semisupervised Transfer Component Analysis (SSTCA) [140]. MMDE learns the latent by learning a non-parametric kernel matrix, which may be more precise. However, the non-parametric kernel matrix learning results in a need to solve a semi-deﬁnite program (SDP), which is extremely expensive (O((nS + nT )6.5 )). In contrast, TCA and SSTCA learn a parametric kernel matrix instead, which may make the latent space less precise, but avoid the use of a SDP solver, which is much more efﬁcient (O(m(nS + nT )6.5 ), where m ≪ nS + nT ). Hence, in practice, TCA and SSTCA are more applicable. We apply these three algorithms to two diverse applications: WiFi localization and text classiﬁcation, and achieve promising results. In order to capture and apply the domain knowledge inherent in many application areas, we also propose a Spectral Feature Alignment (SFA) [137] algorithm for cross-domain sentiment classiﬁcation, which aims at aligning domain-speciﬁc words from different domains in a latent space by modeling the correlation between the domain-independent and domain-speciﬁc words in a bipartite graph and using the domain-speciﬁc features as a bridge. Experimental results show the great classiﬁcation performance of SFA in cross-domain sentiment classiﬁcation.

94

7.2 Future Work

In the future, we plan to continue to explore our work on transfer learning along the following directions. • We plan to study the proposed dimensionality reduction framework for transfer learning theoretically. Research issues include: (1) how to determine the number of the reduced dimensionalities adaptively? (2) How to get a generalization error bound of the proposed framework? (3) How to develop an efﬁcient algorithm for kernel choice of TCA and SSTCA based on the recent theoretic study in [176]. • We plan to extend the dimensionality reduction framework in a relational learning manner, which means that data in source and target domains can be relational instead of being independent distributed. Most previous transfer learning methods assumed data from different domains to be independent distributed. However, in many real-world applications, such as link prediction in social networks or classifying Web pages within a Web site, data are often intrinsically relational, which present a major challenge to transfer learning [52]. We plan to develop a relational dimensionality reduction framework, which can encode the relational information among data into the latent space learning, while can also reduce the difference between domains and preserve properties of the original data. • We plan to study the negative transfer learning issue. It has been proven that when the source and target tasks are unrelated, the knowledge extracted from a source task may not help, and even hurt, the performance of the target task [159]. Thus, how to avoid negative transfer and then ensure safe transfer of knowledge is crucial in transfer learning. Recently, in [30], we have proposed an Adaptive Transfer learning algorithm based on Gaussian Processes (AT-GP), which can be used to adapt the transfer learning schemes by automatically estimating the similarity between a source and a target task. Preliminary experimental results show that the proposed algorithm is promising to avoid negative transfer in transfer learning. In the future, we plan to develop a criterion to measure whether there exists a latent space, onto which knowledge can be safely transferred across domain/tasks. Based on the criterion, we can develop a general framework to automatically identify whether the source and target tasks are related or not, and determine whether we should project the data from the source and target domains onto the latent space.

95

REFERENCES

[1] Ahmed Abbasi, Hsinchun Chen, and Arab Salem. Sentiment analysis in multiple languages: Feature selection for opinion classiﬁcation in web forums. ACM Transaction Information System, 26(3):1–34, June 2008. [2] Eneko Agirre and Oier Lopez de Lacalle. On robustness and domain adaptation using svd for word sense disambiguation. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 17–24. ACL, June 2008. [3] Morteza Alamgir, Moritz Grosse-Wentrup, and Yasemin Altun. Multitask learning for brain-computer interfaces. In Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics, volume 9, pages 17–24. JMLR W&CP, May 2010. [4] Rie K. Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, 2005. [5] Rie Kubota Ando and Tong Zhang. A high-performance semi-supervised learning

method for text chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 1–9. ACL, June 2005. [6] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19, pages 41–48. MIT Press, 2007. [7] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pages 71–85. Springer, September 2008. [8] Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil, and Yiming Ying. A spectral regularization framework for multi-task structure learning. In Annual Conference on Neural Information Processing Systems 20, pages 25–32. MIT Press, 2008. [9] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods for transductive transfer learning. In Proceedings of the 7th IEEE International Conference on Data Mining Workshops, pages 77–82. IEEE Computer Society, 2007.

96

79(12):151–175. and Ashish Verma. April 2003. In Proceeding of the 18th ACM conference on Information and knowledge management. Jasmina Bogojeska. Morgan Kaufmann Publishers Inc. and Fernando Pereira. [17] Shai Ben-David. December 2009. In Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. July 2008. ACM. May 2010. and Jenn Wortman. Koby Crammer. Belle Tseng. Hongyuan Zha. and Myron Hattig. Journal of Machine Learning Reserch. [16] Shai Ben-David. Koby Crammer. Gaurav Sukhatme. In Proceedings of the 25th International Conference on Machine learning. John Blitzer. 2009. November 2006. Journal of Machine Learning Research. Tyler Lu. pages 636–642. pages 129–136. Exploiting task relatedness for multiple task learning. Thomas Lengauer. Partha Niyogi. [13] Mikhail Belkin and Partha Niyogi. Ke Zhou. [15] Shai Ben-David. August 2003. 97 . [12] Maxim Batalin. MIT Press. and David Pal. In Proceedings of IEEE International Conference on Robotics and Automation. In Annual Conference on Neural Information Processing Systems 19. Multi-task learning for learning to rank in web search. Alex Kulesza. Sachindra Joshi. Machine Learning. 2007. 7:2399–2434. Guirong Xue. In Proceedings of the Sixteenth Annual Conference on Learning Theory. A theory of learning from different domains. and Tobias Scheffer. [20] Steffen Bickel. JMLR W&CP.[10] Jing Bai. and Yi Chang. Neural Computation. Fernando Pereira. [14] Mikhail Belkin. 2003. ACM. pages 1549–1552. Analysis of representations for domain adaptation. volume 9. pages 41–50.. Laplacian eigenmaps for dimensionality reduction and data representation. 15(6):1373–1396. Mobile robot navigation using a sensor network. 2003. Impossibility theorems for domain adaptation. Zhaohui Zheng. pages 137–144. [19] Indrajit Bhattacharya. IEEE Computer Society. Shantanu Godbole. pages 56–63. Gordon Sun. Crossguided clustering: Transfer of relevant supervision across domains for improved clustering. Teresa Luu. [18] Shai Ben-David and Reba Schuller. Multi-task learning for HIV therapy screening. 4:83–99. John Blitzer. [11] Bart Bakker and Tom Heskes. and Vikas Sindhwani. Task clustering and gating for bayesian multitask learning. pages 825–830. In Proceedings of the 9th IEEE International Conference on Data Mining. 2010.

Learning bounds for domain adaptation. and Jenn Wortman. pages 432–439. Domain adaptation with structural correspondence learning. 2008. [23] John Blitzer. 28(1):41–75. June 2010. The University of Pennsylvania. ACM. 98 . In Annual Conference on Neural Information Processing Systems 20. [29] Bin Cao. Ryan McDonald. MIT Press. Liu. [30] Bin Cao. ACM. Christoph Sawade. July 2010. [22] Steffen Bickel. Alex Kulesza. Koby Crammer. In Proceedings of the 27th International Conference on Machine learning. [26] John Blitzer. pages 92–100. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence. ACL. Adaptive transfer learning. Multitask learning. and Qiang Yang. In Proceedings of the 11th Annual Conference on Learning Theory. In Advances in Neural Information Processing Systems 21. Dit-Yan Yeung. Sinno Jialin Pan. 2007. July 2006. ACL. bollywood. Domain Adaptation of Natural Language Processing Systems. and Fernando Pereira. Michael Br¨ ckner. Multi-task gaussian process prediction. Nathan N. 2009. and Tobias Scheffer. In Proceedings of the 24th international conference on Machine learning. In Proceedings of the Conference on Empirical Methods in Natural Language. pages 81–88. pages 129–136. June 2007. AAAI Press. [24] John Blitzer. [27] Avrim Blum and Tom Mitchell. pages 120–128. July 1998. Biographies. Kian Ming Chai. PhD thesis. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Mark Dredze.[21] Steffen Bickel. In Annual Conference on Neural Information Processing Systems 20. Discriminative learning for difu fering training and test distributions. [31] Rich Caruana. Yu Zhang. June 2007. pages 145–152. pages 153–160. 1997. and Chris Williams. and Fernando Pereira. Combining labeled and unlabeled data with co-training. boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation. [25] John Blitzer. and Tobias Scheffer. 2008. Transfer learning by distribution matching for targeted advertising. and Qiang Yang. Fernando Pereira. Transfer learning for collective link prediction in multiple heterogenous domains. Machine Learning. [28] Edwin Bonilla. MIT Press.

Zien. K. I. In Proceedings of the 25th international conference on Machine learning. Jun Yan. ACM. 1997. pages 1077–1078. Jia Chen. ACM. A convex formulation for learning shared structures from multiple tasks.[32] Kian Ming A. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [33] Yee Seng Chan and Hwee Tou Ng. ACM. and Sethu Vijayakumar. and Jiebo Luo. Extracting discriminative concepts for domain adaptation in text mining. Chai. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. [42] Ronan Collobert and Jason Weston. ACM. [40] Yuqiang Chen. USA. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence. Dong Xu. and Qiang Yang. [38] Lin Chen. Chung. Gui-Rong Xue. [39] Tianqi Chen. Sch¨ lkopf. July 2008. Tsang. In Proceedings of the 19th International Conference on World Wide Web. [35] Olivier Chapelle. Tag-based web photo retrieval improved by batch mode re-tagging. Ou Jin. Williams. June 2010. MIT Press. In Advances in Neural Information Processing Systems 21. pages 137–144. Christopher K. and Zheng Chen. Tsang. and A. July 2010. Srinivas Vadrevu. June 2009. [41] Fan R. July 2010. Semi-Supervised Learning. June 2007. and Tak-Lam Wong. Transfer learning for behavioral targeting. April 2010. Multi-task gaussian process learning of robot inverse dynamics. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Domain adaptation with active learning for word sense disambiguation. pages 179–188. American Mathematical Society. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. Ya Zhang. Lei Tang. pages 265–272. and Belle Tseng. Multi-task learning for boosting with application to web search ranking. ACL. Kilian Weinberger. Camo bridge. Jun Liu. Stefan Klanke. [36] Bo Chen. Chapelle. Wai Lam. B. Spectral Graph Theory. 99 . pages 1189–1198. June 2009. [34] O. [37] Jianhui Chen. Number 92 in CBMS Regional Conference Series in Mathematics. Ivor W. 2009. MA. ACM. AAAI Press. Ivor W. Gui-Rong Xue. Visual contextual advertising: Bringing textual advertisements to images. A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. pages 49–56. Pannagadatta Shivaswamy. IEEE. pages 160–167. In Proceedings of the 26th Annual International Conference on Machine Learning. and Jieping Ye. 2006.

Guirong Xue. AAAI Press. USA. pages 353–360. [44] W. and Yong Yu. ACM. Thomas K. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence. [50] Wenyuan Dai. pages 256–263. and Yong Yu. Self-taught clustering. MA. Landauer. June 2009. [52] Jesse Davis and Pedro Domingos. August 2007. Dumais. Susan T. Qiang Yang. [53] Scott Deerwester. In Proceedings of the 25th International Conference of Machine Learning. Yuqiang Chen. Jaz Kandola. and Yong Yu. pages 367– 373. Indexing by latent semantic analysis. Qiang Yang.[43] Nello Cristianini. and Richard Harshman. ACM. 1990. Qiang Yang. July 2008. ACM. In Proceedings of the 45th Annual e Meeting of the Association of Computational Linguistics. pages 25–31. and John Shawe-Taylor. Guirong Xue. Deep transfer via second-order markov logic. 2009. Kluwer Academic Publishers. George W. Qiang Yang. and Yong Yu. MIT Press. Guirong Xue. Guirong Xue. [48] Wenyuan Dai. Furnas. In Proceedings of the 26th Annual International Conference on Machine Learning. [51] Hal Daum´ III. [45] Wenyuan Dai. Frustratingly easy domain adaptation. ACL. On kerneltarget alignment. Ou Jin. 2003. pages 193– 200. Gui-Rong Xue. ACM. Eigentransfer: a uniﬁed framework for transfer learning. [49] Wenyuan Dai. Journal of the American Society for Information Science. pages 540–545. June 2007. 2002. Translated learning: Transfer learning across different feature spaces. [46] Wenyuan Dai. June 2007. Qiang Yang. June 2009. and Yong Yu. Norwell. Qiang Yang. Andre Elisseeff. 41:391–407. pages 200–207. Co-clustering based classiﬁcation for out-of-domain documents. Boosting for transfer learning. Language Modeling for Information Retrieval. [47] Wenyuan Dai. Gui-Rong Xue. July 2007. In Advances in Neural Information Processing Systems 14. and Yong Yu. In Annual Conference on Neural Information Processing Systems 21. 100 . Transferring naive bayes classiﬁers for text classiﬁcation. ACM. In Proceedings of the 24th International Conference on Machine Learning. In Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining. In Proceedings of the 26th Annual International Conference on Machine Learning. Bruce Croft and John Lafferty. MIT Press. pages 217–224.

and Tat-Seng Chua. and Massimiliano Pontil. Ivor W. Modeling transfer relationships between learning tasks for improved inductive transfer. Visual event recognition in videos by learning from web data. [57] Lixin Duan. July 2004. New York. pages 225–232. Tsang. pages 1375–1381. Journal of Machine Learning Research. June 2009. Regularized multi-task learning. Morgan Kaufmann Publishers Inc. 101 . July 2001. In Proceedings of the 7th ACM SIGKDD international conference on Knowledge discovery and data mining. 6:615–637. August 2004. In Proceedings of the 26th International Conference on Machine Learning. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Learning multiple tasks with kernel methods. IEEE. Dong Xu. In Proceedings of the 22nd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. In Proceedings of the twenty-ﬁrst international conference on Machine learning. pages 107–115. pages 317–332. [59] Lixin Duan. Marie desJardins. [60] Eric Eaton. Ivor W. Journal of Machine Learning Research. ACM. [55] Chris Ding and Xiaofeng He. K-means clustering via principal component analysis. pages 269–274. and Terran Lane. and Chih-Jen Lin. Dhillon. Ivor W. [61] Henry C. Ellis. Improving regressors using boosting techniques. In Proceedings of the Fourteenth International Conference on Machine Learning. The Macmillan Company. [62] Theodoros Evgeniou.[54] Inderjit S. Tsang. [58] Lixin Duan. 2005. [56] Harris Drucker. Xiang-Rui Wang. Springer. 1965. Dong Xu. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. June 2010. and Jiebo Luo. September 2008. Lecture Notes in Computer Science. Tsang. Micchelli. ACM. Domain adaptation from multiple sources via auxiliary classiﬁers. [63] Theodoros Evgeniou and Massimiliano Pontil. 2008.. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Dong Xu. Co-clustering documents and words using bipartite spectral graph partitioning. Domain transfer svm for video concept detection. Maybank. ACM. LIBLINEAR: A library for large linear classiﬁcation. and Stephen J. pages 109–117. Kai-Wei Chang. June 2009. Charles A. ACM. Cho-Jui Hsieh. [64] Rong-En Fan. 9:1871–1874. July 1997. IEEE. The Transfer of Learning. pages 289–296.

Dan S. Knowledge transfer via multiple model local structure mapping. WiFi-SLAM using gaussian process latent variable models. [70] Arthur Gretton. Eliot Flannery. Gretton. Schapire. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. In Proceedings of the 10th Annual International Conference on Mobile Computing and Networking. January 2007. pages 509–516.11 wireless networks. Rasch. In Advances in Neural Information Processing Systems 19. Practical robust localization over large-scale 802. July 2010. pages 63–77. and Zhong Su.[65] Brian Ferris. pages 842–847. 102 . Springer-Verlag. July 2008. pages 23–37. September 2004. Measuring o statistical dependence with Hilbert-Schmidt norms. ACM. pages 2480–2485. and Jiawei Han. [67] Jing Gao. MIT Press. ACM. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval. In Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence. Smola. M. Truyen Tran. Smola. and A. Ladd. [66] Yoav Freund and Robert E. pages 162–169. [74] Andreas Haeberlen. July 2010. Sch¨ lkopf. July 2006. pages 513–520. Jing Jiang. In Proceedings of the Second European Conference on Computational Learning Theory. volume 3734 of Lecture Notes in Computer Science. Borgwardt. Li Zhang. ACM. K. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Alex J. Algis Rudys. [69] A. Dieter Fox. Kavraki. October 2005. [73] Sunil Kumar Gupta. Empirical study on the performance stability of named entity recognition model across domains. and Neil Lawrence. [68] Wei Gao. 2007. [71] Hong Lei Guo. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. Olivier Bousquet. Andrew M. pages 283–291. A kernel method o for the two-sample problem. In Proceedings of the 16th International Conference on Algorithmic Learning Theory. pages 70–84. and Lydia E. Dinh Phung. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. Springer. and Bernhard Sch¨ lkopf. AAAI Press. Text categorization with knowledge transfer from heterogeneous data sources. Brett Adams. August 2008. and Svetha Venkatesh. [72] Rakesh Gupta and Lev Ratinov. Nonnegative shared subspace learning and its application to social media retrieval. Wei Fan. In Proceedings of the 23rd national conference on Artiﬁcial intelligence. pages 1169–1178. Learning to rank only using training data from related domain. Peng Cai. ACL. Kam-Fai Wong. B. and Aoying Zhou. Wallach.

Zheng. June 2010. In Advances in Neuo ral Information Processing Systems 19. [80] Derek H. Haifa. 28:100–108. Clustered multi-task learning: A convex formulation. Anthony Wong. ACM. inference and prediction. ACM. Vincent W. Springer. [86] Jing Jiang. Real world activity recognition with multiple goals. Mining and summarizing customer reviews. and Jean-Philippe Vert. and Takafumi Kanamori. Multi-task transfer learning for weakly-supervised relation extraction. ACM. Yuta Tsuboi. Nathan N. Borgwardt. In Advances in Neural Information Processing Systems 21. [78] Shohei Hido. pages 601–608. PhD thesis. [82] Jiayuan Huang. ACM. June 2010. Multi-task feature and kernel selection for SVMs. [83] Laurent Jacob. Multi-task learning of gaussian graphical models. [85] Jing Jiang. Robert Tibshirani. [79] Jean Honorio and Dimitris Samaras. In Proceedings of the 21st International Conference on Machine Learning. Sinno Jialin Pan. In Proceedings of the 27th International Conference on Machine learning. Hisashi Kashima. and Qiang Yang. pages 168–177. Alex Smola. Karsten M. 2 edition. [81] Minqing Hu and Bing Liu. In Proceedings of 10th ACM International Conference on Ubiquitous Computing. In Proceedings of the 27th International Conference on Machine learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. Arthur Gretton. Applied Statistics. 2007. Correcting sample selection bias by unlabeled data. ACM. August 2004. 2009. pages 745–752. Domain Adaptation in Natural Language Processing. September 2008. A k-means clustering algorithm. 2009.[75] Abhay Harpale and Yiming Yang. Liu. December 2008. Hartigan and M. Masashi Sugiyama. [76] John A. and Bernhard Sch¨ lkopf. Hu. [84] Tony Jebara. pages 223–232. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th 103 . Inlier-based outlier detection via direct density ratio estimation. The elements of statistical learning: data mining. MIT Press. Israel. In Proceedings of the 8th IEEE International Conference on Data Mining. [77] Trevor Hastie. Francis Bach. volume 69. 1979. 2008. The University of Illinois at Urbana-Champaign. pages 30–39. July 2004. and Jerome Friedman. Active learning for multi-task adaptive ﬁltering.

1995. Laurent El Ghaoui. volume 2. [91] Takafumi Kanamori. and Michael I. G. Lampert and Oliver Kr¨ mer. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2009. 104 . and Hsin-Hsi Chen. April 2009. 2006. Springer. In Proceedings of the 12th International Conference on Machine Learning. pages 200–209. Change-point detection in time-series data by direct density-ratio estimation. In Proceedings of the Sixteenth International Conference on Machine Learning. September 2010. Instance weighting for domain adaptation in NLP. 10:1391–1445. ACM. pages 137–142. [90] Thorsten Joachims. 5:27–72. Jordan. [92] Yoshinobu Kawahara and Masashi Sugiyama. [94] Christoph H. pages 1012–1020. A least-squares approach to direct importance estimation. Text categorization with support vector machines: learning with many relevant features. [88] Nitin Jindal and Bing Liu.. June 1999. Peter L. [95] Gert R. Weakly-paired maximum covariance analysis o for multimodal dimensionality reduction and transfer learning. In Proceedings of the international conference on Web search and web data mining. Nello Cristianini. Lanckriet. Bartlett. June 2007. Journal of Machine Learning Research. Lecture Notes in Computer Science. Transductive inference for text classiﬁcation using support vector machines. Shohei Hido. pages 389–400. pages 331–339. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. Morgan Kaufmann Pulishers Inc. In Proceedings of 2009 SIAM International Conference on Data Mining. pages 219–230. summarization and tracking in news and blog corpora. [96] Ken Lang. [87] Jing Jiang and ChengXiang Zhai. Journal of Machine Learning Research. Yu-Ting Liang. April 2008. [89] Thorsten Joachims. April 1998. In Proceedings of the 10th European Conference on Machine Learning. In Proceedings of the 11th European Conference on Computer Vision. Opinion extraction. and Masashi Sugiyama. [93] Lun-Wei Ku. 2009. Learning the kernel matrix with semideﬁnite programming. Opinion spam and analysis. 2004.International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. ACL. ACL. Newsweeder: Learning to ﬁlter netnews. pages 264–271.

and Yi Zhang. Lawrence and John C. and Daphne Koller. July 2005. Learning to learn with the informative vector machine.[97] Neil D. In Proceedings of the 20th National Conference on Artiﬁcial Intelligence. 105 . Qiang Yang. ACM. [102] Bin Li. pages 3–12. and Andrew Y. Learning and inferring transportation routines. Lewis and William A. Banff. pages 801–808. Dieter Fox. and Vikas Sindhwani.. pages 2052–2057. Morgan Kaufmann Publishers Inc. David Vickrey. Vassil Chatalbashev. [106] Lin Liao. Canada. July 2004. Qiang Yang. Large-scale localization from wireless signal strength. 2007. Platt. A non-negative matrix tri-factorization approach to sentiment classiﬁcation with lexical prior knowledge. June 2009. MIT Press. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. Kautz. Knowledge transformation for cross-domain sentiment classiﬁcation. [98] Honglak Lee. 171(5-6):311–331. AAAI Press / The MIT Press. Patterson. and Xiangyang Xue. Transfer learning for collaborative ﬁltering via a rating-matrix generative model. Gale. Can movies and books collaborate?: crossdomain collaborative ﬁltering for sparsity reduction. In Proceedings of the 24th International Conference on Machine Learning. In Proceedings of the 21st International Conference on Machine Learning. Donald J. [100] Julie Letchner. ACL. Alexis Battle. August 2009. pages 617–624. Ng. [104] Tao Li. [99] Su-In Lee. July 2009. pages 15–20. July 2009. and Henry A. ACM. [101] David D. Chris Ding. and Anthony LaMarca. ACM. In Annual Conference on Neural Information Processing Systems 19. pages 244–252. Yi Zhang. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. pages 716– 717. Learning a metalevel prior for feature relevance from multiple related tasks. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association of Computational Linguistics. ACM / Springer. Alberta. 2007. Efﬁcient sparse coding algorithms. Rajat Raina. [105] Tao Li. In Proceedings of the 26th Annual International Conference on Machine Learning. and Xiangyang Xue. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Artiﬁcial Intelligence. Dieter Fox. Vikas Sindhwani. July 1994. pages 489–496. ACM. [103] Bin Li. July 2007. A sequential algorithm for training text classiﬁers.

pages 443–452. ACM. Ivor W. In Proceedings of the 21st International Conference on Machine Learning. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. [116] Yue Lu. ACM. Qiang Yang. [108] Xiao Ling. Tsang. and Lawrence Carin. Second Edition. Multi-task feature learning via efﬁcient l2. ACM. In Proceedings of the seventeen ACM international conference on Multimedia. Modeling and predicting the helpfulness of online reviews. Opinion integration through semi-supervised topic modeling. pages 505–512. 106 . ACM. Gui-Rong Xue. Alexandros Ntoulas. Web Data Mining: Exploring Hyperlinks. and Yong Yu. Using large-scale web data to facilitate textual query based retrieval of consumer photos. November 2009. Aijun An. Springer. [114] Yang Liu. [115] Yiming Liu. ACM. and Xiaohui Yu. pages 969–978. and Livia Polanyi. pages 121–130. pages 691–700. In Proceedings of the 19th international conference on Wrld wide web. April 2008. Sentiment analysis and subjectivity. Shuiwang Ji. Exploiting social context for review quality prediction. December 2008. August 2008. June 2009. pages 55–64. Ya Xue. [112] Jun Liu. ACM. Spectral domaintransfer learning.1 -norm minimization. Panayiotis Tsaparas. August 2005. Yun Jiang. [113] Kang Liu and Jun Zhao. January 2007. Gui-Rong Xue. 2010.[107] Xuejun Liao. Handbook of Natural Language Processing. [110] Bing Liu. Wenyuan Dai. Qiang Yang. April 2010. Wenyuan Dai. In Proceedings of the 17th International Conference on World Wide Web. Cross-domain sentiment classiﬁcation using a two-stage method. and Jiebo Luo. April 2008. In Proceeding of the 18th ACM conference on Information and knowledge management. [109] Xiao Ling. Contents. IEEE Computer Society. and Yong Yu. pages 1717–1720. October 2009. [111] Bing Liu. ACM. Can chinese web pages be classiﬁed with english data source? In Proceedings of the 17th International Conference on World Wide Web. pages 488–496. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. and Jieping Ye. Xiangji Huang. Dong Xu. In Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence. [117] Yue Lu and Chengxiang Zhai. Logistic regression with an auxiliary data source. and Usage Data.

pages 985–992. pages 412–418. and Qing He. [128] Tony Mullen and Nigel Collier. In Advances in Neural Information Processing Systems 21. Yuhong Xiong. Fuzhen Zhuang. Tuyen Huynh. pages 1163–1168. pages 103–112. Ray. Multiple source adaptation and the renyi divergence. and Klaus-Robert a o M¨ ller. Gunnar R¨ tsch. Hassan Mahmud and Sylvian R. New York. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence. M. pages 1041–1048. and Neel Sundaresan.. Mehryar Mohri. ACM. pages 608–614. June 2009. In Neural Networks for Signal Prou cessing IX. Mitchell. AAAI Press.[118] Yue Lu. [119] Ping Luo. Mehryar Mohri. In Proceedings of the 18th international conference on World wide web. Morgan Kaufmann Publishers Inc. 1999. [121] Yishay Mansour. [123] Yishay Mansour. In Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence. 2009. McGraw-Hill. Mapping and revising markov logic networks for transfer learning. 1997. Machine Learning. and Afshin Rostamizadeh. Fisher discriminant analysis with kernels. In Annual Conference on Neural Information Processing Systems 20. [127] Tom M. Transfer learning using kolmogorov complexity: Basic theory and empirical evaluations. MIT Press. 107 . In Proceeding of the 17th ACM conference on Information and knowledge management. Transfer learning from minimal target data by mapping across relational domains. Domain adaptation with multiple sources. and Raymond J. pages 41–48. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. In Proceedings of the 22nd Annual Conference on Learning Theory. [124] Lilyana Mihalkova. 2009. and Afshin Rostamizadeh. Mehryar Mohri. Jason Weston. [125] Lilyana Mihalkova and Raymond J. Rated aspect summarization of short comments. Mooney. ChengXiang Zhai. and Afshin Rostamizadeh. [122] Yishay Mansour. [126] Sebastian Mika. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Domain adaptation: Learning bounds and algorithms. July 2007. [120] M. Sentiment analysis using support vector machines with diverse information sources. Bernhard Sch¨ lkopf. October 2008. Transfer learning from multiple source domains via consensus regularization. July 2004. 2008. 2009. July 2009. Hui Xiong. Mooney.

pages 1187–1192. Adaptive localization in a dynamic WiFi environment through multi-view learning. pages 988–993. Qiang Yang. Pan. September 2006. [131] Kamal Nigam. July 2008. July 2008. Ng. A manifold regularization approach to calibration reduction for sensor-network based tracking. pages 677–682. Domain adaptation via transfer component analysis. In Proceedings of the 19th International Conference on World Wide Web.[129] Yurii Nesterov and Arkadii Nemirovskii. AAAI Press. James T. Jordan. and James T. Text classiﬁcation from labeled and unlabeled documents using EM. pages 1383–1388. IEEE Transactions on Knowledge and Data Engineering. January 2007. AAAI Press. Tsang. James T. and Chen Zheng. and Yair Weiss. In Advances in Neural Information Processing Systems 14. Kwok. Pan and Qiang Yang. [135] Sinno Jialin Pan. [138] Sinno Jialin Pan. Ivor W. ACM. On spectral clustering: Analysis and an algorithm. Transferring localization models across space. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence. and Qiang Yang. Hong Chang. Philadelphia. AAAI Press. 2001. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence. 39(2-3):103–134. and Tom Mitchell. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. 108 . pages 2166–2171. Society for Industrial and Applied Mathematics. Kwok. Crossdomain sentiment classiﬁcation via spectral feauture alignment. [139] Sinno Jialin Pan. April 2010. Transfer learning via dimensionality reduction. and Qiang Yang. [137] Sinno Jialin Pan. AAAI Press. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. July 2007. and Jeffrey J. Kwok. July 2009. James T. Kwok. Pan. Sebastian Thrun. 2000. Xiaochuan Ni. Michael I. Jian-Tao Sun. Interior-point Polynomial Algorithms in Convex Programming. 1994. [133] Jeffrey J. pages 751–760. and Yiqiang Chen. July 2006. and Dit-Yan Yeung. pages 1108–1113. James T. Kwok. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence. [136] Sinno Jialin Pan. Qiang Yang. Dou Shen. [134] Jeffrey J. PA. Qiang Yang. Qiang Yang. Andrew Kachites McCallum. MIT Press. [132] Jeffrey J. Machine Learning. Multidimensional vector regression for accurate and low-cost location estimation in pervasive computing. Qiang Yang. 18(9):1181–1193. Co-localization from labeled and unlabeled data using graph laplacian. [130] Andrew Y. pages 849–856. Pan. In Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence.

Lawrence. and Qiang Yang. Submitted. Opinion mining and sentiment analysis. pages 79–86. Liu. In Proceedings of the 27th International Conference on Machine learning. [150] Rajat Raina. 2010. Domain adaptation via transfer component analysis. Constructing informative priors using transfer learning. Ivor W. October 2010. 109 . 2(1-2):1–135. [144] Bo Pang and Lillian Lee. July 2008. Evan W. [145] Bo Pang and Lillian Lee. ACL. and Daphne Koller. Qiang Yang. Kwok. pages 713–720. Foundations and Trends in Information Retrieval. Transfer learning for WiFi-based indoor localization. Boosting for regression transfer. e In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence. Thumbs up? Sentiment classiﬁcation using machine learning techniques. [151] Rajat Raina. and Qiang Yang. In Proceedings of the Workshop on Transfer Learning for Complex Task of the 23rd AAAI Conference on Artiﬁcial Intelligence. AAAI Press. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence. In Proceedings of the 42nd Annual Meeting of the Association of Computational Linguistics. Nathan N. Alexis Battle. 22(10):1345–1359. 2008. June 2006. [141] Sinno Jialin Pan and Qiang Yang. Ng. ACM. [143] Weike Pan. [146] Bo Pang. MIT Press. In Proceedings of the 23rd International Conference on Machine Learning. Selftaught learning: Transfer learning from unlabeled data. Andrew Y. pages 759–766. A survey on transfer learning. [148] Joaquin Quionero-Candela. Ng. 2009. and Shivakumar Vaithyanathan. [147] David Pardoe and Peter Stone. In Proceedings of the 24th International Conference on Machine Learning. James T. Anton Schwaighofer. Zheng. IEEE Transactions on Neural Networks. Xiang. and Neil D.[140] Sinno Jialin Pan. and Derek H. Inﬁnite predictor subspace models for multitask learning. June 2007. Hu. July 2010. ACL. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. Lillian Lee. ACM. July 2002. pages 271–278. Tsang. ACM. IEEE Transactions on Knowledge and Data Engineering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Transfer learning in collaborative ﬁltering for sparsity reduction. July 2010. July 2004. Masashi Sugiyama. Dataset Shift in Machine Learning. June 2010. [149] Piyush Rai and Hal Daum´ III. Benjamin Packer. [142] Sinno Jialin Pan. Vincent W. and Andrew Y. Honglak Lee.

Markov logic networks. [158] Marcus Rohrbach. Cook. [156] Alexander E. and o u a A. and Leslie Pack Kaelbling. 62(1-2):107–136. Jinbo Bi.C. Input space versus feature space in kernel-based methods. S. R¨ tsch. IEEE. and Bernt Schiele. Activity. Theory reﬁnement of bayesian networks with hidden variables. In Proceedings of the 11th European Conference on Computer Vision. [157] Stephen Robertson. Brian Kulis. [155] Matthew Richardson and Pedro Domingos. July 2010. and Ian Soboroff. Lecture Notes in Computer Science. [159] Michael T. Rosenstein. In Proceedings of the 25th International Conference on Machine learning. ACL. Sch¨ lkopf. o What helps where . To transfer or not to transfer. In Text REtrieval Conference 2001. July 1998. June 2010. In Proceedings of the Workshop on Plan. [154] Vikas C. September 1999. The trec 2002 ﬁltering track report. and R. Springer. June 2008. Gy¨ rgy Szarvas. Mario Fritz.J. Adapting visual category models to new domains. Burges. Richman and Patrick Schone. Michael Stark. pages 808– 815. 10(5):1000–1017. pages 1–9. IEEE Transactions on Neural Networks. Bayesian multiple instance learning: automatic feature selection and inductive transfer. Mika. 110 .and why? semantic relatedness for knowledge transfer. C. Balaji Krishnapuram. ACM. [160] Ulrich R¨ ckert and Stefan Kramer. pages 220–233. In Proceedings of the 14th International Conference on Machine Learning. Zvika Marx. pages 454–462.J. Activity recognition based on home to home transfer learning. Smola. 2006. Iryna Gurevych. G. Mining Wiki resources for multilingual named entity recognition. and Trevor Darrell. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. September 2010. In a NIPS-05 Workshop on Inductive Transfer: 10 Years Later. September 2008. [153] Parisa Rashidi and Diane J. December 2005. Klaus-Robert M¨ ller. P. July 2008. Murat Dundar. Stephen Robertson. [161] Kate Saenko.[152] Sowmya Ramachandran and Raymond J. and Intent Recognition of the 24th AAAI Conference on Artiﬁcial Intelligence. Mooney. 2001. Raykar. Machine Learning Journal. Kernel-based inductive transfer. [162] B. In Proceedings u of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. Knirsch. AAAI Press. Bharat Rao. In Proceedings of 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technology.

Computer Sciences Technical Report 1648. Nonlinear component o u analysis as a kernel eigenvalue problem. and Kee-Khoon Lee. MIT Press. Regularization. pages 1209–1216. [171] Vikas Sindhwani and Prem Melville. September 2010. Springer. Le Song. IEEE Computer Society. pages 13–31. Optimization. Yew-Soon Ong Ivor W. Learning with Kernels: Support Vector Machines. [168] Burr Settles. MIT Press. and Bernhard Sch¨ lkopf. Learning via Hilbert Space Embedding of Distributions. 1998. and Klaus-Robert M¨ ller. Learning gaussian process kernels via hierarchical bayes. University of Wisconsin–Madison. 2008. Arthur Gretton. [164] Bernhard Scholkopf and Alexander J. In Proceedings of the 18th International Conference on Algorithmic Learning Theory. In Proceedings of the 8th IEEE International Conference on Data Mining. An o a empirical analysis of domain adaptation algorithms for genomic sequence analysis. MA. Neural Computation. 2005. Cambridge. pages 1025–1030. Smola. October 2007. 90:227–244. Improving predictive inference under covariate shift by weighting the log-likelihood function. [173] Le Song. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. In Annual Conference on Neural Information Processing Systems 17. Document-word co-regularization for semi- supervised sentiment analysis. USA. and Jiangtao Ren. Christian Widmer. and Kai Yu. Journal of Statistical Planning and Inference. Wei Fan. Alexander Smola. In Advances in Neural Information Processing Systems 21. December 2007.[163] Bernhard Sch¨ lkopf. Volker Tresp. 111 . [169] Xiaoxiao Shi. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases 2010. 2009. pages 1433–1440. [167] Chun-Wei Seah. pages 342–357. Active learning literature survey. Tsang. [165] Anton Schwaighofer. [172] Alex J. and Beyond. 2009. [166] Gabriele Schweikert. [170] Hidetoshi Shimodaira. September 2008. A Hilbert space o embedding for distributions. Bernhard Sch¨ lkopf. 2000. and Gunnar R¨ tsch. The University of Sydney. 2001. Smola. 10(5):1299–1319. PhD thesis. Lecture Notes in Computer Science. Predictive distribution matching svm for multi-domain learning. Actively transfer domain knowledge.

Hisashi Kashima. and Motoaki Kawanabe. In Advances in Neural Information Processing Systems 20. A shape-based object class model for knowledge transfer. Journal of Machine Learning Research. Rice University. pages 308–316. and Arthur Gretton. [182] Matthew E. In Proceedings of the 8th International Conference on Independent Component Analysis and Signal Separation.[174] Le Song. Motoaki Kawanabe. Journal of Machine Learning Research. MA. September 2009. Direct importance estimation with model selection and its application to covariate shift adaptation. 2009. In Proceedings of 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technology. Paul Von Buenau. In Proceedings of 12th IEEE International Conference on Computer Vision. Alex Smola. ACL. and Bernt Schiele. 112 . 2001. Learning to learn. editors. [181] Taiji Suzuki and Masashi Sugiyama. pages 1385–1392. [176] Bharath Sriperumbudur. A joint model of text and aspect ratings for sentiment summarization. pages 130–137. 10(1):1633– 1685. [183] Sebastian Thrun and Lorien Pratt. Michael Goesele. Kenji Fukumizu. Houston. 2008. Taylor and Peter Stone. [184] Ivan Titov and Ryan McDonald. Department of Computational and Applied Mathematics. USA. pages 1750–1758. 2008. 2010. Shinichi Nakajima. 2009. Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. Estimating squared-loss mutual information for independent component analysis. MIT Press. Kluwer Academic Publishers. and Pui Ling Chui. Technical Report TR-96-40. [175] Danny C. and Bernhard Sch¨ lkopf. Gert Lanckriet. Norwell. pages 1433–1440. Kernel choice and classiﬁability for RKHS embeddings of probability distrio butions. On the inﬂuence of the kernel on the consistency of support vector machines. Karsten Borgwardt. [180] Masashi Sugiyama. [177] Michael Stark. March 2009. 1996. Arthur Gretton. Colored maximum variance unfolding. 23(1):44–59. 2:67–93. June 2008. [179] Masashi Sugiyama. Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Network. Sorensen. MIT Press. 1998. In Annual Conference on Neural Information Processing Systems 20. [178] Ingo Steinwart. In Advances in Neural Information Processing Systems 22. Texas.

September 1998. Eric Postma. Heterogeneous cross domain ranking in latent space. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. November 2009. [194] Hua-Yan Wang. In Proceedings of the 8th Annual IEEE International Conference on Pervasive Computing and Communications. IEEE Computer Society. Francesco Orabona. pages 244–252. and Barbara Caputo. . September 2009. and Jian Hu. Zi Yang. Wiley-Interscience. [189] Vladimir N. Wei Fan.[185] Tatiana Tommasi. Safety in numbers: Learning categories from few examples with multi model knowledge transfer. IEEE Computer Society. ACM. Frank C. [190] Alexander Vezhnevets and Joachim Buhmann. July 2008. Kir´ ly. 2008. Thumbs up or thumbs down? semantic orientation applied to unsupervised classiﬁcation of reviews. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Eigenfaces for recognition. pages 417–424. [193] Chang Wang and Sridhar Mahadevan. Finding stau u u tionary subspaces in multivariate time series. Vincent W. [188] Laurens van der Maaten. Franz C. and Qiang Yang. Physical Review Letters. Carlotta Domeniconi. [186] Matthew Turk and Alex Pentland. Jie Tang. 2002. Indoor localization in multi-ﬂoor environments with reduced effort. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. and Yanzhu Liu. M¨ ller. Songcan Chen. March 2010. 3(1):71–86. IEEE. New York. Technical Report TiCC-TR 2009-005. June 2010. Journal of Cognitive Neuroscience. ACM. Junhui Zhao. and Klaus R. [191] Paul von B¨ nau. Meinecke. Tilburg University. Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning. pages 1120– 1127. [192] Bo Wang. pages 1085–1090. [187] Peter Turney. Zheng. 2009. Manifold alignment using procrustes analysis. In Proceedings of the 25th International Conference on Machine learning. 113 Statistical Learning Theory. 1991. [195] Pu Wang. and Jaap van den Herik. Dimensionality reduction: A comparative review. Vapnik. 103(21). pages 987–996. Using Wikipedia for co-clustering based cross-domain text classiﬁcation. IEEE. In Proceedings of the 40th Annual Meeting fo the Association for Computational Linguistics. In Proceeding of the 18th ACM conference on Information and knowledge management. June 2010.

April 2010. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. and Hai Leong Chieu. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. IEEE. In Proceedings of the 22nd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Springer. pages 142–149. September 2008. IEEE Transactions on Knowledge and Data Engineering. June 2009. Weinberger. and Gunnar R¨ tsch. Learning a kernel matrix for nonlinear dimensionality reduction. In Proceedings of 14th Annual International Conference on Research in Computational Molecular Biology. Canada. pages 1523–1532. Fei Sha. and Qiang Yang. and Jiangtao Ren. and Bo Chen. Nan Ye. and Zhengyou Zhang. pages 283–290. Banff. In Proceedings of the 21st International Conference on Machine Learning. [205] Dikan Xing. Wei Fan. Springer. Mining employment market via text block detection and adaptive cross-domain information extraction.[196] Xiaogang Wang. [203] Evan W. June 2010. and Lawrence K. August 2009. July 2004. 22:770–783. Latent space domain transfer between high dimensional overlapping distributions. [204] Sihong Xie. Derek H. Leveraging sequence a classiﬁcation by taxonomy-based multitask learning. Bridged reﬁnement for transfer learning. Wai Lam. [198] Kilian Q. In 11th European Conference on Principles and Practice of Knowledge 114 . In 18th International World Wide Web Conference. [199] Christian Widmer. Jing Peng. ACM. Improving svm accuracy by training on auxiliary data sources. pages 91–100. April 2009. Olivier Verscheure. ACM. Alberta. pages 522–534. [201] Dan Wu. ACL. In Proceedings of the 21st International Conference on Machine Learning. Cha Zhang. Boosted multi-task learning for face veriﬁcation with applications to web image and video search. Dietterich. ACM. Hu. Domain adaptive bootstrapping for named entity recognition. Yangqiu Song. and Changshui Zhang. Saul. volume 6044 of Lecture Notes in Computer Science. In Proceedings of the 2008 European Conference on European Conference on Machine Learning and Knowledge Discovery in Databases. and Yong Yu. pages 550–565. [197] Zheng Wang. 2009. July 2004. Bin Cao. Jose Leiva. Wenyuan Dai. Bridging domains using world wide knowledge for transfer learning. Yasemin Altun. [202] Pengcheng Wu and Thomas G. [200] Tak-Lam Wong. Wee Sun Lee. Gui-Rong Xue. Xiang. Transferred dimensionality reduction.

and Yong Yu. pages 1159–1168. 99(PrePrints). ACM. Hannah Hong Xue. Activity recognition: Linking low-level sensors to high-level intelligence. Topic-bridged plsa for crossdomain text classiﬁcation. ACL. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. September 2007. In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. Morgan Kaufmann Publishers Inc. IEEE Intelligent Systems. [207] Gui-Rong Xue. [212] Qiang Yang. Unsupervised transfer classiﬁcation: application to text categorization. ACM. July 2008. Anil K. Transfer learning using task-level features with application to information retrieval. Lecture Notes in Computer Science. 2009. In Advances in Neural Information Processing Systems 22. [209] Jun Yang. pages 1–9. 2009. and Vincent W. Sinno Jialin Pan. Qiang Yang.. IEEE/ACM Transactions on Computational Biology and Bioinformatics. Jain. 2008. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Heterogeneous transfer learning for image clustering via the social web. 2009. 2010. Estimating location using Wi-Fi. Yuqiang Chen. [210] Qiang Yang. Rong Yan. pages 188–197. and Qiang Yang. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence. In Proceedings of the 15th international conference on Multimedia. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. and Yong Yu. July 2009. Hauptmann. Gui-Rong Xue. 115 . Cross-domain video concept detection using adaptive svms. pages 20–25. [208] Rong Yan and Jian Zhang. [211] Qiang Yang. Springer. and Wei Tong. and Alexander G. volume 1. pages 1315–1320. Seyoung Kim. 23(1):8–13. Yang Zhou. [213] Tianbao Yang. Morgan Kaufmann Publishers Inc. [206] Qian Xu. pages 324–335. and Eric Xing. Wenyuan Dai. pages 2151–2159. Multitask learning for protein subcellular location prediction.. Sinno Jialin Pan. pages 627–634.Discovery in Databases. 2007. ACM. Rong Jin. July 2010. [214] Xiaolin Yang. Wenyuan Dai. Heterogeneous multitask learning with joint sparsity constraints. Zheng.

In Proceedings of the 21st international jont conference on Artiﬁcal intelligence. Zengfu Wang. June 2010. pages 412–420. Qiang Yang. Gray. and L. [221] Zheng-Jun Zha. 6:483–502.. Learning adaptive temporal radio maps for signalstrength-based location estimation. and Dit-Yan Yeung. In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence.[215] Yiming Yang and Jan O. [225] Yu Zhang and Dit-Yan Yeung. [218] Xiao-Tong Yuan and Shuicheng Yan. 2005. pages 1327–1332. IEEE. A convex formulation for learning task relationships in multi-task learning. 26(12):i97–i105. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Joe W. MIT Press. Bin Cao. and Xian-Sheng Hua. 2009. Tao Mei.M. In Proceedings of the 21st International Conference on Machine Learning. Learning and evaluating classiﬁers under sample selection bias. Meng Wang. [216] Jieping Ye. July 2010. Morgan Kaufmann Publishers Inc. Spectral relaxation for k-means clustering. Sparse multitask regression for identifying common mechanism of response to therapeutic targets. [217] Jie Yin. Visual classiﬁcation with multi-task joint sparse representation. USA. Journal of Machine Learning Research. Bioinformatics. 2001. Morgan Kaufmann Publishers Inc. [222] Kai Zhang. July 2010. and Bahram Parvin. Multi-domain collaborative ﬁltering. 116 . July 1997. Chris Ding. [219] Bianca Zadrozny. IEEE. A comparative study on feature selection in text categorization. pages 1057–1064. In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence. Horst Simon. Xiaofeng He. Pedersen. [224] Yu Zhang and Dit-Yan Yeung. [220] Hongyuan Zha. [223] Yu Zhang. San Francisco. In Advances in Neural Information Processing Systems 14. Ni. June 2010. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. In Proceedings of the 14th International Conference on Machine Learning. July 2008. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CA. Robust distance metric learning with auxiliary knowledge. July 2004. 2010. and Ming Gu. IEEE Transactions on Mobile Computing. 7(7):869– 883. Multi-task warped gaussian process for personalized age estimation.

Cross domain distribution adaptation via kernel mapping. and Qiang Yang. July 2010. Computer Sciences. Lafferty. 2004. University of Wisconsin-Madison. In Advances in Neural Informao tion Processing Systems 16. and Dou Shen. and Olivier Verscheure. August 2003. June 2010. Derek H. and John D. AAAI Press. September 2009. and Jeffrey J. Jing Peng. June 2009. [235] Hankui Zhuo. December 2008. In Proceedings of the 11th International Conference on Ubiquitous Computing. Cross-domain activity recognition. MIT Press. Pan. Deepak Turaga. pages 321–328. In Proceeding of The 22nd International Conference on Machine Learning. Hu. Transferring localization models over time. and Bernhard Sch¨ lkopf. Learning with local and global consistency. July 2008. ACM. Qiang Yang. Semi-supervised learning literature survey. Thomas Navin Lal. [227] Peilin Zhao and Steven C. 2005. pages 912–919. 117 . pages 1027–1036. Hu. OTL: A framework of online transfer learning. Zoubin Ghahramani. Qiang Yang. [230] Vincent W. pages 1421–1426. Zheng.[226] Yu Zhang and Dit-Yan Yeung. Zheng. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. [234] Xiaojin Zhu. Hoi. Transfer metric learning by learning task relationships. [228] Vincent W. [233] Xiaojin Zhu. ACM.H. pages 1427–1432. Xiang. Derek H. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. Springer-Verlag. Transferring multidevice localization models using latent multi-task learning. Qiang Yang. Jason Weston. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence. ACM. [232] Dengyong Zhou. pages 1110–1115. Sinno Jialin Pan. Technical Report 1530. In Proceedings of the 27th International Conference on Machine learning. Transferring knowledge from another domain for learning action models. and Lei Li. In Proceedings of 10th Paciﬁc Rim International Conference on Artiﬁcial Intelligence. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. Semi-supervised learning using gaussian ﬁelds and harmonic functions. Kun Zhang. Wei Fan. July 2008. Zheng. [229] Vincent W. Jiangtao Ren. [231] Erheng Zhong. Olivier Bousquet. AAAI Press. Evan W. pages 1199–1208.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd