Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Deep transfer learning and its applications in remote sensing and computer vision Lin, Jianzhe 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2020_may_lin_jianzhe.pdf [ 52.52MB ]
Metadata
JSON: 24-1.0390305.json
JSON-LD: 24-1.0390305-ld.json
RDF/XML (Pretty): 24-1.0390305-rdf.xml
RDF/JSON: 24-1.0390305-rdf.json
Turtle: 24-1.0390305-turtle.txt
N-Triples: 24-1.0390305-rdf-ntriples.txt
Original Record: 24-1.0390305-source.json
Full Text
24-1.0390305-fulltext.txt
Citation
24-1.0390305.ris

Full Text

Deep Transfer Learning and Its Applications in RemoteSensing and Computer VisionbyJianzhe LinB.Sc., Huazhong University of Science and Technology, 2013M.Sc., The University of Chinese Academy of Sciences, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)April 2020c© Jianzhe Lin, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Deep Transfer Learning and Its Applications in Remote Sensing andComputer Visionsubmitted by Jianzhe Lin in partial fulfillment of the requirements for the degreeof Doctor of Philosophy in Electrical and Computer Engineering.Examining Committee:Z. Jane Wang, Professor, Electrical and Computer Engineering, UBCSupervisorRabab Ward, Professor, Electrical and Computer Engineering, UBCCo-SupervisorPanos Nasiopoulos, Professor, Electrical and Computer Engineering, UBCSupervisory Committee MemberShuo Tang, Professor, Electrical and Computer Engineering, UBCSupervisory Committee MemberBhushan Gopaluni, Professor, Chemical and Biological Engineering, UBCUniversity ExaminerPurang Abolmaesumi, Professor, Electrical and Computer Engineering, UBCUniversity ExamineriiAbstractSeveral machine learning tasks rely on the availability of large amounts of data. Toobtain robust machine learning systems, the employment of annotated data sam-ples is crucial. For computer vision tasks, the shortage of annotated training datahas been a hindrance. To address this problem, one of the most popular approachesis deep transfer learning (DTL). DTL methods transfer Information from annotatedlarge datasets to a scarce number of un-annotated datasets. This transfer employsdeep learning to find the features of all annotated and un-annotated image datasets.The labels of un-annotated datasets are determined by finding the label of the an-notated ones that share similar features.This thesis proposes different deep transfer learning models for problems withthree types of image data: aerial images, satellite images, and ground-view images.Based on these image datasets, our transfer learning tasks include the transfer learn-ing between the different types of regular images, between the different types ofremote sensing images, and between the remote sensing and regular images. Theunderlying relationships are obtained by setting up a correlation between the deeptransfer learning models corresponding to the different types of images.The proposed models address three research tasks. The first task addressesthe “what to transfer” problem, i.e., finding the appropriate content for transfer.For this task, we propose an active learning incorporated deep transfer learningmodel which explores the relationships among different remote sensing images;The second task studies the “where to transfer” problem, and finds the correlationbetween the deep learning networks of the annotated and the un-annotated images.For this task, we considered regular images. The third task investigates the “how totransfer” problem for three types of images (aerial, satellite and ground-view), andiiiinvolves finding the image relationships and the best deep learning neural networkmodels for knowledge transfer. Several models, including the Dual Space structurepreserving Transfer Learning model, the Xnet, and the Dual Adversarial Network(DuAN), are proposed.ivLay SummaryMuch of the impressive recent advances in Artificial Intelligence are due to meth-ods based on deep learning. These methods however need large amounts of anno-tated data. For computer vision tasks, the shortage of annotated training imageshas been a hindrance. To address this problem, deep transfer learning methodsare used. These methods use the information in labeled datasets to annotate theun-annotated ones.This thesis develops advanced deep transfer learning (DTL) methods for ap-plications in remote sensing and computer vision problems, using aerial, satellite,and ground-view images. These methods find the relationships amongst differ-ent labeled and unlabeled images and the correlations between their correspondingdeep neural networks. The proposed methods address three challenging problems:how to transfer the knowledge from the annotated to the un-annotated images, whatknowledge to transfer, and to where in the target deep neural network layers shouldthe knowledge be transferred.vPrefaceThis dissertation is written based on a collection of manuscripts. The majority ofthe research, including literature study, algorithm development and implementa-tion, numerical studies, and report writing, were conducted by the candidate, withsuggestions from Prof. Z. Jane Wang and Prof. Rabab Ward. The manuscripts wereprimarily drafted by the candidate, with helpful revisions and comments from Prof.Z. Jane Wang (papers in Chapters 2–6), and Prof. Rabab Ward (papers in Chapter2–3, 5-6).Chapter 2 is based on the following manuscripts:• J. Lin, L. Zhao, S. Li, R. Ward, and Z. Wang, “Active-learning-incorporateddeep transfer learning for hyperspectral image classification,” IEEE Journalof Selected Topics in Applied Earth Observations and Remote Sensing, vol.11, no. 11, pp. 4048-4062, 2018.• J. Lin, R. Ward, and Z. Wang, “Deep transfer learning for hyperspectralimage classification,” In proc. International Workshop on Multimedia SignalProcessing (MMSP), pp. 1-5, 2018.The author was responsible for algorithm development and implementation, andmanuscript writing for these works. These works were conducted with the guid-ance from Dr. Z. Jane Wang and Dr. Rabab Ward. Dr. L. Zhao gave a lot ofsuggestions for the algorithm design. Dr. S. Li provided the data for the experi-mental part.Chapter 3 is based on the following manuscript:• J. Lin, L. Zhao, Q. Wang, Z. Wang, and R. Ward, “DT-LET: Deep transferlearning by exploring where to transfer,” accepted, Neurocomputing, 2020.viThe author was responsible for algorithm development and implementation, andmanuscript writing for the work. The work was conducted with the guidance fromDr. Z. Jane Wang and Dr. Rabab Ward. Dr. L. Zhao gave a lot of suggestions forthe algorithm design. Dr. Q. Wang provided the data for the experimental part.Chapter 4 is based on the following manuscript:• J. Lin, C. He, Z. Wang, and S. Li, “Structure preserving transfer learningfor unsupervised hyperspectral image classification,” IEEE Geoscience andRemote Sensing Letters, vol. 14, no. 10, 2017.The author was responsible for algorithm development and implementation, andmanuscript writing for the work. The work was conducted with the guidance fromDr. Z. Jane Wang. Dr. H. Chen gave a lot of suggestions for the algorithm design.Dr. S. Li provided the data for the experimental part.Chapter 5 is based on the following manuscript:• J. Lin, K. Yuan, R. Ward, Z. Wang, “Xnet: Task-specific Attentional DomainAdaptation for Satellite-to-Aerial Scene,” accepted, Neurocomputing, 2020.The author was responsible for algorithm development and implementation, andmanuscript writing for the work. The work was conducted with the guidance fromDr. Z. Jane Wang and Dr. Rabab Ward. Mr. K. Yuan was responsible for thecoding part of this work.Chapter 6 is based on the following manuscript:• J. Lin, L. Mou, T. Yu, X. Zhu, and Z. Wang, “Dual Adversarial Network forUnsupervised Ground/Satellite-to-Aerial Scene Adaptation,” In proc. ACMMM, submitted, 2020.The author was responsible for algorithm development and implementation, andmanuscript writing for the work. The work was conducted with the guidance fromDr. Z. Jane Wang. Mr. T. Yu was responsible for the coding part. Dr. L. Mou andDr. X. Zhu provided the data for the experimental part.Other publications realted to my PhD works are:• J. Lin, T. Yu, L. Mou, X. Zhu, R. Ward, and Z. Wang, “Unifying Top-downView by Few-shot Task-Specific Domain Adaptation,” IEEE Transactionson Geoscience and Remote Sensing, submitted, 2020.viiThe author was responsible for algorithm development and implementation, andmanuscript writing for the work. The work was conducted with the guidance fromDr. Z. Jane Wang and Dr. Rabab Ward. Mr. T. Yu was responsible for the codingpart. Dr. L. Mou and Dr. X. Zhu provided the data for the experimental part.• J. Lin, L. Mou, X. Zhu, X. Ji, R. Ward, Z. Wang, “Attention-Aware Pseudo-3D Convolutional Neural Network for Hyperspectral Image Classification,”under major revision, IEEE Transactions on Geoscience and Remote Sens-ing, 2020.The author was responsible for algorithm development and implementation, andmanuscript writing for the work. The work was conducted with the guidance fromDr. Z. Jane Wang and Dr. Rabab Ward. Dr. X. Ji gave a lot of suggestionsfor the algorithm design. Dr. L. Mou and Dr. X. Zhu provided the data for theexperimental part.viiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction to deep transfer learning and its application in com-puter vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.1 What to transfer . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Where to transfer . . . . . . . . . . . . . . . . . . . . . . 91.2.3 How to transfer . . . . . . . . . . . . . . . . . . . . . . . 111.2.4 Transfer learning for remote sensing . . . . . . . . . . . . 121.3 Research objectives, methodologies, and thesis outline . . . . . . 15ix2 What to Transfer: Active Learning Incorporated Deep Transfer Learn-ing for Hyperspectral Image Classification . . . . . . . . . . . . . . 202.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . 222.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.1 Density peaks selection . . . . . . . . . . . . . . . . . . . 242.2.2 Active data augmentation . . . . . . . . . . . . . . . . . . 272.2.3 Deep mapping mechanism . . . . . . . . . . . . . . . . . 292.2.4 Classification on common semantic subspace . . . . . . . 312.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3.1 Experimental dataset descriptions . . . . . . . . . . . . . 322.3.2 Comparative methods and evaluation . . . . . . . . . . . 332.3.3 Experiment 1: Urban dataset and Washington DC MallArea 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.4 Experiment 2: Pavia University data and Washington DCMall area 2 . . . . . . . . . . . . . . . . . . . . . . . . . 372.3.5 Experiment 3: Pavia University Data and Pavia Center Data 382.3.6 Effective of co-occurrence data . . . . . . . . . . . . . . 402.3.7 Parameter sensitivity . . . . . . . . . . . . . . . . . . . . 412.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 Where to Transfer: Deep Transfer Learning by Exploring Where toTransfer (DT-LET) . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . 513.2 Method: deep mapping mechanism . . . . . . . . . . . . . . . . . 513.2.1 Network setting up . . . . . . . . . . . . . . . . . . . . . 533.2.2 Correlation maximization . . . . . . . . . . . . . . . . . 543.2.3 Layer matching . . . . . . . . . . . . . . . . . . . . . . . 553.3 Method: model training . . . . . . . . . . . . . . . . . . . . . . . 553.3.1 Step.1: Updating V S,V T with fixed ΘS,ΘT . . . . . . . . 553.3.2 Step.2: updating ΘS,ΘT with fixed V S,V T . . . . . . . . . 563.3.3 Optimization of Rs,t . . . . . . . . . . . . . . . . . . . . . 583.3.4 Classification on common semantic subspace . . . . . . . 603.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61x3.4.1 Experimental dataset descriptions . . . . . . . . . . . . . 613.4.2 Comparative methods and evaluation . . . . . . . . . . . 613.4.3 Task 1: handwritten digit recognition . . . . . . . . . . . 623.4.4 Task 2: text-to-image classification . . . . . . . . . . . . 633.4.5 Parameter sensitivity . . . . . . . . . . . . . . . . . . . . 663.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 How to Transfer: Structure Preserving Transfer Learning for Unsu-pervised Hyperspectral Image Classification . . . . . . . . . . . . . 684.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . 704.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.2 Constraints on the subspace . . . . . . . . . . . . . . . . 724.2.3 Constraints on the original target space . . . . . . . . . . 744.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 764.3.2 Experiments on Salinas . . . . . . . . . . . . . . . . . . . 774.3.3 Experiments on Pavia University and Center . . . . . . . 774.3.4 Influence of labeled data . . . . . . . . . . . . . . . . . . 794.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805 How to Transfer: Xnet . . . . . . . . . . . . . . . . . . . . . . . . . . 815.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . 835.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.2.1 Network structure . . . . . . . . . . . . . . . . . . . . . . 845.2.2 The objective functions . . . . . . . . . . . . . . . . . . . 865.2.3 Task-specific attentional adaptations . . . . . . . . . . . . 875.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.3.1 Transfer learning benchmarks . . . . . . . . . . . . . . . 895.3.2 Detailed network structure . . . . . . . . . . . . . . . . . 915.3.3 Performance comparisons for digit recognition . . . . . . 935.3.4 Comparisons for remote sensing task . . . . . . . . . . . 945.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95xi6 How to Transfer: Dual Adversarial Network (DuAN) . . . . . . . . 986.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . 1006.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.2.2 Model initialization . . . . . . . . . . . . . . . . . . . . . 1026.2.3 Model training . . . . . . . . . . . . . . . . . . . . . . . 1026.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.3.2 Digit recognition . . . . . . . . . . . . . . . . . . . . . . 1066.3.3 Satellite-to-aerial scene adaptation . . . . . . . . . . . . . 1076.3.4 Ground-to-aerial scene adaptation . . . . . . . . . . . . . 1086.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.5 Conclusion for “How to Transfer” (Chapters 4 to 6) . . . . . . . . 1117 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 1187.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1187.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.2.1 Few-shot transfer learning . . . . . . . . . . . . . . . . . 1207.2.2 Multi-label transfer learning . . . . . . . . . . . . . . . . 1217.2.3 Multi-label transfer learning for aerial scene . . . . . . . . 122Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 138A.0.1 Updating V S,V T with fixed ΘS,ΘT . . . . . . . . . . . . 138A.0.2 Updating ΘS,ΘT with fixed V S,V T . . . . . . . . . . . . 139xiiList of TablesTable 2.1 Dadasets used in experiment 1 . . . . . . . . . . . . . . . . . 35Table 2.2 Classification accuracy results on Washington DC Mall Area 1. 37Table 2.3 Datasets used in experiment 2 . . . . . . . . . . . . . . . . . . 37Table 2.4 Classification accuracy results on Washington DC Mall Area 2. 38Table 2.5 Datasets used in experiment 3 . . . . . . . . . . . . . . . . . . 39Table 2.6 Classification accuracy results on Pavia Center. . . . . . . . . . 39Table 2.7 Effects of the number of neurons at the last layer . . . . . . . . 41Table 2.8 Effects of the number of layers . . . . . . . . . . . . . . . . . 41Table 3.1 Classification accuracy results on multi feature dataset. . . . . 63Table 3.2 Classification accuracy results on NUS-WIDE dataset.) . . . . 65Table 3.3 Effects of the number of neurons at the last layer . . . . . . . . 67Table 4.1 Classification accuracy results on Salinas. . . . . . . . . . . . 77Table 4.2 Classification accuracy results on PU and PC. . . . . . . . . . 78Table 4.3 Classification accuracy results when using different numbers oflabeled data from the source domain. . . . . . . . . . . . . . . 79Table 5.1 A comparison of features for classification task. . . . . . . . . 83xiiiTable 5.2 Comparisons of different methods on digits datasets transferlearning. The coverage is shown in the bracket. Source Onlyand Target Only refer to training only on the respective dataset(supervisedly, without transfer learning) and evaluating on thetarget dataset. The first four comparison methods belong to non-adversarial transfer learning, and the rest methods belong to ad-versarial learning. The best performance is highlighted in bold.The results of Xnet with a single adversarial learning supportthe effectiveness of the proposed Xnet structure, while the re-sults of Xnet without tsa feature, further illustrate the effective-ness of tsa feature learning. . . . . . . . . . . . . . . . . . . . 96Table 5.3 Accuracy(%) of DTR for the Satellite-to-aerial Scene Adapta-tion task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Table 6.1 Performance comparisons of different deep adversarial transferlearning methods. Source and target refer to training only onthe source/target dataset. . . . . . . . . . . . . . . . . . . . . 105Table 6.2 Accuracy (%) results for the satellite-to-aerial scene adaptationtask with ResNet-101 as the base network. . . . . . . . . . . . 110Table 6.3 Accuracy (%) results for the ground-to-aerial scene adaptationtask with ResNet-101 as the base network. . . . . . . . . . . . 111Table 6.4 Accuracy (%) results for the ground-to-aerial scene adaptationtask with ResNet-101 as the base network. . . . . . . . . . . . 111xivList of FiguresFigure 1.1 Examples of HSI scenes collected from different satellites. . . 4Figure 1.2 Visualized examples for the digit recognition tasks. . . . . . . 5Figure 1.3 Examples of scenes from different views. From top to downare scenes from the satellite view, the aerial view, and theground view respectively. Scenes from the satellite view arewith much lower resolution and clarity compared with the aerialview. Scenes from the ground view and the aerial view are withhuge domain gap even with the same semantic labels. . . . . . 6Figure 1.4 An example of the classification result on HSI. (a) pseudo-color image of the PU dataset; and (b) the classification resultof the PU dataset. . . . . . . . . . . . . . . . . . . . . . . . . 13Figure 1.5 The aerial-view image and the ground-view image for the samescene [104]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Figure 1.6 The overview layout of the thesis. . . . . . . . . . . . . . . . 19Figure 2.1 The flowchart of the proposed DTSE framework for hyper-spectral image classification. . . . . . . . . . . . . . . . . . . 43Figure 2.2 An illustrative comparison of HSI sample selection based ondifferent criteria. . . . . . . . . . . . . . . . . . . . . . . . . 44Figure 2.3 An example for the active local peaks querying. . . . . . . . . 44Figure 2.4 Urban dataset vs Washington DC Mall Area 1: classificationaccuracy results of 15 tasks with 5 co-occurrence instances and80 random testing instances. . . . . . . . . . . . . . . . . . . 45xvFigure 2.5 Pavia University Data and Washington DC Mall Area 2: clas-sification accuracy results of 6 tasks with 5 co-occurrence in-stances and 80 random testing instances. . . . . . . . . . . . . 46Figure 2.6 Pavia University Data and Pavia Center: classification accu-racy results of 6 tasks with 5 co-occurrence instances and 80random testing instances. . . . . . . . . . . . . . . . . . . . . 47Figure 2.7 Effects of the co-occurrence data size on four different meth-ods when tested on Pavia Center dataset. . . . . . . . . . . . . 48Figure 3.1 The flowchart of the proposed DT-LET framework. The twoneural networks are first trained by the co-occurrence data Csand Ct . After network training, the common subspace is foundand the training data DlS is transferred to such space to trainSVM classifier, to classify DT . . . . . . . . . . . . . . . . . . 52Figure 3.2 The comparison of different layer matching setting for differ-ent frameworks on Multi Feature dataset. . . . . . . . . . . . 64Figure 3.3 The comparison of different layer matching setting for differ-ent frameworks on NUS-WIDE dataset. . . . . . . . . . . . . 66Figure 4.1 Framework of the proposed method. The source data Xs arethe HSI data from the Pavia center domain, whereas the targetdata Xt are also HSI from the Pavia University domain. We firstproject the data on both domains into a new feature subspaceand find the transformation matrix T. With T, we can obtain theinitial classes of pixels. Further, the data on the target domainreturns to the original space to be optimized by MRF, whichimposes the structure constraint. . . . . . . . . . . . . . . . . 71Figure 5.1 A general simple structure comparison between three types ofdeep transfer learning. S: the source domain, T: the target do-main. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82xviFigure 5.2 The flowchart of the proposed Xnet. We use red color to high-light the task-specific attentional adaptation process of the net-work structure. Here GRL stands for Gradient Reversal Layer[30]. Detailed parameter definitions can be found in Sec. 5.2. . 85Figure 5.3 Visualized examples for the digit recognition tasks. . . . . . . 90Figure 5.4 Visualized examples for the satellite-to-aerial scene adaptationtasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Figure 6.1 (Best Viewed in color.) Illustration of the mechanism com-parison between the classical adaptation approach and the pro-posed DuAN. (a) The classifier cannot classify target domaindata well although two domain data are aligned well, as theymight fail to consider task-specific classifiers during adapta-tion. (b) Two individual task-specific classifiers first trainedon the source domain data provide inconsistent classificationresults for the target domain data. Such discrepancy is mini-mized in an iterative way: 1. the source data feature mimicsthe target data feature, 2. classifiers are updated based on thenew source data distribution and provide new discrepancy, 3.the target data feature is updated to minimize such discrep-ancy. The target data will be suitable for various task-specificclassifiers at last. . . . . . . . . . . . . . . . . . . . . . . . . 113Figure 6.2 The flowchart of the proposed DuAN. Two adversarial pro-cesses exist, where one for the feature adaptation is realizedby the source flow (orange color), and the other for the classi-fication task is realized by the target flow (purple color). Flowhere means the forward and backward propagation in the neu-ral network. Steps 1-3 refer to the three iterative training steps.Components in the corresponding step are updated iteratively.“Ini” is the abbreviation for model initialization. . . . . . . . 114Figure 6.3 Examples from the traditional digit recognition datasets. . . . 114xviiFigure 6.4 Left: Examples from the proposed satellite-to-aerial transferlearning datasets with 9 categories. Right: Examples fromthe proposed ground-to-aerial transfer learning datasets with15 categories (except for classes in Fig. 1.3). . . . . . . . . . 115Figure 6.5 (a)-(b) t-SNE [69] visualization results of transfer learning meth-ods for the Satellite-to-aerial scene adaptation. (c)-(d) t-SNE[69] visualization results of transfer learning methods for theGround-to-aerial scene adaptation. We can see that after ap-plying our adaptation methods, the target samples are morediscriminative. . . . . . . . . . . . . . . . . . . . . . . . . . 116Figure 6.6 The classification accuracies on validation data. . . . . . . . . 117Figure 7.1 Dataset (a) only includes three partial categories with a singlelabel “car”. Dataset (b) with two partial categories is with thelabel “person”. By transferring from the two single-labeleddatasets, we can get the final multi-labels for dataset (c). . . . 123Figure 7.2 UCM multi-label aerial image dataset. . . . . . . . . . . . . . 124Figure 7.3 AID multi-label aerial image dataset. . . . . . . . . . . . . . 125xviiiGlossaryADDA Adversarial Deep Domain AdaptationCCA Canonical Correlation AnalysisCDTL-SVM CCA and Deep Transfer Learning based SVMDCCA Deep Canonical Correlation AnalysisDCNN Deep Convolutional Neural NetworkDM Deep MappingDNNS Deep Neural NetworksDSTL Dual Space structure preserving Transfer LearningDT-LET Deep Transfer Learning by Exploring Where to TransferDTL Deep Transfer LearningDTSE Deep mapping based heterogenous Transfer learning model via queryingSalient ExamplesDUAN Dual Adversarial NetworkGAN Generative Adversarial NetworkGMM Gaussian Mixture ModelGRL Gradient Reversal LayerxixGSAS Ground/Satellite-to-aerial SceneGSSA Satellite/ground-to-aerial Scene AdaptationGTSRB German Traffic Signs Recognition BenchmarkHSI Hyperspectral ImageIALM Inexact Augmented Lagrange MultiplierIP Indian PinesMCD Maximum Classifier DiscrepancyMMD Maximum Mean DiscrepancyMNIST Modified National Institute of Standards and Technology database)MRF Markov Random FieldMSDA Marginalized Stacked Denoising AutoencoderNADDA Non-Adversarial Deep Domain AdaptationNMF Nonnegative Matrix FactorizationOT Optimal TransportPC Pavia CenterPU Pavia UniversityROSIS Reflective Optics System Imaging SpectrometerSDA Stacked Denoising AutoencoderSGD Stochastic Gradient DescentSSQ Salient Sample QueryingSTL Subspace Transfer LearningSVHN Street View House NumbersxxUTL Unsupervised Transfer LearningxxiAcknowledgmentsDoctoral studies are challenging. I would like to thank those who have helped andsupported me during my Ph.D. study.Special thanks go to my supervisors Prof. Z. Jane Wang and Prof. RababWard. Their guidance and patience inspire me to move on not only in my Ph.D.study but also in other aspects of my life. Without their generous support and wisesuggestions, I would not have been able to make it to this point.I own a debt of gratitude to my supervisory committee. Suggestions and in-spiration from these great professors are valued the most by a student like me, assomeone who is new to the research field.I would also like to thank my lab-mates, friends, and co-authors, Prof. Yi Yang,Prof. Xiangyang Ji, Dr. Lichao Mou, Dr. Jiayue Cai, Yongwei Wang, Xinrui Cui,Dan Wang, and Kaiwen Yuan. I would also like to express special thanks to mysenior lab-mates, including Prof. Liang Zhao, Prof. Liang Zou, and Dr. Jian-nan Zheng. Without their professional suggestions and personal encouragement, Iwould never have grown up as fast as I did.Lastly, I am forever thankful to my beloved parents. I deeply thank them fortheir love and support.xxiiChapter 1IntroductionEverything is related to everything else, but near things are morerelated than distant things. —Waldo R. Tobler (1970)Different types of images can record and present the world we see in differentways. Except for the widely acknowledged RGB images, other types of imagesalso show the world from different views, in different modalities, and with differ-ent resolutions. In this thesis, our interest is to process and understand such imagesbetter. Domain in image processing is referred to as the state of the world reflectedby a specific type of image at a particular moment. An interesting research topic isexploring the underlying correlations among images from different domains, andthe related research direction is referred to as transfer learning/domain adapta-tion. The objective of transfer learning is not only to figure out the correlationsamong different domains, but also to transfer information from the annotation-richdomain (the source domain) to the annotation-scarce domain (the target domain),to help the latter with its classification/segmentation/recognition tasks.The necessity of transfer learning mainly comes from the lack of annotationfor a large volume of newly collected image data. With the explosive increaseof multi-source image data from the Internet such as YouTube and Flickr, a largenumber of web databases can be easily crawled. The much easier access to moretypes of images and the huge amount of images makes the challenge of findinga more efficient data annotation way even more urgent. Annotating a large num-ber of newly collected target images is cost-expensive, which consumes a lot of1human resources in labor and time expenses, and it increasingly becomes unreal-istic. A possible way for the automatical annotation of these unannotated imagedata is to borrow and leverage on the prior knowledge of another semanticallyrelated image data with sufficient annotations. The annotation-rich data and theannotation-scarce data here are defined as the source domain and the target domainrespectively. However, the primary problem is the distribution mismatch and do-main shift across the source and target domains, owing to different factors such asresolution, illumination, viewpoint, background, etc. From the statistical learningperspective, the domain gap is also due to the fact that the fundamental independentidentical distribution condition is no longer satisfied for the two domains. Directlyapplying prior knowledge of the source domain to the target domain is not a feasi-ble solution. The above observation, therefore, promotes the emergence of transferlearning /domain adaptation, whose objective is to bridge the domain gaps betweenthe two domains. In this thesis, our main goal is to develop new models to relieveor overcome the existing concerns for transfer learning. To be more specific, theproposed models in the thesis aim to address three more specific issues for transferlearning, namely, what to transfer, how to transfer, and where to transfer.1.1 Introduction to deep transfer learning and itsapplication in computer visionTransfer learning or domain adaption aims at extracting potential information inthe auxiliary source domain to assist the learning task in the target domain, whereoften there is no sufficient labeled data [73]. Especially for tasks like image clas-sification or recognition, a large sample size of labeled data is highly required butoften not enough (the target domain data) since the labeling process could be quitetedious and cost-expensive. Without the help of related data (the source domaindata), the learning tasks could fail. Therefore, having better use of the auxiliarysource domain data through applying efficient transfer learning methods has at-tracted increasing researchers’ attention.It should be noted that the direct use of labeled source domain data on a newscene of the target domain for a classification/segmentation task often could re-sult in poor performance due to the semantic gap between the two domains, even2they are representing the same objects [108][27][66][101]. The semantic gap canresult from different acquisition methods, including different conditions (illumina-tion or view angle), and the use of different cameras or sensors. The images canbe collected from the satellite, the airplane, or regular mini cameras. The collectedimages would record the same scene with different fields of view, including theouter space view, the aerial view, and the ground view. In this thesis, with em-ploying advanced transfer learning methods, we will explore both the correlationsbetween images from the same view while collected under different conditions andthe correlations between the images from different fields of view.The first considered scenario for transfer learning in this thesis is the transferlearning for a special type of satellite images, the hyperspectral images (HSI). Thisscenario is to explore the correlation between different remote sensing images. Dif-ferent from regular RGB images with three channels, this type of images is madeup of hundreds of channels. As the HSI data collected from different satelliteswould have different specific characteristics, exploring the correlations betweenthem is a challenging task. Designing an effective transfer learning model for HSImight help with the annotation of newly collected HSI data. An example of HSIdata from different satellites can be found in Fig. 1.1. Three public real hyperspec-tral datasets with different spatial resolutions are presented as examples, and wouldalso appear in our experimental parts in later chapters. The first is the AV IRIS In-dian Pines (IP) Data. It contains 145× 145 pixels and 202 spectral bands. Thereare 16 mutually exclusive ground-truth classes, with a total of 10366 mark pixels.The second is the ROSIS Pavia University (PU) Data. This dataset was acquired bythe Reflective Optics System Imaging Spectrometer (ROSIS) over the urban areaof the University of Pavia in northern Italy on July 8, 2002. The original datasetcontains 115 spectral bands ranging from 430 to 860 nm. The spatial dimensionof this scene is 610×340 and 9 land cover classes in this scene. Its spatial reso-lution is 1.3m per pixel (m/p). The last one is the Pavia Center (PC) data, alsocollected from ROSIS. These scenes were provided by Prof. Paolo Gamba fromthe Telecommunications and Remote Sensing Laboratory, Pavia University (Italy).Pavia Centre is a 1096×1096 pixels image. The geometric resolution is 1.3 meters.The second considered scenario for transfer learning in this thesis is the trans-fer learning for digit recognition tasks. This scenario is to explore the correlation3(a) HSI-IP (b) HSI-PC (c) HSI-PUFigure 1.1: Examples of HSI scenes collected from different satellites.between different RGB images. For RGB images, the images collected in differ-ent conditions also have a large semantic gap. Exploring the relationship betweendigits collected in different conditions can be widely used not only to help withthe annotation of unknown digit datasets, but also in handwriting recognition andarchaeological discovery. Examples of digit datasets can be found in Fig. 1.2. Sixtraditional digit recognition tasks as examples are presented. The examples arepresented in pairs to make a comparison. The first example is MNIST V.S. MNIST-M. The MNIST-M data contains the RGB images with three color channels, whilethe MNIST contains one channel grey images with a much simpler representation.There exists a distinct difference between the two datasets. The second exam-ple is SVHN V.S. MNIST. The Street View House Numbers (SVHN) dataset [115]and MNIST dataset have a much larger distribution gap. The third experiment isSynthetic numbers V.S. SVHN. Compared with SVHN, the Syn. number datasethas different positionings, orientations, and backgrounds. The fourth example isMNIST V.S. USPS. Both datasets are with white digits on solid black background.The last example is Synthetic Signs V.S. GTSRB. The settings of both datasets arethe same as in [31]. Synthetic Signs dataset is a synthetic dataset generated from4common street signs after various artificial transformations. GTSRB is the GermanTraffic Signs Recognition Benchmark (GTSRB).MNIST       V.S. MNIST-M SVHN MNIST Syn. Digit SVHNUSPS Syn. Signs GTSRBMNIST V.S. V.S. V.S. V.S.Figure 1.2: Visualized examples for the digit recognition tasks.The third considered scenario is the transfer learning between images from dif-ferent views. Such a scenario includes two sub-scenarios, namely, transfer learningbetween the satellite view scene and the aerial view scene, and transfer learningbetween the ground view scene and the aerial view scene. The images from thedifferent views represent the same object, but they are collected from differentsensors. Exploring such correlations can not only find the relationship betweenthe outer space data and the in-space data, but also use the ground view data withsufficient prior information (e.g., labels) to help with the annotation of a newly col-lected large number of aerial view data with very limited prior knowledge (e.g., nolabel information). Some examples can be found in Fig. 1.3. Such examples showfive different types of objects from three different views. Establishing the correla-tion among images from different views can help with the knowledge exchangesbetween them.To achieve the goal of knowledge transfer between images at different domains,numerous transfer learning methods have been proposed to overcome the images’semantic gap [20][60][98][100]. Traditionally, these transfer learning methodsgenerally adopt linear or non-linear transformation with kernel functions to learna common subspace on which the gap is bridged [109][86][102]. However recent5(a) Harbor (b) Forest  (c) Residence     (d) Beach       (e) Parking LotSatellite ViewAerial ViewGround ViewFigure 1.3: Examples of scenes from different views. From top to down arescenes from the satellite view, the aerial view, and the ground view re-spectively. Scenes from the satellite view are with much lower resolu-tion and clarity compared with the aerial view. Scenes from the groundview and the aerial view are with huge domain gap even with the samesemantic labels.advances have shown that the features learned on such a common subspace couldbe inefficient in many tasks. Therefore, deep learning-based models were also in-troduced recently due to their power on high-level feature representation.Current deep learning-based transfer learning methods focus on two major re-search directions: what to transfer, and how to transfer [53]. Regarding ’what totransfer’, researchers mainly concentrate on instance-based transfer learning andparameter transfer approaches. Instance-based transfer learning methods assumethat only certain parts of the source data can be re-used for learning in the targetdomain by re-weighting [35]. As for parameter transfer approaches, people mainlytry to find the pivot parameters in a deep network to transfer to accelerate the trans-fer process. Regarding “how to transfer”, different deep networks are introduced tocomplete the transfer learning process. However, currently in both research direc-6tions, the right correspondence of layers is generally ignored. In this thesis, we planto conduct the following researches: explore the application potentials of “what totransfer” and “how to transfer” in the computer vision area, specifically for satelliteimages, aerial images, and ground-view images; and also propose a novel researchdirection, “where to transfer”, in order to find out the right correspondence of lay-ers between the deep neural networks of source and target domains.To conclude, this thesis has done a thorough research on three directions, whatto transfer, where to transfer, and how to transfer, in order to serve well trans-fer learning applications in computer vision and remote sensing areas, specificallythe transfer learning between remote sensing scenes, the transfer learning betweenRGB images, as well as the transfer learning between ground/satellite view andaerial view scenes.1.2 Related work1.2.1 What to transferSample selection is a promising, novel research front to improve the inferior qual-ity of training samples [122][59][103]. The main purpose of sample selection is tolocate the most valuable data to benefit the supervised training process of classi-fication or segmentation. Such a concept can also be applied to transfer learning,mainly for choosing the training samples from the source domain. Two major cri-teria for selecting valuable samples are the informativeness and representativenessof data samples. These two criteria are widely used in active learning, which is oneof the most well-known sample selection methods. Unlike traditional supervisedmethods that generally set parameters based on randomly selected training sam-ples, active learning is an effective sample selection approach that gives the learnerthe freedom to augment the training data based on specifically designed criteria.With a large pool of unlabeled HSI data, active learning presents an approachthat iteratively selects the most valuable samples. This process is also named query,to label [42]. A widely acknowledged branch of active learning is querying themost informative samples. One representative method of this branch is Uncer-tainty Sampling, which can query the samples with the lowest certainty [3][95][52].7However, this method may fail due to the existence of outliers which have the high-est uncertainty. Query-by-committee [19][29][107], as another well-known exem-plary approach of this branch, measures the informativeness of samples by calcu-lating the degree of agreement with several variants, which are named committees.This method reduces the size of the version-space, which is the subset of the pa-rameter space used to classify labeled examples. However, the required effort onchoosing the appropriate parameter space that can most influence the error rate in-creases the workload remarkably. Another active learning method, named the errorreduction based sampling [77], uses sampling estimation to estimate the future er-ror directly to solve the efficiency problem. Guo and Schuurmans [38] propose thebatch mode which selects multiple instances in every iteration to further improvethe efficiency. However, the above efficient incremental training procedures searchfor the underlying queries solely based on the available small number of labeledexamples, regardless of the distribution of the unlabeled ones. This criterion mightlead to sample bias.On the contrary, another branch of active learning framework is querying themost representative samples that exploit the distribution of unlabeled data effi-ciently. One popular scheme is the coarse-to-find strategy: first cluster the initialdata and choose the representative samples to manually label them, then propa-gate the learned decision. However, this scheme heavily depends on the clusteringquality, and no measure has been taken to avoid labeling the samples in the samecluster repeatedly. To overcome this problem, Sanjoy et al. [22] propose a hierar-chial clustering method which can detect and exploit the clusters whose structuresare loosely aligned with class labels; Rita et al. [14] propose a novel criterionwhich can concurrently select a set of query samples by directly minimizing thedifference between distributions of labeled and the unlabeled data. However, webelieve that a more realistic way to pinpoint the most dominating feature of the datadistribution is to enhance the clustering process itself. Based on this motivation,different from the former clustering frameworks, this section proposes measuringthe representativeness of each candidate sample by its data density, since the datadensity measurement is demonstrated to be an efficient method with super gener-alization performance [76].Recent research suggested that querying samples which are both most infor-8mative and representative can obtain training data with much higher quality. Atypical approach is to seek data points that are hard to predict and at the same timerepresentative enough to explore the distribution information of testing data [112].During this process, the uncertainty and the density of data are dynamically bal-anced to acquire the optimal candidates of training samples [26]. A limitation ofthis approach is that the instance density is only measured by the distribution ofunlabeled data. Huang et al. [42] further discover that the representativeness ofsamples may be enhanced if we take the distributions of both labeled and remain-ing unlabeled data into account. However, the remaining problems are that westill have to measure the informativeness and representativeness respectively, andbalancing them may give rise to the suboptimal solution.The main assumption of our work for sample selection in transfer learningis that the informativeness and representativeness criteria are never independentof each other. The representativeness controls the macrostructure of data, the in-formativeness reflects the micro-features, and they should be correlated. To bemore specific, such micro-features should be based on the consideration of datamacrostructure; and the macrostructure should not neglect data details when esti-mating the data distribution. This principle is the primary motivation of the pro-posed active learning framework based deep transfer learning method in this thesis.We will introduce this principle to our salient sample querying process on boththe source and target domain data. Different from most existing active transferlearning frameworks, which barely consider the informativeness of transferred data[120] [32] and generally ignore their representativeness, in this thesis, the mostinformative and representative samples in every class are queried and correlatedat the same time during the deep transfer learning process. Robust transformationmatrix will be acquired based on these actively acquired training samples.1.2.2 Where to transferDeep learning intends to learn a non-linear representation of raw data to reveal thehidden features [63][114]. However, a large number of labeled data are required toavoid over-fitting during the feature learning process. To achieve this goal, transferlearning has been introduced to augment the data with prior knowledge. By align-9ing data from different domains to the high-level correlation space, the data infor-mation on different domains can be shared. To find this correlation space, manydeep transfer learning frameworks have been proposed in recent years. The mainmotivation is to bridge the semantic gap between the two deep neural networks ofthe source domain and the target domain. However, due to the complexity of trans-fer learning, some transfer mechanisms still lack satisfying interpretation. Basedon this consideration, quite a few exciting ideas have been proposed. To tackle theproblem of how to determine which domain to be the source or the target, Fabio etal. [12] propose to align domains for the source and target domain automatically.To boost the transfer efficiency and find extra profit during the transfer process,deep mutual learning [119] has been proposed to transfer knowledge bidirection-ally. The function of each layer in transfer learning is explored in [16]. The trans-fer learning with unequal classes and data are studied in [74] and [5] respectively.However, all the above works mainly focus on explaining “what to transfer” and“how to transfer” problems. Generally they still ignore interpreting the matchingmechanisms between layers of deep networks of the source domain and the targetdomain. Regarding the “where to transfer” concern, in this thesis we name theproblem as DT-LET: Deep Transfer Learning by Exploring Where to Transfer, andwe adopt a stacked denoising autoencoder (SDA) as the baseline deep network fortransfer learning.Glorot et al. for the first time employed stacked denoising to learn similarfeatures based on joint space for sentiment classification [33]. The computationcomplexity was further reduced by Chen et al. by using the Marginalized StackedDenoising Autoencoder (mSDA) [2]. In this thesis, some characteristics of theword vector are set to zero in the equations of expectation to optimize the rep-resentation. Still, by matching the marginal as well as the conditional distribu-tions, Zhang et al. and Zhuang et al. also developed the SDA based homogeneoustransfer learning framework [124][118]. For the heterogeneous case, Zhou et al.[121] proposed an extension of mSDA to bridge the semantic gap by finding thecross-domain corresponding instances in advance. Google Brain team recently in-troduced the generative adversarial network to SDA and proposed the WassersteinAuto-Encoders [93] to generate samples of better quality on the target domain.It can be noted that SDA is with quite high potential, and our work also chooses10SDA as the primary neural network for addressing the “where to transfer” problem.Recent work for “where to transfer” problem can also be found in [43].1.2.3 How to transferThe increasing interest in the exploration of deep neural networks for transfer learn-ing can be attributed to their capability to learn abstract representations, whichhelps to disentangle different explanatory factors of variations behind the data,and their capability to extract invariant features from the data in different domains[117][83][41].Existing deep transfer learning methods can be divided into two major types,adversarial and non-adversarial. The primary distinction is whether there existsadversarial learning in the deep neural network architecture.Non-adversarial transfer learningThere are many traditional methods for non-adversarial methods, including Max-imum Mean/Classifier Discrepancy (MMD/MCD) [79][96][37][61], Central Mo-ment Discrepancy (CMD) [116], and Optimal Transport(OT) [21][72][17][15] etc.,which would learn the projections that align data spaces to each other. There arealso many existing non-adversarial transfer learning algorithms attempt to quantifydomain shifts by designing specific statistic distances between the two domains un-der the guidance of the convergence learning bounds in [4]. Correlation alignment[87] directly utilizes the difference of the mean and covariance of the two distribu-tions as the domain divergence and attempts to match them during the training.Adversarial deep transfer learningRecent years have witnessed the exploitation of adversarial deep transfer learning.The general structure of such methods stems from the work proposed by Ganin etal. [30]. A clear distinction between adversarial learning and non-adversarial one isthat for the former, the entire data are directly aligned together inside only one deepneural network. Such aligning at the very beginning is realized by simple batch nor-malization statistics, which aligns the distributions of the source and target domaindata to a canonical one [55][13]. The adversarial method would further introduce11an adversarial loss to mix up the data from both domains, making it impossible forthe proposed domain classifier to recognize the domain that the data is from [82].This method assumes that there exists a shared feature space between domains withsmall distribution divergence. Yet for the case with larger domain divergence, animportance weighted adversarial net is proposed [117]. Other advances in adver-sarial deep transfer learning can also be found in recent papers. The most populartrend is to make the adaptation process task-specific, meaning that the adapted fea-tures should better serve the classification/recognition task [80][50][49][81][47].In this thesis, we first explore the non-adversarial transfer learning frameworkwhich is named the Dual Space Structure Preserving Transfer Learning (DSTL),and then we propose two task-specific adversarial transfer learning methods. Thefirst one is named the Xnet, which introduces attention learning to connect thetransfer learning and classification/recognition task directly; the second one isnamed Dual Adversarial Network (DuAN), with application to aerial/satellite-to-ground view transfer learning task.1.2.4 Transfer learning for remote sensingRemote sensing data can be generally divided into satellite images and aerial im-ages. Satellite images can be further divided into the hyperspectral images, mul-tispectral images, SAR images etc. In this thesis, we mainly test our proposedtransfer learning models on remote sensing data, specifically the hyperspectral im-ages (HSI) and aerial images.Transfer learning for hyperspectral imagesA prevalent trend for hyperspectral image (HSI) classification is to exploit spatialand spectral information of HSI data to the maximum (An example of HSI classifi-cation can be found in Figure 1.4). However, a major assumption for the dominantsupervised process is that the training and testing data are in the same feature spaceand have the same distribution [46][99]. However in many real-world problems, es-pecially for the processing of the newly collected HSI data whose training samplesare not ample, the above assumption may not hold, and existing supervised meth-ods would fail to work. Traditional unsupervised methods that barely consider the12data distribution of HSI may yield unsatisfying classification performances. In thissituation, we may transfer the knowledge obtained from another relevant datasetwith sufficient training samples (as the source domain), which might be with a dif-ferent feature space or learning task, to assist the unsupervised HSI processing inthe target domain.(a) HSI-PU (b) GT-PUFigure 1.4: An example of the classification result on HSI. (a) pseudo-colorimage of the PU dataset; and (b) the classification result of the PUdataset.Arduous efforts have been made to tackle this problem, and probably the mostprevalent solution is the transfer learning-based approach. Transfer learning makesuse of the prior information from the relevant domain to learn new tasks on theobjective domain [85]. A general framework is to transfer the HSI data from thesource domain (the auxiliary HSI) and the target domain (the objective HSI) toa common subspace to overcome the cross-domain semantic disparity, where thegoal is to find the transformation matrix. This semantic disparity is large amongHSI data. Due to different acquisition conditions and sensors, the spectra observedon a new scene can be quite different from the existing scene even if they repre-sent the same type of objects. Therefore, transfer learning in many cases may not13provide decent results. One crucial research issue is how to reduce the differencebetween the data while preserving the original data characteristics. This remainsa challenging issue, especially for heterogeneous transfer learning whose data onthe source and target domains are with different dimensionality in different featurespaces.Transfer learning for ground/satellite-to-aerial scene (GSAS)In this thesis, we also want to apply the transfer learning to aerial images. Nowa-days, with much easier access to such remote sensing images, its annotation is anurgent problem. We first explore the relationship between different types of remotesensing images by transfer learning between the satellite scene and the aerial scene.Then we apply transfer learning to help with the annotation of remote sensing im-ages, taking advantage of ground scene data. We name these two tasks as GSASadaptation tasks. Examples for such tasks can be found in Fig. 1.3. We assume theimages captured from different views under the same scene class should have con-sistent underlying intrinsic semantic characteristics, although with a large featuregap. With rich information transferred from ground view data that can be easilyobtained from ImageNet [23] or SUN [105], the understanding and annotation oflabel-scarce aerial images can be much easier.Formerly, works for addressing this cross-view(ground-to-aerial) domain adap-tion problem was mainly based on image geolocalization [104] (An example can befound in Fig. 1.5). There were also works [88][89][90][24] that assumed the scenetransfer from ground to aerial as a particular case of a cross-domain transfer, inwhich the divergences across domains were caused by viewpoint changes. How-ever, all existing methods based on quite basic models were tested on randomlycollected few data that lacked a uniformed benchmark. In this thesis, we wouldfor the first time propose a uniformed GSAS benchmark for such transfer learningtasks.14Figure 1.5: The aerial-view image and the ground-view image for the samescene [104].1.3 Research objectives, methodologies, and thesisoutlineThe major research objective of this thesis is to propose advanced deep transferlearning frameworks to address three key problems (i.e., what to transfer, where totransfer, and how to transfer) and investigate the proposed frameworks in specificlearning tasks. Specifically, for the “what to transfer” problem, we propose a novelactive learning-based sample selection method to select samples on both the sourceand target domains for transfer; for the “where to transfer” problem, we propose aDT-LET method to explore the correspondence of different layers of deep neuralnetworks between the source domain and the target domain; for the “how to trans-fer” problem, we propose three novel methods, namely, Xnet, DuAN, and DSTLmethods.The presentation outline of the dissertation is shown in Fig. 1.6, where the gen-15eral contents of every chapter are briefly listed. The organization of the dissertationis concluded as follows:In Chapter 2, in order to explore the “what to transfer” problem, we propose anactive learning-based transfer learning method, and apply it to explore the correla-tion between hyperspetral images (HSIs). We note that existing transfer learningmethods for HSIs, which mainly concentrate on how to overcome the divergenceamong images, could fail to carefully consider the contents to be transferred andthus limit their performances. In this thesis, motivated by this observation, wewill present two novel ideas: 1) We, for the first time, introduce an active learningprocess to initialize the salient samples on the HSI data, which would be trans-ferred later. 2) We propose constructing and connecting higher-level features forthe source and target HSI data, to further overcome the cross-domain disparity. Dif-ferent from existing methods, no prior knowledge of the target domain is neededfor the proposed classification framework, and the proposed framework works forboth homogeneous and heterogenous HSI data. Experimental results on three real-world hyperspectral images indicate the significance of the proposed method inHSI classification.In Chapter 3, in order to explore the “where to transfer” problem, we proposea novel DT-LET method, and apply it to explore the correlation between digitalimages. Generally previous transfer learning methods based on deep networks as-sume that the knowledge should be transferred between the same hidden layers ofthe source domain and the target domains. This assumption doesn’t always holdtrue, especially when the data from the two domains are heterogeneous with differ-ent resolutions. We propose a new mathematic model, named the DT-LET model,to solve this heterogeneous transfer learning problem. In order to select the bestmatching of layers to transfer knowledge, we define a specific loss function to es-timate the corresponding relationship between high-level features of data in thesource domain and the target domain. To verify this proposed cross-layer model,experiments for two cross-domain recognition/classification tasks were conducted,and the achieved superior results demonstrated the necessity of layer correspon-dence searching.In order to explore the “how to transfer” problem, we present three differentmethods in chapters 4, 5, and 6 of this thesis respectively. The methods are applied16to explore the correlations between three different types of images respectively.In Chapter 4, a transfer learning method, referred as the dual space unsupervisedstructure preserving transfer learning (DSTL), is proposed. This proposed methodis mainly to address the ignorance of data structure problem for “how to transfer”.To the best of our knowledge, this method for the first time applies the transferlearning framework to HSI classification without requiring training samples. Therecent advances in remote sensing techniques allow easier access to imaging spec-trometer data. Manually labeling and processing of such collected hyperspectralimages with vast quantities of samples and a large number of bands are laboriousand time-consuming. We propose DSTL to relieve these manual processes. Theproposed DSTL framework has the following three main contributions: 1) To thebest of our knowledge, this is the first time for deep transfer learning to be used forthe classification of totally unknown target HSI data with no training samples; 2)The characteristics of HSI are learned on dual spaces to exploit its structure knowl-edge to better label HSI samples; and 3) Two specific new scenarios suitable fortransfer learning are investigated in this chapter.In Chapter 5, we propose the second method Xnet to tackle the “how to trans-fer” problem. The method is named Xnet given the X- shape of the proposed deepneural network. This proposed method is mainly to solve the challenging domainshift problem for “how to transfer”. Although many state-of-the-art models havebeen proposed recently, non-adversarial and adversarial transfer learning, as twomajor types of deep transfer learning frameworks, cannot fully address this con-cern. In this thesis, our proposed Xnet model attempts to address this challengeby leveraging the desired structural properties of both non-adversarial and adver-sarial models. The contributions of the proposed Xnet are two-fold: Xnet solvesthe domain shift problem via domain-specific deep feature generator; we furtheradapt the generated features to the classification task by introducing task-specificattention learning. The superiority of the proposed model is demonstrated both ontraditional digit recognition tasks and the newly proposed satellite-to-aerial sceneadaptation task.In Chapter 6, we propose the third method, which is named the Dual Adversar-ial Network (DuAN). This proposed method is mainly to solve the transfer learningfor domains with a significant semantic gap. Sharing information among datasets17with a large domain gap is one of the most fundamental problems for “how totransfer”. In this chapter, we propose a novel idea that the source domain data,which mainly serves the adaption purpose, should be supplementary, whereas thetarget domain data needs to be task-specific. Motivated by this, we propose adual adversarial network for domain adaption, where two adversarial learning pro-gresses with different objective functions and losses are conducted iteratively, incorrespondence with the adaptation purpose and the task-specific purpose1 respec-tively. Features of the source and target domain data are extracted separately fordomain-specific purposes. The efficacy of the proposed method is demonstratednot only on existing datasets of image classification, but also on two newly intro-duced Ground/Satellite-to-Aerial Scene adaptation tasks. Since the semantic gapbetween the ground/satellite scene and the aerial scene is much larger than thatbetween ground scenes, the newly proposed tasks are more challenging than tradi-tional transfer learning tasks.As “How to transfer” is the most essential part of this thesis, we also introducethe three proposed methods from the perspective of adversarial learning, as adver-sarial transfer learning and non-adversarial transfer learning are two major types oftransfer learning methods. In chapters 4-6, we in total propose one non-adversarialmethod and two adversarial methods, in corresponding to three transfer learningtasks respectively: transfer learning for HSI classification, transfer learning fordigit recognition, and satellite/ground-to-aerial transfer. To be more specific, thethree methods explore the correlation between remote sensing images, the correla-tion between regular RGB images, and the correlation between remote sensing andregular RGB images respectively.Finally, Chapter 7 summarizes the contributions of this dissertation and dis-cusses future research directions.An overview layout of the thesis can be found in Fig. 1.6.1Task-specific purpose means the transfer learning model trained for a specific purpose such asobject classification or semantic segmentation. In this section, our task is image classification, andthe two classifiers.18Deep	TransferLearningWhat	totransferHow	totransferWhere	totransferMethodologyChapter	2Deep	mappingbasedheterogenousTransferlearning	modelvia	queryingSalient	Examples(DTSE)Chapter	3Deep	TransferLearning	byExploringwhere	toTransfer	(DT-LET)	Chapter	4Dual	SpaceUnsupervisedStructurePreservingTransferLearning(DSTL)Chapter	5Task-specificattentionaladaptations(Xnet)Chapter	6DualAdversarialNetowrk(DuAN)ApplicationsRemote	SensingdataTransferLearning	forHyperspectralImageTraditionalComputer	VisiondataTransferLearning	forDigit	RecognitionRemoteSensing/ComputerVision	dataTransfer	Learningbetween	GroundView	and	AerialViewRemote	SensingdataTransfer	Learningbetween	GroundSatellite	View	andAerial	ViewDifferent	Data	TypeDifferent	AcquisitionConditionsNOYESYESNOFigure 1.6: The overview layout of the thesis.19Chapter 2What to Transfer: ActiveLearning Incorporated DeepTransfer Learning forHyperspectral ImageClassificationIn this chapter, we want to discuss the task “what to transfer”. The proposed modelexperiments on Hyperspectral Image (HSI) data. By exploring what to transferproblem, the proposed method finds the best content for transfer learning, to helpwith the HSI classification with scarce annotation problem. For a lot of newlycollected HSI images, there is no annotation for them, and the classification ofthem is difficult. In this chapter, we want to explore the possibility of borrowinginformation from existing annotated HSI images to help with the classification ofnew unannotated HSI images.Supervised hyperspectral image classification has long been investigated forhyperspectral image analysis, and generally can provide satisfying performance,assuming that enough training samples are available and that both training andtesting data are on the same feature space. However, such assumptions may not20always hold true. To obtain sufficient training samples for the newly collectedHSI data, it is time-consuming and requires extensive human labor. Consideringthat an HSI is always with vast quantities of samples, a large number of bands,and close relations between them, this training sample requirement is even moredemanding. To address this concern, recently researchers tried to resort to usingpre-existing related HSIs as auxiliary information to exploit the prior knowledgefor classification of newly collected ones. However, one of the foremost problemsis the semantic disparity between the auxiliary and objective HSIs.Arduous efforts have been made to tackle this problem, and probably the mostprevalent solution is the transfer learning based approach. A general framework isto transfer the HSI data from the source domain (the auxiliary HSI) and the targetdomain (the objective HSI) to a common subspace to overcome the cross-domainsemantic disparity, where the goal is to find the transformation matrix. This seman-tic disparity is significant among HSI data. Due to different acquisition conditionsand sensors, the spectra observed on a new scene can be quite different from theexisting scene even if they represent the same type of objects. Therefore, transferlearning in many cases may not provide decent results. One crucial research is-sue is how to reduce the difference between the data while preserving the originaldata characteristics. This remains a challenging issue, especially for heterogeneoustransfer learning whose data on the source and target domains are with different di-mensionality in different feature spaces.One major issue for heterogeneous transfer learning is the distribution diver-gence and the feature bias between the two domains. Existing transfer learningmethods that adopt linear or non-linear kernel functions to transfer the data onboth domains to a common space to bridge this cross-domain gap may not be ef-fective enough. Recently researchers reported that Deep Neural Networks (DNNs),which exploit high-level features of data, could facilitate the minimization of thissemantic gap. In the high-level feature space, the data from both domains are morelikely to have fewer differences and biases.A representative direction is the stacked auto-encoder (SAE) based methodsthat search for high-level common features across domains, and the following su-pervised classification is based on such common representations of the data. How-ever, as most state-of-the-art methods currently failed to consider the relationship21between the source and target domains layer-by-layer in the deep network, the databias between the two domains accumulates at each layer. Another limitation is thatthe current SAE does not address the problem “what to transfer” since most currentworks just transfer the randomly selected training samples from the source domainto the target domain, which introduces noises. Current methods introduce a low-rank framework [84] to overcome the noise during the transfer process, howeverthis way cannot get rid of the noise fundamentally. While selecting the content totransfer can overcome the influence of this noise more effectively.To address the noise problem above, this chapter proposes a Deep mappingbased heterogenous Transfer learning model via querying Salient Examples (re-ferred to as DTSE). First, salient samples on both the source and target domainsare queried actively. These informative samples can, to a great extent, explore thestructure information of the data on each domain and may have a low correlationwith each other. Then, these salient samples on both domains are used to constructtheir auto-encoders with multiple domain-based layers. The source and target do-mains layer by layer in this deep network is correlated by the canonical correlationanalysis (CCA). Such correlations further propagate back and fine tune the layersof the network. The forward and back propagating iterate until the final output fea-tures of these two auto-encoders are with the lowest divergence and the correlationis maximized. A more detailed framework can be found in Fig. 2.1.2.1 Contribution summary• To our knowledge, this is the first attempt to using deep active transfer learn-ing for HSI classification, which explores the high-level feature correlationbetween remote sensing images with different dimensions and being ob-tained by different sensors. Through this in-depth mapping process, the HSIswith few labeled samples on the target domain can be classified effectivelywith the knowledge and information actively learned and transferred fromthe source domain. This proposed deep active transfer learning frameworkcan be a promising research direction for HSI classification.• We design a query principle that searches salient samples for every class ofthe training samples on both the source and target domains. Such salient22training samples are the most informative for representing their correspond-ing classes. The preferred characteristics of these salient samples are as fol-lows: First, the samples are the density peaks in their corresponding classes,meaning that these samples have the most informative signatures. Secondly,these salient ones have a relatively high Euclidean Distance to other sampleswith high data densities [76]. This means that salient samples are prominentand sparse enough with low correlations, and thus facilitate more robust andinformative transfer learning.• We propose a new principle for fine-tuning the neural network for auto-encoder on the HSI dataset. During the training process, the auto-encoderson both domains are set up, and high-level features are explored in bothdomains. The correlation between both domains is sought by CCA layer bylayer. Since the sample pairs with the same labels and the co-occurrence dataon both domains have a higher correlation, the neural networks on both do-mains are further fine tuned based on this principle by CCA restriction. Thefinal common feature space is found after this iterative fine tuning process,taking advantage of the correlation between the co-occurrence data insteadof the prior knowledge of the labeled training data on the target domain. Thetesting samples will be classified on such a common feature space.2.2 MethodThe major components of the DTSE framework are illustrated in Fig. 2.1, includ-ing the salient sample querying (SSQ) and deep mapping (DM). Firstly, Salientsamples Isl in the source domain and Itl in the target domains are queried respec-tively. The sample pairs with the same class label on both domains are recog-nized as co-occurrence data, and they are further used for the fine tuning processof deep mapping. Secondly, we exploit deep features belonging to the data ofboth domains, and the deep neural network is fine tuned based on the correlationconstructed by the co-occurrence data. We now elaborate on the two parts in thefollowing subsections.232.2.1 Density peaks selectionFig. 2.2 shows a synthesized example that emphasizes the importance of SSQin HSI classification. Fig. 2.2(a) shows a binary classification problem in whichdifferent legends represent two kinds of instances. The objective is to query 1%instances on the two datasets (datasets on the source target domains) to generatethe correlation model. The results of traditional active learning methods are il-lustrated in Fig. 2.2(b). In comparison, the results of the proposed method areshown step by step in Fig. 2.2(c)-(e). As indicated by Fig. 2.2(b), the approachfavoring the informative instances, represented by the yellow legend, tends to se-lect the samples with the highest uncertainty and thus may lead to sample bias; theapproach favoring the representative instances, representative by the blue legend,mainly considers the data structure, which ignores the details, and thus is likely toresult in errors especially for the instances near the boundary.Our proposed SSQ framework is detailed in Fig. 2.2(c)-(e). Each of the twomajor steps shown in Fig. 2.2(c) and (d) for SSQ has its own effect, but they aretaken as an integrated one because each step is never independent of each other.To be more specific, in Fig. 2.2(c), we first initialize the selected salient instancesby searching for density peaks, which consider both the overall structure and localdetails, and these points are the most representative ones; Then, in Fig. 2.2(d), weactively augment the instances to 1% instances, which are used for the supervisionof the following fine tuning for the deep network. We sequentially select the mostinformative data by the min-max criterion [42] based on the formerly learned rep-resentative ones. At last, based on the selected training instances, we finally getthe classification result for all the data as shown in Fig. 2.2(e), which is with thehighest accuracy when compared with the results shown in Fig. 2.2(b).We further formulate the problem as follows. Suppose the HSI data is denotedby I = {(x1,y1),(x2,y2), ...(xnl ,ynl ),xnl+1, ...,xn}, consisting of nl labeled pointsand nu = n− nl unlabeled ones. xi represents a pixel in HSI which is a vector ofd-dimension and yi ∈ {−1,+1} is the label of xi. In our method, we first denotethe current labelled samples by Il . We also represent the unlabeled ones by Ia =Iu∪{xa}, including the unlabeled ones Iu and the current salient pixel xa which isselected based on active learning framework [42]. The corresponding labels for24them are denoted by Y = {Yl,ya,Yu}, in which Ya = {ya,YU} is assigned after thelearning process.When we choose the salient examples, we need to consider both the overallstructure of the data and the local details. One of the most widely acknowledgedstrategies is based on clustering, which selects the cluster centers as the salientones. This strategy traditionally barely considers the overall data distribution ne-glecting the local distribution of data. As a result, it can just find the representativedata while their informativeness is ignored. To overcome this drawback, we pro-pose the following density peaks based sample selection idea.Global density and local peaksTo illustrate this idea, we first introduce two concepts: global density and localpeaks.To find the instance representatives, we first calculate the density of each in-stance. For a certain radius ri centering on the instance xi, the number of neigh-boring instances is the global density di of xi. The instances with higher globaldensities are more likely to be selected.However, one problem exists in this density based selection: We are likely toselect instances with similar characteristics. Several neighboring instances with thesimilar highest global densities are chosen based on the global density. Choosingthese instances decreases other instances’ chances of being selected even if they arequite informative and representative. Based on this observation, another conceptlocal peaks is explored. Although it is difficult to define the local area consideringeach instance as an individual, it is much easier if we take each two instances as agroup and compare their global density and the distance between them. To be morespecific, besides the di of xi, we assign another characteristic to each instance, theco-distance si, by considering the instance group. We first rank the density of eachinstance and re-arrange them as Xd = {xd1 ,xd2 , ...,xdn}, in which xd1 is with thehighest density and xdn is the lowest. The distance between xdi and xdi+1 is definedas si. In this group, the density of xdi is a little higher than, but the most proximateto, xdi+1 . After getting si, we can define that local peaks as with both the highestglobal density d and co-distance s. The local peaks will never be neighboring as25they need to be with high co-distance. The final instances we select are these localpeaks.Active local peaks queryingTo find the local peaks, we introduce the objective function Eq. 2.1a∗ = argmaxnl<a<n(sa+λda), (2.1)where a∗ is the label of the selected instance, sa and da are the co-distance anddensity of instance a, λ is the coefficient between the two terms. However, forthis objective function, the global optimal λ is hard to be found considering all theinstances. Therefore, We plan to obtain the solution for this problem actively [76],which can skip the step of optimization of λ . Fig. 2.3 shows a simple example foractively obtaining solution with randomly chosen 20 instances.In Fig. 2.3, all data in the left figure are first put on the coordinate system, asshown in the right figure, where the x-axis represents the global density d and they-axis represents the co-distance s. Here, we have 2 categories of data, indicatedby red and blue respectively. Suppose we need to select 15% training data in total.We note that samples like xd5 and xd6 , which are with quite high global density butlow co-instance and might be chosen by traditional clustering methods, will notbe selected in our framework, since they are quite similar to xd2 . While xd7 whoseoverall density is not high enough will be selected since it is quite informative:From the perspective of global density, it is quite similar with the red type; Fromthe perspective of location, it is quite close to the blue type and belongs to thistype. Therefore, we think this point is with quite high uncertainty. Finally, xd1 , xd3 ,and xd7 are selected actively. We get Il = {(xd1 ,yd1),(xd3 ,yd3),(xd7 ,yd7)}, where Ilmeans the labeled training data.We also note that not all instances can be successfully introduced into thismodel. Two extreme cases during the querying process may occur. The first caseis: for many instances the densities are zero, as they may have no nearing neigh-bors. Suppose this kind of instances are denoted by Xn = {xn1 ,xn2 , ...,xnp}. For xni ,suppose the Euclidean distance between xni and its nearest neighbouring instanceis exni , we define the density of xni is dxni =1dxni; we find dxni is quite small based26on this definition, and xni will not be chosen finally. The second case is that, forxd1 with the highest global density but zero co-distance (as no other sample is withhigher global density than xd1), we heuristically assign sd1 with the highest valuebecause it must be selected.2.2.2 Active data augmentationEven though the above querying process considers the informativeness of eachinstance in a local area, to find the most informative instances for learning the finalclassification model, the already labeled few instances are not enough. Therefore,after getting the labeled data by the density peaks querying, we select the currentpixel xa which is based on the active learning framework [42] via a sequentialaugmentation process. By augmenting the data, the ultimate goal is to learn thebest classification model. Since we employ the SVM for this learning process, wefirst briefly review the SVM.Brief review for SVMThe key of the traditional SVM is to learn an optimal hyperplane to separate thelabeled instances with the maximum marginal. Taking advantage of the kernel,SVM finds the hyperplanef (x) = ωT x+b, (2.2)by solving the optimization problemminw,b12‖w‖2s.t.,yi(wT xi+b)≥ 1, i = 1, ...,n.(2.3)Here we do not describe the slack variable in the traditional SVM as it is notrelevant to our current major problem. By introducing the Lagrange multiplier α ,we can further transfer Eq. 2.4 to the Lagrange function asϕ(w,b,α) =12‖w‖2−n∑iαi(yi(wT xi+b)−1), (2.4)27by maximizing this function, the optimal f can be learnt as shown in Eq. 2.5:θ(w) = minαi≥0ϕ(w,b,a), (2.5)as w = ∑ni=1αiyixi, we have f (x) =n∑i=1αiyi 〈xi,x〉+b. We hold the view that therestrict min 12 ‖w‖2 is realized if minf∈H12 | f |2H is obtained. Therefore, we furtherrewrite Eq. 2.5 to beθ(x) = minf∈H12| f |2H +n∑i=1l (yi, f (xi)), (2.6)whereH is a reproducing kernel Hilbert space, l(y, f (x)) is the loss function.Framework for data augmentationTo motivate the following framework, we separate x in Eq. 2.5 into two sets, Il andxa. Moreover, to identify the most informative example, we consider the worst casefor analysis by selecting the unlabeled instance a that leads to a small value for theobjective function Eq. 2.7 regardless of its assigned class label ya. To achieve thisgoal, we consider the new objective function:θ(Il,xa) = maxya∈{−1,+1}minf∈H12| f |2H +nl∑i=1l (yi, f (xi))+ l(ya+ f (xa)).(2.7)In order to find the most informative instances, we have:a = argnl<a<nminθ(Il,xa). (2.8)By this way, we are more likely to select the instance closest to the decision bound-ary, and xa tends to be more informative. We select the new unlabeled instancessequentially by this way until the co-occurrence data are augmented to achieve theexpected percentage.282.2.3 Deep mapping mechanismThe formerly chosen salient samples Isl in the source domain are denoted as DCS ={CSi }nsi=1, and Itl in the target domain are denoted DCT = {CTi }nti=1;The labeled data on the source domain, denoted as DS = {XSi ,Y Si }nsi=1 are usedto supervise the deep mapping based classification;The unlabeled data on the target domain are demoted as DT = {XTi }nti=1;The deep network in the source domain is denoted by ΘS = {W S,bS};The deep network in the target domain is denoted by ΘT = {W T ,bT};The common subspace is represented byΩ and the final classifier is representedby Ψ. The labeled data DS from the source domain is used to predict the label ofDT by applying Ψ(Ω(DT )).Inspired by Canonical Correlation Analysis (CCA) which can maximize thecorrelation between two domains, we apply the CCA within both deep neural net-works ΘS and ΘT to construct a multi-layer correlation model. As shown in Fig.2.1, first a deep neural network is set up in the source domain and another in thetarget domains, by forward propagation based on the co-occurrence data CS andCT . The correlation coefficients between the hidden layers of these two domainsare found using CCA. After setting upΘS in the source domain andΘT in the targetdomain, the common high-level subspace is finally obtained. DS and DT are bothprojected to the common subspace, on which the labeled DS are used for trainingand predicting the labels of DT . A more detailed mathematic formulation is shownas follows.We employ the stacked auto-encoders (SAE) in the source domain ΘS and inthe target domain ΘT . For the hidden layers of ΘS and ΘT , the hidden features ASand AT can be represented byAS(n+1) = f (W S(n)×AS(n)+bS(n)),n > 1;AS(n) = f (W S(n)×CS+bS(n)),n = 1.(2.9)AT (n+1) = f (W T (n)×AT (n)+bT (n)),n > 1;AT (n) = f (W T (n)×CT +bT (n)),n = 1.(2.10)Here W S and bS are parameters for neural network ΘS, W T and bT are param-29eters for neural network ΘT . AS(n) and AT (n) mean the co-occurrence nth hiddenlayers in the source domain and in the target domain respectively. The correlationmatrices V S(n) and V T (n) project features of DS and DT to a correlating commonsubspace Ω. These V S(n) and V T (n) are obtained by CCA. Therefore, to set up theoptimal neural networks in source domain and in target domain, we need to satisfytwo objectives: to minimize the reconstruction error of the neural network in thesource domain and error of the neural network in the target domain, and to maxi-mize the correlation between the two neural networks. This objective function canbe formulated asminJ = JS(W S,bS)+ JT (W T ,bT )−Γ(V S,V T ), (2.11)in which JS(W S,bS) and JT (W T ,bT ) are the reconstruction errors in the sourcedomain and in the target domain respectively, which are defined as follows:JS(W S,bS) =[1mm∑i=1(12||hW S,bS(CSi )−CSi ||2)]+λ2ns−1∑l=1nSl∑j=1nSl+1∑k=1(W S(l)k j )2(2.12)JT (W T ,bT ) =[1mm∑i=1(12||hW T ,bT (CTi )−CTi ||2)]+λ2nT−1∑l=1nTl∑j=1nTl+1∑k=1(W T (l)k j )2,(2.13)where hW S,bS(CSi ) and hW T ,bT (CTi ) are the output results of the two neural networks,nS and nT are the number of their layers, nSl and nSl are the number of neurons inlayer l, and λ is the trade-off parameter.The third term Γ(V S,V T ) in Eq. 2.11 is the correlation matching matrix be-tween the source domain and the target domain. The objective is to optimize thecorrelation matrices V S and V T by maximizing the correlation between the source30domain data and the target domain data, which is defined in Eq. 2.14.Γ(V S,V T ) =nS−1∑l=2V S(l)T∑ST V T (l)√V S(l)T∑SS V S(l)√V T (l)T∑T T V T (l)(2.14)where ∑ST = AS(l)AT (l)T, ∑SS = AS(l)AS(l)T, ∑T T = AT (l)AT (l)T, By minimizingEq. (2.11), we can collectively train the two neural networks θT = {W T ,bT} andθ S = {W S,bS}.After constructing the multiple layers of the networks by Eq. 2.11, a final CCAis employed at the top layer to fine tune both the neural networks in the sourcedomain and the target domain by back-propagation, and the high-level commonsubspace can be obtained. On such a common subspace, the classification of theunlabeled DT can be conducted under the supervision of the labeled DS.2.2.4 Classification on common semantic subspaceThe final classification is performed on the common subspace Ω. The unlabeleddata on the target domain DT and the labeled DS are both projected to the commonsubspace Ω by the correlation coefficients V S(nS) and V T (nT ). The projection isformulated as HS = AS(nS)V S(nS) and HT = AT (nT )V T (nT ). The standard SVMalgorithm is applied on Ω. The classifier Ψ is trained by {HSi ,Y Si }nSi . This trainedclassifierΨ is applied to DT asΨ(HT ). The pseudo-code of the proposed approachcan be found in the Alg. 1 below.We want to emphasize that the proposed deep mapping framework does not re-quire labeled samples from the target domain. The transferred data from the sourcedomain is sufficient for training the classifier. The transfer process is only underthe supervision of the co-occurrence data, prior label information from the datain the target domain is not transformed. Therefore, the classification process canbe viewed as an unsupervised classification. This is a major advantage and inno-vation of the proposed method, when compared with traditional transfer learningframeworks, for which the labeled data from the target domain is needed.31Algorithm 1 Classification on Common Semantic SubspaceInput: XS,Y S,V S,XT ,V TInput: Θ(W S,bS),Θ(W T ,bT ),ns,ntOutput: Y T1: function SVMTRAINING(XS,Y S,V S,Θ(W S,bS),ns)2: for i = 1,2,3, ..,ns do3: Calculate AS(ns) for XS(ns) by Θ(W S,bS) as Eq. (2.9)4: HS← AS(ns)V S(ns)5: end for6: Ψ←{HS,Y S}7: end function8: function SVMTESTING(XT ,V T ,Θ(W T ,bT ),nt)9: for j = 1,2,3, ..,nT do10: Calculate AT (nt) for XT (nt) by Θ(W T ,bT ) as Eq. (2.10)11: HT ← AT (nt)V T (nt)12: Y T ←Ψ(HT )13: end for14: end function2.3 ExperimentsExperiments are carried out on three HSI datasets: The Pavia dataset, the Washing-ton DC Mall dataset, and the Urban dataset. We focus on the co-occurrence datasupervised heterogeneous transfer learning problem for HSI classification. Theco-occurrence data on the two domains are first chosen by SSQ for the transferprocess. After that, a number of randomly chosen labeled training samples on thesource domain are used for the prediction of unlabeled target domain data. In theexperiments, the transfer is conducted between Pavia University and WashingtonDC Mall, Urban and Washington DC Mall, and Pavia University and Pavia Center.These three pairs also divide the experiments into three major parts. More detailedsettings and results are described in the following sections.2.3.1 Experimental dataset descriptionsWe choose three sets of data, from the most related to unrelated. The detaileddescriptions are as follows. The Pavia Center and Pavia University are two scenesacquired by the ROSIS sensor during a flight campaign over Pavia, northern Italy.32The number of spectral bands is 102 for Pavia Centre and 103 for Pavia University.Pavia Centre is a 1096×1096 image, and Pavia University is with 610×610 pixels,but some samples in both images contain no information and have to be discardedbefore the analysis. The geometric resolution is 1.3 meters.The Washington DC Mall dataset is a hyperspectral digital imagery collectionexperiment (HYDICE) image of the Washington DC Mall collected in 1995. Thenumber of spectral bands is 210 and 191 channels are left after discarding the waterabsorption channels. The original large HSI is divided into Washington DC MallArea 1 with 307× 850 pixels and Washington DC Mall Area 2 with 305× 280pixels. The geometric resolution is 2.8 meters.The Urban dataset was also captured by HYDICE in 1995. The detailed area islocated at Copperas Cove near Fort Hood,TX, USA. The image includes 307×307pixels with 210 bands, and 162 bands remain after removing the noisy and waterabsorption bands. The geometric resolution is 2 meters.Three data pairs are used in the experiments. The first and second are Urbandataset and Washington DC Mall Area 1, Pavia University Data and WashingtonDC Mall Area 2, which are with quite low correlation, and the third is Pavia Centerand Pavia University, which is more correlated and easier for the transfer process.2.3.2 Comparative methods and evaluationThe proposed DTSE framework mainly exploits the salient samples, further ob-tains their high-level features, then constructs the correlation among each domain-specific network and completes the classification process. Therefore, CCA-SVM[40][111], CCA and Deep Transfer Learning based SVM (CDTL-SVM) are adoptedas the baseline methods. In addition, we also compare the proposed method withthe currently best supervised method IRHTL [54] which yields the highest overallclassification accuracy as far as we know, and the most recent Dual Space unsuper-vised structure preserving Transfer Learning method (DSTL) which provides thebest performance among unsupervised frameworks [56].The DSTL is an unsupervised HSI classification method that exploits the struc-ture information of HSI data on dual space. As this method is especially for ho-mogeneous data, we first process our heterogeneous data to the same dimension by33dimension reduction. For the classification process, SVM is applied. We need topoint out that the dimension reduction process significantly affects the performanceof DSTL. This method can get up to now the best performance for unsupervisedtransfer learning based classification.The CCA-SVM uses CCA to derive the correlation between the source andtarget domains and find the common feature subspace based on linear transforma-tion, then a standard SVM is employed for classification. As this step is based onshallow transfer learning, the comparison between this method and the proposeddeep transfer learning based method can verify the effectiveness of the deep neuralnetwork.The CDTL-SVM is another baseline method also proposed in this chapter forHSI classification. Without salient samples querying, the method is mainly used forcomparison. The co-occurrence data is chosen randomly and a joint representationfor data on the source and target domains is found by CDTL. SVM is explored forfinal classification. This comparison method is used to verify the effectiveness ofthe active learning process.The IRHTL learns two projection matrices to find the similarity and the newdata representation in the source and target domains based on the weighted SVM.This method can get up to now the best performance for supervised transfer learning-based classification.All parameters are fine tuned with the training data. A detailed setting canbe found in each experiment. The proposed DTSE method is trained at 4 layers,including 2 hidden layers.For performance evaluation, the overall accuracy is chosen as the performancemeasure. As training samples in SVM classification process are all randomly cho-sen, each classification process is repeated 100 times with 100 sets of randomlychosen training and testing data to avoid data bias, and the final average overall ac-curacies are used for evaluation. In the following subsections, 3 experiments withdifferent datasets are reported.34Table 2.1: Dadasets used in experiment 1domain Source (Urban) Target (Washington DC Mall)#total samples 3272 9847#bands 162 191#classes 6 62.3.3 Experiment 1: Urban dataset and Washington DC Mall Area 1In the first experiment, we conduct our study on the Urban dataset and Washing-ton DC Mall Area 1, with the detailed information being shown in Table 2.1. Theclassification accuracies obtained by different methods are shown in Table 2.2. Weconstruct 15 (C26) binary classification tasks of 6 categories for comparison. Foreach task, we select 5 co-occurrence data samples from each class by the salientsample querying process for the set up of correlation between the source and tar-get domains. Then another 5 samples for each class from the source domain areselected randomly for training on the common correlated subspace. We still repeatthe experiment 100 times with 100 sets of randomly chosen training and testingdata to avoid data bias. This setting applies for DSTL, CCA-SVM, CDTL-SVM,and the proposed DTSE.As for the deep network setting, we train 4-layer neural networks correlatedby CCA, in which the number of neurons is 162→ 118→ 74→ 30 for the sourcedomain and 191→ 137→ 81→ 30 for target SAEs. This deep neural networksetting applies for CDTL+SVM and the proposed method. The CCA-SVM doesnot have this deep correlation process, and the DSTL framework which does notapply CCA also uses this deep network to transfer the original heterogeneous datato homogeneous ones.For the method IRHTL, we choose 5 training samples from each class on boththe source and target domains, the same number as the co-occurrence data. Actu-ally, this setting favors IRHTL since the co-occurrence data is unlabeled.For the final classification, we apply the same one-against-one SVM classifierfor all studied methods, and the Gaussian kernel is chosen for SVM.The average classification accuracy results of all 15 binary tasks and 6 cate-35gories (Road, Grass, Trails, Trees, Shadow and Roofs) are shown in Fig. 2.4 andTable 2.2. From the figure, we can note that overall the proposed DTSE methodyields the best performance, followed by IRHTL, CDTL-SVM, CCA-SVM, whileDSTL is the poorest probably because the dimension reduction process may haveaffected the performance of this unsupervised method. From the table, we can notethat Roofs and Trails are the hardest for classification, while the proposed DSTEmethod still yields 93.48% and 97.80% for these two categories.From the comparison between the proposed DTSE and IRHTL, we note that theperformance of DTSE is more stable, with 90%+ accuracy for each task. IRHTLalso provides satisfying results for most tasks, however for tasks 5, 9, 14, the accu-racies decrease to less than 90%, and it gets a poor 18.6% for task 12. IRHTL can-not classify “Roofs” accurately and thus bring the ill results for related tasks. Thesignificance of active learning and deep learning process can be respectively sup-ported by comparing the proposed DSTE, CDTL-SVM and CCA-SVM. It can benoted that the active learning process brings an increase of 4% accuracy on averageby comparing DSTE and CDTL-SVM. As for the deep learning process, another 3to 4% accuracy increase can be found by comparing CDTL-SVM and CCA-SVM,between which the only difference is the deep mapping process. The effectivenessof co-occurrence data can be illustrated by comparing CCA-SVM with the unsu-pervised DSTL method. With the training process based on co-occurrence data,the accuracy is more steady for each task and an overall 5% accuracy increase isnoted.It needs to be emphasized that the larger the semantic difference between thesource and the target domains, the more performance improvements can be foundin the proposed DSTE, which is probably the most important advantage of theproposed DSTE. This observation can be drawn by comparing DTSE and IRHTL.The difference of “Roofs” in two domains is the largest when observing the spectralinformation of the two HSI images. IRHTL often fails on the classification ofthis category, while DTSE still works well likely due to its employment of thedeep mapping process. By mapping between each corresponding layer of auto-encoder on the source domain and target domain, the semantic gap between thetwo domains is minimized. The final spectral distributions of “Roofs” on these twoHSIs are almost the same and thus benefit the following classification process.36Table 2.2: Classification accuracy results on Washington DC Mall Area 1.Type/method Road Grass Trails Trees Shadow RoofsDSTL 0.8335 0.8275 0.7000 0.8645 0.7460 0.8050CCA-SVM 0.9113 0.9395 0.9535 0.9490 0.9440 0.9088CDTL-SVM 0.9330 0.9603 0.9468 0.9708 0.9495 0.8743IRHTL 0.9608 0.9581 0.8226 0.9574 0.9660 0.7273DTSE 0.9693 0.9773 0.9780 0.9815 0.9848 0.9348Table 2.3: Datasets used in experiment 2domain Source (Pavia University) Target (Washington DC Mall)#total samples 18168 11868#bands 103 191#classes 4 42.3.4 Experiment 2: Pavia University data and Washington DC Mallarea 2In the second experiment, we conduct our study on Pavia University Data andWashington DC Mall Area 2, with detailed information being shown in Table 2.4.The classification accuracies obtained by different methods are shown in Table 2.3.We construct 6 (C24) binary classification tasks of 4 categories. The co-occurrencedata samples from each class are still 5 and 80 testing samples from each class arerandomly chosen. Another 5 samples from the source domain are randomly chosenfor training on the common correlated subspace. The experiments are repeated 100times with 100 sets of randomly chosen training and testing data to avoid data bias.This data setting applies for all five methods under comparison.For the deep network setting, the number of neurons in 4-layer domain net-works correlated by CCA is 103 → 79 → 55 → 30 on the source domain and191→ 137→ 81→ 30 for target SAEs. This deep neural network setting appliesfor CDTL+SVM and the proposed method, as in Experiment 1. For the compar-ison method IRHTL, we choose 5 training samples from each class on both thesource and target domains, the same number as the co-occurrence data. The one-against-one SVM classifier is employed for the final classification.The average classification accuracy results of all 6 binary tasks and 4 categories37Table 2.4: Classification accuracy results on Washington DC Mall Area 2.Type/method Road Bare soil Vegetation RoofsDSTL 0.8458 0.8625 0.8292 0.7542CCA-SVM 0.8825 0.8921 40.9246 0.8342CDTL-SVM 0.9184 0.8905 0.9259 0.9054IRHTL 40.9511 0.8990 0.9964 0.8558DTSE 0.9667 0.9542 0.9917 0.9375are given in Fig. 2.5 and Table 2.4. From the figure, we can note that overall theproposed DTSE and IRHTL methods provide the best performances while DTSEyields more consistent performances. It is worth mentioning that IRHTL needs thelabel information of instances in target domain while other methods do not. Theaccuracies of DTSE are all close to or above 90%, while for IRHTL, the classifi-cation of “Roads” and “Roofs” is 86.42%, and the accuracy is 70.40% for “Road”and “Roofs”, which is the poorest performance among all five comparison meth-ods. For the rest 4 tasks, DTSE and IRHTL are with almost the same accuracies(the difference is less than 1%). These observations once again suggest that, viadeep mapping process, the semantic gap is effectively handled in the proposedDTSE. Thus DTSE is especially suitable for the transfer process between domainswith a large semantic difference.It can be noted that active learning brings an increase of 4% to 10% accuracy onaverage when comparing DSTE and CDTL-SVM. CDTL-SVM has obvious betterclassification performances on tasks 3, 5, 6 and almost the same performance forthe rest tasks. This improvement indicates the effectiveness of the deep mappingprocess. There is a huge performance difference (more than 10% accuracy gapfor most tasks) between CCA-SVM and DSTL, meaning that the training processbased on co-occurrence data cannot be neglected.2.3.5 Experiment 3: Pavia University Data and Pavia Center DataIn the third experiment, we conduct our study on Pavia University Data and PaviaCenter Data, which detailed information being shown in Table 2.5. The classifica-tion accuracies obtained by different methods are shown in Table 2.6. We construct6 (C24) binary classification tasks for 4 categories. The detailed data setting is the38Table 2.5: Datasets used in experiment 3domain Source (Pavia University) Target (Pavia Center)#total samples 18168 62046#bands 103 102#classes 4 4Table 2.6: Classification accuracy results on Pavia Center.Type/method Trees Self-Blocking Bricks Bitumen Bare SoilDSTL 0.5550 0.8100 0.7058 0.6608CCA-SVM 0.9179 0.0.8717 0.8950 0.9171CDTL-SVM 0.9529 0.9446 0.8946 0.9304IRHTL 0.9479 0.9638 0.8619 0.8855DTSE 0.9688 0.9750 0.9458 0.9729same as in Experiment 2.For the deep network setting, the number of neurons in 4-layer domain net-works correlated by CCA is 103 → 79 → 55 → 30 on the source domain and102→ 78→ 54→ 30 for target SAEs. Other settings are the same as in Experiment2.In this experiment, as the two domains are more correlated than the formertwo experiments, IRHTL yields relatively more stable results. It only fails on theclassification of “Bitumen” and “Bare Soil”, with about 70% accuracy. For theproposed DTSE, the accuracy results are almost all 95%+.The proposed DTSE method still outperforms the other 4 methods. This obser-vation is consistent with the assertion that the multi-layer semantic mapping modelcan discover the deeply shared subspace across domains and transfer the sufficientlabeled source domain data knowledge to the target domain for label prediction byexploring co-occurrence information. The deep transfer learning methods (DTSE,CDTL-SVM) perform better than the shallow transfer learning methods (CCA-SVM, DSTL and IRHTL). Based on this observation, the fact that discriminativesemantic information can be embedded in a multi-layer of the feature hierarchy isfurther proved. Deep neural networks can capture this information through multi-ple nonlinear transformations. After deep mapping, the semantic divergence and39feature bias between the source and target domains are much lower. The activelearning process that selects the co-occurrence data can improve the accuracy by5% on average when comparing DTSE and CDTL-SVM.2.3.6 Effective of co-occurrence dataIn this section, we discuss the effect of the number of co-occurrence data on thedeep mapping process. Here, we take the experiment on Pavia Center as an exam-ple. As the co-occurrence data are not used in the IRHTL method, we study therest four methods in this subsection.The accuracy of each method is the average accuracy of all 6 tasks. From Fig.2.7, three observations can be noted.First, the number of co-occurrence data matters but makes no significant dif-ference in general. With the increase in the number of co-occurrence data, theclassification accuracy increases gradually for all four methods. However, this im-provement is not significant, and 5 co-occurrence data samples may only bring lessthan 1% accuracy increase.The second observation is that the effect of co-occurrence data on the com-parison methods is larger than on the proposed method, and the effect on DSTLis especially distinct. The reason is likely as follows: With 5 co-occurrence datasamples, the proposed DSTE method can get a robust classification model, whilewith more co-occurrence data, over-fitting may occur. The accuracy even slightlydecreases for DSTE with 20 co-occurrence data samples when compared with 15.For other methods, the classification model may still have space to improve withthe increase of the co-occurrence data size.The third observation is that, with enough co-occurrence data, CDTL-SVM canyield very close performance to that of the proposed DSTE. The reason is likely asfollows: DSTE selects the most efficient co-occurrence data and 5 co-occurrencedata is enough; for CDTL-SVM, it can achieve the same performance as DSTE aslong as the most salient samples are included in the selected co-occurrence dataalthough the required number of data is much larger and much redundancy mayexist. This observation, to some extent, reflects the significance of the SSQ processwhich makes the proposed method more efficient.40Table 2.7: Effects of the number of neurons at the last layer10 20 300.7625±0.0313 0.8375±0.0142 0.8875±0.025340 500.8250±0.0119 0.7538±0.0283Table 2.8: Effects of the number of layers2 30.8063±0.0202 0.8625±0.02324 50.8875±0.0269 0.7338±0.04362.3.7 Parameter sensitivityIn this section, we study the effect of different parameters on our networks. Boththe number of layers and the number of neurons in each layer affects the finalclassification result. Here, we take the classification of “Bare soil” and “Roofs”in Experiment 2 as an example. We first set different numbers of neurons in the4-layer network. We set the numbers of neurons in the first three layers as 162−→118−→ 74 in the source domain and 191−→ 137−→ 83 in the target domain. Forthe last layer, the neuron number changes from 10 to 50 in both domains, and thefinal accuracy is shown in Table 2.7.From this figure, we can note that, when the number of neurons is 30, theperformance is the best. Therefore, in our former experiments, we use 30 neurons.Secondly, we also test the effect of the number of layers. We choose the bestnumber of neurons in each layer, and the results are shown in Table 2.8.We note that with the increase in the number of layers, the performance firstrises gradually, but when it reaches 5 layers, the accuracy falls suddenly. Based onthis observation, we set the number of layers to be 4 in our former experiments.2.4 ConclusionIn this chapter, we propose a novel method, referred to as a Deep mapping basedheterogenous Transfer learning model via querying Salient Examples (DTSE), for41classification of hyperspectral images. In the proposed model, we first query thesalient examples to obtain the co-occurrence data and then apply them to con-struct the deep mapping network on the source and target domains. The canonicalcorrelation analysis is applied at each layer to correlate the two domains’ data.Then at the top layer, we exploit the correlation matching between two domainsto fine-tune the whole network in back propagation. Therefore the final correlatedcommon subspace is identified and the data on the source domain is projected tothis subspace for training the SVM classifier which is used for classification on thetarget domain.The proposed framework is tested on three HSI datasets and compared to theother four state-of-the-art methods. Experimental results support the effectivenessof the proposed method.In future work, we plan to apply the proposed method to more HSI datasets, aswell as other forms of non-HSI datasets. Also, since the learned deep network isaffected profoundly by the parameter-setting, more efforts will be made to find thebest settings particularly suitable for each individual HSI dataset.Moreover, we have to point out that the current framework is only suitable forbinary classification. Extending it to multi-class classification is another futurework.42. . .. . .. . .. . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . .. . .CCACCACCASalient sample queryingDeep MappingSource DomainTarget DomainDDsback propagationback propagationcDtcforwardpropagationforwardpropagationΩ. . .. . . I lI lDensity Peak SelectionActive Data Augmentation... D...D   VsVtHsHt ψY TClassificationDeep MappingSDTstSTFigure 2.1: The flowchart of the proposed DTSE framework for hyperspec-tral image classification.43(b) Traditional approaches favoring only informative or        representation instances informative instancesrepresentative instances(c) Step one: salient instances intialization (d) Step two: salient instances augmentation (e) result of the proposed methodInitial data distribution Results of traditional active learning methodsThe Proposed method(a) A binary classification problemFigure 2.2: An illustrative comparison of HSI sample selection based on dif-ferent criteria.1234567891011121314151617181920112global densityco-distance34567891011121314151617181920selected instancesFigure 2.3: An example for the active local peaks querying.440 5 10 15Task ID0.10.20.30.40.50.60.70.80.91Classification Accuracy by SVMDTSECDTLIRHTLCCADSTLFigure 2.4: Urban dataset vs Washington DC Mall Area 1: classification ac-curacy results of 15 tasks with 5 co-occurrence instances and 80 randomtesting instances.451 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6Task ID0.70.750.80.850.90.951Classification Accuracy by SVMDTSECDTLIRHTLCCADSTLFigure 2.5: Pavia University Data and Washington DC Mall Area 2: classifi-cation accuracy results of 6 tasks with 5 co-occurrence instances and 80random testing instances.461 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6Task ID0.50.550.60.650.70.750.80.850.90.951Classification Accuracy by SVMDTSECDTLIRHTLCCADSTLFigure 2.6: Pavia University Data and Pavia Center: classification accuracyresults of 6 tasks with 5 co-occurrence instances and 80 random testinginstances.475 10 15 20Co-occurrence data number each class0.650.70.750.80.850.90.951Classification Accuracy by SVMDTSECDTLCCADSTLFigure 2.7: Effects of the co-occurrence data size on four different methodswhen tested on Pavia Center dataset.48Chapter 3Where to Transfer: DeepTransfer Learning by ExploringWhere to Transfer (DT-LET)In this chapter, we want to discuss the task “where to transfer”. Previous transferlearning methods based on deep network assumed the knowledge is transferred be-tween the same hidden layers of the source domain and the target domains. This as-sumption did not always hold true, especially when the data from the two domainswere heterogeneous with different resolutions. In such a case, the most suitablenumbers of layers for the source domain data and the target domain data are differ-ent. As a result, high-level knowledge from the source domain is transferred to thewrong layer of the target domain. Based on this observation, “where to transfer”proposed in this chapter might be a novel research area. In this chapter, the prob-lem we mainly want to solve is transfer learning for the RGB images, as the RGBimage datasets are always with different resolutions, colors, etc. There are alwaysdatasets with insufficient prior knowledge for its classification/segmentation.Transfer learning or domain adaption aims at digging potential information inthe auxiliary source domain to assist the classification/segmentation task in thetarget domain, where insufficient labeled data with prior knowledge exist [73]. Fortasks like image classification or recognition, the labeled data (the target domaindata) is highly required but always not enough as the labeling process is quite49tedious and laborious. Without the help of related data (the source domain data),the learning tasks will fail. Therefore, having better use of auxiliary source domaindata by transfer learning methods has attracted researchers’ attention.Direct use of labeled source domain data on a new scene of the target domainwill result in poor performance due to the semantic gap between the two domains,even they are representing the same objects [108][27][66][101]. The semantic gapcan result from different acquisition conditions(illumination or view angle), andthe use of different cameras or sensors. Transfer learning methods are proposed toovercome this semantic gap [20][60][98][100]. Traditionally, these transfer learn-ing methods adopt linear or non-linear transformation with kernel function to learna common subspace on which the gap is bridged [109][86][102]. Recent advance-ment has proven that the features learned on such a common subspace are ineffi-cient. Therefore, deep learning based model has been introduced due to its poweron high-level feature representation.Current deep learning based transfer learning topics include two branches, whatknowledge to transfer and how to transfer knowledge [53]. For what knowledgeto transfer, researchers mainly concentrate on instance-based transfer learning andparameter transfer approaches. Instance-based transfer learning methods assumethat only certain parts of the source data can be reused for learning in the targetdomain by re-weighting [35]. As for parameter transfer approaches, people mainlytry to find the pivot parameters in a deep network to transfer to accelerate the trans-fer process. For how to transfer knowledge, different deep networks are introducedto complete the transfer learning process. However, for both research areas, theright correspondence of layers is ignored.The current limitation of transfer learning problems is concluded below. Forwhat knowledge to transfer problem, the transferred content might even be nega-tive or wrong. A fundamental problem for current transfer learning work is nega-tive transfer [91]. If the knowledge from the source domain to the target domain istransferred to wrong layers, the transferred knowledge is quite error-prone. Withthe wrong prior information added, the bad effect can be generated on target do-main data. For how to transfer knowledge problem, as the two deep networks forthe source domain data and the target domain data need to have the same number oflayers, the two models can not be optimal at the same time. This situation is espe-50cially important for cross-resolution heterogeneous transfer. For data with differentresolutions, The data with higher resolution might need more max-pooling layersthan the data with lower resolution, and more neural network layers are needed.Based on the above observation and assumption, we propose a new research topic,where to transfer. In this work, the number of layers for two domains does not needto be the same, and the optimal matching of layers will be found by the newly pro-posed objective function. With the best parameters from the source domain datatransferred to the right layer of the target domain, the performance of the targetdomain learning task can be improved.3.1 Contribution summaryThe proposed work is named Deep Transfer Learning by Exploring where to Trans-fer (DT-LET), which is based on Stacked Auto-Encoders [125]. A detailed flowchartis shown in Fig. 3.1. The main contributions are concluded as follows.• This thesis for the first time introduces the where to transfer problem. Thedeep networks from the source domain and the target domain no longer needto be with the same parameter settings, and the cross-layer transfer learningis proposed in this chapter.• We propose a new principle for finding the correspondence between neuralnetworks in the source domain and in the target domain by defining a newunified objective loss function. By optimizing this objective function, thebest setting of two deep networks as well as the correspondence relationshipcan be figured out.3.2 Method: deep mapping mechanismThe general framework of such a deep mapping mechanism can be summarized asthree steps, network set up, correlation maximization, and layer matching. We firstintroduce the deep mapping mechanism by defining the variables.The samples in the source domain are denoted as DS = {Isi }nsi=1, in which thelabeled data in the source domain is further denoted as DlS = {X si ,Y si }nli=1, they are51Source Domain Co-occurence DataTarget Domain Co-occurenceData ..........CCA....CCACCA  CommonSubspace ΩSVMreshape reshapeSVMCsSource Domain   Training DataXsCtTarget Domain            Data ItCross layer   matchingFigure 3.1: The flowchart of the proposed DT-LET framework. The two neu-ral networks are first trained by the co-occurrence data Cs and Ct . Afternetwork training, the common subspace is found and the training dataDlS is transferred to such space to train SVM classifier, to classify DT .used to supervise the classification process. In the target domain, the samples aredemoted as DT = {Iti }nti=1. The co-occurrence data [110] (the data in the source do-52main and the target domain belonging to the same classes but with no prior label in-formation) in the source domain are denoted as CS = {Csi }nci=1, in target domain aredenoted as CT = {Cti}nci=1. They are further jointly represented by DC = {CSi ,CTi }nci ,which are used to supervise the transfer learning process. The parameters of deepnetwork in the source domain are denoted by ΘS = {W s,bs}, and ΘT = {W t ,bt} inthe target domain.The matching of layers is denoted by Rs,t = {r1i1, j1 ,r2i2, j2 , ...,rmia, jb}, in which arepresents the total number of layers for the source domain data, and b representsthe total layers for the target domain data. m is the total number of matchinglayers. We define here, if m = min{a− 1,b− 1}(as the first layer is the originallayer which will not be used to transfer, m is compared with a-1 or b-1 insteadof a or b), we define the transfer process as full rank transfer learning; else ifm < min{a−1,b−1}, we define this case as non-full rank transfer learning.The common subspace is represented by Ω, and the final classifier is repre-sented by Ψ. The labeled data DlS from the source domain is used to predict thelabel of DT by applying Ψ(Ω(DT )).3.2.1 Network setting upThe stacked auto-encoder (SAE) is first employed in the source domain and thetarget domain to get the hidden feature representation HS and HT of original data,as shown in Eq. 3.1 and Eq. 3.2.HS(n+1) = f (W S(n)×HS(n)+bS(n)),n > 1;HS(n) = f (W S(n)×CS+bS(n)),n = 1.(3.1)HT (n+1) = f (W T (n)×HT (n)+bT (n)),n > 1;HT (n) = f (W T (n)×CT +bT (n)),n = 1.(3.2)Here W S and bS are parameters from neural network ΘS, W T , and bT are pa-rameters from neural network ΘT . HS(n) and HT (n) mean the nth hidden layers inthe source domain and in the target domain respectively. The two neural networksare first initialized by the above functions.533.2.2 Correlation maximizationTo set up the initial relationship of the two neural networks, we resort to Canoni-cal Correlation Analysis (CCA), which can maximize the correlation between twodomains [40]. A multi-layer correlation model based on the above deep networksis further constructed. Both the CS and the CT are projected by CCA to a commonsubspace Ω on which a uniformed representation is generated. Such projectionmatrices obtained by CCA are denoted as V S(n) and V T (n). To find optimal neu-ral networks in the source domain and in the target domain, we have two generalobjectives: to minimize the reconstruction error of neural networks of the sourcedomain and the target domain, and to maximize the correlation between the twoneural networks. To achieve the second objective, we need on the one hand findthe best layer matching, on the other hand maximize the correlation between corre-sponding layers. To achieve this goal, we can minimize the final objective functionL(Rs,t) =Ls(θ S)+LT (θT )P(V S,V T ), (3.3)in this function, the objective function is defined as L, and L(Rs,t) is in corre-sponding with different matching of Rs,t . We generate the best matching by findingthe minimum L(Rs,t). In L(Rs,t), Ls(ΘS) and LT (ΘT ) represent the reconstructionerrors of data in the source domain and the target domain, which are defined asfollows:LS(θ S) = [1nsns∑i=1(12||hW S,bS(Csi )−X si ||2)]+λ2nS−1∑l=1nSl∑j=1nSl+1∑k=1(W S(l)k j )2(3.4)LT (θT ) = [1ntmt∑i=1(12||hW T ,bT (Cti )−Cti ||2)]+λ2nT−1∑l=1nTl∑j=1nTl+1∑k=1(W T (l)k j )2,(3.5)Here nS and nT are the numbers of their layers, nSl and nSl are the numbers ofneurons in layer l, and λ is the trade-off parameter. The third term P(Vs,Vt) repre-sents the domain divergence after projection by CCA, which we want to maximize.54The definition for this term is in Eq. 3.6P(V S,V T ) =nS−1∑l=2V S(l)T∑ST V T (l)√V S(l)T∑SS V S(l)√V T (l)T∑T T V T (l), (3.6)where ∑ST = HS(l)HT (l)T, ∑SS = HS(l)HS(l)T, ∑T T = HT (l)HT (l)T, By minimizingEq. 3.3, we can collectively train the two neural networks θT = {W T ,bT} andθ S = {W S,bS}.3.2.3 Layer matchingAfter constructing the multiple layers of the networks by Eq. 3.3, we need tofurther find the best matching for layers after construction of neural networks. Asdifferent layer matching generates different function loss value L in Eq. 3.3, wedefine the objective function for layer matching asRs,t = argmins,tL(Rs,t). (3.7)3.3 Method: model trainingHere we want first to optimize the Eq. 3.3. As the equation is not joint convex withall the parameters θ S, θT , Vs, and Vt , and the two parameters θ S and θT are notrelated with V S and vT , we want to introduce two-step iteration optimization.3.3.1 Step.1: Updating V S,V T with fixed ΘS,ΘTIn Eq. 3.3, the optimization of V S,V T is just related to the dominator term. Theoptimization of each layer V S(l1),V T (l2)(suppose the layer 11 on source domain isin corresponding with layer l2 on target domain) can be formulated asmaxV S(l1),V T (l2)V S(l1)T∑ST V T (l2)√V S(l1)T∑SS V S(l1)√V T (l2)T∑T T V T (l2)(3.8)As V S(l1)T∑SS V S(l1)= 1 and V T (l2)T∑T T V T (l2)= 1 [40], we can rewrite Eq. 3.855asmaxV S(l1)T∑ST V T (l2),s.t.V S(l1)T∑SS V S(l1) = 1,V T (l2)T∑T T V T (l2) = 1(3.9)This is a typical constrained problem which can be solved as a series of uncon-strained minimization problems. we introduce the Lagrangian multiplier to solvethis problem, and we haveL(wl,V S(l1),V T (l2)) =V S(l1)T∑ST V T (l2)+wSl2(V S(l1)T∑SS V S(l1)−1)+wTl2(V T (l2)T∑T T V T (l2)−1)(3.10)Then we take the partial derivatives for V S Eq. 3.10 and get∂L∂V S(l1)=∑ST V T (l2)−wSl ∑SS V S(l1) = 0. (3.11)The partial derivatives for V T is the same. Now we can have the final solution as∑ST∑−1T T∑TST VS(l1) = w2l ∑SS V S(l1), (3.12)here we assume wl = wSl = wTl . From here VS(l1) and wl can be solved by the gen-eralized eigenvalue decomposition and we can also get the corresponding V T (l2).3.3.2 Step.2: updating ΘS,ΘT with fixed V S,V TAsΘS andΘT are mutual independent and with the same form, we here just demon-strate the solution of ΘS on the source domain (the solution of ΘT can be derivedsimilarly). Actually the objective division operation is with the same function withsubtraction operation and we reformulate the objective function asminθ Sφ(θ S) = LS(θ S)−Γ(V S,V T ) (3.13)56Here we apply the gradient descent method to adjust the parameter asW S(l1) =W S(l1)−µS ∂φ∂W S(l1)=∂LS(θ S)∂W S(l1)− ∂Γ(VS,V T )∂W S(l1)=(αS(l1+1)−β S(l1+1)+ωlγS(l1+1))×HS(l1)nc+λ SW S(l1)(3.14)bS(l1) = bS(l1)−µS ∂φ∂bS(l1)=∂LS(θ S)∂bS(l1)− ∂Γ(VS,V T )∂bS(l1)=(αS(l1+1)−β S(l1+1)+ωlγS(l1+1))nc,(3.15)in whichαS(l1) =−(DlS−HS(l1)) ·HS(l1) · (1−HS(l1)), l = nSW S(l1)TαS(l1+1) ·HS(l1) · (1−HS(l1)),l = 2, ...,nS−1(3.16)β S(l1) =0, l = nSHT (l2)V T (l2)V S(l1)T ·HS(l1) · (1−HS(l1)),l = 2, ...,nS−1(3.17)γS(l1) =0, l = nSHS(l1)V S(l1)V S(l1)T ·HS(l1) · (1−HS(l1)),l = 2, ...,nS−1. (3.18)The operator · here stands for the dot product. The same optimization processworks for ΘT on the target domain.After these two optimizations for each layer, the two whole networks (thesource domain network and the target domain network) are further fine-tuned bythe back-propagation process. The forward and backward propagations will iterateuntil convergence.57Algorithm 2 Deep Mapping Model TrainingInput: DC = {Csi ,Cti}nci ,Input: λ S = 1,λ T = 1, µS = 0.5,µT = 0.5Output: Θ(W S,bS),Θ(W T ,bT ),V S,V T1: function NETWORKSETUP2: Initialize Θ(W S,bS),Θ(W T ,bT )← RandomNum3: repeat4: for l = 1,2, ...,nS do5: V S← argminL(ωl,V S(l))6: end for7: for l = 1,2, ...,nT do8: V T ← argminL(ωl,V T (l))9: end for10: θ S = argminφ(θ S),θT = argminφ(θT )11: until Convergence12: end function13: function LAYERMATCHING14: Initialize Rs,t ← RandomMatching15: Initialize m← 0,σ ← 116: if m < mit then17: if σ > 0.05 then18: Rms,t = E(m) = {e1(t),e2(t), ...,ea−1(t)}19: E = {E1,E2, ...,Enc}20: E(m+1) = argmax{S(E1),S(E2), ...,S(Enc)}21: m = m+122: end if23: end if24: end function3.3.3 Optimization of Rs,tWe finally get the minimized L(Rs,t) by the above procedure. As described pre-viously, we have a− 1 layers used for source domain, and b− 1 layers for thetarget domain, the expected computation complexity of exhaustive search can beapproximated by O(a× b), and the problem is NP-hard to optimize. To reducethis to polynomial complexity, we introduce the Immune Clonal Strategy (ICS)to solve this problem. We take Eq. 3.7 as the affinity function in the ICS. Thesource domain layers are regarded as the antibodies, and the target domain layers58are viewed as the antigen. Various antibodies have different effects on the antigen.By maximizing the affinity function, the best antibodies are chosen. The detailedoptimization procedure is modeled as an iterative process. It includes three phases:clone, mutation, and selection.Clone phaseAt the very first, we set the a− 1 layers on the source domain as the antibodiesE(m) = {e1(m),e2(m), ...,ea−1(m)}. For random ei(t) we have b(as ei(t) = 0 isalso one choice) choices of values, representing the corresponding layers in thetarget domain. For notational simplicity, we omit the iteration number m in thefollowing explanation with no loss of understandability. The initial E(m) is clonedfor nc times and we get E = {E1,E2, ...,Enc}.Mutation phaseThe randomly chosen antibodies are not the best. Therefore, a mutation phase af-ter the clone phase is necessary. For example, for the antibody E i, we randomlyreplace Nm representatives in its clone E i by the same number of elements. Thereis no doubt that these newly introduced elements differ from the former represen-tatives, which enrich the diversity of the original antibodies. After this, mutatedantibodies E = {E1,E2, ...,Enc} are obtained.Selection phaseWith the obtained antibodies, which are manifestly more various than the originalset, we will select the most promising ones for the next round of processing. Theprinciple is also defined with the affinity values. Higher ones indicate more fitness.Therefore,we haveE(m+1) = argmaxE{S(E1),S(E2), ...,S(Enc)} (3.19)which means that the antibody with the largest affinity value is taken as E(m+1) toenter the next iteration. The iteration does not terminate until the change betweenS(E(m)) and S(E(m+1)) is smaller than threshold or the maximum number of59iteration mit is reached.The final E(m) then is output as the minimized Rs,t . However, after our ex-periments, we heuristically find the number of matching layers is almost in directproportion to the resolution of images.The training process is finally summarized in Alg. (2).3.3.4 Classification on common semantic subspaceThe final classification is performed on the common subspace Ω. The unlabeleddata on the target domain DT and the labeled DS are both projected to the commonsubspace Ω by the correlation coefficients V S(nS) and V T (nT ). The projection isformulated as HS = AS(nS)V S(nS) in the source domain, and HT = AT (nT )V T (nT )in the target domain. The standard SVM algorithm is applied on Ω. The classifierΨ is trained by {HSi ,Y Si }nSi . This trained classifier Ψ is applied to DT as Ψ(HT ).The pseudo-code of the proposed approach can be found in the Alg. 3 below.Algorithm 3 Classification on Common Semantic SubspaceInput: XS,Y S,V S,XT ,V TInput: Θ(W S,bS),Θ(W T ,bT ),ns,ntOutput: Y T1: function SVMTRAINING(XS,Y S,V S,Θ(W S,bS),ns)2: for i = 1,2,3, ..,ns do3: Calculate HS(ns) for XS(ns) by Θ(W S,bS) as Eq. (3.1)4: AS← HS(ns)V S(ns)5: end for6: Ψ←{AS,Y S}7: end function8: function SVMTESTING(XT ,V T ,Θ(W T ,bT ),nt)9: for j = 1,2,3, ..,nT do10: Calculate HT (nt) for XT (nt) by Θ(W T ,bT ) as Eq. (3.2)11: AT ← HT (nt)V T (nt)12: Y T ←Ψ(AT )13: end for14: end function603.4 ExperimentsWe carry out our DT-LET framework on two cross-domain recognition tasks, hand-written digit recognition, and text-to-image classification.3.4.1 Experimental dataset descriptionsHandwritten digit recognition: For this task, we mainly conduct the experimenton Multi Features Dataset collected from UCI machine learning repository. Thisdataset consists of features of handwritten numerals (0-9, in total 10 classes) ex-tracted from a collection of Dutch utility maps. 6 features exist for each numeral,and we choose the most popular features 216-D profile correlations and 240-Dpixel averages in 2*3 windows to complete the transfer learning based recognitiontask.Text-to-image classification: For this task, we make use of the NUS-WIDEdataset. NUS-WIDE dataset includes 269,648 images and the associated tags fromFlickr, with a total number of 5,018 unique tags. In our experiment, the images inthis dataset are represented with 500-D visual features and annotated with 1000-Dtext tags from Flickr. 10 categories of instances are included in this classificationtask, which are birds, building, cars, cat, dog, fish, flowers, horses, mountain, andplane.3.4.2 Comparative methods and evaluationAs the proposed ET-LET framework mainly has four components, deep learning,CCA, layer matching, and SVM classifier, we first select the baseline method,Deep-CCA-SVM(DCCA-SVM) [2] as baseline comparison methods. We also con-duct experiments just without layer matching(the number of layers is the same onthe source and the target domains), while all the other parameters are the same withthe proposed ET-LET, and we name this framework NoneDT-LET.The other deep learning based comparison methods are duft-tDTNs [92], Deep-Coral [87], DANN [31], and ADDA [97] methods. For these methods, duft-tDTNsis the most representative, which is up to now heterogeneous transfer learningmethod with the best performance. DeepCoral is the first deep learning basedtransfer learning framework, DANN for the first time introduces the adversarial61domain concept to transfer learning, and ADDA is the most famous unsupervisedtransfer learning method.For the deep network based method, the DCCA-SVM, duft-tDTNs, DeepCoral,DANN, NoneDT-LET are all with 4 layers for the source domain and the targetdomain data, as we find more or less layers generate worse performance.At last, for the evaluation metric, we select the classification accuracies on thetarget domain data over the 2 pairs of datasets.3.4.3 Task 1: handwritten digit recognitionIn the first experiment, we conduct our study for handwritten digit recognition.The source domain data are the 240-D pixel averages in the 2*3 windows feature,while the target domain data are the 216-D profile correlations feature. As thereare 10 classes in total, we complete 45 (C210) binary classification tasks, for eachcategory, the accuracy is the average accuracy of 9 binary classification tasks. Weuse 60% data as co-occurrence data to complete the transfer learning process andfind the common subspace, 20% labeled samples on source domain as the trainingsamples, and the rest samples on target domain as the testing samples to completethe classification process. The experiments are repeated for 100 times with 100sets of randomly chosen training and testing data to avoid data bias [94]. The finalaccuracy is the average accuracy of the 100 repeated experiments. This data settingapplies to all four methods under comparison.For the deep network, the numbers of neurons of 4 layer networks are 240-170-100-30 for source domain data and 216-154-92-30 for target domain data, thissetting works for all comparison methods. For the proposed DT-LET, we find thebest two-layer matching with the lowest loss after 25 iterations are r24,3 and r35,4. Thenumbers of neurons for r24,3 are 240-170-100-30 for source domain data and 216-123-30 for target domain data. The average objective function loss of all 45 binaryclassification tasks for these two is 0.856 and 0.832 respectively. The numbers ofneurons for r35,4 are 240-185-130-75-30 for source domain data and 216-154-92-30for target domain data. The one-against-one SVM classification is applied for finalclassification. The average classification accuracies of 10 categories are shown inTable 3.1. The matching correlation is detailed in Fig. 3.2.62Table 3.1: Classification accuracy results on multi feature dataset.numeral DCCA-SVMduft-tDTNsDeepCoralDANNADDANoneDT-LETDT-LET(r3 5,4)DT-LET(r2 4,3)0 0.961 0.972 0.864 0.923 0.966 0.983 0.989 0.9841 0.943 0.956 0.805 0.941 0.978 0.964 0.976 0.9822 0.955 0.972 0.855 0.911 0.982 0.979 0.980 0.9893 0.945 0.956 0.873 0.961 0.973 0.966 0.976 0.9754 0.956 0.969 0.881 0.933 0.980 0.980 0.987 0.9835 0.938 0.949 0.815 0.946 0.970 0.958 0.971 0.9776 0.958 0.966 0.893 0.968 0.961 0.978 0.988 0.9867 0.962 0.968 0.847 0.929 0.979 0.978 0.975 0.9858 0.948 0.954 0.904 0.968 0.971 0.965 0.968 0.9759 0.944 0.958 0.915 0.963 0.969 0.970 0.976 0.961As can be found in Table 3.1, the best performances have been highlighted,which all exist in the DT-LET framework. However, the best performances fordifferent categories do not exist in the framework with the same layer matching.Overall, r35,4 and r24,3 are the best two-layer matchings compared with other set-tings. Based on these results, we heuristically get the conclusion that the best layermatching ratio(5/4, 4/3) is generally in direct proportion to the dimension ratioof original data(240/216). However, more matched layers do not guarantee betterperformance as the classification results for number “1”, “2”, “5”, “7”, “8” of DT-LET (r24,3) with 2 layer matchings perform better than DT-LET (r35,4) with 3 layermatchings.3.4.4 Task 2: text-to-image classificationIn the second experiment, we conduct our study for Text-to-image classification.The source domain data are the 1000-D text feature, while the target domain dataare the 500-D image feature. As there are 10 classes in total, we complete 45 (C210)binary classification tasks. We still use 60% data as co-occurrence data [110], 20%labeled samples on the source domain as the training samples, and the rest sampleson the target domain as the testing samples. The same data setting as Task 1 appliesfor all four methods under comparison.63................DCCA-SVM,  duft-tDTNs,  NoneDT-LET..............DT-LET, 4(source)-3(target) layer matching..................DT-LET, 5(source)-4(target) layer matchingLayer matching for Multi Feature DatasetFigure 3.2: The comparison of different layer matching setting for differentframeworks on Multi Feature dataset.For the deep network, the numbers of neurons of 4 layer networks are 1000-750-500-200 for source domain data and 500-400-300-200 for target domain data,this setting works for all comparison methods. For the proposed DT-LET, we findthe best two-layer matchings with lowest loss after 25 iterations are r25,3, r35,4 andr25,4 (non-full rank). The average objective function loss of 45 binary classificationtasks for these two-layer matchings is 3.231, 3.443 and 3.368. The numbers of neu-rons for r25,3 are 1000-800-600–400-200 for source domain data and 500-350-200for target domain data. The numbers of neurons for both r35,4 and r25,4 are 1000-750-500-200 for source domain data and 500-400-300-200 for target domain data. Asmatching principle also influence the performance of transfer learning, we presenttwo r25,3 with different matching principles as shown in Fig. 3.3 (the average objec-tive function loss for the two different matching principles are 3.231 and 3.455), inwhich all the detailed layer matching principles are described. For this task, as theoverall accuracies are generally lower than task 1, we want to compare more dif-ferent settings for this cross-layers matching task. We first verify the effectivenessof the DT-LET framework. Compared with the comparison methods, the accuracyof the DT-lET framework is generally with around 85% accuracy, while the com-parison methods are generally with no more than 80%. This observation generatesthe conclusion that finding the appropriate layer matching is essential. The secondcomparison is between the full rank and the non-full rank framework. As can befound in the table, actually r25,4 is with the highest overall accuracy, although theother non-full rank DT-LETs do not perform quite well. This observation gives us64Table 3.2: Classification accuracy results on NUS-WIDE dataset.)categories DC. DC DA. d-tD AD. N-TDT-LETr25,3(1) r25,3(2) r25,4 r35,4birds 0.78 0.77 0.67 0.81 0.81 0.78 0.83 0.83 0.85 0.83building 0.81 0.78 0.67 0.83 0.83 0.82 0.88 0.84 0.88 0.89cars 0.80 0.77 0.69 0.81 0.85 0.81 0.83 0.83 0.87 0.85cat 0.80 0.77 0.78 0.83 0.81 0.81 0.87 0.87 0.86 0.87dog 0.80 0.77 0.70 0.82 0.81 0.81 0.85 0.85 0.86 0.82fish 0.77 0.75 0.73 0.82 0.79 0.78 0.85 0.84 0.85 0.84flowers 0.80 0.78 0.77 0.84 0.81 0.81 0.86 0.84 0.84 0.88horses 0.80 0.78 0.72 0.82 0.83 0.81 0.84 0.81 0.84 0.83mountain 0.82 0.79 0.75 0.82 0.82 0.83 0.83 0.81 0.82 0.83plane 0.82 0.79 0.79 0.82 0.79 0.83 0.81 0.83 0.83 0.83average 0.80 0.77 0.77 0.83 0.83 0.81 0.84 0.83 0.85 0.85a hint that full rank transfer is not always best as the negative transfer degrades theperformance. However, the full rank transfer is generally good, although not opti-mal. The third comparison is between the same transfers with different matchingprinciples. We present two r25,3 with different matching principles, and we find theperformances vary. Case 1 performs better than case 2. This result tells us contin-uous transfer might be better than discrete transfer: as for case 1, the transfer is inthe last two layers of both domains, and in case 2, the transfer is conducted in layer3 and layer 5 of the source domain data.By comparing specific objects, we can find the objects with a large semanticdifference with other categories are with higher accuracy. For the objects whichare hard to classify and with low accuracies, like “birds” and “plane”, the accura-cies are always low even the DT-LET is introduced. This observation proves theconclusion that DT-LET can only be used to improve the transfer process, whichhelps with the following classification process; while the classification accuracy isstill based on the semantic difference of data of different 10categories.We also have to point out the relationship between the average objective func-tion loss and the classification accuracy is not strictly positively correlated. Overall,r25,4 is with the highest classification accuracy, while its average objective functionloss is not lowest. Based on this observation, we have to point out, the lowest av-erage objective function loss can only generate the best transfer leaning result with65optimal common subspace. On such a common subspace, the data projected fromthe target domain are classified. These classification results are also influencedby the classifier as well as training samples projected randomly from the sourcedomain. Therefore, we conclude as follows. We can just guarantee a good clas-sification performance after getting the optimal transfer learning result, while theclassification accuracy is also influenced by the classification settings...............DT-LET, 5(source)-3(target) layer matching, with 2 matchings (case 1)..........Layer matching for NUS-WIDE Dataset........DT-LET, 5(source)-3(target) layer matching, with 2 matchings (case 2)DT-LET, 5(source)-4(target) layer matching, with 2 matching..................DT-LET, 5(source)-4(target) layer matching, with 3 matching..................Figure 3.3: The comparison of different layer matching setting for differentframeworks on NUS-WIDE dataset.3.4.5 Parameter sensitivityIn this section, we study the effect of different parameters on our networks. Wehave to point out the even the layer matching is random, the last layer of the two66Table 3.3: Effects of the number of neurons at the last layerlayer matching 10 20 30 40 50r35,4 0.9082 0.9543 0.9786 0.9771 0.9653r24,3 0.8853 0.9677 0.9797 0.9713 0.9522neural networks from the source domain and the target domain must be correlatedto construct the common subspace. Actually, the number of neurons in the lastlayer also affects the final classification result. For the last layer, we take experi-ments on Multi-Feature Dataset as an example. The result is shown in Table 3.3.From this figure, it can be noted when the number of neurons is 30, and theperformance is the best. Therefore in our former experiments, 30 neurons are used.The conclusion can also be drawn that more neurons are not always better. Basedon this observation, The number of layers in Task 1 is set as 30, and in Task 2 as200.3.5 ConclusionIn this chapter, we propose a novel framework, referred to as Deep Transfer Learn-ing by Exploring where to Transfer (DT-LET), for handwriting digit recognitionand text-to-image classification. In the proposed model, we first find the bestmatching of deep layers for transfer between the source and target domains. Afterthe matching, the final correlated common subspace is found on which classifier isapplied. Experimental results support the effectiveness of the proposed framework.67Chapter 4How to Transfer: StructurePreserving Transfer Learning forUnsupervised HyperspectralImage ClassificationThis chapter mainly discusses the “how to transfer” task, specifically on HSI data.A primary assumption in many transfer learning problems is that the training andtesting data are in the same feature space and follow the same distribution. How-ever, this assumption does not always hold in many real-world problems, especiallyin specific HSI processing problems with extremely insufficient or even withouttraining samples. The proposed framework mainly want to find effective waysfor transfer learning for HSI, to transfer information from annotation-rich data toannotation-scarce data to help with its classification.A prevalent trend for HSI classification is to exploit spatial and spectral infor-mation of HSI data to the maximum. However, a major assumption for the domi-nant supervised process is that the training and testing data are in the same featurespace and have the same distribution [46][99]. However in many real-world prob-lems, especially for the processing of the newly collected HSI data whose trainingsamples are not ample, the above assumption may not hold and existing supervised68methods fail to work. Traditional unsupervised methods that barely consider thedata distribution of HSI may yield unsatisfying classification performances. In thissituation, we may transfer the knowledge obtained from another relevant datasetwith sufficient training samples (as the source domain), which might be with a dif-ferent feature space or learning task, to assist the unsupervised HSI processing inthe target domain.Transfer learning makes use of the prior information from the relevant domainto learn new tasks on the objective domain [85]. The former and latter domainsare named the source domain and the target domain respectively. Depending onwhether labeled training samples in the target domain are available or not, transferlearning methods are further divided into supervised and unsupervised categories.Here we mainly concentrate on unsupervised transfer learning, which is of greatinterest in real-world problems. A key challenge in transfer learning is how to setup the relevance between the two domains. Generally, the data in the two domainsshare the same task but follow different distributions. To be more specific for ourHSI problem, even they represent the same categories of interest, the data in thetwo domains can be with very different spectra, due to factors such as variousacquisition conditions and the use of different sensors in HSI.To adapt the data from the source domain to the target domain, great effortshave been made by researchers. The representative idea is to transfer the represen-tations of both source and target data into a common space. The most prevalentworks can be found in [34][36], while they have some common drawbacks as fol-lows. First, they generally do not take into consideration the intrinsic structures ofdata, including global and local features of HSI data. Second, the original data aretreated equally and represented with the same principle [106], and the noise cannotbe effectively eliminated during the representation process [57].To address the above limitations, we propose a novel Dual Space UnsupervisedStructure Preserving Transfer Learning (DSTL) framework for HSI classification.To address the first concern, the Markov Random Field (MRF) is applied to ex-ploit the intrinsic data structure to get the optimal labels for HSI data in the targetdomain [113]. For the second concern, we formulate the objective function by si-multaneously considering the noise as well as the transformation matrix, and theproposed model can effectively handle data discrepancy between the source do-69main data and the target domain data. A flowchart for the proposed framework canbe found in Fig.4.1.4.1 Contribution summaryWe summarize our major contributions in correspondence to our proposed meth-ods. For the proposed non-adversarial method DSTL, contributions are as follows:• We propose a novel transfer learning framework. We apply the joint low-rank and sparse constraints for reconstruction to preserve both local andglobal structures of the data on the new feature subspace that we specifi-cally introduce [106][25]. Data from both domains are well interlaced bythese constraints. We also address the noise effect concern by using a sparsematrix to model the noise so that the noise information can be filtered.• We explore a new dual space structure preserving constraint. We get initialclassification results on the subspace. On the original data space, we furtherrestrict the results from the subspace by considering the structure of the targetdata via MRF to avoid the local minimum and obtain globally optimal labels.• We investigate two specific transfer subspace learning scenarios, to be morespecific, the intra-scenario transfer and the inter-scenario transfer. Exper-imental results show the effectiveness of the proposed framework for HSIclassification on the target domain without labeled training samples.4.2 MethodIn this section, we present the proposed method in detail, as illustrated in Fig. 4.1.The related definitions are given first.4.2.1 DefinitionsLet Xs ∈ Rd1×ns be the data from the source domain and Xt ∈ Rd1×nt be the dataon the target domain, in which d1 is the dimension of HSI data, ns and nd are thenumber of pixels in these two domains respectively. Suppose the transformation70matrix is T ∈ Rd1×d2 , and the reconstruction matrix for the data on the source do-main is R ∈ ns×nt , the noise matrix is E ∈ Rd2×nt , in which d2 is the dimension ofdata on the subspace. The label matrix is denoted by Ys = [y1,y2,y3, ...,yns ]∈ Rm×nsand Yt = [y1,y2,y3, ...,ynt ] ∈ Rm×nt where m is the number of classes. For a randomsample xi from class k, its corresponding label vector yi has its k-th element yki = 1,and the rest are zeros.Pavia CenterPavia UniversittyTpixel 1pixel 2pixel npixel 1pixel 2pixel npixel 1pixel 2pixel nd2d1pixel 1pixel 2pixel nd2d1TXsXtT T XsZXtEYmnpixels00001000010100001MRFNeighbor  systemOriginal    spaceSubspaceclassifierFigure 4.1: Framework of the proposed method. The source data Xs are theHSI data from the Pavia center domain, whereas the target data Xt arealso HSI from the Pavia University domain. We first project the data onboth domains into a new feature subspace and find the transformationmatrix T. With T, we can obtain the initial classes of pixels. Further, thedata on the target domain returns to the original space to be optimizedby MRF, which imposes the structure constraint.714.2.2 Constraints on the subspaceWe can formulate both the HSI on the source domain and the target domain on anew common feature subspace by Eq. 4.1.T T Xt = T T XsR, (4.1)and the solution of T and R in this equation is further formulated asminT,R||T T Xt −T T XsR||. (4.2)Since the data in both domains lie in the same subspace, we can find the knowl-edge that can be effectively transferred based on Eq. 4.2. However, for HSI, whichcan span multiple subspaces, the transfer process is not unique. Moreover, thestructure information cannot be exploited by Eq. 4.2. We, therefore, want to projectthe data in both domains to an optimal common subspace on which the divergenceamong data distributions of both domains is minimized. With the minimum diver-gence, the data from the target domain can be reconstructed by the neighboringdata on the source domain. To be more specific, the data from both domains arealmost with identical distributions and lie on the same manifold subspace. In thisspace, each sample with the same task can be represented by its neighbors. Toachieve this goal, we understand that the reconstruction matrix R is with block-wise structures. This structure can be achieved by low-rank restriction, and we canreformulate Eq. 4.2 asminT,Rrank(R), s.t. T T Xt = T T XsR. (4.3)This equation is non-convex and cannot get a globally optimal solution. More-over, the solution to the problem is NP-hard. However, Eq. (4.2) is equivalenttominT,R||R||∗, s.t. T T Xt = T T XsR, (4.4)where ||R||∗ means the nuclear norm of R, which is a convex function. However,only introducing Eq. 4.3 and Eq. 4.4 is far from satisfying. This restriction just72models the relationship between the source and target domains considering theneighboring relations on the subspace. As we want to represent the target sampleson the subspace with the fewest neighbors to preserve the local structure of data, wefurther introduce the sparse constraint into the reconstruction matrix R as follows:minT,R||R||∗+α||R||1, s.t. T T Xt = T T XsR. (4.5)Another variable E is also introduced to model the noise of the target HSI onthe subspace. To alleviate the noise influence, we impose the sparse constraint onE and Eq. 4.5 is further modified asminT,R,E||R||∗+α||R||1+β ||E||1, s.t. T T Xt = T T XsR+E. (4.6)Here α and β are both heuristically set as 0.1, as we observe in our preliminarystudy that the classification performance of our method is relatively stable whenthe parameters are in a feasible range. We assume that training samples can betransformed into the strict binary label matrix Ys asθ(T,Xs,Y ) = ||T T Xs−Ys||+ ||T ||1. (4.7)By minimizing Eq. 4.7, we can obtain the optimal Transformation matrix T .And the final objective function on the subspace isminT,R,Eθ(T,Xs,Ys)+ ||R||∗+α||R||1+β ||E||1.s.t. T T Xs = T T XtR+E,(4.8)We optimize Eq. 4.8 by employing the Inexact Augmented Lagrange Multi-plier (IALM) algorithm, as shown in [106].After finding the optimal subspace, we employ the Support Vector Machine(SVM)[18] as the final classifier on this subspace to get the label Yt for the data inthe target domain.734.2.3 Constraints on the original target spaceOn the above subspace, we fail to consider the local structure of HSI in the targetdomain, although the data is best represented by exploiting both its global and localstructures of the HSI in the source domain. Therefore, besides knowledge transfer,we further optimize the classification by preserving the self-structure knowledge ofdata in the target domain by exploring Markov Random Field. The MRF describesthe following energy minimization problem:N = Nd +λNs, (4.9)where Nd is the data term representing the likelihood of the objective data, and Ns isthe smoothness term reflecting the joint Gibbs distribution of the label field, whichsatisfies the Markov property [65]. λ defines the weights of these terms (here weconsider the two terms equally and set λ as 1). This function is first initialized bythe initial results Yt obtained from the subspace. By finding the minimum solutionof the energy function N, the corresponding label field can be acquired.In this energy function, we further define the smoothness term Ns by the ISINGmodel to model the structure of neighboring pixels in HSI, and the data term Ndby the Gaussian Mixture Model (GMM) as HSI is generally formed by the mixtureof several categories. To be more specific, Ns formulated by the ISING model isexpressed asNs = ∑a,b∈CVc(Yta ,Ytb),s.t. Va,b(Yta ,Ytb) =−ρn , if Yta = Ytb ,ρn , if Yta 6= Ytb ,(4.10)where C means the set of cliques in a specific neighborhood system, a and b mean arandom center pixel and its neighboring pixel in the image, and Yta and Ytb are theircorresponding labels. n is the order of neighboring, which means the distance be-tween a and b. In fact, this formulation means that the larger the distance betweena specific pixel and the center is, the less effect it has. We treat pixels with differentorders differently by assigning various weights to them. Such weights reflect thesignificance of neighboring pixels in the system. The data term Nd is used to penal-74ize solutions which are not consistent with the prior knowledge. More specifically,we haveNd = Σa∈Cwa(Yta),s.t. wa(Yta) = f (a | θYta ),(4.11)where wa(Yta) shows the cost of assigning label Yta to pixel a. f (·) means the proba-bility density function of the GMM distribution, and θYta denotes the parameter setthat we need to construct this GMM model. We optimize the problem in Eq. 4.9by the EM method [113] and obtain the final labels for HSI data on target domainbased on its structure in its original data space. A brief pseudocode for the wholeprocess is further concluded in Alg. 4.Algorithm 4 Structure Preserving Transfer Learning for HSI classificationInput: Xs,Xt ,Ys,α,β ,λOutput: T,R,E,Yt1: T,R,E← INITIALIZATION(Xs,Xt ,Ys,α,β )2: Find the subspace by T,R,E3: On the subspace, get the initial label Yt by SVM classifier4: On the original space, Yt ← OPTIMIZATION(Yt ,λ )5: function INITIALIZATION(Xs,Xt ,Y,α,β )6: calculate argminT,R,Eθ(T,Xs,Ys)+ ||R||∗+α||R||1+β ||E||1 in Eq. (4.8) by IALM7: end function8: function OPTIMIZATION(Yt ,λ )9: Ns = ∑a,b∈C Vc(Yta ,Ytb),Nd = Σa∈Cwa(Yta),10: Yt = argminYtN(Yt) = Nd(Yt)+λNs(Yt)11: end function4.3 ExperimentsIn this section, we will show the experimental results of the proposed methodwhen compared with several existing transfer learning methods. To verify the per-formance, we apply the proposed DUal Space Unsupervised Structure PreservingTransfer Learning (DSTL) to hyperspectral classification. Although the main clas-sifier SVM in the proposed framework is theoretically a supervised method, inour problem, the training samples come from the source domain, and the testing75samples come from the target domain. In this way, we can still claim the classi-fication of HSI in the target domain as unsupervised. We compare the proposedframework with several approaches, including the recently proposed unsupervisedclassification methods k-means++ [6][123] (which was shown to yield the bestperformance for unsupervised hyperspectral image classification) and two trans-fer learning based classification methods, direct non-subspace Transfer Learning(DTL) and Subspace Transfer Learning (STL) [106]. It is worth noting that it isthe first time that DTL and STL are applied here as transfer learning has not beenexplored for our specific problem yet.4.3.1 Data setsThree publicly available hyperspectral images are tested to illustrate the superior-ity of the proposed method. We conduct the experiments on two types of data:different categories on the same HSI, Salina scene; and the data from two differentHSIs, Pavia Center and Pavia University. A common characteristic of these twotypes of data is that they are with the same dimension. The detailed descriptionsare as follows.The Salinas scene was collected by AVIRIS sensor on Salinas Valley, Cal-ifornia. The image is characterized with 224 spectral bands, which comprises512×217 samples. The image contains 16 classes of interest. We select 4 differentcategories of Lettuce romaines (categories 11-14) which are difficult to distinguishon Salinas to conduct our experiment. These four categories are abbreviated as L1,L2, L3, and L4 respectively.The Pavia Center and Pavia University are two scenes acquired by the ROSISsensor during a flight campaign over Pavia, northern Italy. They have seven sameclasses of interest and we choose them to complete the experiment. The number ofspectral bands is 102 for Pavia Centre and 103 for Pavia University (we select 102bands to conduct the experiment). Pavia Centre is a 1096×1096 pixels image, andPavia University is with 610×610 pixels, but some of the samples in both imagescontain no information and have to be discarded before the analysis. The geometricresolution is 1.3 meters. Both images include 9 classes each. We abbreviate thetwo data set as PC and PU respectively.76Table 4.1: Classification accuracy results on Salinas.Salinas k-means++ DTL STL DSTLL1, L2→ L3, L4 0.9285 0.2941 0.9388 0.9528L1, L3→ L2, L4 0.9833 0.9903 0.9995 0.9998L1, L4→ L2, L3 0.9983 0.7081 0.9395 0.9986L2, L3→ L1, L4 0.9822 0.9944 0.9963 0.9992L2, L4→ L1, L3 0.9985 0.6089 1.0000 1.0000L3, L4→ L1, L2 09776 0.9530 0.9533 0.99334.3.2 Experiments on SalinasWe first conduct the experiments on Salinas, and report the results in Table 4.1.From Table 4.1, we can observe that the proposed DSTL consistently yieldsthe best performance and DTL seems the weakest. The classification between L3and L4 as well as L2 and L3 are most challenging, as L2 has similar characteristicsto that of L3 and L4.The significance of structure preserving can be supported by comparing STLwith DTL. Without constructing subspace learning, barely transferring the knowl-edge from the source to the target can not provide excellent performance. Take theclassification of L3, L4 as an example. The classification between L3 and L4 is29.41% which is quite poor. The performance is improved to 93.88% by introduc-ing subspace learning and exploiting the structure information of HSI on the sourcedomain. The results are further improved by structure preserving of the HSI in thetarget domain. When MRF is employed, the accuracy boosts to 95.28%, which ispromising.By comparing the proposed DSTL with the k-means method, we can find thatwith the assistance of transfer learning and knowledge learned from the sourcedomain, the classification accuracy of HSI on the target domain is improved. Inmost time, the improvement is 1% - 2%, and the final accuracy is near 100%.4.3.3 Experiments on Pavia University and CenterSecondly, we conduct our experiment on the PU and PC data. The overall re-sults can be found in Table 4.2. The proposed framework is compared with twotraditional unsupervised methods including k-means, Nonnegative Matrix Factor-ization (NMF), as well as DTL and STL. We compare the overall accuracy of each77Table 4.2: Classification accuracy results on PU and PC.Pavia PC→ PU PU→ PCk-means++ [1] 0.4231 0.5433NMF [1] 0.5497 0.5623DTL [2] 0.1822 0.1712STL [3] 0.6288 0.6677DSTL 0.6537 0.6971method on both data. For the transfer learning based methods (DTL, STL, DSTL),the labeled data from the source domain are 100 samples for each class on PU(when pavia university is the source domain), and 200 samples on PC (when PCis the source domain). The influence of the number of labeled data will be furtherdiscussed in the next section.As shown in Table 4.2, the classification performance on PC is always betterthan that of PU since the former one is with more complex features.By comparing the unsupervised method k-means and NMF with the proposedDSTL, we can note that the introduction of transfer learning and the assistanceof the knowledge from the source domain data can boost the performance of HSIclassification. Although the two data are with quite different characteristics, theprior knowledge can still be exploited. Barely imposing the k-means or NMF isfar from satisfying which barely considers the features of data and without manuallabeling.The comparison between DTL, STL, and DSTL illustrates the significance ofeach component of the proposed method. Without subspace learning, the perfor-mance of DTL is the worst. The accuracy of DTL is about 18% which is almostrandom accuracy. Such poor accuracy also suggests that the two data are unre-lated. By introducing the subspace learning process, the relationship between thetwo data is established and the accuracy increases to more than 60%. By applyingthe restriction of structure preserving on the target domain, the accuracies of clas-sification on both data further increase by 3% which are about 70%. This accuracyis almost the same as that of some supervised classification methods.78Table 4.3: Classification accuracy results when using different numbers oflabeled data from the source domain.Data samples 50 100 150 200 250 300 350 400PC→ PU STL 0.41 0.52 0.60 0.63 0.61 0.60 0.59 0.55PC→ PU DSTL 0.39 0.50 0.64 0.65 0.65 0.64 0.64 0.63PU→ PC STL 0.59 0.67 0.66 0.66 0.64 0.62 0.61 0.62PU→ PC DSTL 0.63 0.70 0.68 0.64 0.67 0.67 0.63 0.654.3.4 Influence of labeled dataThe labeled data in the source domain always influences the classification accuracyin the target domain. We take the transfer between PU and PC as an example tofurther discuss this issue in Table 4.3. From Table 4.3, three observations can benoticed. The first is that the performance of transfer learning is not always betterwith more labeled samples from the source domain. The accuracy might increaseat first with more labeled data but fall later. This phenomenon is likely caused byover-fitting. With too many labeled data, the learned subspaces are more likely tobe compatible with the distribution of the data in the source domain. When weclassify the data from the target domain, the accuracy might decrease. We cannote that the best number of labeled data for PC to PU transfer learning is 200samples/class, and it is 100 samples/class for PU to PC transfer learning.The second observation is that the process of structure preserving on the targetdomain is not always beneficial for classification. When the initial classificationresult Yt on the subspace is poor, the final accuracy might decrease after this struc-ture preserving process. One example can be found by comparing STL and DSTLof PC→ PU. When the labeled data are no more than 100 samples/class, the accu-racy of DSTL is lower than that of STL. This phenomenon is likely caused by thecharacteristic of the structure preserving process. When the accuracy of classifica-tion is not high and the error happens too frequently, the error may spread duringthe structure preserving process. However, with higher accuracy, this phenomenoncan be overcome.At last, we have to point out that the best number of labeled data from thesource domain is not stable and varies according to the volumes of the source dataand target data. We cannot generally conclude which domain influences more.79More data from the source domain can provide more knowledge to transfer andmore data from the target domain requires more robust subspace. This propertywill be further exploited in our future work.4.4 ConclusionIn this chapter, a novel method named the DUal Space Unsupervised StructurePreserving Transfer Learning (DSTL) is proposed for unsupervised HSI classifica-tion. The main idea is to transfer the knowledge of HSI in the source domain tothe target domain, to perform the classification with no prior information and thusaddress the time and labor-consuming HSI labeling problem.The proposed method consists of two major parts: The first is to transfer thedata on both domains to a specific subspace, on which we can obtain the initialclassification results for the target HSI by exploiting the data structure. The sec-ond is to optimize the initial results on the original target data space based on itsstructure, by applying the Markove Random Filed (MRF) approach.As an unsupervised HSI classification method, the proposed DSTL is robustand effective, as supported by the experimental results. Extensive comparisonsalso demonstrate the superiority over the competitors.However, limitations also exist. Regarding the CPU time cost and computa-tional complexity, the proposed DSTL is not as efficient as some regular unsu-pervised classification methods due to its subspace learning process, since such aprocess may take tens of seconds. Our future work is to further address this com-plexity problem.80Chapter 5How to Transfer: XnetThis chapter mainly discusses the “how to transfer” task, specifically for both thetransfer learning between the RGB images, as well as transfer learning betweenimages from satellite view and images from aerial view. By exploring the corre-lation of images from a wide range of image types, we will finally find a way toeffectively relieve the domain shift problem for “how to transfer” problem.Many computer vision problems face changing factors such as illumination,position, and different numbers of data channels and thus need adaptation strate-gies. The images collected with different factors make the transfer learning sufferfrom serious domain shift problem. Also, such a transfer entails the domain gapdue to the varying characteristics among domain samples [78]. Although one wayto minimize this domain difference is learning domain-invariantly representations[70][61][62][64], given the scarcity of data in the target domain, it is not easy toobtain such representations.Deep transfer learning algorithms are designed to deal with situations whereaccess to the target domain data is costly or even non-existent [78]. The case withfully unlabeled target domain data is also referred as unsupervised deep transferlearning, while the case with both labeled and unlabeled target domain samples iscalled semi-supervised deep transfer learning. In this chapter, the proposed algo-rithm is designed for unsupervised deep transfer learning.To our knowledge, the latest advance in unsupervised deep transfer learning isthe embedded adversarial learning framework [45][117]. In this domain adversarial81XnetNon-adversarial deep transfer learning Adversarial deep transfer learningaligned feature... ...target featureT... ...... ...source featuretarget feature... ...Label Predictorclasses source featureSS T... ...... ...Label Predictorclasses aligned dataextracted featureS T... ... ... ...... ...source feature target featureDomain ClassifierDomain labelsS TLabel Predictorclasses Domain ClassifierDomain labelsS Tabcommon domainAttention learningFigure 5.1: A general simple structure comparison between three types ofdeep transfer learning. S: the source domain, T: the target domain.82Feature characteristic Xnet ADDA NADDADomain-specific X × XAdversarial X X ×Attentional X × ×Task-specific X × ×Table 5.1: A comparison of features for classification task.adaptation model, the whole source and target domain data distributions are alignedtogether [97][1][39].For current adversarial transfer learning, the domain aligning process ignoresthe individual characteristics underlying the unique data structure of each domain,which leads to two limitations.• First, adversarial transfer learning may fail in practical scenarios when largesemantic gap exists between the source domain and the target domain [7],especially for the proposed practical satellite-to-aerial scene adaptation task.• Second, well-adapted features may not be suitable for the classification/recog-nition task. Well-adapted domain features change the original data too muchto keep its original discriminative characteristics for the classification pur-pose. The adaptation harmful for classification/recognition can be viewedas negative transfer [91][11]. This opinion was mentioned in [31], while nobest solution was proposed.5.1 Contribution summaryIn this chapter, we propose the Xnet framework by addressing the above limita-tions. The main contributions of this work are as follows:• Domain-specific feature generator: Different from existing models, in theproposed model, data from different domains pass through domain-specificfeature generators. The idea is to deal with domains with large semanticgaps by employing domain-specific structure/parameter settings. Also, sucha feature generator setting is important to generate attentional features.83• Task-specific attentional transfer learning: By introducing attention mech-anism, the gap between transfer learning and classification/recognition taskcan be bridged. The features generated by the adaptation process will befurther adapted to classification/recognition tasks.• Satellite-to-Aerial Scene: We investigate a practical scenario for transferlearning. As the collection of Satellite/Aerial data is much easier nowa-days while annotations are limited, this application is practically important.Comparisons with SOTA on our Benchmark dataset are also provided.There are also other minor novelty parts in our proposed neural network struc-ture, as stated in Sec. 5.3.2. They as also illustrated in Fig. 5.2.5.2 MethodIn this section, we will present each component of the proposed Xnet in detail. Asthe major novelty lies in the neural network structure, we only use the minimumnumber of equations to illustrate our work. General terminologies are first intro-duced as follows. Suppose we are given the source domain data Ds = {(xsi ,ysi )}nsi=1with ns labeled samples, and the target domain data Dt = {xti}nti=1. They are sampledfrom the distributions P(X s,Y s) and Q(X t ,Y t), referring to the source distributionand target distribution (source domain and target domain) respectively. Both dis-tributions are assumed complex and unknown, similar but different. For a randomdata sample x, its domain label is represented by dx ∈ {0,1}, for which ‘0’ meansthat x comes from the source domain and ‘1’ means from the target domain.5.2.1 Network structureWe now explain the deep feed forward architecture, as shown in Fig. 5.2. Fourphases are included in the network pipeline.In the first phase, the input xs (or xt) will first pass through several feed forwardlayers each, which are also viewed as the domain-specific deep feature extractorf s =Gsf (xs,θ sf ) or ft =Gtf (xt ,θ tf ), for which the fs (or f t) means the middle outputfeature, θ sf (or θtf ) means the parameters used in the deep feature extractor. Asthere might exist a large semantic/distribution gap between the characteristic of84f sf tGf csfGtfGcfCNNf lGdSoftmaxGyydLdInputx sx tddLθ∂∂GRLdLθ∂∂ f_LyLθ∂∂yyLθ∂∂ fycompatibility caculator ,t lC f fgforward propagation back propagationSource domainTarget domainCommon domain{ }, ,s t cf f f fθ θ θ θ=,,,Optimizationabf l common domain featurefc, : fs: source domain feature ft: target domain featureSource domainTarget domainFigure 5.2: The flowchart of the proposed Xnet. We use red color to highlightthe task-specific attentional adaptation process of the network structure.Here GRL stands for Gradient Reversal Layer [30]. Detailed parameterdefinitions can be found in Sec. 5.2.85the source and target domain data, separate feature extractions can be employed toovercome the problems brought by data alignment.In the second phase, f s and f t will pass through a common deep feature ex-tractor f c =Gcf ( fs/t ,θ cf ), in which fc denotes the common feature, θ cf denotes theparameters of the neural network in this phase.In the third phase, the common feature vector f c will be mapped by the labelpredictor y = Gy( f c,θy) to get the classification label. Moreover, for the last rep-resentation of the input before passing through the final softmax layer to producethe original architecture class score, we represent it as f l .The last phase is to classify the domain label. f c will be mapped by the do-main classifier d =Gd( f c,θd) to classify from which domain the input data comes.θy and θd are respectively the parameters of the neural network used in the labelprediction phase and domain classification phase.5.2.2 The objective functionsWe propose three objectives during the learning process.We first aim to minimize the classification loss by comparing the predictedlabel with the annotated label, especially for the source domain data. This objectiveis to optimize θ sf , θcf , as well as θy. The purpose of this objective is to extract thediscriminative common subspace feature f c and the accurate label predictor Gy.The second objective loss is the domain adversarial loss. On the one hand, wewant to get the domain-invariant feature f c; therefore we need to maximize theloss of the domain classifier Gd with respect to the parameters θ sf , θtf , θcf . On theother hand, we want to find the parameter θd to minimize the loss of the domainclassifier Gd .The third objective is for the attention learning between features on the targetand common domains, more specifically, f t and f l . The purpose of this objectiveis strike a balance between classification task and transfer learning task. As werequire f l to be domain-invariant, and meanwhile we want to prevent f l have toohigh difference with f t , because f t without sacrificing information for the domain-invariant’ purpose might be more discriminative for label prediction. To achievethese objectives, we plan to learn the task-specific attentional (tsa) feature g as86shown in Fig. 5.2 with the trade-off between f t and f l . The calculation of g will befurther explained in the next part, and now we can formulate the final loss functionas:L(θ f ,θy,θd) = ∑xi∈PLy(Gy(Gcf (G(xi,θsf),θ cf),θy),ysi )−λ ∑xi∈{P,Q}Ld(Gd(Gcf (G(xi,θs/tf),θ cf),θd),dxi).. (5.1)Here θ f ={θ sf ,θtf ,θcf}, and we use θ f for short. Ly and Ld represent the labelprediction loss and domain classification loss respectively. λ provides a trade-offbetween the two losses.Based on the first two objectives, we can solve the above problem by findingthe saddle point as:(θ̂ f , θ̂y) = argminθ f ,θyL(θ f ,θy, θ̂d),θ̂d = argmaxθdL(θ̂ f , θ̂y,θd). (5.2)However, up to now we have only explained the first two objectives, and the thirdattentional objective will be elaborated in the next section.5.2.3 Task-specific attentional adaptationsUnlike the traditional domain adversarial adaptation only between the source do-main and the target domain, in this chapter, we will introduce another commondomain. The adversarial learning and attention learning will be conducted betweentwo domain pairs: One is between the source domain and the target domain; andthe other is between the target domain and the common domain. For the first do-main pair, it is the regular adversarial adaptation mentioned in the previous section.In this section, we will elaborate on the attention learning between the targetand common domains. The primary purpose is to explore the discriminative infor-mation of the original target domain data before being projected to the commondomain, and to improve the performance of the label predictor Gy. The attentionlearning here mainly refers to the features, f t and f l , and the goal of the learningprocess is to find a trade-off between these two.At first, we can obtain the target feature f t and then the final feature before87the softmax layer f l . The discriminative feature f t can preserve the original targetdomain feature without aligning with the source domain feature. The domain-invariant feature f l which lies on the common domain, after the feature extractionof several layers, is assumed to be shared by the two domains, and is the final fea-ture of the traditional domain adversarial adaptation methods [31]. In our view,such two features are both informative and can be the input into the label predic-tor Gy, although they have different characteristics: f t tends to be more private tothe target domain while f l tends to get away from the target domain after domainalignment. To have a trade-off between these two opposite parameters, we intro-duce the parameter g, representing the tsa feature with attention mechanism. Thistsa feature will be directly connected with the final softmax layer to produce theclass score, replacing f l .The calculation of g is based on the attention mechanism [44]. The detailedcalculation is as follows:We first denote f t by f t = { f t1, f t2, ..., f tn}, in which f ti represents the vector ofoutput activations at the spatial location i of n total spatial locations in the layer.We calculate the compatibility between f l and f t by the dot productc =〈fˆ ti , fl〉, (5.3)where fˆ ti is the linear mapping of fti to the dimensionality of fl , and f̂ t = { f̂ t1, f̂ t2, ..., f̂ tn}.Now the set of compatibility scores can be denoted as C( fˆ t , f l) = {c1,c2, ...,cn}.The compatibility scores are then normalized asbi =expcin∑i=1expci. (5.4)The normalized compatibility scores A= {a1,a2, ...,an} represent the attention.Further A is multiplied with f t as the weights to produce the single vector g =∑ni=1 ai· f ti . Now g will replace f l to be the input into the last fully connected layer(layer) to get the final classification result.88We then can reformulate the loss function Eq. 5.1 asL(θ f ,θy,θd) = ∑xi∈PLy(Gy(g,θy),ysi )−λ ∑xi∈{P,Q}Ld(Gd(G f (xi,θ f ),θd),dxi). (5.5)The label predictor now will be related with the tsa feature g, and g is calculatedby multiple middle layer features ( f t , f l). Eq. 5.5 is our final objective function.For the optimization of Eq. 5.5, as demonstrated in [30], this equation can beoptimized by regular stochastic gradient solvers (SGD) after the introduction ofgradient reversal layer (GRL) transformation for the domain classifier.Therefore, the proposed Xnet can have a more robust label predictor, which canwork when data from various domains with a large domain gap cannot be easilyaligned at the very beginning of the network pipeline. Separate feature extractorscan make it much easier for the extracted features from source and target domainsto be aligned, compared with the alignment of the raw input data, as separate finetuning will narrow the domain gap between these two domains in the extractedfeatures. Also, with separate layers, the original characteristics of the source andtarget domain data can be preserved before being aligned together. The trade-off between the discriminative target domain feature f t and the domain-invariantfeature f l helps lead to the performance improvement in classification.5.3 Experiments5.3.1 Transfer learning benchmarksIn order to compare the proposed Xnet with the state-of-the-art methods, we inves-tigate common transfer learning tasks, for which previous results were reported inpublished papers. For each dataset, certain examples are shown in Fig. 5.3.We test our proposed model on a comprehensive six traditional digit recogni-tion tasks. We apply the proposed framework for six traditional tasks. The first ex-periment MNIST→ MNIST-M deals with the labeled MNIST dataset (the source)and unlabeled MNIST-M dataset (the target) [31]. The MNIST-M data containsthe RGB images with three color channels, while the MNIST contains one channel89MNIST MNIST-M SVHN MNISTSyn. Digit SVHN USPS Syn. SignsGTSRBMNISTFigure 5.3: Visualized examples for the digit recognition tasks.grey images with a much simpler representation. There exists a distinct differencebetween the two datasets. The second task is SVHN→ MNIST. The Street ViewHouse Numbers (SVHN) dataset [115] and MNIST dataset have a much larger dis-tribution gap. The third experiment is on Synthetic numbers→ SVHN. Comparedwith SVHN, the Syn. number dataset has different positionings, orientations, andbackgrounds. The fourth and fifth tasks are MNIST→USPS, USPS→MNIST. Bothdatasets are with white digits on a solid black background. The last experimentis on Synthetic Signs→ GTSRB. The settings of both datasets are the same as in[31]. Synthetic Signs dataset consists of 100,000 images, generated from commonstreet signs after various artificial transformations. For the German Traffic SignsRecognition Benchmark (GTSRB), we apply 31,367 random training samples forunsupervised adaptation and the rest for evaluation.After verifying the effectiveness of our method on traditional tasks, we alsotest our method on the proposed Satellite→ Aerial scene adaptation. For this task,we collect 9 classes of interest for the transfer learning problem. They are River,Parking lot, Overpass, Harbor, Forest, Building, Beach, Residential, Agricultural.The data are mainly collected from WHU-RS dataset, UCMerced dataset, as wellas data collected by our collaborators. Compared with the data from the satelliteview, aerial view data is with much lower resolution and clarity. The data fromboth datasets are rescaled to the resolution of 256*256, and in total 3,600 imagesare included. Examples for each class are shown in Fig. 5.4. As 1,800 images are90River Parking lot OverpassHarbor ForestBuilding Beach ResidentialAgriculturalAerial ViewSatellite ViewFigure 5.4: Visualized examples for the satellite-to-aerial scene adaptationtasks.collected by ourselves, please contact us directly for the data if interested.5.3.2 Detailed network structureThe detailed CNN architecture can be found in the github page. The novelty of ourXnet structure settings is as follows.91Domain-specific Feature Extractor: To preserve the discriminative features ofdifferent domains, the data from two domains first pass through separate featureextractors. This part is shown in Fig. 5.2 in yellow and blue.Classifier Splitting: Generally in a traditional adversarial adaptation neural net-work [31][78], the domain classifier and label predictor are directly connected withthe feature extractor at the same middle layer. However, we must further considerin which middle layer the two should be placed. In this work, the domain classifierand label predictor are connected with different stages of middle layers. This isimportant since the two tasks fit better at different locations of the CNN pipeline.We apply the domain classifier Gd at an earlier stage (as “b” in Fig. 5.2) before thelabel predictor Gy to first minimize the domain gap. This setting can be found inFig. 5.2. Compared with Gd , Gy is connected with deeper convolutional layers (as“a” in Fig. 5.2).Task-specific Attentional Feature Learning: We generate the attentional fea-ture ‘g’ by calculating the compatibility between the target and common domainfeatures. Now the former output feature of the neural network f l is used only toprovide “feature weight” for the attentional feature g, and is replaced by “g” whichis connected with the softmax layer. This part has been highlighted in red in Fig.5.2.Important hyperparameters are set as follows. The Learning rate rate is set as0.01 for all experiments. This learning rate gradually reduces with decay depend-ing on the number of epochs. For the Mini-batch, the model is trained on 64-sizedbatches. For each batch, half of its size is randomly sampled from the source do-main with known labels while the rest of its size is from the target domain withunknown labels. This setting can avoid biases, and ensure that the mini-batch canrepresent all classes sufficiently. The input images before being sampled are pre-processed by the mean subtraction.For the Evaluation, we calculate the overall accuracy and an informative met-ric as in Eq. 5.6:T L−SOTO−SO , (5.6)where SO means the source only, TO means the target only, and TL means the92transfer learning. SO reflects the width of the domain gap while TO reflects howhard the classification task is. This metric measures the degree of success for thelabel information to transfer from the source domain to the target domain. Thismetric is named as coverage as in [39]. We need to consider both the coverage andabsolute error in the target domain since a high value of coverage might rise fromthe poor performance of the SO or TO setting.For the hardware, all experiments are carried out on an NVIDIA Titan X (Pas-cal). Each experiment takes less than 1 hour before the convergence.5.3.3 Performance comparisons for digit recognitionTable 5.2 reports our results and also includes the results obtained from previousstudies. We make two major contributions in terms of the Xnet architecture andthe introduction of tsa feature. It is difficult to determine the fair accuracy reachedby different methods in unsupervised transfer learning where cross-validation isnot feasible (since target labels can only be used in the evaluation phase). Tomake the comparison fair, we directly compare our results with the reported resultsin previous papers. Because different methods have been reported on differentdatasets in previous papers, we use ‘-’ to mark the datasets that have not beentested by a particular method.Results show that our proposed methods generally perform well for differenttransfer learning tasks. In particular, our proposed method yields the best perfor-mances for five tasks (MNIST→ MNIST-M, SVHN→ MNIST, Syn. numbers→SVHN, USPS→ MNIST, and Syn.Signs→ GTSRB). However, we have to makeit clear that we did not compare our method with several GAN methods which ap-ply the extra generated target domain training data to boost their performances,as it is unfair for comparison here (like Pixel GAN [9] etc). For the task SVHN→ MNIST, we need to point out the unique characteristic of SVHN dataset. Thispublic dataset also provides extra data for training (531,131 images), and manymethods (UNIT [58] etc.) use such extra data to get better results, which makes itdifficult to interpret the result. In our case, we only use the standard dataset withoutextra images. For the task MNIST→ USPS, compared with other methods (UNIT[58] etc.), the performance of our method (where no generated samples exist) is93less accurate since we do not rely on the generation of extra target domain images(which rely on additional hyperparameters) to perform classification. Also it havemore convolution layers whereas our method has a much simpler structure.We also want to point out that, for the two tasks, MNIST→MNIST-M and Syn.numbers→ SVHN, the proposed unsupervised method can yield even better resultsthan that of the model trained by target samples directly (‘target only’ results). Itis also worth noting that, for the classification task of SVHN, the adapted featureextracted from Synthetic data (Synthetic number) can perform classification betterthan the direct feature on SVHN. The accuracy improves from 92.2% to 93.2%.This improvement suggests that we might replace the real-world data by syntheticdata in the future, as synthetic data is much easier to obtain.It can be noted that there is no obvious performance difference between non-adversarial and adversarial transfer learning methods. Among the non-adversarialmethods, the TriDA method yields good accuracy on all reported tasks, while forthe adversarial ones, several GAN based methods can provide satisfying perfor-mances.5.3.4 Comparisons for remote sensing taskThe second task is the practical remote sensing task proposed by us. The sourcedomain is the satellite data with rich labels while the target domain is the aerialdata with scarce labels. Table 5.3 reports our results and also includes the re-sults of comparison methods reproduced by ourselves, as there are no reportedmethods yet on this benchmark dataset we propose. As all parameter setting andthe dataset settings are the same for all comparison methods, such comparisonsare fair. We mainly compare with the selected representative adversarial transferlearning methods in this section, as other methods may not be applicable on thistask, and some results of existing methods can not be reproduced. The accuracy ofeach class of interest is reported.As can be found in the Table 5.3, the proposed method is generally with betterperformance on all categories. Such results illustrate the extension ability of theproposed method, and also demonstrate the practical use of the proposed Xnet inthe remote sensing area. As can be found in the table, the accuracy of the proposed94method is generally over 90%, while the other is no better than 90%. The stafeature further improves the accuracy with about 1.5%. The Harbor class is themost difficult for classification.By comparing the classification results of transfer learning and the source-onlycase, we can illustrate the effectiveness of transfer learning in our proposed task.The classification accuracy after transfer learning is generally higher than 50%,compared with 46.11% for the source-only case. Also, we note that the averageaccuracy of the proposed method is even higher than the target-only case. We planto test on larger datasets (with more samples and more classes) to investigate morechallenging tasks in the future.5.4 ConclusionIn this chapter, we propose a novel unsupervised deep transfer learning model tosolve the “how to transfer” problem. We name this model as Xnet given its “X”shape. Since adversarial learning is employed between two pairs of domains, thesource domain and target domain as well as the target domain and common domain,we also refer the proposed method as dual domain adversarial adaptations.For future work, we plan to investigate the potential of the Xnet structurein more transfer learning tasks and enlarge our proposed satellite-to-aerial scenedataset. The separate feature extractors can have totally different structures forvarious domains, in order to overcome potential problems caused by large domaindifferences. We will further explore the potential of such separate feature extrac-tors.95MNIST SVHNSyn.num-bersUSPS MNIST Syn.SignsMethod → → → → → →MNIST-MMNIST SVHN MNIST USPS GTSRBMMD[96] 57.7 63.1 85.2 - - 86.9DeepCORAL[87] 57.7 63.1 85.2 - - 86.9TriDA[78] 94.2 86.2 93.1 - - 96.2DANN[30] 75.4 70.8 91.1 73.0 77.1 88.6DSN[7] 83.2 82.7 91.2 - - 93.1UNIT[58] - 90.5 - 93.5 95.9 -ADDA[97] 78.8 76.0 - 93.8 92.4 -Self.[28] - 93.3 - 92.4 88.1 -I2IAdapt[71] - 80.3 - 87.2 92.1 -G.Adapt[82] - 92.4 - 90.8 95.3 -Impr.ADDA[1] 93.0 92.7 - 94.8 91.0 -DIFA[75] - 89.7 93.0 89.7 96.2 -ZDDA[45] 94.8 - - - - -Source Only 56.6 54.9 86.7 59.7 75.4 79.0Target Only 93.6 99.4 92.2 97.5 99.9 98.2Xnet 94.8 91.7 91.2 94.4 93.3 93.4(1.03) (0.83) (0.82) (0.89) (0.73) (0.75)Xnet+tsa 96.3 93.3 93.2 96.9 93.8 96.4(1.07) (0.86) (1.18) (0.98) (0.75) (0.91)Table 5.2: Comparisons of different methods on digits datasets transfer learn-ing. The coverage is shown in the bracket. Source Only and TargetOnly refer to training only on the respective dataset (supervisedly, with-out transfer learning) and evaluating on the target dataset. The first fourcomparison methods belong to non-adversarial transfer learning, and therest methods belong to adversarial learning. The best performance ishighlighted in bold. The results of Xnet with a single adversarial learn-ing support the effectiveness of the proposed Xnet structure, while theresults of Xnet without tsa feature, further illustrate the effectiveness oftsa feature learning.96Table 5.3: Accuracy(%) of DTR for the Satellite-to-aerial Scene AdaptationtaskMethodRiverParkinglotOverpassHarborForestSource only 70.0 55.5 38.0 14.0 37.5Target only 100.00 92.0 100.00 58.5 96.0DANN [31] 88.5 65.5 42.0 20.0 55.5DSN [7] 90.5 82.0 80.5 51.5 69.0PADA [10] 96.0 90.0 91.5 48.0 95.0MCD [80] 100.00 88.5 93.5 47.5 96.5ours (Xnet) 100.00 92.5 100.00 54.0 97.0(1.0) (1.01) (1.0) (0.90) (1.02)ours (Xnet+tsa) 100.00 93.5 100.00 63.0 98.0(1.0) (1.03) (1.0) (1.10) (1.03)BuildingBeachResidentialAgriculturalAverageSource only 19.0 57.0 10.5 53.0 39.4Target only 95.5 100.00 94.5 99.0 92.8DANN [31] 32.0 90.0 16.0 52.0 51.3DSN [7] 72.5 90.5 96.0 61.0 77.0PADA [10] 82.0 100.00 75.5 77.5 84.0MCD [80] 85.0 100.00 96.5 97.5 89.5ours (Xnet) 93.5 100.00 97.5 95.5 92.2(0.97) (1.0) (1.04) (0.92) (0.99)ours (Xnet+tsa) 94.0 100.00 96.5 98.5 93.7(0.98) (1.0) (1.02) (0.99) (1.02)97Chapter 6How to Transfer: DualAdversarial Network (DuAN)In this chapter, an overview of the proposed Dual Adversarial Network (DuAN) isgiven to present a comprehensive picture. By proposing a novel task “Ground/Satel-lite to aerial scene transfer”, this chapter will find a way to deal with the “Howto Transfer” problem for this task with large domain gap. Working on this task,we can solve the annotation scarce problem for aerial images, by using the largeamount of prior knowledge of the regular ground view RGB images.Recent advances in deep learning not only bring impressive performance forimage processing, but also aggravate the burden of image data annotation. To traina reliable deep neural network, excessive annotated images with labels are required.This annotation concern is severe for remote sensing images, especially aerial im-ages. Nowadays, with much easier access to this type of images, annotation ofnewly collected remote sensing images has become a big problem, as human laborfor annotation is expensive, and limited prior knowledge exists for remote sensingdata.Transfer learning might solve this problem in a straight forward manner. Bytransfer learning, the label-scarce remote sensing data (the target domain) can bor-row information directly from the label-rich regular RGB image data (the sourcedomain). As data from such two domains are hard to be aligned, effective adap-tation is challenging. This task is even more challenging when the target remote98sensing samples are totally unlabeled. In this work, we plan to propose a novelunsupervised Transfer learning (UTL) method to tackle the above challenge.A popular research direction of UTL is based on adversarial learning, whichis to align data with different distributions in an adversarial manner: A featuregenerator is trained to generate the domain invariant features for both source andtarget domain samples, in order to fool a domain discriminator which is trained todiscriminate the domain labels of the features generated by the generator [8][68].However, there are two potential limitations of the above adversarial learningbased UDA. First, this method might not be task-specific. The adapted target do-main data will lose its discriminative data distribution, which is essential for itsclassification [51][67][48]. The generated aligned feature vectors of the target datamight not perform well in task-specific classifiers. Second, the source and targetdomain data are treated in the same way during the adaptation process. To be morespecific, raw data from two different domains pass through a standard feature gen-erator and then a task-specific classifier. Such a process may not be preferred asthe data from two domains serve for different purposes: The target domain dataneeds to serve for task-specific classifiers, whereas the source domain data is sup-plementary. The objective for source domain data is mainly related to the featureadaptation but not to the classification task. To make the two domains function wellrespectively for their own objective, we proposed the dual adversarial network.In this work, we assign two domains with domain-specific tasks. The sourcedomain mainly serves for the feature adaptation, whereas the target domain is task-specific. To achieve the task-specific goal with unlabeled target domain data, weintroduce two individual classifiers, which can classify source samples correctly,to provide inconsistent classification results for target domain data simultaneously.The model loss will be generated by the inconsistency to optimize the target do-main feature generator. The dual adversarial learning is proposed to complete thedomain-specific tasks.The proposed dual adversarial learning method includes four players: Twotask-specific classifiers, the source feature generator, the target feature generator,and the domain discriminator. In the first adversarial learning phase, the source do-main feature generator generates features by mimicking the target domain featureswhich are fixed in this phase to fool domain discriminator; For the second adver-99sarial learning phase, task-specific classifiers whose weights are initialized by thesource domain features generated in the first phase yield inconsistent classificationresults to fool the target domain feature generator: let it mistake the two classifiersare for different tasks. Such a feature generator is more like a “task discriminator”:It only realizes that the two classifiers are for the same task when the two task-specific classifiers provide the same classification results. These two phases williterate until the domain discriminator is fooled, and meanwhile the target featuregenerator does not get fooled. Compared with the traditional adversarial trans-fer learning, our source domain feature generator only needs to generate featuresfor the feature adaptation, and thus the generated features are better aligned andadapted; The target domain feature generator, which does not participate in adap-tation directly but only plays the adversarial game with classifiers, can generatemuch more discriminative features.6.1 Contribution summaryMajor contributions of DuAN can be summarized as follows:• We propose separate feature generators to serve for domain-specific pur-poses (e.g., feature adaptation and classification task). The generated targetdomain features can better preserve the discriminative target domain datadistribution.• We propose the Dual Adversarial Network (DuAN). The network is trainedin a stepwise manner. Four “players” play two adversarial games in DuAN,one for the feature adaptation, and the other for the classification task.• We investigate a novel, challenging satellite/ground-to-aerial Scene Adap-tation task (GSSA). This task not only explores the effectiveness of transferlearning for remote sensing data (satellite-to-aerial), but also aims to solvethe label-scarce problem for the aerial scene (ground-to-aerial). Examplesof data for GSSA are shown in Fig. 1.3.1006.2 MethodIn this section, an overview of the proposed Dual Adversarial Network (DuAN) isgiven to present a comprehensive picture. Afterward, the model initialization andtraining are described respectively.6.2.1 OverviewAs illustrated in Fig. 6.2. Five components exist in our framework: the domaindiscriminator D1, the source feature generator G1, the target feature generator G2,the classifier C1, and the classifier C1. The general process is separated into twoparts, model initialization and parameter learning. The feature generators G1 andG2, and the domain discriminator D1 are initialized by adversarial learning, whileclassifier C1, C2 are initialized by classification on the source domain features.Parameters of every component are learned in a stepwise manner. First, G2 as“task discriminator” is optimized based on classification discrepancy between C1and C2, and the output feature of G2 is updated; Second, the parameters of G1 andD1 are optimized by feature discrepancy between the newly generated G2 featureand the former G1 feature, and the new G1 feature generated; Third, C1 and C2 areoptimized by the cross-entropy loss based on the G1 feature. The updated C1 andC2 will further return to step one to update G2. The three steps will iterate untilconvergence. In the process, G2 is fully task-specific, whereas the major task ofG1 is to generate features of the source domain to mimic target domain features.These three steps are illustrated in Fig. 6.2.The inputs of general framework is formulated as follows. let’s represent la-beled source domain data with Xs = {xis,yis}Nsi=0, and let Xt = {xit}Nti=0 represents theunlabeled target domain data, in which Ns and Nt represent the numbers of dataon the two domains respectively. The source domain feature set Fs = { f is ,yis}Nsi=0with known labels ys is first generated by fs =G1{xs;θG1}, in which θG1 means theparameters of G1. The target domain feature set is generated by ft = G2{xt ;θG2}in which G2 is the target feature generator and θG1 means its parameters.1016.2.2 Model initializationThe model is first initialized conventionally. The source and target domain featuresare the inputs to the domain discriminator, which is represented as D1{ fs, ft ;θD1}.The two generators try to fool D1 while D1 is maximized to classify the features’domain labels. At the same time, the two classifiers assign labels to the sourcedomain features, based on the regular cross-entropy loss. These two classifiers areformulated as C1{ fs;θC1} and C2{ fs;θC2}. Our first min-max objective isminθC1 ,θC2maxθG1 ,θG2 ,θD1α1Ld1(D1,G1,G2)+β1Lt1(G1,C1,C2),(6.1)where α1 and β1 are weights for the two losses, and we also defineLd1 andLt1 asLd1(D1,G1,G2) = Ext[logD1(G2(xt ;θG2);θD1)]+E f s [log(1−D1( f s;θD1))] ,(6.2)Lt1(C1,C2,G1) = E f s,ys,z[−ysT logC1( f s;θC1)]+E f s,ys,z[−ysT logC2( f s;θC2)],(6.3)where ys means the one-hot encoding of the labels of source domain data. In bothequations, f s =G1(xs,z;θG1) as defined earlier. In our implementation, for both G1and G2, we use resnet to extract the features, and D1, C1, and C2 are regular resnetclassifiers. For the above minmax objective, we solve the problem by updatingθG1 ,θG2 (freezing θD1 ,θC1 ,θC2) and θD1 ,θC1 ,θC2 (freezing θG1 ,θG2) alternatively.We can initialize all parameters of the proposed model in this way.6.2.3 Model trainingAfter the initialization of the model parameters, we can get differed classificationresults from C1 and C2. The following model training is divided into three steps.Step 1 and classifier discrepancy loss: In this step, we use the discrepancyloss to train the target feature generator G2, while other components are frozen. Thetwo classifiers try to fool G2 with inconsistent classification results whereas G2 triesto generate the features to make them look the same to avoid being fooled. Here we102introduce D2 to identify the difference between the results of two classifiers. D2 isonly an identifier with no parameters. Therefore it is not a component in the DuANmodel. The objective of this step is to minimize the discrepancy loss defined in Eq.6.4 asLd2(D2,C1,C2) = D2(C1(ft ;θC1),C2(ft ;θC2)). (6.4)HereLd2 is the discrepancy loss between the two classifiers. The only variablein this step is θG2 . For D2, different from D1 which is defined by the neural network,it is just an identifier which is defined asD2(x,y) =1nN∑n=1|xn− yn|, (6.5)in which N is the total number of elements for x and y (x and y have the samenumber of elements). We use the L-1 norm to calculate the difference between thetwo inputs.Step 2 and feature adversarial loss: In this step, we train the feature gener-ator G1 and the domain discriminator D1 in an adversarial manner, with all othercomponents being frozen. Different from the traditional UDA, only features fromthe feature generator G1 are updated to appear as if generated from G2, to fool D1which tries best to discriminate the features from two domains. The objective ofthis step is to minimize the discrepancy between source and target domain featuresby D1, which is formulated in Eq. 6.6 asminθG1 ,θD1Ld1(D1,G1,G2), (6.6)in which Ld1 is the feature adversarial loss defined in Eq. 6.2. Such loss willoptimize the network parameters in a Gradient Reverse Learning (GRL) [31] way,as higher loss means worse adaptation performance. The variables to be optimizedin this step are θG1 and θD1 . After this step, the feature output of G1 is updated,which will be used to optimize the classifiers. However, as G2 is not involved inthis step, its generated target domain feature is only related to the classificationtask.103Step 3 and cross-entropy loss: In this step, we train C1 and C2 with othercomponents being frozen. This step has two objectives, the first is to make thetwo classifiers as dissimilar as possible, for the adversarial purpose as in Step 1.The second objective is to maximize the classification accuracy of both classifiersfor features from G1 by minimizing cross-entropy losses, which is a task-specificobjective. To jointly consider these two objectives, the objective function is definedasmaxθC1 ,θC2α2Ld2(D2,C1,C2)+β2Lt2(G1,C1,C2),(6.7)where α2 and β2 are weights for the two losses, and we defineLt2 the same asLt1in Eq. 6.3, and defineLd2 the same as in Eq. 6.7.For both C1 and C2, the input are the features from G1 and G2.Dual Adversarial Network Training: For the model training, we have twoadversarial objectives. The first is between Step 1 and Step 3, and the second is inStep 2. The three steps will iterate, not only until the classification results on G1are converged but also until: 1. The D1 gets fooled by G1 and cannot discriminatewhich domain the data are from; 2. The G2 does not get fooled by C1 and C2, andrealizes that the two classifiers are for the same task. We also name this proposedprocess based on the neural network as a Dual Adversarial Network, as shown inFig. 6.2.6.3 ExperimentsIn the experimental part, we conduct our experiments on three tasks. The first isthe traditional digit recognition task, the second is transfer learning between twotypes of remote sensing scene (namely the satellite scene and the aerial scene), inorder to explore the relationship between them. The third is the Ground-to-Aerialscene Adaptation task, which is the most challenging. Below we will first describethe datasets.104MNIST SVHN Syn.num MNIST USPSMethod to to to to toM-M MNIST SVHN USPS MNISTSource 56.6 54.9 86.7 59.7 75.4Target 93.6 99.4 92.2 97.5 99.9DANN [31] 75.4 70.8 91.1 73.0 77.1DSN [8] 83.2 82.7 91.2 - -UNIT [58] - 90.5 - 93.5 95.9*ADDA [97] 78.8 76.0 - 93.8 92.4Ensem. [28] - 93.3 - 92.4 88.1I2I [71] - 80.3 - 87.2 92.1GenTA [82] - 92.4 - 90.8 95.3I.ADDA [1] 93.0 92.7 - 94.8 91.0ZDDA [45] 94.8 - - - -MCD [80] - 96.2 - 94.1 94.2DuAN 95.6 97.1 91.8 94.9 95.3Table 6.1: Performance comparisons of different deep adversarial transferlearning methods. Source and target refer to training only on the source/-target dataset.6.3.1 DatasetsDigit recognition task: we first investigate five traditional digit benchmark tasksto evaluate our model. The first experiment deals with the labeled source MNISTdataset containing grey images with one channel, and unlabeled target MNIST-M (M-M) dataset with three (RGB) color channels [31]. The second experimentintroduces the Street View House Numbers (SVHN) dataset [115][78], which hasa large distribution gap with the MNIST dataset. SVHN contains house numbersigns collected by Google Street view. The third experiment introduces Syntheticnumbers as source domain data. The Synthetic number dataset (syn. number) [31]with 500,000 images is to simulate the SVHN dataset. The Syn. number datasetis proposed in [31]. Compared with SVHN, the Syn. number dataset has differentpositionings, orientations, and backgrounds. The fourth experiment is betweenMNIST and USPS, the two most famous one channel digit image datasets. Fourexperiments are one direction, as the reversed transfer learning will result in highaccuracy with no challenge. Examples for this task can be found in Fig. 6.3.Satellite to aerial scene adaptation: For this task, we collect 9 classes for105transfer learning, including River, Parking lot, Overpass, Harbor, Forest, Build-ing, Beach, Residential, Agricultural. The datasets are mainly collected from theWHU-RS dataset, the UCMerced dataset, as well as the data collected by ourselvesonline and through our collaborators. The data from the satellite view is with muchlower resolution and clarity when compared with the data from the aerial view. Thedata are re-scaled to the resolution of 256×256. There are 53 images/class for thesource domain, and 100 images/class for the target domain, and in total 1377 im-ages. A visualized comparison of these two types of remote sensing data is shownin the left of Fig. 6.4.Ground to aerial scene adaptation: For this task, we include 15 classes, asshown in Fig. 6.4. Each image is re-scaled to the resolution of 256× 256. Eachclass has 5,800 images (5,000 from the source domain and 8,00 from the targetdomain), and the datasets contain 87,000 images in total. We randomly choose25,000 images from the source domain for training, and use the trained model totest on the validation data, which is 5% target domain data. For this task, thedata from the ground view has a huge distribution gap when compared with thedata from the aerial view, as can be noted in the examples. This task is highlychallenging. Moreover, the similarity between classes in the same view also makesthis task difficult. For example, the features of the parking lot are similar to that ofthe harbor, and the runway looks similar to the bridge from the aerial view. Dataexamples for this task are shown in the right of Fig. 6.4.6.3.2 Digit recognitionSetupIn this experiment, we evaluated the performance of the proposed DuAN model onfive digit recognition tasks. For network training, we used Adam optimizer withlearning rate 2×10−4 with no decay in all digit recognition experiments. The batchsize is set to 64. for Eq. 6.1 and Eq. 6.7, α/β = 0.1. This parameter setting issuitable for all scenarios. We select LeNet for this task as the base network. Allcomparison methods are trained until convergence (most are trained for 30 epochs).For the hardware, the CPU we adapt is Intel R© CoreTM i7-8700k, and GPU we use106is NVIDIA GEFORCE GTX 1080 TI. This hardware also works for the satellite-to-aerial scene adaptation task.ResultsTable 6.1 reports our results and also the results obtained from previous studies. Wedirectly compare our results with the reported results in previous papers to makethe comparison fair. We use “-” to mark the datasets that have not been tested byspecific methods, as different methods had been reported on different datasets.Results show that our proposed methods generally perform best for differenttransfer learning tasks. However, for the task MNIST→ USPS, The method UNIT[58] performs a little better than our method. This difference is because the UNITgenerates additional target domain images for classification. Therefore we mark itsresult with ‘*’. It also needs to be noticed that the proposed unsupervised methodcan yield even better results than the model directly trained by target samples forMNIST→ MNIST-M task. The comparison with the task-specific MCD methodverifies the effectiveness of the proposed dual adversarial process.6.3.3 Satellite-to-aerial scene adaptationSetupFor this task, we adapt ResNet-101 as our basenet. We implement each compari-son method all by ourselves, including the first adversarial transfer learning workDANN [31], the recent SOTA PADA [10] based on DANN, as well as two SOTAtask-specific methods [80] and [50]. The work in [50] is generally a modificationof [80], but in our task, it works no better than [80]. Therefore, we choose theDANN, MCD, as our major comparison methods. We not only provide detailedaccuracy comparison for each method but also a visualized t-sne comparison forthe target data before (source only) and after adaptation by our method, as shownin Fig. 6.5. All methods have been trained for 100 epochs, as testing accuracy ofevery method has converged at such epoch number. The trained model for everyten epochs is tested directly on target domain data without validation as the size ofthe dataset is not big, and the best performance is reported to make a comparison.107ResultsAs can be found in Table 6.2, the proposed method DuAN is with the best over-all accuracy, followed by MCD, PADA, and SWD. The accuracies of DANN andsource only are both around 30%. Also, we can find the building class is mostdifficult for classification as it is quite easy to be confused with Residential class.By accuracy comparison with source only method, we can find the two types ofremote sensing scenes can be aligned by transfer learning, which proves that theinformation between different types of remote sensing images can be shared andexchanged. Also, it can be concluded from t-SNE comparison in Fig. 6.5: Al-though the target samples do not separate well in the non-adapted situation, theydo separate clearly in the adapted situation. Such a conclusion proves the signif-icance of the proposed satellite-to-aerial adaptation task, as information transferbetween these two types of images can help with their classification.6.3.4 Ground-to-aerial scene adaptationSetupFor this task, due to a large number of image data for training, we run the experi-ments on our server. For the hardware, the CPU is AMD Ryzen 2nd Threadripper2990WX, and GPU is NVIDIA RTX TITAN × 2, with 128GB Memory. We useResNet-101 as the base network. We show the detailed accuracy comparison inTable 6.3. The training epochs are always set as 30, as all settings can converge atthis epoch number. For all comparison methods and the proposed method, as thetarget domain dataset is large, we use 5% randomly selected target domain data forvalidation and the rest for testing. The trained model parameters with the lowestloss on the validation phase are used for testing. We also make detailed visualizedt-sne results comparison as in Fig. 6.5.ResultsAs noted in Table 6.3, the overall accuracy (OA) of the proposed method is 53.36%,much higher than that of other methods which are all lower than 50%. In thetable, basketball court, baseball field, water park, parking lot and parking space are108abbreviated as basketball., baseball., water., parking.L and parking.S respectively.We want to provide two observations for this result. First, there is an indoor classbaseball field for the ground scene but outdoor for the aerial scene. Therefore,these two classes are with larger domain gaps than the other classes. All methodsget the lowest classification accuracy for this class although our method is with thebest performance. Second, there is a possibility that the source domain data aremore discriminative than target domain data. A representative class is swimmingpool and basketball field, for which aerial view data are easily be mistaken aswater park and golf field. On these tasks, the loss of discriminative distributionof target domain data during the adaptation process might even result in betterperformance. Such observation can explain our failure in such class comparedwith other methods. For the other tasks, our method is almost always with thebest performance compared with other methods. The t-SNE comparison betweenthe adapted result and the source-only result proves the effectiveness of transferlearning.Model Training ObservationsWe take the ground-to-aerial adaptation task as an example to demonstrate the ad-vantage of our proposed method in terms of model training. Fig.6.6 shows thechanges in classification accuracy on validation data at different epochs. The bestaccuracy for the proposed DuAN on the validation data is 53.36% which appearsat the 3rd epoch, while for MCD the best accuracy is 49.82% which appears atthe 10th epoch. We use the model parameters trained at the above epochs to dothe testing for DuAN and MCD. We want to to mention two observations. First,from the perspective of convergence of classification, due to our stepwise modeltraining, the classification result of the proposed DuAN stops to change at the 5thepoch while the result of MCD takes much longer to get converged. Also, at thefirst epoch, DuAN already yields 44% accuracy. We need to point out that foralmost all UDA methods, the accuracy results are highest at the second or thirdepoch and then reduce a bit. The same tendency works are observed for MCDand DuAN. Second, from the perspective of convergence of adaptation, the dis-crepancy between C1 and C2 in DuAN decreases much faster than in MCD. As109Table 6.2: Accuracy (%) results for the satellite-to-aerial scene adaptationtask with ResNet-101 as the base network.Method RiverParkinglotOverpassHarborForestBuildingBeachResidentialAgriculturalAverageSource 53 0 0 4 44 14 0 22 52 21DANN [31] 71 0 0 77 84 24 2 41 16 35PADA [10] 75 94 83 84 50 21 83 80 69 71SWD [50] 90 100 53 92 59 23 96 80 74 74MCD [80] 92 100 62 100 58 22 99 83 77 77DuAN 93 94 87 98 57 46 98 72 93 82in DuAN, each domain is assigned a specific task, the classifiers can get consis-tent results much faster than in MCD. This suggests that the adaptation processcan get converged much faster in a stepwise manner, and we can obtain uniformedtask-specific classification results in much shorter time.6.4 ConclusionIn this chapter, we propose a novel adversarial transfer learning model, namedDual Adversarial Network (DuAN), motivated by the idea that the source and tar-get domain data should not be treated in the same way in transfer learning. Differ-ent from previous methods, we propose a domain-specific strategy for the featureadaptation and the classification task, in order to relieve the loss of discriminativecharacteristics of the target domain data during the adaptation process. The modelis optimized in a stepwise manner. we also propose a novel “Ground/Satellite-to-Aerial Scene Adaptation” task. This adaptation task is for a highly challenging andpractical scenario with larger domain gap when compared with traditional trans-fer learning tasks. Also, such an adaptation can help to tackle the remote sensingdata automatic annotation problem. The superior experiment results for both tradi-tional digit recognition task and GSSA task prove the effectiveness of our proposedmethod.110Table 6.3: Accuracy (%) results for the ground-to-aerial scene adaptationtask with ResNet-101 as the base network.Method AirplaneBaseball.Basketball.BeachBridgeCrosswalkForestGolfSource 0.25 0.38 6.62 0.00 27.25 49.88 70.88 0.00DANN 35.38 1.00 37.50 0.00 0.25 49.25 0.00 5.62PADA [10] 39.43 0.26 25.03 66.89 52.46 43.24 21.58 46.37SWD [50] 50.04 0.27 5.03 80.72 77.82 0.00 94.67 82.12MCD [80] 71.38 0.38 0.38 100.0 91.38 0.00 100.0 99.62DuAN 93.25 1.62 7.12 99.38 83.00 42.25 99.88 98.75Table 6.4: Accuracy (%) results for the ground-to-aerial scene adaptationtask with ResNet-101 as the base network.Method HarborParking.LParking.SResidentialRunwaySwimmingWater.AverageSource 1.50 0.25 0.12 0.00 0.00 17.38 2.00 11.77DANN 0.00 1.00 0.25 0.12 0.50 41.38 0.12 11.49PADA[10]3.43 28.34 21.57 13.44 2.94 4.66 1.25 24.73SWD [50] 3.31 29.53 46.46 61.04 40.53 15.30 1.57 39.09MCD [80] 0.75 45.12 44.50 83.50 40.62 71.27 1.75 45.73DuAN 8.00 99.75 33.75 84.12 25.62 9.88 2.38 53.16Results6.5 Conclusion for “How to Transfer” (Chapters 4 to 6)In chapters 4-6, we propose three novel transfer learning models, i.e., the DualSpace Transfer Learning (DSTL), Xnet, and Dual Adversarial Network (DuAN),to better serve three specific tasks (i.e., transfer learning for hyperspectral images,transfer learning for digit recognition, and Satellite/ground-to-aerial transfer learn-ing).111Regarding the DSTL work, the challenge is to address the time and labor-consuming HSI labeling concern, and the main idea is to transfer the knowledge ofHSIs in the source domain to the target domain, to perform the classification withno prior information. The proposed method consists of two major parts: The first isto transfer the data on both domains to a specific subspace, on which we can obtainthe initial classification results for the target HSIs by exploiting the data structure.The second is to optimize the initial results on the original target data space basedon its structure, by applying the Markove Random Filed (MRF) approach. As anunsupervised HSI classification method, the proposed DSTL is robust and effec-tive, as supported by the experimental results.Regarding the Xnet work, we propose a novel unsupervised deep transfer learn-ing model. We name this model as Xnet because of its “X” shape. Since adversariallearning is employed between two pairs of domains, the source domain and targetdomain as well as the target domain and common domain, we also refer the pro-posed method as dual domain adversarial adaptations.Regarding the DuAN work, motivated by the idea that the source and targetdomain data should serve for separate purposes. Different from previous meth-ods, we propose introducing a domain-specific strategy for feature adaptation andclassification, in order to avoid the potential loss of discriminative characteristics oftarget domain data in the classical adaptation. The model is optimized in a stepwisemanner. We also propose a novel “Ground/Satellite-to-Aerial Scene Adaptation”task. This adaptation task is for a highly challenging, practical scenario with largerdomain gap when compared with traditional transfer learning tasks. Also, the pro-posed adaptation can help to solve the automatic annotation problem of remotesensing data. The superior experiment results on GSAS adaptation tasks supportthe effectiveness of the proposed DuAN method.112CSourceTargetSourceTargetCC2SourceTargetC1C2C1(a)	classical	domain	adversarial	network(b)	dual	adversarial	networkSourceTargettarget,	class	1 target,	class	2source,	class	1 source,	class	2source	feature	adaptationtarget	feature	adaptationclassifier	boundaryFigure 6.1: (Best Viewed in color.) Illustration of the mechanism compari-son between the classical adaptation approach and the proposed DuAN.(a) The classifier cannot classify target domain data well although twodomain data are aligned well, as they might fail to consider task-specificclassifiers during adaptation. (b) Two individual task-specific classifiersfirst trained on the source domain data provide inconsistent classifica-tion results for the target domain data. Such discrepancy is minimizedin an iterative way: 1. the source data feature mimics the target data fea-ture, 2. classifiers are updated based on the new source data distributionand provide new discrepancy, 3. the target data feature is updated tominimize such discrepancy. The target data will be suitable for varioustask-specific classifiers at last.113Figure 6.2: The flowchart of the proposed DuAN. Two adversarial processesexist, where one for the feature adaptation is realized by the source flow(orange color), and the other for the classification task is realized by thetarget flow (purple color). Flow here means the forward and backwardpropagation in the neural network. Steps 1-3 refer to the three itera-tive training steps. Components in the corresponding step are updatediteratively. “Ini” is the abbreviation for model initialization.MNIST  MNIST-M SVHN MNISTSyn. Digit SVHN USPSMNISTFigure 6.3: Examples from the traditional digit recognition datasets.114Airplane Baseball fieldBasketball courtSwimming PoolGolf fieldCross Walk Parking SpaceWater ParkBridgeGround ViewAerial ViewRiver BuildingGround ViewAerial ViewGround ViewAerial ViewGround ViewAerial ViewRunwayFigure 6.4: Left: Examples from the proposed satellite-to-aerial transferlearning datasets with 9 categories. Right: Examples from the proposedground-to-aerial transfer learning datasets with 15 categories (except forclasses in Fig. 1.3).115Figure 6.5: (a)-(b) t-SNE [69] visualization results of transfer learning meth-ods for the Satellite-to-aerial scene adaptation. (c)-(d) t-SNE [69] vi-sualization results of transfer learning methods for the Ground-to-aerialscene adaptation. We can see that after applying our adaptation meth-ods, the target samples are more discriminative.116DuAN  C1DuAN  C2MCD  C1MCD  C2Ground-to-aerial average accuracy changes against epochs50403020100Figure 6.6: The classification accuracies on validation data.117Chapter 7Conclusion and Future Work7.1 ConclusionIn this chapter, I first summarize my major contributions during my Ph.D. research.From the perspective of the machine learning model, I develop transfer learningframeworks to address three key problems, i.e., what to transfer, where to transfer,and how to transfer. From the perspective of computer vision applications, myproposed works in this thesis focus on three applications, i.e., transfer learningbetween remote sensing images, transfer learning between regular RGB images,and transfer learning between remote sensing images and regular RGB images.Different proposed machine learning models are verified on different applications.For the “what to transfer” problem, the main task is to find the best content totransfer. For this task, we propose a novel method, which is referred to as Deepmapping based heterogenous Transfer learning model via querying Salient Exam-ples (DTSE), for applications on transfer learning between hyperspectral images(HSIs, one type of remote sensing images). We verify the effectiveness of transferlearning by studying the classification task of HSI. By transfer prior knowledgefrom the annotated source domain images to the unknown target domain, we ob-serve that the classification accuracy of HSIs on the target domain is significantlyimproved.For the “where to transfer” problem, the main task is to find the correspondencebetween different layers of deep neural networks from the source domain and the118target domain. For this task, we propose a novel method, which is referred to asDeep Transfer Learning by Exploring where to Transfer (DT-LET), for transferlearning between digital images (one type of regular RGB images). We verify theeffectiveness of the proposed model by studying the recognition/classification tasksof writing digits. By transfer prior knowledge from the annotated digit images tounknown digit datasets, we can achieve the improvement of recognition/classifica-tion accuracy of the target domain images.For the “how to transfer” problem, the main task is to define different models toboost the performance of transfer learning. For this task, we propose three modelsto tackle three various computer vision tasks, and explore the correlations betweendifferent types of images.The first proposed model is the DUal Space Unsupervised Structure Preserv-ing Transfer Learning (DSTL), which is proposed to serve the first task for “how totransfer”. The first task is to explore the correlation between different remote sens-ing images, specifically the HSIs, and figure out whether the information borrowedfrom one hyperspectral image dataset can benefit the classification of another hy-perspectral image dataset. To address the time and labor-consuming HSI labelingconcern, our main idea is to transfer the knowledge of HSIs in the source domainto the target domain and therefore perform the classification with no prior informa-tion.The second proposed model is an adversarial method, the Xnet model, whichis proposed to serve the transfer not only between different regular RGB images(specifically the digit images), but also between different types of remote sensingimages (e.g., the satellite view images and aerial view images). The data used fortransfer learning are with different resolutions, illumination, colors etc. We namethe proposed as Xnet because of its “X” shape. This model adapts the generatedfeatures from the source domain to the classification task of the target domain byintroducing task-specific attention learning. By introducing the attention mecha-nism, the gap between transfer learning and classification/recognition task can bebridged. The features generated by the adaptation process are further adapted toclassification/recognition tasks.The third proposed model is the Dual Adversarial Netowrk (DuAN), whichis proposed to serve the transfer not only between the same type of images col-119lected under different acquisition conditions (e.g., transfer between different digitdatasets, and transfer between satellite view and aerial view datasets), but alsobetween different data types (e.g., transfer between the ground view and aerialview datasets). This task not only explores the effectiveness of transfer learningfor remote sensing data (e.g., satellite-to-aerial), but also aims to solve the label-scarce problem for the aerial scene (e.g., ground-to-aerial). The proposed modelis named the Dual Adversarial Network (DuAN), motivated by the idea that thesource and target domain data should serve for separate purposes. Different fromprevious methods, the proposed model introduces a domain-specific strategy forfeature adaptation and classification, to avoid the potential loss of discriminativecharacteristics of the target domain data in classical adaptation.To sum up, this thesis proposes five deep transfer learning models, to serve spe-cific applications in both remote sensing and traditional computer vision tasks. Themodels are proposed by addressing three key problems in deep transfer learning,i.e, what to transfer, how to transfer, and where to transfer.7.2 Future workMy future work will focus on two major directions. The first is few-shot transferlearning, and the second is multi-label transfer learning.7.2.1 Few-shot transfer learningFor the few-shot transfer learning direction, our major motivation is that if we canget a few annotated samples from the target domain, the classification/segmenta-tion/recognition accuracy of the target domain could boost a lot.The design of a desired transfer learning mechanism is to strike a balance be-tween classification accuracy and the number of annotated samples in the targetdomain. Although a conventional unsupervised transfer learning method can get a90%+ accuracy for traditional digit classification tasks, it cannot always get satis-fying classification/segmentation accuracy under practical scenarios. For example,for the task of ground-to-aerial view transfer learning, the classification accuracy inthe target aerial images is around 50%. An intuition in practical scenarios is to adda few annotated samples from the target domain to achieve a potentially significant120increase in classification accuracy. As in the traditional computer vision research,we could employ few-shot learning to achieve this goal in transfer learning.Few-shot learning is different from supervised learning, which generally couldapply 5%− 10% annotated samples to supervise the classification process. Forfew-shot learning, the number of annotated samples ranges from 1 to 5 per class.Therefore it can be considered as weak supervision. However, it is worth notingthat such a weak supervision approach might be able to result in a significant per-formance improvement.For the ground-to-aerial transfer learning, which is a practically challengingapplication of transfer learning, the primary motivation is to apply transfer learningto solve the scarce annotation problem for aerial images. With much easier accessto aerial data in nowadays, its annotation becomes a big problem. However, ourintuition is that getting 5 images per class annotated should not be a problematicissue. If we can provide such as weak supervision, the performance might boostby employing advanced transfer learning models.During my PhD study, we have already done a preliminary study. For thedataset introduced in Chapter 6, the experiment results show that, by adding fiveannotated samples to each class for aerial images, the average classification accu-racy can increase from 53.36% to 75%. We think such a performance improvementis a good beginning for our future work in the direction of few-shot transfer learn-ing.7.2.2 Multi-label transfer learningUnsupervised transfer learning has already shown excellent performance for single-label image transfer (e.g., transfer between ImageNet and Coco datasets), but it isnecessary to move beyond the single-label transfer learning task since the imagesof everyday life are inherently multi-label. Recently, empirical evidence has beenpresented, showing that the performances of the state-of-the-art classifiers on Ima-geNet are largely underestimated. Many of the remaining errors are due to the factthat ImageNet is single-label annotation ignores the intrinsic multi-label nature ofthe images in the dataset.Multi-label datasets (e.g. MS COCO, Open Images) contain more complex121images that represent scenes with several objects. However, getting clean multi-label annotations is more difficult, and most images are single labeled. As it isquite likely to be wrongly or partially labeled for multi-label data, applying transferlearning to help with the automatic labeling of newly collected multi-label data isa challenging topic with great practical importance.In this future work, we plan to propose multi-label transfer learning, to reducethe annotation cost of multi-label data. We plan to study the following scenarios:1. transfer the knowledge from annotated multi-label datasets to unannotated ones;2. transfer knowledge between images with partial labels, since currently manyimages in the datasets are only labelled with one label while each image indeedincludes more than one objects/target. The first scenario is easier to study: Weonly transfer prior knowledge from the label-rich datasets to label-scarce datasets,and the exploited information can directly help with the classification of the targetdomain data. However, the second scenario is more critical, since for most ex-isting data, there is only one label per image, or only some labels are known foreach image. We can explore the potential of transferring the information from richsingle-labeled datasets to multi-labeled datasets. An example of such a scenario isillustrated in Fig. 7.1.However, multi-label transfer learning is a more practical and challenging taskthan single-label classification, since both the input images and output label spacesare complex. We will first conduct some experiments to verify our assumptionsand intuitions in the future.7.2.3 Multi-label transfer learning for aerial sceneThis future work is an extension from the last section, and this future work ismainly focusing on the specific application direction of the aerial scene. As theaerial images have much larger fields of view, the aerial scene in most time containsmore than one object of interest.For this future work, we plan to conduct the following two experiments: 1.transfer learning from single label aerial images to multi-label aerial scenes. 2.transfer learning between multi-label aerial scenes. 3. transfer learning from themulti-label RGB datasets to multi-label aerial scenes. Also, one expected major122(a) (b)(c)[a][b][c]CarApplePersonBananaBottleCarBottleBananaApplePersonFigure 7.1: Dataset (a) only includes three partial categories with a singlelabel “car”. Dataset (b) with two partial categories is with the label“person”. By transferring from the two single-labeled datasets, we canget the final multi-labels for dataset (c).contribution of this future work will be the collected datasets, as shown in Fig. 7.2and Fig. 7.3, which were proposed by ourselves.123Figure 7.2: UCM multi-label aerial image dataset.124Figure 7.3: AID multi-label aerial image dataset.125Bibliography[1] Y. A. A. Chadha. Improving adversarial discriminative domain adaptation.In arXiv:1809.03625, 2018. → pages 83, 96, 105[2] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonicalcorrelation analysis. in Proc. ICML, 2013. → pages 10, 61[3] M. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Inproc. COLT, pages 35–50, 2007. → page 7[4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W.Vaughan. A theory of learning from different domains. Machine Learning,79(2):151–175, 2010. → page 11[5] M. Bernico, Y. Li, and Z. Dingchao. Investigating the impact of datavolume and domain similarity on transfer learning applications. In proc.CVPR, 2018. → page 10[6] B. Bharath, C. Nicolas, and L. Sebastien. Unsupervised classifier selectionapproach for hyperspectral image classification. In proc. IEEE IGARSS,pages 5111–5114, 2016. → page 76[7] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan.Domain separation networks. In In proc. NIPS, 2016. → pages 83, 96, 97[8] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan.Domain separation networks. in NIPS, pages 343–351, 2016. → pages99, 105[9] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan.Unsupervised pixel-level domain adaptation with genrative adversarialnetworks. In In proc. CVPR, 2017. → page 93126[10] Z. Cao, L. Ma, M. Long, and J. Wang. Partial adversarial domainadaptation. In In proc. ECCV, pages 135–150, 2018. → pages97, 107, 110, 111[11] Z. Cao, K. You, M. Long, J. Wang, and Q. Yang. Learning to transferexamples for partial domain adaptation. In In proc. CVPR, pages2985–2994, 2019. → page 83[12] F. M. Carlucci, L. Porzi, and B. Caput. Autodial: Automatic domainalignment layers. arXiv:1704.08082, 2017. → page 10[13] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulo. Autodial:Automatic domain alignment layers. In In proc. CVPR, pages 5067–5075,2017. → page 11[14] R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, andJ. Ye. Batch mode active sampling based on marginal probabilitydistribution matching. In In proc. ACM SIGKDD, pages 741–749, 2012. →page 8[15] Q. Chen, Y. Liu, Z. Wang, I. Wassell, and K. Chetty. Re-weightedadversarial adaptation network for unsupervised domain adaptation. In Inproc. CVPR, 2018. → page 11[16] E. Collier, R. DiBiano, and S. Mukhopadhyay. Cactusnets: Layerapplicability as a metric for transfer learning. arXiv:1711.01558, 2018. →page 10[17] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Jointdistribution optimal transportation for domain adaptation. InarXiv:1705.08848, 2017. → page 11[18] N. Cristianini and J. Shawe-Taylor. An Introduction to SupportVectorMachines and Other Kernel-Based Learning Methods. CambridgeUniversity Press, Cambridge, U.K., 2000. → page 73[19] I. Dagan and S. Engelson. Committee-based sampling for trainingprobabilistic classifiers. In In proc. ICML, pages 150–157, 1995. → page 8[20] W. Dai, Y. Chen, G. Xue, Q. Yang, and Y. Yu. Translated learning: Transferlearning across different feature spaces. In proc. NIPS, 2009. → pages5, 50127[21] B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and N. Courty.Deepjdot: Deep joint distribution optimal transport for unsuperviseddomain adaptation. In Proc. The European Conference on ComputerVision, 2018. → page 11[22] S. Dasgupta and D. HsuLewis. Hierarchical sampling for active learning.In In proc. ICML, pages 208–215, 2008. → page 8[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In In proc. CVPR, pages 248–255.Ieee, 2009. → page 14[24] Z. Deng, H. Sun, and S. Zhou. Semi-supervised ground-to-aerial adaptationwith heterogeneous features learning for scene classification. ISPRSInternational Journal of Geo-Information, 7(5):182, 2018. → page 14[25] Z. Ding and Y. Fu. Robust transfer metric learning for image classification.IEEE Transactions on Image Processing, 26(1):660–670, 2017. → page 70[26] P. Donmez, J. Carbonell, and P. Bennett. Active learning via transductiveexperimental design. In In proc. ICML, pages 116–127, 2007. → page 9[27] L. Duan, D. Xu, and W. Tsang. Learning with augmented features forheterogeneous domain adaptation. In proc. ICML, 2012. → pages 3, 50[28] G. French, M. Mackiewicz, and M. Fisher. Self-ensembling for visualdomain adaptation. In In proc. ICLR, 2018. → pages 96, 105[29] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling usingthe query by committee algorithm. Machine Learning, 28(2/3):133–168,1997. → page 8[30] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation bybackpropagation. In In proc. ICML, pages 1180–1189, 2015. → pagesxvii, 11, 85, 89, 96[31] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,M. Marchand, and V. Lempitsky. Domain-adversarial training of neuralnetworks. Journal of Machine Learning Research, 17(59):1–35, 2016.URL http://jmlr.org/papers/v17/15-239.html. → pages4, 61, 83, 88, 89, 90, 92, 97, 103, 105, 107, 110[32] E. Gavves, T. Mensink, T. Tommasi, C. G. Snoek, and T. Tuytelaars.Active transfer learning with zero-shot priors: Reusing past datasets forfuture tasks. arXiv:1510.01544, 2015. → page 9128[33] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scalesentiment classification: A deep learning approach. in Proc. ICML, 2011.→ page 10[34] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel forunsupervised domain adaptation. In Proc. Computer Vision and PatternRecognition, pages 2066–2073, 2012. → page 69[35] M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Scholkopf.Domain adaptation with conditional transferable components. In Proc.ICML, 2016. → pages 6, 50[36] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for objectrecognition: An unsupervised approach. In In proc. ICCV, pages999–1006, 2011. → page 69[37] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola. Akernel two-sample test. JMLR, 13:723–773, 2012. → page 11[38] Y. Guo and D. Schuurman. Toward optimal active learning throughsampling estimation of error reduction. In In proc. NIPS, pages 593–600,2007. → page 8[39] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associativedomain adaptation. In In proc. ICCV, pages 2765–2773, 2017. → pages83, 93[40] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlationanalysis: An overview with application to learning methods. Neuralcomputation, 16(12):2639–2664, 2004. → pages 33, 54, 55[41] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros,and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation.arXiv preprint arXiv:1711.03213, 2017. → page 11[42] S. Huang, R. Jin, and Z. Zhou. Active learning by querying informativeand representative examples. IEEE Transactions on Pattern Analysis andMachine Intelligence, 36(10):1936–1949, 2014. → pages 7, 9, 24, 27[43] Y. Jang, H. Lee, S. J. Hwang, and J. Shin. Learning what and where totransfer. In In proc. ICML, 2019. → page 11[44] S. Jetley, N. Lord, N. Lee, and P. Torr. Learn to pay attention. In In proc.ICLR, 2018. → page 88129[45] Z. W. K. Peng and J. Ernst. Zero-shot deep domain adaptation. In In proc.ECCV, 2018. → pages 81, 96, 105[46] M. Kan, J. Wu, S. Shan, and X. Chen. Domain adaptation for facerecognition: Targetize source domain bridged by common subspace. IEEEInternational Journal of Computer Vision, 109(1):94–109, 2014. → pages12, 68[47] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann. Contrastive adaptationnetwork for unsupervised domain adaptation. In In proc. CVPR, pages4893–4902, 2019. → page 12[48] V. K. Kurmi, S. Kumar, and V. P. Namboodiri. Attending to discriminativecertainty for domain adaptation. In In proc. CVPR, June 2019. → page 99[49] S. Kuroki, N. Charoenphakdee, H. Bao, J. Honda, I. Sato, andM. Sugiyama. Unsupervised domain adaptation based on source-guideddiscrepancy. In In proc. AAAI, volume 33, pages 4122–4129, 2019. →page 12[50] C.-Y. Lee, T. Batra, M. H. Baig, and D. Ulbricht. Sliced wassersteindiscrepancy for unsupervised domain adaptation. In In proc. CVPR, pages10285–10295, 2019. → pages 12, 107, 110, 111[51] S. Lee, D. Kim, N. Kim, and S.-G. Jeong. Drop to adapt: Learningdiscriminative features for unsupervised domain adaptation. In In proc.ICCV, pages 91–100, 2019. → page 99[52] D. Lewis and J. Catlett. Heterogeneous uncertainty sampling forsupervised learning. In In proc. ICML, pages 148–156, 1994. → page 7[53] J. Li, H. Zhang, Y. Huang, and L. Zhang. Visual domain adaptation: asurvey of recent advances. IEEE Signal Processing Magazine, 33(3):53–69, 2015. → pages 6, 50[54] X. Li, L. Zhang, B. Du, L. Zhang, and Q. Shi. Iterative reweightingheterogeneous transfer learning framework for supervised remote sensingimage classification. IEEE Journal of Selected Topic in Applied EarthObservations and Remote Sensing, 10(5):2022–2035, 2017. → page 33[55] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou. Revisiting batch normalizationfor practical domain adaptation. In arXiv:1603.04779, 2016. → page 11130[56] J. Lin, C. He, Z. Wang, and S. Li. Structure preserving transfer learning forunsupervised hyperspectral image classification. IEEE Geoscience andRemote Sensing Letters, 14(10):1656–1660, 2017. → page 33[57] J. Liu, Y. Chen, J. Zhang, and Z. Xu. Enhancing low-rank subspaceclustering by manifold regularization. IEEE Transactions on ImageProcessing, 23(9):4022–4030, 2014. → page 69[58] M. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translationnetworks. In In proc. NIPS, pages 700–708, 2017. → pages93, 96, 105, 107[59] P. Liu, H. Zhang, and K. B. Eom. Active deep learning for classification ofhyperspectral images. IEEE Journal of Selected Topics in Applied EarthObservations and Remote Sensing, 10(2):712 – 724, 2017. → page 7[60] T. Liu, Q. Yang, and D. Tao. Understanding how feature structure transfersin transfer learning. In proc. IJCAI, 2017. → pages 5, 50[61] M. Long and J. Wang. Learning transferable features with deep adaptationnetworks. In In proc. ICML, pages 97–105, 2015. → pages 11, 81[62] M. Long, J. Wang, G. Ding, J. Sun, and P. Yu. Transfer feature learningwith joint distribution adaptation. In In proc. ICCV, pages 2200–2207,2013. → page 81[63] M. Long, J. Wang, Y. Cao, J. Sun, and P. Yu. Deep learning of transferablerepresentation for scalable domain adaptation. IEEE Transactions onKnowledge and Data Engineering, 28(8):2027–2040, 2016. → page 9[64] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning withjoint adaptation networks. In In proc. ICML, 2017. → page 81[65] Q. Lu, X. Huang, J. Li, and L. Zhang. A novel mrf-based multifeaturefusion for classification of remote sensing images. IEEE Geoscience andRemote Sensing Letters, 13(4):515–519, 2016. → page 74[66] X. Lu, T. Gong, and X. Zheng. Multisource compensation network forremote sensing cross-domain scene classification. IEEE Transactions onGeoscience and Remote Sensing, 2019. → pages 3, 50[67] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang. Taking a closer look atdomain shift: Category-level adversaries for semantics consistent domainadaptation. In In proc. CVPR, June 2019. → page 99131[68] X. Ma, T. Zhang, and C. Xu. Gcan: Graph convolutional adversarialnetwork for unsupervised domain adaptation. In CVPR, June 2019. →page 99[69] L. Maaten and G. Hinton. Visualizing data using t-sne. Journal of MachineLearning Research, 9. → pages xviii, 116[70] D. Moyer, S. Gao, R. Brekelmans, G. Steeg, and A. Galstyan. Invariantrepresentations without adversarial training. In In proc. NIPS, 2018. →page 81[71] Z. Murez, S. Kolouri, S. Kriegman, D. Ramamoorthi, and R. Kim. Imageto image translation for domain adaptation. In In proc. CVPR, 2018. →pages 96, 105[72] N.Courty, R.Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transportfor domain adaptation. IEEE Transactions on Pattern Analysis andMachine Intelligence, 39(9):1853–1865, 2017. → page 11[73] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transaction onKnowledge and Data Engineering, 22(10):1345–1359, 2010. → pages2, 49[74] I. Redko, N. Courty, R. Flamary, and D. Tuia. Optimal transport formulti-source domain adaptation under target shift. arXiv:1803.04899,2018. → page 10[75] V. Riccardo, M. Pietro, S. Silvio, and M. Vittorio. Adversarial featureaugmentation for unsupervised domain adaptation. In In proc. CVPR,2018. → page 96[76] A. Rodriguez and A. Laio. Clustering by fast search and find of densitypeaks. Science, 344(6191):1492–1496, 2014. → pages 8, 23, 26[77] N. Roy and A. Mccallum. Toward optimal active learning throughsampling estimation of error reduction. In In proc. NIPS, pages 441–448,2001. → page 8[78] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training forunsupervised domain adaptation. In In proc. ICML, 2017. → pages81, 92, 96, 105[79] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifierdiscrepancy for unsupervised domain adaptation. In In proc. CVPR, 2018.→ page 11132[80] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifierdiscrepancy for unsupervised domain adaptation. In proc. CVPR, pages3723–3732, 2018. → pages 12, 97, 105, 107, 110, 111[81] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko. Semi-superviseddomain adaptation via minimax entropy. arXiv preprint arXiv:1904.06487,2019. → page 12[82] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generateto adapt: Aligning domains using generative adversarial networks. In proc.CVPR, pages 8503–8512, 2018. → pages 12, 96, 105[83] O. Sener, H. Song, A. Saxena, and S. Savarese. Learning transferrablerepresentations for unsupervised domain adaptation. In In proc. NIPS,pages 2110–2118, 2016. → page 11[84] M. Shao, C. Castillo, G. Zhenghong, and Y. Fu. Low-rank transfersubspace learning. In In proc. ICDM, pages 1104–1109, 2012. → page 22[85] M. Shao, D. Kit, and Y. Fu. Generalized transfer subspace learning throughlow-rank constraint. IEEE International Journal of Computer Vision, 109(1):74–93, 2014. → pages 13, 69[86] J. S. Smith, B. T. Nebgen, R. Zubatyuk, N. Lubbers, C. Devereux,K. Barros, S. Tretiak, O. Isayev, and A. E. Roitberg. Approaching coupledcluster accuracy with a general-purpose neural network potential throughtransfer learning. Nature communications, 10(1):1–8, 2019. → pages 5, 50[87] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domainadaptation. In proc. AAAI, 2016. → pages 11, 61, 96[88] H. Sun, S. Liu, S. Zhou, and H. Zou. Transfer sparse subspace analysis forunsupervised cross-view scene model adaptation. IEEE Journal of SelectedTopics in Applied Earth Observations and Remote Sensing, 9(7):2901–2909, 2015. → page 14[89] H. Sun, S. Liu, S. Zhou, and H. Zou. Unsupervised cross-view semantictransfer for remote sensing image classification. IEEE Geoscience andRemote Sensing Letter, 13(1):13–17, 2015. → page 14[90] H. Sun, Z. Deng, S. Liu, and S. Zhou. Transferring ground level imageannotations to aerial and satellite scenes by discriminative subspacealignment. In In proc. IGARSS, pages 2292–2295, 2016. → page 14133[91] B. Tan, Y. Zhang, S. J. Pan, and Q. Yang. Distant domain transfer learning.In proc. AAAI, 2017. → pages 50, 83[92] J. Tang, X. Shu, Z. Li, G. J. Qi, and J. Wang. Generalized deep transfernetworks for knowledge propagation in heterogeneous domains. ACMTransactions on Multimedia Computing, Communications, andApplications, 12(4):68:1–68:22, 2016. → page 61[93] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wassersteinauto-encoders. arXiv:1711.01558, 2018. → page 10[94] T. Tommasi, N. Quadrianto, B. Caputo, and C. Lampert. Beyond datasetbias: Multi-task unaligned shared knowledge transfer. In proc. ACCV,2012. → page 62[95] S. Tong and D. Koller. Support vector machine active learning withapplications to text classification. In In proc. ICML, pages 999–1006, 2000.→ page 7[96] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domainconfusion: Maximizing for domain invariance. In arXiv:1412.3474, 2014.→ pages 11, 96[97] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarialdiscriminative domain adaptation. pages 7167‘‘–7176, 2017. → pages61, 83, 96, 105[98] J. Wang, Y. Chen, S. Hao, W. Feng, and Z. Shen. Balanced distributionadaptation for transfer learning. In proc. ICDM, 2017. → pages 5, 50[99] Q. Wang, J. Fang, and Y. Yuan. Adaptive road detection via context-awarelabel transfer. Neurocomputing, 158:174–183, 2015. → pages 12, 68[100] Q. Wang, J. Wan, and X. Li. Robust hierarchical deep learning forvehicular management. IEEE Transactions on Vehicular Technology, 68(5):4148–4156, 2018. → pages 5, 50[101] Q. Wang, J. Gao, and X. Li. Weakly supervised adversarial domainadaptation for semantic segmentation in urban scenes. IEEE Transactionson Image Processing, 2019. → pages 3, 50[102] Q. Wang, Q. Li, and X. Li. Hyperspectral band selection via adaptivesubspace partition strategy. IEEE Journal of Selected Topics in AppliedEarth Observations and Remote Sensing, 2019. → pages 5, 50134[103] Z. Wang, B. Du, L. Zhang, L. Zhang, and X. Jia. A novel semisupervisedactive-learning algorithm for hyperspectral image classification. IEEETransactions on Geoscience and Remote Sensing, 55(6):3071 – 3083,2017. → page 7[104] S. Workman, R. Souvenir, and N. Jacobs. Wide-area image geolocalizationwith aerial reference imagery. In In proc. ICCV, pages 3961–3969, 2015.→ pages xv, 14, 15[105] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database:Large-scale scene recognition from abbey to zoo. In In proc. CVPR, pages3485–3492. IEEE, 2010. → page 14[106] Y. Xu, X. Fang, J. Wu, X. Li, and D. Zhang. Discriminative transfersubspace learning via low-rank and sparse representation. IEEETransactions on Image Processing, 25(2):850–863, 2016. → pages69, 70, 73, 76[107] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang. Query by committee. In Inproc. IWCLT, pages 287–294, 1992. → page 8[108] Z. Y., Y. Chen, Z. Lu, S. J. Pan, G. R. Xue, Y. Yu, and Q. Yang.Heterogeneous transfer learning for image classification. In proc. AAAI,2011. → pages 3, 50[109] Y. Yan, W. Li, M. Ng, M. Tan, H. Wu, H. Min, and Q. Wu. Translatedlearning: Transfer learning across different feature spaces. In proc. IJCAI,2017. → pages 5, 50[110] L. Yang, L. Jing, J. Yu, and M. K. Ng. Learning transferred weights fromco-occurrence data for heterogeneous transfer learning. IEEE Transactionson Neural Networks and Learning Systems, 27(11):2187–2200, 2016. →pages 52, 63[111] Y. Yeh, C. Huang, and Y. Wang. Heterogeneous domain adaptation andclassification by exploiting the correlation subspace. IEEE Transaction onImage Processing, 23(5):2009–2018, 2014. → page 33[112] K. Yu, B. Ji, and V. Tresp. Active learning via transductive experimentaldesign. In In proc. ICML, pages 1081–1088, 2006. → page 9[113] Y. Yuan, J. Lin, and Q. Wang. Hyperspectral image classification viamulti-task joint sparse representation and stepwise mrf optimization. Inproc. ICME, 46(12):2966–2976, 2016. → pages 69, 75135[114] Y. Yuan, X. Zheng, and X. Lu. Hyperspectral image superresolution bytransfer learning. IEEE Journal of Selected Topics in Applied EarthObservations and Remote Sensing, 10(5):1963–1974, 2017. → page 9[115] N. Yuval, T. Wang, A. Coates, A. Bissacco, B. Wu, and Y. Ng. Readingdigits in natural images with unsupervised feature learning. In In proc.NIPS, 2011. → pages 4, 90, 105[116] W. Zellinger, T. Grubinger, E. Lughofer, T. Natschlager, andS. Saminger-Platz. Central moment discrepancy (cmd) fordomain-invariant representation learning. In In proc. ICLR, 2017. → page11[117] J. Zhang, Z. Ding, W. Li, and P. Ogunbona. Importance weightedadversarial nets for partial domain adaptation. In In proc. CVPR, 2018. →pages 11, 12, 81[118] X. Zhang, F. X. Yu, S. F. Chang, and S. Wang. Supervised representationlearning: Transfer learning with deep autoencoders. Computer Science,2015. → page 10[119] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning.In proc. CVPR, 2018. → page 10[120] L. Zhao, S. Pan, W. Xiang, E. Zhong, Z. Lu, and Q. Yang. Active transferlearning for cross-system recommendation. In In proc. AAAI, pages1205–1211, 2013. → page 9[121] S. J. Zhou, J. T.and Pan, I. W. Tsang, and Y. Yan. Hybrid heterogeneoustransfer learning through deep learning. In proc. AAAI, 2014. → page 10[122] X. Zhou and S. Prasad. Active and semisupervised learning withmorphological component analysis for hyperspectral image classification.IEEE Geoscience and Remote Sensing Letters, 14(8):1348 – 1352, 2017.→ page 7[123] W. Zhu, V. Chayes, A. Tiard, S. Sanchez, D. Dahlberg, A. L. Bertozzi,S. Osher, D. Zosso, and D. Kuang. Unsupervised classification inhyperspectral imagery with nonlocal total variation and primal-dual hybridgradient algorithm. IEEE Transactions on Geoscience and RemoteSensing, 55(5):2786–2797, 2017. → page 76136[124] F. Zhuang, X. Cheng, P. Luo, P. S. J., and H. Q. Supervised representationlearning: Transfer learning with deep autoencoders. In proc. IJCAI, 2015.→ page 10[125] F. Zhuang, X. Cheng, P. Luo, P. J., and Q. He. Supervised representationlearning with double encoding-layer autoencoder for transfer learning.ACM Transactions on Intelligent Systems and Technology, pages 1–16,2018. → page 51137Appendix ASupporting MaterialsHere we want to optimize the objective function in Eq. 2.11, which is not jointconvex considering ΘT = {W T ,bT}, ΘS = {W S,bS}, V S, and V T . To solve thisproblem, we adopt the Lagrangian multiplier to update ΘS,ΘT and the stochasticgradient descent method to update V S,V T . The objective function is derived intotwo sub-problems as follows.A.0.1 Updating V S,V T with fixed ΘS,ΘTIn Eq. 2.11, the optimization of V S,V T is just related to the third term, and theoptimization of each layer V S(l),V T (l) can be formulated asminV S(l),V T (l)− VS(l)T ∑ST V T (l)√V S(l)T∑SS V S(l)√V T (l)T∑T T V T (l)(A.1)As V S(l)T∑SS V S(l) = 1 and V T (l)T∑T T V T (l) = 1, we have the Lagrangian mul-tiplierL(wl,V S(l),V T (l)) =−V S(l)T∑ST V T (l)+wSl2(V S(l)T∑SS V S(l)−1)+wTl2(V T (l)T∑T T V T (l)−1)(A.2)138Then we take the partial derivatives for Eq. A.2 and get∂L∂V S(l)=∑ST V T (l)−wSl ∑SS V S(l) = 0∂L∂V T (l)=∑ST V S(l)−wTl ∑SS V T (l) = 0(A.3)After reduction, we further haveV T (l) =∑−1T T ∑ST VS(l)wl(A.4)∑ST∑−1T T∑TST VS(l) = w2l ∑SS V S(l) (A.5)and wl = wSl = wTl . So VS(l) and wl in Eq. A.4 can be solved by the generalizedeigenvalue decomposition and the corresponding V T (l) can be obtained by Eq.A.5.A.0.2 Updating ΘS,ΘT with fixed V S,V TAsΘS andΘT are mutual independent and with the same form, we here just demon-strate the solution of ΘS on the source domain (the solution of ΘT can be derivedsimilarly).minθ Sφ(θ S) = JS(W s,bS)−Γ(V S,V T ) (A.6)Here we apply the gradient descent method to adjust the parameter asW S(l) =W S(l)−µS ∂φ∂W S(l)=∂JS(W S,bS)∂W S(l)− ∂Γ(VS,V T )∂W S(l)=(αS(l+1)−β S(l+1)+ωlγS(l+1))×AS(l)nc+λ SW S(l)(A.7)139bS(l) = bS(l)−µS ∂φ∂bS(l)=∂JS(W S,bS)∂bS(l)− ∂Γ(VS,V T )∂bS(l)=(αS(l+1)−β S(l+1)+ωlγS(l+1))nc,(A.8)in whichαS(l) ={−(CS−AS(l))•AS(l) • (1−AS(l)), l = nSW S(l)TαS(l+1) •AS(l) • (1−AS(l)), l = 2, ...,nS−1 (A.9)β S(l) ={0, l = nSAT (l)V T (l)V S(l)T •AS(l) • (1−AS(l)), l = 2, ...,nS−1 (A.10)γS(l) ={0, l = nSAS(l)V S(l)V S(l)T •AS(l) • (1−AS(l)), l = 2, ...,nS−1 . (A.11)The operator • here stands for the dot product. The same optimization processworks for ΘT on the target domain. After these two optimizations for each layer,the CCA on the top hidden layer is employed to fine-tune all parameters of thewhole network by the back-propagation process. As we just exploit the correlationof two domain networks, the objective function is defined asminθ S,θT ,V S,V TJ =−Γ(V S,V T ), (A.12)The procedures of updating {V S,V T} are the same as in Eq. A.1 but with differentparameters. Yet in Eq. A.6 for updating ΘS,ΘT , the settings should be αS(l) = 0andβ S(l) ={AT (l)V T (l)V S(l)T •AS(l) • (1−AS(l)), l = nSW S(l)β S(l+1) •AS(l) • (1−AS(l)), l = 2, ...,nS−1 (A.13)140Algorithm 5 Deep Mapping Model TrainingInput: DC = {CSi ,CTi }nCi ,Input: λ S = 1,λ T = 1, µS = 0.5,µT = 0.5Output: Θ(W S,bS),Θ(W T ,bT ),V S,V T1: function INITIALIZATION2: Initialize Θ(W S,bS),Θ(W T ,bT )← RandomNum3: for l = 1,2, ...,nS do4: V S← argminL(ωl,V S(l))5: end for6: for l = 1,2, ...,nT do7: V T ← argminL(ωl,V T (l))8: end for9: θ S = argminφ(θ S),θT = argminφ(θT )10: end function11: function LASTLAYERFINETUNING12: repeat13: Set αS,αT = 0, update β ,γ by Eq. A.13, Eq. (A.14)14: for l = 1,2, ...,nS do15: V S← argminL(ωl,V S(l))16: end for17: for l = 1,2, ...,nT do18: V T ← argminL(ωl,V T (l))19: end for20: θ S,θT = argmin−Γ(V S,V T )21: until Convergence22: end functionγS(l) ={AS(l)V S(l)V S(l)T •AS(l) • (1−AS(l)), l = nSW S(l)γS(l+1) •AS(l) • (1−AS(l)), l = 2, ...,nS−1 . (A.14)The optimization process is summarized in Alg. 5.In Alg. 5, when it is converged, the domain-specific networks and the correla-tion coefficients between the domains are achieved for the deep mapping process.It is noteworthy that, to guarantee the performance of this process, the parametersare heuristically selected to yield the best results.141

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0390305/manifest

Comment

Related Items