An Ensemble Automatic Modulation ClassificationModel With Weight Pruning and Data PreprocessingbyXueting YangB. Eng, Xidian University, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)February 2020c© Xueting Yang, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:An Ensemble Automatic Modulation Classification Model WithWeight Pruning and Data Preprocessingsubmitted by Xueting Yang in partial fulfillment of the requirements for the de-gree of MASTER OF APPLIED SCIENCE in Electrical and Computer En-gineering.Examining Committee:Victor C.M. Leung, Electrical and Computer Engineering, UBC, VancouverSupervisorJulian Cheng, School of Engineering, UBC, OkanaganCo-supervisor, Supervisory Committee MemberJane Z. Wang, Electrical and Computer Engineering, UBC, VancouverSupervisory Committee MemberiiAbstractAutomatic Modulation Classification (AMC) detects the modulation type and or-der of the received signal using limited prior knowledge within a short observa-tion interval. In this thesis, we aim to provide a computation-efficient and high-performance AMC model for resource-constrained mobile devices.We use a public RadioML dataset and introduce three data pre-processingmethods including noise reduction, normalization, and label smoothing beforetraining the raw signals. Besides four common signal representations, we proposea new signal representation called a three-dimensional constellation image.For each signal representation, we carefully design a Deep Learning (DL)model. In addition to the traditional Convolutional Neural Network (CNN), twonew AMC model structures are proposed. The attention module is integratedinto the AMC model structure based on conventional Long Short-term Mem-ory (LSTM) networks. Another proposed AMC model structure connects CNN,LSTM, and densely connected neural networks with two additional connections.After training the AMC models, we analyze the overall and per-class perfor-mance. We also study the computational complexity of trained AMC models interms of memory consumption and detection efficiency. Overall, the results in-dicate that the proposed data pre-processing methods and the new AMC modelstructures can significantly improve the classification performance. To reduce thecomplexity of proposed AMC models, we introduce weight pruning to remove un-necessary connections in DL models. After weight pruning, the proposed AMCmodels have negligible performance degradation.iiiTo further improve the performance of AMC models, we also propose en-semble learning to train a second-level model based on multiple first-level AMCmodels. With three-fold cross-validation, the second-level model can train on thewhole dataset and have an F1-score improvement of at least 10%. We also con-duct weight pruning to reduce the unnecessary parameters of the ensemble learnedmodel. Overall, after weight pruning, the ensemble learned AMC model receivesan F1-score of 0.965 when the signal-to-noise ratio is greater than 6 dB.ivLay SummaryAutomatic Modulation Classification (AMC) is an important technology that findsapplications in both civilian and military communication systems. AMC classifiesthe modulation types and orders of the received signals in short observation timeand typically requires less prior statistical information. This thesis aims to providea high-performance and memory-efficient AMC model for mobile devices.We propose two improved Deep Learning (DL) models based on the existingAMC models and three data pre-processing methods before training AMC mod-els. To further improve the classification accuracy of the proposed AMC models,we combine multiple AMC models into one model. To improve computationalefficiency, we further remove the unnecessary connections in the model structure.Both performance analysis and experimental results validate the efficiency of theproposed three data pre-processing methods, two new DL model structures, andthe combination of single models.vPrefaceI hereby declare that I am the author of this thesis. The original and unpublishedwork outlined in this thesis was conducted in the Department of Electrical andComputer Engineering at The University of British Columbia, under the supervi-sion of Professor Julian Cheng and Professor Victor C.M. Leung.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiSymbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Automatic Modulation Classification (AMC) Methods . . . . . . 21.2.1 Likelihood-based Method (LBM) . . . . . . . . . . . . . 21.2.2 Feature-based Method (FBM) . . . . . . . . . . . . . . . 31.2.3 Deep Learning-based Method (DLM) . . . . . . . . . . . 51.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6vii1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 72 System Model and Machine Learning Basics for AMC . . . . . . . . 92.1 Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.1 Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Wireless Channel . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Machine Learning in AMC . . . . . . . . . . . . . . . . . . . . . 122.4 Deep Learning (DL) in AMC . . . . . . . . . . . . . . . . . . . . 132.4.1 Classic Neural Network (NN) . . . . . . . . . . . . . . . 132.4.2 Convolutional Neural Network (CNN) . . . . . . . . . . . 142.4.3 Prevent Overfitting in DL . . . . . . . . . . . . . . . . . . 152.4.4 Loss function in DL . . . . . . . . . . . . . . . . . . . . 152.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Signal Representations and AMC Models . . . . . . . . . . . . . . . 173.1 Signal Representations . . . . . . . . . . . . . . . . . . . . . . . 183.1.1 Time-Domain IQ Signal . . . . . . . . . . . . . . . . . . 183.1.2 Time-Domain AP Signal . . . . . . . . . . . . . . . . . . 183.1.3 Frequency Spectrum Signal . . . . . . . . . . . . . . . . . 193.1.4 Image-Domain 3D-CI . . . . . . . . . . . . . . . . . . . . 193.1.5 Statistical-Domain Manually Extracted Features . . . . . . 213.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Noise Reduction . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . 233.2.3 Label Smoothing . . . . . . . . . . . . . . . . . . . . . . 233.3 Corresponding AMC Model for Each Signal Representation . . . 243.3.1 AP-Attention Model . . . . . . . . . . . . . . . . . . . . 243.3.2 IQ-CLDNN Model . . . . . . . . . . . . . . . . . . . . . 313.3.3 Img-MobileNetV2 Model . . . . . . . . . . . . . . . . . . 33viii3.3.4 Features-CNN Model . . . . . . . . . . . . . . . . . . . . 333.3.5 Fast Fourier Transform (FFT)-CNN Model Architecture . 343.4 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4.1 Dataset Split . . . . . . . . . . . . . . . . . . . . . . . . . 343.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . 353.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Performance of AMC Models . . . . . . . . . . . . . . . . . . . . . . 364.1 Accuracy Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.1.1 Overall Accuracy Rate . . . . . . . . . . . . . . . . . . . 384.1.2 Per-class Accuracy Rate . . . . . . . . . . . . . . . . . . 384.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Performance Metrics for Hard and Soft Classifiers . . . . . . . . . 424.3.1 Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . 424.3.2 Hard Classifiers: Balanced Accuracy and F1-score . . . . 434.3.3 Micro and Macro Averaging . . . . . . . . . . . . . . . . 464.3.4 Soft Classifiers: Receiver Operating Characteristics (ROC)and Precision-Recall Curve (PRC) . . . . . . . . . . . . . 484.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . 514.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 Weight Pruning and Stacking for a Better End-to-end AMC Model . 545.1 Weight Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2.2 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 626.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63ixBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65A Typical Layers in CNN Architecture . . . . . . . . . . . . . . . . . . 70A.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . 70A.2 Activation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 71A.3 Fully-connected or Dense Layer . . . . . . . . . . . . . . . . . . 71B RadioML Dataset Generation Setup . . . . . . . . . . . . . . . . . . 72B.1 Dataset Simulation Model . . . . . . . . . . . . . . . . . . . . . . 72B.2 Dataset Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 73C Statistical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75C.1 High-order Cumulants (HOC) . . . . . . . . . . . . . . . . . . . 75C.2 High-order Moments (HOM) . . . . . . . . . . . . . . . . . . . . 76C.3 Other features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76C.4 Summary of Selected Statistical Features . . . . . . . . . . . . . . 78xList of TablesTable 3.1 Long Short-Term Memory (LSTM) Parameters . . . . . . . . 26Table 4.1 Micro and Macro-averaging F1-scores for AMC models . . . . 47Table 4.2 Computational Complexity of AMC Models . . . . . . . . . . 52Table 5.1 Computational Complexity of Pruned Stacking AMC Models . 61Table B.1 RadioML Dataset Parameters . . . . . . . . . . . . . . . . . . 74Table C.1 List of Features Used in Proposed AMC Method . . . . . . . . 77xiList of FiguresFigure 2.1 Simplified block diagram of signal model processing chain . . 10Figure 3.1 30,000 QPSK constellation points at SNR = 2dB (upper) and14 dB (lower) . . . . . . . . . . . . . . . . . . . . . . . . . . 21Figure 3.2 AP-Attention model schematic diagram for batch i . . . . . . 25Figure 3.3 LSTM block diagram . . . . . . . . . . . . . . . . . . . . . . 27Figure 3.4 Schematic diagram of the attention module . . . . . . . . . . 29Figure 3.5 IQ-CLDNN model schematic diagram for batch i . . . . . . . 32Figure 4.1 Overall accuracy rate of AMC classifiers under various Signal-to-Noise Ratio (SNR) values. . . . . . . . . . . . . . . . . . . 37Figure 4.2 Accuracy rate of AP-Attention model . . . . . . . . . . . . . 37Figure 4.3 Accuracy rate of 8PSK for different signal sample lengths . . 39Figure 4.4 Confusion matrices when SNR is 0 dB or 6 dB . . . . . . . . 40Figure 4.4 Confusion matrices when SNR is 0 dB or 6 dB . . . . . . . . 41Figure 4.5 Per-class F1-scores for AP-Attention and IQ-CLDNN . . . . . 45Figure 4.6 Micro-averaging ROC Curves for AMC models . . . . . . . . 48Figure 4.7 Per-class ROC Curves for AP-Attention . . . . . . . . . . . . 48Figure 4.8 Micro-averaging PRCs for AMC models . . . . . . . . . . . . 49Figure 4.9 Per-class PRC for AP-Attention . . . . . . . . . . . . . . . . 49Figure 5.1 Model sizes comparison after weight pruning . . . . . . . . . 55Figure 5.2 Micro-averaged F1-scores comparison after weight pruning . . 56xiiFigure 5.3 Stacking with K-fold Cross-Validation . . . . . . . . . . . . . 59Figure 5.4 Micro-averaged F1-scores for stacking learned models . . . . 60xiiiSymbolsargmax Points of the domain of the function where the function values aremaximized.argmin Points of the domain of the function where the function values areminimized.cum(·) Cumulant function in probability theory.= Imaginary part of a complex variable.< Real part of a complex variableR A set of all real numbersM Number of modulation schemesN Number of examples in the datasetL Sample length of received signalsxivGlossary1D One-dimensional2D Two-dimensional3D-CI Three-dimensional Constellation ImageAMC Automatic Modulation ClassificationAM Amplitude ModulationAP Amplitude and PhaseAWGN Additive White Gaussian NoiseBER Bit Error RateBPSK Binary Phase Shift KeyingCD Constellation DiagramCNN Convolutional Neural NetworkCPFSK Continuous-phase Frequency-shift KeyingCR Cognitive RadioD/A Digital-to-analogDFT Discrete Fourier TransformxvDLM Deep Learning-based MethodDL Deep LearningDNN Densely-connected Neural NetworkDSB Double Side-bandEM Expectation MaximizationFBM Feature-based MethodFFT Fast Fourier TransformGFSK Gaussian Frequency-shift KeyingHOC High-order CumulantsHOM High-order MomentsHOS High-order StaticsI.I.D Independent and Identically DistributedIQ In-phase and QuadratureKDE Kernel Density EstimationKNN K-Nearest NeighborhoodLBM Likelihood-based MethodLSTM Long Short-Term MemoryMLE Maximum Likelihood EstimationMR Modulation RecognitionNN Neural NetworkxviPAM Pulse Amplitude ModulationPAR Peak to Average RatioPDF Probability Density FunctionPRC Precision-Recall CurvePRR Peak to Root-mean-square RatioPSK Phase Shift KeyingQAM Quadrature Amplitude ModulationQPSK Quadrature Phase Shift KeyingRELU Rectified Linear UnitRNN Recurrent Neural NetworkROC Receiver Operating CharacteristicsSGD Stochastic Gradient DescentSNR Signal-to-Noise RatioSSB Single Side-band ModulationWBFM Wide-band Frequency ModulationxviiAcknowledgmentsFirst and foremost, I would like to express my sincere gratitude to my supervisorsProfessor Victor C.M. Leung and Professor Julian Cheng, for their great supportand persistent encouragement during my master’s program. It is a great pleasureto work under their supervision. Also, I greatly thank my committee members forthe time and effort in reading my thesis and providing feedback.I have been truly fortunate to have all the brilliant colleagues and friends inVancouver Lab Kaiser 4090 and Okanagan Lab EME 2255. I am also thankful tomy roommates Yubo Sun and Siqi An for always being a huge support in my life.Thank you to my sweet, reliable and caring boyfriend Zhen Wang, for allthe precious understanding, encouragement, and always keeping me company.Thanks for being my soulmate, and I love you with all my heart.Finally, I own my heartfelt gratitude to my beloved parents for giving meendless love and support. I am lucky and proud to be your daughter.xviiiChapter 1IntroductionIn this chapter, we first introduce the purpose of Automatic Modulation Clas-sification (AMC). We then introduce the classification of AMC methods withmotivations, and state our contributions. The structure of the thesis is outlined atthe end of this chapter.1.1 BackgroundRadio spectrum is scarce. To use the radio spectrum efficiently, we need to mon-itor and understand the spectrum resource usage. Effective use of radio spectrumcalls for developing advanced algorithms to share the available spectrum dynam-ically. To achieve this goal, we are required to extract useful information fromcomplex radio signals over a wide frequency range [1].Cognitive Radio (CR) can mitigate the long-standing spectrum scarcity prob-lem [2]. CR automatically detects available channels in the wireless spectrumto utilize unused spectrum. In CR, as an important element in determining thethroughput, robustness, and overall implementation overhead, the modulation typeis automatically determined according to the external environments.As an intermediate step between signal detection and demodulation, Modula-tion Recognition (MR) is the task of detecting the modulation type and modula-1tion order of a received radio signal [3]. AMC is a task to complete the MR taskin an autonomous way using limited prior knowledge within a short observationinterval.Adaptive modulation uses different modulation schemes, such as different or-ders of Quadrature Amplitude Modulation (QAM) and Phase Shift Keying (PSK)according to changing channel conditions. In the adaptive modulation, pilot sym-bols are commonly used to help demodulate in fading environments [4]. Under alimited power budget, applying AMC to adaptive modulation without pilot tonecan increase spectral efficiency without sacrificing the Bit Error Rate (BER) per-formance [5].AMC has numerous civilian and military applications [1, 3, 6–9]. For civilianapplications, AMC is essential for signal sensing for cooperative communicationand spectrum interference monitoring. For military applications, AMC providesadded advantages for signal interception, jamming and localization of a hostilesignal in electronic warfare and surveillance.1.2 AMC MethodsIn general, MR can be formulated as a classification problem. Various MRapproaches in traditional wireless networks can be divided into two broad cat-egories: Likelihood-based Method (LBM) and Feature-based Method (FBM)1. Recently, many researchers have also applied Deep Learning-based Method(DLM) to AMC, since DLM is simple to design, robust to model mismatch andless dependent on prior features [6, 7, 10].1.2.1 LBMIn LBM, the MR is presented as multiple composite hypothesis-testing problems.LBM builds a probabilistic model for the received signal with a proper decisioncriterion, where the modulation type having the largest likelihood value is the1The LBM and FBM are also known as the decision-theoretic method and statistical patternrecognition method, respectively.2identified output. LBM assumes that the Probability Density Function (PDF)of the transmitted signal contains all information for classification. Even thoughLBM can achieve the highest recognition rate in the Bayesian sense for a givenmodel, LBM has the following four disadvantages [6].First, obtaining accurate prior PDF information of the transmitted signal istypically infeasible in most practical scenarios, e.g., non-cooperative communica-tion.Second, for LBM, it is challenging to obtain an exact closed-form solutionfor the decision function of this hypothesis-testing problem. Even though such aclosed-form solution exists, for example in the PSK classification problem withunknown carrier phase [8], the high computation complexity makes such a classi-fier impractical [6].Third, the classification performance of the LBM models is significantly de-teriorated in the presence of a model mismatch, which indicates that there is adiscrepancy, e.g., unknown channel conditions, and other receiver discrepancies,between the ideal system model of LBM and the true model.Fourth, the practical implementation of LBM suffers from high computationalcomplexity due to the computation of PDF over unknown channel conditions andthe buffering requirement for a large number of samples. To ease the computa-tional complexity, three sub-optimal methods have been proposed based on theirassumptions for the unknown parameters [9, 11, 12].1.2.2 FBMAs a mapping function between features and multi-hypothesis, FBMs normallyfocus on three major stages: 1) pre-processing; 2) feature extraction; 3) classifi-cation.Pre-processing estimates channel state information, eliminates noise, frequencyand phase offset, or transforms inputs into proper forms for better equalization andeasier feature extraction.In the feature extraction stage, five statistical features of the instantaneous3amplitude, phase, and frequency are subsequently extracted [13–21].• Signal spectral based features [14, 16, 17].• Wavelet, short-time Fourier transform-based features [17].• High-order Statics (HOS) based features [14–17, 19, 21]. HOS-based fea-tures include cumulants-based and moments-based features, which oftenhave better anti-noise and anti-interference properties.• Cyclo-stationary analysis-based features [14, 16, 17, 20].• Graph-based cyclic-spectrum analysis features, such as the constellation di-agram [18].For the classification process, the existing classifiers include various tech-niques from the decision tree [16, 19], lightweight support vector machine [21],K-Nearest Neighborhood (KNN) [15] to artificial Neural Network (NN) [14, 20].Compared with LBM, FBM can achieve reasonable probability of correctclassification (Pc) with favorable computational complexity. Moreover, sinceFBM exploits training data to extract features, it is more robust to variations ofsystems, such as fading, path-loss and time shift.However, it has been shown that FBM performance is stable only at relativelyhigh Signal-to-Noise Ratio (SNR) in an Additive White Gaussian Noise (AWGN)channel, and is dependent on the number of fine-designed features [14–17]. Themanual selection of features [19–21] is also tedious and makes it impossible tomodel all changes in time, location, velocity, and propagation conditions of thetransmitter or the receiver.Overall, all these aforementioned LBM and FBM exploit knowledge aboutthe structure of different modulation schemes to formulate the classification rulesfor AMC. Both methods require intensive processing power to deploy on low-cost distributed sensors. Also, both methods are inflexible to adjust for variousenvironments where we need to extract different features for MR.41.2.3 DLMThe past decade has witnessed the rapid development of high-performance graphics-card processing power, improved Stochastic Gradient Descent (SGD) methods,and strong regularization techniques in Deep Learning (DL) area. With hardwareand software breakthroughs, DL has achieved a series of exciting achievementsin natural language processing, computer vision and other pattern recognitiontasks. This revolution has also sparked interests in extending DL to other do-mains, including optimization algorithms for wireless communication to achievebetter end-to-end performance [22].Many research works have been conducted to include DL in AMC [1, 3, 10,18, 23, 24].Meng et al. [23] specified an idea of end-to-end Convolutional Neural Network(CNN)-based AMC, which shows a performance superior to FBM and close toideal LBM. Furthermore, Wang et al. [24] combined a CNN model based on In-phase and Quadrature (IQ) data and another CNN model based on ConstellationDiagram (CD) to improve the poor classification of QAM16 and QAM64 in [23].However, both works are limited to a relatively simple and small dataset, and as aresult, their models are not suitable in practical applications and not comparableto other AMC models.In [3], a public dataset with raw IQ time series radio signals for AMC isgenerated using GNU Radio2, and this dataset is named as RadioML dataset. Ex-periments in [3] show that the proposed DLM based on CNNs are robust to noiseand corruption.Based on the public dataset [3], Kulin et al. [1] exploited three wireless signalrepresentations with CNNs for signal classification problems. The results demon-strated the feasibility of DLM using correct signal representation. However, thiswork only considered CNN and Densely-connected Neural Network (DNN) forthe model structure, and the computational complexity was not discussed [1]. Toachieve real-time AMC for varying environmental conditions, the quantized Long2https://www.gnuradio.org/about/5Short-Term Memory (LSTM)-based model for Amplitude and Phase (AP) signalswas proposed by Rajendran et al. [10], where the robustness of LSTM is estab-lished for variable sample lengths. Besides regular raw IQ signals or AP signals,Peng et al. [18] also demonstrated the feasibility of constellation diagrams in dif-ferent DLMs.To date, a comprehensive performance and complexity comparison of AMCmodels based on all possible signal representations is still missing. Whilst someresearch has been carried out on different model structures, e.g., CNN and LSTM,little attention has been paid to ensemble learning. It is also surprising that modelpruning 3 has not been investigated in all the aforementioned papers.Therefore, in this thesis, we investigate five different signal representationsof RadioML dataset [3] and propose a corresponding DL-based model for eachsignal form. Three data pre-processing methods are introduced to improve AMCperformance. To validate the performance of the proposed AMC models, we alsoconduct extensive numerical experiments to compute and depict various perfor-mance metrics including computational complexity. Pruning is further introducedto achieve real-time AMC on edge devices having constrained computational re-sources. Besides pruning, we illustrate the remarkably improved performance ofstacking learning. At last, pruned stacking models are proposed for improvedperformance and realistic complexity in practice.1.3 ContributionsThis thesis proposes real-time pruned stacking AMC models based on RadioMLdataset. We summarize the main contributions of this thesis as follows:• We consider five commonly-used signal representations for AMC: time-domain IQ signal, time-domain AP signal, frequency-domain Fast FourierTransform (FFT) signal, image-domain Three-dimensional ConstellationImage (3D-CI), and the statistical-domain manually extracted 23 features3Model pruning removes the less salient connections in NNs to reduce the number of non-zeroparameters with little loss in final model quality.6(denoted as “Features”). Noise reduction, normalization, and label smooth-ing are conducted for signals in various representations before feeding intothe AMC model.• For AP signal, we combine attention module with LSTMs to add weightsfor lower features extracted by the LSTM module. For IQ signal, we addtwo additional multi-scale connections to capture information at differentresolutions produced by CNN, LSTM and DNN modules. The well-knownMobileNet version 2 [25] is chosen for 3D-CI and simple CNNs are ex-ploited for FFT and Features to control the computation complexity.• Detailed per-class and overall performance metrics are computed for eachAMC model. Memory consumption and detection efficiency are also con-sidered.• Pruning AMC models with minimal performance degradation is introducedto save the memory resource in practical deployment. The performancecomparison of the pruned model with the non-pruned model is also in-cluded.• Stacking with three-fold cross-validation is proposed to combine the strengthsof previous single AMC models. We also prune the stacking learned mod-els to ensure efficient storage memory. Results show that the pruned stack-ing models can achieve superior performance while having a relatively lowcomputation complexity, when compared with single AMC models andconventional AMC models.1.4 Outline of the ThesisThe rest of the thesis is organized as follows. Chapter 2 presents the signal model,problem formulation, and fundamental information of DL in AMC. Chapter 3 in-troduces five signal representations and data pre-processing methods. In Chapter74, we illustrate the AMC model structures and experiment setup. Chapter 5 sum-marizes the performance metrics and evaluates the performance of AMC models.In Chapter 6, pruning and stacking are conducted for improved end-to-end perfor-mance. Conclusions and future work are presented in Chapter 7.8Chapter 2System Model and MachineLearning Basics for AMC2.1 Signal ModelThis section describes the system model and formulates the AMC problem. Theprocessing pipeline for wireless communication system with AMC model is illus-trated in Figure 2.1. A wireless signal model consists of a transmitter, a receiverand a channel model at the system level. The AMC module is an intermediateprocess that occurs between signal detection and demodulation at the receiver.2.1.1 TransmitterThe transmitter transforms a stream of source information bits bk ∈ {0, 1} intotransmission signal s(t). After coding and modulation, the message is mapped to adiscrete waveform or signal denoted by sk via a pulse-shaping filter. The Digital-to-analog (D/A) converter module transforms sk into the continuous basebandsignal sb(t).The real-valued bandpass signal s(t) having carrier frequency fc can be written9Source Information Modulation D/AInformation Demodulation A/DAMCbits symbolsnoisebk sk sb(t) s(t )h(t , τ)Channelr (t )n (t )rb(t )r (k )Figure 2.1: Simplified block diagram of signal model processing chainas [1]s(t) =<{sb(t)e j2pi fct}=<{sb(t)} cos(2pi fct)−={sb(t)} sin(2pi fct)(2.1)where sb(t) =<{sb(t)}+ j={sb(t)} is the baseband complex envelope of s(t).2.1.2 Wireless ChannelThe channel effects are modelled as a linear time-varying bandpass channel im-pulse response h(t, τ). A general channel output r(t) under h(t, τ) isr(t) = s(t) ∗ h(t, τ)+n(t) (2.2)where n(t) is AWGN having mean zero and variance σ2n , and ∗ denotes the con-volutional operation.Equation 2.2 is widely used in traditional expert features computation. How-ever, the practical input/output relation is more complicated with a frequency-10selective fading channel and imperfect receiver hardware. For more details, pleaserefer to Section B.1.2.1.3 ReceiverThe relationship of r(t) and its baseband complex envelope rb(t) is given byr(t) =< {rb(t)e j2pi fct} (2.3)rb(t) = (sb(t) ∗ hb(t,τ))12ej(2pi f0t+φ(t))+n(t) (2.4)where f0 is frequency offset of transmitter local oscillator frequency fc and re-ceiver local oscillator frequency f ′c , φ(t) is the phase offset, and hb(t,τ) is thebaseband channel.Let r(k) denote the discrete-time observed signal at sampling index k after thereceived signal is amplified, mixed, low-pass filtered, and passed through the D/Aconverter module. After sampling rb(t) at time kfs where fs is the sampling rate,r(k) is given byr(k) = rb(t)|t=k/ fs, −∞ < k < +∞. (2.5)The input r(k) to AMC is presented in a set of IQ complex form I + jQ dueto its flexibility and simplicity for mathematical operations and algorithm design.The expression of I + jQ form is given byr(k) = rI(k)+ jrQ(k), k = 0, · · · ,L−1 (2.6)where L is the sample length that specifies the number of received signal samples.2.2 Problem FormulationIn general, AMC can be treated as a multi-class classification problem. The objec-tive of supervised AMC is to produce the probabilities P(s(k) ∈ Θm |r(k)), where11Θm represents the mth class in modulation schemes poolΘ, which is defined asΘ = {Θm}Mm=1 (2.7)where M is the number of possible modulation schemes.Let Pr( fAMC (r(k)) =m′|Hm) denote the probability of detecting the mth mod-ulation format of the transmitted signal as the m′th modulation format. fAMC(·) isa classification function of the AMC classifier. fAMC(r(k)) denotes the modula-tion format estimated by the classifier according to the received signal r(k). Hmis the hypothesis that the transmitted sequence s(t) is generated from Θm.We suppose each modulation type Θm has equiprobability. The typical AMCapproaches maximize the average probability of correct classification Pc in a shortobservation interval for a wide range of SNR values, and Pc is defined as [6]Pc =1MM∑m=1Pr ( fAMC (r(k)) = m|Hm) (2.8)2.3 Machine Learning in AMCMachine learning trains a parametric data-driven model from historical data with-out using explicit instructions, but relying on patterns and inference instead. Fora training dataset D = {(x1,y1),(x2,y2), . . . ,(xN,yN )}, x i is the i-th received se-quence and yi is the corresponding modulation scheme index. We assume ex-amples {(xi, yi)}Ni=1 are Independent and Identically Distributed (I.I.D). Each yiwas generated by an unknown function yi = h(x i). The goal of machine learningalgorithm is to discover a function f that approximates the true function h.The input signal matrix X and the output label vector Y are denoted byX = [xT1 ,xT2 , · · · ,xTN ]T ∈ RN×L (2.9)12Y = [y1,y2, . . . ,yN ]T ∈ RN (2.10)where xi = [xi,1,xi,2, . . . ,xi,L]T ∈ RL for i = 1, . . . ,N is the received signal or thefeature vector at the ith observation, L is the signal sample length, and N is thenumber of samples.For continuous output variable yi ∈ R, f is called a regressor; for categori-cal output variable yi ∈ {1, . . . ,M}, f is described as a classifier. Therefore, ourproposed model is called a modulation classifier.For a new signal xnew in the testset, the modulation predictor is given by yˆ =f (xnew; θ ), where θ represents the AMC model parameters. The estimation ofθ is an optimization problem regarding the training loss J(θ ) averaged over alltraining examples. Hence, θ is computed asargminθJ(θ ) =∑Ni=1L(x i,yi,θ)N(2.11)where L(x i,yi,θ) is the point-wise loss function of the true modulation scheme yiand the predicted modulation label f (x i; θ ).2.4 DL in AMC2.4.1 Classic NNBased on the model of human brain neurons, classic NN projects the input signalsequence space into a linearly separable space by feeding weighted sum of inputsinto a non-linear activation function fact(·). Let the input and output vector of theclassic NN layer l be xl−1 and xl , then the mathematical representation of xl canbe described asx l = fact(W lx l−1+bl) (2.12)13where weight matrixW has the dimension N l ×N l−1, the bias b has the dimensionN l ×1, the activation function fact is applied element-wise N l , and the number oftrainable parameters is N l ×N l−1+N l .2.4.2 CNNCNN uses a cascade of multiple hidden layers with non-linear logistic functionsto transform high-level pertinent information directly from the original data into amanageable reduced dimension representation.Derived from feed-forward NNs, CNN replaces general matrix multiplicationwith convolution. CNN is more memory-efficient and invariant to various datatransformations with the characteristics of parameters sharing and local connec-tion.• Parameters sharing: Different from the Equation 2.12, CNN uses the convo-lutions to compute the output neuron. Each hidden neuron in CNN has thesame shared weight matrix and bias connected to its local receptive field.Compared with the fully-connected NN, CNN shares parameters betweendifferent neurons to reduce parameter storage and enforce translation invari-ance [7].• Local connection: Classic NN connects every input neuron to every hiddenunit, while CNN achieves sparse connectivity by connecting local receptivefields. The local receptive field slides across the entire input matrix withcertain movement step size known as stride length.CNN relies on back-propagation with SGD to extract low-level features from rawinputs and higher-level features from previous layers.In general, there are three typical layers in CNN architectures: convolutionallayer, activation layer, and fully-connected or dense layer. Please refer to Ap-pendix A for details.142.4.3 Prevent Overfitting in DLThere are many modern techniques (such as pooling and dropout) that can beapplied to prevent overfitting.Pooling layer or down-sampling layer reduces the input dimensionality bycomputing the average value or the maximum value of a windowed input vector(average-pooling or max-pooling), which is defined asx l+1 = fpool(x l) (2.13)where fpool(·) is the pooling function.Based on the fact that the exact locations of found features is not as impor-tant as the rough location relative to other features, a pooling layer produces acondensed feature map by throwing away the exact position information. Afterpooling, x l+1 is more computational efficient and robust to the small translationsof x l .Apart from pooling, dropout is also commonly applied to prevent overfitting.Dropout neglects the updating of part nodes’ weights and results in more indepen-dent nodes in DL models.2.4.4 Loss function in DLBack-propagation is the optimization process to update the parameters effectivelyand iteratively, such as Equation 2.14 in traditional machine learning. In AMCexperiments, the Adam optimizer [26] is utilized. Under Adam optimizer, theparameter θn+1 is updated asθn+1 = θn−η∂L(Y , fAMC(X ,θn))∂θn(2.14)where η is the learning rate for Adam optimizer and L(·) is the chosen loss func-tion.In the multi-class classification problem, the cross-entropy loss function is15commonly introduced to measure the deviation between the desired and actualoutput across an entire layer, and this loss function can be expressed as:L(y, yˆ) = − 1MM∑i=0yi log(yˆi)+ (1− yi) log(1− yˆi) (2.15)where yi represents the true label, yˆi = fAMC(x i) is the predicted probability of theith class by the model fAMC , and M is the number of modulation categories.2.5 SummaryIn this chapter, we firstly describe the wireless communication system model andformulate the AMC problem. In general, AMC is a multi-class classification prob-lem. The fundamentals of ML and DL for AMC problem are discussed in Sec-tion 2.3 and Section 2.4. For this specific multi-class classification problem, theAMC problem, we choose the cross-entropy function as the loss function.16Chapter 3Signal Representations and AMCModelsSince the RadioML 2016.10a has been widely used to evaluate the AMC models,we choose the RadioML 2016.10a dataset to compare fairly with the benchmarkmodels without generating new datasets. The simulation setup to obtain the datasamples is presented in Appendix B.In this chapter, the five forms of signal representation and three data pre-processing methods are reviewed for AMC. In Section 3.3, we correspondinglytailor the different types of DL-based AMC models for the five forms of signalrepresentation. At last, the dataset split and implementation details are introducedin Section 3.4.The notations in this chapter are as follows:• y i: The ith one-hot encoding label vector1.• x i: The ith complex signal with L data points.1In AMC, each signal is associated with categorical data, i.e., the corresponding modulationscheme. One-hot encoding quantifies the categorical data into numerical data. One-hot encodingproduces a vector with length equal to the number of categories in the dataset. If an elementbelongs to the ith category, then the elements are assigned with a value of 0 except for the ithelement, which is assigned a value of 1 [27].17• x Ii : The ith in-phase signal with L data points. Each element of x Ii is repre-sented by x Ii,j , where j ∈ {1, 2, · · · L}.• xQi : The ith quadrature signal with L data points. Each element of xQi isrepresented by xQi,j , where j ∈ {1, 2, · · · L}.• x Ai : The ith amplitude vector of complex signal x i. Each element of x Ai isrepresented by xAi,j , where j ∈ {1, 2, · · · L}.• xPi : The ith phase vector of complex signal x i. Each element of xPi is repre-sented by xPi,j , where j ∈ {1, 2, · · · L}.• xFi : The ith frequency domain signal of x i. Each element of xFi is repre-sented by xFi,k , where k ∈ {1, 2, · · · L}.3.1 Signal Representations3.1.1 Time-Domain IQ SignalWhen the RadioML dataset is used, an IQ data sample consists of N time-domaincomplex IQ signals. Denote the ith L-dimensional in-phase term by x Ii , and theith L-dimensional quadrature term by xQi , an IQ sample is given byDIQ ={(x Ii , xQi ), y i}Ni=1. (3.1)3.1.2 Time-Domain AP SignalWhen the jth term of the ith in-phase vector is x Ii,j , and the jth term of the ithquadrature vector is xQi,j , we respectively define the terms xAi,j and xPi,j asxAi,j = Amplitude(xi,j) =√(x Ii,j)2+(xQi,j)2(3.2)18andxPi,j = Phase(xi,j) = arctan(xQi,j/x Ii,j). (3.3)Therefore, each IQ sample can be transformed to an AP sample in the polarcoordinate asDAP ={(x Ai , xPi), y i}Ni=1. (3.4)3.1.3 Frequency Spectrum SignalWe perform the one-dimensional (1D) L-point Discrete Fourier Transform (DFT)with the efficient FFT algorithm over the IQ sample. Since computing the L-pointDFT requires O(L2) arithmetic operations, which are time-consuming. There-fore, the FFT algorithm is proposed to exploit symmetries in signals to reduce thecomplexity to O(L logL).Performing the FFT operation over the ith complex IQ signal, we obtain acomplex vector xFi . More specifically, the kth element of xFi is obtained asxFi,k =L∑p=1xi,p · e−j2piL kp k = 1, . . . ,L. (3.5)Therefore, a frequency spectrum sample is given byDFFT ={(<{xFi },={xFi }), y i}Ni=1 . (3.6)3.1.4 Image-Domain 3D-CIIn the constellation diagram, the horizontal x-axis is the in-phase term of complexIQ signals, and the vertical imaginary y-axis is the quadrature term of complex IQsignals. An L-dimensional complex IQ signal corresponds to the L points in theconstellation diagram. More specifically, the jth point is represented as (x Ii,j,xQi,j)where x Ii,j is the jth in-phase term of the ith complex IQ signal, and xQi,j is theith quadrature term of the ith complex IQ signal. Unless otherwise specified, we19select the CD having a size of 0.015×0.015 for the RadioML dataset to avoid theoverlapping of constellation points and include as more signal points as possible.The carrier phase shift from a reference phase is equal to the counterclock-wise angle of the constellation point from the horizontal x-axis. The distance of aconstellation point to the origin represents the signal amplitude. The distance be-tween different points indicates the ability of a receiver to differentiate modulationschemes under additive noise. In practice, the CDs are usually a “cloud” of pointssurrounding each symbol position. Since the different “cloud” areas in CD havedifferent densities of sample points, we use the differences in density to make thedisturbing signals more discernible. Therefore, we convert the traditional 2D-CDinto the 3D-CI where the third dimension is the signal density.Density estimation techniques consist of mixture models and neighbor-basedapproaches e.g., the non-parametric Kernel Density Estimation (KDE). We useKDE to reconstruct the probability density function for the 3D-CI.Assuming the observed signal {xi,j}Lj=1 is univariate i.i.d, we want to estimatethe underlying unknown PDF. Mathematically, the formal definition of kerneldensity estimator at a point z within {xi,j}Lj=1 is given byρ̂h(z) = 1LhL∑j=1K( z− xi,jh)(3.7)where K(·) is the non-negative, smooth and symmetric kernel function controlledby the bandwidth parameter h.We choose the common Gaussian kernel function, which is defined byK(x; h) ∝ e− x22h2 (3.8)where the smoothing bandwidth h controls the trade-off between the bias andvariance of the estimator. More specifically, the smoothness of ρ̂h(y) increaseswith h. With the estimated PDF ρ̂h(y), we can convert CD into colorful 3D-CI.Figure 3.1 illustrates the CD, 3D-CI, and 3D-CI with noise reduction of 30,00020Figure 3.1: 30,000 QPSK constellation points at SNR = 2dB (upper) and 14dB (lower)QPSK samples when the values of SNR are 2 dB and 14 dB.At high SNR, CD and 3D-CI have already revealed enough statistics informa-tion for AMC, and the noise reduction has little improvement. Overall, both CDand 3D-CI can reveal statistical modulation information at high SNR. However,at low SNR, CD polluted by noise is disguised, and 3D-CI with color density in-formation can provide richer amplitude and phase information. In addition, noisereduction can concentrate the constellation points to build a clear “cloud”. Thedetails of noise reduction will be discussed in Section 3.2.1.3.1.5 Statistical-Domain Manually Extracted FeaturesWe convert a complex IQ signal into K-dimensional feature vector f i based onthe extensive feature set, which contains the statistical and instantaneous features[28]. Using the time-average features, the negative effects of background noise21and phase rotation can be mitigated [29]. The High-order Moments (HOM) andHigh-order Cumulants (HOC) vary for different modulated signals, and have goodanti-noise performance. Therefore, we include the HOM and HOC as two fea-tures. Besides, kurtosis (K), skewness (S), Peak to Average Ratio (PAR) and Peakto Root-mean-square Ratio (PRR) are also included. Table C.1 in Appendix Clists the selected 23 statistical and instantaneous features [30–32].3.2 Data PreprocessingBefore feeding the data samples to the AMC models, several pre-processing op-erations are performed, namely noise reduction, signal normalization and labelsmoothing.3.2.1 Noise ReductionSince the AWGN term n(t) can compromise the performance of AMC models, weuse the Gaussian low-pass filter fG(·) to reduce the noise before any other datapre-processing method. The filter fG(·) attenuates high-frequency outliers andkeeps the low-frequency details of the received signal. The impulse response ofthe one-dimensional Gaussian filter fG(·) with standard deviation σ is given byfG(x) = 1√2pi ·σ · e− x22σ2 . (3.9)Mathematically, the input signal is convolved with the Gaussian function fG(x).In theory, fG(x) is non-zero everywhere, which would require an infinitely largeconvolution kernel. In practice, fG(x) is effectively zero more than about fourstandard deviations from the mean, and therefore we truncate the convolution ker-nel at this point.The effectiveness of Gaussian low-pass filtered signals can be found in Fig-ure 3.1. Experiments have demonstrated that setting σ = 0.8 can remove mostoutliers and keep more constellation points in 3D-CI signal representation.223.2.2 NormalizationAfter passing the Gaussian filter, we normalize the training samples to robustifythe automatic feature extraction. The normalization is to perform a linear opera-tion to the original data samples such that the normalized data samples have zeromean and unit variance. For example, the normalization operation to the jth pointof the ith signal is given byx′i j =xi j − µ jσj, j ∈ {1,2, · · · L} (3.10)where the jth element of the mean vector, denoted by µ j , is obtained asµ j =1NtrainNtrain∑i=1xi,j, j ∈ {1, 2, · · · L} (3.11)and the jth element of the standard deviation vector, denoted by σj , is obtained asσj =√√1NtrainNtrain∑i=1(xi,j − µ j)2, j ∈ {1, 2, · · · L}. (3.12)Here, Ntrain is the number of training signals, and the values of µ and σ arecomputed over the training set and fixed for the test set2.3.2.3 Label SmoothingIn our experiments, the modulation scheme is represented by one-hot encodingand later modified by label smoothing. With label smoothing, the difference oftrue label value and wrong label value becomes a constant relying on a smoothingparameter α. Moreover, the activation values of the penultimate layer of the modelare close to the true class and equidistant to the wrong classes [33]. Hence, themodel with label smoothing is automatically calibrated and less overfitting. For2This trick can accelerate the training process and prevent overfitting.23the mth modulation, we smooth the traditional label vector ym byy′m = ym(1−α)+αM(3.13)where the smoothing parameter α equals 0.1.3.3 Corresponding AMC Model for Each SignalRepresentationBased on the five forms of signal representation in Section 3.1, we correspond-ingly introduce five models by tailoring different types of neural networks as• Attention-based model of AP signal (AP-Attention).• Combined CNN, LSTM and DNN model with multi-scale additions of IQsignals (IQ-CLDNN).• MobileNet version 2 model of 3D-CI images (Img-MobileNetV2).• CNN-based model of statistical features (Features-CNN).• CNN-based model of frequency spectrum signals (FFT-CNN).Among the five models, we propose AP-Attention and IQ-CLDNN modelsfor the first time. Though the MobileNet model has achieved great success in thediscipline of computer vision, the performance is unknown in the AMC problems.Therefore, we tailor the MobileNet model for the proposed 3D-CI signals. Weconsider the Features-CNN and FFT-CNN as benchmarks [1, 17].3.3.1 AP-Attention ModelThe proposed AP-Attention model consists of an LSTM-based module to extractlow-level features, an attention-based module to include importance weights, anda classification module to obtain the probabilities for different modulations. The24LSTM ModuleAttention ModuleClassification ModuleFigure 3.2: AP-Attention model schematic diagram for batch i25Table 3.1: LSTM ParametersVariables Definitionh The number of hidden unitsd The number of input featuresσg The sigmoid activation functionσh The tanh activation functionx t ∈ Rd The input vector to the LSTM unitf t ∈ Rh The activation vector of forget gatei t ∈ Rh The activation vector of input gateot ∈ Rh The activation vector of output gateht ∈ Rh The hidden state vector of the LSTM unitc t ∈ Rh The current cell state vectorW ∈ Rh×d The weight matrix of input vector x tU ∈ Rh×h The weight matrix of previous hidden state vector ht−1b ∈ Rh The trained bias vectorinput of the AP-Attention model has a unified size of 128× 2 for the AP signals.Figure 3.2 provides the schematic diagram of AP-Attention model. The term“None” in the diagram represents the first dimension of a variable. This dimensionis usually ignored when building a model and is defined during the predictionperiod.The LSTM module can alleviate the gradient vanishing problem by exploitinga gating mechanism with explicit memory releasing and updating. The variablesof the LSTM module are listed in Table 3.1.Typically, an LSTM block is composed of a cell state vector c t to record valuesover arbitrary time intervals and three gates to regulate the information flow. Thefunctionalities of the three gates are [34]:• Input gate with weight matricesW i,U i, and bi to control the extent where anew value flows into the cell state.• Forget gate with weight matricesW f ,U f , and b f to control the extent wherethe value is kept or discarded from the cell state.26Forget GateInput Transform GateInput GateOutput GateOutput ActivationLSTM UnitFigure 3.3: LSTM block diagram• Output gate with weight matricesW o,U o, and bo to control the extent wherethe cell value computes the output activation of the LSTM unit.The mechanism of the LSTM module is described by the following equationsusing current input x t and the previous state ht−1 [35]:• Gatesi t = σg(W ix t +U iht−1+bi) (3.14)f t = σg(W f x t +U f ht−1+b f ) (3.15)ot = σg(W ox t +U oht−1+bo) (3.16)27• Input transformcint = σh(W cx t +U cht−1+bc) (3.17)• State updatec t = f t ~ c t−1+i t ~ cint (3.18)ht = ot ~σt(c t) (3.19)where the initial values c0 and h0 are zero, the subscript t represents the time step,and the operator ~ denotes the element-wise product. Figure 3.3 presents theLSTM block diagram and illustrates the above equations. The input, output, andforget gate functions are the sigmoid function. The input transform and outputactivation functions are the tanh function.In the LSTM module, we exploit two layers of LSTM with a hidden size lsof 128 along with a dropout layer with a ratio of 0.5 to encode the AP sequences.In AP-Attention model, the input x t of LSTM module is denoted by x i, wherei ∈ {1, · · · ,bs}. Thus, the output of LSTM hi ∈ RL×ls is described ashi = fLSTM(x i; θLSTM) (3.20)where L is the signal length and θLSTM is the parameters of LSTM module.The second module in AP-Attention model is the attention module with archi-tecture illustrated in Figure 3.4. We explain Figure 3.4 in the following equationsfrom Equation 3.21 to Equation 3.25.We first generate the attention score [35]. Let H be the annotation matrixwith the LSTM extracted vector [h1,h2, . . . ,hbs ]T , where bs is the batch size andhi ∈ RL×ls is the LSTM feature vector with sample length L and hidden size ls.We feed hi into an additional feed-forward densely-connected network such thatthe attention score vector ui ∈ RL×ls is obtained.28ConcatenateLast Hidden StateAttention WeightContextVectorAttention VectorSoftmax......... ............ ...htu1 u2 ui uLα1 α2 α i αLv1 v2 v i vhsa1 a2 ai aas......h1 h2 hi hLhsFigure 3.4: Schematic diagram of the attention module29As a hidden representation of hi, the vector ui is defined asui = fc(W c ·hi) (3.21)where fc represents the fully-connected network andW c is the weight matrix ran-domly initialized and jointly trained.The attention score si ∈ RL is computed bysi = ui ·ht (3.22)where ht ∈ Rls is the last LSTM hidden state vector and is extracted from thematrix H .With attention scores si and softmax activation function fsoftmax(·), the atten-tion weights αi is given by αi = fsoftmax(si) ∈ RL . The large values in αi force thenetwork to focus on the corresponding part in input x i.After that, we measure the context vector v i ∈ Rls . v i is a weighted combina-tion of attention weight αi and hi, which is given byv i = αi ·hi . (3.23)We then concatenate the context vector v i and the last hidden states ht intoan attention output vector vai ∈ Rls . At last, the attention vector ai ∈ Rla with theattention output size la is computed by feeding vai into a DNN having 128 neurons,vai = [v i, ht] (3.24)ai = fa(W a ·vai ) (3.25)where fa is the DNN layer with attention weight matrixW a ∈ Rla×ls .In summary, the attention module has full access to the input sequences. Al-though the weighted combination increases the computational burden, attention30module produces a more targeted and better-performing model.The last module is the classification module. The classification module takesthe attention layer output ai as the input. We feed ai into a DNN with M units,where M is the number of modulation schemes. Then a softmax classifier is usedto predict the modulation scheme yˆi. The process in the classification module isas follows:r i = tanh(W cai +bc) (3.26)pˆ(y i |Θ) = softmax (W sr i +bs) (3.27)yˆi = argmaxy ipˆ(y i |Θ) (3.28)where W c and bc are DNN layer parameters, W s and bs are softmax layer pa-rameters, and the pˆ(y i |Θ) represents the probability vector for each modulationscheme.3.3.2 IQ-CLDNN ModelSpeech recognition performance can be improved by combining CNNs, LSTMs,and DNNs in a unified framework [36]. Thus, we propose a CLDNN model tocombine CNN, LSTM, and DNN. We also introduce the multi-scale additionalconnections to include more features. Figure 3.5 shows the structure of the IQ-CLDNN model.The first module is the CNN module. Passing the input signal to CNNs be-fore LSTMs can help reduce the input variance. Thus the temporal modeling inLSTM is processed on higher-level features extracted by CNNs. Specifically, wechoose two 1D-CNNs with 256 and 128 units. The convolutional kernel with sizeof nine is chosen for each CNN.After the CNN module, we pass the output to the LSTM module to modeltemporally the sequence input. We use two layers of LSTM with returned se-31CNNModuleLSTM ModuleCNN ModuleFigure 3.5: IQ-CLDNN model schematic diagram for batch i32quences and 128 cells.A DNN module provides a mapping between hidden units and outputs, whichhelp the AMC model separate different modulation schemes. Therefore, we passthe output of the LSTM module to two DNNs with 64 and 11 hidden units. Foractivation functions, the first DNN layer uses ReLU and the second DNN layerexploits softmax to output probability scores for each modulation scheme.Compared with the CLDNN model in [37], we add two additional multi-scaleconnections to capture information at different resolutions. The additional con-nections are shown in Figure 3.5 by dashed streams. The “Permute” block changesthe input dimensions, and the “Concatenate” block concatenates the two inputs to-gether.The first additional connection forward the long-term features from the CNNmodule and the original short-term input features to the LSTM module. Thesecond connection directly passes the outputs from the LSTM module and CNNmodule into the DNN module without extra layers. The two direct connectionsincrease only a negligible amount of the number of network parameters.3.3.3 Img-MobileNetV2 ModelIn Section 3.1.4, our inputs are batches of Gaussian filtered 3D-CIs with a sizeof (224, 224, 3). We convert the AMC problem into a multi-class classificationproblem. The popular MobileNetV2 [25] model is proposed for running deep net-works efficiently on personal mobile devices. Hence, we choose the MobileNetV2model for the real-time AMC problem under resource-constrained environments.The training parameters are provided in Section 3.4.3.3.4 Features-CNN ModelStatistical features have been proven to be a robust input format for the AMCproblem [17]. Based on Section 3.1.5, our input is in the form of (bs, 23, 2) with23 manually selected features. To prevent overfitting, we use a simple model withfour 1D-CNNs and two DNNs. A dropout layer with a dropout rate of 0.6 is33exploited after each CNN and DNN layer.The numbers of CNN filters for four 1D-CNN layers are 256, 128, 64, and64. The classification module has a DNN of 128 neurons with Rectified LinearUnit (RELU) activation function and another DNN of 11 neurons with softmaxactivation function.3.3.5 FFT-CNN Model ArchitectureBased on the definition in Section 3.1.3, the frequency spectrum data DFFT is inthe form of (bs, 128, 2) with batch size bs. the FFT-CNN model has three layersof one-dimensional (1D)-CNN and two layers of DNN.Specifically, the three 1D-CNNs have 256, 128, and 64 filters of the same sizeof nine. Each 1D-CNN is followed by a batch normalization layer and a dropoutlayer with a dropout rate of 0.4. After CNNs, the input feature maps are flattenedto 1D. The DNN classification module has two DNN layers. The first DNN has128 neurons and the second DNN has 11 neurons.3.4 Experiment Setup3.4.1 Dataset SplitIn the RadioML dataset, there are total NSNR×M × L = 20×11×1000 = 220,000samples. We split the dataset into three parts: training set, validation set, and testset. We randomly selected 50% samples for training with a batch size bs of 1024,25% samples for validation. After training, the performance of the AMC modelis tested using the remaining 25% samples. All sets are uniformly distributed indifferent SNR values.343.4.2 Implementation DetailsBenefit from the user-friendliness, modularity and easy extensibility features, wechoose the open-source NN library “Keras” to develop our models3. With addi-tional support for CNN, Recurrent Neural Network (RNN), and other commonlyused layers, Keras allows non-ML researchers to focus on their model construc-tion and training setting.The AMC models are trained and validated with the central processing unitIntel(R) Core i7-8700 @ 3.20GHz with 12 threads and 16GB RAM.As mentioned in Section 2.4.4, the categorical cross-entropy measures theprobability error, and the Adam optimizer estimates the model parameters witha learning rate of 0.001. The training of AMC models will stop when the valida-tion loss dose not improve for the last 20 epochs. The model having the lowestvalidation loss is selected for evaluation. All AMC models use the same trainingparameters unless specified explicitly.In the Img-MobileNetV2 model, the SGD optimizer with an initial learningrate of 0.1, a momentum of 0.9, and a weight decay of 10−4 is used for training.The total number of training epochs is 120. The learning rate is divided by 10every 40 epochs.3.5 SummaryIn this chapter, we introduce five different signal representations for AMC prob-lem. Before we feed raw signals of different signal representations into AMCmodels, three data pre-processing methods are conducted in Section 3.2. In Sec-tion 3.3, we build a DL-based AMC model for each signal representation. Atlast, the experiment setting including dataset split and implementation details areintroduced in Section 3.4.3https://keras.io/35Chapter 4Performance of AMC ModelsIn this chapter, we perform extensive experiments to evaluate the performance ofAMC models. Accuracy rate is introduced in Section 4.1, and confusion matrixis introduced in Section 4.2. In Section 4.3, we regard AMC models as softclassifiers or hard classifiers. We also discuss the micro-averaging and macro-averaging algorithms when combing the per-class metrics in Section 4.3. Weevaluate the computational complexity in Section 4.4.4.1 Accuracy RateThe accuracy rate Pacc, also known as the correct probability, assesses the ratio ofcorrect AMC predictions. We evaluate the overall per-class accuracy rate usingdifferent SNR values to gain insights on the effective SNR range of each AMCmodel. The overall accuracy rate is defined as the ratio of correct predictions overthe total samplesPacc =NcNtest(4.1)where Nc is the number of correct predictions, and Ntest is the number of samplesin test set.36For each modulation scheme, the per-class correct accuracy rate is defined asPmacc =NmcNmtest(4.2)where Nmc is the number of correct classifications for modulation scheme Θm, andNmtest is the number of samples with modulation scheme Θm in test set.Figure 4.1: Overall accuracy rate of AMC classifiers under various SNRvalues.Figure 4.2: Accuracy rate of AP-Attention model374.1.1 Overall Accuracy RateFigure 4.1 depicts the overall accuracy rate Pacc over various SNR values forfive AMC models in Section 3.3 (AP-Attention, IQ-CLDNN, Img-MobilenetV2,features-CNN, and FFT-CNN) and previous AMC models (AP-LSTM [10] andIQ-CNN [3]). We use the same signal representation and model structure in [3, 10]for comparison. Seven AMC models are trained, validated, and tested on the samedataset. Compared with the results in [1, 3, 10, 17], noise reduction and labelsmoothing help the model discriminate modulation schemes better.At high SNR, AP-Attention achieves a higher Pacc than the other models.Moreover, AP-Attention, IQ-CLDNN, and Img-MobilenetV2 achieve a stablePacc over 86% for most modulation schemes in Figure 4.2. On the contrary, theAMC models based on statistical features and frequency-domain features requireat least 6 dB to converge and have low converged Pacc values, especially for FFT-CNN model with a final Pacc less than 80%. Hence, we conclude that the APsignal representation has the best modulation discrimination ability, which veri-fies that the signal representation is a key factor for the AMC problem.Compared with an 87.4% accuracy rate for IQ-CNN model [3], our updatedIQ-CNN with noise reduction and label smoothing has a roughly 1.3% improve-ment under the same model structure. Similarly, for SNR > 10 dB, AP-LSTMachieves a nearly 1.2% accuracy gain over the model based on partial high-SNRtraining data in [10]. This further verifies the efficiency of proposed data pre-processing methods.4.1.2 Per-class Accuracy RateWe further investigate the per-class accuracy rate of AP-Attention in Figure 4.2due to its highest overall Pacc. The curves are obtained by averaging variousmodulation schemes. The value in the brackets represents the Pmacc averaged overall possible SNR values from −20 dB to +18 dB with step size of 2 dB.Figure 4.2 shows that AM-SSB is the most easily recognized modulation typewith an almost 100% accuracy rate for all considered SNR values. For high SNR38Figure 4.3: Accuracy rate of 8PSK for different signal sample lengthsvalues (> 5 dB), almost all modulation schemes have Pacc over 95%, except forWBFM. It is also found that the accuracy rate of 8PSK has a sudden decrease atlow SNR from around −14 dB to −4 dB, which is because that the noise limitsthe representation learning capability of attention module and results in unstableperformance.Figure 4.3 illustrates that the performance degradation of 8PSK can be alle-viated by increasing the dimension of signals. Based on the original RadioMLdataset, three new datasets are generated with sample lengths of 256, 512, and1024, respectively. We train the same AP-Attention model for new datasets andplot their accuracy rates of 8PSK. The results show that the accuracy rate curvegrows more steadily and converges faster when we increase the sample dimensionL of training samples.4.2 Confusion MatrixSince the accuracy rate is unreliable for imbalanced datasets, we evaluate the con-fusion matrix of AMC models. A detailed overview of the per-class performance39(a) AP-Attention at 0 dB (b) AP-Attention at 6 dB(c) IQ-CLDNN at 0 dB (d) IQ-CLDNN at 6 dB(e) Img-MobileNetV2 at 0 dB (f) Img-MobileNetV2 at 6 dBFigure 4.4: Confusion matrices when SNR is 0 dB or 6 dB40(g) Features-CNN at 0 dB (h) Features-CNN at 6 dB(i) FFT-CNN at 0 dB (j) FFT-CNN at 6 dBFigure 4.4: Confusion matrices when SNR is 0 dB or 6 dBis also evaluated and visualized by a confusion matrix for the AMC problem. In anM ×M confusion matrix, the ith row represents the signals with actual Θi, whilethe jth column represents the signals predicted as Θ j . The (i, j)th entry representsthe percentage of signals with Θi but predicted as Θ j . The summation over the ithrow is equal to 1. The diagonal element is the correct classification probability forthe mth modulation. Therefore, a good classifier will have a clear diagonal in itsconfusion matrix.From Figure 4.4a and Figure 4.4b, the main source of error for AP-Attention isthe misclassification between QAM16/QAM64 and AM-DSB/WBFM. IQ-CLDNN41tends to misclassify more QAM64 as QAM16 when SNR = 0 dB, while AP-Attention discriminates QAM16/QAM64 better with a clear diagonal. This is dueto the fact that QAM16 is a subset of QAM64, and it is hard to differentiate them.Separating AM-DSB and WBFM is another challenge for both AP-Attention andIQ-CLDNN.Compared with other models, Img-MobileNetV2 cannot classify most mod-ulation schemes when SNR is 0 dB, while the performance for SNR at 6 dBis better and comparable to AP-Attention and IQ-CLDNN. The performance ofFeatures-CNN also degrades for low SNR values. Thus, in low SNR regime, 3D-CI and statistical features are not a good choice. From Figure 4.4i and Figure 4.4j,compared with the other AMC algorithms, FFT-CNN has no benefits for QAM16,QAM64, QPSK, and WBFM.4.3 Performance Metrics for Hard and SoftClassifiersAccuracy rate and confusion matrix are relatively simple metrics. To obtain abalanced and overall performance description, we further evaluate AMC mod-els from two scopes: hard classifier with a single cutoff and soft classifier withmultiple cutoffs. The per-class metrics are combined by two averaging methods:micro-averaging and macro-averaging.4.3.1 Basic StatisticsWe introduce some fundamental statics before defining the performance metricsfor hard and soft classifiers. For each sample in the test set, true positive (TP), truenegative (TN), false positive (FP) and false-negative (FN) are computed at first.We use them to derive per-class performance metrics: sensitivity, specificity, falsepositive rate (FPR), false negative rate (FNR), and precision. For simplification,we denote precision by P and recall by R.• TP: Number of signals with Θm and predicted as positive.42• TN: Number of signals not with Θm and predicted as negative.• FP: Number of signals not with Θm but predicted as positive.• FN: Number of signals with Θm but predicted as negative.• Sensitivity, recall, or true positive rate (TPR): Number of signals correctlypredicted as positive over total true itemsSensitivity = R = TPR =TPTP+FN. (4.3)• Specificity or true negative rate (TNR): Number of signals correctly pre-dicted as negative over total false itemsSpecificity = TNR =TNTN+FP. (4.4)• FPR: Number of signals wrongly predicted as positive over total false itemsFPR = 1−Specificity = FPFP+TN. (4.5)• FNR: Number of signals wrongly predicted as negative over total true itemsFNR = 1−Sensitivity = FNFN+TP. (4.6)• Precision: Number of signals correctly predicted as positive over total pos-itive predicted itemsP =TPTP+FP. (4.7)4.3.2 Hard Classifiers: Balanced Accuracy and F1-scoreBalanced accuracy is a combination of sensitivity and specificity. Based on thedefinition of sensitivity and specificity in Equation 4.3 and Equation 4.4, balanced43accuracy is a holistic measure considering all entries in the confusion matrix, andit is calculated byBalanced Accuracy =Sensitivity+Specificity2. (4.8)F1-score is the harmonic mean of precision and recall, and it is defined byF1 = 2 · R ·PR+P. (4.9)However, from Equation 4.3 and Equation 4.7, F1-score should only be used whenTN does not play a role, as TNs are not taken into account in F1-score.We depict the per-class F1-scores for AP-Attention and IQ-CLDNN in Fig-ure 4.5. The number in parentheses represents the F1-score for a certain modula-tion class average over all possible SNR values. We will discuss the reason whybalanced accuracy is unreliable later.Figure 4.5a indicates that at high SNR, AP-Attention has excellent perfor-mance on most modulations except for AM-DSB and WBFM. The F1-score per-formance of AM-SSB is far superior to other modulations with the highest overallF1-score of 0.802 and the fastest convergence speed. AM-DSB and WBFM canbe easily classified at low SNR, but they have little improvement when SNR isincreased. QAM16 and QAM64 have similar performance of per-class F1-scores.Figure 4.5b shows the per-class F1-scores for IQ-CLDNN. Similarly, AM-SSB is still the most easily recognized modulation. Different from AP-Attention,QAM16 and QAM64 both have a slight performance degradation for around 8%and a larger performance gap between them in high SNR values. In the low SNRregime, all modulation schemes have lower F1-scores in IQ-CLDNN than in AP-Attention. This indicates that AP-Attention can extract robust and distinct featuresat lower SNR.44(a) F1-scores for AP-Attention.(b) F1-scores for IQ-CLDNN.Figure 4.5: Per-class F1-scores for AP-Attention and IQ-CLDNN454.3.3 Micro and Macro AveragingTo obtain one metric that quantifies the overall performance of the AMC classifier,we combine the aforementioned per-class performance measures using micro ormacro averaging algorithms. Since the AMC problem belongs to the category ofmulti-class classification problems, each AMC model has a larger TN than the bi-nary classification problems1. Since the specificity becomes inflated, the balancedaccuracy does not provide a good performance measurement for the multi-classclassification problem. Therefore, the F1-score is used for the micro and macroaveraging algorithms.Micro-averaging algorithm considers each modulation type as a binary classi-fication problem and assigns equal weight to the individual decision. The micro-averaging precision and recall are defined asPmicro =∑Mm=1TPm∑Mm=1 (TPm+FPm)(4.10)andRmicro = TPRmicro =∑Mm=1TPm∑Mm=1 (TPm+FNm)(4.11)where m is the index of modulation scheme, and M represents the number ofmodulation schemes.The micro-averaging F1-scores are calculated byF1micro = 2 · Rmicro ·PmicroRmicro+Pmicro . (4.12)Since the micro-averaging is unreliable for imbalanced class distribution, themacro-averaging algorithm is introduced. Macro-averaging algorithm is suitablefor imbalanced datasets. Macro-averaging algorithm assigns equal weights toeach modulation schemes and averages over M possible modulation schemes. The1The ith type of modulation is TP, and the other types of modulations are TNs when ∀ j , i.46Table 4.1: Micro and Macro-averaging F1-scores for AMC modelsAMC Model F1micro F1macroAP-Attention 0.6974 0.6946IQ-CLDNN 0.6800 0.6785Img-MobileNetV2 0.5951 0.5844Features-CNN 0.5488 0.5721FFT-CNN 0.5602 0.5457macro-averaging precision and recall are defined asPmacro =1MM∑m=1TPmTPm+FPm=∑Mm=1PmM(4.13)andRmacro = TPRmacro =1MM∑m=1TPmTPm+FNm=∑Mm=1 RmM. (4.14)Similarly, the macro-averaging F1macro is defined byF1macro = 2 · Rmacro ·PmacroRmacro+Pmacro . (4.15)Table 4.1 provides the F1micro and F1macro computed by the same test set.The AP-Attention obtains the largest F1micro and F1macro. Note that, except forFeatures-CNN, other AMC models have a larger F1micro than F1macro. This indi-cates that Features-CNN is more stable when the modulation scheme distributionis imbalanced, as F1macro is sensitive to the predictive performance for individualclasses.474.3.4 Soft Classifiers: Receiver OperatingCharacteristics (ROC) and Precision-RecallCurve (PRC)The AMC models can be regarded as soft classifiers that produce predictions witha decision cutoff applied on scores for each modulation scheme. ROC curve andPRC are plotted to visualize the classification performance for soft classifiers.Figure 4.6: Micro-averaging ROC Curves for AMC modelsFigure 4.7: Per-class ROC Curves for AP-Attention48Figure 4.8: Micro-averaging PRCs for AMC modelsFigure 4.9: Per-class PRC for AP-AttentionROC curve is a probability curve with FPR as the x-axis and TPR as the y-axis. Each point of the ROC curve represents a TPR/FPR pair at different decisionthresholds. The point in the upper left corner (0,1) of the ROC space yields thebest prediction result, representing no FNs and FPs.The area under ROC (AUROC) measures the model distinguishing ability be-tween classes by computing the 2D area under ROC from point (0, 0) to (1, 1).49In probabilistic interpretation, AUROC is a decision-threshold-invariant measure,while F1-score is a threshold-sensitive measure based on hard (0 or 1) outputs. Inour experiment, AUROC is computed by the trapezoidal rule [38] in Python.As a useful measure for imbalanced datasets, PRC shows the trade-off be-tween precision and recall for different decision thresholds. A larger area underPRC (AUPRC) represents high scores for both precision and recall. However,similar to F1-score, PRC does not consider TNs.Compared with the macro-averaging ROC curves, micro-averaging ROC curvesfor AMC models are closer to the ideal point (0,1) and has a slightly larger AU-ROC for all AMC models. This indicates that AMC models have better overallperformance (micro-averaging) than class-specific performance (macro-averaging).Compared with the macro-averaging ROC, micro-averaging ROC has a similargrowing tendency and comparable AUROC value. Thus, we only present themicro-averaging ROC curves for AMC models in Figure 4.6, unless specified ex-plicitly. Per-class ROC curves of AP-Attention are also plotted in Figure 4.7.From Figure 4.6, almost all algorithms converge to TPR = 1.0 at the samevalue of FPR ≈ 0.35, except for Features-CNN. For per-class ROC curves of AP-Attention in Figure 4.7, AM-SSB, AM-DSB, and WBFM have the lowest FPRwhen reaching TPR = 1. In other words, classifying AM-SSB, AM-DSB, andWBFM is easier than other modulations. QAM16 and QAM64 have the nearlycoincident ROC curves and the smallest AUROCs because of their similar featuresunder low SNR values.Similarly, micro-averaging and per-class PRCs are illustrated in Figure 4.8and Figure 4.9. Iso-F1 curve consisting of points with the same F1-score is alsoincluded in these two figures to show how close the PRCs are to different F1scores. Good PRC are close to point (1,1) and good AUROC is close to 1.Overall, micro-averaging AUPRCs are lower than micro-averaging AUROCs.From Figure 4.8, AP-Attention and IQ-CLDNN still show an AUPRC advantageof up to 0.16 than other AMC models. AP-Attention and IQ-CLDNN reach arecall of roughly 46% without any FP predictions. Features-CNN and FFT-CNN50start to predict FP at a low recall rate of 29%.For the per-class PRCs for AP-Attention model in Figure 4.9, AM-SSB has thehighest AUPRC. It is also noticed that at first plotted PRC points, only the preci-sion of AM-SSB is slightly lower than 100%. This indicates that at the first thresh-old, the AP-Attention model already makes FPs for AM-SSB. However, to reacha sensitivity of 100%, the precision of AM-SSB only reduces to around 70%. Ex-cept for QAM16 and QAM64, the modulation classes have an AUPRC above 75%in AP-Attention. From the micro-averaging PRC, AP-Attention reaches a recallrate of roughly 48% without any FPs.4.4 Computational ComplexityComputational complexity includes memory consumption and detection efficiency.The memory consumption can be measured by the number of trainable parame-ters (and corresponding required storage size) and the peak memory usage. Thedetection efficiency is evaluated by the average inference time per input signal.Since the model parameters of AMC models can be trained and stored in mem-ory, the computational complexity of AMC models is mainly determined by themodel structure (neural-network architecture and input size) the feature extractionstage (signal representation transformation and data pre-processing). The modu-lation schemes and channel states have negligible influence on the computationalcomplexity. To make a fair comparison, we analyze the complexity metrics (thenumber of trainable parameters, storage size of parameters, peak memory usageand average inference time) under the same hardware and software implementa-tions.As shown in Table 4.2, AP-Attention has the smallest number of trainable pa-rameters and the smallest required memory size. Therefore, AP-Attention presentsan attractive choice when the computational complexity and classification perfor-mance are preferred. We also observe that the proposed IQ-CLDNN model hasthe second-highest efficiency in inference time. The Img-MobileNetV2 model re-quires more memory and inference time than other models because of the larger51Table 4.2: Computational Complexity of AMC ModelsAMC ModelNumber ofTrainableParametersStorageSize(MB)PeakMemoryUsage(MiB)AverageInferenceTime perExample(ms)AP-Attention 249,227 3.0 4,243.492 0.329IQ-CLDNN 794,763 9.6 6,317.871 0.303Img-MobileNetV2 2,237,963 16.1 7155.203 17.045Features-CNN 821,259 9.9 3530.207 0.655FFT-CNN 1,175,883 14.2 6518.602 0.291input size of (224, 224, 2) and complex model structure. The average inferencetime for Features-CNN is almost twice than that of AP-Attention since the com-putation of statistical features consumes more time. Using the pooling layer anda smaller number of CNN filters, FFT-CNN has the lowest inference time amongAMC models.Attributing to the hardware support of parallel processing and the software op-timization of data flow, it is envisioned that the inference time of DL-based modelscan be sharply reduced in implementation. Therefore, the graphics processing unit(GPU) assisted AMC models are envisioned to have lower computation complex-ities by processing different received sequences simultaneously.4.5 SummaryThis chapter analyzes the results of previous introduced five AMC models. Differ-ent performance metrics are introduced to evaluate the AMC models. Experimentresults indicate that AP-Attention and IQ-CLDNN have the superior classificationperformance over other AMC models. Computational complexity including mem-ory consumption and detection efficiency is discussed in Section 4.4. Complexityresults show that AP-Attention has less trainable parameters and short inference52time, while Img-MobileNetV2 suffers from high complexity.53Chapter 5Weight Pruning and Stacking for aBetter End-to-end AMC Model5.1 Weight PruningReal-time AMC for mobile devices with constrained computational resourcesrequires more memory-efficient and power-efficient AMC models. Therefore,pruned NNs are introduced at the expense of negligible loss in accuracy. Backto the 1990s, pruned NNs have been proposed based on the fact that many NNparameters are redundant and have less contributions to the final output [39].We eliminate the unnecessary model parameters to improve the memory effi-ciency and power efficiency of the proposed AMC models. More specifically, theweight pruning operations set the low-magnitude model parameters to zero suchthat the required memory resource is reduced. After weight pruning, the modelsbecome sparse. Therefore, performing the compression operation1 can reduce thelatency. The detailed procedures of our proposed weight pruning operation areas follows. Based on the obtained models in Chapter 4, the training is performedfor 150 epochs among which the pruning operation is conducted in the first 601The compression operation is defined as recording the non-zero elements and skipping thecomputations related to those zeros.54Figure 5.1: Model sizes comparison after weight pruningepochs. Since pruning too many model parameters in each iteration can severelydegrade the performance, the pruning operation is performed iteratively. For ex-ample, we prune every 500 steps to give the AMC model more recovery time. Wegradually train the model until the sparsity target of 70% is reached. When thevalidation accuracy does not improve in consecutive 30 epochs, the pruned modelis obtained.We compress the original models and the pruned models by a generic file com-pression algorithm, namely zip compression. As illustrated in Figure 5.1, barswith texture represent the model after zip compression. The value labeled on eachbar is the model size before zip compression. Figure 5.1 shows that pruned mod-els occupy roughly 30% memory of the original models. After zip compression,we can save roughly 10% more storage space. From these results, the sizes ofpruned models do not exceed 5 MB. Moreover, pruned AP-Attention, pruned IQ-CLDNN, and pruned Features-CNN only occupy around 1 MB. Therefore, weightpruning greatly lightens the storage burden for end-to-end mobile devices.We also perform the comparison of F1-scores based on the original models in55Figure 5.2: Micro-averaged F1-scores comparison after weight pruningChapter 4 and the pruned models. The micro-averaged pruned F1-scores resultscan be found in Figure 5.2. The values in the brackets are the overall averagedF1-scores. We observe that the pruning operation results in minor performancedegradation. Although the degradation in F1-scores is more obvious in low SNRregime, the overall decrease does not exceed 0.08. Therefore, the pruning opera-tion can trade the F1-scores for the memory reduction.5.2 Ensemble LearningChapter 4 illustrates the classification accuracy of different models in the de-scending order as AP-Attention, IQ-CLDNN, Img-MobileNetV2, Features-CNN,and FFT-CNN. Using the parallel processing, we can process multiple signal se-quences simultaneously. Therefore, the classification accuracy is more important56than the computational complexity in practice. To improve the classification ac-curacy, ensemble learning is introduced to integrate the different AMC models inChapter 4.5.2.1 DefinitionIn ensemble learning, multiple first-level models (a.k.a. weak learners) are com-bined and trained to solve the same problem. Two categories of combinationmethods can be used for the ensemble learning, namely homogeneous and hetero-geneous methods. The homogeneous methods combine the same models trainedin different ways, and the heterogeneous methods combine the different models.Different models are based on different learning algorithms.5.2.2 StackingSince the proposed AMC models are all heterogeneous models based on differentlearning algorithms, our ensemble target is to obtain a more accurate modulationclassification. Therefore, a heterogeneous method (i.e., stacking method) is used.Rather than choosing a single model, stacking method integrates the different first-level models into a second-level model (a.k.a meta model) and trains the second-level model based on outputs of first-level models [40].We denote the original training dataset by D with N individual signal sam-ples {x i, y i}Ni=1 and N1 first-level models. At the beginning, the general proce-dure of stacking learns the first-level model f 1 = { f 11 , f 12 , . . . , f 1N1} based on D.Then, stacking trains a second-level model f 2 based on the predictions of first-level models f 1. For sample (x i, y i), the corresponding item in the new datasetis ({ f 11 (x i), f 12 (x i), . . . , f 1N1(x i)}, y i). After training the second-level model f 2, theprediction of unseen sequence x is computed by f 2( f 11 (x), f 12 (x), . . . , f 1N1(x)).We use the combinations of introduced AMC models in Chapter 4 to generatethe first-level models f 1. To maintain low complexity, a three-layer CNN is usedas the second-level model f 2. Using the proposed five AMC models, we canobtain C25 +C35 +C45 +C55 = 26 first-level models. Based on the performance of five57introduced AMC models discussed in Chapter 4, we consider the five reasonablecombinations to include more features and less computation complexity:• AP-Attention, IQ-CLDNN (AP-IQ)• AP-Attention, IQ-CLDNN, Img-MobileNetV2 (AP-IQ-Img)• AP-Attention, IQ-CLDNN, Features-CNN (AP-IQ-Features)• AP-Attention, IQ-CLDNN, Img-MobileNetV2, Features-CNN (AP-IQ-Img-Features)• AP-Attention, IQ-CLDNN, Img-MobileNetV2, Features-CNN, FFT-CNN(AP-IQ-Img-Features-FFT).When the first-level models use the same dataset as the second-level model,the overfitting to the dataset occurs. A heuristic method is to split the dataset intotwo subsets for the first-level models and the second-level model. However, thedata-splitting method has an obvious drawback, i.e., only a fraction of the datasamples are used for training the model in each level. Therefore, the K-fold cross-validation is used.Using the K-fold cross-validation, we can partition D into K disjoint subsetssuch that the second-level model f 2 can be trained using all samples in dataset D.The procedure of stacking with K-fold cross-validation is illustrated in Figure 5.3.We train f 1 on K − 1 folds and make predictions on the remaining fold to avoidoverfitting. Repeat K times, all predictions from f 1 make up the training datasetfor f 2. After training f 2, we re-train f 1 on the whole dataset D. Therefore,the final stacking model F is obtained by applying the second-level model f 2 onre-trained first-level models f 1, which is defined byF(·) = f 2( f 11 (·), f 12 (·), . . . , f 1N1(·)). (5.1)In our experiments, we choose three-fold cross-validation to train the stackingmodel. The second-level model f 2 consists of three layers of CNN with 128,58Repeat K timesFigure 5.3: Stacking with K-fold Cross-Validation128, and 64 hidden neurons. The padding and dropout with a rate of 0.4 areutilized. We depict the F1-scores performance of five stacking learned models inFigure 5.4. The values in the brackets are the overall micro-averaged F1-scores.Compared with the other stacking models and the AMC models in Figure 5.2,we found that the AP-IQ model has the highest overall micro-averaged F1-scores.This observation is reasonable due to the facts that the AP-Attention model andIQ-CLDNN models show the superior performance over the other models. Whenthe highly accurate features are combined, the second-level model f 2 extract morediscriminative representations. Besides, AP-IQ model outperforms the single AP-Attention model by roughly 2% in F1-score, which validates the effectiveness ofstacking.59Figure 5.4: Micro-averaged F1-scores for stacking learned modelsPredictions of Features-CNN and FFT-CNN limit the representation learningcapability of the second-level model f 2. Therefore, the obtained F1-scores islower than the other stacking models. This observation indicates the possibilityof a weighted dataset to give more weights to the first-level models f 1 with betterperformance and fewer weights to other models.Following Section 4.4 and Section 5.1, extensive pruning experiments with atarget of 70% sparsity are also conducted to control the computational complexityof stacking AMC models. Similar to single AMC models, pruning operation leadsto a slight degradation in F1-scores. Although the deterioration is more severeunder low SNR regime, overall the effects can be neglected.Table 5.1 shows the comparison of complexity with the same pruning oper-ations in Section 5.1. After the pruning operations, the stacking classifiers withImg-MobileNetV2 ensembled have remarkably increased inference time. The AP-60Table 5.1: Computational Complexity of Pruned Stacking AMC ModelsAMC ModelNumber ofTrainableParametersStorageSize(MB)PeakMemoryUsage(MiB)AverageInferenceTime perExample(ms)AP-IQ 331,656 4.15 6,806.595 0.723AP-IQ-Img 977,142 7.91 7,382.542 17.325AP-IQ-Features 633,930 7.52 6,917.368 1.378AP-IQ-Img-Features1,177,869 12.19 7,520.321 18.308AP-IQ-Img-Features-FFT1,551,467 15.88 7,614.509 18.711IQ model has a storage size of less than 5 MB and an averaged inference time ofless than 1 ms. Therefore, the pruned AP-IQ is the most suitable AMC classi-fier for end-to-end devices due to its superior F1-scores performance and lowercomputational complexity.5.3 SummaryIn this chapter, we introduce weight pruning and stacking learning to further im-prove the classification performance of AMC models. Experiments show thatweight pruning can reduce the AMC model complexity with a negligible perfor-mance degradation. Stacking learned AMC models have superior performancethan single AMC models. Overall, the pruned stacking learned AP-IQ model hasthe best classification performance and acceptable low computational complexity.61Chapter 6Conclusions and Future WorkIn this chapter, we conclude and highlight the contributions of this thesis in Sec-tion 6.1. Then, we also illustrate some ideas for future work in Section 6.2.6.1 ConclusionsThis thesis aims at providing a memory-efficient and high-performance AMCmodel for resource-constrained devices. Our contributions can be summarized asfollows:• We introduced noise reduction, signal normalization, and label smoothingbefore training AMC models. To conduct a thorough comparison, we inves-tigated all possible signal representations and designed a DL-based modelfor each signal representation. For example, we used the KDE algorithm toconvert the well-known CD into 3D-CI. We proposed the attention modulein AP-Attention and connected CNN, LSTM, and DNN with multi-scaleconnection IQ-CLDNN.• We conducted extensive simulations to evaluate the model performancebased on the comprehensive metrics, namely F1-score, ROC, and PRC.Numerical results illustrate that the overall performance of AMC modelsin descending order was: AP-Attention, IQ-CLDNN, Img-MobileNetV2,62Features-CNN, and FFT-CNN. Moreover, the complexity experiments vali-dated the computation-efficiency of AP-Attention and IQ-CLDNN.Therefore, we conclude that the used data pre-processing methods and pro-posed AMC models (AP-Attention model and IQ-CLDNN model, Img-MobileNetV2model, Features-CNN and FFT-CNN) can improve the AMC performance overthe benchmark models (AP-LSTM and IQ-CNN). Besides, our work also com-pared the impacts of the pruning operation and stacking operation on the AMCmodels. The investigation of pruned models showed that weight pruning greatlyreduced the model storage size while only brought negligible performance degra-dation. Furthermore, the stacking experiments confirmed that pruned AP-IQ achievedthe highest F1-scores and kept low computational complexity at the same time.6.2 Future WorkSeveral research directions can be considered in the future• Recently, O’Shea et al. [7] proposed the second version of the RadioMLdataset that includes both synthetic simulated channel effects and over-the-air recordings of 24 modulation schemes. We will train and test our modelson this dataset in the future to check the performance of our proposed mod-els on this complicated dataset.• In this thesis, we chose weight pruning to remove unnecessary NN connec-tions. Further efforts to reduce the model size might explore the quantiza-tion method. Quantization converts the model weights to eight-bit preci-sion. We should also convert the final pruned model into a suitable formatto run on specific back-ends. For example, TensorFlow lite1 is a format fordeploying DL models on mobile devices.• In Section 5.2, we assign equal weights to the predictions of first-level basemodels when building the training dataset for the meta-model. However, the1https://www.tensorflow.org/lite63results suggest the possibility for further research about a weighted datasetwith more weights to good predictions and fewer weights for other predic-tions.• Recent research has shown that DNNs are highly vulnerable to adversar-ial attacks. Sadeghi and Larsson [41] validated this phenomenon in radiomodulation classification tasks. Further studies need to be carried out to de-termine whether adversarial attacks will affect our proposed AMC models.• All our proposed models belong to supervised learning. The model per-formance relies heavily on the quality of training dataset. More researchshould be undertaken to explore how semi-supervised learning performs inAMC problem.64Bibliography[1] M. Kulin, T. Kazaz, I. Moerman, and E. De Poorter, “End-to-end learningfrom spectrum data: A deep learning approach for wireless signalidentification in spectrum monitoring applications,” IEEE Access, vol. 6,pp. 18 484–18 501, 2018. → pages 1, 2, 5, 10, 24, 38[2] S. Haykin, “Cognitive radio: brain-empowered wireless communications,”IEEE Journal on Selected Areas in Communications, vol. 23, no. 2, pp.201–220, Feb 2005. → page 1[3] T. J. O’Shea and J. Corgan, “Convolutional radio modulation recognitionnetworks,” CoRR, vol. abs/1602.04105, 2016. [Online]. Available:http://arxiv.org/abs/1602.04105 → pages 2, 5, 6, 38, 72[4] I. Abou-Faycal, M. Medard, and U. Madhow, “Binary adaptive coded pilotsymbol assisted modulation over rayleigh fading channels withoutfeedback,” IEEE Transactions on Communications, vol. 53, no. 6, pp.1036–1046, June 2005. → page 2[5] J. Sun, G. Wang, Z. Lin, S. G. Razul, and X. Lai, “Automatic modulationclassification of cochannel signals using deep learning,” 2018 IEEE 23rdInternational Conference on Digital Signal Processing (DSP), pp. 1–5,2018. → page 2[6] O. A. Dobre, A. Abdi, Y. Bar-Ness, and W. Su, “Survey of automaticmodulation classification techniques: classical approaches and new trends,”IET Communications, vol. 1, no. 2, pp. 137–156, April 2007. → pages2, 3, 12[7] T. J. O’Shea, T. Roy, and T. C. Clancy, “Over-the-air deep learning basedradio signal classification,” IEEE Journal of Selected Topics in SignalProcessing, vol. 12, no. 1, pp. 168–179, Feb 2018. → pages 2, 14, 6365[8] C.-Y. Huan and A. Polydoros, “Likelihood methods for mpsk modulationclassification,” IEEE Transactions on Communications, vol. 43, no. 2/3/4,pp. 1493–1504, Feb 1995. → page 3[9] W. Wei and J. M. Mendel, “Maximum-likelihood classification for digitalamplitude-phase modulations,” IEEE Transactions on Communications,vol. 48, no. 2, pp. 189–193, Feb 2000. → pages 2, 3[10] S. Rajendran, W. Meert, D. Giustiniano, V. Lenders, and S. Pollin, “Deeplearning models for wireless signal classification with distributed low-costspectrum sensors,” IEEE Transactions on Cognitive Communications andNetworking, vol. 4, no. 3, pp. 433–445, Sep. 2018. → pages 2, 5, 6, 38[11] A. E. El-Mahdy and N. M. Namazi, “Classification of multiple m-aryfrequency-shift keying signals over a rayleigh fading channel,” IEEETransactions on Communications, vol. 50, no. 6, pp. 967–974, June 2002.→ page 3[12] P. Panagiotou, A. Anastasopoulos, and A. Polydoros, “Likelihood ratio testsfor modulation classification,” in MILCOM 2000 Proceedings. 21st CenturyMilitary Communications. Architectures and Technologies for InformationSuperiority (Cat. No.00CH37155), vol. 2, Oct 2000, pp. 670–674 vol.2. →page 3[13] Y. Zeng, M. Zhang, F. Han, Y. Gong, and J. Zhang, “Spectrum analysis andconvolutional neural network for automatic modulation recognition,” IEEEWireless Communications Letters, vol. 8, no. 3, pp. 929–932, June 2019. →page 4[14] M. L. D. Wong and A. K. Nandi, “Automatic digital modulation recognitionusing artificial neural network and genetic algorithm,” Signal Process.,vol. 84, no. 2, pp. 351–365, Feb. 2004. [Online]. Available:http://dx.doi.org/10.1016/j.sigpro.2003.10.019 → page 4[15] M. W. Aslam, Z. Zhu, and A. K. Nandi, “Automatic modulationclassification using combination of genetic programming and knn,” IEEETransactions on Wireless Communications, vol. 11, no. 8, pp. 2742–2750,August 2012. → page 4[16] Huang Fu-qing, Huang Fu-qing, Zhong Zhi-ming, Xu Yi-tao, and RenGuo-chun, “Modulation recognition of symbol shaped digital signals,” in662008 International Conference on Communications, Circuits and Systems,May 2008, pp. 328–332. → page 4[17] J. H. Lee, J. Kim, B. Kim, D. Yoon, and J. W. Choi, “Robust automaticmodulation classification technique for fading channels via deep neuralnetwork,” Entropy, vol. 19, no. 9, 2017. [Online]. Available:https://www.mdpi.com/1099-4300/19/9/454 → pages 4, 24, 33, 38, 75[18] S. Peng, H. Jiang, H. Wang, H. Alwageed, Y. Zhou, M. M. Sebdani, andY. Yao, “Modulation classification based on signal constellation diagramsand deep learning,” IEEE Transactions on Neural Networks and LearningSystems, vol. 30, no. 3, pp. 718–727, March 2019. → pages 4, 5, 6[19] A. Swami and B. M. Sadler, “Hierarchical digital modulation classificationusing cumulants,” IEEE Transactions on Communications, vol. 48, no. 3,pp. 416–429, March 2000. → page 4[20] J. J. Popoola and R. van Olst, “Effect of training algorithms on performanceof a developed automatic modulation classification using artificial neuralnetwork,” in 2013 Africon, Sep. 2013, pp. 1–6. → page 4[21] Han Gang, Li Jiandong, and Lu Donghua, “Study of modulationrecognition based on hocs and svm,” in 2004 IEEE 59th VehicularTechnology Conference. VTC 2004-Spring (IEEE Cat. No.04CH37514),vol. 2, May 2004, pp. 898–902 Vol.2. → page 4[22] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physicallayer,” IEEE Transactions on Cognitive Communications and Networking,vol. 3, no. 4, pp. 563–575, Dec 2017. → page 5[23] F. Meng, P. Chen, L. Wu, and X. Wang, “Automatic modulationclassification: A deep learning enabled approach,” IEEE Transactions onVehicular Technology, vol. 67, no. 11, pp. 10 760–10 772, Nov 2018. →page 5[24] Y. Wang, M. Liu, J. Yang, and G. Gui, “Data-driven deep learning forautomatic modulation recognition in cognitive radios,” IEEE Transactionson Vehicular Technology, vol. 68, no. 4, pp. 4074–4077, April 2019. →page 567[25] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Invertedresiduals and linear bottlenecks: Mobile networks for classification,detection and segmentation,” CoRR, vol. abs/1801.04381, 2018. [Online].Available: http://arxiv.org/abs/1801.04381 → pages 7, 33[26] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”arXiv e-prints, p. arXiv:1412.6980, Dec 2014. → page 15[27] C. M. Bishop, Pattern Recognition and Machine Learning (InformationScience and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006. →page 17[28] E. Azzouz and A. Nandi, “Automatic modulation recognition,” Journal ofthe Franklin Institute, vol. 334, no. 2, pp. 241–273, 1997. → pages 21, 76[29] A. Swami and B. M. Sadler, “Hierarchical digital modulation classificationusing cumulants,” IEEE Transactions on Communications, vol. 48, no. 3,pp. 416–429, March 2000. → page 22[30] A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulationrecognition of communication signals,” IEEE Transactions onCommunications, vol. 46, no. 4, pp. 431–436, April 1998. → pages 22, 76[31] E. Azzouz and A. Nandi, “Automatic identification of digital modulationtypes,” Signal Processing, vol. 47, no. 1, pp. 55 – 69, 1995. → page 76[32] M. W. Aslam, Z. Zhu, and A. K. Nandi, “Automatic modulationclassification using combination of genetic programming and knn,” IEEETransactions on Wireless Communications, vol. 11, no. 8, pp. 2742–2750,August 2012. → page 22[33] R. Mu¨ller, S. Kornblith, and G. Hinton, “When does label smoothing help?”2019. → page 23[34] M. Zhang, Y. Zeng, Z. Han, and Y. Gong, “Automatic modulationrecognition using deep learning architectures,” in 2018 IEEE 19thInternational Workshop on Signal Processing Advances in WirelessCommunications (SPAWC), June 2018, pp. 1–5. → page 26[35] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu, “Attention-basedbidirectional long short-term memory networks for relation classification,”in ACL, 2016. → pages 27, 2868[36] T. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, longshort-term memory, fully connected deep neural networks,” 04 2015, pp.4580–4584. → page 31[37] X. Liu, D. Yang, and A. E. Gamal, “Deep neural network architectures formodulation classification,” in 2017 51st Asilomar Conference on Signals,Systems, and Computers, Oct 2017, pp. 915–919. → page 33[38] T. Fawcett, “Introduction to roc analysis,” Pattern Recognition Letters,vol. 27, pp. 861–874, 06 2006. → page 50[39] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” inAdvances in Neural Information Processing Systems 2, D. S. Touretzky, Ed.Morgan-Kaufmann, 1990, pp. 598–605. [Online]. Available:http://papers.nips.cc/paper/250-optimal-brain-damage.pdf → page 54[40] C. C. Aggarwal, Data Classification: Algorithms and Applications, 1st ed.Chapman & Hall/CRC, 2014. → page 57[41] M. Sadeghi and E. G. Larsson, “Adversarial attacks on deep-learning basedradio signal classification,” CoRR, vol. abs/1808.07713, 2018. [Online].Available: http://arxiv.org/abs/1808.07713 → page 64[42] T. O’Shea and N. West, “Radio machine learning dataset generation withgnu radio,” Proceedings of the GNU Radio Conference, vol. 1, no. 1, 2016.→ page 7369Appendix ATypical Layers in CNN ArchitectureA.1 Convolutional LayerEach convolutional layer convolves input feature map with fixed length filters andthen cascades these new feature maps together to form the output layer. We beginwith the standard One-dimensional (1D) convolutional layer, which convolves ina single direction for vectors.For example, if the input feature map of the first 1D convolutional layer isxl−1 ∈ RL×N l−1 , where L is the sample length of 128 points and N l−1 = 2 is thenumber of input feature maps, or dimensions, then the 1D convolutional layer lwith N l output feature maps would have N l−1 × N l kernels with size kl and N lbiases.For a certain kernel i ∈ {1, . . . ,N l} in next layer l, the input vector of 1Dconvolution layer is the output from previous layer denoted as xl−1j with j ∈{1, . . . ,N l−1}, then the ith mathematical output vector is represented asxli = fact©«Nl−1∑j=1xl−1j ∗ k li,j + bliª®¬ (A.1)where k li,j is the kernel for ith output vector and jth input vector with size kl , bli70represents the bias term, and ∗ is the convolution operation. For kernel with sizekl , the total number of trainable parameters in convolution layer l is Nl ×Nl−1 ×kl +Nl .A.2 Activation LayerAfter classic NN or convolutional layer, a non-linear activation function, likeRELU, is applied to mitigate the effects of gradient vanishing problem. RELUcan be described asfReLU(xi) = max(0, xi). (A.2)Activation functions are typically in sigmoidal shape, resulting in a nonlinear sys-tem to extract more complex features than a high-order linear system.Another common activation function, softmax is often used to produce theprobability score associated with the ith class. In AMC, the final modulation typeis decided by the modulation class with the highest probability score. The ithelement in output of softmax isfsoftmax(xi) = exi∑Mi=1 exi(A.3)where xi is ith pre-activation output, and M is the number of modulation classes.A.3 Fully-connected or Dense LayerAfter stacked convolutional layers and a flatten layer to transfer a 2D vector to a1D vector, a fully-connected layer is used to extract higher-level information fromthe previous flatten layer.Fully-connected layer or dense layer has the same architecture as classic neuralnetworks in Equation 2.12 where neurons have full connections to all activationsfrom the previous layer. The dense layer is usually used as the last layer in theclassification problem to output the normalized likelihood vector with activationfunction being softmax.71Appendix BRadioML Dataset Generation SetupB.1 Dataset Simulation ModelModulated with real voice and text data, the samples in this dataset are generatedwith 11 different modulation schemes and 20 different SNR levels from −20 dBto +18 dB with a step of 2 dB.For digital modulations, Binary Phase Shift Keying (BPSK), Quadrature PhaseShift Keying (QPSK), 8PSK, QAM16, QAM64, and Pulse Amplitude Modula-tion (PAM)4, the entire Gutenberg works of Shakespeare in ASCII is used, withwhitening block randomizer to ensure equiprobable symbols and bits. Moreover,the gr-mapper OOT module and an interpolating finite impulse response root-raised cosine pulse shaping filter with an excess bandwidth of 0.35 is used toachieve the desired samples per symbol rate [3].For analog modulations Wide-band Frequency Modulation (WBFM), Ampli-tude Modulation (AM)-Single Side-band Modulation (SSB), AM-Double Side-band (DSB), Gaussian Frequency-shift Keying (GFSK), and Continuous-phaseFrequency-shift Keying (CPFSK), a continuous acoustic voice speech with someinterludes and off times is implemented with GNU Radio hierarchical blocks.The generated signals then pass through a number of realistic channel imper-fections and intersymbol interference. Primary amplitude, phase, Doppler, and72delay impairments introduced in the wireless channel consist of [42]:• Thermal noise: Due to the resistive components in the physical device suchas the receiver antenna, this thermal noise maybe modelled as AWGN, n ∼N(0, σ2), which forms a specific noise power level corresponding to thedesired SNR.• Frequency offset: The frequency offset is caused by the slightly differentlocal oscillator signal frequencies at the transmitter fc and receiver f ′c , andthe motion of emitters, reflectors, and/or receivers.• Phase noise: Oscillator drift and unknown phase-delay of various propa-gation medium result in the angle of the signal to drift around its intendedinstantaneous phase 2pi fct.• Sample rate offset: Different sample rates at the receiver and transmitterand time dilation are simulated by a fractional interpolator stepping alongat a rate of 1+ input samples per output sample, where is close to zeroand follows a clipped random walk process.• Multipath fading or frequency selective fading: Implemented by the sum ofsinusoids with random phase offset for Rician and Rayleigh fading in GNURadio, multipath fading usually occurs when signals reflect off any form ofreflectors like buildings and vehicles.• Delay spread: Non-impulsive delay spread is caused by the propagation ofdelayed multi-path reflection, diffraction, and diffusion.B.2 Dataset ParametersThe total dataset is split by a short-time rectangular windowing process whichis similar for speech recognition to slice continuous acoustic voice signals [42].After segmentation, each signal example is normalized to average transmit power73Table B.1: RadioML Dataset ParametersParameters ValueSamples per symbol 4Sample length 128Sampling frequency 200 kHzSampling rate offset standard deviation 0.01 Hzexcess bandwidth for root-raised cosine pulse shapingfilter0.35Maximum sampling rate offset 50 HzCarrier frequency offset standard deviation 0.01 HzMaximum carrier frequency offset 500 HzNumber of sinusoids 8Maximum Doppler frequency 1Fading model RicianRician K-factor 4Fractional sample delays for the power delay profile [0, 0.9, 1.7]Number of samples per modulation scheme at a specificSNR1000Magnitude corresponding to each delay time [1, 0.8, 0.3]Filter length to interpolate the power delay profile 8Standard deviation of the AWGN process 10− SNR10Number of training samples 82500Number of validation samples 41250Number of test samples 41250of 0 dB in a 128×2 vector with IQ components. Each example is approximately128 µ sec and contains between 8 and 16 symbols.For dataset storage, numpy and cPickle Python packages are exploited to storeit as a pickle file with complex 32-bit floating point samples. The detailed spec-ifications and generation parameters are listed in Table B.1. As a modulationcharacteristic, the samples per symbol parameter used in the Table B.1 specify thenumber of samples representing each modulated symbol.74Appendix CStatistical FeaturesC.1 HOCThe HOC Cpq is defined byCpq = cum(x, . . . ,x, x∗, . . . ,x∗) (C.1)where cum(·) denotes the cumulant function, x is repeated p− q times and theconjugated version x∗ is repeated q times. To remove the effect of the signal scaleon cumulants, Cpq is typically powered by 2p .By adopting Equation C.1, second-order cumulant C21 is given by cum(x, x∗),fourth-order cumulant C42 is cum(x,x,x∗,x∗), and the sixth-order cumulant C62 iscomputed by cum(x,x,x,x,x∗,x∗).The joint cumulant function is defined as [17]:cum(x1, . . . ,xN ) =∑A(|A| −1)!(−1)|A|−1ΠB∈AE(Πi∈Axi) (C.2)where A is the whole partitions of set [1, . . . ,N], and B runs through the list ofall blocks of the partition A. A simple example is cum(α,β,γ,ν) = E[αβγν] −E[αβ]E[γν]−E[αγ]E[βν]−E[αν]E[βγ].75C.2 HOMGiven N samples x(i), the function of HOC Cpq can be obtained from HOM Mpq,where the empirical estimated moment Mpq associated with the stationary processx(i) is computed asMpq = E[x(i)p−q(x(i)∗)q], for 0 ≤ q ≤ p (C.3)where p and q are integers, and the superscript ∗ represents the complex conjugate.C.3 Other featuresAnother effective feature type extracted for AMC here is instantaneous features[28, 30, 31]. We introduce three instantaneous features: γmax , kurtosis K andskewness S. γmax is the maximum value of the power spectral density of thenormalized signal samples.K measures whether the PDF of x(i) are heavy-tailed or light-tailed relativeto a normal distribution. In other words, K checks the presence of outliers in thedata distribution. High K indicates the presence of heavy tails or outliers in data,while low K is an indicator of light tails or lack of outliers.S is a measure of lacking symmetry in the data distribution. The S of symmet-rical distribution is zero. Negative S indicates x(i) is skewed left, which means theleft tail is longer than the right tail of the distribution. While positive S suggeststhe x(i) is skewed right; the mean and median are less than the mode.Besides, we also include four variances. σ2aa is the variance of the absolutevalue of normalized instantaneous amplitude |xcn(i)|. σ2v represents the varianceof the absolute value of normalized signal phase.In the the variance of the direct instantaneous phase ϕNL(i), σ2dp, and the vari-ance of the non-linear component of ϕNL(i), σ2ap, there is a threshold xt = 1 belowwhich the estimation of instantaneous phase ϕNL(i) is sensitive to noise.76Table C.1: List of Features Used in Proposed AMC MethodFeatures Definitionf1 : C20 M20f2 : C21 M21f3 : C1/240 M40−3M220f4 : C1/241 M40−3M20M21f5 : C1/242 M42− |M20 |2−2M221f6 : C1/244M44−M240−18M222−54M420−144M411−432M220M211+12M40M220+192M31M11M20+144M22M211+72M22M220f7 : C1/360 M60−15M20M40+30M320f8 : C1/361 M61−5M21M40−10M20M41+30M320M21f9 : C1/362M62−6M20M42−8M21M41−M22M40+6M220M22+24M221M20f10 : C1/363M63−9M21M42+12M321−3M20M43−3M22M41+18M20M21M22f11 : C1/480 M80−28M60M20−35M240+420M40M220−630M420f12 : C1/484 M84−16C63C21− |C40 |2−18C242−72C42C221−24C421f13 : γmax max |DFT(x(·))|2/Nf14 : σ2aa E[x2cn(i)]−E[|xcn(i)|]2f15 : σ2v E[x2v (i)]−E[|xv(i)|]2f16 : σ2dp Ex(i)>xt [ϕ2NL(i)]−Ex(i)>xt [ϕNL(i)]2f17 : σ2ap Ex(i)>xt [ϕ2NL(i)]−Ex(i)>xt [|ϕNL(i)|]2f18 : v20 M42/M221f19 : β∑Ni=1 x2I (i)/∑Ni=1 x2Q(i)f20 : KE[x4cn(i)]/E2[x2cn(i)]f21 : SE[x3cn(i)]/E 32 [x2cn(i)]f22 : PAR max(|x(·)|)/E[|x(i)|]f23 : PRR max(|x(·)|2)/E[|x(i)|2]77C.4 Summary of Selected Statistical FeaturesPartial variables in Table C.1 are given byx(i) = xI(i)+ j xQ(i) (C.4)xcn(i) = |x(i)|E[x(i)] −1 (C.5)xv(i) =√|x(i)|E2[|x(i)| −E[x(i)]] −1 (C.6)ϕNL(i) = phase(x(i))−E[phase(x(i))] (C.7)In the Table C.1, the E[·] operation denotes the mathematical expectation; theDFT(·) denotes the discrete Fourier transform operation; N is the number of thereceived symbols; and x(·) represents all samples of the received signal x.78
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- An ensemble automatic modulation classification model...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
An ensemble automatic modulation classification model with weight pruning and data preprocessing Yang, Xueting 2020
pdf
Page Metadata
Item Metadata
Title | An ensemble automatic modulation classification model with weight pruning and data preprocessing |
Creator |
Yang, Xueting |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | Automatic Modulation Classification (AMC) detects the modulation type and order of the received signal using limited prior knowledge within a short observation interval. In this thesis, we aim to provide a computation-efficient and high-performance AMC model for resource-constrained mobile devices. We use a public RadioML dataset and introduce three data pre-processing methods including noise reduction, normalization, and label smoothing before training the raw signals. Besides four common signal representations, we propose a new signal representation called a three-dimensional constellation image. For each signal representation, we carefully design a Deep Learning (DL) model. In addition to the traditional Convolutional Neural Network (CNN), two new AMC model structures are proposed. The attention module is integrated into the AMC model structure based on conventional Long Short-term Memory (LSTM) networks. Another proposed AMC model structure connects CNN, LSTM, and densely connected neural networks with two additional connections. After training the AMC models, we analyze the overall and per-class performance. We also study the computational complexity of trained AMC models in terms of memory consumption and detection efficiency. Overall, the results indicate that the proposed data pre-processing methods and the new AMC model structures can significantly improve the classification performance. To reduce the complexity of proposed AMC models, we introduce weight pruning to remove unnecessary connections in DL models. After weight pruning, the proposed AMC models have negligible performance degradation. To further improve the performance of AMC models, we also propose ensemble learning to train a second-level model based on multiple first-level AMC models. With three-fold cross-validation, the second-level model can train on the whole dataset and have an F1-score improvement of at least 10%. We also conduct weight pruning to reduce the unnecessary parameters of the ensemble learned model. Overall, after weight pruning, the ensemble learned AMC model receives an F1-score of 0.965 when the signal-to-noise ratio is greater than 6 dB. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-02-12 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0388609 |
URI | http://hdl.handle.net/2429/73517 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_may_yang_xueting.pdf [ 7.16MB ]
- Metadata
- JSON: 24-1.0388609.json
- JSON-LD: 24-1.0388609-ld.json
- RDF/XML (Pretty): 24-1.0388609-rdf.xml
- RDF/JSON: 24-1.0388609-rdf.json
- Turtle: 24-1.0388609-turtle.txt
- N-Triples: 24-1.0388609-rdf-ntriples.txt
- Original Record: 24-1.0388609-source.json
- Full Text
- 24-1.0388609-fulltext.txt
- Citation
- 24-1.0388609.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0388609/manifest