UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Video-based cardiac physiological measurements using joint blind source separation approaches Qi, Huan 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_september_qi_huan.pdf [ 9.17MB ]
JSON: 24-1.0166350.json
JSON-LD: 24-1.0166350-ld.json
RDF/XML (Pretty): 24-1.0166350-rdf.xml
RDF/JSON: 24-1.0166350-rdf.json
Turtle: 24-1.0166350-turtle.txt
N-Triples: 24-1.0166350-rdf-ntriples.txt
Original Record: 24-1.0166350-source.json
Full Text

Full Text

Video-based Cardiac Physiological Measurements UsingJoint Blind Source Separation ApproachesbyHuan QiB. Eng., Zhejiang University, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)July 2015c© Huan Qi, 2015AbstractNon-contact measurements of human cardiopulmonary physiological parametersbased on photoplethysmography (PPG) can lead to efficient and comfortable med-ical assessment. It was shown that human facial blood volume variation duringcardiac cycle can be indirectly captured by regular Red-Green-Blue (RGB) cam-eras. However, few attempts have been made to incorporate data from differentfacial sub-regions to improve remote measurement performance. In this thesis, wepropose a novel framework for non-contact video-based human heart rate (HR)measurement by exploring correlations among facial sub-regions via joint blindsource separation (J-BSS). In an experiment involving video data collected from16 subjects, we compare the non-contact HR measurement results obtained from acommercial digital camera to results from a Health Canada and Food and Drug Ad-ministration (FDA) licensed contact blood volume pulse (BVP) sensor. We furthertest our framework on a large public database, which provides subjects’ left-thumbplethysmograph signal as ground truth. Experimental results show that the pro-posed framework outperforms the state-of-the-art independent component analysis(ICA)-based methodologies.Driver physiological monitoring in vehicle is of great importance to providea comfortable driving environment and prevent road accidents. Contact sensorscan be placed on the driver’s body to measure various physiological parameters.However such sensors may cause discomfort or distraction. The development ofnon-contact techniques can provide a promising solution. In this thesis, we employour proposed non-contact video-based HR measurement framework to monitor thedrivers heart rate and do heart rate variability analysis using a simple consumer-level webcam. Experiments of real-world road driving demonstrate that the pro-iiposed non-contact framework is promising even with the presence of unstable illu-mination variation and head movement.iiiPrefaceThis thesis is based on the following works:• Huan Qi, Z. Jane Wang and Chunyan Miao, “Non-contact Driver CardiacPhysiological Monitoring Using Video Data”, accepted for the Third IEEEChina Summit and International Conference on Signal and Information Pro-cessing (ChinaSIP), 2015.The research was jointly initiated by Dr. Z. Jane Wang and the thesis author,and the majority of the research, including literature survey, model design, algo-rithm implementation, experimental data collection, data analysis and paper writ-ing, was conducted by the author of this thesis, with valuable suggestions from Dr.Z Jane Wang. Dr. Zhenyu Guo and Dr. Xun Chen also helped on the methods partin Chapter 2. Dr. Xun Chen, Mr. Liang Zou and Mr. Yiming Zhang helped greatlyon data collection of the road driving experiment in Chapter 3.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Background of Cardiac Physiological Measurements . . . . . . . 41.3 Background of Non-contact Physiological Measurements . . . . . 71.4 Research Objectives and Methodology . . . . . . . . . . . . . . . 82 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Facial Landmark Localization . . . . . . . . . . . . . . . . . . . 122.3 Joint Blind Source Separation . . . . . . . . . . . . . . . . . . . 142.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . 142.3.2 IVA and M-CCA . . . . . . . . . . . . . . . . . . . . . . 16v2.4 Identify BVP Signal . . . . . . . . . . . . . . . . . . . . . . . . . 182.5 Connectivity Multiset Canonical Correlation Analysis . . . . . . . 242.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 EXP1: Laboratory Experiment . . . . . . . . . . . . . . . . . . . 333.3 EXP2: Public Database Experiment . . . . . . . . . . . . . . . . 383.4 EXP3: Road-Driving Experiment . . . . . . . . . . . . . . . . . . 463.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.5.1 HR Estimation Using Side Profile . . . . . . . . . . . . . 493.5.2 Dynamic HR Estimation . . . . . . . . . . . . . . . . . . 493.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . . 504 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 524.1 Conclusion and Contribution . . . . . . . . . . . . . . . . . . . . 524.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55viList of TablesTable 1.1 Popular heart rate monitoring techniques . . . . . . . . . . . . 2Table 3.1 A summary of experiments . . . . . . . . . . . . . . . . . . . 33Table 3.2 Performance on EXP1 using different non-contact HR measure-ment methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 37Table 3.3 Basic information of DEAP database such as subject statistics,physiological parameters . . . . . . . . . . . . . . . . . . . . 39Table 3.4 Performance on EXP2 using different non-contact HR measure-ment methods (with δ = 0.5) . . . . . . . . . . . . . . . . . . 41Table 3.5 Performance on EXP2 using different non-contact HR measure-ment methods (with adaptive δ -correlation) . . . . . . . . . . 41Table 3.6 Acceptance rate using different non-contact methods . . . . . . 43viiList of FiguresFigure 1.1 Popular heart rate monitoring devices. . . . . . . . . . . . . . 2Figure 1.2 A segment of ECG waveforms. . . . . . . . . . . . . . . . . . 5Figure 1.3 A segment of BVP waveforms of a subject from the DEAPdatabase [25]. . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 2.1 Robust facial landmark localization in different viewpoints,generated by a pre-trained model in [4] . . . . . . . . . . . . 13Figure 2.2 Facial sub-region division. (a) Division vertexes distribution.(b)-(d) Facial landmark localization in different viewpoints.(e)-(g) Areas covered by four facial sub-regions. . . . . . . . 13Figure 2.3 Facial landmark localization and sub-region division patternunder three different experimental settings. From top to bot-tom: the setting of our self-collected laboratory experiment,the setting of DEAP affective computing database [25], the set-ting of our self-collected road-driving experiment. . . . . . . 15Figure 2.4 Overview of the proposed video-based (non-contact) HR mea-surement method using facial landmark localization and J-BSStechniques. First, subjects’ faces are divided into several sub-regions according to coordinates of facial landmarks. Thencolor channel data from each sub-region are collected into tem-poral signals and fed to J-BSS algorithms. The obtained sourcesets are clustered after certain detrending and filtering opera-tions. Finally we could recover the BVP signal and conductHR estimation and HRV analysis. . . . . . . . . . . . . . . . 18viiiFigure 2.5 One example of recovered SCVs using the M-CCA method.They are computed from datasets of four facial sub-regions,and each has three color channels and three underlying sourcesto recover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 2.6 Results in Fig. 2.5 were clustered by Normalized Cut [38]. Thelargest cluster has four elements and their frequency spectra allcontain peaks near 1Hz, which is close to human resting HR’s.The arrow indicates the largest peak among all spectra, whichbelongs to the BVP signal estimates. Here the cluster numberis set to 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Figure 2.7 Scatter plot of δ test on DEAP affective computing database.The line shows the linear regression model that is fit using testdata. Here Fs denotes the sampling rate of the BVP signal afterinterpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 2.8 The top row is an interpolated BVP signal before peak detec-tion. The remaining two figures show different detection per-formances. With a fixed δ , several small local peaks are alsoincorporated as labeled by blue arrows in the middle row. Us-ing adaptive δ -correction by incorporating frequency knowl-edge of input BVP signal, false detections are removed andalmost all large local peaks corresponding to heart beats aresuccessfully detected. . . . . . . . . . . . . . . . . . . . . . . 23Figure 2.9 Absolute error of non-contact HR measurement from 5 inde-pendent trials by altering CDM pattern. . . . . . . . . . . . . 27Figure 2.10 Proposed learning-based C-MCCA based on an M3L model [18].Given multi-set color channel signals, we train the model us-ing extracted feature x and label set y. The trained model canpredict the optimal label set y′ (i.e. CDM) given any input fea-ture x′ extracted from new multi-set color channel signals. Thepredicted CDM is then used for subsequent heart rate measure-ment and HRV analysis. . . . . . . . . . . . . . . . . . . . . 29ixFigure 3.1 Illustration of the system setup. The pulse oximeter was slightlyclamped on subject’s finger tip. A webcam was programmedto take pictures of pulse oximeter’s OLED screen every onesecond. A consumer-level digital camera recorded the subjectwith the support of a tripod. All drawing materials in the upperfigure are from the Internet. The lower figure shows a subjectis being recorded in one trial. . . . . . . . . . . . . . . . . . . 34Figure 3.2 (a) Webcam is focused on the OLED screen of pulse oximeterand programmed to take pictures each second. (b) An examplepicture taken by the webcam in (a). (c) Smoothed 60-secondsamples of five subjects’ oximeter readings. The average HRis shown in parenthesis. . . . . . . . . . . . . . . . . . . . . . 36Figure 3.3 Scatter plots of three non-contact methods (a) ICA by [35] (b)ours using IVA (c) Ours using M-CCA. . . . . . . . . . . . . 37Figure 3.4 Bar plot of experimental results in EXP1. . . . . . . . . . . . 38Figure 3.5 A participant’s frontal face video during the experiment. Elec-trodes, wires, and tapes occlude parts of the facial regions. . . 40Figure 3.6 Error distribution of the proposed C-MCCA and ICA-basedmethod [35] without adaptive δ -correlation. . . . . . . . . . . 43Figure 3.7 Error distribution of three non-contact methods with and with-out adaptive δ -correlation. . . . . . . . . . . . . . . . . . . . 44Figure 3.8 The scatter plot comparing HRgt with HRnc between (a) Adap-tive δ -correlation and fixed δ -correlation (b) Adaptive ICAand Adaptive C-MCCA. . . . . . . . . . . . . . . . . . . . . 45Figure 3.9 Road driving experiment (EXP3) setting-up. In the left figure,we show that the webcam is placed behind the wheel and a lap-top is used to monitor the video recording. In the right figure,a zoom-in picture is provided. . . . . . . . . . . . . . . . . . 47Figure 3.10 HRV analysis examples. The top row is from the laboratorysetting. The bottom row is from the real road driving experi-ment. Six measures in time and frequency domains are com-puted based on IBI series. LS-Periodogram and LS-Spectrogramare also given. . . . . . . . . . . . . . . . . . . . . . . . . . . 48xFigure 3.11 (a)-(d) Division pattern for profiles. (e) Part of recovered BVPsignal with detected peaks. (f) Readings from pulse oximeterand HR estimates using M-CCA and IVA. . . . . . . . . . . . 49Figure 3.12 Black dash line reflects one subject’s HR variation during therecording with sampling rate 1Hz. The red line is the HRestimate based on a slide window of past 10 seconds and a95% overlap. Both curves were smoothed by moving averagemethod with span 20. . . . . . . . . . . . . . . . . . . . . . . 50xiGlossaryECG ElectrocardiogramHRM Heart Rate MonitoringPPG PhotoplethysmographyHR Heart RateHRV Heart Rate VariabilityBVP Blood Volume PulseIBI Inter-beat IntervalSDNN Standard Deviation of the IBI SeriesRMSSD Root Mean Square of Successive Differences of the IBI SeriesLF Low FrequencyHF High FrequencyPSD Power Spectral DensityLS Lomb-ScargleHRVAS HRV Analysis SoftwareRGB Red-Green-BlueSCV Source Component VectorxiiJ-BSS Joint Blind Source SeparationCCA Canonical Correlation AnalysisM-CCA Multiset Canonical Correlation AnalysisBSCM Between-set Source Correlation MaximizationESCM Eigenvalue-maximization of Source Correlation MatrixRGCCA Regularized Generalized Canonical Correlation AnalysisIVA Independent Vector AnalysisFFT Fast Fourier TransformC-MCCA Connectivity Multiset Canonical Correlation AnalysisCDM Connectivity Design MatrixSSQCOR Sum of Squared CorrelationM3L Max-margin Multi-label ClassificationEEG ElectroencephalographyEOG ElectrooculographyxiiiAcknowledgmentsI want to express my great appreciation to my supervisor, Dr. Z. Jane Wang for herpersistent support, constant encouragement and profound insight in the researcharea throughout my master study. I would like to thank Dr. Chunyan Miao fromNanyang Technological University for financial support and research guidance dur-ing my visit in Singapore. Many thanks go to Dr. Zhenyu Guo and Dr. Xun Chenfor their research advice.I would like to thank all my dear friends and labmates. Thanks a lot for theirhelp and feedback. Special thanks go to Liang Zou and Yiming Zhang for theirfriendship and lunch companion since the day we began to study together at UBC.I would like to thank all committee members of my master exam for theirvaluable time and suggestions.Last but not least, I own my deepest gratitude to my parents in China, Mr.Xiaodong Qi and Mrs. Juyan Wang, for their endless love and support. They arethe spiritual idols in all aspects of my life.xivChapter 1Introduction1.1 MotivationVarious human physiological parameters provide direct or indirect evidence of hu-man health state. Measurements of these physiological parameters, which are ofteninterdependent, have always been one of the most fundamental questions in the areaof modern medicine. Among numerous parameters, cardiovascular parameters areof great research interest, including heart rate (HR), heart rate variability (HRV),blood pressure, and respiratory rate. Large-scale clinical studies show that surveil-lance and prevention of certain cardiovascular diseases requires regular medicalassessment of HR and HRV [15], which is also known as heart rate monitoring(HRM). The history of HRM partially reflects the development of medical tech-nologies. In traditional Chinese medicine, which dates back to more than 2,000years ago, therapists can diagnose illness based on patients’ wrist pulse patterns.For centuries, HRM was carried out by placing an ear on the patient’s chest. Theinvention of stethoscope by French physician Rene´ Laennec nearly 200 years agowas a milestone in HRM [2]. It provides instant and clear heart beat feedback totherapists in a non-invasive fashion. Later in the year of 1903, the Dutch physiol-ogist and Nobel laureate Willem Einthoven invented the first practical electrocar-diogram (ECG). With the ECG technique, it is possible to observe and record theentire cardiac electrical activity of heart beat cycle.Since the invention of ECG, great efforts have been made to develop conve-1Figure 1.1: Popular heart rate monitoring devices.Table 1.1: Popular heart rate monitoring techniquesDevice Target Signal AccessoryElectrocardiogram Heart electrical activity Adhesive electrodesHolter monitor Heart electrical activity Adhesive electrodesChest strap Heart electrical activity Transmission moduleDoppler fetal monitor Electronic audio Ultrasound couplantFinger pulse oximeter Blood volume pulse NoneWatch-like heart rate sensor Blood volume pulse NoneNon-contact video technique Blood volume pulse Nonenient and comfortable HRM devices, as shown in Fig. 1.1. Some of the most popu-lar HRM techniques are listed in Table 1.1. A Holter monitor is a portable medicaldevice for continuously monitoring various cardiovascular electrical activities formore than 24 hours. Electrodes need to be attached to the human body togetherwith the monitor itself and no intensive exercise is allowed during monitoring. Acommercial-level chest strap is designed to measure HR in situations such as rac-ing, hiking, and various sports exercises. A segment of the chest strap is madeof multi-layered textile with good conductivity. A small and light processor is at-2tached to measure heart electrical activity and transmit signals to other devices suchas smartphones and computers. A Doppler fetal monitor uses the Doppler effect togenerate electronic audio simulation of fetal heart beat. To enhance the simulation,ultrasound couplant (usually liquid) is often used to facilitate the transmission ofultrasonic energy from the transducer into the target. A finger pulse oximeter is anon-invasive device designed for convenient HRM, which can provide accurate in-stant HR within just a few seconds. It is based on the photoplethysmography (PPG)effect generated by cardiac cycle. A PPG sensor is clipped on a thin part of the hu-man body such as a fingertip. It measures the change of tissue optical absorbance,which is known as the blood volume pulse (BVP) signal. No other accessory isrequired. Another type of PPG-based devices is the popular watch-like heart ratemonitor, such as Apple Watch. All aforementioned HRM devices can be classifiedas contact techniques since they require physical contact between the electrodesor sensors and the human body. Placement and removal of these attachments cancause discomfort, stress and even epidermal stripping [1].With the advances of imaging sensors and computer vision technologies, manyvision algorithms have been successfully adapted and applied to biomedical engi-neering applications [21]. Video-based physiological measurement was also bornin this exciting trend. Without requiring any physical contact, video-based physi-ological measurement technique allows remote detection of human blood volumepulse signals (thus heart rate measurements) using designated imaging sensors oreven low-cost webcams. This technique can potentially bring HRM to the nextlevel of comfort and convenience. Non-contact heart rate measurement benefitsfrom the integration of computer vision and biomedical signal processing. Bothcomputer vision and biomedical signal processing areas have witnessed importantadvances in recent years. For instance, state-of-the-art face tracking algorithmsare more robust to various background, occlusion, illumination change, and inten-sive head motion. Advanced multi-set analysis of biomedical signals can revealdeeper correlations across multiple datasets. Investigating the interaction betweenthese two areas has been receiving increasing research attention. It is expected thatsuch interaction would achieve more accurate and robust non-contact physiologicalmeasurement [40].Studying the robust of non-contact HRM is also of great importance to facilitate3the utilization of non-contact cardiac physiological measurements in real world ap-plications. Convenient measurement of HR is of great potential for both clinic diag-nosis and daily healthcare. In a clinical setting, non-contact methods work withoutattaching any medical electrode or sensor to the subject. Some of such methodshave been clinically tested, such as vital signs monitoring during haemodialysis[40], neonatal intensive care unit [1], quantification of limb movement in epilepticseizure [29], and dynamic tissue phantoms evaluation [43]. Family healthcare canalso benefit a lot from non-contact detection techniques, especially with the rapiddissemination of smartphones [37]. For instance, commercial apps such as Cardiio(Cardiio, Inc., San Francisco, CA, USA) and Vital Signs Camera [33] (Philips,Inc., Amsterdam, Netherlands) enable users to measure heart rates using contin-uous recordings of their faces by front cameras on the phones. Another potentialapplication is the driver medical assistance in the automobile environment, whichis considered as one of the most promising ways to effectively prevent accidentsand augment intelligence in transportation systems [14, 47]. Such an assistancesystem should incorporate reliable measurements of the drivers vital signals in or-der to depict his/her driving condition.In this thesis, with the intention to enhance the accuracy and robustness of ex-isting non-contact techniques, we plan to develop a non-contact video-based heartrate monitoring framework based on the combination of advanced computer visionand multi-set data analysis methods. Moreover, we attempt to test the proposedframework under different indoor and outdoor environments.1.2 Background of Cardiac Physiological MeasurementsAs a prognostic factor and potential therapeutic target, HR has been verified inlarge epidemiological studies to be an independent predictor of cardiovascularand all-cause mortality for people with or without diagnosed cardiovascular dis-ease [15]. Currently, ECG devices are widely used in HRM due to the high re-liability. ECG records the electrical activity of the heart over a certain period oftime using multiple electrodes attached on a patient’s body. During each heartbeat cycle, heart muscle depolarizing would result in the tiny electrical variationon the skin, which can be detected by ECG electrodes. Variations from multiple4Figure 1.2: A segment of ECG waveforms.Figure 1.3: A segment of BVP waveforms of a subject from the DEAPdatabase [25].electrodes are processed to form heart beat waveforms, as shown in Fig. 1.2. Nor-mally, each individual heart beat is represented on the ECG as a PQRST complex.Different parts of the PQRST complex are related to different sub-processes ofthe cardiac cycle. One common method to estimate HR is to use the mean R-Rinterval:HR =60TRR, (1.1)where TRR denotes the average time interval between adjacent R peaks given asegment of heart beat waveforms. Since physiological interpretation of PQRSTcomplex is beyond the scope of this thesis, interested readers are referred to [20]for more information.5PPG is an indirect but effective technique to measure cardiovascular BVP. Dur-ing a cardiac cycle, variations of tissue blood volume in certain human body seg-ments modulate the transmission or reflection of visible light at these segments.PPG sensors are used to capture such variations in the dedicated light source, andthe heart rate can be estimated correspondingly by measuring time intervals be-tween consecutive peaks of the signal [40]. As shown in Fig. 1.3, the BVP signalcontains less information about cardiac cycle than the ECG signal does. However,the BVP signal is sufficient to estimate HR using a similar method as in ECG:HR =60IBI, (1.2)where IBI denotes the average inter-beat interval (IBI) between adjacent heart beatpeaks given a segment of BVP waveforms.Besides the heart rate, another important cardiac physiological parameter isHRV. HRV is the physiological phenomenon of variation in the time interval be-tween heartbeats. Once the heart beat signal is obtained, a sequence of IBIs com-puted from every pair of adjacent peaks can be extracted for HRV analysis, whichis, to some extent, more informative than the heart rate alone. HRV analysis usu-ally focuses on the time and frequency domain measures of the IBI series. Varioustime domain measures have been proposed, such as the standard deviation of theIBI series (SDNN), the root mean square of successive differences of the IBI series(RMSSD), the number of successive differences that are great than x milliseconds(NNx, often x = 50), and the percentage of total intervals that successively differby more than x milliseconds (pNNx, often x = 50).Clinical pathological studies of HRV reveal that the low frequency (LF) andhigh frequency (HF) oscillations of the IBI series are of great research interest. Itis believed that LF is associated with sympathetic and parasympathetic activity andHF is associated with respiratory sinus arrhythmia. The nominal frequency rangesof LF and HF are 0.04∼0.15 Hz and 0.15∼0.4 Hz, respectively. Powers withincertain frequency bands are useful for quantitative description. For example, theLF power measures the amount of power within [0.04 Hz, 0.15 Hz], calculatedby integrating the power spectral density (PSD) over the frequency band. The HFpower can be calculated in a similar way over the range [0.15 Hz, 0.4 Hz]. The6ratio of the LF power to the HF power (i.e., LF/HF) also provides insight into thesympatho-parasympatho balance. A popular way to estimate PSD is the Lomb-Scargle periodogram method (LS-Periodogram), which does not require the datato be uniformly sampled. If we segment the IBI series temporally and generatethe LS-Periodogram with respect to time, the resulting plot is a spectrogram. Cur-rently there are many open source toolboxes such as HRVAS [36] for HRV analysisin time and frequency domains. The physiological interpretation of the time andfrequency domain measures in HRV analysis is beyond the scope of this thesis.Interested readers are referred to [7] for more information.1.3 Background of Non-contact PhysiologicalMeasurementsMany efforts have been made to provide non-contact HR measurements. Someworks used dedicated sensors such as Doppler wave sensors [5, 17, 44] and ther-mal imaging sensors [16]. The study in [45] showed, for the first time, that BVPsignals can be remotely acquired from the human face using consumer-level digi-tal cameras in ambient light. Poh et al. [34] presented an independent componentanalysis (ICA) framework to measure HR using a low-cost webcam in ambientlight. Later, the authors extended their previous work with measurements of therespiratory rate and low & high frequency components of HRV [35]. To overcomethe frequency resolution limitation of traditional red-green-blue (RGB) sensors,Mcduff et al. [30] presented a modified five band digital camera with the cyanand orange frequency bands being added to the original red, green and blue colorchannels. Experimental results showed that such modifications improve the perfor-mances of physiological measurements of HR and HRV. Real-time measurementsusing continuous wavelet transform can be achieved despite the existence of lightand motion artifacts [8]. In [6], an interesting motion-based approach was pre-sented to recover the heart beat signal. It showed that the influx of blood duringa cardiac cycle causes detectable head motion according to Newton’s third lawof motion. The frequency component of such a motion can be extracted by us-ing facial feature tracking and principal component analysis. A recent publication[26] reported the results under more challenging conditions, where the subject’s7motions and illumination variations are involved. The proposed normalized leastmean square adaptive filtering method was tested on a difficult public database andachieved the state-of-the-art performance when compared to other methods.Based on the aforementioned literature of recent years, it is concluded that theaccuracy of non-contact measurements is highly susceptible to video recording en-vironment. For example, in [30], facial video recording was conducted under well-controlled laboratory environment with stable indoor illumination and little headmotion artifact. In that case, the performance of non-contact measurements is al-most as good as the performance of the contact PPG sensor (ground truth). In [26],a much more challenging experimental environment impaired the performance ofnon-contact methods to a large extent. Therefore it suggests that there is still muchspace for improvement in video-based non-contact physiological measurements.1.4 Research Objectives and MethodologyThe technical objective of this thesis is to investigate possible solutions to enhanc-ing the performance of video-based non-contact HR measurements under challeng-ing experimental environments in order to facilitate the utilization of non-contactphysiological measurements in real-world applications. It is worth noting that al-most all previous non-contact techniques extract the face color channel data by av-eraging over the entire facial region without considering potential variations amongdifferent facial sub-regions 1. Therefore we plan to investigate how such varia-tions among different facial sub-regions might contribute to the non-contact HRmeasurement. We also plan to conduct three types of experiments to evaluate theperformances of non-contact HR measurements:i Experiment under the well-controlled laboratory environment using self-collectedvideo and physiological data.ii Experiment under a more challenging laboratory environment using video andphysiological data from public affective computing database.iii Experiment under the difficult road-driving setting using self-collected videoand physiological data.1A facial ‘sub-region’ is a region containing only one part of a human face.8In all three experiments, we attempt to compare the proposed non-contact HR mea-surement framework with traditional HRM devices. In the road-driving experi-ment, we also plan to compare HRV time and frequency measures.In order to achieve the above objective, we propose a non-contact video-basedhuman heart rate measurement framework by exploiting data correlations amongspecified facial sub-regions via advanced facial landmark localization and jointblind source separation approaches. The two main components of the proposedframework, facial landmark localization and joint blind source separation, are sum-marized as follows:• Facial Landmark LocalizationCollect facial color channel data is the first step of most non-contact meth-ods. It is desirable that this data collection procedure is robust to potentialhead movement and illumination variation. In this thesis, we employ anadvanced real-time facial landmark localization algorithm to collect facialcolor channel data. Specifically, based on detected facial landmark coor-dinates, we design a division pattern to divide facial region into four sub-regions and collect data respectively for subsequent data analysis. Experi-mental results verify the robustness of this algorithm under different settingsincluding road-driving condition.• Joint Blind Source SeparationTo extract BVP from color channel data of different facial sub-regions, wepropose to use joint blind source separation methods including multi-setcanonical correlation analysis (M-CCA) and independent vector analysis(IVA). A BVP extraction pipeline is designed to ensure accurate heart ratemeasurement by using techniques such as frequency analysis, signal detrend-ing, correlation clustering via normalized cut. We also develop a BVP peakdetection method using adaptive sliding window size. We observe that byaltering connectivity among different sub-regions, HR measurement per-formance actually varies. Based on this observation, we propose to usea max-margin multi-label classification method to learn interaction amongdata from different sub-regions in order to further enhance the performanceof our framework. A learning-based connectivity multi-set canonical cor-9relation analysis (C-MCCA) algorithm is proposed. Experimental resultsdemonstrate that the proposed non-contact framework outperforms currentstate-of-the-art method.The organization of the remainder thesis is as follows. In Chapter 2, we elaboratethe employment of facial landmark localization and joint blind source separationapproaches. The proposed non-contact measurement framework is described indetail, including its pipeline and several proposed methods. Chapter 3 containsthe setting descriptions and results of three types of experiments to evaluate theperformance of the proposed framework. The conclusions are given in Chapter 4along with discussion of future work.10Chapter 2Method2.1 IntroductionTwo major concerns that determine the performance of a non-contact HR mea-surement method are (i) data collection and (ii) BVP signal recovery, especiallyunder challenging environment where illumination variation and motion artifactsare involved. To tackle the first concern, we employ a real-time facial landmarklocalization algorithm to track subjects’ facial regions. Based on coordinates ofcertain facial landmarks such as eyes, nose, and mouth, we divide facial regionsinto four non-overlapping sub-regions and extract four sets of color channel sig-nals respectively. Compared with traditional methods which mainly focus on theentire facial region, this sub-region data collection method allows the investiga-tion of interaction among different sub-regions and whether it would contributeto the improvement of HR measurement accuracy. J-BSS methods can deal withthis multi-set analysis problem by using correlation maximization. In this thesis,we introduce a framework for non-contact HR measurement by using landmarklocalization to designate facial sub-regions and J-BSS to recover BVP signals. Weelaborate the details in the following sections.112.2 Facial Landmark LocalizationIn this section, we first introduce the importance of facial data collection usinglandmark localization methods in brief, and then focus on an advanced vision al-gorithm that is recently proposed and how to divide facial regions using designateddivision pattern.For each frame of video signals, we divide the face into M = 4 sub-regions andextract color channel data from each of them. There are two important issues aboutdividing facial sub-regions in video signals. Firstly, a face should be accuratelydetected in each frame of the recorded video. Secondly, it should be guaranteedthat the locations of corresponding facial sub-regions remain approximately thesame over all frames. The second one is a fundamental requirement since we intendto study correlations among different sub-regions. It only makes sense when colorchannel data in each sub-region are extracted from the same physical part no matterwhat absolute coordinates of the face might change from one frame to another dueto possible head movements.Facial landmark localization is a natural approach to address these require-ments. Specifically, we employ the landmark localization method proposed in [4]to divide sub-regions based on coordinates of detected facial landmarks (e.g. eye-brows, pupil, nose bridge, and lips). In [4], an efficient facial landmark localizationalgorithm is introduced to achieve real-time 2D face and eyes landmark detectionand tracking. By incrementally updating a discriminative facial deformable model,the algorithm achieves state-of-the-art performance for face alignment in static im-ages and face tracking in videos. In this thesis, we use a pre-trained model providedby the authors of [4], which allows detection and tracking of 49 facial landmarks.As shown in Fig. 2.1, the landmark localization method works in different view-points and returns (x,y) coordinates of detected facial landmarks (marked as greendots) in each frame of the video for each subject. Based on the coordinates of theselandmarks, we specify a division pattern by using certain landmarks and their ge-ometric connection lines as vertexes and edges to constitute M = 4 polygons orregions of interest as facial sub-regions. In Fig. 2.2(a), filled dots denote faciallandmarks while black squares denote division vertexes. {ri} represent coordi-nates for selected landmarks and {Vi} for division vertexes. {V1,V2,V3,V4,V6} are12Figure 2.1: Robust facial landmark localization in different viewpoints, gen-erated by a pre-trained model in [4]  	   (a)(b) (c) (d)(e) (f) (g)Figure 2.2: Facial sub-region division. (a) Division vertexes distribution. (b)-(d) Facial landmark localization in different viewpoints. (e)-(g) Areascovered by four facial sub-regions.13corresponding facial landmarks. The remaining vertexes are specified as follows:V5 = r1 +34(r1− r2) V7 = r4 +34(r4− r3)V11 = r7 +34(r7− r5) V12 = r8 +34(r8− r6)V9 =12(r5 + r6) V8 =12(V5 +V11) V10 =12(V7 +V12)Such division pattern is tested in different experimental environments. In Fig. 2.3,we show that this division pattern guarantees a robust data collection in differentfacial sub-regions under three different settings: a well-controlled laboratory envi-ronment (the first row in Fig. 2.3), a challenging recording environment of DEAPaffective computing public database [25] (the second to fourth rows in Fig. 2.3),a real-world road driving setting where illumination and motion artifacts are con-stantly involved (the last row in Fig. 2.3).It is shown that most facial sub-regions are robustly captured, as shown inFig. 2.2(e)-(g). Among eight division vertexes, two are facial landmarks, the restare geometric coordinates constituted based on facial landmarks. For each sub-region, color channel data is computed by averaging over all pixels within the sub-region. By aligning data temporally, we can acquire four facial sub-region datasets,each containing a multi-dimensional color channel signal for the correspondingsub-region during the recording.2.3 Joint Blind Source SeparationAfter acquiring multiple datasets from facial sub-regions, we perform joint blindsource separation (J-BSS) methods based on the assumption that these datasetsshare common underlying sources, which come from blood volume variations dur-ing cardiac cycles.2.3.1 Problem FormulationFirst we describe the mathematical formulation and notations of J-BSS problemsin general. Given M datasets, let X [m] ∈ RV×N denote the m-th dataset, where Vis the number of variables and N is the number of samples (in our case V = 3 for14Figure 2.3: Facial landmark localization and sub-region division pattern un-der three different experimental settings. From top to bottom: the set-ting of our self-collected laboratory experiment, the setting of DEAPaffective computing database [25], the setting of our self-collected road-driving experiment.the RGB color channels and N is the number of video frames). X [m] can be furtherdenoted with respect to column vectors x[m]n (1≤ n≤ N) as:X [m] =[x[m]1 , · · ·x[m]N]for 1≤ m≤M (2.1)where x[m]n ∈RV×1 is the n-th observation. It is assumed that each dataset is a linearmixture of L underlying independent sources:x[m] = A[m]s[m] for 1≤ m≤M (2.2)15where A[m] ∈ RV×L means the mixing matrix to be determined and s[m] ∈ RL×1 isthe random source vector that can be further expressed as s[m] = [s[m]1 , · · · ,s[m]L ]T for1≤ m≤M, where the superscript T means the transpose operation.The concept of source component vector (SCV) in J-BSS is based on the as-sumption that common underlying sources are shared among multiple datasets[23]. Formally, let sl = [s[1]l , · · · ,s[M]l ]T denote the l-th SCV, for 1 ≤ l ≤ L. sl isa random vector uncorrelated with all other SCVs but the components within itselfare mutually correlated. The general goal of J-BSS is to find SCVs by estimatingthe mixing matrices A[m]’s or their inverse matrices W [m]’s and the correspondingsource vector estimations y[m] =W [m]x[m]. The estimation of s[l] is hereby expressedas yl = [y[1]l , · · · ,y[M]l ]T, where y[m]l is the estimation of the l-th component in datasetm:y[m]l = (w[m]l )Tx[m] (2.3)where w[m]l is a demixing vector. It is the l-th column vector of W[m].2.3.2 IVA and M-CCASeveral approaches to J-BSS have been developed in recent years based on differ-ent statistical assumptions. One direction focuses on extensions of the ICA idea.Methods such as group ICA [9], parallel ICA [28], IC-PLS [11], and indepen-dent vector analysis (IVA) [3, 23] were proposed. Among these methods, IVA isa natural extension of ICA from one to multiple datasets by ensuring that the ex-tracted sources are independent within each dataset and meanwhile well correlatedacross multiple datasets. IVA is designed to minimize the mutual informationIIVA16among the estimated SCVs [3]:IIVA ,I [y1; · · · ;yL]=L∑l=1H [yl]−H [y1; · · · ;yL]=L∑l=1H [yl]−H[W [1]x[1], · · · ,W [M]x[M]]=L∑l=1(M∑m=1H [y[m]l ]−I [yl])−M∑m=1log |det(W [m])|−C1(2.4)where H (·) denotes the entropy of certain random variables (or vectors) and C1is the constant term H [x[1], · · · ,x[M]]. The derivation shows that minimizing IIVAis equivalent to minimizing the entropy of all components y[m]l and maximizing themutual information within each estimated SCV yl for l = 1, · · · ,L. We use IVA-G[3] for the implementation of IVA, which exploits second-order statistical infor-mation across multiple datasets by assuming that each SCV follows a multivariateGaussian distribution.Another popular method, named multiset canonical correlation analysis (M-CCA), is an extension based on the concept of canonical correlation analysis (CCA).The goal of CCA is to find two linear transformation vectors a,b for two randomvectors x1,x2 such that the correlation between y1 = aTx1 and y2 = bTx2 is maxi-mized. The random variables y1 and y2 are called the first pair of canonical variates(CVs). The following pairs of CVs can be iteratively obtained by achieving maxi-mum correlation under the constraint that they are statistically uncorrelated to theprevious ones. M-CCA extends the idea of correlation maximization by optimizinga certain objective function of the correlation matrix of CVs from multiple randomvectors to achieve maximum overall correlation [22]. It is presented in [27] thatJ-BSS can be achieved by two different yet interrelated approaches: between-setsource correlation maximization (BSCM) and eigenvalue maximization of sourcecorrelation matrix (ESCM), respectively. In BSCM approach, the group of sourceswith largest between-set correlation values are first extracted from datasets, by op-17Figure 2.4: Overview of the proposed video-based (non-contact) HR mea-surement method using facial landmark localization and J-BSS tech-niques. First, subjects’ faces are divided into several sub-regions ac-cording to coordinates of facial landmarks. Then color channel datafrom each sub-region are collected into temporal signals and fed to J-BSS algorithms. The obtained source sets are clustered after certaindetrending and filtering operations. Finally we could recover the BVPsignal and conduct HR estimation and HRV analysis.timizing objective functions with respect to correlation magnitudes, i.e. measureof overall correlation. The ESCM approach, on the other hand, focused on themaximum eigenvalue λmax(R(l)) of M×M source correlation matrix R(l) for thel-th SCV.2.4 Identify BVP SignalThe major steps of the proposed framework are shown in Fig. 2.4. In step 1, un-compressed videos of each subject are analyzed frame by frame using facial land-mark localization to obtain sub-region division. Details have been discussed inSection 2.2. We calculate average RGB channel values of pixels within each sub-18Figure 2.5: One example of recovered SCVs using the M-CCA method. Theyare computed from datasets of four facial sub-regions, and each hasthree color channels and three underlying sources to recover.region to generate a feature X (k) for frame k in a recorded video:X (k) = [x[1]k , · · · ,x[M]k ]T (2.5)where x[m]k = [xmRk ,xmGk ,xmBk ] contains average RGB color channel values of sub-region m in frame k. M is the number of sub-regions (datasets).In step 2, by combining features from all frames, we can obtain a feature se-quence of one subject [X (1),X (2), · · · ,X (N)], where N is the number of framesin the video. The feature sequence of sub-region m is referred to as dataset X [m],where X [m] ∈ R3×N . For m = 1, · · · ,M, we have:[X (1), · · · ,X (N)] =X [1]...X [M] (2.6)All datasets are detrended using a technique that has been widely used in HRVanalysis to remove slow linear or more complex trends in signals [41]. Similarperformance is achieved when we set the smoothness parameter between 1500 and2000. The detrended signals are further normalized to have zero mean and unitvariance. They are then fed into J-BSS algorithms to recover source signals under19the assumption that the number of sources is equal to the number of observationsfor each dataset in step 3 of Fig. 2.4. One example of recovered SCVs are shownin Fig. 2.5. The resulting source signals are band-pass filtered between [0.5 Hz,2.5 Hz], corresponding to 30 bpm and 150 bpm as the lower and upper bounds ofhuman HR measurements in step 4 of Fig. 2.4.In [34], Poh et al. empirically selected the second source to be the BVP signal.This can be practically problematic since there is no guarantee on the order ofrecovered sources using ICA. Source selection method in [30, 35] is simple yeteffective. All sources were band-pass filtered and then calculated by the normalizedFast Fourier Transform (FFT). The one with the largest peak in frequency domainwas selected to be the BVP signal. In J-BSS framework, the number of recoveredsources can be large and differences among them are sometimes difficult to telleven in the frequency domain. We instead consider the basic assumption of J-BSS that SCVs are uncorrelated with all other SCVs but the components withineach SCV are correlated. It is reasonable to assume that those mutually correlatedsources in source sets are likely to be the estimates of BVP signals since humanheart beat is an underlying source for all facial sub-region datasets.Therefore, in step 5 of Fig. 2.4, we propose performing spectral clustering onthe similarity matrix of all source variables, which is calculated based on the nor-malized cross correlation among all recovered source signals. The cluster numberis carefully appointed to be slightly smaller than the total number of recoveredsources such that most resulting clusters have only one component. The clusterwith largest number of components is selected, which contained several BVP sig-nal candidates according to our assumption. In our case, 12 source signals wereassigned to 8 clusters. The largest cluster usually contained 3 or 4 BVP signal can-didates. In step 6, we run the same source selection approach discussed above infrequency domain to determine the best BVP signal. Here we employ the Normal-ized Cut algorithm [38] to achieve spectral clustering. One example of the sourceselection process is shown in Fig. 2.6. Most clusters have only one or two compo-nents while the largest cluster contains four. The resulting BVP signal reflects thesubject’s heart beat process during the recording and can be used to estimate HR.The signal is first interpolated to increase its sampling frequency to Fs = 256 Hz.A peak detection algorithm is then applied to find peaks that are at least separated20Figure 2.6: Results in Fig. 2.5 were clustered by Normalized Cut [38]. Thelargest cluster has four elements and their frequency spectra all containpeaks near 1Hz, which is close to human resting HR’s. The arrow in-dicates the largest peak among all spectra, which belongs to the BVPsignal estimates. Here the cluster number is set to 8.by δ sampling points, which is called δ -correction. Here δ is a positive integer de-signed to ignore smaller peaks that might occur in close proximity to a large one,which can be quite common cases in BVP signals. For instance, if there is a largelocal peak at index x, then all smaller peaks in the range (x−δ ,x+δ ) are ignored.In [30], different values of δ are tested and the one giving the best peak de-tection performance are chosen by visual verification. However, it is reasonableto select the value of δ in an adaptive way because the IBI of BVP signals variesfrom time to time. Peak detection algorithm should adjust its step, in this case δ ,21Figure 2.7: Scatter plot of δ test on DEAP affective computing database. Theline shows the linear regression model that is fit using test data. Here Fsdenotes the sampling rate of the BVP signal after interpolation.according to the frequency of BVP signals. Intuitively, a segment of BVP signalwith a frequency of 1.5 Hz (i.e. 90 beats per minutes) should have more peaksthan the one with a frequency of 0.9 Hz. Accordingly, a smaller value of δ shouldbe used to detect peaks of a signal with larger frequency. To verify this point, werun a test on the DEAP affective computing database [25]. Given BVP signals ofdifferent frequency, we manually choose different values of δ to estimate HR andselect the δ that would yield the most accurate HR measurement. A scatter plot ofthe experimental results are shown in Fig. 2.7. It is clear that a linear correlationbetween the frequency of input signal f and the optimal value of δ ∗ exists. We fita linear regression model based on these pairs of ( f ,δ ∗) and use it to predict theoptimal δ value given an input BVP signal with frequency f . It is called the adap-tive δ -correction. Fig. 2.8(c) shows the performance of our adaptive δ -correctionpeak detection method.After localizing peaks in a BVP signal, we can compute the average IBI usingcoordinates of detected peaks and estimate HR using 1.2.22Figure 2.8: The top row is an interpolated BVP signal before peak detection.The remaining two figures show different detection performances. Witha fixed δ , several small local peaks are also incorporated as labeled byblue arrows in the middle row. Using adaptive δ -correction by incor-porating frequency knowledge of input BVP signal, false detections areremoved and almost all large local peaks corresponding to heart beatsare successfully detected.232.5 Connectivity Multiset Canonical CorrelationAnalysisIn the M-CCA setting, all datasets are symmetrically incorporated to calculate theoverall correlation. However, asymmetry might occur in certain applications. Forinstance, sampling areas, illumination angles or even reflectivity are likely differentamong facial sub-regions, which may cause inaccuracy if being treated equally. Isthere an optimal correlation combination among datasets that would give the bestmeasurement performance? Previously discussed J-BSS methods such as IVA andM-CCA are not capable of exploring this question. Hence, we propose a learning-based J-BSS approach for non-contact HR measurement based on M-CCA, namedthe connectivity multiset canonical correlation analysis (C-MCCA), to explore theexistence of such an optimal combination.Inspired by [27, 42], we plan to combine the correlation maximization method-ology with the flexibility of connectivity designation and modify the method toachieve J-BSS. The proposed method is termed as C-MCCA since we introduceconnectivity among datasets. We notice that doing J-BSS via M-CCA would in-volve all datasets unbiasedly when optimizing correlation objective functions, eventhough some datasets might not be as useful as others. However, since no priorknowledge about connectivity among different facial sub-regions can be directlyused, we plan to design a data-driven method using a training set to learn a poten-tially non-linear mapping from multiset facial color channel signals (input) to anoptimal connectivity pattern (output).In [42], a method named regularized generalized canonical correlation anal-ysis (RGCCA) was proposed for multiset data analysis. Unlike J-BSS via M-CCA, which involves a multi-stage deflationary correlation maximization scheme,RGCCA tries to find a group of linear mixtures that achieves overall correlationmaximization for multiple datasets using the knowledge of partial least square pathmodeling algorithms [12]. In their framework, a design matrix C = {cm,n} is intro-duced based on prior knowledge about between-set correlation of multiple datasets:cm,n = 1 if dataset m and n are connected and 0 otherwise.Similarly, we introduce a binary connectivity design matrix (CDM) C= {cm,n}in our non-contact HR measurement method. For any two different facial sub-24regions m and n, where m,n = 1, · · · ,M, we can choose to either incorporate datacorrelation between m and n by setting cm,n = 1 or simply discard the correlation bysetting cm,n = 0. Generally a CDM is used to describe whether color channel databetween any two facial sub-regions are jointly analyzed (connected) to maximizeoverall correlation. The major difference between our method and RGCCA is thatours is a multi-stage method without any prior knowledge on the connectivity ofdifferent datasets. In order to explore whether there exists an optimal CDM for non-contact HR measurement, we have to consider all possible CDMs. Theoretically,there are 2M(M−1)/2 combinations given the number of datasets M, which yields 64possible CDMs in our case (M = 4).Similar to M-CCA, C-MCCA contains L stages, where L is the number ofsources variables to be extracted. In the l-th stage, the following optimizationproblem is solved:maxw[m]l ,w[n]lM∑m,n=1m6=ncm,n|r[m,n]l |2s.t. w[m]l ⊥{w[m]1 , · · · ,w[m]l−1} except for l = 1.(2.7)where r[m,n]l , corr[(w[m]l )Tx[m],(w[n]l )Tx[n]] denotes the correlation for m,n= 1, · · · ,M.The orthogonality constraint indicates that the newly obtained demixing vectorshould be uncorrelated to the previous ones. Here we use the sum of squaredcorrelation (SSQCOR), which is presented as one of the five objective functionsin [22]. SSQCOR is shown to be the most robust one in [27]. After identifyingdemixing vectors, we could recover SCVs using (2.3). It is obvious that M-CCA isa special case of C-MCCA where its CDM is 1M×M, i.e. a M×M all-ones matrix.C-MCCA works similarly as aforementioned J-BSS methods, so it can eas-ily fit into our non-contact HR measurement framework. The only difference isthat we have to designate a CDM a priori before multi-set signals are fed into C-MCCA. Intuitively, it is expected that given multi-set signals (input), an optimalCDM (output) can be determined to best recover the latent BVP signal. To thisend, we propose to use a max-margin multi-label (M3L) classification method [18]to learn non-linear mapping between the input and output.Formulated in [18], the objective of multi-label classification is to predict a set25of relevant binary labels for a given input, which is in contrast to the multi-classclassification problem where one has to predict the single, most probable label.Formally, it is to learn a mapping f from a point x to a set of labels y ∈ Y . HereY denotes the set of all possible binary labels with |Y | = L. We assume thatN training samples are in the form of (xi,yi) ∈ RD×{±1}L with yil being +1if label l has been assigned to sample i and -1 otherwise. One way to formulatethis problem as a max-margin one would be to define a loss function ∆ betweenground truth label and predicted label and minimize it over the training set subjectto regularization, which can be formulated as the following primal:minf12‖ f‖2 +CN∑i=1ξis.t. f (xi,yi)≥ f (xi,y)+∆(yi,y)−ξi ∀i,y ∈ {±1}L \{yi}(2.8)with a new point x being assigned y∗ = argmaxy f (x,y). In [18], an efficient M3Lapproach is proposed to reduce computational complexity even when dense pair-wise label correlation is incorporated. The implementation details are beyond thescope of this thesis, please refer to [18] for more information.To demonstrate how the selection of CDM influences non-contact HR mea-surement, we use trials in DEAP affective computing database as examples. In thedatabase, each trial contains one subject’s facial video recording and blood volumepulse ground truth from a contact PPG sensor. First we use facial landmark lo-calization algorithm to extract 4 sets of facial color channel signals from 4 facialsub-regions. These four datasets serve as the input to our C-MCCA method. Sincewe have no prior knowledge on how to select the optimal CDM, we shall try all ofthem one by one by altering {yi} in the following CDM:1 c1,2 c1,3 c1,4c2,1 1 c2,3 c2,4c3,1 c3,2 1 c3,4c4,1 c4,2 c4,3 1(2.9)where cm,n = cn,m ∈ {0,1}, for m,n = 1, · · · ,4. There are 64 possible CDMs in26Figure 2.9: Absolute error of non-contact HR measurement from 5 indepen-dent trials by altering CDM pattern.total. Performance varies among different CDMs as they alter the connectivitypattern among different facial sub-regions. Fig. 2.9 shows how different CDMsinfluence the accuracy of non-contact HR measurement on five independent trials.We measure the performance in terms of absolute error ε = |HRgt −HRnc|. Here‘gt’ refers to ground truth acquired by contact PPG sensor while ‘nc’ refers tonon-contact measurement using C-MCCA. Results in Fig. 2.9 have been sorted.It is clear that C-MCCA would extract BVP signals with better performance fornon-contact HR measurement given certain CDMs.To collect training samples, we first define a label set Y . As shown in Equa-tion 2.9, it takes six-dimension vector c to determine a CDM:c = {c2,1,c3,1,c4,1,c3,2,c4,2,c4,3} (2.10)Each element in c represents a connectivity status between two facial regions. Forinstance, if c2,1 = 1, sub-region #1 and #2 are connected. Their correlation wouldbe included to solve Equation 2.7. Therefore we define a label set with size of six:Y = {y1,y2,y3,y4,y5,y6}. Each label yi ∈ {+1,−1} corresponds to one element27of c with the exact order in Equation 2.10 for i = 1, · · · ,6. The corresponding pairof (yi,cm,n):yi ={+1 if cm,n = 1−1 if cm,n = 0(2.11)Given an input x, the learnt model would predict its label set y⊂Y = {y1,y2,y3,y4,y5,y6}.In our non-contact HR measurement framework, we define the training pair (x,y)of a single trial as follows:x = [Corr2,1,Corr3,1,Corr4,1,Corr3,2,Corr4,2,Corr4,3]T (2.12)where Corrm,n represent the Pearson’s correlation coefficient between gm and gn.Here gm denotes the extracted green channel signal of facial sub-region m in thetrial. Green channel signal has been widely used in non-contact research becausegreen channel provides the strongest PPG signal according to optical property. Todetermine label set y, we define a threshold parameter T . As the black dash lineshown in Fig. 2.9, T controls the upper bound of the absolute error of one trialusing C-MCCA. It is assumed that those CDMs that yield absolute error lowerthan T contains representative features to determine mapping from x to the optimalCDM. We can use c in Equation 2.10 to represent a CDM. Let {c}T denote the setof CDMs which yield absolute error lower than T . For a single trial, we computethe expectation of this set, denoted as {c}T, by averaging over all elements in {c}T.We define label set y as:y = {c}T  λ (2.13)where λ ∈ [0,1] is another threshold parameter and  denotes an element-wiseoperator. For vector a and scalar A, a A returns a binary vector of the same sizeas a. For each element ai in a:(a A)i ={1 if ai > A0 if ai ≤ A(2.14)Given a single trial, we can now determine a training sample pair (x,y) ∈R6×B6 by choosing certain threshold parameters T ∈ R+ and λ ∈ [0,1]. It is clearthat T controls the accuracy and λ controls the sparsity of y. After collecting28Figure 2.10: Proposed learning-based C-MCCA based on an M3Lmodel [18]. Given multi-set color channel signals, we train themodel using extracted feature x and label set y. The trained modelcan predict the optimal label set y′ (i.e. CDM) given any input featurex′ extracted from new multi-set color channel signals. The predictedCDM is then used for subsequent heart rate measurement and HRVanalysis.certain amount of training samples, we can train a M3L model in Equation 2.8 andpredict the optimal CDM given a new input. Fig. 2.10 depicts the general ideaof our M3L training and testing procedure. Using the predicted CDM, we thenperform C-MCCA to recover BVP signal for HR estimation. More discussion andexperimental results of the proposed learning-based C-MCCA will be presented inChapter 3. For easy reading, we summarize the proposed non-contact frameworkas pseudo-codes in Algorithm 1, using C-MCCA as an example in the J-BSS step.29Algorithm 1 Video-based HR Measurement via C-MCCAInput: Video frame sequencesOutput: IBI sequence, HR estimation1: procedure TRAINING M3L MODEL2: Collect N training samples.3: Train a M3L modelM by solving Equation 2.8.4: end procedure5: procedure LANDMARK LOCALIZATION6: for m = 1→M do7: for k = 1→ K do8: if a face is detected in frame k then9: Divide it into M sub-regions using [4].10: x[m]k ← rgb(m).11: end if12: end for13: X [m]← [x[m]1 , · · · ,x[m]K ]14: end for15: end procedure16: procedure JOINT BLIND SOURCE SEPARATION17: Compute x for the current trial using Equation 2.1218: Predict optimal CDM: C∗← f (M ,x)19: Solve {W [m]’s}← argminw(SSQCOR) in Equation 2.7 given C∗.20: for m = 1→M do21: Y [m]←W [m]X [m]22: end for23: end procedure24: procedure HEART RATE ESTIMATION25: Compute similarity matrix S←{Y [m]}×{Y [m]}26: Obtain {yL’s} as the largest cluster via NormCut(S).27: y∗← argmaxy, f |FFT ({yL’s})|.28: HR, IBI← AdaptivePeakDetection(y∗)29: end procedure2.6 SummaryWe propose a non-contact HR measurement framework to accurately and robustlyrecover human BVP signals from multiple color channel signals of different facialsub-regions by exploiting their data correlation interactions. An advanced real-time facial landmark localization algorithm is used to track facial regions. A facialdivision pattern is designed using coordinates of certain facial landmarks to form30four facial sub-regions. In each sub-region, color channel data are collected as theform of temporal signals. Facial video data recorded under different experimentalsettings are used to test its performance. The results demonstrate that the com-bination of facial landmark localization and facial region division is suitable fordata collection even when intensive illumination variation and motion artifacts areinvolved. It is much more reliable than previously used face tracking methods fornon-contact HR measurement.We use joint blind source separation methods (M-CCA and IVA) to extract la-tent BVP signals from multiset color channel signals. An adaptive δ -correlationpeak detection method is proposed by fitting a linear regression model betweenBVP signal frequency and δ -correlation. It is expected that such an adaptivemethod would yield a better peak detection performance. Experimental results willbe presented in next chapter. Finally, a learning-based connectivity multiset canon-ical correlation analysis (C-MCCA) algorithm is proposed to investigate how colorchannel data from different facial sub-regions would influence the performanceof non-contact HR measurement. A max-margin multi-label classification modelis used to map from multiset signals (input) to optimal connectivity design ma-trix (CDM) based on training samples collected from DEAP affective computingdatabase. One limitation of the proposed C-MCCA is the requirement of trainingdata. More discussion will be made in next chapter regarding training and testingstage of C-MCCA.31Chapter 3Experiments3.1 IntroductionWe evaluate our non-contact HR measurement framework using three differentexperiment settings: (i) well-controlled laboratory environment with stable illumi-nation and little motion artifacts (EXP1); (ii) challenging laboratory environmentwhere illumination variations and head motion are involved (EXP2); (iii) real-world road driving situation with strong illumination variation and head movement(EXP3). A summary of experimental data source, and HRM device involved forevaluation purpose is presented in Table 3.1. In all three settings, traditional HRMdevices are used to provide heart rate ground truth for evaluation purpose. In thefollowing sections, we give detailed introductions on these experiments includingtheir settings, subjects involved, number of trials. Then we report experimentalresults of our proposed non-contact HR measurement framework on these experi-ments respectively. All data processing procedures are implemented in a desktop(Intel Core i7 @3.20 GHz) using MATLAB.i Experiment under well-controlled laboratory environment using self-collectedvideo and physiological data.ii Experiment under challenging laboratory environment using video and physi-ological data from public affective computing database.32Table 3.1: A summary of experimentsNon-contact data source HRM device involvedEXP1 Self-collected facial video recording Commercial finger oximeterEXP2 DEAP affective computing database [25] Clinical PPG sensorEXP3 Self-collected facial video recording Commercial chest strapiii Experiment under difficult road-driving setting using self-collected video andphysiological data.3.2 EXP1: Laboratory ExperimentWe first carried out a self-collected non-contact HR measurement experiment tostudy the performance of the proposed method. All recordings were conducted inan indoor laboratory environment with stable ambient light, which was the mixtureof sunlight from nearby windows and fluorescent lamps from the ceiling. Subjectswere asked to sit at a chair, wearing a pulse oximeter on their left index fingers.Meanwhile a consumer-level digital camera (Sony, Corp., NEX-5R, Tokyo, Japan)was used to take the video recordings at a distance of 1.5 meter. Video data werestored in a laptop for offline processing. An illustration of the system setup isshown in Fig. 3.1 with annotation. 16 subjects of both genders (5 females), variousages (22-40, avg. 27.9 y.o.), and multiple skin colors (East Asian, Semu, Cau-casian) participated in the experiments. Six subjects were wearing glasses. Allsubjects were in good health status, and one subject was pregnant. We recordeda 60-second video for each subject in 1920×1080 resolution (raw data) and 50frames per second (fps). Before recording, subjects were asked to manually counttheir carotid pulse for exactly one minute and report the result. Then they wereasked to wear a pulse oximeter and sit without intentional head movement for 60seconds. Video recording as well as pulse oximeter measurement began simulta-neously after that. During the 60-second recording, subjects were asked to keepthe body still and face the camera. Any mild facial expression was acceptable.One subject smiled occasionally and it turned out to have no salient impact onthe performance of our method. Since finger motion may impact the accuracy of33Figure 3.1: Illustration of the system setup. The pulse oximeter was slightlyclamped on subject’s finger tip. A webcam was programmed to take pic-tures of pulse oximeter’s OLED screen every one second. A consumer-level digital camera recorded the subject with the support of a tripod.All drawing materials in the upper figure are from the Internet. Thelower figure shows a subject is being recorded in one trial.34pulse oximeter measurement, subjects were asked to keep their finger as steady aspossible during video recording. When recording was over, subjects were askedto manually count their carotid pulse again for exactly one minute and report theresult. All data acquired were then analyzed by the proposed method to recoverBVP signals and hereby estimate HR.To evaluate the performance of our method, we compare the non-contact HRmeasurement to the readings of a pulse oximeter (Choice Electronic TechnologyCo., MD300C2, Beijing, China), which is officially licensed by Health Canadaand FDA. Such a device is widely used in family healthcare due to its low priceand portability. A probe with dedicated light source provided by LEDs is slightlyclamped on human finger tip to measure pulsatile variations in the light transmittedthrough tissue [40]. The variations are extracted as BVP signals to estimate HR interm of beats per minute (bpm).The pulse oximeter updates its reading every second and display it on an OLEDscreen. By collecting its readings during the 60-second recording, we could visu-alize HR variations of each subject. Since the oximeter has no data transmissionmodule, we designed a data collection approach using an external webcam (Mi-crosoft, Corp., Xbox Live Vision, Redmond, WA, USA) connected to the laptop,as shown in Fig. 3.2(a). Subjects were asked to place their fingers properly so thatthe webcam would focus on the OLED screen of the pulse oximeter, as shownin Fig. 3.2(b). We programmed the webcam to take a picture every one second,synchronizing with the oximeter. Once video recording was taken, the webcamwas automatically and simultaneously activated so that the video signals were tem-porally aligned with the contact measurements. Fig. 3.2(c) shows five examplecurves of the pulse oximeter’s readings. We collected video signals from 16 sub-jects and recovered BVP signals using our J-BSS methods. These signals werethen used to estimate HR. All 16 videos were recorded in 1920×1080 resolu-tion and 50 fps, with 60-second length for testing. For comparison use, we re-implemented the ICA-based non-contact HR measurement algorithm in [34, 35].The experimental results of all methods on EXP1 are shown in Table 3.2. For ourframework, we tested on both IVA and M-CCA algorithm in the step of J-BSS andreport the results here. Several statistical measures are used for evaluation pur-pose, which have been used in previous research work. First we have ground truth35Figure 3.2: (a) Webcam is focused on the OLED screen of pulse oximeterand programmed to take pictures each second. (b) An example picturetaken by the webcam in (a). (c) Smoothed 60-second samples of fivesubjects’ oximeter readings. The average HR is shown in parenthesis.HRgt acquired from contact finger pulse oximeter by averaging over the HR read-ing during video recording period. For a single trial, the trial error is computed asHRerr = HRnc−HRgt , where HRnc denotes non-contact HR measurement result.The first two are the mean error and corresponding standard deviation of the trialerror sequence with N = 16 elements, denoted as Me and SDe respectively. Theroot mean squared error of trial error sequence is also used, denoted as RMSEe:RMSEe =√1NN∑n=1|HRerr(n)|2 (3.1)The mean error rate of trial sequence, Mr, is defined as:Mr =1NN∑n=1|HRerr(n)|HRgt(n)×100% (3.2)The Pearson’s correlation coefficient between ground truth sequence and non-contactmeasurement sequence is also computed, denoted as r. The symbol ∗ in Table 3.2means correlation coefficient satisfies p < 0.01, indicating that there are statisti-cally significant correlations between ground truth and the non-contact method.The scatter plot of all results are shown in Fig. 3.3.Fig. 3.3(a) used ICA based on the joint approximate diagonalization of eigen-matrices (JADE) algorithm [10], the same one used in [35]. The facial data collect36Table 3.2: Performance on EXP1 using different non-contact HR measure-ment methods.Method Me (SDe) (bpm) RMSEe (bpm) Mr (%) rPoh [35] 1.6195 (1.0644) 1.9196 2.4853 0.9575*Ours (IVA) 1.3922 (1.1741) 1.7974 2.1449 0.9510*Ours (M-CCA) 0.9514 (0.7865) 1.2186 1.4519 0.9797*Figure 3.3: Scatter plots of three non-contact methods (a) ICA by [35] (b)ours using IVA (c) Ours using M-CCA.method is based on Viola-Jones algorithm [46]. In Fig. 3.3(b)-(c) the same facialsub-region datasets were computed using different J-BSS methods: (b) IVA, (c) M-CCA. The proposed learning-based C-MCCA algorithm is not used in this exper-iment for two reasons: (i) C-MCCA requires certain amount of training samples,usually at least 30% of the trial numbers. In EXP1, only 16 trials were conducted,one for each subject. The training samples is too small in this case to learn sig-nificantly meaningful model that would actually improve the performance. (ii) Asmentioned above, EXP1 has a well-controlled experiment setting, where indoorillumination is much more stable and head movement artifact is relatively smallcomparing to EXP2 and EXP3. In this case, all non-contact methods listed in Ta-ble 3.2 yield good performance with respect to all statistical measures. With themean error less than 2 bpm and mean error rate less than 3%, it is safe to say thatthe performance of a non-contact measurement method is similar to that of a con-tact PPG sensor. It is also noted that J-BSS-based methods outperform ICA-basedmethod with respect to mean error, RMSE, and mean error rate. In Fig 3.4, a bar371 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16−5−4−3−2−1012345HRerr(bpm)Subject IDMean Error of Heart Rate Measurement  M-CCAIVAICAFigure 3.4: Bar plot of experimental results in EXP1.plot of the EXP1 results are illustrated with respect to trial errors. Here we usesubject IDs to index trials. It can be seen that M-CCA manages to keep all trialerrors within 2 bpm, while ICA is not. Discussion on why J-BSS-based methodsshow more stable performance comparing to ICA-based method will be made atthe end of this chapter.3.3 EXP2: Public Database ExperimentIn EXP2, we test the proposed framework using IVA, M-CCA, and C-MCCA meth-ods on the DEAP affective computing database [25]. DEAP is a public multi-modaldatabase for analysis of human affective states in terms of the levels in arousal, va-lence, like/dislike, dominance, and familiarity. It provides electroencephalography(EEG) and other peripheral physiological signals recordings of 32 participants un-der designated multimedia emotional stimuli (music video). Basic information ofDEAP database is listed in Table 3.3. For more details, please refer to [25]. For 22of the 32 participants, who consent to publish audio-visual recordings, their frontalface video recordings are publicly available. We test our proposed framework onthese videos. Among all kinds of available peripheral physiological signal, we use38Table 3.3: Basic information of DEAP database such as subject statistics,physiological parametersParticipant InformationNumber of participants 32Number of participants with visual consent 22 (11 females)Number of trials ∼ 40/participantTrial duration 60 seconds/trialPhysiological parameters (downsampled to 128 Hz)Electroencephalography Channel 1-32Electrooculography Channel 33-34Electromyography Channel 35-36Galvanic skin response Channel 37Respiration belt feedback Channel 38Blood volume pulse Channel 39BVP (channel-39) as ground truth for evaluation purpose. In DEAP, frontal videoswere recorded in DV PAL format using a SONY DCR-HC27E camcorder. Theywere then segmented and transcoded to 50 fps de-interlaced videos using h264codec.22 participants are of both genders (11 females) and various ages (19-37, avg.26.5 y.o.). Ten of them wore glasses during the recording. Prior to the experiment,EEG and peripheral physiological sensors were placed and signals checked. Theplethysmograph sensor was attached on left thumb. All participants’ frontal faceswere occluded by electrooculography (EOG) sensors to varying degrees, as shownin Fig. 3.5. During the experiment, they were asked to sit on a chair and watchdifferent music videos. For each trial, a 60-second frontal face video was recordedin 720×576 resolution at 50 fps, along with other physiological signals. Eachparticipant had 40 trials. The total amount of trials (test videos) in this experimentis 874 1. We use BVP signals captured by the plethysmograph sensor as groundtruth HRgt . In DEAP’s pre-processed data release, the raw BVP signal (channel39) was downsampled to 128 Hz and segmented appropriately to temporally alignwith the face video recording.1Due to technical issues (i.e. tape ran out), participate # 3,5,14 have only 39 face videos for eachof them, and participant # 11 has only 37 face videos.39Figure 3.5: A participant’s frontal face video during the experiment. Elec-trodes, wires, and tapes occlude parts of the facial regions.In Section 3.1, we describe EXP2 as an experiment with challenging laboratoryenvironment. The reasons are (i) Part of facial regions are occluded by electrodes,wires, and tapes, which lowers the performance of almost all facial tracking al-gorithms regardless whether the algorithm is tracking the entire facial region ordesignated facial landmarks. This may increase the rate of false detection or framedrop. (ii) As illustrated in Fig. 2.3 and Fig. 3.5, frontal facial videos of DEAPwere recorded in a relatively dark environment comparing to EXP1. Participantssat in front of a color monitor, which was playing music videos. The illuminationvariations on participants’ facial regions caused by the color monitor were largeenough even to be visible by naked eyes. Such variation acts as artifacts in colorchannel data collection step and potentially interferes the accurate extraction of fa-cial BVP signals [26]. (iii) In EXP2, participants can move their heads naturallywithout being asked to keep still. This results in motion artifacts in the collectedcolor channel signals.To evaluate our non-contact HR measurement framework, we first perform fa-cial landmark localization algorithm to all EXP2 trials. Due to aforementioned40reasons, false detection and frame drop appeared in some trials. Here we pick 782video trials2, which do not have any frame drops, out of the total number of 874trials. For each video trial with duration of 60 seconds, we extract the last 40 sec-onds and measure the non-contact HR using Equation 1.2. Their correspondingBVP signals are used as ground truth. Experimental results of EXP2 are shownin Table 3.5. The statistical measures are computed in the same way as we do inEXP1.Table 3.4: Performance on EXP2 using different non-contact HR measure-ment methods (with δ = 0.5)Method Me (SDe) (bpm) RMSEe (bpm) Mr (%) r DivergencePoh [35] 7.9177 (6.3074) 10.1203 10.6647 0.4299∗ 12IVA 6.7792 (5.1651) 8.5206 9.3684 0.4413∗ 0M-CCA 6.7434 (5.2486) 8.5432 9.3505 0.4084∗ 0Table 3.5: Performance on EXP2 using different non-contact HR measure-ment methods (with adaptive δ -correlation)Method Me (SDe) (bpm) RMSEe (bpm) Mr (%) r DivergencePoh [35] 5.7089 (5.1334) 7.6752 8.5844 0.5530∗ 12IVA 4.9227 (4.7683) 6.8513 7.0197 0.5964∗ 0M-CCA 4.7001 (4.6633) 6.6189 6.8171 0.6358∗ 0C-MCCA 3.6554 (3.4169) 5.0017 5.1712 0.7423∗ 0For C-MCCA, we randomly select 200 (out of 782) trials to compose trainingset using methods mentioned in Section 2.5. Two threshold parameters are set tothe following values: T = 3 and λ = 0.53. Both are determined by cross validation.Comparing to the results of EXP1, performance of all non-contact methods dropssignificantly. This verifies that EXP2 has a much more challenging experimen-tal setting than EXP1. Our proposed non-contact framework using IVA, M-CCA,and learning-based C-MCCA methods all outperform the ICA-based method withrespect to all statistical measures. In Fig. 3.6, the error distributions of the pro-2Participate #20 has frame drop problems in his every trial (Average 20 frames per trial). Al-though it is not a big problem given the frame per second (fps) being 50, we remove all of his trialsanyway.41posed C-MCCA and the ICA-based method are compared. In general, it is clearthat C-MCCA outperforms ICA. It’s worth noting that ICA-based method doesnot converge in 12 (out of 782) trials, while our proposed methods converge inall trials. In Section 2.4, we propose an adaptive peak detection method, namelyadaptive δ -correction. Unlike previous non-contact methods [30, 35] which fix δ ,the proposed method uses a simple linear regression model to predict δ value fromthe frequency information of the recovered BVP signal. To verify its performance,we test EXP2 with two δ -correlation strategies: (i) fix δ = 0.5 (ii) use adaptiveδ -correlation. For the first strategy, δ = 0.5 yields the best performance compar-ing to other fixed values. The top three rows of Table 3.4 and Table 3.5 show theexperimental results for (i) and (ii). Using δ -correlation gives more than 27% per-formance improvement by all methods, which proves its effectiveness. We furtherobserve that such an adaptive method helps decrease measurement bias and gen-erally moves the error mean back to zero. In Fig. 3.7, adaptive and non-adaptiveδ -correlation are compared in three different non-contact methods. With adap-tive δ -correlation, the mean of error distribution tends to move towards zero. Thevariance of error distribution is also smaller.Correlation coefficient r in Table 3.4 and Table 3.5 also indicates that the pro-posed non-contact framework has better performance. In Fig. 3.8(a), it is shownthat with adaptive δ -correlation, non-contact method such as [35] is better corre-lated with the ground truth than the one with fixed δ . In Fig. 3.8(b), we furthercompare our learning-based C-MCCA with ICA-based method. In the wide rangefrom 50 bpm to 90 bpm, C-MCCA gives good HR estimations in most cases. Out-liers which fall far from the perfect correlation line (ground truth is perfectly fitted.)exist in all non-contact methods. Most of them result from participants’ intensivehead motion during video recordings, such as unintentionally rotating heads fromleft side to right side in a fast speed.In [26, 34], the authors claim that for application scenarios such as detect-ing vital signs of an emergency situation, HR measurement with error less thanTheta= 5 bpm is likely to be acceptable. Here we show how aforementioned non-contact methods perform under different Θ values. The measure we use is called42Figure 3.6: Error distribution of the proposed C-MCCA and ICA-basedmethod [35] without adaptive δ -correlation.acceptance rate Γ:Γ=Number of Trials s.t. abs(HRnc) <ΘNumber of All Trials×100% (3.3)where abs(·) denotes the absolute value. In Table 3.6, we compare the acceptancerate of different non-contact methods by changing Θ. C-MCCA manages to controlthe HR error under 5 bpm for more than 70% of trials.Table 3.6: Acceptance rate using different non-contact methodsMethod Θ= 5 (%) Θ= 3 (%) Θ= 2 (%) Θ= 1 (%)Poh [35]δ = 0.5 42.08 27.27 20.78 11.69Adaptive δ 56.49 38.57 28.80 16.62IVAδ = 0.5 44.76 29.41 21.61 10.87Adaptive δ 63.04 44.25 34.40 19.44M-CCAδ = 0.5 47.57 30.31 21.36 11.51Adaptive δ 64.32 48.34 38.87 21.87C-MCCA Adaptive δ 72.79 53.53 43.64 25.8043Figure 3.7: Error distribution of three non-contact methods with and withoutadaptive δ -correlation.4440 50 60 70 80 90405060708090HRgt (bpm)HRnc(bpm)Scatter plot between HR ground truth and ICA−based method  Adaptive δδ = 0.5(a)40 50 60 70 80 90405060708090HRgt (bpm)HRnc(bpm)Scatter plot between HR ground truth and Non−contact Measurement  Adaptive ICAAdaptive C-MCCA(b)Figure 3.8: The scatter plot comparing HRgt with HRnc between (a) Adaptiveδ -correlation and fixed δ -correlation (b) Adaptive ICA and Adaptive C-MCCA. 453.4 EXP3: Road-Driving ExperimentOfficial statistics have shown that driver fatigue, distraction and paroxysm accountfor a considerable proportion of road fatal accidents every year. Driver medical as-sistance in automobile environment has been considered as one of the most promis-ing ways to effectively prevent accidents and augment intelligence in transporta-tion systems [19, 47]. Such assistance should incorporate reliable measurementsof driver’s vital signals in order to depict his/her driving condition. In [19], phys-iological sensors were distributed in a testing vehicle as well as on driver’s body.Driving experiments indicated that physiological signals such as skin conductivitycan provide a metric of driver stress level and a measure of how different road andtraffic conditions affect drivers. In [14], driver fatigue and distraction were com-bined into one superclass named driver inattention. The paper reviews that physi-ological signals such as EEG, ECG, EOG, surface electromyogram and PPG havebeen jointly used to measure driver inattention level and detect abnormal drivingconditions such as arrhythmia, hypovigilance and drowsiness.Among these physiological parameters, cardio-related parameters are of greatinterest. The most fundamental ones are HR and HRV. Paroxysm of cardiovasculardisease is usually accompanied by early heart failure, which means heart wouldfunction decreasingly to pump blood all over the body. As a result, detectableparameter variations in autonomic nervous system occur, leading to reduced vagal-cardiac activity and enlarged sympathetic activity [13]. Studies show that these vi-tal signs are highly associated with HR and HRV [24]. In [32], experiments underlaboratory conditions were carried out to detect early onset of driver fatigue usingHRV frequency-domain measures. An artificial neural network system was builtup to classify between fatigue and alert condition. In [31], highway driving ex-periments revealed the high reliability of HR and HRV measures in distinguishingbetween single task driving and low/high cognitive workload driving. Therefore, itis reasonable to believe that driver medical assistance system with constant moni-toring of driver’s vital signals, especially cardiac parameters, can provide physio-logical evidence to reflect his/her driving conditions.In EXP3, we test our proposed non-contact HR measurement framework on areal-world road driving experimental setting. As shown in Fig. 3.9, a commercial-46Figure 3.9: Road driving experiment (EXP3) setting-up. In the left figure, weshow that the webcam is placed behind the wheel and a laptop is usedto monitor the video recording. In the right figure, a zoom-in picture isprovided.level webcam is installed behind the wheel and a laptop is used to monitor thevideo recording. A zoom-in picture is also provided. Two examples of HRV analy-sis are shown in Fig. 3.10. The top row show results from the laboratory experimentand the bottom are from real road driving experiment. First, subjects’ facial BVPsignals were extracted with the real-time facial landmark detector and C-MCCAmethod. The locations of heart beat peaks were then determined in order to acquireIBI series for further analysis. Finally, we calculated different measures to ana-lyze IBI series. We compute both time and frequency domain measures and drawcorresponding LS-Periodogram as well as spectrogram based on Lomb-Scarglemethod (LS-Spectrogram). Time-domain measures include mean HR during therecording, SDNN and RMSSD for IBI series. In the frequency domain, we useLS-Periodogram to estimate PSD and compute LF power, HF power and their ratioLF/HF. Our non-contact framework provides a promising solution to driver physi-ological monitoring. More advanced method that considers illumination variationmay further improve the performance, which is beyond the scope of this thesis.47Figure 3.10: HRV analysis examples. The top row is from the laboratorysetting. The bottom row is from the real road driving experiment. Sixmeasures in time and frequency domains are computed based on IBIseries. LS-Periodogram and LS-Spectrogram are also given.48Figure 3.11: (a)-(d) Division pattern for profiles. (e) Part of recovered BVPsignal with detected peaks. (f) Readings from pulse oximeter and HRestimates using M-CCA and IVA.3.5 Discussion3.5.1 HR Estimation Using Side ProfileThe idea of facial sub-region division can lead to some interesting applications.With landmark localization techniques such as [48], we can estimate HR using sub-jects’ profile alone. Fig. 3.11(a)-(d) show two new patterns to acquire facial sub-region datasets, corresponding to left and right side profiles. Fig. 3.11(e) displayedthe BVP signal acquired from one subject’s left profile. The proposed method canachieve desirable HR performance, as shown in Fig. 3.11(f).3.5.2 Dynamic HR EstimationIn EXP1, we take the entire 60-second video signals into account. In Fig. 3.12, weshow one subject’s smoothed HR curve according to readings of the pulse oximeter(black dash line). In order to capture HR variation during recordings, we introducea sliding window with size τw and a 95% overlap. HR at time t was estimated49Figure 3.12: Black dash line reflects one subject’s HR variation during therecording with sampling rate 1Hz. The red line is the HR estimatebased on a slide window of past 10 seconds and a 95% overlap. Bothcurves were smoothed by moving average method with span 20.using video signals during the time period (t− τw, t). The red line in Fig. 3.12 wascomputed with τw = 10s. We can see that, when compared with results of pulseoximeter, our approach generally reflects the subject’s HR variation with a slightbias towards higher HR.3.5.3 Performance AnalysisOne common step of all non-contact approaches is facial data collection. Tradi-tional ICA-based frameworks [34, 35] used a fast face detector [46] to localize theentire face. It sacrifices accuracy for speed. A considerable portion of non-PPG in-stances (e.g., hair, recording background) are included and uniformly averaged toget color channel data. Our approach, on the other hand, gets rid of almost all unre-lated instances and focuses only on facial regions. Our division patterns, as shownin Fig. 2.2, avoid mouth (useless if opened or half-opened), forehead (useless ifthere are bangs), and chin (useless if there are beards).Besides improved facial data collection method, the introduction of J-BSSalso makes non-contact HR measurement more robust than ICA-based approaches.Many practical issues, such as ambient light changes or shadow caused by facialexpression variation, might cause fluctuation of local color channel values and thusinfluence the accuracy of BVP signal recovery. We cannot solve this problem by50designing better facial data collection method. However, with J-BSS we can reducesuch negative impacts. Given color channel data of different facial sub-regions, J-BSS methods attempt to recover the underlying source set for every sub-regiondataset. An important assumption in our proposed method is that the BVP signalis the shared source among all datasets. Results of spectral clustering support thisassumption. The largest cluster actually contains candidate BVP signals recoveredfrom different sub-regions. Then the signal with strongest frequency component isselected among all candidate BVP signals. Therefore even if all sub-region datasetsare contaminated, to various degrees, by the local fluctuation, we could still re-cover the one with the minimal impact. In summary, landmark-based facial datacollection and J-BSS-based BVP signal extraction together contribute to the betterperformance of the proposed method, compared to ICA-based approaches.51Chapter 4Conclusion and Future Work4.1 Conclusion and ContributionIn this thesis, we proposed a novel framework for non-contact HR measurementby exploiting correlations between different facial sub-regions to enhance the ro-bustness of the measurements when illumination variation and head motions areinvolved. We tested the proposed framework on three experimental settings: (i)a well-controlled laboratory environment, (ii) a more challenging laboratory envi-ronment, and (iii) a real-world road driving environment. Results show that theproposed non-contact framework can be a promising solution to both clinical diag-nosis and family healthcare.In Chapter 2, we presented the proposed non-contact HR measurement frame-work step by step. Starting from facial data collection, we proposed using an ad-vanced facial landmark localization algorithm that gives real-time tracking evenwhen the video fps is as high as 50. The algorithm returns physical coordinatesof 49 facial landmarks. We designed a specific facial division pattern based onthese coordinates in order to divide the entire facial region into four sub-regions.Compared with previously used face detectors, this algorithm is robust to variousintensive head motions, as demonstrated in subsequent experiments. Based on thisfacial division pattern, we collected four sets of facial color channel signals andfed them into a joint blind source separation (J-BSS) system to extract latent facialBVP signals. Two recently proposed J-BSS algorithms, IVA and M-CCA, were52used respectively in our framework. Using post-processing methods such as signaldetrending, temporal filtering, and spectral clustering, we recovered BVP signalsfrom these sub-regions. We further proposed an adaptive peak detection methodthat uses frequency information of the BVP signal to better detect heart interbeat.It is named adaptive δ -correlation. Based on M-CCA, we proposed a learning-based J-BSS algorithm, connectivity multi-set canonical correlation (C-MCCA),in order to improve the HR measurement performance. A max-margin multi-label(M3L) classification algorithm is used to learn the optimal connectivity design ma-trix (CDM) from a certain amount of training samples.In Chapter 3, we designed three types of experiments to test the proposed non-contact HR measurement framework. In a well-controlled experimental setting(EXP1), 16 subjects were included. Experimental results show that all non-contactmethods yield good performance. Our proposed framework works a little betterthan the ICA-based method. In a much more challenging setting (EXP2), we testedour framework on the DEAP affective computing public database. Random illumi-nation variation and head motions are included in all video recordings, and generalperformance degradation is observed in our tested non-contact methods. We com-pared these methods when using a fixed δ correlation and the proposed adaptiveδ -correlation respectively. Results showed that the adaptive method can improvethe performance by more than 27 % for all tested methods. We randomly collecteda training set with 200 (out of 782) trials and tested on the rest. We note thatthe proposed C-MCCA method yields the best performance with respect to testedstatistical measures such as the mean error, RMSE and mean error rate. The scat-ter plot of correlations between non-contact HR measurements and contact BVP(ground truth) and the acceptance rate analysis in Section 3.3 further demonstratedthe effectiveness of the proposed adaptive δ -correlation and C-MCCA method.Generally, experimental results indicate high consistency between traditional con-tact PPG-sensors and the proposed non-contact methods. In a real-world roaddriving setting (EXP3), we tested the proposed HR measurement framework byinstalling a commercial webcam behind the wheel. Experimental results showedthat such a non-contact driver monitoring method can be a promising solution toboth HR measurement and HRV analysis in both time and frequency domains. Wealso illustrated that the proposed method can work well when given video signals53of subjects’ facial profiles only.4.2 Future WorkIn this thesis, since most previously proposed non-contact HR measurement meth-ods use the entire facial regions to collect signals, we extend the idea by dividingthe facial regions into multiple sub-regions which is verified to be more effectiveto extract facial BVP signals for HR measurement. However, it is just one wayto enhance the robustness of non-contact techniques and several limitations arestill associated with the non-contact framework. The most important limitationof non-contact measurement is still related to illumination variation and head mo-tion artifacts. In this thesis, attempts have been made to improve the performanceof non-contact HR measurement under a challenging laboratory setting. In moresevere settings, such as the road driving experiment (EXP3), non-contact HR mea-surement cannot perform as well as it does in the laboratory due to the intensiveillumination variation and head motion artifacts. Our efforts in EXP3 show somepromising results and more work will be done in the future to further improve therobustness. Moreover, in the end of Chapter 3, we propose to measure HR usingonly facial profile. It is natural to extend the current method into a full viewpointmeasurement framework.In this thesis, we focus on video-based non-contact physiological measure-ment. Specifically, we select heart rate and heart rate variability as research targets.In the future, more physiological parameters will be investigated. Non-contacttechnique can be a promising solution to both clinical diagnosis and family health-care. How to achieve non-contact measurement in an both accurate and robust wayis a promising research direction that we intend to follow in the future.54Bibliography[1] L. A. Aarts, V. Jeanne, J. P. Cleary, C. Lieber, J. S. Nelson,S. Bambang Oetomo, and W. Verkruysse. Non-contact heart rate monitoringutilizing camera photoplethysmography in the neonatal intensive care unitapilot study. Early Hum. Dev., 89(12):943–948, 2013. → pages 3, 4[2] J. Achten and A. E. Jeukendrup. Heart rate monitoring. Sports medicine, 33(7):517–538, 2003. → pages 1[3] M. Anderson, T. Adali, and X.-L. Li. Joint blind source separation withmultivariate gaussian model: algorithms and performance analysis. IEEETrans. Signal Process., 60(4):1672–1683, 2012. → pages 16, 17[4] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Incremental facealignment in the wild. In Computer Vision and Pattern Recognition (CVPR),2014 IEEE Conference on, pages 1859–1866. IEEE, 2014. → pages viii, 12,13, 30[5] S. Bakhtiari, T. W. Elmer, N. M. Cox, N. Gopalsami, A. C. Raptis, S. Liao,I. Mikhelson, and A. V. Sahakian. Compact millimeter-wave sensor forremote monitoring of vital signs. IEEE Trans. Instrum. Meas., 61(3):830–841, 2012. → pages 7[6] G. Balakrishnan, F. Durand, and J. Guttag. Detecting pulse from headmotions in video. In Computer Vision and Pattern Recognition (CVPR),2013 IEEE Conference on, pages 3430–3437, June 2013. → pages 7[7] G. G. Berntson, J. T. Bigger, D. L. Eckberg, P. Grossman, P. G. Kaufmann,M. Malik, H. N. Nagaraja, S. W. Porges, J. P. Saul, P. H. Stone, et al. Heartrate variability: origins, methods, and interpretive caveats.Psychophysiology, 34(6):623–648, 1997. → pages 7[8] F. Bousefsaf, C. Maaoui, and A. Pruski. Continuous wavelet filtering onwebcam photoplethysmographic signals to remotely assess the instantaneous55heart rate. Biomedical Signal Processing and Control, 8(6):568–574, 2013.→ pages 7[9] V. D. Calhoun, J. Liu, and T. Adalı. A review of group ica for fmri data andica for joint inference of imaging, genetic, and erp data. Neuroimage, 45(1):S163–S172, 2009. → pages 16[10] J.-F. Cardoso. High-order contrasts for independent component analysis.Neural computation, 11(1):157–192, 1999. → pages 36[11] X. Chen, C. He, Z. J. Wang, and M. J. McKeown. An ic-pls framework forgroup corticomuscular coupling analysis. IEEE Trans. Biomed. Eng., 60(7):2022–2033, 2013. → pages 16[12] W. W. Chin. The partial least squares approach to structural equationmodeling. Modern Method for Business Research, 295(2):295–336, 1998.→ pages 24[13] M. M. J. De Jong and D. C. Randall. Heart rate variability analysis in theassessment of autonomic function in heart failure. Journal ofCardiovascular Nursing, 20(3):186–195, 2005. → pages 46[14] Y. Dong, Z. Hu, K. Uchimura, and N. Murayama. Driver inattentionmonitoring system for intelligent vehicles: A review. IntelligentTransportation Systems, IEEE Transactions on, 12(2):596–614, 2011. →pages 4, 46[15] K. Fox, J. S. Borer, A. J. Camm, N. Danchin, R. Ferrari, J. L. L. Sendon,P. G. Steg, J.-C. Tardif, L. Tavazzi, and M. Tendera. Resting heart rate incardiovascular disease. J. Am. Coll. Cardiol., 50(9):823–830, 2007. →pages 1, 4[16] M. Garbey, N. Sun, A. Merla, and I. Pavlidis. Contact-free measurement ofcardiac pulse based on the analysis of thermal imagery. IEEE Trans.Biomed. Eng., 54(8):1418–1426, 2007. → pages 7[17] E. Greneker. Radar sensing of heartbeat and respiration at a distance withapplications of the technology. Proc. Conf. RADAR, 1997. → pages 7[18] B. Hariharan, L. Zelnik-Manor, M. Varma, and S. Vishwanathan. Largescale max-margin multi-label classification with priors. In Proceedings ofthe 27th International Conference on Machine Learning (ICML-10), pages423–430, 2010. → pages ix, 25, 26, 2956[19] J. A. Healey and R. W. Picard. Detecting stress during real-world drivingtasks using physiological sensors. Intelligent Transportation Systems, IEEETransactions on, 6(2):156–166, 2005. → pages 46[20] J. W. Hurst. Naming of the waves in the ecg, with a brief account of theirgenesis. Circulation, 98(18):1937–1942, 1998. → pages 5[21] X. Jiang, M. Dawood, F. Gigengack, B. Risse, S. Schmid, D. Tenbrinck, andK. Scha¨fers. Biomedical imaging: A computer vision perspective. InComputer Analysis of Images and Patterns, pages 1–19. Springer, 2013. →pages 3[22] J. R. Kettenring. Canonical analysis of several sets of variables. Biometrika,58(3):433–451, 1971. → pages 17, 25[23] T. Kim, T. Eltoft, and T.-W. Lee. Independent vector analysis: An extensionof ica to multivariate components. In Proc. Independent ComponentAnalysis and Blind Signal Separation. Springer, 2006. → pages 16[24] R. E. Kleiger, J. P. Miller, J. T. Bigger Jr, and A. J. Moss. Decreased heartrate variability and its association with increased mortality after acutemyocardial infarction. The American journal of cardiology, 59(4):256–262,1987. → pages 46[25] S. Koelstra, C. Mu¨hl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi,T. Pun, A. Nijholt, and I. Patras. Deap: A database for emotion analysis;using physiological signals. Affective Computing, IEEE Transactions on, 3(1):18–31, 2012. → pages viii, 5, 14, 15, 22, 33, 38[26] X. Li, J. Chen, G. Zhao, and M. Pietikainen. Remote heart rate measurementfrom face videos under realistic situations. In Computer Vision and PatternRecognition (CVPR), 2014 IEEE Conference on, pages 4264–4271. IEEE,2014. → pages 7, 8, 40, 42[27] Y.-O. Li, T. Adali, W. Wang, and V. D. Calhoun. Joint blind sourceseparation by multiset canonical correlation analysis. IEEE Trans. SignalProcess., 57(10):3918–3929, 2009. → pages 17, 24, 25[28] J. Liu, G. Pearlson, A. Windemuth, G. Ruano, N. I. Perrone-Bizzozero, andV. Calhoun. Combining fmri and snp data to investigate connectionsbetween brain function and genetics using parallel ica. Hum. Brain Mapp.,30(1):241–255, 2009. → pages 1657[29] H. Lu, Y. Pan, B. Mandal, H.-L. Eng, C. Guan, and D. W. Chan. Quantifyinglimb movements in epileptic seizures through color-based video analysis.IEEE Trans. Biomed. Eng., 60(2):461–469, 2013. → pages 4[30] D. McDuff, S. Gontarek, and R. W. Picard. Improvements in remotecardio-pulmonary measurement using a five band digital camera. IEEETrans. Biomed. Eng., 61(10):2593–2601, 2014. → pages 7, 8, 20, 21, 42[31] B. Mehler, B. Reimer, and Y. Wang. A comparison of heart rate and heartrate variability indices in distinguishing single task driving and driving undersecondary cognitive workload. In Proc Driving Symposium on HumanFactors in Driver Assessment, Training & Vehicle Design, pages 590–597,2011. → pages 46[32] M. Patel, S. Lal, D. Kavanagh, and P. Rossiter. Applying neural networkanalysis on heart rate variability data to assess driver fatigue. Expert Systemswith Applications, 38(6):7235–7242, 2011. → pages 46[33] Philips. Vital signs camera, 2011. → pages 4[34] M.-Z. Poh, D. J. McDuff, and R. W. Picard. Non-contact, automated cardiacpulse measurements using video imaging and blind source separation. Opt.Express, 18(10):10762–10774, 2010. → pages 7, 20, 35, 42, 50[35] M.-Z. Poh, D. J. McDuff, and R. W. Picard. Advancements in noncontact,multiparameter physiological measurements using a webcam. IEEE Trans.Biomed. Eng., 58(1):7–11, 2011. → pages x, 7, 20, 35, 36, 37, 41, 42, 43, 50[36] J. Rashmur. Design, evaluation, and application of heart rate variabilityanalysis software (hrvas). 2010. → pages 7[37] C. Scully, J. Lee, J. Meyer, A. M. Gorbach, D. Granquist-Fraser,Y. Mendelson, and K. H. Chon. Physiological parameter monitoring fromoptical recordings with a mobile phone. IEEE Trans. Biomed. Eng., 59(2):303–306, 2012. → pages 4[38] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans.Pattern Anal. Mach. Intell., 22(8):888–905, 2000. → pages ix, 20, 21[39] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic. A multimodal databasefor affect recognition and implicit tagging. Affective Computing, IEEETransactions on, 3(1):42–55, 2012. → pages58[40] L. Tarassenko, M. Villarroel, A. Guazzi, J. Jorge, D. Clifton, and C. Pugh.Non-contact video-based vital sign monitoring using ambient light andauto-regressive models. Physiol. Meas., 35(5):807–831, 2014. → pages 3, 4,6, 35[41] M. P. Tarvainen, P. O. Ranta-aho, and P. A. Karjalainen. An advanceddetrending method with application to hrv analysis. IEEE Trans. Biomed.Eng., 49(2):172–175, 2002. → pages 19[42] A. Tenenhaus and M. Tenenhaus. Regularized generalized canonicalcorrelation analysis. Psychometrika, 76(2):257–284, 2011. → pages 24[43] J. E. Thatcher, K. D. Plant, D. R. King, K. L. Block, W. Fan, and J. M.DiMaio. Dynamic tissue phantoms and their use in assessment of anoninvasive optical plethysmography imaging device. In Proc. SPIE SensingTechnology+ Applications, 2014. → pages 4[44] S. S. Ulyanov and V. V. Tuchin. Pulse-wave monitoring by means of focusedlaser beams scattered by skin surface and membranes. In OE/LASE’93:Optics, Electro-Optics, & Laser Applications in Science& Engineering.International Society for Optics and Photonics, 1993. → pages 7[45] W. Verkruysse, L. O. Svaasand, and J. S. Nelson. Remote plethysmographicimaging using ambient light. Opt. Express, 16(26):21434–21445, 2008. →pages 7[46] P. Viola and M. Jones. Rapid object detection using a boosted cascade ofsimple features. In Computer Vision and Pattern Recognition (CVPR), 2001IEEE Conference on, pages I511–I518, December 2001. → pages 37, 50[47] T. Wartzek, B. Eilebrecht, J. Lem, H.-J. Lindner, S. Leonhardt, andM. Walter. Ecg on the road: robust and unobtrusive estimation of heart rate.Biomedical Engineering, IEEE Transactions on, 58(11):3112–3120, 2011.→ pages 4, 46[48] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmarklocalization in the wild. In Computer Vision and Pattern Recognition(CVPR), 2012 IEEE Conference on, pages 2879–2886, June 2012. → pages4959


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items