Neural Network Satellite Retrievals of Nocturnal Stratocumulus Cloud Properties by M a r c Rautenhaus A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T O F THE REQUIREMENTS FOR T H EDEGREE OF M a s t e r of Science in T h e Faculty of Graduate Studies (Atmospheric Science) T h e University O f B r i t i s h C o l u m b i a October 2007 © M a r c Rautenhaus 2007 Abstract I investigate the feasibility of retrieving cloud top droplet effective radius, optical thickness and cloud top temperature of nocturnal marine stratocumulus clouds by inverting infrared satellite measurements using an artificial neural network. F o r my study, I use the information contained i n the three infrared channels centred at 3.7, 11.0 and 12.0 /nm of the M o d e r a t e Resolution Imaging Spectroradiometer ( M O D I S ) on-board N A S A ' s T e r r a satellite, as well as sea surface temperature. A database of simulated top-ofatmosphere brightness temperatures of a range of cloud parameters is computed using a correlated-k parameterisation which I have embedded in the radiative transfer package l i b R a d t r a n . T h e database is used to t r a i n feed-forward neural networks of different architecture to perform the inversion of the satellite measurements for the cloud properties. I investigate the application of B a y e s i a n methods to estimate the retrieval uncertainties, and analyse the J a c o b i a n of the networks in order to gain information about the functional dependence of the retrieved parameters on the inputs. A high v a r i a b i l i t y in the J a c o b i a n indicates that the nocturnal retrieval problem is ill-posed. M y experiments show that because the problem is ill-conditioned, it is very difficult to find a network t h a t approximates the database of simulated brightness temperatures well. Sea surface temperature proves to be a necessary input. I compare the retrievals of a selected network architecture w i t h in-situ cloud measurements taken d u r i n g the second D y n a m i c s and C h e m i s t r y of M a r i n e Stratocumulus experiment. T h e results show general agreement between retrievals and in-situ observations, although no collocated comparsions are possible because of a time lag of five hours between b o t h measurements. I establish that the uncertainty estimates are prone to numerical problems and their results are questionable. I show that the J a c o b i a n is a valuable tool in evaluating the retrieval networks. ii Table of Contents Abstract 1 1 Table of Contents iii List of Tables vi vii List of Figures List of Algorithms x i List of Abbreviations xii Acknowledgements xv 1 Introduction 1 1.1 C l o u d s and C l i m a t e 2 1.2 M a r i n e Stratocumulus Clouds and the Aerosol Indirect Effect 5 1.2.1 B u l k Properties 5 1.2.2 Aerosols 9 1.3 R a d i a t i v e Transfer Terminology 1.4 Motivation 15 1.5 Overview and T h e o r y of E x i s t i n g Retrieval A l g o r i t h m s 18 1.5.1 Daytime Methods 18 1.5.2 Nighttime Methods 21 1.5.3 A r t i f i c i a l N e u r a l Networks 25 1.6 Objectives : : 12 29 iii TABLE 2 3 CONTENTS Radiative Transfer and Development of the Forward Model 31 2.1 D Y C O M S - I I Data 32 2.2 Scattering a n d Droplet Spectra 32 2.2.1 R a d i a t i v e Transfer E q u a t i o n 32 2.2.2 Droplet Size D i s t r i b u t i o n a n d its Phase F u n c t i o n 35 2.3 Adiabatic Cloud Model 40 2.4 Choice of Channels a n d Correlated-k 46 2.5 T h e O v e r l y i n g Atmosphere 51 2.6 Forward Model 54 2.6.1 R a d i a t i v e Transfer M o d e l : l i b R a d t r a n 54 2.6.2 Design of the Forward M o d e l 55 Nonlinear Regression with Neural Networks and Uncertainty Estimation 60 3.1 T h e Nonlinear Regression P r o b l e m 61 3.1.1 E r r o r Functions a n d Overfitting 61 3.1.2 C o n t r o l l i n g M o d e l C o m p l e x i t y a n d Network T r a i n i n g 64 3.1.3 Describing Uncertainty 65 3.2 D i s t r i b u t i o n of the N e u r a l Network Weights 68 3.2.1 Derivation 68 3.2.2 T h e M e a n i n g of the Hyperparameters 71 3.2.3 G a u s s i a n A p p r o x i m a t i o n to the Posterior D i s t r i b u t i o n 72 3.3 O u t p u t Uncertainties 73 3.4 Hyperparameter R e - E s t i m a t i o n 76 3.4.1 Evidence Procedure for A r 79 3.5 T h e Jacobian M a t r i x 82 3.6 Implementation Issues 84 3.6.1 N o r m a l i s a t i o n of Input a n d O u t p u t Variables 87 3.6.2 Regularisation of the Hessian 87 Usefulness of the Aires et a l . M e t h o d 88 3.7 4 OF Retrieval, Results and Evaluation 94 4.1 94 Retrieval Setup TABLE OF 4.2 4.3 4.4 4.5 4.6 CONTENTS 4.1.1 T h e Test Scene 94 4.1.2 L o o k u p Table Setup 95 Cloud Mask 97 4.2.1 M O D I S C l o u d M a s k and Irretrievable Pixels 98 4.2.2 Possible Failure Mechanisms 100 Network T r a i n i n g 105 4.3.1 Network Architecture 105 4.3.2 Failure of the Aires et a l . Hyperparameter R e - E s t i m a t i o n 106 4.3.3 Input Preprocessing 107 Network Architecture: Inputs a n d H i d d e n Neurons 107 4.4.1 Brightness Temperature Differences 107 4.4.2 Three Input Networks 110 4.4.3 Four Input Networks . Ill Retrieval E v a l u a t i o n , Jacobian a n d Uncertainty 120 4.5.1 C o m p a r i s o n w i t h In-Situ D a t a 120 4.5.2 Jacobian 121 4.5.3 Uncertainty 129 Further Developments '• • 130 Summary 135 Bibliography 139 5 Appendices A Implementation of the Aires et al. Method in Matlab 147 A.l M o d i f i e d N e t l a b Functions 147 A.2 M a i n Loop: mlatrain 147 A.3 G r a d i e n t C o m p u t a t i o n w i t h Backpropagation: mlagrad 149 A.4 T h e Hessian w i t h the Pearlmutter TZ{-}-Algorithm: 151 A.5 T h e J a c o b i a n a n d its D i s t r i b u t i o n : mlajacob, mlajacobuncertainty 152 A.6 N o r m a l i s a t i o n of Input a n d O u t p u t Variables . . . .• 153 A.7 Implementation Difficulties: Regularisation of the Hessian a n d N u m e r i c a l S y m m e t r y mlahess, m l a h d o t v . . . 154 v List of Tables 2.1 S p e c t r a l intervals of the four M O D I S channels implemented i n the forward m o d e l a n d the atmospheric constituents whose absorptivity is accounted for 3.1 50 J a c o b i a n matrices a n d their uncertainty (one s t a n d a r d deviation) of the two networks trained to model the example function given by (3.9) a n d (3.10) 4.1 P o i n t estimate of the J a c o b i a n of a network containing 3 0 neurons i n the hidden layer a n d using B T ( 3 . 7 ) , B T ( l l ) , B T ( 1 2 ) and T 4.2 s / c as inputs. . . 109 T h e same as Table 4 . 1 , but for a network containing 1 5 neurons i n the hidden layer and using B T ( 3 . 7 ) , B T ( l l ) , B T D ( 1 1 - 1 2 ) a n d T sfc 4.3 84 as inputs 115 T h e same as Table 4 . 1 , but for a network containing 3 0 neurons i n the h i d d e n layer and using B T ( 3 . 7 ) , B T ( l l ) , B T D ( 1 1 - 1 2 ) and T sfc as inputs 115 4.4 Average J a c o b i a n of the J u l y 11, 2 0 0 1 , scene a n d its variability ( 1 5 N network) 124 4.5 T h e same as i n Table 4.4, but normalised by the s t a n d a r d deviation of the inputs 125 A.l L i s t of N E T L A B functions that were adapted or created i n order to accommodate the m a t r i x hyperparameters Ai and A n r 148 vi List of Figures 1.1 T h e role of clouds in the climate system 2 1.2 C l o u d feedback processes represent a large uncertainty in climate modelling 4 1.3 T h e annual cycle of cloud amount, lower tropospheric stability and shortwave, longwave and net cloud forcings from a two-year E a r t h R a d i a t i o n B u d g e t E x p e r i m e n t climatology. . 1.4 5 E x a m p l e of in-situ measurements of microphysical cloud properties taken during the F i r s t International Satellite C l o u d C l i m a t o l o g y P r o g r a m m e R e g i o n a l E x p e r i m e n t i n precipitating marine stratocumulus clouds off the coast of C a l i f o r n i a 1.5 7 T y p i c a l ranges of mean and effective particle diameter, cloud l i q u i d water content and particle number concentration in low-level stratiform clouds 8 . 11 1.6 Dependence of cloud radiative forcing on macroscopic and microscopic cloud properties. 1.7 Different methods to observe clouds 16 1.8 Dependence of radiative effects on microphysical properties of a cloud 20 1.9 P h y s i c a l basis for the remote sensing of nocturnal marine stratocumulus clouds 23 1.10 Schematic diagram of a feed-forward network w i t h two layers of adaptive weights 26 2.1 Processes that influence radiation as it passes through the atmosphere 33 2.2 In-situ measured spectra of cloud droplet sizes from the J u l 11th and J u l 13th flights, together w i t h best-fit g a m m a distributions 2.3 Phase functions of cloud distributions following a g a m m a d i s t r i b u t i o n w i t h a shape parameter of 26 2.4 38 39 V e r t i c a l profile of cloud properties from the idealised adiabatic cloud model, together w i t h in-situ measurements from the J u l y 11 D Y C O M S - I I flight 41 2.5 Measurements of k value versus height in marine stratocumulus clouds 44 2.6 T h e same as F i g u r e 2.4, but w i t h a subadiabatic liquid water content of 7 5 % of the adiabatic value 45 vii L I S T OF 2.7 FIGURES Dependence of atmospheric transmittance on wavelength for a t y p i c a l m i d l a t i t u d e summer atmosphere. 47 2.8 Illustration of the fundamental idea of correlated-k 49 2.9 A t m o s p h e r i c soundings from San Diego from J u l y 11, 2001, and J u l y 13, 2001 52 2.10 V e r t i c a l brightness temperature profiles through a t h i n cloud and a thick cloud 57 2.11 Dependence of brightness temperature ( B T ) and B T difference on varying effective radius and optical thickness for a cloud top temperature of 285 K and a cloud top pressure of 900 hPa 58 2.12 T h e same in F i g u r e 2.11, but for different cloud top pressures 59 3.1 Illustration of overfitting 1 62 3.2 Illustration of overfitting II 63 3.3 D i s t r i b u t i o n of the first weight of the neural network used for the example given by equations (3.9) and (3.10) after a short training r u n and after a longer t r a i n i n g r u n 3.4 73 Section through the functional surface of the example given by (3.9) and (3.10) w i t h network prediction and error bars 3.5 77 Scatterplots of the estimated error (one standard deviation) vs. the actual error for b o t h output variables of the example given in (3.9) and (3.10) for the long t r a i n i n g r u n 78 3.6 T h e same as F i g u r e 3.5, but for the short training r u n 78 3.7 Development of the covariance matrices Co and Ci n A r 3.8 and the hyperparameters Ai n and d u r i n g the t r a i n i n g r u n shown in F i g u r e 3.8 83 Development of the weight values and the mean-square-error of the t r a i n i n g dataset and of a test dataset without added noise of the example given by (3.9) and (3.10) d u r i n g a t r a i n i n g r u n w i t h 10 hyperparameter re-estimation iterations 3.9 85 Scatterplot of the network predictions for b o t h outputs of the example given i n (3.9) and (3.10) vs. the target d a t a 86 3.10 D e m o n s t r a t i o n of the input dependence of the error estimation given in (3.43) 89 3.11 Same as F i g u r e 3.10, but w i t h a training dataset consisting of 10,000 datapoints 90 3.12 T h e same example as in F i g u r e 3.4, but the network has only been trained w i t h d a t a i n a n interval 90 3.13 L i m i t a t i o n s of the error estimation - a simple two-layer perceptron is not able to model a m a p p i n g that contains ambiguities 91 LIST OF FIGURES 3.14 T h e same as Figures 3.5 and 3.6, but using non-regularised Hessian 4.1 91 Brightness temperature difference ( B T D ) between channel 20 (3.7 /im) a n d channel 31 (11 fiva) and channel 31 brightness temperature at 6:25 U T C on J u l y 11th, 2001 4.2 96 k-value vs. height as inferred from the D Y C O M S - I I in-situ measurements of J u l y 11 and J u l y 13, 2001 4.3 Histograms of 97 u g a m and fc-value as inferred from the D Y C O M S - I I in-situ measurements of J u l y 11, 2001 4.4 97 M O D I S cloud mask product for the J u l y 11, 2001, scene and cloud mask derived from the B T / B T D range in the lookup table 4.5 L U T and M O D I S cloud mask filtered scene brightness temperatures 4.6 T h e same as F i g u r e 4.5, but for the brightness temperature difference between channels 99 101 29 and 31 (8.5 and 11 /an) 102 4.7 11 fim versus 12 fim radiance plots for the observations along the line shown i n F i g u r e 4.4. 103 4.8 B T D ( 3 . 7 - 1 1 ) and B T D ( 1 1 - 1 2 ) signals of the pixels between points A a n d B as shown in F i g u r e 4.4 4.9 104 Subset of the J u l y 11 L U T . In addition to the cloud top pressure (939.5 h P a ) , the cloud top temperature is also held constant at 285 K 108 4.10 T r a i n i n g performance of a three input network employing the B T ( 3 . 7 ) , B T ( l l ) and B T D ( 1 1 12) signals, compared to a four input network additionally m a k i n g use of surface temperature. 112 4.11 Retrievals of cloud top temperature and visible cloud optical thickness of the three input network from which the scatter plots on the left side of F i g u r e 4.10 were produced 113 4.12 Mean-square-error of the validation dataset for the four input A N N ( B T ( 3 . 7 ) , B T ( l l ) , B T D ( 1 1 - 1 2 ) , T f ) in dependence of the number of hidden neurons s c 114 4.13 Sensitivity of 15N predictions if cloud top pressure (fixed in the L U T ) is varied by ± 5 h P a . 1 1 7 4.14 T h e same as F i g u r e 4.13, but for the 3 0 N network. 118 4.15 Histograms of the 15N J a c o b i a n of the L U T subset plotted in F i g u r e 4.9 119 4.16 Retrieved effective radii for the test scene and comparison of retrieved ( 1 5 N network) and aircraft measured values 4.17 T h e same as F i g u r e 4.16, but for cloud top temperature 4.18 T h e same as F i g u r e 4.16, but for cloud optical thickness 4.19 Histograms of the 1 5 N J a c o b i a n of the J u l y 11, 2001, scene 122 . 123 124 . 126 ix L I S T OF FIGURES 4.20 S p a t i a l distribution of the sensitivity of the effective radius retrieval to changes i n B T ( 3 . 7 ) and B T D ( 1 1 - 1 2 ) on J u l y 11, 2001 (15N network) 127 4.21 T h e same as F i g u r e 4.20, but for the sensitivity of cloud top temperature to changes i n B T ( 3 . 7 ) and B T ( l l ) 128 4.22 S p a t i a l d i s t r i b u t i o n of the uncertainty i n the effective radius and o p t i c a l thickness retrievals, as computed from the neural uncertainty term G H~ G T 1 131 x List of Algorithms 3.1 Bayesian N e u r a l Network T r a i n i n g w i t h M a t r i x Hyperparameter R e - E s t i m a t i o n 85 xi List of Abbreviations 15N network w i t h four inputs and 15 hidden neurons used i n C h a p t e r 4 30N network w i t h four inputs and 30 hidden neurons used in C h a p t e r 4 ABL atmospheric boundary layer ACE-2 Second Aerosol Characterisation E x p e r i m e n t AIE aerosol indirect effect AMSR-E A d v a n c e d Microwave Scanning Radiometer for the E a r t h O b s e r v i n g System ANN artificial neural network AR4 I P C C 4th Assessment R e p o r t AS adiabatic stratified cloud model AVHRR . A d v a n c e d Very H i g h Resolution Radiometer BT brightness temperature BT(ll) 11 /jm brightness temperature BT(12) 12 fxm brightness temperature BT(3.7) 3.7 / i m brightness temperature BT(8.5) 8.5 urn brightness temperature BTD brightness temperature difference BTD(11-12) B T D between the 11 f i m and 12 fim channels BTD(3.7-11) B T D between the 3.7 / a n and 11 fim channels xii LIST OF ABBREVIATIONS BTD(8.5-11) B T D between the 8.5 / i m and 11 channels CCN cloud condensation nuclei CERES Clouds and the E a r t h ' s R a d i a n t E n e r g y System CFMIP C l o u d Feedback M o d e l Intercomparison P r o j e c t CRF cloud radiative forcing DISORT Discrete Ordinate R a d i a t i v e Transfer DYCOMS-II Second D y n a m i c s and C h e m i s t r y of M a r i n e Stratocumulus field experiment EPIC E a s t Pacific Investigation of C l i m a t e ERBE E a r t h R a d i a t i o n Budget E x p e r i m e n t FIRE F i r s t I S C C P Regional E x p e r i m e n t GCM general circulation model IPA independent pixel approximation IPCC Intergovernmental P a n e l on C l i m a t e Change ISCCP International Satellite C l o u d C l i m a t o l o g y P r o g r a m m e LUT lookup table LWC l i q u i d water content LWP liquid water p a t h MLP multilayer perceptron MODIS M o d e r a t e Resolution Imaging Spectroradiometer MSE mean square error NASA N a t i o n a l Aeronautics and Space A d m i n i s t r a t i o n OA overlying atmosphere PCA principal component analysis LIST OF ABBREVIATIONS PDF probability density function PDT Pacific Daylight T i m e RF02 D Y C O M S - I I research flight II RF03 D Y C O M S - I I research flight III RTE radiative transfer equation Sc stratocumulus SHDOM Spherical H a r m o n i c Discrete Ordinate M e t h o d SST sea surface temperature TOA top of atmosphere TRMM Tropical Rainfall Measuring Mission UTC Coordinated Universal Time VU vertically uniform cloud model Acknowledgements I gratefully acknowledge the support of all of the people who made this thesis possible. F i r s t , I would like to thank P h i l for giving me the opportunity to spend two years at the U n i v e r s i t y of B r i t i s h C o l u m b i a , including the participation i n the E u r o p e a n Geosciences U n i o n conference in V i e n n a . I am very grateful for his supervision, the numerous discussions, his patience, guidance and for his tremendous time commitment, especially while I was preparing m y poster presentation for the conference and while I was w r i t i n g up my thesis. I a m grateful to my committee members, W i l l i a m and especially D o u w for his valuable advice. I thank C h r i s t i a n for supporting discussions when I needed them. I appreciate the help of U l i and D r . Mayer w i t h the radiative transfer model l i b R a d t r a n . I am enormously thankful to my parents for their continuing belief in me d u r i n g my entire time at university and for accepting all of m y decisions. W i t h o u t their support my stay in Vancouver would not have been possible. I thank my family and my friends for support and for bringing joy into my life. M o s t of all, however, I thank R i k i from w h o m I received all the love and support an i n d i v i d u a l person could have given during the past two years. xv Chapter 1 Introduction Of paramount importance to a comprehensive understanding of the E a r t h ' s climate and its response to anthropogenic and natural variability is a knowledge, on a global sense, of cloud properties that m a y be achieved through remote sensing a n d retrieval algorithms. - K i n g et a l . (1997, Page 2) Shallow boundary layer clouds play a critical role i n the exchange of energy and water in our climate system. T h e y do not contribute as m u c h to the greenhouse effect as do upper level clouds, but due to their high a l b e d o , they reflect a large p o r t i o n of the incoming shortwave r a d i a t i o n back to space. 1 The radiative forcing that these clouds exert on the atmosphere is m o d u l a t e d by their macrophysical, microphysical, and optical properties, and the way they interact w i t h the climate system has been the subject of intense research activity i n the last decades (Stephens, 2005, and references therein). However, despite this extensive research work, processes that are involved in cloud-atmosphere interaction and their representation in general circulation models remain some of the p r i m a r y uncertainties i n global climate modelling (Bony et a l . , 2006; Ringer et a l , 2006; W i l l i a m s and Tselioudis, 2007). A c c u r a t e observations of clouds are a key component to solving the problems i n this field. In this thesis I address the retrieval of marine stratocumulus cloud properties from satellite measurements, specifically during night time. These marine boundary layer clouds are c o m m o n and cover large parts of the oceans ( K l e i n and H a r t m a n n , 1993; Norris and L e o v y , 1994). T h e i r horizontal homogeneity and usually low liquid water path put t h e m amongst the simplest to treat. Nevertheless, in a recent retrieval comparison exercise, Turner et a l . (2007) pointed out that such t h i n l i q u i d water clouds are surprisingly difficult to observe. T h i s is particularly true for n o c t u r n a l observations, where no operational retrieval d a t a are available to date. T h i s introductory chapter will cover the fundamentals of the clouds and climate research field, present information on marine stratocumulus clouds, and review relevant previous research work in the field of satellite remote sensing. I conclude w i t h an outline of the objectives for this work w h i c h w i l l be addressed i n the following chapters. A 1 f o r m a l d e f i n i t i o n o f t h e a l b e d o w i l l b e g i v e n i n s e c t i o n 1.3. 1 Chapter 1. Introduction Outgoing longwave radiation Figure 1.1: Clouds play a complex role i n the radiative energy balance of our planet. Reflecting incoming shortwave radiation from space and t r a p p i n g outgoing longwave radiation f r o m the earth's surface and atmosphere, they impact the climate w i t h b o t h cooling and warming effects. T h e magnitude of these radiative effects is dependent on macroscopic and microscopic cloud properties, giving rise to complex interactions w i t h i n the climate system (from P h i l l i p s and B a r r y , 2002, courtesy of N A S A M a r s h a l l Space F l i g h t Center and S c i e n c e @ N A S A ) . 1.1 Clouds and Climate C l o u d radiative forcing ( C R F ) is defined as the difference between the upwelling radiative flux at the top of the atmosphere and what it would be if the clouds were absent. F i g u r e 1.1 illustrates the most important interactions between clouds and radiation. T h e largest contribution to the globally averaged long wave C R F is made by high clouds in the upper levels of the troposphere. T h e y are cold and thus emit less radiation to space t h a n the earth's surface and atmosphere under clear skies. Clouds that are optically thick, on the other hand, have the largest albedo, which makes t h e m the largest contributor to the short wave C R F . In the global annual average, clouds exert a net cooling effect on the climate system i n its present state. T h e average top-of-atmosphere ( T O A ) short wave C R F is about -50 W / m 2 (the negative sign indicates cooling), while the T O A long wave C R F accounts for a gain of approximately 30 W / m cloud forcing o f - 2 0 W / m 2 2 - a net ( W i e l i c k i et a l . , 1995). These values are large compared to the heating that would result if the current C O 2 concentration was doubled, which is estimated to be on the order of 4 W/m 2 (Wielicki et a l . , 1995). 2 Chapter 1. Introduction Because of their s m a l l scale variability i n time and space, clouds are difficult to simulate in numerical global climate a n d weather forecasting models. T h e processes governing cloud formation and dissipation occur on spatial scales smaller t h a n a general circulation model grid cell, so t h a t the clouds and their interaction w i t h the atmosphere have to be parametrised. Since the phase change f r o m the gaseous to the liquid phase produces a nonlinear transition from transparent vapour to opaque water droplets, parameterisations require detailed knowledge of the processes involved so as to capture the major effects of the governing physics. A n additional complication is that knowledge of grid cell mean cloud properties does not uniquely determine knowledge of the C R F because of the nonlinear relationship between cloud properties and albedo (Harshvardhan, 1982). Especially of interest to the climate science community is the radiative feedback between clouds and the evolving ocean and atmosphere. H o w do clouds react to a changing climate? Does a warmer atmosphere induce more shallow clouds, which i n t u r n would cool the climate? G i v e n the magnitude of the net C R F , small changes in cloud cover could already have a significant effect on the earth's energy balance. For instance, H a r t m a n n et a l . (1992), using one year of global satellite d a t a , found the sensitivity of the radiation balance to low clouds to be on the order of -0.6 W / m 2 per percent fractional cloud cover in the annual average. C l o u d feedback processes depend on so many factors that they have long been identified as being the largest source of uncertainty i n climate change predictions; as B o n y et a l . (2006) note, w i t h a larger uncertainty t h a n any other feedback. Despite intense research work in this area (see, for instance, the review articles of Stephens (2005) and B o n y et al. (2006)), recent comparisons among climate models still show a large variability of the predicted change in global mean C R F if forcing factors such as a doubling of the atmospheric C O 2 content or an increase in sea surface temperature ( S S T ) are prescribed (Ringer et a l . , 2006; W i l l i a m s and Tselioudis, 2007). F i g u r e 1.2 reproduces the results of a modelling study conducted by Ringer et al. (2006). T h e change in C R F , as computed by the current generation of climate models, varies not only in magnitude but also in sign - it is currently unclear whether clouds are changing so as to amplify global warming or to counteract it. W i l l i a m s and Tselioudis (2007) investigated how different cloud regimes contribute to the cloud feedback uncertainty. F o r the six general circulation models ( G C M s ) they compared, they found that differences in the radiative response of frontal clouds in the mid-latitudes a n d of stratocumulus clouds in low-latitude regions cause the largest proportion of the variance in the global cloud response. B o n y and Dufresne (2005) also found low-latitude marine boundary layer clouds to be a major factor i n the 3 Chapter 1. Introduction C l o u d feedback parameter: +/- 2K C l o u d feedback parameter: 2 X C O . AR4 Models . n i l 0 8 I I I I I D F H B A Longwave Shortwave 0 I I I C Model I I I I I L G J I 1.2 1.2 O I 0.8 O °LL DC O 4 o Longwave Shortwave h 4 < o < o H D E F Model G - 0 . 4 r - -0.8 D F H B A C Model G J Figure 1.2: C l o u d feedback processes represent a large uncertainty in climate modelling. Macroscopic and microscopic cloud properties cannot be resolved explicitly in the large grid cells of general circulation models, and different parameterisations lead to a large range of different feedbacks - from global mean cooling to warming of atmospheric temperature. Shown on the left is the global mean change in cloud radiative forcing (ACRF) i n ten climate models participating i n the C l o u d Feedback M o d e l Intercomparison P r o j e c t ( C F M I P ) , normalised by the radiative imbalance G resulting from a perturbation of the sea surface temperature i n the simulations by ±2K (top: t o t a l cloud feedback; b o t t o m : longwave a n d shortwave components separated). O n the right the forcing G is given by doubling the C O 2 content of the atmosphere, and the models used are the ones submitted to the I P C C 4th Assessment R e p o r t ( A R 4 ) . ACRF/G = 1 means that the radiative forcing associated w i t h cloud feedback is the same as the original direct forcing G, i.e. the clouds double the forcing exerted by G. Note the large differences i n cloud feedback amongst the models (reprinted from Ringer et a l . , 2006, © 2006, w i t h permission from the A m e r i c a n Geophysical U n i o n ) . 4 Chapter 1. J F M A M J J A S O N D Introduction J F M A M J J A S O N D Figure 1.3: T h e annual cycle of (left) cloud amount and lower tropospheric stability (expressed as the potential temperature difference G(700 h P a ) - 0 ( s e a level pressure)) and (right) shortwave, longwave and net cloud forcings from a two-year E R B E ( E a r t h R a d i a t i o n B u d g e t E x p e r i m e n t ) climatology i n the region 20°-30°N and 120°-130°W off the coast of C a l i f o r n i a (reprinted from K l e i n and H a r t m a n n , 1993, © 1993, w i t h permission f r o m the A m e r i c a n M e teorological Society). disagreement of G C M s in cloud feedback predictions. W i l l i a m s and Tselioudis (2007) note that the uncertainty is already present in model simulations of current climate, before the i n t r o d u c t i o n of additional uncertainty from future climate forcing. T h i s indicates that there is currently no consensus on how to represent shallow boundary layer cloud processes in climate models. 1.2 Marine Stratocumulus Clouds and the Aerosol Indirect Effect 1.2.1 B u l k Properties T h e retrieval algorithm developed in this thesis focuses on the observation of marine stratocumulus (Sc) clouds. These clouds are so important to the climate problem because of their persistence and their high albedo (typically ~0.6-0.8 compared to the underlying sea surface value of typically ~ 0 . 0 5 for water (Driedonks and Duynkerke, 1989)). T h i s causes a strong shortwave C R F especially in the low-latitude regions, where the incident solar radiation is very high. For example, off the coast of C a l i f o r n i a , average cloud amount reaches more t h a n 6 5 % during the summer months, accounting for a net C R F of up to -70 W/m 2 ( K l e i n and H a r t m a n n , 1993, their F i g u r e 5b,d is reproduced in F i g u r e 1.3). M a r i n e Sc particularly favour the eastern subtropical oceans, where cold surface waters f r o m the upwelling ocean currents produce boundary layer air temperatures that are cool compared to the warm, 5 Chapter 1. Introduction subsiding air aloft. T h e resulting strong temperature inversion caps the b o u n d a r y layer and inhibits deep convection. Strong radiative cooling at cloud top helps to m a i n t a i n the shallow convection a n d thus the cloud layer i n the absence of strong sensible heat flux f r o m the ocean surface ( K l e i n a n d H a r t m a n n , 1993). Recently, it has been recognised that drizzle also plays a n i m p o r t a n t role i n the dynamics of the cloud layer. However, how exactly precipitation processes interact w i t h the c i r c u l a t i o n is still subject to ongoing research (e.g. A c k e r m a n et a l . , 2004; Stevens et a l , 2005; Savic-Jovcic and Stevens, 2007). These complex physical processes, together w i t h the clouds' s m a l l geometrical thickness (typically a few hundred metres) a n d the sharp capping inversion makes the representation of marine Sc i n climate models p a r t i c u l a r l y difficult (Bretherton et a l . , 2004). T h e y are also still a p r o b l e m for weather forecasting models, as a recent comparison of in-situ d a t a taken during the second D y n a m i c s and C h e m i s t r y of M a r i n e Stratocumulus ( D Y C O M S - I I ) field experiment (Stevens et al., 2003a) w i t h numerical weather prediction results shows (Stevens et al., 2007). Several field experiments have been conducted during which in-situ a n d remotely sensed d a t a of marine Sc were taken i n order to better comprehend the mechanisms involved i n sustaining a n d dissipating this type of cloud. Figures 1.4 a n d 1.5 give examples of t y p i c a l vertical cloud profiles observed i n marine Sc. T h e d a t a i n F i g u r e 1.4 were taken during F I R E , the F i r s t I S C C P (International Satellite C l o u d C l i m a t o l o g y P r o g r a m m e ) R e g i o n a l E x p e r i m e n t (Albrecht et a l . , 1988), i n precipitating Sc off the coast of C a l i f o r n i a . C l o u d l i q u i d water content and average droplet radius t y p i c a l l y follow their adiabatic v a l u e s , 2 w i t h l i q u i d water content often becoming subadiabatic at the t o p of the clouds by entrainment of d r y air f r o m aloft or b y removal of droplets by precipitation (drizzle). T h e mean droplet concentration typically stays approximately constant w i t h height, a n d the geometrical thickness of the clouds observed i n this example is less t h a n 300 m. F i g u r e 1.5 summarises the results compiled by M i l e s et a l . (2000) f r o m in-situ d a t a reported i n the literature. T h e magnitude of average cloud top droplet radii ranges f r o m about 4 to about 12 / j m . . 3 T h e effective radius, a n area-weighted mean radius of the cloud droplets w h i c h is i m p o r t a n t i n radiative transfer applications (see section 1.3 and C h a p t e r 2), is 5 to 15 f i m , a little larger t h a n the mean droplet radius. M e a n droplet number concentrations range from less t h a n 20 c m - 3 to about 200 c m - 3 i n marine clouds. A s a comparison, number concentrations t y p i c a l of continental clouds are also shown. D u e to the higher aerosol content of continental air masses, which act as cloud condensation nuclei ( C C N ) , significantly higher concentrations can be observed over land. Adiabatic refers to a well mixed cloud layer in which entropy and total water content (i.e. water vapour plus liquid water) are constant with height. Note that Miles et al. ( 2 0 0 0 ) use diameter to describe droplet sizes in their work, whereas I will use radius in this thesis. 2 3 6 Chapter 1. jlwc (g m ) (points 0.1 0 2 0.3 0.4 0.5 N (cm r 0 50 ) (points) 100 150 Introduction N, (cm jlwc (g m ) (points) 0 0 0.1 0.2 0.3 0.4 0.5 50 ) (points) 100 150 •o 3 o 15 •o 0 0.1 0.2 0.3 0.4 0.5 q, fern ) (line) , (nm) (lines) 3 0 2 q^gm^Xline) 4 6 8 10 12 r*, (urn) (lines) Figure 1.4: E x a m p l e of in-situ measurements of microphysical cloud properties taken during the F i r s t I S C C P (International Satellite C l o u d C l i m a t o l o g y P r o g r a m m e ) R e g i o n a l E x p e r i m e n t ( F I R E ; A l b r e c h t et al., 1988) in precipitating marine stratocumulus clouds off the coast of C a l i f o r n i a . S h o w n are d a t a from two aircraft flights (left and right): in-situ measured liquid water content (q , left side of each panel, solid line), adiabatic l i q u i d water content (left side, straight dashed line), in-situ measured and adiabatic volume-mean radius (r i, right side, solid and dashed, respectively), and in-situ measured number concentration (N , right side, points). 0 marks cloud-top and cloud-base (reprinted from A u s t i n et a l . , 1995, © 1995, w i t h permission from the A m e r i c a n Meteorological Society). r vo r 7 Chapter 1. Introduction a 0 10 20 30 D ,obs ( r " n ) m Figure 1.5: M i l e s et al. (2000) collected observations of low-level s t r a t i f o r m clouds from previously published studies i n order to compare the results found by different authors. T h e figures shown here illustrate t y p i c a l ranges of mean and effective particle diameter (D and D ), cloud liquid water content (LWC) and particle number concentration (N ). O n the left profiles of b o t h continental and marine clouds are plotted together versus the normalised cloud height (h/h = 0 marks cloud base, h/ht = 1 cloud top), whereas i n the plots of the number concentration on the right they are separated. M a r i n e clouds t y p i c a l l y have m u c h smaller number concentrations t h a n continental clouds. T h i s is due to fewer aerosols acting as cloud condensation nuclei ( C C N ) over the oceans. Furthermore, Nt c a n be regarded as approximately constant w i t h height w i t h i n a cloud. Note that, the figures on the left use diameter to describe cloud droplet size, whereas i n this thesis radius is used (reprinted f r o m M i l e s et a l . , 2000, © 2000, w i t h permission from the A m e r i c a n Meteorological Society). m e t t Chapter 1. Introduction Due to their occurrence i n strong subsidence regions, marine Sc cloud tops are usually found at low altitudes of less t h a n 1000 m (e.g. Driedonks and Duynkerke, 1989). L i q u i d water paths ( L W P ) are found to be on the order of up to a few hundred g / m , often well below 100 g / m 2 2 (e.g. B r e t h e r t o n et a l . , 2004; X u et a l . , 2005). A distinctive feature of marine Sc decks is the occurrence of a pronounced d i u r n a l cycle i n cloud amount and L W P . For instance, during the E a s t Pacific Investigation of C l i m a t e ( E P I C ) , B r e t h e r t o n et al. (2004) observed a rise of the inversion height (and thus cloud top height) of roughly 200 m each night from an early afternoon m i n i m u m value. T h e cloud base showed little d i u r n a l cycle, so that the clouds were thinnest during the afternoon. Furthermore, cloud cover was almost 100% d u r i n g nighttime, but during the afternoon clouds were often broken. Such d i u r n a l cycles i n cloud amount were also found by R o z e n d a a l et al. (1995), who used low cloud fraction d a t a inferred from infrared satellite channels i n the I S C C P dataset. W o o d et al. (2002) analysed the L W P derived from microwave measurements from the T r o p i c a l R a i n f a l l M e a s u r i n g M i s s i o n ( T R M M ) satellite. T h e y found a strong d i u r n a l cycle i n L W P of subtropical low cloud regions, which w i t h an early morning peak was i n phase w i t h the cloud amount cycle of Rozendaal et a l . (1995). S i m i l a r results were found by B l a s k o v i c et a l . (1991). T h e d i u r n a l cycle of cloud thinning during the day and growth at night is m a i n l y caused by solar forcing. D u r i n g the day, the shortwave radiation that is absorbed i n the cloud layer largely offsets the longwave radiative cooling from cloud top. T h i s inhibits the turbulence i n the layer, and the resulting less well-mixed structure is b o t h less efficient i n transporting moisture upwards f r o m the surface and m a i n t a i n i n g the cloud top by entrainment. Consequently, subsidence c a n advect the cloud top downwards, while it is also dried out. T y p i c a l l y , the cloud base is also lifted d u r i n g the day. T h i s leads to a commensurately thinner cloud (Bretherton et a l . , 2004; Stevens et a l . , 2007). D a i l y amplitudes of the variations reach 5-10% for cloud amount (Rozendaal et a l . , 1995) and 15-35% for L W P ( W o o d et a l . , 2002), so that a correct representation of the d i u r n a l cycle i n G C M s is critical to the accurate representation of low clouds i n large scale models. 1.2.2 Aerosols Intimately linked to low boundary layer clouds are the aerosol indirect effects ( A I E s ) , important for the estimation of the climate impact of anthropogenic aerosol emissions. T w o m e y (1974, 1977) proposed that a n increased aerosol concentration shifts the droplet size d i s t r i b u t i o n towards smaller values by p r o v i d i n g more cloud condensation nuclei. If cloud liquid water content ( L W C ) is unchanged, more and 9 Chapter 1. Introduction commensurately smaller droplets are formed, which increase the albedo of the cloud (first A I E of T w o m e y effect). A second key hypothesis was introduced b y Albrecht (1989). H e suggested t h a t smaller cloud droplets suppress the formation of precipitation i n the c l o u d . T h i s reduced cloud water sink would moisten the 4 atmospheric boundary layer ( A B L ) , leading to a lower cloud base a n d a thicker cloud t h a t could live longer (second A I E or cloud lifetime effect). T h e effect on the radiation budget is similar to the T w o m e y effect - increasing cloud thickness increases the albedo, a n d clouds t h a t persist over longer time spans also have a n increased albedo if we consider the time average. B o t h effects are i m p o r t a n t i n the context of anthropogenic emissions. F o r instance, a n idealised climate sensitivity study conducted by H u a n d Stamnes (2000) investigated the sensitivity of the C R F to microphysical properties of clouds. A s a n illustration, F i g u r e 1.6 shows the dependence of shortwave C R F on average cloud droplet radius for three different cloud thicknesses. If the average droplet size was decreased by just one micrometer, the effect on the radiation budget could already be significant. T h e figure also shows that t h i n clouds (given w i t h a L W P of 50 g / m ) , such as marine Sc, exhibit the largest 2 radiative sensitivity to microphysical changes. However, the interactions between aerosols and clouds are complicated. P r e c i p i t a t i o n , including drizzle that does not reach the ground, occurs and can influence the aerosol concentration v i a scavenging, which in t u r n influences the cloud microphysics a n d cloud fraction (Stevens et a l . , 2005; S h a r o n et al., 2006). A l s o , precipitation is k n o w n to stabilise the cloud layer (by cooling the subcloud layer a n d increasing static stability), so less precipitation caused by the second A I E would increase turbulence and thus entrainment of w a r m and d r y air f r o m aloft - counteracting the cloud thickening process ( W o o d , 2007). In total, the aerosol indirect effects make the problem of accurately representing b o u n d a r y layer cloud processes i n G C M s even more complicated, adding significantly to the already existing cloud feedback uncertainty ( L o h m a n n and Feichter, 2005). In fact, whether higher aerosol concentrations actually increase the area-averaged albedo of a cloud deck and thus its C R F is discussed controversially i n the literature. Recently, studies suggested that processes might exist t h a t could cancel the indirect effect completely, leaving no effect of aerosols on the net radiative fluxes (Twohy et a l . , 2005; W o o d , 2007; X u e et a l , 2007). A intriguing manifestation of the A I E are ship tracks (Hobbs et a l . , 2000; Durkee et a l . , 2000b,a). T h e y occur when elevated aerosol levels i n the area of a ship plume lead to an enhanced reflectivity of the cloud layer above the ship. T h e resulting "ship tracks" can often be observed on satellite imagery. Indeed, In ice-free clouds, large droplets are needed to form precipitation (via the collision/coalescence process). If the droplet size distribution is shifted towards smaller values, the propensity of the cloud to form precipitation decreases. 4 10 Chapter 1. Introduction -TOO 3 - 1 5 0 - 2 0 0 L W P : L W P : L W P : 5 g m 5 0 g m 5 0 0 g m - ' - 2 5 0 10 1 5 A v e r a g e 2 5 2 0 D r o p l e t R a d i u s 3 0 ( / ^ m ) - 1 5 L W P : L W P : L W P : 5 g m " ' 5 0 g m - ' 5 0 0 g m " - 2 0 1 5 A v e r a g e 2 0 D r o p l e t R a d i u s {fJrri) 2 5 3 0 Figure 1.6: E x a m p l e of the dependence of cloud radiative forcing o n macroscopic a n d microscopic cloud properties: the dependence of shortwave C R F at the tropopause (global annual average, top) on average cloud droplet radius for three different cloud thicknesses, and its change if the average droplet radius is decreased by 1 /um (bottom) i n a n idealised climate sensitivity study. T o p u t the numbers into context, W i e l i c k i et a l . (1995) estimate the globally averaged shortwave C R F to be on the order of -50 W m ~ ; an instantaneous doubling of C 0 would result in a forcing of roughly 4 W m ~ (reprinted from H u a n d Stamnes, 2000, © 2000, w i t h permission from B l a c k w e l l P u b l i s h i n g ) . 2 2 2 11 Chapter 1. Introduction Durkee et a l . (2000a) found that ships that emit more aerosols o n average produce ship tracks that are brighter, wider, and longer-lived t h a n ships w i t h lower emissions. Here, I w i l l use the phenomenon of ship tracks to check the physical consistency of the retrieval method (Chapter 4). 1.3 Radiative Transfer Terminology Before I discuss the motivation for this work and give a n overview of the relevant existing literature, it is useful t o review some basic radiative transfer terminology. G e n e r a l references i n w h i c h more detailed information can be found are, for instance, P e t t y (2006) or B o h r e n and C l o t h i a u x (2006). T h e flux density, or irradiance (sometimes abbreviated as flux), is a measure of the t o t a l energy per unit time and unit area transported by electromagnetic radiation t h r o u g h a flat surface: F energy (Joules) = time (seconds) x area ( m ) 2 It is measured i n W m - 2 . T h e radiance, or intensity, measures the directional energy transport: j energy (Joules) = ^ time (seconds) x area ( m ) x field of view ( s r ) 2 It is measured i n W m - 2 sr - 1 _ 1 . T h e direction is given b y solid angle w i t h units steradian (sr). A steradian is a "square r a d i a n " ; solid angle is t o "regular" angle as area is t o length (Petty, 2006). A n integration of solid angle over one hemisphere yields 2tt, over a n entire sphere 47r. B o t h flux a n d intensity c a n be expressed i n monochromatic form, i.e. per unit wavelength, A (or, alternatively, wave number, 1/A): ^ w i t h units Wm~ sr~V 2 m _ 1 A wavelength (fim)' ^ ^ - W h e n electromagnetic radiation is incident o n a m e d i u m , part of it is absorbed, part reflected, a n d part transmitted. T h e absorptivity (also absorptance) a, reflectivity (also reflectance) r and transmissivity (also transmittance) t of a m e d i u m describe the corresponding fractions of the incident radiation and i n general depend o n wavelength a n d direction of the incident radiation. T h e shortwave reflectivity of a surface is also referred to as its (shortwave) albedo. Obviously, all three quantities range from zero t o one, 12 Chapter 1. Introduction and (1-4) aA(M)+a(M)+tA(M) = l T h e transmissivity is described by Beer's Law. iA = ^ = e x p ( - / 3 , s ) , A (1.5) e where s denotes distance along the direction of propagation, 7A,O the intensity at position s = 0, I\(s) the intensity after distance s, a n d /3A,e the extinction coefficient ( m ) . It describes the rate of energy - 1 attenuation per unit distance (l//?A,e determines the distance for energy t o be attenuated t o e _ 1 of its original value). Since a l l of the following equations correspond to the monochromatic case, I w i l l drop the subscript A for convenience. I n equations that describe wavelength-integrated cases this w i l l be explicitly stated. Following Beer's L a w , i n a direct b e a m w i t h no sources of r a d i a t i o n , the intensity I falls off exponent i a l l y w i t h distance: J(s)=/ exp(-/J a). 0 (1.6) e In this equation, the extinction coefficient describes the effects of two mechanisms for extinction; radiation can be either absorbed by a m e d i u m , or be scattered out of its original direction of propagation. T h e extinction coefficient is the s u m of a n absorption coefficient f3 a n d a scattering coefficient (5 : a Pe=Pa S + Ps- (1-7) T h e single scatter albedo Hi characterises the relative importance of scattering i n the t o t a l extinction - Ps Ps Pe Pa + Ps (1.8) In a purely absorbing m e d i u m , w would be zero, whereas i n a purely scattering one it would be unity. T h e dimensionless optical thickness of a cloud between z\ a n d z 2 (in a n atmosphere i n which z represents the vertical coordinate) indicates the opacity of the cloud for a given wavelength. It is defined as the path-integrated extinction coefficient a n d hence expresses how m u c h extinction occurs over the geometrical thickness of the cloud: T(Z Z ) U 2 = I 2 P (z)dz. e (1.9) 13 Chapter 1. Introduction A s a rule of thumb, B o h r e n et al. (1995) found that at an approximate visible optical thickness of 10 one can no longer see the sun through a cloud. T h e water droplets found i n a cloud are generally distributed over a range of sizes. T h e droplet size distribution is described by n(r)dr = number of droplets per unit volume w i t h radii between r and r + dr, (1.10) and it is important for determining the effective radius r f / , w h i c h is defined as e J n(r)r)r dr 6 (1.11) eff r r dr T w o cloud parcels w i t h identical liquid water content and r f j have the same t o t a l droplet surface area, e independent of the actual droplet size distribution. T h e absorption and scattering coefficients fi and @ can be w r i t t e n i n terms of the droplet size (and s a wavelength) dependent absorption and scattering cross sections cr (r) a n d <x (r) ( m ) : 2 s a (3 = J a (r)n(r)dr (1.12) Ps = J cr (r)n(r)dr (1.13) a a s o- (r) and cr (r) hence describe the absorption or scattering per particle. W a t e r droplets larger t h a n about a s 5 fim have a scattering cross section of a = 2nr 2 s and a n absorption cross section of o~ « 0 at visible a wavelengths. It can be shown (Petty, 2006) that i n this case the visible o p t i c a l thickness is related to the effective radius b y _ Ti v s ~ 3LWP r' 2 Pir (1.14) , eff where pi is the density of liquid water and the liquid water p a t h (g m LWP qidz •• - 2 ) is given b y 4ir , dr dz, (1.15) w i t h liquid water content q (g m ~ ) . A l t h o u g h (1.14) does not apply at absorbing wavelengths where a 3 t a and < 7 vary w i t h r and A, it is still the case that cloud reflectivity and emissivity (defined below) can be S w r i t t e n as a unique function of r ff, e r vis , temperature and A for a l l wavelengths. 14 Chapter 1. Introduction F i n a l l y , the intensity emitted b y a m e d i u m is given by the Planck function a n d t h e emissivity s: (1.16) I =e B {T), x X X where T is the temperature of the emitting material. T h e emissivity describes how efficient the m e d i u m is emitting, a n d after Kirchhoff's Law, the emissivity of a m a t e r i a l is equal to its absorptivity: (1-17) e (0,d>)=a (e,4>). x x T h e P l a n c k function, dependent o n the temperature of a m e d i u m , is given b y = — Bx{T) v , (1.18) * (Mot)- ) 1 where h = 6.626 x 1 0 ks = 1.381 x 10 - 2 3 - 3 4 Js _ 1 is P l a n c k ' s constant, c = 2.998 x 1 0 m s 8 _ 1 is the speed of light, a n d J / K is B o l t z m a n n ' s constant. It gives the emission of a black body, a n object that absorbs a l l r a d i a t i o n that falls onto it (a x = 1) a n d consequently also emits the m a x i m u m possible r a d i a t i o n that a m a t e r i a l w i t h a given temperature c a n emit (e\ = 1). T h e brightness temperature TB is often used as a substitute for describing the measured intensity i n remote sensing. It is the temperature a black b o d y must have i n order to emit the observed radiance at a given wavelength: h = B (T ). X 1.4 B (1.19) Motivation D a t a that c a n shed light o n the regional variations i n cloud m i c r o p h y s i c a l a n d optical properties, their d i u r n a l cycle a n d the interaction of clouds, precipitation a n d aerosols is p a r t i c u l a r l y important for further progress i n the accurate representation of stratocumulus clouds i n climate a n d weather models (Stevens et a l . , 2003a). F i g u r e 1.7 summarises the options the scientific c o m m u n i t y has t o observe clouds. In-situ instruments carried by aircraft can very accurately measure drop size d i s t r i b u t i o n a n d l i q u i d water content, as well as the standard meteorological variables. W h i l e field campaigns such as D Y C O M S - I I yield valuable data, the aircraft flights are expensive, and it is not p r a c t i c a l t o use i n - s i t u s a m p l i n g t o conduct 15 Chapter 1. Introduction ground based remote sensing Figure 1.7: Observations of cloud microphysical a n d optical properties, their diurnal cycle and the interaction of clouds, precipitation and aerosols are important for further progress i n the accurate representation of stratocumulus clouds i n climate and weather models. Different methods can be employed; active (radar, lidar) and passive (radiometers) remote sensing can be used from the ground. Aircrafts or helicopter-based instrument-sondes are able to perform i n situ measurements and directly measure particle size, number a n d the standard meteorological parameters. However, for marine clouds, both ground based remote sensing (from ships) and insitu measurements are cost and labour expensive. Satellites provide an ideal means to remotely sense d a t a over large areas a n d long time spans. M o s t satellite-based approaches use passive multispectral radiometer data, b u t d a t a from active satellite instruments will also be available. the long-term observations that are needed to understand how the climatology of marine Sc responds to environmental change. Remote sensing from surface a n d satellite based instruments can provide a long-term d a t a record, w i t h surface instruments typically being able to achieve a higher t e m p o r a l resolution b u t w i t h a limited spatial field of view. A l s o , it is difficult to deploy surface remote sensing instruments for long time spans over sea, leaving satellites to be the major d a t a source for large-area and climatic studies of marine Sc. Remote sensing instruments infer cloud properties from radiation that is reflected, transmitted or emitted by the cloud. W h i l e active instruments such as radar, sodar or lidar emit electromagnetic radiation and infer the property of interest from the backscatter signal, passive sensors record the radiation that originates from n a t u r a l sources. M a t h e m a t i c a l models of the radiative processes i n the cloud a n d 16 Chapter 1. Introduction atmosphere are then employed to compute the intensity arriving at the sensor (i.e. for satellites, at the top of the atmosphere) as a function (referred to below as the forward model) of the relevant cloud parameters: jTOA = _f( T re//j r ? , T cioud , surface overlying atmosphere, ...), (1.20) where the intensity has been w r i t t e n as a vector i n order to account for observations at several wavelengths. Other parameters, indicated by i n the above equation, could include subadiabaticity, cloud inhomogeneity or partially filled pixels. T h e retrieval is then defined as the inversion of the function / , =f~ ip where the vector <p = (r ff, r, e T i d, c ou 1 {I T O A T face, sur , overlying atmosphere, ...), (1-21) •••) contains the parameters to be determined. M o s t satellites currently i n orbit carry passive sensors, due to the high power consumption of active technology. F o r instance, the M o d e r a t e Resolution Imaging Spectroradiometer ( M O D I S ) instruments, on-board the N a t i o n a l Aeronautics a n d Space A d m i n i s t r a t i o n ' s ( N A S A ) near-polar o r b i t i n g T e r r a a n d A q u a satellites, measure i n the visible and infrared wavelength range and provide four images daily for a given location o n earth. W h i l e a n operational cloud product including droplet size and optical thickness (a measure of how much radiation can pass through the cloud) is routinely retrieved f r o m the daytime measurements ( K i n g et a l . , 1997; P l a t n i c k et a l . , 2 0 0 3 ) , no such retrievals are available for nighttime 5 images. 67 However, such retrievals would be useful for studies of marine Sc on the d i u r n a l timescale. One reason that no operational retrievals of n o c t u r n a l cloud properties are available is that solar radiation carries several orders or magnitude more energy t h a n radiation emitted b y terrestrial sources. T h e sun hence represents a source of very easily detectable photons, a n d the absence of emission of terrestrial sources i n the visible wavelength range means that cloud temperature is not a confounding variable. T h e higher intensity allows for smaller pixel sizes, too. F o r instance, while near a n d thermal infrared signals are recorded at a 1-km resolution b y the M O D I S sensors, the visible channels have a 250m resolution. Furthermore, most information o n cloud microphysical and optical properties are carried by wavelengths i n the visible and near infrared. W h i l e the reflected sunlight at visible channels is highly Available from http://ladsweb.nascom.nasa.gov/data/. T h i s is also true for other instrument data, such as from the Advanced Very High Resolution Radiometer (AVHRR). In 2006, NASA launched the CloudSat and CALIPSO satellites within the so-called A-Train - a series of satellites equipped with special cloud remote sensing instrumentation, flying in formation with the Aqua satellite in order to provide near-simultaneous measurements. The satellites are equipped with an active radar and a lidar instrument, respectively, which operate independently of the available sunlight. However, the lidar instrument will mainly focus on upper level clouds (i.e. cirrus), and the radar on vertical profiles of LWC and precipitation — droplet size, for instance, will not be part of the retrieval product (Stephens et al., 2002; Vaughan et al., 2004) 5 6 7 17 Chapter 1. Introduction sensitive to cloud optical thickness, near infrared channels provide i n f o r m a t i o n about cloud particle size (e.g. K i n g et a l . , 1997). However, a similar, although weaker, dependence can also be found at non-visible wavelengths. R a d i a tive transfer calculations by B a u m et al. (1994) showed that a c o m b i n a t i o n of the three near infrared and thermal channels of the A d v a n c e d V e r y H i g h Resolution Radiometer ( A V H R R ) could be used to infer cloud droplet size, optical thickness and cloud temperature. A t t e m p t s to construct a retrieval scheme f r o m such measurements ( M i n n i s et a l . , 1995; Heck et a l . , 1999; Perez et a l . , 2000; Gonzalez et a l . , 2002; Perez et a l . , 2002) were p a r t l y successful, but suffered f r o m high c o m p u t a t i o n a l cost and included only simple sensitivity analyses i n order to estimate the uncertainty on the retrievals. In particular, the computational time needed to retrieve an i n d i v i d u a l pixel in an image was on the order of seconds , which, 8 again, is not suited to b u i l d a climatology build on a large number of satellite images. A l s o , ambiguities in the inverse m a p p i n g were a problem (i.e. different droplet size and optical thickness combinations can lead to the same set of emitted radiances), further complicating the uncertainty estimation and restricting the quantitative usefulness of the retrievals (Perez et a l . , 2000; Gonzalez et a l . , 2002). Towards these ends, a retrieval algorithm b o t h fast enough to process large datasets and able to provide reliable error estimates on the retrieved values could prove quite useful, a n d its non-existence motivates this study. 1.5 Overview and Theory of Existing Retrieval Algorithms 1.5.1 D a y t i m e Methods Retrieval techniques to extract boundary layer liquid water cloud properties f r o m satellite measurements have been developed since the 1980s ( A r k i n g and C h i l d s , 1985; T w o m e y and C o c k s , 1989; N a k a j i m a and K i n g , 1990; N a k a j i m a et al., 1991; P l a t n i c k and T w o m e y , 1994; H a n et a l . , 1994; P l a t n i c k and Valero, 1995; N a k a j i m a and N a k a j m a , 1995; K a w a m o t o et a l . , 2001), although interest i n inferring cloud cover characteristics f r o m satellite images can be traced back to A r k i n g (1964). Since reflected solar radiation is observed by the sensors, and the reflectivity m a i n l y depends on cloud optical thickness and effective radius, these two parameters are commonly retrieved. T h e underlying principle is illustrated i n F i g u r e 1.8. L i q u i d water clouds are composed of spherical water droplets whose scattering properties can be described by M i e theory ( M i e , 1908). F i g u r e 1.8a shows 8 o n a 933 MHz Pentium III machine; P.H. Austin, 2005, per. comm. 18 Chapter 1. Introduction the dependence of the single scatter albedo ui (1.8) on wavelength for three different droplet sizes. In the visible wavelength range, Hi is very close to unity for a l l particle sizes, and no r a d i a t i o n is absorbed i n the cloud. T h e reflectance of a cloud layer, i.e. its albedo, is determined by t h e amount of r a d i a t i o n that is scattered back into space, i.e. by the number of photons that leave the cloud at its top after various scattering events. Since water droplets tend to scatter visible r a d i a t i o n m a i n l y i n t h e forward d i r e c t i o n , 9 it needs strong and a large number of scattering events for a high albedo. If a l l extinction is due to scattering, an optically thicker cloud w i l l have a larger albedo. Since Hi depends only weakly o n particle size i n the visible wavelength range, the optical thickness of a cloud can well be inferred f r o m the measured reflectance (Figure 1.8b, the curve for Hi = 1). F i g u r e 1.8a shows that for wavelengths i n the near and thermal infrared, the single scatter albedo becomes dependent o n particle size. Furthermore, for sufficiently low Hi a n d sufficiently thick clouds, the cloud albedo becomes independent of r (illustrated i n F i g u r e 1.8b for Hi = 0.9), leaving the major dependence o n the effective droplet radius r /f. e F o r t h i n clouds, t h e retrieved optical thickness f r o m the visible wavelengths can be used to eliminate the optical depth dependence. T y p i c a l l y , near infrared channels centred at 2.1 / m i ( N a k a j i m a and K i n g , 1990) or 3.7 fim ( N a k a j i m a and N a k a j m a , 1995) are used to retrieve particle size information. T o infer the values of cloud optical depth a n d effective radius f r o m t h e satellite measurements, a radiative transfer model is employed to compute a database of reflected intensities, t h e l o o k u p table ( L U T ) . Radiances that, would be expected at the satellite sensor are computed using simplified models of the cloud layer and the atmosphere. F o r instance, the operational M O D I S retrieval ( K i n g et a l . , 1997), based on the work of N a k a j i m a a n d K i n g (1990), uses a simple model of horizontally a n d vertically uniform clouds whose only parameters are r ff e and r. For the atmosphere, absorption b y water vapour or trace gases (mainly infrared) or scattering at aerosols (visible) has to be considered. A l s o , emittance of cloud droplets i n the near infrared bands plays a role i n the observed intensities. It is often removed by utilising further thermal bands for estimates of cloud temperature. T h e components of cloud, atmosphere, and radiative transfer model together are called the forward model. Since the forward model is too complex to be inverted analytically, it is r u n for a variety of cloud parameters to build a L U T . T h e determination of the cloud properties is then done b y "looking u p " the measured intensities i n the database - the difference between observed and computed radiances is 9 More details will be discussed in Chapter 2. 19 Chapter 1. Introduction 1 '"i""i""i'"i"i"r 0.9 1 0.8 0.8 0.7 o 3 0.6 o 0.5 < 0.4 0.3 1 I I 1 1 1 1 1 -~~ 0.6 to=1 0.4 .03=0.999 (D=0.99 03=0.9 - /•' 0.2 0.2 0.1 1 10 Wavelength X [urn] (a) 0 10 20 30 40 50 60 70 80 90 100 (b) Figure 1.8: E x a m p l e of how radiative effects are dependent on m i c r o p h y s i c a l properties of a cloud, (a) Dependence of the single scatter albedo Us (the fraction of r a d i a t i o n extinguished along a ray of radiation that is due to scattering effects) on wavelength for three different cloud droplet sizes (5, 10 and 20 fim). (b) Idealised albedo, (c) transmittance and (d) absorptance of a plane parallel layer cloud i n dependence o n cloud optical thickness r for different values o f oj. E s p e c i a l l y note the different w for the satellite channels measuring at 3.7, 11 a n d 12 p,m i n (a). T h e resulting differing albedo, transmittance and absorptance c a n be exploited for remote sensing of droplet size and optical depth (reprinted from P e t t y , 2006, © 2006, w i t h permission f r o m Sundog P u b l i s h i n g ) . 20 Chapter 1. Introduction represented b y a cost function (also referred to as error surface) pf»nm / j-observed — (lTOA jcomputed\2 ~ TOA ) 1 /-i n n \ ' U- Z Z J which is minimised numerically. T h e above mentioned studies a l l utilise these fundamental techniques, differing i n the wavelength of the employed channels or the minimisation algorithm used. A l s o , forward models were refined i n later publications, for instance the treatment of water vapour ( K a w a m o t o et a l . , 2001), or the cloud model, which now is often assumed to be adiabatic rather t h a n vertically u n i f o r m (Brenguier et a l . , 2000). A n important assumption i n most retrieval schemes is the plane-parallel a p p r o x i m a t i o n , which treats the cloud layer w i t h i n a pixel as horizontally homogeneous. W h i l e this might seem to be a n oversimplification for many realistic clouds, it is often a good a p p r o x i m a t i o n for stratiform clouds at scales of about 1 k m (the resolution of the M O D I S channels) a n d considerably simplifies the radiative transfer calculations (Petty, 2006). T h e cloud parameters are then retrieved on a pixel-by-pixel basis, i m p l y i n g that the observed radiances are determined entirely by the properties of the clouds w i t h i n the pixel of interest (independent pixel a p p r o x i m a t i o n , I P A ) . Hence, care has to be exercised as to only apply the retrieval scheme to pixels that are fully covered by cloud - a requirement that can be problematic for regions of broken cloud. Unfortunately, precipitation produces inhomogeneities at scales of about 1 k m (Stevens et a l . , 2005; S h a r o n et a l . , 2006). T h i s poses obvious problems for remote sensing studies of the aerosol indirect effect; research showed a considerable error is introduced into the retrieval if inhomogeneities exist i n the cloud layer, especially i n the presence of broken clouds o n the subpixel scale (Faure et a l , 2001b,a,c, 2002; K a t o et a l . , 2006; Iwabuchi, 2007). 1.5.2 Nighttime Methods A t night, the dependence of the radiative fields on cloud optical thickness a n d effective radius is slightly more complicated. Since radiation i n the visible wavelengths is not a v a i l a b l e , the retrieval has 10 to rely entirely on near and thermal infrared signals, resulting i n the lack of the relatively independent optical thickness information i n the visible range. Research work i n this area dates back to H u n t (1973), whose work showed a difference i n the sensitivity of cloud emissivity to changes i n particle size between several wavelengths i n the near a n d thermal infrared. M o s t follow-up studies i n the 1970s a n d 1980, Although new highly sensitive radiometers are being developed that will be able to measure visible light reflected by the clouds from the moon (Lee et al., 2006). 10 21 Chapter 1. Introduction however, were focused on cirrus clouds rather t h a n shallow liquid water clouds ( M i n n i s et a l . , 1995, a n d references therein). D ' E n t r e m o n t (1986) used the 3.7, 11 a n d 12 fim channels of the A V H R R instrument to determine fractional cloud amount a n d cloud t o p height for low a n d mid-level clouds, a n d L i n a n d C o a k l e y (1993) a n d L u o et a l . (1994) developed a method based on the 11 a n d 12 u,m emissivities i n order to determine fractional cloud cover a n d a n effective particle radius for single layered clouds. B a u m et a l . (1994) eventually found that the combination of the A V H R R channels at 3.7,11 and 12 / i m may be used for a simultaneous retrieval of cloud temperature, optical thickness a n d effective radius. A t these wavelengths, the emitted radiation of a cloud depends on b o t h effective radius and optical thickness, and additionally also on temperature. F i g u r e 1.8a shows the strong dependence of w o n particle size for wavelengths larger t h a n about 2 fim. Since, after K i r c h h o f f ' s L a w , the emissivity of a droplet is equal to its absorptivity, the value of l-u> gives an idea of how cloud emission as well as absorption depend on r /f. n e T h e cloud t o p radiance at a certain wavelength then becomes a complex function of cloud temperature (determining the black b o d y emission), particle size, the thickness of the cloud (determining how many scattering or absorption/emission events occur), a n d , for clouds t h i n enough to allow for transmission, surface temperature (assuming a surface emissivity = 1). O f course, cloud inhomogeneities a d d further complexity. Despite this complexity, the differences i n the dependence of cloud droplet scattering, absorption a n d emission properties o n wavelength can be exploited for the retrieval. W h i l e the objective of B a u m et a l . (1994) was to detect multilevel cloud situations, M i n n i s et a l . (1995) a n d Heck et a l . (1999) report on a nighttime retrieval algorithm w i t h i n the Clouds and the E a r t h ' s R a d i a n t E n e r g y System ( C E R E S ) programme. T h e y use a very simple radiative transfer approach, parametrising the forward model by a n efficiently computable function that incorporates cloud optical depth a n d the emissivity dependence o n effective radius, a n d minimise the sum-of-square error between the model a n d observations. Perez et al. (2000) also include surface temperature information, retrieved f r o m clear sky pixels i n the v i c i n i t y o f the clouds, i n their retrieval method to determine effective radius a n d cloud t o p temperature of n o c t u r n a l marine stratocumulus clouds. T h e i r forward model consists of a vertically uniform, planeparallel cloud over a sea surface. E m i s s i o n a n d absorption effects of the atmosphere above the cloud are neglected. Following B a u m et a l . (1994), Perez et a l . (2000) employ brightness temperature differences ( B T D ) between the satellite channels to express the varying behaviour of the cloud radiative properties w i t h wavelength, and extensively study the behaviour of the forward model when r, r ff, and temperature e Indeed, although making the retrieval of particle size possible at nighttime, this dependence becomes a problem in the removal of the thermal contribution to the near infrared channels during daytime. 1 1 22 Chapter 1. Introduction A 4 Optical depth sI- - 2 CQ 15.0 I 2 8 4 2 8 4 2 8 8 2 8 6 T 4 2 9 0 ( K ) 2 8 2 K 2 8 5 K 2 8 6 K 2 8 4 2 8 6 T 4 2 8 7 K 2 8 8 2 9 0 ( K ) F i g u r e 1.9: P h y s i c a l basis for the remote sensing methods used b y Heck et a l . (1999), Perez et al. (2000), B a u m et al. (2003) and C e r d e n a et al. (2007): plotted is the brightness temperature difference ( B T D ) between A V H R R channels 3 a n d 4 (3.7 a n d 11.0 / m i ) , as measured at the top of the atmosphere, versus the brightness temperature for channel 4. T h e curves are based on radiative transfer computations for horizontally a n d vertically homogeneous clouds w i t h varying particle radius and optical thickness (left) and temperature (right). B y creating such curves w i t h a radiative transfer model a n d comparing t h e m to the brightness temperatures measured by the satellite, it is possible to infer information about the cloud properties that caused the satellite measurements (reprinted from Perez et a l . , 2000, © 2000, w i t h permission from Elsevier). are v a r i e d . 12 F i g u r e 1.9 shows the B T D between the 3.7 fim a n d 11 /j,m channels of the A V H R R instrument (channels 3 and 4, respectively), henceforth abbreviated w i t h B T D ( 3 . 7 - 1 1 ) , versus the 11 /j,m brightness temperature ( B T ) for vertically uniform clouds w i t h (a) a fixed cloud t o p temperature of 285 K a n d several effective radii and optical depths, a n d (b) a fixed radius of 8 fim a n d varying cloud temperature. T h e curves span a n area of possible solutions of the forward m o d e l , and if some parameters can be fixed the remaining ones c a n readily be inferred from these diagrams. S i m i l a r diagrams exist for the B T D between 11 a n d 12 fim ( B T D ( 1 1 - 1 2 ) i n the following). Y e t two issues complicate the retrieval. One problem for n o c t u r n a l retrievals is the saturation of cloud emission w i t h optical depth. F o r t h i n clouds a number of photons from the surface and a l l levels w i t h i n the cloud layer are able to reach the satellite without absorption. F o r thicker clouds absorption w i l l effectively remove a l l photons from lower levels i n the cloud layer before they reach cloud t o p - leaving 1 2 T h e intensities in the cost function (1.22) now become brightness temperatures and brightness temperature differences. 23 Chapter 1. Introduction the cloud top radiance to be governed only by the upper cloud (of course, the absorption is wavelength dependent). F i g u r e 1.8d illustrates the phenomenon. For a n u> of 0.9, the emissivity of the cloud, equal to its absorptance, quickly reaches an asymptotic limit. A convergence for larger optical depths is also visible i n the B T D diagrams of F i g u r e 1.9. However, since most marine Sc have o p t i c a l thicknesses of less that 10 ( X u et a l . , 2005; Rossow and Schiffer, 1999), the saturation problem does not pose a significant constraint. A l s o problematic are ambiguities in the forward model. A s noted by Perez et a l . (2000), some sets of measured brightness temperatures can be caused by several clouds having differing optical properties. T h i s , of course, makes the unique retrieval of the cloud parameters impossible. Perez et al. (2000) find that the effect is larger on B T D ( 3 . 7 - 1 1 ) , but also occurs for B T D ( 1 1 - 1 2 ) , especially for effective radii i n the range of 4-7 fim. T h e m i n i m i s a t i o n technique used by Perez et a l . (2000) is similar to the one used by N a k a j i m a and K i n g (1990) a n d K i n g et al. (1997), who first retrieve optical depth from the visible wavelengths in order to transform the optimisation w i t h respect to r / / e into a one-dimensional p r o b l e m . N o t i n g the complexity of the hypersurface of the cost function (with equivalently deep m i n i m a t h a t resemble the ambiguities), Perez et a l . (2000) also split the retrieval into two parts. F i r s t , cloud top temperature is recovered from the B T D ( 1 1 - 1 2 ) signal. F i x i n g the atmospheric profile using the inferred surface a n d cloud top temperatures, they then r u n the forward model again for a range of effective radii. T h i s produces a one-dimensional m i n i m i s a t i o n p r o b l e m w i t h respect to r ff, which is easier to solve for multiple solutions. e T h e retrieval result then consists of a cloud top temperature, together w i t h a list of possible effective radii. Perez et al. (2000) compare the retrieval results to time-averaged in-situ measurements performed on the C a n a r y Islands, and find that satellite and in-situ observed r / / e agree w i t h i n 1.25 ^ m . Gonzalez et a l . (2002) report on a modified version of Perez et al. (2000). T h e y eliminate the ambiguities due to effective radius by building a L U T i n which, for situations where multiple solutions exist, they only store the median radius. T h e discarded solutions are used to compute a confidence interval around the central value, w h i c h they find to be as large as 7 um for large r / / e i n t h i n clouds. T h e cost function then possesses a unique global m i n i m u m , which they find w i t h a genetic a l g o r i t h m in order to avoid multiple local m i n i m a . In addition to cloud temperature and effective radius, they also retrieve optical depth values. Perez et a l . (2002) adapt the retrieval to channels from the M O D I S instrument. Instead of using B T s at 3.7, 11 and 12 fim, they employ the 3.7, 3.9, 8.5 and 11 p m channels (plus surface temperature). 24 Chapter 1. Introduction Furthermore, atmospheric contributions to the top-of-atmosphere ( T O A ) B T s are included this time, assuming a k n o w n atmospheric profile from a different source. T h e error surface is minimised w i t h a scatter search algorithm. B a u m et al. (2003) also extend the B a u m et a l . (1994) work t o the M O D I S instrument. T h e i r objective is again t o recognise situations i n which high level cirrus overlays low b o u n d a r y layer clouds. Including the 8.5 jj,m channel i n their computations, they note that the B T D ( 8 . 5 - 1 1 ) ( B T D between 8.5 and 11 /um) shows a strong dependence on cloud top temperature b u t not on particle radius. A p a r t from the expensive m i n i m i s a t i o n techniques employed i n the outlined nighttime retrieval methods, the procedures also suffer from the problem of c o m p u t i n g a meaningful uncertainty estimate. Perez et al. (2000), as well as Gonzalez et al. (2002), perform a sensitivity analysis of their retrievals w i t h respect t o errors i n the observed brightness temperatures. T h e y find t h a t a variation of ± 2 K i n the observed B T s of cloudy pixels leads to variations i n r ff e of more t h a n 3 fj.m, variations of ± 0.5 K i n the clear sky B T s lead to errors i n r / / of less t h a n 0.5 u m (Perez et a l . , 2000). Gonzalez et a l . (2002) note e that the sensitivities of retrieved cloud parameters to input B T s are largest i n the case of t h i n clouds. T h e y also discuss the effect of the ambiguities i n effective radius and the error i n the presence of broken clouds on the subpixel scale (neglecting nonlinear effects). However, these sensitivity analyses cannot provide more t h a n a n idea of the general magnitude of the uncertainty. T h e y also do not include the effects of assumptions made i n the forward m o d e l or the accuracy of the numerical inversion. Indeed, P i n c u s et al. (1995), for the case of o p t i c a l depth retrievals, report o n the difficulty of obtaining a good uncertainty estimate. I n order t o o b t a i n a n accurate error estimate for a n i n d i v i d u a l retrieval or for an average over a p o p u l a t i o n of retrievals, additional inversions w i t h perturbed input variables would have t o be performed, which increases the c o m p u t a t i o n a l cost. 1.5.3 A r t i f i c i a l N e u r a l Networks A n interesting development t o tackle the high computational cost of the inverse procedure is the use of artificial neural networks ( A N N s ) . T h e theory of A N N s as a tool for statistical d a t a modelling has been developed since the 1950s (e.g. Rosenblatt, 1958), b u t they were not used m u c h i n the atmospheric sciences u n t i l the 1990s (e.g. M c C a n n , 1992; Hsieh and T a n g , 1998). General references t o A N N s are the books b y B i s h o p (1995, 2006). Inspired from the biological archetype of the h u m a n b r a i n , a n A N N imitates the concept of interconnected nodes, the neurons, that can "fire" i n order to pass on information if certain input conditions are 25 Chapter 1. Introduction bias bias F i g u r e 1.10: Schematic diagram of a feed-forward network w i t h two layers of adaptive weights w^j and wfl- T h e input neurons o n the left are labelled w i t h a;;. T h e information propagates in a forward direction through the hidden neurons zj to the output neurons y^. In addition to the input and hidden neurons, there are two bias parameters w i t h a fixed input (activation) of XQ = 1 a n d zo = 1, respectively. met. In particular, a neuron i n a n A N N computes the s u m over a l l of its input values a n d returns the value of a n activation function of that sum. T h i s activation function c a n be a n a r b i t r a r y function, but usually a sigmoidal function is used that returns 1 if a certain threshold is reached and 0 otherwise. T h e connections between the i n d i v i d u a l neurons are weighted, a n d b y adjusting these weights, the A N N c a n "learn" a certain behaviour. A r t i f i c i a l neural networks have been widely used for b o t h classification a n d regression problems (Bishop, 1995). T h e r e are m a n y different types of A N N s , but the i m p o r t a n t one for this study a n d the works cited here is the multilayer perceptron ( M L P ) . 1 3 Its architecture is schematically illustrated in F i g u r e 1.10. Its neurons are grouped into several layers, one input layer, one output layer, and one to several hidden l a y e r s . 14 E a c h neuron i n the input layer corresponds to a n input variable, and each neuron i n the output layer to a n output. T h e number of hidden layers a n d neurons i n t h e m is variable a n d has to be determined i n d i v i d u a l l y for each case. In order to learn a behaviour, the network has to be trained w i t h a t r a i n i n g dataset that includes input values as well as their corresponding target values. T h e weights are then modified u n t i l , for a l l I will not give a more detailed description of multilayer perceptrons in this thesis, a thorough introduction to many aspects concerning neural networks can be found in Bishop (1995), a shorter review in Bishop (1994). T h e numbering of the layers can be confusing. A "two-layer-perceptron" refers to a network with two layers of adaptive weights, that is only one layer of hidden neurons. "Two hidden layers", however, refer to two layers of hidden neurons and hence three layers of adaptive weights. 1 3 1 4 26 Chapter 1. Introduction input values, the outputs of the A N N equal the target values to a sufficient accuracy. A n i m p o r t a n t property of such A N N s is that, given a large enough number of hidden neurons and weights, they are able to approximate any arbitrary smooth function (e.g. B i s h o p , 1995). T h i s makes them ideally suited for nonlinear regression problems in which little is k n o w n about the shape of the cost function. Furthermore, once the training process is completed, the c o m p u t a t i o n of the output values includes only a few summations and evaluations of the activation functions, w h i c h makes a trained A N N very fast. O f course, there are also disadvantages, as it can be difficult to determine the number of hidden neurons or to find a good set of weights (Bishop, 1995, more specific problems arising in the atmospheric sciences are outlined by Hsieh and Tang (1998)). In the context of atmospheric satellite remote sensing, some work has been done t h a t employs A N N s . For instance, K r a s n o p o l s k y et al. (1995) and K r a s n o p o l s k y et al. (2000) used A N N s to retrieve surface w i n d speeds over the ocean from microwave measurements. Aires et a l . (2001) determined surface temperature, water vapour content and cloud liquid water p a t h , also from microwave observations, and Aires et al. (2002) retrieved atmospheric and surface temperature from infrared measurements. Faure et al. (2001c) applied A N N s to daytime retrievals of cloud parameters. In their work, they investigated the feasibility of simultaneously retrieving cloud optical thickness, effective radius, a relative cloud inhomogeneity and fractional cloud cover of inhomogeneous clouds from observations at 0.6, 1.6, 2.1 and 3.7 u.m. T h e y used a three dimensional M o n t e C a r l o radiative transfer model to account for nonlinear effects due to the cloud inhomogeneities to b u i l d a database of 3000 clouds. A n A N N w i t h two hidden layers, 10 neurons i n each layer, was trained from this database i n order to model the inverse function. Faure et al. (2001c) concluded that a retrieval of inhomogeneous clouds using an A N N is feasible. T h e study was extended by Cornet et al. (2004) to combine daytime observations available at different resolutions. In order to account for the increased complexity of the problem, they employed a combination of several A N N s and applied the method to M O D I S d a t a (Cornet et a l . , 2005). However, they d i d not compare their results to in-situ observations. Schiiller et al. (2003, 2005) also used A N N s for daytime retrievals of droplet number concentration, geometrical thickness and liquid water p a t h of shallow convective clouds, but they d i d not give many details on the architecture they used. M o t i v a t e d by the high computational efficiency of an A N N - a p p r o x i m a t e d inverse function, Cerdena et al. (2004) continued the work of Perez et al. (2000) and Gonzalez et a l . (2002), becoming the first to apply A N N s to nocturnal cloud property retrievals. In their study, they empirically explored several 27 Chapter 1. Introduction network architectures w i t h differing numbers of hidden layers and neurons, and found that two layers w i t h 100 neurons in the first and 20 neurons in the second layer yielded the best results. T h e A N N s were trained using a database of 20,000 d a t a points, w i t h an additional 10,000 independent points for validation. C e r d e n a et al. (2004) compare two forwards models, one employing the vertically uniform cloud m o d e l , and the other an adiabatic profile w i t h vertically increasing l i q u i d water content (more details in C h a p t e r 2). A n a l y s i n g the same d a t a as Perez et al. (2000) and Gonzalez et a l . (2002), the A N N retrieved effective radii agreed w i t h i n 2 /jm w i t h the in-situ observed values, w i t h the adiabatic cloud model yielding slightly improved results. C e r d e n a et a l . (2007) extended the retrieval m e t h o d to the daytime case a n d also further investigated the network architecture. U t i l i s i n g genetic algorithms, they were able to improve the network design to a network containing 20 neurons i n the first and 5 neurons i n the second hidden layer, yielding similar results as their first, more expensive architecture. T h e y also presented a sensitivity analysis, similar to the one conducted by Perez et al. (2000), by perturbing the input brightness temperatures by 0.5 K and noting the effect o n the retrieved parameters for t h i n ( r < 2) as well as thick ( r > 8) clouds. For t h i n clouds, such perturbations can lead to errors of up to over 4 u m in r ff, 4 K i n cloud temperature e and 0.6 in r. F o r thick clouds, errors are smaller in r jf e (T), and T , but larger for r. However, in the C e r d e n a et al. (2007) work, the problem of estimating an accurate error interval for given inputs remains. Furthermore, they do not discuss the effect of an uncertainty i n the neural network fit on the outputs (i.e. the uncertainty of having found the best set of weights d u r i n g the t r a i n i n g process), which can also be significant (Aires et a l . , 2004a). Indeed, there exist b o t h analytical methods to compute the J a c o b i a n (i.e. the sensitivity) of a network for a given input and to compute the output uncertainty due to the network fit. M a c K a y (1992a,b, 1995) developed a Bayesian framework for neural network training that besides of m a k i n g the t r a i n i n g process more stable allows for an estimation of the uncertainties in the network predictions t h a t are due to the network fit and to noise in the t r a i n i n g data. Aires (2004); A i r e s et a l . (2004a,b) generalised this concept to the multidimensional case and demonstrated its application in the context of their previous microwave remote sensing problem (Aires et al., 2001). T h e J a c o b i a n is very important for the validation of the network fit. Since for the inverse problem, we do not know the shape of the function to model, we can only verify the A N N b y c o m p a r i n g its predictions to independent test d a t a or by controlling the physical consistency of the J a c o b i a n . T h i s means that the output variables should be dependent on the expected input variables. F o r instance, during nighttime, 28 Chapter 1. Introduction we expect the effective radius output to be sensitive mainly to the B T D ( 3 . 7 - 1 1 ) signal (cf. F i g u r e 1.9), while cloud top temperature should depend mainly on the thermal signals at 11 or 12 u m . T h i s can even be done i n a quantitative way by numerically estimating sensitivities at certain points i n the L U T and comparing t h e m to the A N N predicted values. However, inverse problems are often ill-conditioned. T h a t means t h a t there exist several differing functional mappings that approximate the t r a i n i n g d a t a equally well, but not necessarily represent the true function that generated the d a t a - the training d a t a are simply not precise enough to unambiguously specify the correct function. In the case a b a d m a p p i n g is afterwards applied to i n p u t d a t a that was not part of the training dataset, the prediction may be far from the actual value. T h e network is then said to generalise badly (Bishop, 1995). W h i l e this problem is difficult to avoid, additional care should be exercised when using A N N s for ill-conditioned problems. A i r e s et al. (2004b) describe an approach to estimate the variability of the J a c o b i a n due to the uncertainty in the network fit. T h i s variability is a good indicator of how unstable the training process was and thus how ill-conditioned the problem is. 1.6 Objectives A neural network can potentially provide a fast retrieval a l g o r i t h m for determining n o c t u r n a l cloud properties that might be able to overcome the performance problem encountered i n classical retrieval approaches. Furthermore, the application of the methods developed by M a c K a y and A i r e s et al. could be very beneficial for understanding the "black-box n a t u r e " of A N N s a n d in order to improve the retrieval quality. I w i l l extend the promising C e r d e n a et al. (2007) results to a different network architecture, a different satellite sensor and different in-situ d a t a . M y focus lies on the uncertainty i n the retrieval, and what we can learn from its sensitivities. In particular, the objectives for this thesis can be outlined as follows. T h e goal is to set up a retrieval method capable of determining cloud effective radius, cloud top temperature and cloud optical thickness from n o c t u r n a l measurements of the M O D I S instrument. In-situ d a t a for evaluation purposes are available from the D Y C O M S - I I field campaign, and the m e t h o d should be able to provide error bars on the results. T h e i n d i v i d u a l steps include: • T h e A i r e s (2004); Aires et a l . (2004a,b) method w i l l be implemented a n d it w i l l be investigated if it is suitable for the given problem. A p a r t from obtaining an uncertainty estimate of the network fit, the hope is that ambiguities in the training database will cause a larger uncertainty i n the retrieval. 29 Chapter 1. Introduction • A forward model w i l l be developed that is capable of c o m p u t i n g top-of-atmosphere brightness temperatures for the relevant near and thermal infrared M O D I S channels for v a r y i n g cloud parameters. A L U T ( L U T w i l l be used synonymously for database from here on) has to be computed using this model. • T h e implemented methods w i l l be applied to the retrieval problem. A N N s have to be trained from the computed L U T , different network architectures have to be explored a n d the results have to be evaluated. F o r this thesis, the goal is to retrieve one scene from the D Y C O M S - I I field campaign (so t h a t several parameters including the atmospheric profile can be prescribed i n order to simplify the problem) and to evaluate the results using the in-situ d a t a . • F i n a l l y , the uncertainties and sensitivities of the retrieval w i l l be analysed and their usefulness investigated. T h i s thesis is meant to b u i l d a basis for further research. If the retrieval of the test case proves successful, future work can be performed on evaluating further scenes, w i t h an eventual goal of creating a general m e t h o d that could be used " o p e r a t i o n a l l y " for any a r b i t r a r y scene. T h e remainder of this document is structured as follows. In C h a p t e r 2, the radiative transfer model w i l l be described. T h e employed cloud model w i l l be discussed, as well as radiative transfer techniques for computing the T O A brightness temperatures. A short section w i l l be devoted to the effects of the overlying atmosphere. In C h a p t e r 3 I discuss neural network techniques. T h e A i r e s m e t h o d w i l l be derived, and its implementation w i l l be applied to simple test cases. T h e behaviour of the m e t h o d for these cases w i l l be analysed and its applicability to the remote sensing problem discussed. T h e t r a i n i n g of A N N s from the L U T and the retrieval of the test scene is the topic of C h a p t e r 4. Issues concerning the retrieval setup w i l l be discussed, and the results of sensitivity and uncertainty estimates are presented. A discussion of the usefulness of the estimated variables is given, and the thesis concludes w i t h a s u m m a r y of the work in C h a p t e r 5. 30 Chapter 2 Radiative Transfer and Development of the Forward Model In this chapter I w i l l describe the design of the forward model t h a t is used for the case study. T h e scene of J u l y 11, 2001, was chosen for the retrieval, and the in-situ aircraft measurements used i n this chapter are taken from D Y C O M S - I I research flight II ( R F 0 2 ) , which took place on t h a t day. T h e D Y C O M S - I I campaign w i l l briefly be introduced in section 2.1. N e x t , I w i l l introduce the radiative transfer equation ( R T E ) that mathematically describes the propagation of r a d i a t i o n t h r o u g h the atmosphere and discuss how it can be solved. For the radiative transfer calculations, cloud m o d e l and droplet size distribution are needed. U s i n g the R F 0 2 d a t a , I w i l l demonstrate the adequacy of the adiabatic a p p r o x i m a t i o n for the cloud model and show that the size distribution can be described by a modified g a m m a distribution (sections 2.2 and 2.3). T h e question of which M O D I S channels are best suited for the retrieval is discussed in section 2.4. T h e radiances measured by the sensor at these channels are always average radiances over a wavelength interval, since the instrument cannot measure at i n d i v i d u a l wavelengths. Hence, the forward model must be able to compute such interval-averaged intensities. I w i l l introduce the correlated-k a p p r o x i m a t i o n as an efficient way to calculate gaseous absorption over these intervals. T h e forward model also requires specification of absorption and emission by gaseous atmospheric constituents i n c l u d i n g water vapour and carbon dioxide above the cloud. T h i s is discussed i n section 2.5. T h e radiative transfer package l i b R a d t r a n (Mayer and K y l l i n g , 2005) is employed to compute cloud top radiances. In section 2.6, the setup of the forward model i n c l u d i n g l i b R a d t r a n w i l l be described, and I conclude the chapter by demonstrating that the forward model is able to reproduce the B T D relationships of B a u m et a l . (1994) and Perez et al. (2000). 31 Chapter 2. Radiative Transfer and Development of the Forward Model 2.1 D Y C O M S - I I Data T h e D Y C O M S - I I field campaign took place from J u l y 7, 2001, to J u l y 28, 2001. D u r i n g nine nights, research flights collected extensive in-situ datasets in the nocturnal Sc cloud layer over the east Pacific ocean off the coast of C a l i f o r n i a , approximately 350-400 k m west southwest of S a n Diego. T h e campaign and the available d a t a are described by Stevens et al. (2003a,b). W h i l e the m a j o r objective of the campaign was to perform measurements to advance understanding of entrainment and drizzle processes, the collected d a t a are also very useful for this remote sensing study. D u r i n g seven nights, circles w i t h an approximate diameter of 60 k m (30 min) were flown at several heights in the boundary layer (subcloud and cloud layer). A d d i t i o n a l l y , frequent vertical profiles were taken. D a t a useful for this study includes the measurements taken d u r i n g vertical profiling and in horizontally advected circles flown just below cloud top and above cloud base. Besides the standard meteorological measurements of temperature, humidity, pressure and w i n d speed, d a t a of liquid water content, droplet concentration and droplet sizes were taken (a complete list can be found i n Stevens et a l . (2003b)). T h e ground speed of the airplane was about 100 m s _ 1 and measurements relevant for this work were taken every second. T o ensure comparability w i t h the M O D I S d a t a , I have averaged the D Y C O M S - I I d a t a over 10 s intervals in order to yield measurements on the 1 k m scale. T h e measured droplet size distributions allowed for the computation of effective radius and mean radius. For this thesis, the flights of J u l y 11 and J u l y 13 ( R F 0 2 and R F 0 3 i n the D Y C O M S - I I literature) were chosen. Satellite images of b o t h scenes contain a number of ship tracks t h a t provide contrasting droplet sizes w h i c h can be used to check the physical consistency of the retrieval. W h i l e the J u l y 11 case will serve as the retrieval example in C h a p t e r 4, d a t a from b o t h nights are used i n this chapter for evaluating the cloud model. T h e actual satellite images from the M O D I S sensor w i l l be introduced i n C h a p t e r 4. 2.2 2.2.1 Scattering and Droplet Spectra Radiative Transfer E q u a t i o n F i g u r e 2.1 illustrates the processes that influence radiation propagating along a line-of-sight as it passes through the atmosphere. A s discussed in section 1.3, energy can be lost by absorption and scattering out of the direction of propagation, while emission and scattering into the b e a m represent sources of energy. T h e change i n intensity across an infinitesimal volume hence is the s u m of a sink t e r m describing 32 Chapter 2. Radiative Transfer and Development of the Forward Model scattering out absorption F i g u r e 2 . 1 : Processes that influence radiation as it passes through the atmosphere, extinction b y absorption and scattering, and source terms representing emission and scattering :: 15 dl = d l e x t + dl it em + dl s c a t (2.1) . T h e extinction t e r m is described by Beer's L a w (1.6): dhxt = -pjds, (2.2) where ds represents a n infinitesimal p a t h length along the direction of propagation. E m i s s i o n is given b y the emissivity of the m e d i u m times the P l a n c k function (1.16): dlemit (2.3) = PaB{T)ds. Here, K i r c h h o f f ' s L a w (1.17) has been used to express the emissivity i n t e r m of the absorption coefficient. T h e scattering source t e r m , however, is more complicated. T h e scattering phase function p{Cl'', Cl) expresses the idea that radiation from a n y direction Cl' passing t h r o u g h a n infinitesimal volume can contribute scattered radiation to our direction of interest Cl. Furthermore, d l s c a t must be proportional to the scattering coefficient @ , which describes how much scattering occurs: s dlscat = T- \ P ( « ' , Cl)I{Cl')dCl'ds (2.4) T h e scattering phase function can be interpreted as a probability density. p(Cl', Cl) gives the probability that a photon from direction Cl' is scattered into direction Cl. Hence, p(Cl', Cl)I(Cl') described the gain i n 1 5 Note that I am still omitting the subscript A. All equations given here correspond to the monochromatic case. 33 Chapter 2. Radiative Transfer and Development of the Forward Model intensity i n direction Cl due to scattered radiation from direction Cl'. I n order to compute the t o t a l energy gain due t o scattered radiation, contributions from a l l directions are summed i n the integral, where the factor of 4w arises from the spherical integration to ensure that the integral over the phase function is one. P u t t i n g the extinction and source terms into (2.1), the radiative transfer equation becomes dl = -p Ids e + p Bds a + — f p(Cl', Cl)I{Cl')dCl'ds. (2.5) In this form, it has a general three-dimensional character, i.e. the direction vectors Cl can be expressed i n any coordinate system, and the infinitesimal p a t h length ds can be i n any direction. T h e most general approach to solve the radiative transfer problem numerically is the Monte Carlo method (e.g. B o h r e n and C l o t h i a u x , 2006). It allows for arbitrary three-dimensional scenes b y simulating the propagation of a large number of photons through the m e d i u m . T h e p a t h of each photons is traced from its original source (e.g. the sun) until it leaves the defined scene (e.g. at the t o p of the atmosphere). W h i l e this method is very flexible and allows for inhomogeneities i n the cloud layer, i t is computationally very expensive and currently not suited to compute the T O A radiances for a large number of clouds, as needed for a L U T . For the reasons discussed i n C h a p t e r 1, the plane parallel a p p r o x i m a t i o n is a good assumption for large parts of marine Sc o n the scale of the M O D I S pixels (1 k m ) . I n order t o keep the forward problem as simple as possible and at the same time computationally tractable, I decided t o construct the L U T for plane parallel clouds. In fact, most analytic solutions a n d approximations to the radiative transfer equation have been developed for the plane parallel case (Petty, 2006). In a plane parallel atmosphere, a l l relevant parameters (i.e. p , (3 , T, r ff, cf. section 1.3) are only a S e dependent o n height. Since the important aspect for radiation is h o w much absorbing, emitting a n d scattering atmosphere it must traverse, it is convenient to express the vertical coordinate i n terms of the optical properties of the atmosphere rather t h a n i n geometrical units. T h e extinction optical depth is defined as the optical thickness of the atmosphere between the t o p of the atmosphere and a level at height z : 1 6 (3 (z')d '. e Z (2.6) A t the t o p of the atmosphere, r = 0. Furthermore, directions are usually expressed i n terms of zenith Some authors use optical depth synonymous for optical thickness when referring to the cloud optical thickness; in order to avoid confusion I will use optical depth for the vertical coordinate in the R T E and optical thickness for the cloud property. 16 34 Chapter 2. Radiative Transfer and Development of the Forward Model angle 9 (measured from directly overhead; 6 = 0 is overhead, and 6 = TT/2 is the horizon), and a z i m u t h angle 0 (measured counterclockwise f r o m a reference point on the horizon). T h e infinitesimal p a t h length ds i n (2.5) can now be expressed as (2.7) where n = cos 9. F r o m (2.6) it follows that dr = —0 dz = —{3 /j.ds. D i v i d i n g (2.5) b y this new height e e increment, the radiative transfer equation becomes dl{y., <t>) I(fi, 4>) - (1 - w)B - — / dr 4TTJ J _ 0 / (JM', </>'• , <TW, p M #Wd4i (2.8) X T h e problem of computing the T O A radiances "seen" by the satellite sensor now becomes the problem of solving this R T E . A c o m m o n simplification i n order to solve the R T E is t o divide the zenith angle into discrete intervals, so-called streams. T h e simplest of these methods is the two-stream m e t h o d , w h i c h divides the intensity field into only two directions; upwelling and downwelling. Hence, it assumes that the intensity is approximately constant i n each hemisphere. T h e multi-stream code D I S O R T (Stamnes et a l . , 1988, 2000) generalises this concept to an arbitrarily large number of discrete angles, so t h a t i t becomes possible to accurately compute radiances for pixels that are not directly underneath the satellite. T h e code is well documented a n d already part of the radiative transfer package l i b R a d t r a n . It w i l l be used for the forward model computations i n this thesis. 2.2.2 Droplet Size D i s t r i b u t i o n and its Phase Function A big simplification for liquid water clouds is that the water droplets can be treated as spherical droplets, which makes the computation of the phase function w i t h M i e theory ( M i e , 1908) possible. I n contrast, the non-spherical character of ice particles makes the c o m p u t a t i o n of the phase function for ice clouds m u c h more difficult (Mayer and K y l l i n g , 2005). M i e theory employs M a x w e l l ' s equations to derive a three-dimensional wave equation for electromagnetic radiation, w h i c h is solved for b o u n d a r y conditions at the surface of a sphere (Petty, 2006). For spherical particles, the phase function only depends on the angle 9 between the original direction $V and the scattered direction fl of a photon. Since cos 0 = fl' • fl, it is c o m m o n t o write the phase function as p(cos 0 ) . 35 Chapter 2. Radiative Transfer and Development of the Forward Model For remote sensing applications it is of interest to model the mean scattering effects of the entire cloud droplet size distribution. A common assumption is to model this distribution as a modified gamma distribution. It is given by (Miles et al., 2000) (2.9) where T(z) = J °° t ~ e~ dt is the gamma function (e.g. Weisstein, 2007), v z 1 t 0 i"N = eff I\vgam r gam the shape parameter, and + 2) a scaling parameter. Location and width of the distribution are given by r jf & and v . The larger v gam gam becomes, the narrower the distribution will be. In many studies that employ Mie theory for computing the scattering properties of the cloud layer, the shape parameter is not discussed at all (for instance, in Perez et al. (2000) or Cerderia et al. (2007)). It is common, however, to assume v gam to be constant (Miles et a l , 2000). For instance, the precalculated Mie tables for the A V H R R sensor, as available on the libRadtran homepage , use a value of 7. 17 Arduini et al. (2005) investigated the sensitivity of daytime retrieved continental cloud properties to droplet size distribution width. They found that the shape of the size distribution can have a significant effect on retrieved properties, therefore, the shape parameter must be representative of the cloud regime to be observed. Miles et al. (2000) compared the droplet size distributions of both marine and continental stratocumulus clouds that were reported from different measurement campaigns in the literature. From more than 40 observations, they found a mean shape parameter of 8.6 for marine clouds, however, the observed values varied from less than 3 to over 40. In order to avoid errors arising from an unrepresentative size distribution, I computed the best fit gamma distribution to the D Y C O M S - I I measured droplet size spectra. The shape parameter was determined from equations (3a) and (3b) in Miles et al. (2000): (2.10) Vigam — Here, r ff e 17 is the effective radius as computed from (1.11), and r i vo is the volume mean radius. It is http://www.libradtran.org/ 36 Chapter 2. Radiative Transfer and Development of the Forward Model computed from the measured spectra v i a (cf. section 1.3) 1/3 (2.11) F i g u r e 2.2a shows a size distribution measured at cloud top d u r i n g the J u l y 11 flight. T h e effective radius is 11.6 /jm, and the best fit g a m m a distribution has a shape parameter of 33.5, significantly narrower t h a n the average value of 8.6 found by M i l e s et al. (2000). Indeed, the average shape parameter encountered at this day was 26, the distribution of which is also plotted i n the figure (see section 4.1 for more details). N o t all measured distributions were well represented by a modified g a m m a distribution. In fact, M i l e s et al. (2000) note that Sc droplet size distributions are frequently b i m o d a l . A n example of such a distribution, measured close to cloud base on J u l y 13, is shown i n F i g u r e 2.2b. Here, the best fit distribution computed from (2.10) has a shape parameter of 10.4. Since u garn is not a variable parameter in my retrieval algorithm, it has to be set to a fixed value as representative of the scene as possible. For the specific test case discussed i n C h a p t e r 4 (and also for the remaining D Y C O M S - I I cases), it is possible to analyse the in-situ d a t a prior to building the L U T . Hence, a representative mean value can be found. For future applications of the m e t h o d that require independence of the L U T of the scene to be retrieved, sensitivity tests should be conducted in order to investigate the impact of a fixed g a m m a distribution shape parameter on the retrieved values. F i g u r e 2.3a shows the phase functions of droplet size distributions w i t h effective radii of 2, 11.5 and 25 /um and a shape parameter of 26 for radiation at A = 3.7 fim. In F i g u r e 2.3b, the phase function for the 11.5 /jm distribution (as i n F i g u r e 2.2a) is reproduced i n a polar plot (note the logarithmic scale in b o t h figures). T h e functions were computed using M i e code included in K . F . E v a n s S H D O M (Spherical H a r m o n i c Discrete O r d i n a t e M e t h o d , another radiative transfer model) distribution (Evans, 1998). A distinct feature w o r t h mentioning is the strong forward scattering peak. It becomes more pronounced w i t h increasing r /f, e which means that for this cloud a large amount of the scattered photons will be scattered into the forward direction (this is also true at visible wavelengths, the reason why sunlight penetrates even thick clouds). A l s o , more photons are scattered into the backwards direction t h a n to the sides. T h e small peak at about c o s 0 = —0.8 becomes more pronounced i n the visible wavelength range, where it is seen as the rainbow. 37 Chapter 2. Radiative Transfer and Development of the Forward Model in-situ spectrum of 2001-07-11 09:52:49+00:00 UTC i 1 1 1 1 ' 1 1 droplet radius [/xm] (a) in-situ spectrum of 2001-07-13 08:21:16+00:00 U T C droplet radius [/im] () b F i g u r e 2.2: In-situ measured spectra of cloud droplet sizes from the J u l 11th a n d J u l 13th flights, together w i t h best-fit g a m m a distributions for (a) a u n i m o d a l d i s t r i b u t i o n a n d (b) a b i m o d a l distribution. Solid black lines mark the in-situ d a t a , dashed lines are the best fit distributions. T h e average shape parameter o n b o t h days was a p p r o x i m a t e l y 26, this distribution is shown by the dotted line. 38 Chapter 2. Radiative Transfer and Development of the Forward Model 270 (b) F i g u r e 2 . 3 : (a) Phase functions of cloud distributions following a g a m m a d i s t r i b u t i o n w i t h a shape parameter of 26. S h o w n are functions for effective r a d i i of 2, 11.5 and 25 yitm. 0 represents the scattering angle, c o s ( 0 ) = 1 is forward scattering, (b) P o l a r view of the phase function for 11.5 / m i , also w i t h a logarithmic scale. 39 Chapter 2. Radiative Transfer and Development of the Forward Model 2.3 Adiabatic Cloud Model In order to estimate the effect of the vertical stratification i n the cloud layer o n forward radiative transfer calculations, Brenguier et al. (2000) compared the vertically uniform ( V U ) cloud model (employed by M i n n i s et al. (1995); Heck et al. (1999); Perez et al. (2000); Gonzalez et al. (2002) or Perez et a l . (2002)) t o an adiabatic stratified (AS) one. T h e y found that the adiabatic stratified model i n general is the more accurate choice. F i r s t , in-situ d a t a analysed by Brenguier et a l . (2000) of marine a n d continental Sc taken during the Second Aerosol Characterisation E x p e r i m e n t ( A C E - 2 ) showed a close-to-adiabatic stratification. Second, by comparing the radiative properties of clouds represented by a vertically uniform model and by the adiabatic stratified one, they found that i n order to achieve radiative equivalence, the effective radius of the V U model had t o be 80% - 100% of the cloud top r ff e of the A S model, the actual value depending strongly on cloud geometrical thickness and droplet concentration. These results mean that the effective radius retrieved b y a scheme employing a V U model is not necessarily representative of the actual cloud top r ff, since the V U model provides no way t o retrieve cloud geometrical thickness e or droplet concentration. B y employing the A S model, o n the other h a n d , it is possible to retrieve droplet concentration v i a the adiabatic relationships between the t h e r m o d y n a m i c variables. T h i s could be especially useful for monitoring of the aerosol indirect effect. Indeed, the vertical profiles measured during the D Y C O M S - I I c a m p a i g n closely resemble adiabatic profiles. F i g u r e 2.4 shows in-situ data taken during R F 0 2 (July 11), together w i t h the vertical profile of an idealised adiabatic cloud. W h e n cloud droplet number concentration, humidity, temperature and pressure are prescribed at a reference level, i t is straightforward to compute a n adiabatic cloud profile using the thermodynamic equations. A s thermodynamic variables I use the moist static energy h m (e.g. Wallace and Hobbs, 2006) and the t o t a l specific humidity qx (i.e. water vapour plus liquid water). In this thesis, "adiabatic profile" refers t o a profile i n which h m pressure (J K _ 1 and qx are constant. If c denotes the specific heat of d r y air at constant p ) , g the earth's gravitational acceleration (m s ~ ) , L 2 v the latent heat of evaporation (J k g ) and q the specific h u m i d i t y (g k g ) ( C u r r y and Webster, 1999; W a l l a c e and H o b b s , 2006), h - 1 - 1 v m is given b y h m = c T + gz + L q . p v v (2.12) 40 Chapter 2. Radiative Transfer and Development of the Forward Model droplet concentration [cm 70 285 L W C [g/m ] 3 3 90 110 0.2 287 289 2 temperature [K] 0.4 0.6 0.8 1.0 4 6 8 10 particle radius [/xm] F i g u r e 2.4: Vertical profile of cloud properties from the idealised adiabatic cloud model, together w i t h in-situ measurements from the J u l y 11 D Y C O M S - I I flight. T h e adiabatic cloud has a cloud t o p effective radius of 10.7 /j,m, a cloud top temperature of 284.8 K and a droplet concentration of N = 110 c m . Volume mean radius is converted to effective radius w i t h k = 0.8. N o t e that i n order to fit the observed liquid water content the droplet concentration is overestimated i n the adiabatic model (compare to F i g u r e 2.6). - 3 41 Chapter 2. Radiative Transfer and Development of the Forward Model Following the findings of M i l e s et al. (2000), the total droplet concentration (cm 3 ) (2.13) is also assumed t o be constant i n m y model. T h e above equations result i n linear temperature profiles following (2.14) T(z)=T -Tz, sfc where T represents the dry adiabatic lapse rate T = Td « 10 K k m adiabatic lapse rate r = T s value of r s w 6K k m - 1 - 1 below cloud base and the saturated w i t h i n the cloud. T varies w i t h temperature a n d pressure, for marine Sc a s is representative ( C u r r y and Webster, 1999). Below cloud base, no condensation occurs, so that the specific h u m i d i t y remains constant w i t h height: q (z) = const. (2-15) v W i t h i n the cloud, q follows the saturation specific h u m i d i t y q : v s (2.16) q (z) = q.(T,p), v where q is dependent on b o t h temperature and pressure and c a n be obtained from the C l a u s i u s - C l a p e y r o n s equation (see C u r r y and Webster, 1999, for further details). T h e liquid water content can be determined by subtracting the saturation water vapour content (the saturation specific h u m i d i t y expressed i n g m ~ ) from the t o t a l water content (given by the specific 3 h u m i d i t y below cloud base), qi also follows a n almost linear profile w i t h i n the cloud: (2.17) qi{h) = C h, v> where h denotes the height above cloud base, and the moist adiabatic condensate coefficient C w from about 1 to 2.5 x 1 0 - 3 g m~ 3 m - 1 varies , depending slightly on temperature (Brenguier et a l . , 2000). In m y m o d e l , cloud t o p and surface pressure, cloud t o p temperature, cloud t o p l i q u i d water content and droplet concentration are prescribed. F r o m the latter two, the volume mean radius r i is computed, vo 42 Chapter 2. Radiative Transfer arid Development of the Forward Model given the relationship 9 = ^ / . (2-18) In order to compute the effective radius from the volume mean radius or vice versa, an assumption about the droplet size distribution has to be made. If the distribution is modelled as a modified g a m m a distribution as discussed in the previous section, (2.10) can be used for the conversion. M o r e c o m m o n i n the literature, however, is to employ the fc-value parameterisation suggested by M a r t i n et al. (1994). M a r t i n et a l . (1994) analysed in-situ measurements from marine and continental Sc of several field campaigns. In their d a t a , the clouds were relatively homogeneous a n d entrainment played a minor role. For these cases, they proposed to parameterise the effective radius i n terms of the volume mean radius and a dimensionless constant fc following r = k-^ r 3 eff v o l (2.19) . In m a r i t i m e airmasses, M a r t i n et al. (1994) found an average value of k = 0.80 ± 0.07 (one standard deviation). Measured values of P a w l o w s k a and Brenguier (2000) during A C E - 2 confirm the estimates made by M a r t i n et al. (1994), however, they also show a strong variability of k w i t h height, w i t h values at cloud base ranging from as low as 0.4 to up to 1, while values at cloud top where more constant around 0.8 to 1.(Figure 2.5). C o m b i n i n g (2.10) and (2.19), the relationship between fc-value and v gam *Vm = 2 ( V / 1 3 - l ) _ 1 becomes . (2.20) A distinct problem in F i g u r e 2.4 is that the droplet concentration is overestimated by about 30 c m - 3 i n order to fit the observed liquid water p a t h and droplet sizes. A l s o , the slope of the l i q u i d water content line seems slightly overestimated. T h i s problem is due to subadiabatic regions i n the cloud. For instance, as discussed in section 1.2, entrainment of dry air from above the inversion often causes subadiabaticity. T h e cloud c o l u m n can hence be adiabatic i n the lower part and subadiabatic i n its upper part. Brenguier et al. (2000) emphasise that the A S model still is a n idealised representation of the actual cloud. T h e y note t h a t the adiabaticity should be taken as a m a x i m u m reference for the actual cloud microphysics at all levels. However, they also point out that the variety of possible profiles is too large for a simple parameterisation, and hence suggest the A S model as a simple and relatively accurate 43 Chapter 2. Radiative Transfer and Development of the Forward Model 25 June 1997 26 June 1997 0.6 0.8 d /d 0.6 0.8 d /d 0.4 3 3 3 3 F i g u r e 2 . 5 : Measurements of k value versus height i n marine stratocumulus clouds. T h e k value varies widely from values as low as 0.4 to almost 1, w i t h a larger variability at cloud base (reprinted from Pawlowska and Brenguier, 2000, © 2000, w i t h permission from B l a c k w e l l Publishing). compromise solution. For comparison, Figure 2.6 shows the same d a t a as F i g u r e 2.4, b u t this time w i t h a subadiabatic profile using 75% of the adiabatic values. T h i s time, b o t h droplet concentration and liquid water content are well-fit. In their investigation of radiative equivalence between the V U and A S models, Brenguier et al. (2000) found that a major difference between the two models is the dependence of optical thickness on cloud geometrical thickness. E q u a t i o n (1.9) introduced the optical thickness as the integrated extinction coefficient between cloud base Zb ot and cloud top z : top (2.21) P dz. e •> Zbot If the extinction cross section cr = 1 7 0 + o~s (cf. 1.12 a n d 1.13) is w r i t t e n i n terms of t h e extinction e efficiency Q = i r / ( 2 7 r r ) , the extinction coefficient of a droplet size distribution is given b y 2 e x t e Pe = J n(r) {Q {rW} ext dr. (2.22) T h e extinction efficiency can be interpreted as how effective a particle can attenuate r a d i a t i o n w i t h respect to its size, and is a complex function of the size parameter x, a n d hence particle size and wavelength. In order to compute the optical thickness i n the A S model, it is necessary to express (2.21) i n terms 44 Chapter 2. Radiative Transfer and Development of the Forward Model L W C [g/m ] droplet concentration [cm ] 3 3 70 0.7 1 • 1 110 1 1 0.2 1 — D • V 0.6 90 1 • 0.5 O i p o \° 0 \- 0.6 — i \LWC 0 1 P 1.0 0.8 — i — i i — py <>•/ y \Z Vvol | %/: reff / /* /• /• D O V 0.4 — i \ 0.4 \ m \ \ 43 •55 0.3 \ tfl\ ' • \ \ \ 0.2 \ X . \ 0.1 % 0.0 1 285 1 1 spec, humidity [g/kg] 1 2 287 289 temperature [K] 4 6 8 10 particle radius [urn] F i g u r e 2 . 6 : T h e same as F i g u r e 2.4, b u t w i t h a subadiabatic l i q u i d water content of 75% of the adiabatic value. of the available t h e r m o d y n a m i c variables. B y combining (2.11), (2.13) and (2.18), we can write the liquid water content as a function of the particle size distribution: 4n o (2.23) dr. T h i s enables us to express r i n terms of the droplet radius and qi Ztop ( Jn(r) [Qext(r)irr(z) ] 2 dr qi(z)dz. Jz bot qiw . Jz, Zbot J n(r) Pi-^rizf dr (2.24) J For cloud droplet sizes i n the visible wavelength range, the extinction efficiency Q ext is approximately constant at a value of two (e.g. P e t t y , 2006, Figures 12.4 - 12.6), so t h a t it can be pulled out of (2.24). The integrals can then be w r i t t e n as the effective radius: rztop 4pi eff{z) dz (2.25) r J Zl,„t 45 Chapter 2. Radiative Transfer and Development of the Forward Model (e.g. N a k a j i m a et a l , 1991), which reduces t o (1.14) for Q e x t = 2 and constant r ff. F o r a V U model, e the integral over height can be approximated by a multiplication of the integrand by cloud geometrical thickness AH, whereas i n the A S case it has t o be explicitly computed. In the infrared wavelength range, however, Q e x t and thus the optical thickness of the cloud become a function of b o t h cloud geometrical thickness and particle size d i s t r i b u t i o n (Petty, 2006). T h u s , there no longer is a simple relationship between r , L W P and r ff. F o r simplicity, I w i l l use the visible optie cal thickness i n m y forward model. T h i s has the advantage that the optical thickness retrieved by m y algorithm can be compared w i t h the preceding and subsequent operational daytime M O D I S retrievals. 2.4 Choice of Channels and Correlated-k T h e M O D I S instrument offers a t o t a l of 36 channels in the visible and infrared s p e c t r u m (Barnes et a l . , 1998), 16 of which cover the infrared spectrum from 3.7 t o 14.1 ^ m . N o t all of these channels are suited for remote sensing of cloud properties. A s discussed i n C h a p t e r 1, the radiative properties of the cloud droplets i n the infrared strongly depend on particle size. Hence, the employed channels should maximise these differences. However, i t is also desirable for the satellite sensor to measure photons a r r i v i n g directly from the cloud. T h i s means that the impact of the atmosphere above the cloud on the radiation arriving at the satellite should be as small as possible. F i g u r e 2.7 shows how the transmittance of a t y p i c a l midlatitude summer atmosphere varies w i t h wavelength. T h e b o t t o m panel displays the t o t a l transmittance, while i n the panels above, the effects of i n d i v i d u a l atmospheric components are shown. T h e four channels at approximately 3.7 ( M O D I S number 20), 8.5 (29), 11 (31) and 12 (32), as used i n previous studies (cf. C h a p t e r 1), are highlighted. T h e y are located at wavelengths at which the t o t a l transmittance is large. O t h e r M O D I S channels are located, for instance, at approximately 4.5 or 9.7 fim. W h i l e these channels provide d a t a for other applications, the atmosphere at these wavelengths is too opaque for remote sensing of cloud properties. Hence, channels 20, 29, 31 and 32 are indeed best suited for the remote sensing of cloud properties at night. T h e second panel from the b o t t o m shows that most of the absorption t h a t occurs above cloud top is due to water vapour. It should thus be considered i n the retrieval m e t h o d . I w i l l discuss i n section 2.5 how the effect of water vapour can be accounted for. T h e forward model must account for the band w i d t h of the chosen channels i n the radiative transfer model. T h i s requirement arises from the fact that the M O D I S channels are not monochromatic, b u t cover finite wavelength intervals. Hence, the "monochromatic" radiances measured by the instrument are 46 Chapter 2. Radiative Transfer and Development of the Forward Model ZENITH ATMOSPHERIC TRANSMITTANCE UVI VIS Thermal IR Near IR I CO N 0 2 r o. 0.3 0.4 0.5 0.6 0.8 1 1.2 1.5 2 2.5 3 4 5 6 7 8 910 12 15 20 25 30 40 50 Wavelength [u.m] F i g u r e 2 . 7 : Dependence of atmospheric transmittance on wavelength for a typical midlatitude summer atmosphere. Highlighted i n grey are the four M O D I S channels relevant i n this study (reprinted from Petty, 2006, © 2006, w i t h permission from Sundog P u b l i s h i n g ) . 47 Chapter 2. Radiative Transfer and Development of the Forward Model spectral-mean radiances. In order to compute spectral-mean radiances w i t h the radiative transfer m o d e l , the wavelength dependent absorption of the different molecular species listed i n F i g u r e 2.7 has t o be taken into consideration. For a n extact solution, a n integration over m a n y i n d i v i d u a l l y computed m o n o c h r o m a t i c radiances has to be performed, w h i c h can be computationally expensive. T h e correlated-k procedure (e.g. F u and L i o u , 1992; K r a t z , 1995) provides a way to reduce the computational cost b y several orders of magnitude while maintaining a high accuracy and is used i n this thesis. T h e concept of correlated-k can be explained b y considering the simplified p r o b l e m of computing the spectral-mean transmission of a n atmospheric layer, due to only molecular a b s o r p t i o n . 18 W h e n the a b - sorption coefficient of an atmospheric constituent is plotted against wavelength, the absorption spectrum in general is very complex. F o r example, F i g u r e 2.8a shows a spectrum of C O 2 at a pressure of 507 h P a . T h e distinct peaks i n the spectrum are called absorption lines, they can be explained b y q u a n t u m theory (e.g. P e t t y , 2006) and their height and w i d t h are dependent o n b o t h temperature and pressure. I n the spectral interval of a satellite channel, there can be thousands of such lines, especially i f the effects of different molecular species are overlapping. F i r s t consider a single absorbing species. L e t the spectral interval w i d t h of a channel be A A = A 2 — A i , and the mass path of the absorbing constituent be denoted b y u. T h e mass p a t h is the t o t a l mass of the species i n a column, measured i n k g m coefficient - 2 . It is then convenient t o introduce the mass absorption 19 k = x (2.26) P where p is the density of the constituent. T h e spectral-mean transmittance follows f r o m the integration of Beer's L a w over wavelength: t (u,p, AX T) = jf 2 exp [-k (p, T) u) dX. x (2.27) Here, p denotes pressure and T temperature. T h e fundamental idea of correlated-k stems from the fact that i n order t o approximate the integral i n (2.27) numerically, it is not i m p o r t a n t i n which order the i n d i v i d u a l lines are arranged i n the spectrum, as long as the area underneath the curve stays the same. F i g u r e 2.8 illustrates this idea. If the complex T h e layer concept is relevant in as far as that DISORT and other R T E solvers divide the atmosphere into discrete layers. D o not confuse the mass absorption coefficient "k" used in correlated-k with the k-value used in the conversion of r i 1 8 1 9 vo to r ff. e 48 Chapter 2. Radiative Transfer and Development of the Forward Model I F i g u r e 2 . 8 : Illustration of the fundamental idea of correlated-k. (a) A b s o r p t i o n spectrum (kcoefficients) of C O 2 at a pressure of 507 h P a . (b) Inverse cumulative p r o b a b i l i t y function k(g), that represents the reordered fc-coefficients f r o m (a). For a numerical integration, the function i n (b) requires much less quadrature points t h a n the function i n (a) (reprinted f r o m M l a w e r et a l . , 1997, © 1997, w i t h permission from the A m e r i c a n G e o p h y s i c a l U n i o n ) . s p e c t r u m of the absorption coefficient i n F i g u r e 2.8a is integrated numerically, a large number of quadrature points is needed i n order t o o b t a i n a reasonable a c c u r a c y . 20 However, b y reordering the absorption lines i n ascending order (Figure 2.8b), we get a function that is m u c h easier to integrate, yet covers the same area. T h e reordering of the lines can be done by c o m p u t i n g the inverse cumulative p r o b a b i l i t y d i s t r i b u t i o n k(g) of the absorption coefficients. T h e cumulative probability function g(k) = f£ p(k')dk' is the integral of the probabilities p(k') for a l l mass absorption coefficients k' between 0 a n d k, so t h a t its inverse k(g) gives the upper b o u n d of the (3 x 100)% smallest ks. For instance, 60% of a l l fc-values i n the spectral interval of the channel are smaller t h a n k(g = 0.6). E q u a t i o n (2.27) can t h e n be w r i t t e n as tAx(u,p,T)= [ (2.28) exv[-k (p,T)u}dg. g Jo For the numerical integration, the integral is approximated b y a finite s u m : N t A A (u, P l r)»Xl w * E X P [~ *(P. ) K T u l . ( 2 2 9 ) i=l w i t h quadrature weights u>; a n d quadrature points g j . F o r a function shaped as i n F i g u r e 2.8b, much 20 Calculations of all individual lines are called line-by-line calculations. 49 Chapter 2. Radiative Transfer and Development of the Forward Model less quadrature points are needed than for one shaped as i n 2.8a, and it is hence c o m p u t a t i o n a l l y more efficient. T h e m e t h o d o f reordering the absorption coefficients is called the k-distribution m e t h o d . Since the ks are simply reordered and no approximation is made, it is exact. However, i n a t y p i c a l atmospheric layer pressure and temperature w i l l vary w i t h height, and since the ks are dependent on p a n d T (2.28) is only exact for vertically homogeneous layers (homogeneous mass paths). Nevertheless, the same equation is used to approximate the transmissivity of inhomogeneous mass paths. T h e assumption t h a t is made i n the correlated-k method is that for any p, T found i n the atmosphere, any particular absorption coefficient k(p, T) w i l l always have the same cumulative probability. T h u s , the absorption coefficients at different pressure a n d temperature levels are correlated w i t h each other. In other words, the shape of k{g) is assumed to be constant, b u t the magnitudes of the ks are scaled w i t h p and T (e.g. P e t t y , 2006). Frequently the absorption w i t h i n a spectral interval arises f r o m overlapping lines of two or more molecular species. T h e standard procedure i n this situation ( K r a t z , 1995) is to assume the spectral features of the species to be uncorrelated. T h e mean transmittance for a combination of two constituents w i t h mass paths ui and u can then be expressed as a double s u m m a t i o n : 2 N t&x{ui,u ,p,T) 2 « M i »=i 3 = 1 W i e x P [- i(p>T)ui]} k {WJ exp [-kj(p,T)u ]} 2 . (2.30) K r a t z (1995) computed optimised ^-coefficients for the five A V H R R channels. H e later applied the same technique to several of the M O D I S channels ( K r a t z , 2001). Table 2.1 lists t h e M O D I S channels centred at 3.7, 8.5, 11 a n d 12 / m i , together w i t h the spectral intervals taken into consideration i n the correlated-k routines, and the atmospheric constituents that contribute to the absorption i n each channel. T h e number of fc-coefficients required for the integration is as low as one to five, depending on the gas. T a b l e 2 . 1 : Spectral intervals (as considered i n the K r a t z (2001) correlated-k routines) of the four M O D I S channels implemented i n the forward model, the atmospheric constituents whose absorptivity is accounted for, and the number of fc-coefficients required for the spectral integration of each species. channel centre wavelength (/jm) 20 29 31 32 3.748 8.553 11.010 11.920 considered gases (no. of fcs) wavelength interval (/mi) 3.656 8.333 10.526 11.494 - 3.839 8.772 11.494 12.346 H 0 (4), C 0 (1), H 0 (5), 0 (1), C H H 0 (5), C 0 H 0 (5), C 0 2 2 2 3 C H (1) (1), N 0 (2) (1) (1) 4 4 2 2 2 2 2 T h e K r a t z (2001) M O D I S routines were not part of l i b R a d t r a n at the t i m e the work o n this thesis 50 Chapter 2. Radiative Transfer and Development of the Forward Model was conducted. I thus implemented the corresponding functions into the model i n order to compute the spectral-mean intensities for the desired channels. 2.5 The Overlying Atmosphere F i g u r e 2.7 showed that most of the absorption above cloud t o p is due to water vapour. A s discussed i n C h a p t e r 1, marine Sc often appear under conditions of strong subsidence i n subtropical regions. T h i s generally implies a low water vapour content of the free atmosphere, nevertheless, it should be investigated how large the impact of the existing water vapour is and how it can be accounted for i n the forward model. T h e way the overlying atmosphere (referred to below as O A ) is treated i n the existing literature varies, but most studies (Perez et a l , 2000; Gonzalez et al., 2002; C e r d e n a et a l . , 2007) ignore absorption above cloud top, arguing that the subsiding air is mostly d r y and that, most of the atmospheric water vapour is contained i n the boundary layer air - so that its effects are already contained i n the observed clear sky brightness temperatures (which are used to fix parameters of the cloud m o d e l , cf. section 1.5.2). T h e 21 other possibility (Perez et a l . , 2002) is to compute the bulk absorption effects f r o m a given water vapour p a t h observed f r o m a n independent source (e.g. microwave soundings or reanalyses of weather forecast models). In order to estimate the impact of the O A for m y retrieval scene, I computed absorption and emission effects from radiosonde water vapour and temperature profiles measured b y the closest operational meteorological sounding station i n S a n D i e g o . 22 T w o soundings daily were available, the nighttime soundings of J u l y 11 ( R F 0 2 ) a n d J u l y 13 ( R F 0 3 ) are plotted i n F i g u r e 2.9. T h e moisture profiles (middle panels) i n fact show a very d r y atmosphere above the subsidence inversion, especially for J u l y 11. U s i n g adapted versions of the K r a t z (2001) correlated-k routines, I computed emissivity a n d transmissivity for each layer defined i n the sounding d a t a (left a n d middle panels), as well the profile of the upwelling radiances (right panels). Since the scattering effects of gaseous particles are negligible for infrared wavelengths (Petty, 2006), scattering was not considered. Trace gas concentrations of 350 p p m for C O 2 , 1.75 p p m for C H 4 a n d 0.31 p p m for N 0 were assumed for the 2 computations, a n d the O3 profile was taken f r o m the U S standard atmosphere compiled by Anderson et al. (1986). For a l l channels, the transmissivity of the d r y atmosphere encountered o n J u l y 11 is close to 1 k m - 1 T h i s , of course, assumes that the boundary layer water vapour path of the utilised cloud free pixels equals that of the observed cloudy pixels. T h e data are freely available at http://weather.uwyo.edu/upperair/sounding.html. 21 2 2 51 emissivity [(* 10 )W/m /sr/cm 2 2 ] radiance [W/m /sr/cm transmissivity per k m l 0.00 0.02 0.04 0.06 0.08 0.10 0.120.0 0.2 0.4 0.6 I 0.8 1.0 \ PL, <P m xn • • : :i ::l ::l ;i :: 1 :; 1 •: 1 -.' 1 1 •: 1 :: 1 1 •: 1 •1 1 ( 1 1 ; i 1000 i t \ • r- — 1 , i i - —i ... OH • < 1 7 1 1000 : i 220 240 260 280 temperature [K] | \ CM CS3 \ I :\ : !l :l : 1 ::1 :i1 :1 ;i :; 1 j • 1 :; 1 :: 1 ; i 1 ^ 1r ; :J o o \ - -\ O o CN C CN O i / 7-~^-I .' 0 2 4 6 8 10 water vapour mixing ratio [g/kg] .* : 1 / 0.040 0.045 0.050 0.055 [(*10~ ) W/ m 1 sr I cm- } 2 2 1 F i g u r e 2 . 9 : A t m o s p h e r i c soundings f r o m S a n Diego f r o m (top) J u l y 11, 2001, and (bottom) J u l y 13, 2001. Shown are [left] temperature profile (thick) and corresponding emissivity of the atmosphere i n the wavenumber ranges of the four M O D I S channels 20, 29, 31 a n d 32; [middle] h u m i d i t y (thick) a n d transmissivity for the channels; and [right] radiance profiles. N o scattering has been considered for the c o m p u t a t i o n of the radiances. Channels are m a r k e d as solid-channel 20, dashed-channel 29, dash-dotted-channel 31, d o t t e d - c h a n n e l 32. C h a n n e l 20 radiances and emissivities have been multiplied b y 1 0 for better display ( b o t t o m scale; top scale for the r e m a i n i n g channels). See text for more details. 2 to i ; o3 ••.} -. i :1 : 1 / 1 ) l 0.095 0.100 0.105 0.1100.115 i 03 ] 2 Chapter 2. Radiative Transfer and Development of the Forward Model at a l l layers. Influenced most b y the O A is channel 29. O n J u l y 13, there is a layer of increased h u m i d i t y between 800 h P a a n d 600 h P a , where the layer transmissivity varies between 0.9 a n d 1 k m 20, 31 a n d 32, a n d between 0.8 and 1 k m - 1 - 1 for channels for channel 29. However, the attenuation of radiation by absorption is partly countered by emission, so that the total differences between t h e radiances at the inversion and the top of the atmosphere are small. O n both days, the intensities changed by less that 2% for channel 20, less t h a n 3.5% for channel 29 a n d less t h a n 0.5% for channels 31 a n d 32. W h i l e these effects do not contribute much to the observed T O A radiances, they represent a potential and - if k n o w n - unnecessary source of uncertainty i n the retrieval, hence, I decided to account for atmospheric transmission a n d emission above cloud top i n the forward model. T h e simplest, b u t also most computer time consuming approach to account for the O A i n the forward model is to feed the entire sounding profile into the radiative transfer model. However, it is undesirable to make the L U T dependent on a specific atmosphere. A l s o , at most locations over the ocean, no radiosonde soundings are available i n the vicinity of the retrieval scene. Rather, it is often possible to infer estimates of t o t a l L W P from a n independent source as mentioned above. Alternatively, it might be feasible to add a t o t a l transmissivity t* a n d average emitted intensity B* of the O A as retrievable parameters to the algorithm. In order to keep m y method flexible, I decided to account for O A effects by adding these two parameters to the forward model. U s i n g this approach, the cloud t o p radiances can be computed independently from the O A , the impact of which can be added afterwards. In m y case, t* a n d B* can easily be computed from the available atmospheric soundings. In the absence of scattering, the R T E (2.8) can be w r i t t e n as'Schwarzschild's equation: (2.31) where 7 (oo) denotes the upwelling intensity at the top of the atmosphere, 7 ( 0 ) the upwelling atmosphere T T at cloud t o p (the b o t t o m of the O A ) , and W^(z) the weighting function, defined b y W\z) = dt(z, OO) _ pg(z) t[z, oo). dz fx (2.32) B y introducing the weighted average P l a n c k function (Petty, 2006) (2.33) 53 Chapter 2. Radiative Transfer and Development of the Forward Model and c o m p u t i n g the t o t a l transmittance as the product of all i n d i v i d u a l layer transmissivities, the T O A intensity can be conveniently w r i t t e n as a function of cloud top radiance, t* and B*: 7 (oo) = / ( 0 ) £ * +B*(l-t*). T T (2.34) In the above equations, the O A is effectively treated as a single homogeneous layer w i t h transmittance t* and emitted intensity B*. Unfortunately, b o t h parameters are wavelength dependent, so t h a t they have to be computed for each channel. W h i l e this poses no constraint if the atmospheric profiles are known in advance, it adds additional parameters to the retrieval if they were to be inferred from the satellite observations. However, even i n this case, it seems feasible to map t* and B* of one channel to a l l other channels employed i n the algorithm. T h i s idea is left for future work. 2.6 2.6.1 Forward Model Radiative Transfer M o d e l : l i b R a d t r a n T h e radiative transfer package l i b R a d t r a n , described by M a y e r and K y l l i n g (2005), contains a number of tools for radiative transfer calculations i n the earth's atmosphere. P a r t i c u l a r l y i m p o r t a n t for this thesis is its ability to read in arbitrary atmospheric and cloud profiles, as well as precomputed phase functions, and to solve the R T E using D I S O R T . 2 3 T h e package comes w i t h the K r a t z (1995) A V H R R correlated-k routines implemented, but, as mentioned above, I had to add the corresponding code for the M O D I S channels ( K r a t z , 2001). Clouds are included i n the radiative transfer calculations by specifying vertical profiles of height, liquid water content and effective radius. Phase functions are read i n from tables produced by the M i e code included in S H D O M (Evans, 1998), the same that was used to produce the examples i n F i g u r e 2.3. In M i e theory, the solution to the electromagnetic wave equation (cf. section 2.2) is expressed as a finite series of Legendre polynomials (e.g. P e t t y , 2006), so that the phase functions are listed i n the M i e tables in terms of Legendre coefficients. These coefficients are directly used by D I S O R T for the numerical solution of (2.8). T h e atmosphere is divided into discrete layers, for each of which the radiative transfer problem is solved. A b s o r p t i o n coefficients and single scatter albedos are computed from the K r a t z (2001) routines, while the phase function is assumed to be constant over the spectral interval of the channel. D I S O R T is 2 3 T h e complete software package, however, is capable of much more; see Mayer and Kylling (2005) for details. 54 Chapter 2. Radiative Transfer and Development of the Forward Model called once for each fc-coefncient, and the resulting intensities are summed using the weights W{ i n (2.29) to compute the channel integrated intensities. 2.6.2 Design of the Forward M o d e l I implemented a system based on P y t h o n 2 4 scripts that automatically computes a number of adiabatic cloud profiles and calls l i b R a d t r a n i n order to generate a L U T . T h e cloud profiles can be computed from either r a n d o m l y chosen cloud properties drawn from a given interval or from discrete values. T h e forward model requires the four variable input parameters effective radius r ff, total droplet e number concentration N, cloud top pressure p and cloud top temperature T (either intervals or discrete ct ct values), as well as the fixed inputs surface pressure p f , g a m m a d i s t r i b u t i o n shape parameter v s c gam computing the M i e properties of the cloud droplets, k-value for the conversion of r i to r" // e vo 25 for and the number of datapoints that should be computed (if discrete values of the first four parameters are given, this input is not applicable). T h e system is able to r u n i n parallel mode, so that the generation of large L U T s can be performed i n reasonable time. F o r instance, the generation of 96,000 datapoints requires about 8 hours o n a cluster of 32 2 - G H z processors. After the input parameters are read i n , liquid water content qi is computed from r ff e a n d N using (2.18) a n d (2.19). T h e specific humidity q is obtained from (2.16) by using T and p - Since l i b R a d t r a n v ct ct uses height as a vertical coordinate, the hydrostatic equation (e.g. C u r r y and Webster, 1999) is used to relate pressure and height i n the cloud model, so that T f can be computed from (2.12). T h e atmospheric s c and cloud variables are then integrated using a denned step-size (e.g. 0.2 h P a ) from the surface to cloud top. B o t h cloud (z, r jf, e qi) a n d atmospheric (z, p, T, specific h u m i d i t y q a n d trace gases) profiles v are passed to l i b R a d t r a n , where for this study, the same trace gas concentrations were used as for the overlying atmosphere i n section 2.5. l i b R a d t r a n is then r u n t o compute the cloud t o p radiances. T h e cloud visible optical thickness is computed using (2.25), w i t h Q t ex = 2. If desired, the effects of the O A can be accounted for using the procedure described i n section 2.5. P r e c o m p u t e d t* and B* as well as radiosonde soundings can be used. A s a last step, the T O A intensities are converted to brightness temperatures using the inverse of (1.18). Since the P l a n c k function also varies over the spectral intervals of the channels, the channel-mean B T c a n either be approximated at the channel-centre wavelength, or a correction formula suggested b y v a n Deist (2005) can be used i n order to account for the polychromaticity of the channels. T h e free programming language Python can be obtained at http://www.python.org/. ^pam and k are held separate in order to facilitate sensitivity studies and to avoid time consuming re-computations of the Mie properties for small changes of k. 2 4 25 55 Chapter 2. Radiative Transfer and Development of the Forward Model For all computations, the satellite is assumed t o be directly above the cloud (i.e. T O A B T s are computed for n a d i r view), no satellite zenith angle is considered. F i g u r e 2.10 shows two vertical profiles of channel 20 (3.7 p.m), 31 (11 ftm) and 32 (12 p m ) brightness temperatures for an optically t h i n ( r « 3.8) and a n optically thick ( r « 43) cloud. T h e plots illustrate how the attenuation of radiation w i t h i n the cloud layer is simulated by the forward model. A s discussed in C h a p t e r 1, the surface signal still influences the cloud top radiances for the t h i n cloud case, while for the other case, the cloud emits as a black b o d y i n channels 31 and 32. T h e point at which the B T s of all three channels coincide w i t h the actual temperature is clearly visible. A t this point, all photons propagating upwards through the cloud that entered the cloud at its b o t t o m are extinguished, and the cloud emits as a black b o d y w i t h its actual temperature T . T h e intensities observed at cloud top are only determined by emission i n this upper part of the cloud (cf. section 1.5.2). F i g u r e 2.11 shows the B T D ( 3 . 7 - 1 1 ) and B T D ( 1 1 - 1 2 ) signals, each w i t h the 11 p.m B T as reference. A s i n Perez et al. (2000), cloud top temperature is fixed at 285 K . Since i n my forward model, cloud top pressure is a fixed parameter as well (in the given case at 900 h P a ) , the surface temperatures as computed from the adiabatic lapse rates depend on the cloud geometrical thickness and hence differ for the i n d i v i d u a l clouds. T h u s , the curves do not converge i n a single point on the right side, as they do i n F i g u r e 1.9. M y model is able to reproduce the relationships found by B a u m et al. (1994) and Perez et a l . (2000). A s expected and discussed i n C h a p t e r 1, the changes i n B T w i t h changing optical thickness become smaller for larger T. In contrast to the V U cloud model, the m a x i m u m optical thickness in the A S model depends on r jf. Consequently, clouds w i t h small effective radii cannot become as optically thick as i n e the B a u m et al. (1994) and Perez et al. (2000) studies, where b o t h parameters were independent (cf. F i g u r e 1.9). A s noted i n C h a p t e r 1, if T t and p t are known, r jj c c e and r could in principle already be retrieved m a n u a l l y from the plots in F i g u r e 2.11. T h e characteristic of my A S model that surface temperature is obtained f r o m the adiabatic lapse rates is particularly pronounced is the case of constant r ff, e N and T but v a r y i n g p t- A s shown in F i g u r e c 2.12, changes i n cloud top pressure can cause large differences in the B T D signals. I n C h a p t e r 4, I w i l l fix cloud top pressure in the L U T in order to simplify the retrieval problem. It will hence be important to evaluate the sensitivity of the retrieval to changes in cloud top pressure (or more accurately, to the pressure difference between cloud top and sea surface) i n order to estimate the i m p a c t of slightly varying cloud top heights in the scene to the retrieval accuracy. I w i l l come back to this .topic in section 4.4. 56 Chapter 2. Radiative Transfer and Development of the Forward Model e 80 285 290 295 temperature [K] 300 280 285 290 295 temperature [K] F i g u r e 2 . 1 0 : V e r t i c a l brightness temperature profiles through a t h i n cloud (left) and a thick cloud (right). Shown are the brightness temperatures of M O D I S channel 20 (3.7 f i m ; t h i n solid line), channel 31 (11 u.m; t h i n dashed line) and channel 32 (12 fim; t h i n dash-dotted line), as well as temperature (T) and potential temperature ( 0 ) . For the thick cloud, the brightness temperatures of channels 31 and 32 are similar to the temperature of the cloud (i.e. the cloud emits as a black b o d y i n this wavelength range), whereas for the t h i n cloud, the surface temperature still influences the signal. 57 Chapter 2. Radiative Transfer and Development of the Forward Model brightness temperature ch. 31 (llu,m) [K] brightness temperature ch. 31 (llu,m) [K] F i g u r e 2 . 1 1 : Droplet size and cloud optical thickness influence r a d i a t i o n at the different M O D I S channels to a different extend. Shown is the dependence of brightness temperature ( B T ) and B T difference ( B T D ) on varying effective radius and optical thickness for a cloud top temperature of 285 K and a cloud top pressure of 900 h P a . S y m b o l s from left to right: optical thickness of 0.5, 1, 1.5, 2, 3, 5, 8. Note t h a t clouds w i t h a s m a l l effective radius cannot become as thick as clouds w i t h a larger r / / i n the adiabatic cloud model (compare to the vertically uniform model of Perez et al. (2000)). e 58 Chapter 2. Radiative Transfer and Development of the Forward Model ^ 1.4r | 1.2 4um lr m "5 0.8 | 0.6 0.4 [ o 0.2 Q EPQ 284 0 r 1 286 288 290 292 brightness temperature ch. 31 (ll|0.m) [K] 294 F i g u r e 2 . 1 2 : T h e same i n F i g u r e 2.11, but for different cloud top pressures (solid-940 h P a , dashed-920 h P a , dotted-900 h P a ) . In the adiabatic model, a different cloud top pressure w i t h a fixed cloud top temperature results in a different surface temperature. Hence, the left convergence points i n the figures (surface temperature) are warmer for higher cloud tops. 59 Chapter 3 Nonlinear Regression with Neural Networks and Uncertainty Estimation T h e forward model I developed i n the previous chapter maps given cloud properties to satellite observations computed by the radiative transfer model. F r o m a set of such forward computations, the u n k n o w n inverse function has to be inferred in order to determine the cloud parameters from the observed satellite data. T h i s poses a nonlinear multiple regression problem, w h i c h , as proposed i n C h a p t e r 1, I will tackle w i t h artificial neural networks and the Bayesian framework suggested by M a c K a y (1992a,b) and Aires (2004); Aires et al. (2004a,b). M y objective for this chapter is to describe this approach and to discuss its applicability to m y remote sensing problem. F o r simplicity, the original M a c K a y (1992a,b) publications, as well as the book by B i s h o p (1995) and the review paper by M a c K a y (1995), only discuss networks w i t h one output variable and hence only one output uncertainty. For multidimensional mappings, M a c K a y (1995) suggests the use of full covariance matrices to describe the output uncertainties, but does not elaborate on this idea. Aires (2004); Aires et a l . (2004a,b) point out the importance of uncertainty estimates for multidimensional mappings that arise i n atmospheric inverse problems, and expand the original M a c K a y (1992a,b) method to networks having multiple input and output variables. In this chapter, I w i l l first formulate the regression problem in a general context (section 3.1). T h i s introduction (mainly based on the book by B i s h o p (1995)) is useful i n order to understand the problems that arise in neural network training for which the Bayesian framework provides some remedy. I w i l l then derive the theoretical foundation of how the uncertainty in the network prediction and the J a c o b i a n can be estimated (sections 3.2 - 3.5). T h e derivation w i l l be illustrated b y a n application of the method to a simple problem, which w i l l demonstrate important benefits but also highlight problems. 60 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation I implemented the A i r e s method by extending the N E T L A B toolbox by N a b n e y (2002), a set of routines implemented i n M a t l a b . Some information about the implementation is given i n section 3.6; i n section 2 6 3.7 I discuss the usefulness of the method for m y remote sensing problem. 3.1 The Nonlinear Regression Problem Consider the p r o b l e m of nonlinear regression o n noisy data, for w h i c h M a c K a y a n d A i r e s et a l . derive their methods. T h e d a t a consist of a dataset S = {X, D} w i t h input d a t a X = {x , x 1 target (or observed) d a t a D = {t , t , ...,t }. 1 2 N x n and t n 2 , x N } and denote vectors t h a t c o n t a i n a l l input a n d target variables, respectively. W e assume the inputs X to be exact, while the target d a t a D are noisy. S t r i c t l y speaking, for the inverse problem, we deal w i t h the opposite case; the input d a t a - the satellite measurements - are noisy, whereas the target d a t a - the inputs to the forward m o d e l - are k n o w n exactly at the time the dataset S is generated. I w i l l come back to this issue shortly. 3.1.1 E r r o r Functions and Overfitting Once the dataset S has been observed, we wish to gain information about the function or relation which generated the targets from the inputs. Since the target d a t a are noisy, they are composed of two parts - the underlying generator, representing the physical relationship we w i s h to capture, and the noise. L e t M denote a statistical model for S. M. w i l l be a neural network shortly, b u t first consider a p o l y n o m i a l of order K: K y(x) = w + w\x + ... + WKX = K 0 '^2 j ' • w x > (3-1) 3=1 T h i s m o d e l maps a n input variable x to a n output variable y, the prediction. T h e exact form of the model function depends on the order of the p o l y n o m i a l a n d the values of the parameters Wj. Hence, following B i s h o p (1995), I write y = y(x;w) (where a l l parameters wj are grouped into one parameter vector w). A standard way to find the best possible model architecture for a given S is t o consider the errors between the model predictions y n a n d the desired values, the targets, t n order to o b t a i n a good fit, the differences y n —t n for the N d a t a pairs i n S. I n should be as small as possible, w h i c h c a n be achieved by m i n i m i s i n g the sum-of-squares error (cf. the cost function (1.22)) E=\j2{y(x -w)-n . n Z 26 2 (3.2) n=l http://www.mathworks.com/ 61 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation ' 0 0.5 1 0 0.5 1 F i g u r e 3.1: Left: a dataset S, from which we would like to infer information about the underlying generator, i.e. the function from w h i c h the values were generated. R i g h t : the sine function from which the d a t a were produced. However, i f the shape of the generating function is not known, it is difficult to judge different models based only o n their sum-of-squares error (compare F i g u r e 3.2). However, w i l l the model w i t h the smallest error E best represent the u n d e r l y i n g generator of the data? In fact, this often is not the case. T h e phenomenon is known as overfitting i n the literature, and describes the problem of fitting the noise rather than the generating function. Its major effect is t h a t models that fit the training d a t a S too well do not generalise well. In this case the m o d e l , being presented w i t h a n input value which is not part of the training dataset, predicts a value far f r o m the desired one. Figures 3.1 a n d 3.2 illustrate the problem. In the left panel of F i g u r e 3.1, eleven datapoints are plotted, which are generated b y adding r a n d o m noise to the sine function depicted i n the right panel. F i g u r e 3.2 shows two polynomials fitted to the d a t a , i n the left panel a cubic p o l y n o m i a l , i n the right panel a p o l y n o m i a l of order ten. Since the underlying generator o f the d a t a is k n o w n , we easily judge that the cubic p o l y n o m i a l is a better representation of the original sine function. T h e sum-of-squares error, however, is much smaller for the higher order p o l y n o m i a l , since it perfectly fits the t r a i n i n g data. If we knew nothing about the original function, how could we judge which model is b e s t ? 2 7 T h e same problem arises if neural networks are used instead of polynomials. F o r simplicity of the derivation and implementation, I restrict myself to a two-layer perceptron (i.e. one layer of hidden units) as used by Aires et a l . A s discussed by, for instance, B i s h o p (1995), this network is already capable of a p p r o x i m a t i n g a r b i t r a r y functions. M a t h e m a t i c a l l y , a n A N N such as the one depicted i n F i g u r e 1.10 computes the output values from the following equations. T h e inputs X{ are weighted by the parameters of the first layer, then summed Concluding that less complex models (i.e. lower degree) are better is also misleading — consider, for example, a linear polynomial for the given case. 27 62 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation F i g u r e 3 . 2 : Left: a cubic p o l y n o m i a l (solid line) fitted to the d a t a f r o m F i g u r e 3.1, together w i t h the generating sine function (dashed). R i g h t : a p o l y n o m i a l of order ten, also plotted w i t h the generating sine function. It is easily seen b y eye that the cubic p o l y n o m i a l represents the sine function better, however, the sum-of-squares error is smaller for the right p o l y n o m i a l . T h i s phenomenon is k n o w n as overfitting. for each hidden neuron and transformed b y a n activation function <?(•) to give the hidden values (3.3) \i=0 T h e same equation is applied to the hidden values, which yields the output values M Vk = 9 i j=0 Y, fl i w (3.4) z where <?(•) denotes the activation of the output units. In the architecture considered here, g(-) is a sigmoidal function, whereas g(-) is a simple linear function (necessary to o b t a i n output values other t h a n zero and one). C o m b i n i n g (3.3) and (3.4) gives the complete network function M (3.5) Vk \j=0 : = 0 A s for the p o l y n o m i a l i n (3.1), the goal is to find a set of weights w so that (3.5) becomes the best possible representation of the generator of the dataset S. A l s o , as i n the p o l y n o m i a l example, a trade-off between model complexity and a small error between predicted outputs and target d a t a has to be found in order to find a model that generalises the d a t a well. T h e problem of finding a model that is neither too complex nor too simple is k n o w n as Occam's 63 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation razor (after W i l l i a m of O c c a m , 1285-1349; see B i s h o p (1995)). Assessing the representativeness of the model a n d controlling its complexity is a n important part of assessing the uncertainty inherent i n the predictions. 3.1.2 Controlling M o d e l Complexity and Network Training F i n d i n g the o p t i m a l set of weights i n (3.5) involves finding the m i n i m u m of an error surface i n weight space, such as defined by (3.2). Obviously, the number of weights i n an A N N architecture r a p i d l y increases w i t h the number of hidden units. Hence, the error surface is of a high dimensionality a n d t y p i c a l l y has local m i n i m a . T h e neural network error function is s i m i l a r l y to the cost function (1-22) between computed a n d observed radiances i n the retrieval methods described i n section 1.5 t o o complex to be m i n i m i s e d analytically. Instead, several numerical algorithms exist to perform the m i n i m i s a t i o n . Overviews of commonly used algorithms a n d their functionality are given by, for instance, B i s h o p (1995) a n d Press et al. (2007). In this work, I used the scaled conjugate gradients algorithm, as implemented b y N a b n e y (2002) i n the N E T L A B toolbox. Obviously, b o t h the number of adaptive parameters (i.e, weights) i n a network architecture a n d the type of the error function determine the complexity of the error surface. C o m m o n problems arise i n specifying the error surface i n a way so that the global m i n i m u m does not correspond to a state of overfitting a n d i n actually finding this global m i n i m u m - numerical m i n i m i s a t i o n algorithms often get stuck i n l o c a l m i n i m a , depending o n the i n i t i a l point where the search for the global m i n i m u m started. B i s h o p (1995) points out two p r i n c i p a l approaches to controlling the c o m p l e x i t y of the model. T h e first, called structural stabilisation, involves controlling the number of adaptive parameters i n the network. T h e second, regularisation, adds a penalty t e r m to the error function (3.2) to counteract overfitting by avoiding strongly fluctuating mappings such as the example i n the right panel of F i g u r e 3.2. T h e simplest approach to network structure optimisation (and, as B i s h o p (1995) points out, still the most widely adopted approach i n practise) is to perform a n exhaustive search t h r o u g h a restricted class of architectures (i.e. varying number of hidden units). Obviously, such a search requires a large c o m p u t a t i o n a l effort. Nevertheless, due to the lack of easy-to-use alternatives (cf. B i s h o p , 1995, C h a p t e r 9), I w i l l follow this approach as well. T h e o p t i m i s a t i o n of m o d e l complexity is performed w i t h respect to a given t r a i n i n g dataset. T h e simplest approach to comparing the generalisation performance of different network architectures is evaluate 64 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation the error function for a n independent validation dataset. A c o m m o n approach is to split the original dataset S into two parts, one used for training and one for validation (typically two-thirds are used for training a n d one-third for validation). After the models have been compared, the one w i t h the smallest validation error can be selected. A s for controlling the complexity of the model by regularisation, I w i l l show i n sections 3.2 a n d 3.3 that the Bayesian approach provides a natural framework for b o t h regularising the t r a i n i n g process a n d estimating the uncertainty of the network predictions. 3.1.3 Describing Uncertainty T h e idea of the Bayesian approach is that instead of finding the single weights vector w* that minimises an error function E a n d represents the most probable solution to the regression p r o b l e m , we compute a n entire distribution o f weights. B y examining this distribution, we c a n see h o w " c e r t a i n " it is that the training process has found the generator of the data; if the distribution is wide, it is uncertain, if it is narrow, it is certain. Since the distribution of the weights is inferred from the t r a i n i n g d a t a , it is usually w r i t t e n as (3.6) p(w\D). In order to simplify the notation and to follow the notation of B i s h o p , the dataset is only denoted b y D. Implicitly, however, the distribution depends on the entire dataset, i n c l u d i n g the input values (written as p(w\D,X)). 28 T h e certainty i n the network outputs is also restricted b y the noise on the d a t a . If we were, for example, to model the long-term generator of hourly-averaged windspeed given a dataset S composed of minuteby-minute measurements there would be intrinsic noise caused by turbulence o n the measurements that would not be part of the generator. Obviously, apart from describing this noise w i t h a probability d i s t r i b u t i o n , we cannot make statements about exactly where the value of the generator lies w i t h i n a noise interval. T h e noise distribution i p(t\x,w) (3.7) I n section 3.2 I will show how so-called prior information about the weights distribution can be inserted into the training process that leads to the weights distribution (3.6). Research showed that weight values close to zero lead to smoother network mappings than large weights (Bishop, 1995, Section 10.1.2). This can be achieved by favouring small weights by using a special prior, a process is known as regularisation in the literature. Regularisation is an important tool to control the complexity of the model. However, it does not solve the problem of how many weights and hidden nodes the architecture should contain. In the evidence framework (MacKay, 1992a; Bishop, 1995, Section 10.6), the Bayesian approach provides the tools for comparing different architectures (e.g. different numbers of weights) of networks based on the training data. This topic, however, is outside the scope of my thesis and will not be discussed further. 2 8 65 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation gives the p r o b a b i l i t y of observing a specific target t, given a n input x a n d t h e most likely generator function represented b y a network w i t h weights w. To come back to the issue that the target d a t a i n the forward model are not noisy, this p r o b l e m i n fact turns out to be a flaw i n the application of the m e t h o d t o the retrieval p r o b l e m . However, f r o m the studies of Perez et al. (2000) and Gonzalez et al. (2002), we expect the forward m o d e l t o c o n t a i n ambiguities. O n e can reason t h a t these ambiguities could be interpreted as noise on the target d a t a . Furthermore, the noise i n the input d a t a (the satellite observations) can be accounted for w i t h the network J a c o b i a n . T h i s issue w i l l be further discussed i n the remainder of this thesis. G i v e n the two sources of uncertainty described by the weights d i s t r i b u t i o n a n d the noise distribution, the total uncertainty of the network predictions can be calculated f r o m (3.8) where the noise d i s t r i b u t i o n has been integrated over all likely generators w. T h i s d i s t r i b u t i o n describes the p r o b a b i l i t y t h a t a target t w i l l be observed if an input vector x is given a n d the general behaviour of the target d a t a is specified b y the t r a i n i n g dataset D. W h e n I first studied the derivation of the network outputs d i s t r i b u t i o n (3.8), it appeared confusing to m e ' t h a t the uncertainty i n the network outputs, denoted b y y, is expressed as a d i s t r i b u t i o n of the targets t. However, since we have no information about the generator of the d a t a except the dataset S, i t is best t o make predictions about new d a t a that would likely be observed for a new input. T h e network prediction y describes the mean of the posterior target d i s t r i b u t i o n , a n d hence the l o c a t i o n of the most probable d a t a value to be observed. T h e shape a n d w i d t h of p(t\x,w), however, describe the certainty that a d a t a value we would observe i n reality would actually be the most probable value - hence, the d i s t r i b u t i o n w i d t h can be interpreted directly as error bars o n the network prediction y. T h e weights d i s t r i b u t i o n p(w\D) a d d i t i o n a l l y describes our degree of belief i n the ability of the network w t o model the generator, thereby further widening the d i s t r i b u t i o n of possibly observed d a t a values. T h e network J a c o b i a n represents another important t o o l to analyse t h e m o d e l . It is defined as the derivative of the network outputs w i t h respect to its inputs, dy^jdxi. F o r instance, the J a c o b i a n provides a mechanism t o estimate the impact of errors associated w i t h the inputs on outputs errors. Furthermore, the J a c o b i a n provides a good t o o l t o answer questions concerning the c o m p l e x i t y a n d adequateness of the m a p p i n g - to what extend does the output change when the input is changed? Is this the behaviour we expect the solution to have? Is the model sensitive t o the inputs we expect it to be sensitive t o given 66 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation our knowledge of the problem? T h i s information can make the A N N m u c h more transparent t h a n a "black-box network" would be. U s i n g the weights distribution, it is also possible to estimate a J a c o b i a n d i s t r i b u t i o n . A s mentioned i n C h a p t e r 1, such an estimate of the variability i n the J a c o b i a n is a useful indicator of how ill-conditioned the regression problem i n question is. Aires et al. (2004b) point out t h a t inverse problems are often i l l posed; similar output statistics of the network can be obtained by a variety of different network parameters (i.e. weights). A possible reason is redundant information i n the input variables, caused, for instance, by correlated inputs. Imagine, for example, a network w i t h one output a n d two highly correlated inputs. T h e output might physically be dependent on the magnitude of one of the two inputs as well as their difference. In this case, the training algorithm cannot decide w h i c h input carries the information, and separate t r a i n i n g runs could yield different results i n terms of the A N N weights. T h e J a c o b i a n would be fundamentally different between the networks. In one case the output variable could be m a i n l y sensitive to the first input a n d to a lesser extend to the second, while i n the other case it could get most of its information from the second input and represent the dependence on the input difference w i t h an opposite sign dependence on the other input. T h e output statistics for the t r a i n i n g d a t a might be s i m i l a r l y good for all networks, but the ability to generalise to new inputs could vary considerably. In particular, if we know little about the p r o b l e m a n d are interested i n interpreting the network Jacobians as actual physical Jacobians (so that we can learn on which inputs the output depends), such a h i g h variability i n the J a c o b i a n is problematic a n d we should be careful about assuming that the network represented by the most probable weights vector models the real generator. If, however, the v a r i a b i l i t y i n the J a c o b i a n is small, we might expect that a " g o o d " m a p p i n g has been found. In the following sections, I w i l l show how the weights d i s t r i b u t i o n , output uncertainties a n d variability i n the J a c o b i a n can be obtained. T o illustrate some i m p o r t a n t aspects, I w i l l use a simple example function - a m a p p i n g from two input to two output variables, given by yi(xi,x ) = 4x\ + sin (2TTX ) V2(X1,X2) = COS (27TXl) + 2 2 X2- (3.9) (3.10) A two-layer network w i t h six hidden units has been trained from a dataset generated from these equations by adding n o r m a l l y distributed noise w i t h a standard deviation of 0.1 to output 1 a n d 0.05 to output 2. I w i l l show examples that originate from two training runs - a long r u n a n d a short r u n - , resulting i n different posterior weights distributions and consequently different output uncertainties and J a c o b i a n 67 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation variabilities. 3.2 Distribution of the Neural Network Weights In this section I w i l l present a n d i n places elaborate o n the Aires (2004) derivation of the posterior network weights distribution, p(w\D), a n d demonstrate how neural network t r a i n i n g can be formulated in the Bayesian framework. 3.2.1 Derivation U s i n g Bayes' theorem, p(w\D) can be w r i t t e n as p H ^ ) = ^MPW ( 3 1 1 ) T h e density p(D\w) is called the likelihood of the d a t a a n d indicates how likely the observation of the dataset D is i n the light of a given model (here represented by a set of weights). p(w) is called the weights prior. T h i s density represents a l l information about the weights that is available before the d a t a is observed. F i n a l l y , the denominator p(D) is a normalisation factor t o ensure t h a t the integral over p(w\D) is o n e : 29 p(D) = J p{D\w)p{w)dw. (3.12) In order t o find the m a p p i n g that best represents the generator of the d a t a , the m a x i m u m of the posterior distribution p(w\D) has t o be found (following A i r e s , I w i l l write t o * for this m a x i m u m ) . Therefore, expressions for p(D\w) and p(w) are needed. T h e first assumption that is made is that the target d a t a are composed of a s m o o t h function h a n d an additive noise component e (Bishop, 1995, C h a p t e r 6): t = h(x) + e. (3.13) T h e noise e is assumed t o follow a G a u s s i a n distribution, which I write as a multivariate G a u s s i a n w i t h zero mean (zero mean because it is the noise around the generator value): 29 N o t e that again, the explicit dependence on the input data X has been omitted in the notation. 68 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation Ci n denotes the covariance m a t r i x of this intrinsic noise on the data; c denotes the number of network outputs (cmp. F i g u r e 1.10). T h e generator function h(x) is the function t h a t we would like t o model w i t h the neural network, therefore I replace h(x) = y(x;w). C o m b i n i n g (3.13) and (3.14) yields the distribution of the target variables: p(t\x, w) = y/ \ |i/2 l 2 c e x ("^( ( p y x ; ™) * ^ '° _ i n ~ '( ( l y x ;w ^~ ' ^• 1 5 ^ A s s u m i n g that all datapoints of the dataset are drawn independently from this d i s t r i b u t i o n , the probability density of the entire dataset becomes N p(D\w) = (3.16) Y[p(t \x ,w) n n n=l 1 where I have used e — t — y n n 1/2 exp \C, zri | (2n)^ and replaced Ai n (4|> ' " )' n)r Ain = Ci ~~ • T h e m a t r i x Ai £n X n n n - (3 17) is called a hyperparameter i n the context of Bayesian learning, since it is a parameter that controls the d i s t r i b u t i o n of other parameters (the network weights). I will explain its function i n the next paragraph. I n analogy t o the conventional m a x i m u m likelihood approach (Bishop, 1995, C h a p t e r 6), the s u m i n the exponential is called the data error function 1 = ^-T(e ) N E (w) n D A -A T (3.18) e . n i n n=l In order to find a n expression for the weights prior p(w), Aires (2004) assumes that the weights follow a Gaussian distribution as well: p(w) = Lw exp f - \ w \ I T •A r •w )= J ZJW exp (-E (w)). w (3.19) For convenience and following A i r e s (2004), I have grouped all n o r m a l i s a t i o n factors into a single constant Zw- T h e parameter A = C~ l r r is the inverse of the covariance m a t r i x C r of the weights and the second hyperparameter that occurs i n the context of Bayesian neural network learning. Analogous t o the d a t a error function, the weights error function is defined as E (w) w = ^w T •A T • w. (3.20) 69 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation In the form given b y (3.19), the prior distribution has zero mean, thereby expressing t h a t the weights are expected t o be centred around zero. T h e use of such a prior weights d i s t r i b u t i o n regularises the training process (cf. subsection 3.1.2); smooth network mappings, which usually generalise better t h a n strongly fluctuating functions, can be achieved b y favouring s m a l l weights ( B i s h o p , 1995, Section 10.1.2). If the weights are large, then Ew w i l l be large and p(w) w i l l be small; for s m a l l weights p(w) w i l l be large. Hence, by setting A r correspondingly, we can prefer small weights over large ones. B i s h o p points out that priors other t h a n G a u s s i a n can be considered as well. In this work, however, I w i l l only consider the Aires (2004) approach. G i v e n the expressions for p(D\w) and p(w), the posterior distribution of the weights (3.11) becomes p{w\D) = | exp I> ) • U T ™ • A ") 6 e x P (~\ wT • r • ) A w = \ e x P(~ E d ~ E w ^> ( 3 2 1 ) where I have again used the shorthand notation Z for the normalisation factors. T h i s expression could be m a x i m i s e d , however since m a n y standard algorithms exist t o minimise a function rather than to maximise it (Bishop, 1995, C h a p t e r 7) it is more convenient t o minimise the negative l o g a r i t h m of (3.21). T h e logarithm removes the exponential, and because of its monotonicity the l o c a t i o n of the m i n i m u m remains unchanged. Since the normalisation constant Z also does not affect the position of the m i n i m u m , we are left w i t h m i n i m i s i n g the total error f u n c t i o n 30 1 = - X>") N E(w) = E (w) D + E {w) w T ' Ai-n. • e " + 1 A r (3.22) w . n=l Conventional (i.e. non-Bayesian) network learning can be regarded as a special case of this Bayesian framework; if we have no information about the weights prior and assume it t o be a uniform distribution (p(w) = const.), then Ew = 0. If we furthermore assume independent output variables (all off-diagonal elements of Ai n are zero) which have the same variance (all diagonal elements of Ai set t o a ), then a 2 n 2 becomes a constant factor i n (3.22) and can be omitted, leaving E to be a sum-of-squares error function as i n the example of p o l y n o m i a l curve fitting: N c fiW^EEW "^)-^. . 1 (3-23) n=lfc=l A problem w i t h m a x i m i s i n g the posterior weights distribution is t h a t the hyperparameters Ai n 30 and A r T h i s is analogous to the principle of maximum likelihood; Bishop (1995, Chapter 2). 70 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation are usually not k n o w n beforehand. I w i l l discuss how they can be estimated i n section 3.4. 3.2.2 T h e M e a n i n g of the Hyperparameters M a c K a y (1992a, 1995) a n d B i s h o p (1995) introduce the B a y e s i a n framework w i t h only scalars (3 a n d a as hyperparameters instead of the matrices Ai n thereby reducing Ai n and A . T F o r simplicity, they only use one output, to a single element (5. T h e weights are assumed to be independent, m a k i n g A a r diagonal m a t r i x . In the simplest form, the variance of a l l weights is the same ( 1 / a ) . T h e error function 3 1 (3.22) then becomes (3.24) If the intrinsic noise on the target d a t a is small, then ft w i l l be large, and s m a l l deviations of the network predictions from the target w i l l result i n a large " p e n a l t y " . If the noise is large, (3 w i l l be s m a l l and larger differences w i l l be tolerated. Setting the ratio a/f3 becomes important under the aspect of the size of the training dataset. W h i l e the first t e r m i n (3.24) grows w i t h increasing numbers N of datapoints i n the dataset, the second t e r m does not. Hence, the ratio of b o t h hyperparameters controls the importance of the weights prior and the size of the dataset form which it w i l l become insignificant. T h e m a t r i x hyperparameters Ai n and A r used by Aires (2004) generalise this concept to interdepen- dent weights a n d outputs, respectively. U s i n g Ai n instead of (3 allows for more outputs a n d also incor- porates the correlation structure of the errors i n the i n d i v i d u a l variables. (Remember that C in — Ai n does not represent the covariance m a t r i x of the outputs, b u t of the errors i n the outputs (the noise).) T h i s way, m a p p i n g errors for variables w i t h small noise are penalised stronger t h a n those for targets w i t h larger intrinsic noise. In order to understand A , r it is important to keep i n m i n d t h a t the prior p(w) is a G a u s s i a n distribu- t i o n w i t h zero mean. T h i s means that, similar to the scalar hyperparameter case, we expect the weights to be small. T h e major difference to the scalar case is that we assign different inverse variances (diagonal elements of A r if the weights are independent) to different weights, hence controlling i n d i v i d u a l l y how much the weights are penalised for being large. T h i s m a y be useful since the magnitudes of weights i n different layers of a network can have fundamentally different ranges ( M a c K a y , 1995; A i r e s , 2004). MacKay (1995, Section 3.2) notes that the weights of a two-layer perceptron will usually fall into three or more distinct classes, depending on the structure of the inputs. For a good regularisation performance, he suggests the use of different hyperparameters a for these different classes. 3 1 71 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty 3.2.3 Estimation Gaussian A p p r o x i m a t i o n to the Posterior D i s t r i b u t i o n A l t h o u g h (3.21) is an exact equation for a given noise model and prior, it is useful t o approximate the posterior w i t h a G a u s s i a n distribution i n order t o make it analytically tracktable when used i n integrals such as (3.8) (Bishop, 1995, Section 10.1.7). T h i s can be obtained by performing a second-order Taylor expansion of the t o t a l error function (3.22) around its m i n i m u m w* (i.e. the m a x i m u m of p(w\D)): E(w) = E(w*) + b - Aw + ^Aw -H-Aw, T (3.25) T where Aw = w — w*. b denotes the gradient of E at w*, b = VE(w)\ , .=0, v (3.26) =w which vanishes because w* marks the m i n i m u m of E. H is the Hessian m a t r i x of E (second derivative) w i t h respect t o the weights, and it w i l l play a n important role i n t h e remaining parts of this thesis. L o o k i n g at (3.22), we can see that H is composed of two parts, the data Hessian Hp hyperparameter A 3 2 r and the weights : H = VVE(w)\ * = H w=w D + A r = VVE (w)\ * D w=w +A r . (3.27) (3.28) U s i n g the approximated error function (3.25), (3.21) becomes p(w\D) = ^ exp (-E{w*) - ^Aw T • H • Aw^j , (3.29) a n d , including the constant exp (—E(w*)) i n the normalisation factor Z, p{w\D) = exp ^ - i Aw T • H • Aw^j . (3.30) T h e posterior weights distribution thus becomes a Gaussian w i t h m e a n w* and covariance m a t r i x H T h e information contained i n this covariance m a t r i x can be immediately used t o give error bars o n the most probable weights vector w*, a n example of w h i c h is shown i n F i g u r e 3.3. . Note that if we use no prior information about the weights (i.e. p(w) = const.), E = ED and H = HQ. This case corresponds to A N N training without regularisation. 32 72 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation -0.5 0 weight value F i g u r e 3.3: D i s t r i b u t i o n of the first weight wn of the neural network used for the example given by equations (3.9) and (3.10) after a short training r u n (grey histogram, the network has not gained m u c h certainty about the weight value yet) and after a longer t r a i n i n g r u n (black histogram, the distribution has narrowed considerably). 3.3 Output Uncertainties In the previous section, b o t h terms under the integral i n (3.8) - the noise m o d e l of the target variables (3.15) and the Gaussian a p p r o x i m a t i o n to the posterior weights d i s t r i b u t i o n (3.30) - have been derived. U s i n g these results, the derivation of the distribution of the network outputs is straightforward (Aires et a l , 2004a): p(t\x,D) = J p(t\x,w)p(w\D)di J exp ^ - i (t - y) T (3.31) •A in -(t-y) • exp --Aw T H-Aw ) dw. (3.32) A l l normalisation factors have been omitted i n the notation i n (3.32), instead, the • oc • sign has been used to indicate the missing normalisation. T h i s integral c a n be evaluated by assuming that the posterior d i s t r i b u t i o n of the weights, (3.30), is narrow enough to approximate the network function y(x; w) by its linear expansion around the o p t i m a l weights value w*: y(x; w) = y(x; w*) + G Aw, 1 (3.33) 73 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation where the W x c m a t r i x G represents the gradient of the network function (c is the number of network outputs): (3.34) G = Vy(x;w)\ .. w=w Hence, (3.32) becomes p(t\x,D) oc J exp ^ - i (t - y(x;w*) - G Aw^ •A T i • (t-y(x;w*) n - G Aw^ • T exp ^ A w • H • Awj T dw. (3.35) W r i t i n g e* = t — y{x\ w*) for simplicity, expanding the product and rearranging yields p(t\x, D) oc exp ^ - ^ * e J exp (e* ' i n • e* j • T A •A T • (G Aw) - ^(G Aw) T i n T •A T • (G Aw) - \AW T i n • H • Aw^j dw, t (3.36) where the first factor is independent of w and has hence been pulled out of the integral. U s i n g the m a t r i x identity (AB) and further rearranging leads to = BA T T T p(t\x, D) oc exp (^-^ * ' i n • €* j • e T A J exp ^ e * T •A • G ) Aw - ^Aw T i n (G • A T •G + H^j Awj T i n dw. (3.37) B i s h o p (1995, A p p e n d i x B ) shows that Gaussian integrals w i t h a linear t e r m evaluate to J exp (L W T Setting L = e* T • Ai •Aw^dw T and A = G •G T n - ^w Ai = {2-n) ' w 2 •G T n |A|" 1 / 2 e x p Qi (3.38) • A • LJ . T + H, the integral i n (3.37) becomes - 1 / 2 J exp(...)d«> = (2n) ' G w 2 exp Q Ai G n (e* T •A + H -G ) T i n T • (G • A •G T i n + H) • (e* T •A i n •G T ^ . (3.39) 74 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation O m i t t i n g the constant factor a n d rearranging (3.37) again gives 1 p(t\x,D) oc exp ( - - e * r \ / I T • e* J exp I - - e * • A in -A G GAi + hJ -(G-A -G T T in irl (3.40) and further p(t\x, D) oc exp (-±(t-y(x;w*)f A - A G • (G • A T IN IN •G T IN + (t - y(x; ™ * ) ) ) • • GA IN (3-41) T h u s , the target variables follow a multivariate G a u s s i a n distribution w i t h m e a n y(x; w*) a n d covariance matrix - l C 0 = Ain •(G• A - AG T in i n G T + i?) 1 • GA T h e expression for the covariance m a t r i x can be simplified b y m u l t i p l y i n g by J G . . . x [ G ( l + H^GA^G^ G] _ 1 0 n = A i „ _ 1 + H GJ X GAi G ^j 1 T n to give C where Ci (3.42) = C + GH T in 'G, (3.43) has been used. E q u a t i o n (3.43) is the m a i n result of this derivation (Aires et a l , 2004a). It shows t h a t the uncertainty i n the neural network predictions is composed of two parts; the intrinsic noise contained i n the training d a t a , represented b y its covariance m a t r i x Ci , n a n d a term G G t h a t represents the impact of the H uncertainty of the posterior weights distribution o n the predictions - a result that is expected from (3.8). Unless we have situation dependent ( = input dependent) information about intrinsic noise o n the data, Ci n w i l l be constant. T h e neural prediction t e r m , however, is situation dependent t h r o u g h the gradient G (the Hessian is not dependent on the input data). W e c a n determine the error bars o n a network prediction by t a k i n g the s t a n d a r d deviation from the covariance m a t r i x Co- Figures 3.4, 3.5 a n d 3.6 illustrate the results that can be obtained for the example —i T defined at the end of section 3.1. W i l l i a m s et a l . (1995) show that the weights uncertainty t e r m G H G is approximately p r o p o r t i o n a l to the inverse training d a t a d e n s i t y , a n d note t h a t consequently i n high33 T h e y can prove the result for generalised linear regression models (models of the form y(x) — i't j( )> where <fij are basis functions), and note that empirical studies provide evidence that the result also holds for multi-layer networks - especially for networks with linear output activations trained with a least-squares error function (as in our case), which "is effectively a generalised linear regression model with adaptive basis functions" (Williams et al., 1995, Section 5). 33 w > x 75 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation data-density regions, the contribution of this t e r m w i l l become insignificant compared to the noise t e r m C o - a result that I will discuss i n section 3.7. 3.4 Hyperparameter Re-Estimation W i t h the results from the previous sections, we have to know the values of the hyperparameters and A r i n advance. However, using the complete structure of A Ai n i n the training process requires a good r knowledge of the network m a p p i n g , which usually is not available a priori. Similarly, information about the noise distribution i n the target d a t a will also not be available i n most cases. Therefore, how can the hyperparameters be estimated? Aires et a l . (2004a) suggest the following - omit b o t h hyperparameters in a n i n i t i a l learning stage, then estimate them from the trained network a n d re-train the network w i t h the new values. In the case of Ai , Aires et a l . suggest use of (3.43). T h e a p p r o x i m a t i o n they make is to assume n the intrinsic noise to be constant throughout the training data. T h e n a n average covariance m a t r i x Ci n can be computed from the training dataset D by determining the covariance m a t r i x of the output errors C o from the ( e * ) of the training datapoints, a n d by using a n average of the neural prediction t e r m n * G over the training dataset. F r o m these two matrices, a n average Ci GH 7 Ai n n c a n be computed, a n d is obtained by inversion: C = (Co) - (G H T in _1 (3.44) G), where the averages have been denoted by (•). Unfortunately, A i r e s et a l . do not give details o n how to re-estimate A . T h e i r papers suggest use r of the posterior weights distribution of one training r u n as the prior d i s t r i b u t i o n i n the consecutive r u n . T h i s , however, would destroy the intended regularisation mechanism - the prior (3.19) is intentionally chosen to be a G a u s s i a n w i t h zero mean i n order to keep the weight values small. T h e posterior (3.30), i n contrast, is a Gaussian w i t h mean w* ^ 0. T h u s , using this d i s t r i b u t i o n as a prior for the consecutive training r u n would p u l l the weights towards the already found m i n i m u m , not encourage them to be s m a l l . T h i s also becomes clear when considering the following: 3 4 In order to accommodate the mean w*, (3.20) would have to be changed to E (w)M w = - (w - (™*) ) l (i_1) T ' • (w - (w*)^- ^ 1 , (3.45) Although MacKay (1992b) points out that using weight decays with non-zero means would just correspond to using a different model - whose performance could be compared to a zero-mean regulariser within the evidence framework (MacKay, 1992a). 34 76 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation 3r 2 Input X (a) Long training run. 3r I I 0 0.2 I 1 0.4 0.6 Input X I I 0.8 1 (b) Short training run. F i g u r e 3 . 4 : Section through the functional surface of the example given by (3.9) and (3.10) at X = 0.3. P l o t t e d are datapoints from the training dataset (grey), the network prediction (solid line) and error bars (dotted lines). Panel (a) shows the error prediction after the long training r u n (the narrow weight distribution i n F i g u r e 3.3), while panel (b) displays the same prediction for the network trained only a short time (the broad d i s t r i b u t i o n i n Figure 3.3). T h e broader weights distribution is noticeably reflected in larger error bars. 2 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation 0.2r 0.4r CM 3 g, 0.15 & 0.3 3 o o fc o E la O 0.2 u u 0.1 e Loj: 0 U E 0.1 0.05 H 0.1 0.2 o0 0.3 Estimated error output 1 0.05 0.1 0.15 Estimated error output 2 0.2 (b) (a) F i g u r e 3 . 5 : Scatterplots of the estimated error (one standard deviation) vs. the actual error for b o t h output variables of the example given in (3.9) and (3.10), for the long training r u n shown in Figure 3.8 (also see F i g u r e 3.4a). T h e narrow weights distribution (cmp. Figure 3.3) causes the network t e r m in (3.43) to be much smaller t h a n the intrinsic noise Ci . Consequently, the estimated error is almost identical to the noise on the d a t a , which is correctly estimated to have a standard deviation of 0.1 and 0.05, respectively. Hence, the predicted error does not show significant input dependence, it can only be stated that the true error w i l l be smaller t h a n the estimated error in at least 68.2% of all cases. In fact, for the given example, 68.5% of the true errors are smaller t h a n the predicted ones for output variable 1 (68.0% for variable 2). n 0.5 1 Estimated error output 1 (a) 1.5 0.2 0.4 0.6 Estimated error output 2 0. 0) F i g u r e 3 . 6 : T h e same as Figure 3.5, but for the short t r a i n i n g r u n leading to the broad weights distribution i n Figure 3.3 and the larger error bars i n F i g u r e 3.4b. Note that this time, the network dependent t e r m i n (3.43) is m u c h larger, due to the broader weights distribution, resulting in a more input dependent error (in panel (a) 67.5% of the predicted errors are smaller t h a n the actual error; in panel (b) 75.4%). 78 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty where refers to the i - t h re-estimation iteration. Since VV•Ew(to)lu>=(u; ' )W = A ( Estimation , l , r when (3.45) is used instead of (3.20), A r {i) r r ' does not change is re-estimated following A « = H Since H, i / j j a n d A 1 = JT W D d-V. + (3.46) Ar are the inverses of covariance matrices, w h i c h are positive definite (Aires, 2004), they are also positive definite. T h i s means that at least the diagonal elements of these matrices are positive (Weisstein, 2007). Hence, the diagonal elements of A r w o u l d become larger a n d larger w i t h each t r a i n i n g iteration. T h i s makes sense; larger diagonal elements of A r approximately 35 m e a n a smaller variance of the weights, hence the weights d i s t r i b u t i o n is more sharply peaked. If the o p t i m i s a t i o n a l g o r i t h m has found a m i n i m u m w* i n a first t r a i n i n g r u n , and i n a second t r a i n i n g r u n is " t o l d " that this m i n i m u m is very likely (even if it was not a very good m i n i m u m ) , then it will further increase the certainty i n that m i n i m u m by decreasing the variance. F u r t h e r m o r e , by increasing the diagonal elements i n A , the importance of the r d a t a is decreased, u n t i l they eventually are insignificant compared t o the weights prior. T h i s , however, w i l l only increase the chance of remaining stuck i n a local m i n i m u m found i n the first t r a i n i n g r u n . However, since we are using a zero-mean regulariser, we wish to get a new estimation of how strongly the weights should be pulled towards zero, w h i c h eventually should lead to a compromise between a m a p p i n g that fits the d a t a well a n d a guard against overfitting. A m e t h o d t o re-estimate a n A T for a zero-mean G a u s s i a n is needed, something t h a t A i r e s et a l . do not derive. T h e approach I use is t o adapt the re-estimation technique suggested b y A i r e s et a l . (2004a) for A i , n but to use the so-called evidence procedure ( M a c K a y , 1992a) i n order to re-estimate a modified version of A . However, i n order t o use the evidence procedure i n the form derived b y M a c K a y , A r r c a n only contain diagonal elements. Hence, correlations between the weights w i l l be ignored below. 3.4.1 Evidence Procedure for A . r T h e evidence procedure has been derived by M a c K a y (1992a) i n order t o re-estimate the scalar hyperparameters a a n d (3 from the d a t a during the network learning stage. N a b n e y (2002) shows how the approach c a n be generalised t o multiple a for different groups of weights, up t o a n i n d i v i d u a l a for every weight - w h i c h corresponds to having a diagonal A r i n (3.20). W h e n i n a d d i t i o n to t h e network weights the values of the hyperparameters are inferred from the 3 5 if no covariances exist exactly 79 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation data, the result can be expressed by the joint probability distribution p{w, a, f3\D). T h e correct Bayesian treatment (Bishop, 1995, Section 10.4) to get the posterior distribution of the weights p(w\D) from this joint distribution would be to integrate over a l l possible values of a and (3: p{w\D) = J J p(w,a,P\D) = J J p(w\a,p,D)p(a,p\D) (3.47) dad/3 (3.48) dad/3. T h e evidence procedure, however, makes the approximation that the density p(a, /3\D) is sharply peaked around the most probable values a and /3 , M P reducing (3.48) to MP p(w\D) = p{w\a ,/3 ,D) MP (3.49) J J p{a,(3\D) dad/3 . MP v v ' = 1 In order to find a M P and /3 , MP the posterior distribution p i a , m = m ± B ' ^ ,3.50) has to be maximised. T h e prior p(a, f3) is assumed to be uniform i n the evidence procedure, so that it does not affect the m a x i m u m of p(a, P\D) . 36 T h e normalisation factor p(D) (the integral of the numerator over a and (3) also does not affect the m a x i m u m , hence only p(D\a,(3) - k n o w n as t h e evidence of the hyperparameters - has to be maximised. T h i s t e r m can be w r i t t e n as a n integral of the d a t a likelihood over a l l possible weights w: p(D\a,P) = jp(D\w,a,P)p{w\a,P)dw. (3.51) Since the hyperparameters are given, the factors p(D\w, a, P) and p(w\a, P) of the integrand are given by (3.17) and (3.19) (exchange the m a t r i x hyperparameters w i t h the scalar ones), leading to p(D\a,P) = ± | exp ( - £ a V ) ) dw, (3.52) where (3.24) has been used because of the scalar hyperparameters. T h e G a u s s i a n integral i n (3.52) is 36 S u c h a prior is said to be an improper prior, since it does not have a finite integral and cannot be normalised. 80 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation given b y , exp (-E (w)) dw = exp (-E (w*)) ap / (2TT) / a(3 W 2 1—1/2 Lff (3.53) (see B i s h o p (1995, A p p e n d i x B ) for the evaluation of Gaussian integrals). Together w i t h terms from the normalisation factor Z, the logarithm of the evidence (3.52) is then given by w logp(D\a,(3) = - | ^ K ) ~ N 2 { y ( x n ; w) - t } n 2 I i i W N N - - l o g \ H \ + — l o g a + - log/5 - - log(27r). (3.54) In order to optimise this log evidence w i t h respect to a (the corresponding approach for j3 will not be considered here), the partial derivative -j^ (log p(D\a, ft)) has to be computed. B i s h o p (1995, p. 410) shows that ^log|ff| =tr(H _ 1 ), (3.55) where the approximation has been made that the eigenvalues of H do not depend o n a. (I have skipped the eigenvalue step here, see B i s h o p (1995, Section 10.4) for further details.) Hence, d_ da (\ogp(D\a,P)) = ~ f > : ) - | t r (iT ) 2 1 + (3.56) 2=1 E q u a t i n g (3.56) to zero yields the expression W-atr (H~ ^) 1 eL«) 2 (3.57) If groups of weights are assigned different hyperparameters, this equation can be adjusted correspondingly, down t o a n i n d i v i d u a l a for each weight: l - o ^ H - 1 ) , In order to optimise a, the A N N training is first started w i t h a n initial (random) value of a. Once the training algorithm has found a m i n i m u m , a is re-estimated from (3.58): 1 K ) (3.59) 2 O f course, the optimisation of the hyperparameters by iterative re-estimation - valid for b o t h re81 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation estimating Ai n w i t h the Aires et a l . (2004a) approach and A r w i t h the evidence procedure - is computa- tionally expensive, since the network has to be re-trained several times, ideally u n t i l the hyperparameters stabilise. F i g u r e 3.7 illustrates how they develop d u r i n g the long t r a i n i n g r u n of the example used i n this chapter. 3.5 T h e Jacobian M a t r i x T h e J a c o b i a n m a t r i x is not a part of the Bayesian framework described in the previous sections. It represents the first derivative of the network outputs y w i t h respect to the inputs x and is defined as J lk = (3.60) Aires et a l . (2004b) suggest that the variability in the J a c o b i a n can be determined by computing a distribution of Jacobians from the posterior weights distribution given b y (3.30). T h e i r approach, and the one adopted in this study, is to use M o n t e C a r l o techniques to sample from the weights distribution, then to construct a J a c o b i a n distribution from these samples. Aires et a l . use R = 1000 samples from (3.30), compute the J a c o b i a n for each weights sample, and compute mean and variance from all these Jacobians. T h i s , of course, assumes a G a u s s i a n distribution for the J a c o b i a n as well, although the histogram of the Jacobians itself could be used as a P D F (probability density function) representation. T h e J a c o b i a n , as defined in (3.60), is input dependent, hence its d i s t r i b u t i o n can be computed for individual inputs. 37 For the purpose of identifying ill-posed problems A i r e s et a l . (2004b) propose to compute an average J a c o b i a n over the training dataset and to interpret its variability. T h i s approach applied to the example given by (3.9) and (3.10) yields the results listed i n Table 3.1. Indeed, the wider weights distribution results i n a larger variability in the J a c o b i a n . If a problem has been identified as ill-posed, they suggest a p r i n c i p a l component decomposition of the input and output data. T h e y apply this approach to a remote sensing p r o b l e m w i t h m a n y correlated inputs and use it to reduce the number of inputs and to exploit the decorrelated structure of the principal components. However, if the number of inputs is small from the beginning, it is usually not desired to further decrease their number. B i s h o p (1995, Section 8.2) discusses a similar approach, which decorrelates the inputs and is known as whitening in the literature. It will become i m p o r t a n t in chapter 4. A property that will prove useful in chapter 4, where the Jacobian provides information about whether a network is modelling the "right" function. 3 7 82 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty -,12 '0 -• •,11 -,22 - o— 0 0.02 -,22 ,12 Estimation Development of A 400 r 300 U U A ..S o . o i < r 200 c 0 -0.01 • - A 12 2 2 i n • 100 0 0, 10 Iteration 0 before iteration (a) 10 (b) Development of A 10000 5000 before iteration (c) F i g u r e 3.7: Development of the covariance matrices Co a n d C\ (panel (a)) a n d the hyperparameters A (panel (b)) a n d A (panel (c)) during the t r a i n i n g r u n shown i n F i g u r e 3.8. After a few re-estimation iterations, the estimated intrinsic noise stabilises at the correct variance values of 0.01 and 0.0025, respectively, leading t o a stabilisation i n the d a t a hyperparameter Ai . N o t a b l e is that the uncertainty i n the neural prediction, given b y the difference between C and Ci , quickly becomes smaller. M o s t elements of A stay i n the range between 0 a n d 100, however, some weights are more strongly penalised for getting large. A f t e r approximately eight iterations, A has stabilised as well. n i n r n 0 n r r 83 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation T a b l e 3 . 1 : J a c o b i a n matrices a n d their uncertainty (one s t a n d a r d deviation) of the two networks trained to model the example function given by (3.9) and (3.10). Left: the J a c o b i a n of the network w i t h the narrow weights d i s t r i b u t i o n (long t r a i n i n g run) shown i n F i g u r e 3.3; right: the J a c o b i a n of the network w i t h the broad weights d i s t r i b u t i o n (short t r a i n i n g run) i n F i g u r e 3.3. Note how the w i d t h of the weights d i s t r i b u t i o n is' reflected i n the uncertainty of the J a c o b i a n , and how the only shortly trained network models a significantly different dependence of output 1 to input 2 - a dependence that is also most uncertain i n the long t r a i n e d network and indicates a problematic input. dyi/ dyi/ dx 2 3.6 4.00 ± 0 . 0 2 -0.02 ±0.43 -0.04 ±0.01 1.00 ± 0 . 0 5 dxi dx 2 dy / 2 3.83 ± 0 . 5 0 -1.52 ± 0 . 3 7 0.01 ± 0 . 8 7 0.77 ± 0 . 3 4 Implementation Issues T h e N E T L A B toolbox by N a b n e y (2002) is implemented i n M A T L A B a n d provides m a n y functions for neural networks. T h e toolbox comes w i t h the ability to handle two-layer M L P w i t h a Bayesian module that implements the scalar hyperparameter approach of M a c K a y (1992b). Since N E T L A B is very well documented and already provides many functions needed for the Aires et al. a l g o r i t h m , I chose it as a basis for m y implementation. A l g o r i t h m 3.1 below summarises the a l g o r i t h m discussed i n the previous sections i n a schematic outline. T h e i n i t i a l hyperparameters A a 1, respectively, where a run. If no regularisation is desired, A 0 0 i r l ^ and A r ^ w i l l usually be set to the identity m a t r i x I and is some constant that controls the degree of regularisation i n the first t r a i n i n g r can be set to zero. T h e loop i n lines 2 to 9 trains the network and re-estimates the hyperparameters. It is repeated u n t i l some t e r m i n a t i o n criterion has been reached, w h i c h c a n be either a stabilisation of the hyperparameters or s i m p l y a m a x i m u m number of iterations. In line 3, the error surface of equation (3.22) is minimised w i t h one of the o p t i m i s a t i o n algorithms provided by N E T L A B , for instance, conjugate gradients. Afterwards all input d a t a is propagated t h r o u g h the network, a n d the covariance m a t r i x C o is computed from the errors of the predictions compared to the targets. In the following step, the gradient of the outputs w i t h respect to the weights G and the Hessian H is computed, w h i c h together w i t h C 0 are used to estimate C in and A i . n Eventually, A r is re-estimated w i t h the evidence procedure described i n section 3.4. W h i l e details about the implementation are given i n A p p e n d i x A , a few i m p o r t a n t issues are mentioned i n subsections 3.6.1 a n d 3.6.2. 84 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation Algorithm 3.1: Bayesian N e u r a l Network T r a i n i n g w i t h M a t r i x H y p e r p a r a m e t e r R e - E s t i m a t i o n . Input: T r a i n i n g dataset S = {X, D}, i n i t i a l hyperparameters Ai ^°\ A ^°\ n r Result: M a x i m u m of posterior weights distribution w*, hyperparamaters Ai and A , n m a t r i x of intrinsic noise Ci n T A <- A A <- A r covariance and covariance m a t r i x of posterior weights d i s t r i b u t i o n H . - (0) 2 repeat 3 T r a i n network (minimise error function (3.22)) w i t h Ai a n d A i n order to find w*; 4 E s t i m a t e covariance m a t r i x Co — covariance(t(x) — y(x\ w*)) from target d a t a and network predictions; C o m p u t e gradient G = Vy(x;w)\ (3.34) and Hessian H = V\7E(w)\ ' (3.27); f — l C o m p u t e the approximate covariance m a t r i x of the intrinsic noise Ci = ( C o ) — (G H G) (3.44); Set A.-i — Cin , Re-estimate A w i t h the evidence procedure, (3.59); 9 until hyperparameters have stabilised or maximum number of iterations has been reached ; n r w=w w=w n n r Development of training and test M S E 0 5 Iteration 10 Development of A N N weights 0 5 Iteration (a) 10 (b) Figure 3.8: (a) Development of the mean-square-error ( M S E ) of the t r a i n i n g dataset and of a test dataset without added noise of the example given b y (3.9) and (3.10) d u r i n g a t r a i n i n g r u n w i t h 10 hyperparameter re-estimation iterations (corresponding to the narrow weight distrib u t i o n i n F i g u r e 3.3). T h e stabilisation of the test M S E close to zero after iteration 4 is a sign that no overfitting is taking place; i n the case of overfitting, the t r a i n i n g M S E would further decrease while the test M S E would increase (the network prediction would not model the true function anymore), (b) Development of the weight values d u r i n g t h e training. A f t e r iteration 6 the values hardly change anymore. T h e regularisation effect of A causes a l l weights to stay of order unity. r 85 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Test dataset after training. (c) Estimation Test dataset after training. (d) F i g u r e 3.9: Scatterplot of the network predictions for b o t h outputs of the example given i n (3.9) a n d (3.10) vs. the target data. Panels (a) a n d (b) show plots of the t r a i n i n g d a t a (with noise), panels (c) and (d) show plots of the test d a t a (no noise). Whereas panels (a) and (b) reflect the intrinsic noise i n the training d a t a , panels (c) a n d (d) illustrate the good generalisation performance of the network; if overfitting h a d occurred d u r i n g the t r a i n i n g process, predictions of the test d a t a would unlikely match the targets. 86 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty 3.6.1 Estimation Normalisation of Input and Output Variables A s B i s h o p (1995, Section 8.2) points out, rescaling input and o u t p u t variables is useful if different variables have t y p i c a l values that differ significantly. In atmospheric remote sensing, for instance, particle size and temperature would have very different value ranges. In such cases, pre-processing can have a significant effect on the generalisation performance of the network; after normalising input and output variables to be of order unity, it is expected that the weights will also be of order unity and hence be small (Bishop, 1995, C h a p t e r 8). A simple method to achieve similar values of order unity for all variables is to apply a linear rescaling by subtracting the mean of the variable and normalising by its standard deviation: (3.61) where v can be either input or output variable. Whitening, the more sophisticated linear rescaling method mentioned in section 3.5, decorrelates the variables i n addition to normalising them. W i t h whitening, the rescaled variables v are computed using 5 n = A - l / 2 ? 7 T (3.62) ( ; (Bishop, 1995, Section 8.2). Here, A denotes a diagonal m a t r i x containing the eigenvalues of the covariance m a t r i x £ of the variables Vi, and U contains the eigenvectors of S . 3.6.2 Regularisation of the Hessian W h i l e working w i t h the implementation of the Aires et al. algorithm, I encountered a severe difficulty, which has already been discussed by Aires (2004) - the positive definite character of the Hessian. A s the covariance m a t r i x of a G a u s s i a n distribution, the Hessian has to be positive definite. T h i s means that all its eigenvalues have to be strictly positive, and that v Hv T > 0 for any non-zero vector v (Bishop, 2006). Aires (2004) points out that for the local quadratic a p p r o x i m a t i o n (3.30) to be valid, the o p t i m a l weights vector w* must be at a real m i n i m u m of the error surface, otherwise the positive definite character is not guaranteed. T h i s is obviously a problem, since every training algorithm can only approximate this m i n i m u m . Furthermore, as Aires states, the possibly large size of the Hessian (W x W, w i t h W being the t o t a l number of weights in the network) has the consequence that its estimation needs to be done w i t h a large enough dataset, otherwise the eigenvalues could become very small or even negative, also violating 87 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation the positive definiteness of H. A solution t o this p r o b l e m suggested by Aires (2004) is to a d d a diagonal regularisation m a t r i x 3 8 XI to the Hessian, where A is a s m a l l scalar a n d / is the identity m a t r i x . If A is large enough, this approach w i l l result i n a positive definite m a t r i x . However, the a d d i t i o n w i l l also result i n a bias i n quantities estimated f r o m the regularised m a t r i x ( R i g d o n , 1997). U s i n g a combination of c r i t e r i a t o measure c o n d i t i o n number a n d positive definiteness, A i r e s (2004) obtains A = 12 for his example. M y own experiments, however, showed t h a t A needs to assume values larger t h a n 10, 000 i n certain cases, so t h a t the original character of the Hessian is considerably altered. A n o t h e r possibility, mentioned i n passing by N a b n e y (2002), is to use a n eigenvalue decomposition, which, for the given case, is closely related t o the truncated singular value decomposition (e.g. Hansen, 1994). If the m a t r i x H is not positive definite, then one or more of its eigenvalues w i l l be negative. N a b n e y (2002, Section 9.4.2) decomposes the d a t a Hessian H n into its eigenvalues a n d eigenvectors, sets all negative eigenvalues to zero and reconstructs the m a t r i x : H (3.63) = VAV , T D where V contains the eigenvectors of Hp and A the modified eigenvalues. If the m a t r i x HD is at least positive semi-definite (i.e. eigenvalues can also be zero) and H e r m i t i a n (which all real s y m m e t r i c matrices are), the eigenvalue decomposition is equivalent to the singular value decomposition and the eigenvalues coincide w i t h the singular values ( A b d i , 2007, Section 2). Since the reconstructed HQ still has zero eigenvalues, N a b n e y further adds the weights hyperparameter A T H. Since i n N a b n e y ' s i n order t o reconstruct the full Hessian diagonal m a t r i x , his procedure is a c o m b i n a t i o n of a n eigenvalue decomposition and the m e t h o d proposed by A i r e s . 3.7 Usefulness of the Aires et al. Method T h e figures given i n this chapter so far show that the Aires et al. m e t h o d yields the expected results for the example given b y equations (3.9) and (3.10); the hyperparameters stabilise as expected after several re-estimation iterations (Figure 3.7), the network converges (Figures 3.8 a n d 3.9), a n d the network w i t h Do not confuse the meaning of regularisation used here with the meaning of the word used for the weights term in the error function. Regularisation for neural network training denotes adding a weight penalty term to the error function in order to encourage small network weights. Regularisation of the Hessian here describes methods to make the Hessian (and its inverse) positive definite. Note that the term regularisation of matrices is also often used in the context of matrix inversion or the solution of linear systems, in this case a matrix is singular or close to singular and has to be regularised in order to make it invertible with the available numerical precision. 3 8 88 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation 2r Input F i g u r e 3 . 1 0 : D e m o n s t r a t i o n of the input dependence of the error estimation given i n (3.43). S h o w n is the one-dimensional generator function y(x) = sin(37r:r) + x (dashed line), f r o m w h i c h 100 datapoints were created by adding G a u s s i a n noise w i t h a s t a n d a r d d e v i a t i o n of 0.1 i n the intervals [0.15..0.4] and [0.9..1.0]. T h e network prediction (solid line) is shown together w i t h error bars at one standard deviation. N o t e the larger error bars i n the regions where no t r a i n i n g d a t a existed, reflecting the inverse dependence of the estimated error on the t r a i n ing d a t a density. ( E x a m p l e generated w i t h the original N E T L A B i m p l e m e n t a t i o n w i t h scalar hyperparameters.) 2 the broader weights d i s t r i b u t i o n (Figure 3.3) produces larger error bars (Figure 3.4). However, i n many examples, I found the performance of the error estimation to be unsatisfactory, a n d several problems and open questions require further investigation. Figures (3.10) to (3.14) illustrate some of the problems and failures I encountered while testing my implementation. A serious problem is the actual size of the error bars. A s mentioned i n sections 3.3 a n d 3.4, A i r e s et a l . assume the intrinsic noise on the d a t a to be constant throughout the dataset, a n d W i l l i a m s et al. (1995) show that the uncertainty due to the neural network weights is a p p r o x i m a t e l y p r o p o r t i o n a l to the inverse d a t a density. E s p e c i a l l y i n the examples given by B i s h o p (1995, F i g u r e 10.9) a n d N a b n e y (2002, F i g u r e 9.4), the error bars increase significantly i n regions where no t r a i n i n g d a t a is given. However, I found that for many other functions such good results seem to be difficult to o b t a i n . F u r t h e r m o r e , W i l l i a m s et al. (1995) found t h a t the neural uncertainty contribution of a network trained f r o m a very dense dataset w i l l become insignificant compared to the noise term. For instance, F i g u r e 3.10 shows a simple example of a network trained from a dataset including 100 datapoints i n two distinct intervals. A s expected, the error bars i n the middle section, where no t r a i n i n g d a t a was given, are larger t h a n i n the sections where d a t a was available. Consequently, the-true generator function is still w i t h i n the error interval of 89 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation L 5 0 0.2 0.4 0.6 0.8 1 Input F i g u r e 3.11: Same as Figure 3.10, but w i t h a training dataset consisting of 10,000 datapoints. T h i s example demonstrates how the estimated error bars depend on the factors such as the size of the training dataset, an effect that is not desired i n the given case. 2r. _ 4 ! 0 1 1 1 1 0.2 0.4 0.6 0.8 Input X . 1 1 F i g u r e 3.12: T h e same example as i n F i g u r e 3.4, but the network has only been trained w i t h d a t a i n the interval X\ = [0.15..0.4] (plotted i n grey, however, are the entire training data). A g a i n , the error bars diverge where no training d a t a was present, however, the divergence is m u c h too weak to indicate the actual error - a problem that I encountered i n many cases a n d w h i c h emphasises how difficult it can be to interpret the error bars. 90 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation -0.2 - 1 1 1 1 0 1 ' 1 2 3 Input ' 4 1 5 1 6 1 7 F i g u r e 3.13: L i m i t a t i o n s of the error estimation - a simple two-layer perceptron is not able to m o d e l a m a p p i n g that contains ambiguities, such as the m a p p i n g shown here (dashed line). W e would hope that at least the error bars could reflect the problematic areas by becoming larger i n the ambiguous regions, however, this is not the case. Furthermore, the lower d a t a density i n the non-ambiguous regions acts as a counter-productive effect. (One-dimensional example generated w i t h the original N E T L A B implementation w i t h scalar hyperparameters.) F i g u r e 3.14: T h e errors i n Figures 3.5 and 3.6 were predicted using a Hessian regularised w i t h the eigenvalue decomposition described i n section 3.6. Shown here is the error predicted by the same network as i n F i g u r e 3.6, b u t using non-regularised Hessian. N o t e the striking difference between computations performed w i t h a regularised and a non-regularised Hessian. Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation the network prediction. However, if the same problem is repeated w i t h 10,000 datapoints (Figure 3.11), then the network uncertainty t e r m suddenly becomes very small - including the interval where no d a t a were present. In this case, the true function is outside of the error bars of the prediction, and we are confronted w i t h the counter-intuitive result that more observations lead to a worse result. F i g u r e 3.12 shows a variation of the two-dimensional example used in the previous sections. It displays the same section through the functional surface as F i g u r e 3.4, but this time the network has been trained only w i t h d a t a in a narrow interval. A s expected, the error bars diverge i n the part where no training d a t a was present, however, they are much too small to indicate the true error to the actual function. A n o t h e r problem is the presence of ambiguities in the training d a t a . A simple two-layer network architecture is not able to model such ambiguities (Bishop, 1995), but as I pointed out in C h a p t e r 1, the hope is that the error bars reflect these regions through a larger uncertainty. However, I was not able to achieve this result. O n the contrary, regions where ambiguities exist actually have a higher d a t a density t h a n regions where only one functional value is present, leading to the possibility that the error bars are even smaller in the ambiguous parts (Figure 3.13). T h i s investigation also raises several questions concerning the regularisation of the Hessian - how large is the influence of the regularisation on the training a l g o r i t h m and on the error bars? W h i c h is the better regularisation method, and how m u c h information is destroyed by performing the regularisation? F i g u r e 3.14 shows the errors from Figure 3.6, but computed using the non-regularised Hessian - the neural uncertainty term almost vanishes. In other examples, I also encountered negative errors due to a non positive definite inverse Hessian. T h e problem is made worse by the fact that the Hessian is estimated from a large dataset. Since it represents the second derivative of the error w i t h respect to the weights, estimating it from different datasets should not significantly change the Hessian. U s u a l l y it will be estimated from the t r a i n i n g dataset. However, if the size of the dataset is changed, or the Hessian is estimated from only a part of it, the negative eigenvalues can slightly change, resulting in a different regularisation and hence different error bars. A l s o unclear is the effect of the regularised Hessian on the J a c o b i a n variability estimates. T h e numerical accuracy of the implementation also becomes a problem i n the light of the issue that the m e t h o d has been devised for noisy target data. Hence, if the intrinsic noise variance of the training d a t a becomes small and the network prediction is good enough so that the covariance m a t r i x of the errors Co —i T has very s m a l l elements, the average neural uncertainty t e r m G to re-estimate Ci n H G has to be even smaller in order w i t h the help of (3.44). G i v e n the regularisation issue, however, I often encountered 92 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation the case that neural uncertainty term was larger t h a n Co, leading to negative variances i n Ci , n T obviously does not make sense. A s a simple workaround, the G cases, re-estimating Ci n which —t H G t e r m c a n be omitted i n such by setting it to the CQ values, however, such a n approach is likely to generate further uncertainties and problems. Unfortunately, the noise on the target d a t a i n my inverse problem is small, and consequently, I was confronted w i t h this problem during the application of the method to the actual retrieval problem. I will come back to this topic i n chapter 4. For future investigations, it would be interesting to follow the suggestions of M a c K a y (1995) and compare the performance of the training algorithm using a full diagonal A r w i t h an i n d i v i d u a l hyperpa- rameter for each weight to its performance when only a few scalar hyperparameters for distinct groups of weights are used. For some test runs I observed large fluctuations of some elements of A , and it might r be w o r t h testing whether suppressing such fluctuations can influence the t r a i n i n g performance. In conclusion, it seems to require a lot of skill and experience w i t h neural networks i n order to t r a i n a network well and to interpret the results of the error estimation correctly. G i v e n the t r a i n i n g d a t a density dependent magnitude of the error bars and the inability of the m e t h o d to recognise ambiguities, the usefulness of the uncertainty estimates for the retrieval problem is l i m i t e d . T h i s is especially true when considering the numerical problems i n the implementation. Nevertheless, the obtained uncertainties will give a general idea of the certainty i n the neural network fit. T h e J a c o b i a n and its variability, on the other h a n d , provide powerful tools to analyse a network. Hence, at least this part of the uncertainty estimation framework should provide useful results for the retrieval A N N s . 93 Chapter 4 Retrieval, Results and Evaluation In the previous chapters, I presented the theory a n d design of the forward model a n d the neural network techniques to be used i n the retrieval algorithm. In this chapter, I w i l l report on the application of these methods to the actual retrieval. T h e satellite images of the J u l y 11, 2001, scene as well as the corresponding L U T setup w i l l be described i n section 4.1. N o t a l l pixels observed b y the satellite were overcast. In section 4.2, I discuss the operational M O D I S cloud mask product and report o n problems I encountered w i t h a number of pixels classified as overcast, but exhibiting B T D values outside the expected range. T h e design of the neural network is the topic of sections 4.3 and 4.4. I also discuss how the J a c o b i a n can be used to analyse the retrieval performance of a given network architecture, present the sensitivity of the retrievals to cloud top pressure (cf. section 2.6) a n d check o n their physical plausibility. G o o d results were obtained w i t h a network architecture containing 15 neurons i n the hidden layer that used the brightness temperatures of channels 20 (3.7 fim) a n d 31 (11 /um), the B T D of channel 31 to 32 (12 fim) a n d surface temperature as inputs. T h i s network's results w i l l be evaluated i n section 4.5 by comparison w i t h the D Y C O M S - I I in-situ d a t a and an analysis of its J a c o b i a n . I conclude this chapter w i t h a discussion of m y findings i n section 4.6. 4.1 4.1.1 Retrieval Setup The Test Scene T h e satellite image of the test scene was taken by the M O D I S instrument a b o a r d N A S A ' s T e r r a satellite at 6:25 U T C on J u l y 11th, 2 0 0 1 , corresponding to a local time of 11:25pm P D T (Pacific Daylight 39 T i m e ) . F i g u r e 4.1 shows the B T D ( 3 . 7 - 1 1 ) , as well as the 11 / a n brightness temperature (hereafter B T ( l l ) ) images. T h e black circles i n the B T D ( 3 . 7 - 1 1 ) image mark the flight track d u r i n g this night. T h e satellite I n 2001, only the Terra satellite was operational. Hence, no image from Aqua, which would have been approximately six hours later, was available. 39 94 Chapter 4. Retrieval, Results and Evaluation overpass did not coincide w i t h the in-situ measurements, which were taken approximately five hours after the M O D I S images were recorded. A mean wind speed of about 8 m s _ 1 from the north-west (310°N) was observed during the flight (Stevens et al., 2003b), so that the clouds w i t h i n the dashed rectangle shown i n F i g u r e 4.1 .(top) have likely been advected into the flight area. T h e y w i l l serve for comparison w i t h the in-situ data. T h e scene contains two ship tracks, discernible in the B T D ( 3 . 7 - 1 1 ) , but not i n the B T ( l l ) image. T h i s can be explained b y noting the much smaller single scatter albedo i n the t h e r m a l infrared i n F i g u r e 1.8. T h i s means that the thermal cloud top radiances are m a i n l y a function of cloud emission, a n d the impact of the scattering t e r m in (2.8) becomes small. Since the L W P of the cloud stays constant across the ship tracks, the emission does not change. A s noted in C h a p t e r 2, the ship tracks w i l l be used to check on the physical consistency of the retrieval. 4.1.2 Lookup Table Setup A s discussed i n C h a p t e r 2, the forward model requires the input of four variable ( r / / , N, p , T t) e and three fixed (p fc, v s ct c , /c-value) parameters. These parameters were obtained from the in-situ mea- gam surements. D u r i n g R F 0 2 , cloud top effective radii ranged from approximately 8 to 16 pm number concentrations varied between 25 and 115 c m - 3 a n d droplet . C l o u d top temperature i n the flight area varied only slightly between 284 and 285 K , and a fairly constant cloud top pressure of 939.5 h P a was observed (not shown). In order to account for a potentially larger variability in the entire satellite scene, the L U T was generated w i t h effective radii ranging from 3 to 23 /zm, cloud top temperatures varying between 280 to 288.5 K , and droplet concentrations ranging from 20 to 200 c m - 3 . C l o u d top pressure was fixed at the observed value of 939.5 h P a , as was the surface pressure at 1016.8 h P a . In order to find representative values for k-value a n d v gam (cf. section 2.3), I created statistics of the droplet size distributions that were encountered during the flight. F i g u r e 4.2 shows scatterplots of fc-value (obtained from equation (2.19)) versus height. A s noted i n section 2.1, the d a t a obtained during R F 0 2 cover only cloud top and b o t t o m . T h e fc-values that are found cover ranges similar to those found by Pawlowska and Brenguier (2000) during A C E - 2 (Figure 2.5). A similar picture was obtained from the R F 0 3 data (right panel). F i g u r e 4.3 displays histograms of b o t h fc-value and v gam inferred from the R F 0 2 measurements. Since the radiative transfer through the entire cloud is simulated, I decided to take k as the average of cloud top and b o t t o m values. F r o m the distributions shown, I chose a k of 0.8. T h i s corresponds roughly to 95 Chapter 4. Retrieval, Results and Evaluation channel 20 - channel 31 brightness temperature difference [K] longitude (degrees west) channel 31 brightness temperature [K] longitude (degrees west) F i g u r e 4 . 1 : (Top) Brightness temperature difference ( B T D ) between channel 20 (3.7 pm) and channel 31 (11 pm) at 6:25 U T C on J u l y 11th, 2001. T h e ship tracks have a smaller B T D t h a n their environment, as expected from F i g u r e 2.11. T h e black circles m a r k the flight track during this night. T h e in-situ measurements were taken approximately five hours after the satellite overpass, so that the clouds w i t h i n the dashed rectangle have likely been advected into the flight area and will serve for comparison. (Bottom) T h e same for channel 31 brightness temperature. N o t e that the shiptracks are not discernible in this channel. 96 Chapter 4. Retrieval, Results and Evaluation rf02 07/11/2001 06:24:40-15:52:35 rf03 07/13/2001 06:18:23-15:46:03 0.4 0.6 k value 0.4 0.6 k value F i g u r e 4 . 2 : fc-value vs. height as inferred from the D Y C O M S - I I in-situ measurements of (left) J u l y 11 and (right) J u l y 13, 2001. rf02 07/11/2001 06:24:40-15:52:35 rf02 07/11/2001 06:24:40-15:52:35 0.06 0.05 §0.04 g0.02 o G 0.01 0.00, II 10 20 30 40 gamma distr. shape parameter l 50 0.4 0.6 k value F i g u r e 4 . 3 : Histograms of (left) v a n d (right) fc-value as inferred from the D Y C O M S - I I in-situ measurements of J u l y 11, 2001. T h e black histograms represent cloud t o p values, the white ones cloud base values. gam gam u 26, which I used to compute the M i e tables. Note that these values are considerably larger t h a n the average values found b y M i l e s et al. (2000). T h e optical properties of the overlying atmosphere were computed f r o m the S a n Diego sounding displayed i n F i g u r e 2.9. T h e sounding was recorded at 12:00 U T C , also about five hours after the satellite overpass. 4.2 Cloud Mask In order to confirm the accurateness of the forward computations, it is i m p o r t a n t to test how the computed brightness temperatures compare to the observations. If the forward model represents a reasonable 97 Chapter 4. Retrieval, Results and Evaluation a p p r o x i m a t i o n to the actual clouds, the observed B T / B T D values should lie well inside the ranges defined by the L U T . However, if B T / B T D values outside the defined ranges are encountered, the corresponding pixels have to be nagged as "irretrievable", since the inverse function is not defined for such c a s e s . 40 It is likely that some pixels i n the scene w i l l contain clear s k y or broken clouds. M u l t i p l e scattering from cloud sides a n d inhomogeneous mixtures of cloud a n d clear s k y mean t h a t I expect some pixels i n these broken regimes to exhibit radiances not reproducible w i t h the plane parallel forward model. A scatterplot of the forward model computations, overlain w i t h the M O D I S observations indeed showed a large number of outliers (not shown). Hence, the operational M O D I S cloud mask product ( A c k e r m a n et a l . , 1998) was employed to filter the satellite image for fully cloud covered pixels. 4.2.1 M O D I S C l o u d M a s k and Irretrievable Pixels A simple way to discriminate cloudy from clear s k y pixels a n d the approach used i n the studies of Perez et a l . (2000), Gonzalez et al. (2002) a n d Cerdena et a l . (2007) is the spatial coherence method proposed b y C o a k l e y a n d Bretherton (1982). T h e algorithm uses information from neighbouring pixels to assess the spatial homogeneity i n the observations. B y computing mean B T a n d standard deviation from small clusters of pixels (e.g. 2-by-2), homogeneous (i.e. low standard deviation) areas w i t h cold temperatures are classified as cloudy and homogeneous areas w i t h w a r m temperatures as clear. Since C o a k l e y and Bretherton (1982), several other methods have been proposed to recognise cloudy pixels i n satellite images ( A c k e r m a n et a l . , 1998, and references therein). T h e operational M O D I S cloud mask product combines several of these methods into one product, selecting amongst a variety of tests optimised for different underlying surfaces (i.e. water, different l a n d surfaces, ice) a n d employing several of the available visible (daytime) a n d infrared (at day a n d night) channels. A description can be found in A c k e r m a n et al. (1998). T h e top panel of F i g u r e 4.4 shows a map of the cloud mask for the J u l y 11 scene. T h e image pixels are classified into four categories: cloudy, uncertain clear, probably clear, a n d confident clear. A p p l i c a t i o n of the cloud mask to the scene significantly reduced the number of points outside the L U T - d e f i n e d B T / B T D range. However, after I eliminated a l l pixels t h a t were not classified as cloudy, the brightness temperature difference diagrams still contained many outliers (Figure 4.5). A s noted i n section 1.5, B a u m et al. (2003) also employed the 8.5 fim signal ( M O D I S channel 29) i n their work. A s a n unfortunate result, the observed brightness temperatures for channel 29 were entirely outside of the range N o t e that the neural network will still retrieve some values for such undefined inputs, so that it is important to remove the corresponding pixels from the scene in order to avoid unphysical retrievals. 40 98 Chapter 4. Retrieval, Results and Evaluation MODIS cloud mask product longitude (degrees west) F i g u r e 4 . 4 : (Top) M O D I S cloud mask product for the scene. A l l black pixels are classified as cloudy, the orange (grey in black and white) pixels are uncertain clear, and the white pixels are probably clear. (Bottom) C l o u d mask derived from the B T / B T D range i n the L U T (all observations that are outside the L U T - d e f i n e d B T / B T D region i n F i g u r e 4.5 are marked as dark red). Note that the M O D I S cloud mask has " g r o w n " ; are the additional pixels broken clouds? 99 Chapter 4.. Retrieval, Results and Evaluation computed by m y forward model (Figure 4.6). Currently, I can only conclude that there is a n error i n the forward model, and consequently I w i l l drop channel 29 from the retrieval scheme. It would, however, be desirable to investigate the cause of the failure and to include the 8.5 / m i information in future retrieval designs. Y e t what causes the large number of outliers i n the remaining channels 20 (3.7 /urn), 31 (11 /mi) and 32 (12 /mi) brightness temperatures? T h e b o t t o m panel of F i g u r e 4.4 shows the scene w i t h all pixels outside the L U T - d e f i n e d B T / B T D range removed. T h e cloud mask basically grows - this could be a n indication of more broken or otherwise inhomogeneous clouds at the edges of the clear areas. 4.2.2 Possible Failure Mechanisms T h e r e are several possibilities for the failure of the forward model to m a t c h the observed brightness temperatures. Besides inhomogeneous clouds, it is possible that other assumptions i n the forward model do not m a t c h the real clouds accurately enough. For instance, the anomalous pixels could contain subadiabatic clouds or clouds w i t h a significantly different cloud top height. Effective r a d i i , droplet number concentrations and temperatures outside the in the L U T ranges seem unlikely. T h e m a j o r i t y of the "irretrievable" pixels in F i g u r e 4.5 exhibit b o t h a larger B T D ( 3 . 7 - 1 1 ) and B T D ( 1 1 - 1 2 ) signal t h a n defined by the L U T , while the B T ( l l ) signal is well inside the computed range. W a r m e r or colder cloud temperatures would cause the location of the failing points on the B T ( l l ) axis to fall outside the computed range. Larger r jj e t h a n defined would cause a larger B T D ( 3 . 7 - 1 1 ) signal, but at the same time a smaller B T D ( 1 1 - 1 2 ) observation (cf. F i g u r e 2.11). Since droplet concentration is not a direct input into l i b R a d t r a n (cf. section 2.6), changing droplet concentrations would influence r ff e and r, the latter of which would merely displace the B T / B T D points w i t h i n the defined ranges (cf. F i g u r e 2.11). A n o t h e r possibility is that the structure of the overlying atmosphere changed w i t h time and location, so that the sounding recorded at S a n Diego is not representative for the entire scene. T h i s could also include the presence of t h i n high level clouds (cirrus). In order to investigate the problem of the irretrievable pixels, I analysed the pixels along a test line in an area t h a t included clouds w i t h i n the L U T - d e f i n e d B T / B T D range, pixels classified as cloudy by the M O D I S cloud mask, but outside the defined B T / B T D range, and pixels classified as clear sky by the M O D I S cloud mask. T h e line is shown i n the b o t t o m panel of F i g u r e 4.4. T h e capital letters A and C mark pixels w i t h i n the L U T - d e f i n e d B T / B T D range, and B marks a clear sky pixel. I first tested for broken clouds. L u o et al. (1994) showed that broken clouds exhibit a characteristic 100 Chapter 4. Retrieval, Results and Evaluation F i g u r e 4 . 5 : L U T (grey) and M O D I S cloud mask filtered scene brightness temperatures. M a n y pixels are still outside of the B T range contained i n the L U T . These pixels are not defined if a retrieval network is trained w i t h the L U T data, hence they have to be nagged as "irretrievable". Chapter 4. Retrieval, Results and Evaluation 280 285 290 brightness temperature c h . 31 (1 l u m ) [K] F i g u r e 4 . 6 : T h e same as F i g u r e 4.5, but for the brightness temperature difference between channels 29 and 31 (8.5 and 11 urn). C h a n n e l 29 B T s are 1-5 K cooler t h a n the M O D I S measurements. C h a n n e l 29 hence could not be used for the retrievals. pattern when the observed 11 yum radiances (not brightness temperatures) are plotted against the 12 fim radiances. In particular, if a given area contains clear sky pixels, broken clouds and fully overcast pixels, the entire set of observations resembles a continuous curve i n the d i a g r a m between cloudy and clear pixels. F i g u r e 4.7 shows an 11 versus 12 /jm radiance plot for the sections f r o m A to B (left panel) and from B to C (right panel). T h e section between B and C clearly contains broken clouds. T h i s area already contains many pixels classified as "uncertain clear" by the M O D I S cloud mask product, so that I conclude that inhomogeneities on the sub-pixel scale are likely to be responsible for the failures. T h e other section between A and B , however, does not show the characteristic broken cloud signature i n the radiance plot. Y e t a larger number of the pixels along this line are irretrievable. In order to extend the analysis, I examined the observed B T D ( 3 . 7 - 1 1 ) and B T D ( 1 1 - 1 2 ) signals (Figure 4.8). Similar to the radiance plot in F i g u r e 4.7, the observed values cluster around the cloudy and clear foot. T h e elevated B T D ( 3 . 7 - 1 1 ) signal places almost all points outside of the L U T - d e f i n e d B T / B T D range (with point A just at the edge), however, the pixels clustering around the cloudy B T D ( 1 1 - 1 2 ) signal are all well inside the computed range. A higher cloud top seems to be an unlikely cause of such a behaviour. A s noted in Chapter 2, the differences in the B T D signals due to varying cloud top pressure in the forward model are mainly due to the change i n sea surface temperature produced by the new adiabatic profile (cf. F i g u r e 2.12). However, the existence of a much warmer T f between A and B compared to B and C seems unlikely. A l s o , b o t h s c 102 Chapter 4. Retrieval, Results and Evaluation 7.8 clear BC 7.7 7.6 7.5 7.4 7.3 7.2 7.1 7 radiance ch. 31 (1 l(im) [W/m /sr/nm] (a) cloudy 7.4 7.S S.2 7.6 radiance ch. 31 (ll".m) [W/m /sr/|Xm] (b) F i g u r e 4 . 7 : 11 //m versus 12 /jm radiance plots after L u o et a l . (1994) for the observations along the line shown in F i g u r e 4.4. Left panel: the section f r o m A to B , right panel: from B to C. T h e solid lines connect the observed radiances at A and B (left) and B and C (right), respectively, and are only shown as references to the point clouds. T h e observations between points B and C follow a typical signature of broken clouds. 103 Chapter 4. Retrieval, Results and Evaluation U 4r |-4 " 5" 280 285 290 brightness temperature ch. 31 (Hum) [K] 295 2r brightness temperature ch. 31 (llp.m) [K] F i g u r e 4 . 8 : B T D ( 3 . 7 - 1 1 ) and B T D ( 1 1 - 1 2 ) signals of the pixels between points A and B as shown i n F i g u r e 4.4. A s i n the radiance plot i n F i g u r e 4.7, the observed values cluster around the cloudy and clear foot. It is possible that this behaviour is caused b y t h i n overlying cirrus clouds. Chapter 4. Retrieval, Results and Evaluation B T D ( 3 . 7 - 1 1 ) a n d B T D ( 1 1 - 1 2 ) would be impacted by such a c h a n g e . 41 O t h e r causes for a n increased p ct could be equally thick, but elevated clouds, or geometrically thicker clouds. T h e first possibility would not change the optical properties of the cloud, and the second mechanism would merely lead to increased T and r ff. e B a u m et a l . (1994) a n d B a u m et a l . (2003) showed that overlying cirrus clouds lead to a n increase in b o t h the B T D ( 3 . 7 - 1 1 ) a n d B T D ( 1 1 - 1 2 ) signals. T h e y also showed that t h i n cirrus has relatively small impact o n the B T D ( 1 1 - 1 2 ) signal, m a k i n g t h i n cirrus a likely cause of the l o o k u p table failure along A B . However, i n order to give this presumption more confidence, further investigations would have to be conducted. A l s o , possible effects of subadiabatic clouds should be examined i n the future. F o r this thesis, I will continue to work w i t h those pixels that fall w i t h i n the L U T - d e f i n e d B T / B T D range. 4.3 4.3.1 Network Training Network Architecture A s described i n C h a p t e r 3, I restricted myself to the two-layer perceptron design. C e r d e n a et a l . (2007) found that a three-layer architecture exhibited a better generalisation performance i n their study, however, since a two-layer network should already be able to model a r b i t r a r y functions (cf. C h a p t e r 3), the generalisation of the software to three-layer architectures was not a p r i o r i t y of m y work. W i t h respect to the inputs, decisions to be made included whether to provide the satellite observations as radiances or as brightness temperatures to the network, the inclusion of sea surface temperature as input, and how the inputs should be preprocessed. A n o t h e r difficult p r o b l e m was selecting a good number of hidden neurons. Aires (2004) used a two-layer perceptron w i t h 30 neurons i n the hidden layer i n their example of microwave remote sensing, a n d as noted i n C h a p t e r 1, C e r d e n a et a l . (2007) employed 20 neurons i n the first a n d five neurons i n the second layer. M o t i v a t e d by these values, the number of hidden neurons i n m y architectures will be of the same order of magnitude. It d i d not seem feasible w i t h i n the timeframe of this M a s t e r ' s thesis to systematically t r a i n , analyse and compare a large number of different network architectures i n order to find the best possible one. Instead, after some preliminary try-outs, I selected some architectures t h a t I w i l l present i n more detail in this chapter. These networks yielded the most interesting results. T h e methods applied to analyse these The data in Figure 2.12, computed with a similar cloud top temperature as encountered in the July 11 scene, also showed that a relatively large increase in p t is necessary in order to increase the BTD(3.7-11) signal (40 hPa between the solid and dotted curve in Figure 2.12). Also, the resulting relative change in the BTD(11-12) signal is stronger than in the BTD(3.7-11) signal. 4 1 c 105 Chapter 4. Retrieval, Results and Evaluation networks, however, can readily be used for any other network architectures to be employed i n future work. T h e authors of all works presented i n C h a p t e r 1 that investigated the n o c t u r n a l retrieval case employed satellite observations in the form of brightness temperatures, while the daytime retrieval techniques i n general use radiances. In fact, my tests showed that networks using radiances as inputs were unable to approximate the inverse function (possibly because channel differences provide the most information about cloud optical properties - I d i d not investigate the use of radiance differences as inputs). Therefore, and in order to stay consistent w i t h the published literature, I chose to use brightness temperatures instead of radiances as inputs (this also facilitates the analysis of the networks i n sections 4.4 and 4.5). Unfortunately, the t r a i n i n g process generally was unstable, as I w i l l discuss i n section 4.4. T r a i n i n g of networks of a given architecture led to very different results when the weights were initialised differently at the beginning of the training process - a property that I attribute to the inverse p r o b l e m being ill-posed (cf. chapter 3). T h i s is also reflected in a large variability i n the J a c o b i a n (see the following section). Consequently, m y goal for this thesis is not to find the best possible architecture, but to show that the method is working i n principle and to demonstrate the use of the J a c o b i a n and other tools to compare architectures and to evaluate the performance of a given network. 4.3.2 Failure of the Aires et al. Hyperparameter Re-Estimation T h e estimation of the m a t r i x hyperparameter Ai also often failed d u r i n g network training. A s n discussed in section 3.7, the neural uncertainty term G H~ G T l covariance m a t r i x C o , leading to negative variances in Ci . n i n general was larger t h a n the error T h e problem was encountered w i t h b o t h regularisation methods described in section 3.6. Aires (2004) pointed out that the estimation of the Hessian H from the t r a i n i n g dataset has to be done w i t h a large enough number of datapoints i n order to avoid numerical problems. However, increasing the number of samples i n the L U T from i n i t i a l l y 30,000 (20,000 for t r a i n i n g a n d 10,000 for validation) to 96,000 (64,000 and 32,000, respectively) d i d not improve the situation (for comparison, Aires (2004) used 15,000 samples for t r a i n i n g and 5,000 for testing, and C e r d e n a et a l . (2007) 20,000 and 10,000, respectively). T h e true cause of the failures currently remains unclear. F u r t h e r research is needed to clarify whether the modification of the Hessian due to the regularisation is the i m p o r t a n t factor, or whether the noise level on the target d a t a i n the L U T is too small to be treated w i t h the Aires et a l . method. A s noted in section 3.1, there is no noise on the target d a t a except for the expected ambiguities. However, as I d i d 106 Chapter 4. Retrieval, Results and Evaluation not investigate the actual magnitude of such ambiguities, I cannot rule out the possibility t h a t they are small. D u e to these problems I eventually decided to t r a i n the networks w i t h the original scalar h y p e r p a rameter approach implemented i n N E T L A B (cf. section 3.2). U s i n g the L U T containing 96,000 datapoints a n d the N a b n e y (2002) eigenvalue regularisation (cf. section 3.6), I was able to estimate a Ci n from the trained network i n some cases, although often this strategy failed as well. However, the estimation of the Hessian and the network gradient were always possible, so that the v a r i a b i l i t y of the J a c o b i a n as well as the neural uncertainty t e r m G H~ G T 4.3.3 1 could be computed. Input Preprocessing A s discussed i n C h a p t e r 3 a n d expected from the findings of A i r e s et al. (2004b), correlations among the input variables were a problem. T h e observed brightness temperatures of the three employed channels were all correlated amongst each other w i t h (linear) correlation coefficients larger t h a n 0.8. T h i s lead to widely varying Jacobians, a n d it was practically impossible to o b t a i n a network fit that a p p r o x i m a t e d the lookup table well. T h e input d a t a were thus decorrelated and normalised w i t h the whitening procedure described i n C h a p t e r 3, leading to better results. 4.4 4.4.1 Network Architecture: Inputs and Hidden Neurons Brightness Temperature Differences Despite the input preprocessing, it appeared to be difficult for the A N N s to infer the brightness temperature differences from the B T inputs. E s p e c i a l l y the close correspondence between the 11 and 12 /jm channels seemed to have a negative effect on the stability of the t r a i n i n g process. In fact, I was not able to produce a reasonable fit to the L U T w i t h networks that used B T ( l l ) a n d B T ( 1 2 ) (12 /jm B T ) as inputs. T o analyse the p r o b l e m , I computed the Jacobians of i n d i v i d u a l points i n the L U T i n order to compare the sensitivities to expected values. F i g u r e 4.9 shows the B T / B T D diagrams of a subset of the J u l y 11 L U T . In a d d i t i o n to the cloud top pressure, cloud top temperature is also held constant at 285 K . T h e effective radius varies i n steps of two microns from 4 fim to 12 /j.m, as i n F i g u r e 2.11. Since all variables are connected w i t h each other i n a nonlinear way, it is difficult to assess J a c o b i a n values visually f r o m the plots in F i g u r e 4.9. O f course, it is possible to compute finite difference derivatives f r o m the L U T . However, finding the required datapoints for a sensitivity estimation constitutes a 107 Chapter 4. Retrieval, Results and Evaluation ^ If QI 284 . 1 1 — • • 285 286 287 288 289 290 brightness temperature ch. 31 (Hum) [K] • 291 F i g u r e 4 . 9 : Subset of the J u l y 11 L U T . I n addition t o the cloud top pressure (939.5 h P a ) , the cloud t o p temperature is also held constant at 285 K . C o m p a r e v i s u a l l y obtained estimates of the J a c o b i a n for r / / « 8pm and B T ( l l ) « 288 K t o the computed values i n Tables 4.1, 4.2 and 4.3. T h e symbols correspond to the optical thicknesses defined i n F i g u r e 2.11. e 108 Chapter 4. Retrieval, Results and Evaluation multidimensional optimisation problem itself. C o m p u t i n g the derivative of an o u t p u t t o a given input, at least two points have to be found for which the other three inputs are constant. E v e n w i t h 96,000 datapoints, this was n o t possible to a good enough accuracy (solving this problem, for instance, w i t h larger L U T s and interpolation could constitute an important part of future research i n this area). I thus restricted myself to order-of-magnitude estimates from F i g u r e 4.9 where possible. For instance, for a n effective radius of 8 f i m , if B T ( l l ) is held constant at 288 K , r 3 r / < 9 B T ( 3 . 7 ) e// « 2.5/jm/K. Similarly, <9r ///<9BT(12) s=y 18/um/K. C l o u d optical thickness decreases slightly w i t h increasing B T ( 3 . 7 ) , e and increases w i t h increasing B T ( 1 2 ) (<9T/<9BT(3.7) « - 0 . 5 K " and r > r / d B T ( 1 2 ) is positive). If all cloud 1 parameters are held constant, an increase i n surface temperature w i l l lead t o a similar increase i n cloud top temperature (cf. F i g u r e 2.9). Since the cloud i n question is t h i n ( r sa 1.3), the surface temperature signal is expected t o be "visible" through the cloud, so that dT /dT f ct S c « 1. Likewise, T ct should depend o n B T ( l l ) and B T ( 1 2 ) ; for thicker clouds, these two inputs almost entirely determine cloud top temperature (cf. section 4.1 and F i g u r e 2.9). Table 4.1 shows the J a c o b i a n estimated w i t h the entire L U T trained network at the discussed coordinates of r jf e « 8/^m and B T ( l l ) sa 288 K . T h e A N N contained 30 neurons i n the hidden layer, and i n addition to B T ( 3 . 7 ) , B T ( l l ) and B T ( 1 2 ) also used surface temperature as a n i n p u t (the T f input w i l l s c be discussed shortly). T h e variability was obtained by computing the Jacobians of 10,000 samples from the weights distribution, as described i n section 3.5. T a b l e 4 . 1 : P o i n t estimate of the J a c o b i a n of a network containing 30 neurons i n the hidden layer and using B T ( 3 . 7 ) , B T ( l l ) , B T ( 1 2 ) and T as inputs. N o t e t h e large variability of the J a c o b i a n , indicating a n ill-posed problem. Furthermore, the dependences o n B T ( l l ) a n d B T ( 1 2 ) are large and of opposite sign. sfc eff dr 1 — [MW-K"] dT / — [K/K] dr/ — [K- ] 1 <9BT(3.7) 0.22 ± 4 . 7 2 - 0 . 0 7 ± 1.58 -1.10 ±9.83 dBT(ll) -20.16 ±20.92 -0.78 ±7.38 -4.83 ±34.53 5BT(12) 19.11 ± 2 1 . 6 7 0.75 ± 8 . 3 0 5.56 ± 3 7 . 1 2 1.18 ± 3 . 3 7 1.09 ± 1.27 0.47 ± 6 . 0 5 dT sfc T w o features are particularly noticeable. F i r s t , there is a large variability i n the computed values (i.e. uncertain network fit), especially i n the dependences on B T ( l l ) and B T ( 1 2 ) . Second, the dependences o n these two channels are large and of opposite sign. Since B T ( l l ) and B T ( 1 2 ) are so highly correlated, this behaviour models the dependence on the differences between the two channels (since B T ( l l ) and B T ( 1 2 ) w i l l always change by approximately the same value). However, the large values make the retrieval very 109 Chapter 4. Retrieval, Results and Evaluation sensitive to noise in the inputs, and there is no reason why they could not be m u c h smaller or of the same sign but slightly different magnitude. Large a n d opposite sign sensitivities for at least one output could be observed for all A N N s trained w i t h B T ( l l ) a n d B T ( 1 2 ) inputs (although the magnitude of the dependences and variability varied, depending on the m i n i m u m that was found in the weights error surface), and the performance of the networks was very sensitive to the initialisation of the weights and the number of hidden neurons. I n an attempt to make the training process more stable, I replaced the B T ( 1 2 ) i n p u t w i t h the B T D ( 1 1 - 1 2 ) signal, which indeed improved the training performance. It is w o r t h noting t h a t a similar replacement of the B T ( 3 . 7 ) input by the B T D ( 3 . 7 - 1 1 ) signal d i d not lead to comparable improvements. T h e empirical J a c o b i a n reported by Cerdena et al. (2007) (cf. C h a p t e r 1) shows the same behaviour of opposite signs, but w i t h smaller magnitudes (although they compute a scene-averaged J a c o b i a n , whose values should not b e ' c o m p a r e d w i t h the point estimate given in Table 4.1). T h e i r network seems to perform well (although no results are given in the paper); hence I conclude t h a t in general i t is possible for the A N N to model the correct dependences from the "raw" B T inputs, but that using the B T D ( 1 1 - 1 2 ) input improves the stability i n the training process. 4.4.2 Three Input Networks A different question was whether surface temperature, as used in all previous studies, is a necessary input. Perez et a l . (2000) argued that T / s C is necessary in order to compute the upwelling cloud base radiation, and Cerdena et al. (2007) added that the effects of water vapour i n the atmosphere can be accounted for by using clear sky B T s of all input channels. I prescribe the optical properties of the overlying atmosphere and the effects of subcloud water vapour are connected w i t h the cloud parameters through the adiabatic model. However, it is true that three variables are not enough to uniquely specify an adiabatic cloud in m y forward model. If cloud top pressure is held constant, four more parameters are needed to compute a profile (for instance, cloud.top T , r / / , e l i q u i d water content a n d either surface pressure or temperature). Nevertheless, since surface temperature does not belong to the direct satellite measurements, i t was worth testing the behaviour of the networks if no T f input is used. s c T h e retrieval attempts using three inputs were not successful. F i g u r e 4.10 shows scatter plots of the target variables in the L U T (cloud top r ;j, T and r ) versus A N N predictions, for b o t h a three input e network and a four input input network that made use of T f . Such scatter plots provide a good way s c for visualising whether the network is able to approximate the target d a t a i n the training or validation 110 Chapter 4. Retrieval, Results and Evaluation database. If the network predictions are perfect, the plots w i l l take the shape of a straight line. T h e wider the scatter around this line, the worse is the fit. Obviously, intrinsic noise on the target d a t a also creates scatter. T h e plots in Figure 4.10 lead to the conclusion that the three input network is not able to produce good predictions of r / / e and T, only r is predicted reasonably well. T h e A N N that produced the displayed results had 30 neurons in the hidden layer, but varying this number did not improve the performance. Nevertheless, in order to ensure that the wide scatter is not caused by ambiguities in the L U T and that the four input network possibly overfits the data, I applied the three input A N N to the actual scene. T h e retrieval results of cloud top temperature and cloud optical thickness are shown in F i g u r e 4.11. W h i l e the optical thickness retrieval shows the expected signature of optically thicker clouds along the ship tracks, the tracks are also discernible i n the temperature retrieval, exhibiting a higher temperature t h a n their environment. T h i s is not the expected physical behaviour - the aerosol particles contained i n the ship exhaust should not influence the cloud temperature. Furthermore, m a n y of the retrieved droplet sizes were negative (not shown). A curious feature of all trained networks is that they are able to retrieve optically very thick clouds from the L U T . Due to the saturation effect discussed in section 1.5 I had expected t h a t target optical thicknesses above a certain threshold would be retrieved as the threshold value. It is currently unclear if there actually is enough information in the input d a t a to infer the large optical thicknesses or if the networks are overfitting in this case. Since the scene of J u l y 11, 2001, does not contain clouds w i t h (retrieved) r larger t h a n about 6, no anomalies can be found i n the retrieval (cf. section 4.5). However, this behaviour should be further investigated in the future. 4.4.3 Four Input Networks T h e scatter plots of the four input network i n F i g u r e 4.10 showed t h a t i n c l u d i n g T f s c as an input significantly improves the ability of the network to fit the lookup table well. In contrast to C e r d e n a et al. (2007), I used the actual surface temperature as a single input, not the clear sky brightness temperatures of all three employed channels. U s i n g T / s C as an input at first seems to be a significant restriction to the usefulness of the retrieval, since clear sky pixels are not always available i n the near vicinity of the clouds whose properties are to be retrieved. However, sea surface temperature retrievals are also possible from microwave imagers such as the A d v a n c e d Microwave Scanning R a d i o m e t e r for the E a r t h Observing System ( A M S R - E ; K a w a n i s h i et a l . , 2003) on-board the A q u a satellite (which also carries the 111 Chapter 4. Retrieval, Results and Evaluation H E- 10 5 10 15 20 Prediction r 10 10 15 20 Prediction r 25 25 eff ci'f 290 290 290 280 285 Prediction T 0 20 40 Prediction x 60 280 285 Prediction T 20 40 Prediction T 60 F i g u r e 4 . 1 0 : T r a i n i n g performance of a three input network (left column) employing the B T ( 3 . 7 ) , B T ( l l ) and B T D ( 1 1 - 1 2 ) signals, compared to a four input network (right column) additionally m a k i n g use of surface temperature. T h e scatter plots show the target data, as contained in the validation database, plotted against the corresponding network predictions. It is clearly visible that three inputs are not enough to approximate the inverse function. 112 Chapter 4. Retrieval, Results and Evaluation cloud top temperature [K] longitude (degrees west) cloud visible optical thickness longitude (degrees west) F i g u r e 4 . 1 1 : Retrievals of cloud top temperature (top) and visible cloud optical thickness (bottom) of the three input network from which the scatter plots on the left side of F i g u r e 4.10 were produced. T h e ship tracks are discernible in the T retrieval, w h i c h is not the expected physical behaviour. ct 113 Chapter 4. Retrieval, Results and Evaluation 0.3 0.25 8 0.2 | 0.15 1 0.1 0.05 °0 20 40 60 number of hidden neurons 80 100 F i g u r e 4 . 1 2 : Mean-square-error of the validation dataset for the four input A N N ( B T ( 3 . 7 ) , B T ( l l ) , B T D ( 1 1 - 1 2 ) , T fc) i n dependence of the number of hidden neurons. T h e networks were trained w i t h four hyperparameter re-estimation cycles, each w i t h 1500 optimisation steps. 3 second M O D I S instrument). Such retrievals are largely independent of cloud cover, since clouds are semitransparent i n the microwave. For this study, I used the T f measurements taken d u r i n g the D Y C O M S - I I s c research flight. O n J u l y 11, 2001, the surface temperature was approximately 19°C. T h e next open question was how many hidden neurons are needed to approximate the inverse function well without adding too many degrees of freedom to the training process. A s mentioned above, the work by Aires (2004) a n d C e r d e n a et al. (2007) suggested that a number on the order of 30 neurons should be sufficient. I thus trained a number of A N N s w i t h different numbers of hidden neurons ranging from five to 100, a n d compared the M S E of the validation data. F i g u r e 4.12 shows the decrease of the error w i t h an increasing number of hidden neurons. However, the curve is deceptive - the conclusion that a 100 hidden neurons network performs better i n the retrieval t h a n a 15 hidden neurons network proved wrong. In fact, from a l l networks that were trained the most physical plausible results were obtained from the one containing 15 neurons i n the hidden layer, although its M S E was higher t h a n that of other networks. A s noted i n section 3.1, models of a higher complexity (i.e. a larger number of hidden neurons) are more susceptible to overfitting. However, the Bayesian regularisation i n the training process should effectively prevent overfitting to noise i n the target d a t a (cf. section 3.2), a n d as noted above, there seems to be little noise i n the L U T . Instead, I assume that the described behaviour, too, has to be attributed to the ill-conditioning of the inverse problem - if a network contains more hidden neurons, there exist more possibilities to m a p the inputs to the target 114 Chapter 4. Retrieval, Results and Evaluation data. M o r e possible mappings also mean a larger number of possible dependences, hence it becomes more likely that a physically incorrect dependence is modelled. It is thus especially i n this case important to find a network w i t h a good degree of complexity (cf. section 3.1). In order to investigate the problem, I selected two networks - 15 and 30 neurons i n the hidden layer, referred to as 1 5 N and 3 0 N hereafter - and analysed some of their p r o p e r t i e s . 42 A l t h o u g h 3 0 N produced a better fit to the L U T (lower M S E ) , the retrieval of the J u l y 11 scene yielded m a n y unphysical results (for instance, negative optical thicknesses). Tables 4.2 a n d 4.3 show point estimates of the J a c o b i a n at r / / « 8/jm a n d B T ( l l ) « 288 K , as e i n Table 4.1, for 15N a n d 3 0 N . T h e most striking difference is the m u c h larger variability i n the 3 0 N J a c o b i a n , indicating that a less well-defined m i n i m u m i n the weights error surface has been found. T h i s increases the probability of an incorrect m a p p i n g . Furthermore, while the sensitivities of r jf e to the four inputs are similar for b o t h networks (and at least <9r ///<9BT(3.7) is i n the expected range), those of e T t and r are different. F o r instance, as expected, 15N retrieves a good p a r t of the T t output from the c c B T ( l l ) signal. 3 0 N , on the other hand, infers cloud t o p temperature almost exclusively from the surface temperature input (since the variation i n B T D ( 1 1 - 1 2 ) is so small). T a b l e 4 . 2 : T h e same as Table 4.1, b u t for a network containing 15 neurons i n the hidden layer and using B T ( 3 . 7 ) , B T ( l l ) , B T D ( 1 1 - 1 2 ) and T as inputs (referred t o as 15N i n the text). sfc eff dr — dT/ 1 — [fJ-m/K] [K'/K] dr/-[K-i] SBT(3.7) 2.04 ± 1 . 0 3 -0.42 ±0.75 - 0 . 7 8 ± 1.26 dBT(ll) -3.36 ±1.22 0.45 ± 1 . 3 4 0.37 ± 1.50 SBTD(11-12) -6.47 ±3.30 -3.32 ±2.34 -5.31 ±4.17 1.35 ± 0 . 3 6 1.04 ± 0 . 6 1 0.49 ± 0.41 dT sfc T a b l e 4 . 3 : T h e same as Table 4.1, but for a network containing 30 neurons i n the hidden layer and using B T ( 3 . 7 ) , B T ( l l ) , B T D ( 1 1 - 1 2 ) and T as inputs (referred t o as 3 0 N i n the text). sfc eff dr 1 — [fJ-m/K] dT 1 — [K/K] dr/-[K-i] <9BT(3.7) 1.93 ± 4 . 3 2 -0.15 ±1.46 - 2 . 0 5 ± 10.33 <9BT(11) -3.19 ±5.90 0.07 ± 2 . 5 3 1.80 ± 1 6 . 0 8 <9BTD(11-12) -7.11 ±15.01 -1.20 ±7.21 -11.96 ±38.32 1.34 ± 2 . 6 0 1.08 ± 1.20 0.50 ± 7 . 7 1 dT sfc Since, as noted above, I do not have further independent estimates available for comparison, I cannot T h e networks discussed here were trained with three hyperparameter re-estimation cycles with 1500 optimisation steps each, unlike the networks that were used to produce Figure 4.12. 4 2 115 Chapter 4. Retrieval, Results and Evaluation judge how reasonable the remaining sensitivities are. In order to gain more insight into the response of the two networks, I analysed the sensitivity of the retrieval to changes i n cloud top pressure (or to the pressure difference between cloud top and sea surface, cf. section 2.6). T h i s is i m p o r t a n t because of the sensitivity of the computed brightness temperature differences to changes in cloud top pressure in the forward m o d e l , as discussed in section 2.6. Since p t c is fixed in the L U T , it is necessary to know about the retrieval response to changes in this parameter. O f course, in the forward m o d e l , changes in p ct will cause changes in the other cloud properties through the adiabatic assumption. It is hence quite possible that for a cloud w i t h a different p t c encountered in the scene a cloud of similar thickness, particle size and temperature, but different cloud top pressure is contained in the L U T and the desired parameters can be retrieved correctly. Nevertheless, a sensitivity analysis can provide further insight. T h i s sensitivity of the retrieval to changes in p t cannot be determined w i t h the J a c o b i a n , since cloud c top pressure is not an input variable. I thus repeated the forward model runs that produced the subset of the t r a i n i n g L U T plotted in F i g u r e 4.9, but this time the cloud top pressure was changed by ± 5 h P a . Figures 4.13 and 4.14 show scatter plots of the forward propagation of these datasets through 15N and 3 0 N , in b o t h cases compared to the results obtained w i t h the original L U T , fixed at a cloud top pressure of 939.5 h P a . W h i l e the sensitivity of effective radius and cloud top temperature is relatively small and of similar magnitude for b o t h networks ( ± 2 fim for r ff e and ± 0.5 K for T t), c the retrieved optical thickness of the 3 0 N network changes drastically w i t h changes i n cloud top pressure (up to ± 4) - in contrast to 1 5 N , where the r retrieval is m u c h less sensitive. Indeed, for higher cloud tops (decreased p t), c 30N infers negative optical thicknesses for t h i n clouds. T h i s behaviour might be a possible mechanism for the unphysical 3 0 N retrievals. O f course, the point estimates of the J a c o b i a n presented in Tables 4.1, 4.2 and 4.3 can only be used to evaluate the sensitivity of the retrieval i n the selected area of the L U T . Histograms of the Jacobian, on the other hand, can provide information on the range i n which the sensitivity varies over the entire L U T or a given scene. F i g u r e 4.15 shows such histograms of the 15N J a c o b i a n over the database from which F i g u r e 4.9 was created. A s expected from F i g u r e 4.9, the sensitivities vary significantly, w i t h the larger magnitudes likely corresponding to the thick and t h i n clouds. A possible application of the J a c o b i a n histograms could be to identify pixels for which the retrieving network exhibits an unreasonably large sensitivity. T h e retrieval could thus be further constrained to pixels for w h i c h the sensitivity is in the expected range. 116 Chapter 4. Retrieval, Results and Evaluation 4 6 8 10 Prediction r _ Prediction x 12 4 6 8 10 Prediction r _ 12 Prediction x F i g u r e 4 . 1 3 : Sensitivity of 15N predictions if cloud top pressure (fixed i n the L U T ) is varied by ± 5 h P a . In the right column, predictions of the unperturbed subset of the training L U T plotted in F i g u r e 4.9 are shown. T h e left column shows the same network-predicted cloud parameters from L U T subsets i n which p t was perturbed by +5 h P a (grey) and -5 h P a (black). See text for more details. c 117 Chapter 4. Retrieval, Results and Evaluation 4 6 8 10 Prediction r eft' 8 10 Prediction r 12 12 cit 285.5 285.5 % 285 so S3 — f 284.5 -2 285 Prediction T 0 Prediction x 285.5 284.5 285 Prediction T 285.5 0 2 Prediction x F i g u r e 4 . 1 4 : T h e same as Figure 4.13, b u t for the 3 0 N network. 118 Chapter 4. Retrieval, Results and Evaluation 0 d(r 2 eif ^ m 4 ] ) 1 d ( 6 -1.5 -1 -0.5 0 d(T [K]) / d(BT20 [K]) B T 2 0 I jlllli.M.l.lllllh.Jl! .nit iiiiiil -60 -40 -20 ( d(r [um])/d(BT31 [K]) eff allium .iiiiiil df Ellliii.l -10 d(r [nm])/d(T [K]) eff 60 fc 0 „•_ 20 40 60 d(T)/d(BT31 [K]) l_ - 4 - 2 0 2 d(T [K])/d(BTD31-32 [K]) .llllll 20 iliiiHHiiii 0.5 1 1.5 2 d(T [K])/d(BT31 [K]) III 0 -50 -150 -100 d(r [um])/d(BTD31- 32 [K]) -30 -20 -10 d(T) / d(BT20 [K]) 1•ijj 1 2 d(T [K])/d(T [K]) fc -600 -400 -200 0 d(T)/d(BTD31-32 [K]) Ill -40 -30 -20 -10 0 d ( T ) / d O \ [K]) F i g u r e 4 . 1 5 : Histograms of the 15N J a c o b i a n of the L U T subset plotted i n F i g u r e 4.9. Average values are i n d i c a t e d by the black lines. 119 Chapter 4. Retrieval, Results and Evaluation 4.5 4.5.1 Retrieval Evaluation, Jacobian and Uncertainty Comparison w i t h In-Situ D a t a A s discussed in the previous section, the 1 5 N network yielded the best retrieval performance. In this section, I will analyse the retrieval by comparing the results to the R F 0 2 in-situ d a t a (cf. section 2.1) and by checking the physical plausibility. F i g u r e 4.16 shows a m a p of the retrieved r jj e values, along w i t h a histogram of the cloud top measurements from R F 0 2 and a histogram of the retrieved values in the area. Since the in-situ measurements were taken about five hours after the satellite overpass, the measured values have been advected and are plotted over the clouds that likely have been observed from the aircraft. B y comparing against advected d a t a I assumed that the clouds in question d i d not change over the five hours. A l s o , I assumed that the w i n d speed and direction at cloud level were constant at the values measured from the aircraft at the time the in-situ measurements were taken. W h i l e the cloud layer is both extensive and horizontally homogeneous (so that changes in wind speed and direction still advect clouds w i t h similar characteristics), changes due to the diurnal cycle and precipitation are possible. T h e cloud top was likely lower earlier i n the diurnal cycle when the M O D I S image was taken (cf. section 1.2), however, the in the L U T employed cloud top pressure is probably underestimated (I used the aircraft altitude just below cloud top), thereby offsetting this change to some extend. P r e c i p i t a t i o n , on the other hand, was significant during R F 0 2 (Stevens et a l . , 2003a, Figure 6), hence, it represents a possible source of uncertainty. A s noted i n section 2.1, the D Y C O M S - I I d a t a were averaged over 10 s intervals in order to yield measurements on the 1 k m scale of the M O D I S pixels. D u e to the time lag between satellite and in-situ observations and the uncertainty in the advection, no collocated observations were possible. I hence chose distributions of the retrieved parameters w i t h i n the box shown in F i g u r e 4.16 and in-situ measured values from the cloud top flight tracks of R F 0 2 as the means for comparison. Unfortunately, the selected area contains a large number of "irretrievable" pixels. T h e range of the retrieved effective radii corresponds well to the range of the in-situ measured values, b o t h ranging from about 8 ^im to about 16 pm (Figure 4.16b,c). T h e shape of the distribution, however, does not agree as well. Whereas the retrieved effective radius peaks at b o t h about 9.5 and 13 pm, the aircraft measured values are more uniformly distributed, w i t h several s m a l l peaks and one larger peak located at 11 pm. A possible cause for these discrepancies are changes i n the structure of the cloud 120 Chapter 4. Retrieval, Results and Evaluation layer between the satellite a n d in-situ observations as described above. A l s o , pixels i n the vicinity of "irretrievable" pixels might still be influenced by the mechanisms that cause the anomalous pixels, e.g. overlying cirrus. Furthermore, sea surface temperature was assumed to be constant i n the scene. A s I will show at the end of this section, the uncertainty i n the neural inversion (G H T G, cf. section 3.3) is on the order of 2 pm i n the selected area; however, as discussed below, the usefulness of this value is questionable. Nevertheless, the retrieved values look physically plausible. T h e ship tracks are clearly discernible w i t h a decreased particle radius of about 2 /zm smaller t h a n particle sizes i n the environment of the tracks, which is i n the expected range (Schreier et a l , 2006). T h e radii retrieved for the remaining scene also look reasonable for marine stratocumulus (values ranging from 5 to 15 pm, some structure present, but no abrupt changes). Similar results are obtained for cloud top temperature, shown i n F i g u r e 4.17. A s expected, the ship tracks are not discernible i n the temperature retrieval. Instead, the cloud deck has temperatures varying only slightly between values of about 284 K a n d 286 K , w i t h smooth transitions a n d no abrupt changes. T h e range of the retrieved temperature agrees well w i t h the in-situ measurements, a n d for this variable the shape of the distribution also agrees well. T h e optical thickness of the clouds can only be judged by physical plausibility, since values of r cannot be inferred from the in-situ measurements along the horizontal flight legs. A s F i g u r e 4.18 shows, values of T vary from less t h a n 1 to about 5, a l l reasonable values for marine Sc. T h e ship tracks are thicker by about one to two compared to their environment, a n d the majority of the clouds are t h i n ( T < 2). 4.5.2 Jacobian In section 4.3, I introduced the application of point estimates of the J a c o b i a n to compare the A N N sensitivities w i t h independent estimates. However, the information in the J a c o b i a n c a n also be used in different ways to analyse the retrieval network. A s noted i n section 1.5, Aires et a l . (2004b) investigated whether a P C A (principal component analysis) preprocessing of the input d a t a can reduce the variability in the J a c o b i a n . T o obtain a variability representative of the entire retrieval scene, they computed an average J a c o b i a n a n d its variability from all pixels contained in the scene. Furthermore, they normalised this mean J a c o b i a n b y the standard deviations of the input d a t a (over the retrieval scene) to gain information about the relative importance of the i n d i v i d u a l inputs. T h e y argue that the normalised J a c o b i a n c a n be used to refine the inversion procedure by identifying inputs t h a t do not contribute 121 Chapter 4. Retrieval, Results and Evaluation cloud top effective radius Qxm] -123 -121 -119 longitude (degrees west) -117 (a) retrieved effective radius [urn] (b) in-situ effective r a d i u s [pm (c) F i g u r e 4 . 1 6 : Retrieved effective radii for the test scene a n d comparison of retrieved ( 1 5 N network) a n d aircraft measured values. T h e ship tracks are discernible w i t h smaller droplet sizes t h a n their environment. In-situ measurements of two cloud t o p flights have been advected and overlain o n the retrieved d a t a . D u e to the uncertainty i n the advection a n d the large number of irretrievable pixels i n the flight area the histogram of in-situ d a t a is compared to a histogram of the surrounding retrieved values. 122 Chapter 4. Retrieval, Results and Evaluation retrieved cloud top temperature [ K ] in-situ temperature [K] (b) (c) F i g u r e 4 . 1 7 : T h e same as Figure 4.16, but for cloud top temperature. 123 Chapter 4. Retrieval, Results and Evaluation cloud visible optical thickness longitude (degrees west) F i g u r e 4 . 1 8 : T h e same as Figure 4.16, b u t for cloud optical thickness. Note that for the optical thickness no in-situ measurements were available. significantly to the outputs (which could consequently be eliminated). Table 4.4 shows the average J a c o b i a n of the J u l y 11, 2001, scene. T h e values are similar to those found for the point estimate i n Table 4.2. O n average, the dependence of r ff e in Table 4.2, dr/dBT(ll) is larger and dr ff/dT f e s c and dr/dT f s to B T ( l l ) is smaller than are very small. C l o u d top temperature c is similarly dependent on b o t h T / a n d B T ( l l ) . s C T a b l e 4 . 4 : Average J a c o b i a n of the J u l y 11, 2001, scene and its variability (15N network). eff dr 1 — [f^m/K] dT/ — [K/K] <9BT(3.7) 1.47 ± 1.32 -0.22 ±0.56 - 0 . 9 4 ± 1.53 <9BT(11) - 1 . 2 3 ± 1.37 0.56 ± 0.80 1.07 ± 1.69 <9BTD(11-12) -7.75 ±6.11 -2.39 ±2.32 - 5 . 2 4 ± 7.86 -0.16 ±0.54 0.62 ± 0 . 4 3 - 0 . 0 4 ± 0.82 dT sfc T h e normalised mean J a c o b i a n is listed i n Table 4.5. T h e standard deviations of B T ( 3 . 7 ) , B T ( l l ) and B T D ( 1 1 - 1 2 ) were obtained from the satellite data, while that of T / was estimated from the R F 0 2 s C measurements. T h e normalisation leads to some interesting results. Effective radius is indeed mainly determined b y the B T ( 3 . 7 ) signal, while its dependence o n B T D ( 1 1 - 1 2 ) becomes more relative. C l o u d top temperature still is equally dependent on b o t h B T ( l l ) and T f , which also contribute most to this s c 124 Chapter 4. Retrieval, Results and Evaluation output ( B T D ( 1 1 - 1 2 ) is less significant). T j s c contributes very little to b o t h r ff e and r, so that this input likely is m a i n l y needed for a correct T t retrieval (cf. F i g u r e 4.11). c T a b l e 4 . 5 : T h e same as i n Table 4.4, but normalised by the standard deviation of the inputs. T h i s allows for a better judgement of the importance of the i n d i v i d u a l inputs. T h e standard deviation of the surface temperature input has been estimated from the D Y C O M S - I I measurements. H dT 1 — [K] <9BT(3.7) 2.36 -0.35 -1.51 SBT(ll) -1.82 0.91 1.74 <9BTD(11-12) -1.34 -0.42 -0.92 -0.20 0.78 -0.04 eff dr 1— dT sfc dr/-[l] W h e n inferring information about the importance of i n d i v i d u a l inputs from a n averaged Jacobian, it is important that the average indeed is representative of the scene. T o verify the representativeness of the J a c o b i a n given i n Tables 4.4 and 4.5, I computed histograms of the i n d i v i d u a l sensitivities of a l l pixels i n the scene, as was done for the L U T subset i n F i g u r e 4.15. T h e histograms for the J u l y 11 scene are displayed i n F i g u r e 4.19. T h e average J a c o b i a n values correspond well w i t h the most often occurring sensitivities, so that the relative J a c o b i a n i n Table 4.5 indeed gives a good idea of the information content of the i n d i v i d u a l inputs. T h e last useful representation of the J a c o b i a n that shall be discussed i n this thesis is its spatial distribution, as shown i n Figures 4.20 and 4.21 for effective radius and cloud top temperature, respectively. T h e spatial distributions of <9r ///t5BT(3.7) and 0 r / / / d B T D ( l l - 1 2 ) i n F i g u r e 4.20 show a coherent picture. e e T h e absolute magnitudes of b o t h sensitivities decrease across the ship tracks, which is expected from F i g u r e 4.9 due to the increased space between the curves for clouds w i t h effective radii of approximately 10 p and optical thicknesses of about 3. Similarly, the sensitivities for the t h i n clouds i n the area that served for comparison w i t h the in-situ d a t a exhibit a much larger absolute magnitude - consistent w i t h the converging lines i n F i g u r e 4.9 for t h i n clouds. A curious feature is that the negative dependence of T t on B T ( 3 . 7 ) decreases i n magnitude w i t h c increasing T (Figure 4.21, cf. F i g u r e 4.18). A t r 3.5, the dependence becomes positive. A t the same time, the positive c3T t/<9BT(ll) decreases w i t h increasing r. T h i s sensitivity is largest for t h i n clouds. c T h i s behaviour can likely be attributed to the transition between t h i n clouds that influence the surfaceemitted radiation very little a n d thick clouds that emit approximately as black bodies i n the thermal infrared. 125 Chapter 4. Retrieval, Results and Evaluation lllllllllll d(r [urn]) / d(BT20 [K]) eff - 4 - 2 0 d(r [(im])/d(BT31 [K]) eff lllllldlDL -20 -10 d(r [urn]) / d(BTD31-32 [K]) eff H i . . d(r [um])°/d(T [K]) eff (c Jill. ll. -0.5 0 0.5 d(T [K]) / d(BT20 [K]) . llllllll . 0 0.5 1 d(T [K])/d(BT31 [K]) - 4 - 2 0 d(T [K])/d(BTD31-32[K]) I 0.4 0.6 0.8 1 1.2 d(T[K])/d(T [K]) fc - 2 - 1 0 1 d(x) / d(BT20 [K]) 0.5 1 1.5 2 d(T)/d(BT31 [K]) -10 -5 0 d(T)/d(BTD31-32 [K]) -0.2 III.... ) 0.2 d(x)/d(T [K]) fc F i g u r e 4 . 1 9 : Histograms of the 1 5 N J a c o b i a n of the J u l y 11, 2001, scene. T h e average values as listed i n T a b l e 4.4 are highlighted b y the black lines. 126 Chapter 4. Retrieval, Results and Evaluation sensitivity of effective radius to B T 20 [um/K] longitude (degrees west) F i g u r e 4 . 2 0 : S p a t i a l distribution of the sensitivity of the effective radius retrieval to changes i n B T ( 3 . 7 ) (top panel) and B T D ( 1 1 - 1 2 ) (bottom panel) on J u l y 11, 2001 (15N network). 127 Chapter 4. Retrieval, Results and Evaluation sensitivity of cloud top temperature to B T 20 [K/K] longitude (degrees west) F i g u r e 4 . 2 1 : T h e same as Figure 4.20, but for the sensitivity of cloud t o p temperature to changes i n B T ( 3 . 7 ) (top panel) and B T ( l l ) (bottom panel). 128 Chapter 4. Retrieval, Results and Evaluation Since p ct was held constant i n the L U T , the dependence of T ct on T j s c should be approximately 1 for very t h i n clouds ( r < 1) - the cloud layer has little impact o n the radiances, a n d due to the linear temperature decrease w i t h height changes i n T f lead to immediate changes i n T t- T h e sea surface also s c c approximately emits as a black b o d y i n the infrared (cf. the subcloud layer i n F i g u r e 2.10), thus the dependence of T t on B T ( l l ) is also expected to be close to 1. Indeed, the sensitivity of T t on T f is c c s c also largest for t h i n clouds (not shown). If the clouds have reached a n optical thickness large enough to emit a p p r o x i m a t e l y as black bodies at 11 pm (e.g. r »s 43 i n the right panel of Figure 2.10), <9T t/<9BT(ll) is again expected to be about c 1. In between, however, B T ( l l ) is influenced through absorption in the cloud. Independently of T , the ct optically thicker the cloud, the smaller B T ( l l ) as the cold cloud t o p becomes more opaque (cf. the left panel of F i g u r e 2.10, w i t h r « 4). Similar effects are expected to impact the B T ( 3 . 7 ) signal. W h i l e I cannot verify this argumentation from the B T / B T D diagrams computed from m y L U T (cloud top temperature is fixed i n a l l plots), the corresponding plot by Perez et a l . (2000) i n F i g u r e 1.9b confirms the dependence of T t on B T ( l l ) for thick clouds (note that the surface temperature is constant i n this c plot). T h i s example shows how valuable the J a c o b i a n is i n interpreting the behaviour of the retrieval A N N . Since the creation of diagrams similar to F i g u r e 4.9 b u t for v a r y i n g T t is straightforward, I recommend c a more detailed analysis of the spatial distribution of the J a c o b i a n i n the future. 4.5.3 Uncertainty A l t h o u g h the estimation of the intrinsic noise covariance m a t r i x Ci failed d u r i n g the t r a i n i n g process, n the neural uncertainty t e r m G H~ G T 1 could still be computed and provides a n estimate of the uncertainty in the retrieval due to the weights distribution. F i g u r e 4.22 shows maps of the uncertainty (standard deviation computed from the variance i n the diagonal elements of the m a t r i x ) for effective radius a n d cloud optical thickness. T h e uncertainty i n r ff e (r) ranges from less t h a n 1 pm (1) i n the area of the ship tracks to more t h a n 2 pm (4) i n the upper left corner of the scene. However, based on m y findings in C h a p t e r 3, I question the usefulness of these uncertainty estimates. T h e larger uncertainty i n the upper left corner of the scene is correlated w i t h neither of the retrievals there are no significantly larger or smaller effective r a d i i , warmer or colder or optically thicker or thinner clouds i n this area t h a n in the remainder of the scene. A s discussed i n section 3.7, ambiguities in the L U T cannot be recognised by the neural network. Hence, the only reason for the increased uncertainty would be a lower d a t a density of the type of clouds occurring i n the area i n question i n the L U T - possibly 129 Chapter 4. Retrieval, Results and Evaluation the setup of my forward model caused the cloud type i n the upper left corner of the scene to be more sparsely represented i n the L U T compared t o the other clouds i n the scene. Further research is needed to clarify the meaning of the uncertainties. 4.6 Further Developments In this chapter, I analysed the retrieval results of the 1 5 N network because i t yielded the most plausible retrieval performance. A s noted, it did not produce the smallest M S E . G i v e n the instabilities i n the training process, it is likely that the same network architecture, trained from a different weights initialisation would not perform as well. I hence attribute the relatively good results to "luck" rather t h a n a well functioning method. It is likely that a network containing a larger number of hidden units could also approximate the inverse function correctly. Possibly such a network would produce a smaller M S E and exhibit a n improved retrieval performance. However, given that the inverse problem is very ill-posed, more hidden neurons i m p l y that it becomes more difficult to find the correct m a p p i n g . D u e to the various problems I encountered, I was not able t o give u l t i m a t e answers t o the questions that arose i n this thesis. Instead, the work that I presented should be taken as a first step t o find a good and reliable network architecture suited for the given inverse problem. I demonstrated the high potential of neural networks for application i n remote sensing, and explored several techniques t h a t can be used to construct a stable retrieval scheme. However, my work also showed the need for further investigations. G i v e n the results presented i n this chapter, I propose, i n roughly descending order of importance, t o continue work i n the following areas: Stabilisation of the Jacobian In order to make the training process less dependent o n the network architecture and the initialisation of the weight values, it is necessary t o decrease the variability in the J a c o b i a n and make the problem less ill-conditioned. T h e P C A approach suggested b y Aires et al. (2004b) is not useful for this problem, since it involves reducing the number of inputs - feasible only if a large number of inputs are involved. K r a s n o p o l s k y (2007) notes alternative methods that a i m at o b t a i n i n g a J a c o b i a n w i t h a low uncertainty for d a t a assimilation purposes, such as training a separate A N N t o represent the Jacobian. However, while such methods might be able to provide a good estimate of the physical Jacobian, they say nothing about the J a c o b i a n of the retrieval network. I n his work, K r a s n o p o l s k y (2007) suggests a n approach based on ensembles of A N N s . Several networks of the same architecture are 130 Chapter 4. Retrieval, Results and Evaluation standard deviation r [urn] longitude (degrees west) standard deviation x longitude (degrees west) F i g u r e 4 . 2 2 : S p a t i a l distribution of the uncertainty i n the effective radius (top panel) a n d optical thickness (bottom panel) retrievals, as computed from the neural uncertainty t e r m G H G. T 1 131 Chapter 4. Retrieval, Results and Evaluation trained using differently perturbed initial conditions for the A N N weights, so t h a t each network w i l l likely exhibit different weight values after the training. T h e average o f a l l networks is then used t o predict the outputs and estimate the Jacobians. O f course, such an approach implies increased computational effort for the network t r a i n i n g . H o w ever, since the number of inputs cannot be reduced i n my retrieval problem, the m e t h o d could lead to improvements. Optimisation of the Network Architecture Once the J a c o b i a n is more stable, the network architecture can be optimised. T h i s can be continued t o be done by t r a i n i n g a restricted class of architectures and comparing b o t h their validation M S E and retrieval performance. F o r comparison, it might be worthwhile to test whether three-layer networks, such as employed by Cerdefia et a l . (2007), perform better t h a n the two-layer architectures used i n this thesis. A n alternative for finding the o p t i m a l number of hidden units would be to adopt the genetic a l g o r i t h m approach proposed by Cerdefia et al. (2007). A l s o , B i s h o p (1995, Chapters 9 and 10) discusses further m o d e l selection tools. In m y work, I restricted the number of training cycles during network t r a i n i n g due to the high c o m p u t a t i o n a l cost. However, if a stabilised J a c o b i a n allows for a more structured exploration of different network architectures it should be ensured that the t r a i n i n g process is long enough to find the global m i n i m u m i n the weights error surface. Retrieval Evaluation T h e retrieval evaluation can be extended and improved i n several ways. F i r s t , it would be desirable t o compute accurate point estimates of the physical J a c o b i a n from the L U T , as discussed i n section 4.4. T h i s would significantly increase the usefulness of A N N - e s t i m a t e d point Jacobians. N e x t , the comparison of retrieved and in-situ observed values b y means of histograms is n o t very satisfactory. M o r e work should be invested into applying the retrieval t o other scenes - possibly i n which satellite overpass and in-situ measurements are closer together. T h e D Y C O M S - I I campaign provides further scenes, as does the E P I C campaign (Bretherton et a l . , 2004). I n addition to the in-situ d a t a , the A N N retrievals could also be compared t o precedent and subsequent daytime retrievals from independent sources (e.g. the operational M O D I S p r o d u c t ) . A l s o , it would be useful t o know precisely what caused the irretrievable pixels. If computed correctly i n the forward m o d e l , the 8.5 /um channel could help t o identify high clouds ( B a u m et a l . , 2003), and the operational M O D I S algorithms also provide information about cloud height ( A c k e r m a n 132 Chapter 4. Retrieval, Results and Evaluation et a l . , 1998). For comparing the obtained retrieval results w i t h those of other nighttime retrieval schemes (e.g. Cerdeha et a l , 2007) and for estimating the true impact of the cloud layer o n the upwelling thermal flux, it would be desirable to implement the c o m p u t a t i o n of the 11 /urn optical thickness into the forward model. Sensitivities to Assumptions in the Forward Model It is also i m p o r t a n t t o explore the sensitivities of the retrieval to the assumptions made i n the forward model. A sensitivity study to fc-value and Vgam (cf. sections 2.2 and 2.3) is difficult, since in order to determine the precise effects, for each new value a new L U T would have to be computed i n the current setup, and a new network would have to be trained. However, it would be possible to compute the changes i n the computed T O A B T s due t o changing fc and u m, ga and t o estimate the impact on the retrieval v i a the Jacobian. T h e same is valid for the sensitivity of the retrieval to subadiabatic clouds, a n d the impact on T O A B T s caused b y broken cloud regimes should also be investigated. In m y forward model, I assumed that the satellite is directly above the cloud (cf. section 2.6). In many cases, this might not be a good assumption (in fact, I d i d not check whether i t was fulfilled i n the J u l y 11 case). Hence, i f a sensitivity analysis shows a large impact of a slanted radiation path on the B T s observed by the satellite, the satellite zenith angle should be considered i n the forward model. Overlying Atmosphere In order to make the retrieval independent of a prescribed overlying atmosphere, the optical properties of the O A could be introduced as variable parameters t* and B*, such as suggested i n section 2.5. G i v e n the ease of including additional inputs into the A N N architecture, an alternative could be t o obtain independent estimates of the water vapour p a t h i n the atmosphere (cf. section 2.5) and use it as a n additional input. T h e network could then a u t o m a t i c a l l y infer the bulk absorption effects. A l s o , it could be investigated how m u c h i n f o r m a t i o n is gained if the brightness temperatures of the clear sky pixels are obtained for a l l channels, such as done by Perez et al. (2000) and Cerdefia et al. (2007). Uncertainties and Ambiguities It would be interesting t o check whether the type of cloud i n the upper left region of the scene is actually underrepresented i n the L U T , as discussed i n section 4.5. T h i s information could help t o refine the forward model so that t h e d a t a density is similar for a l l types of clouds. 133 Chapter 4. Retrieval, Results and Evaluation It is unsatisfying that the Aires et al. (2004a) method is not able to recognise ambiguous situations. A n o t h e r analysis of the L U T should be performed to estimate the actual magnitude of the occurring ambiguities and t o determine the regimes i n which they occur (for instance, t h i n clouds vs. thick clouds). In order to incorporate the uncertainty due t o ambiguities into the retrieval, I propose the following. F i r s t , if we consider the ambiguities as being a part of the intrinsic noise, we could discretise the input space and compute localised noise matrices C' in (where ' indicates the localisation). B y assuming that the network mapping is correct, the covariance m a t r i x of the errors C' i n the selected input area could be used to approximate C' Q in (thereby either ignoring the neural rp uncertainty term, or, if stable enough, considering localised versions oiG'H 1 G a s well). A n alternative could be to employ a different network architecture k n o w n as mixture models (Bishop, 1995, C h a p t e r 6). M i x t u r e models are able to compute m u l t i m o d e l o u t p u t distributions; they are therefore not restricted b y the Gaussian assumption as the Aires et al. (2004a) m e t h o d (cf. section 3.3). W h i l e this could represent an approach w i t h a high potential to improve some of the difficulties encountered i n m y work, the Bayesian methods applied i n this thesis could not be directly applied to m i x t u r e models. Hence, this approach would require an increased effort. A comprehensive uncertainty estimate requires the combination of the errors of a l l contributing sources. If the above problems are solved, such an estimate can be obtained b y combining network uncertainty, ambiguities, uncertainties due to assumptions i n the forward model a n d instrument noise of the M O D I S instrument, which can be converted t o output error w i t h the J a c o b i a n . A i r e s et al. I m p l e m e n t a t i o n F i n a l l y , C h a p t e r 3 raised m a n y questions concerning the implementation and application of the Aires et a l . (2004a) method. W h e t h e r we can improve the unsatisfying need to regularise the Hessian and whether the problems w i t h the hyperparameter re-estimation can be solved are issues that should also be addressed i n the future. 134 Chapter 5 Summary In this thesis I have investigated the feasibility of retrieving cloud top effective radius, cloud optical thickness and cloud top temperature of nocturnal marine stratocumulus clouds by inverting infrared satellite measurements using an artificial neural network. In C h a p t e r 1, I described the scientific context of my work. M a r i n e Sc play a critical role in the exchange of energy and water i n our climate system. However, processes t h a t are involved i n cloudatmosphere interaction and their representation in general circulation models remain some of the p r i m a r y uncertainties in global climate modelling. Observational d a t a that can shed light on regional variations in cloud microphysical and optical properties, their d i u r n a l cycle and the interaction of clouds, precipitation and aerosols is i m p o r t a n t for further progress in the accurate representation of marine Sc i n climate and weather models. A few studies have investigated the nocturnal retrieval problem and shown that it is possible to infer cloud properties from the information contained in the infrared channels centred at 3.7, 8.5, 11.0 and 12.0 /jm. However, problems w i t h the standard optimisation approach to these retrievals include high computational cost and a lack of consistent error estimates. T h e a i m of this study was to use neural networks to design a retrieval method that is b o t h fast and able to give such uncertainty estimates. T h e fundamental idea behind this approach is to approximate the inverse of the forward function that describes the dependence of top-of-atmosphere radiances on cloud parameters w i t h a neural network architecture that is capable of solving nonlinear regression problems. Cerdefia et al. (2007) were the first to use neural networks to invert n o c t u r n a l measurements of the A V H R R instrument. T h e objective of this study was to extend their approach by investigating the applicability of methods proposed by Aires (2004) and Aires et al. (2004a,b) - allowing for the estimation of uncertainties and sensitivities - to the problem, to apply the retrieval scheme to nocturnal measurements of the higher resolution M O D I S instrument, and to compare the retrieved values to insitu measurements obtained during the D Y C O M S - I I field c a m p a i g n off the coast of C a l i f o r n i a . M y focus lay on the uncertainty i n the retrieval, and on what we can learn from retrieval sensitivities. M y initial 135 Chapter 5. Summary work included the construction of a forward model capable of computing top-of-atmosphere brightness temperatures for varying cloud parameters and the implementation of the A i r e s et a l . method. T h e topic of C h a p t e r 2 was the theory and implementation of the forward model. T h e scene of J u l y 11, 2001, was chosen for the retrieval, and the in-situ aircraft measurements obtained on that day were used for the development of the forward model and the evaluation of the retrieval performance. T h e scene contains a number of ship tracks that provide contrasting droplet sizes w h i c h were used to check the physical consistency of the retrieval. T h e clouds were modelled as adiabatic plane parallel cloud layers i n the forward model and the droplet size distribution was described by a modified g a m m a distribution. C o m p a r i s o n s of idealised adiabatic profiles and size distributions w i t h in-situ measured d a t a showed good agreement. T h e radiative transfer model l i b R a d t r a n (Mayer and K y l l i n g , 2005), which incorporates the multiple scattering code D I S O R T (Stamnes et a l . , 1988) to solve the radiative transfer equation, was used to compute cloud top radiances. To account for the spectral intervals of the M O D I S instrument channels, I implemented correlated-k code developed by K r a t z (2001) into l i b R a d t r a n . Gaseous absorption and emission above cloud top were accounted for by incorporating average transmission and emission properties of the atmosphere obtained f r o m radiosonde soundings. T h e forward model proved capable of reproducing relationships between cloud top brightness temperature differences and cloud parameters that were previously described by B a u m et a l . (1994) and Perez et al. (2000). In C h a p t e r 3, I further explored theory and implementation of the m e t h o d proposed by Aires (2004) and A i r e s et a l . (2004a,b). T h e m e t h o d ideally provides estimates of the uncertainty arising from b o t h the neural network fit to the inverse function and the intrinsic noise inherent in the d a t a , as well as the variability in the J a c o b i a n that is due to the network fit. T h e J a c o b i a n , describing the sensitivities of the outputs w i t h respect to the inputs, is particularly i m p o r t a n t for analysing the dependences of the network fit in order to ensure that the network models the "correct" function. Its variability indicates how ill-conditioned the regression problem is, a c o m m o n problem w i t h inverse problems that makes it difficult to find a good approximation to the inverse function. Unfortunately, I encountered several difficulties w i t h the A i r e s et al. m e t h o d t h a t l i m i t e d its usefulness for the retrieval problem. Initial tests w i t h simple examples showed questionable results, w i t h uncertainty intervals often not including the true value. For instance, the method was not able to recognise ambiguities that were expected to occur in the lookup table. Furthermore, numerical problems involved in the estimation of the Hessian m a t r i x of the network in some cases led to a failure in the estimation of the 136 Chapter 5. Summary uncertainty due to the intrinsic noise in the d a t a and questionable results for the uncertainty due to the network fit. T h e estimation of the J a c o b i a n and its variability, however, proved promising. In C h a p t e r 4, I applied the methods described in Chapters 2 and 3 to the retrieval scene. A lookup table consisting of 96,000 cloud profiles w i t h varying effective radius, droplet number concentration and temperature was computed. Parameter ranges were obtained f r o m the in-situ measurements, and, for simplicity, cloud top and surface pressure were held constant at the observed values. A n average droplet size distribution w i d t h representative of the night was determined f r o m the measurements. T h e brightness temperatures and cloud parameters in the lookup table were used to t r a i n different architectures of neural networks. I explored several configurations w i t h different inputs and numbers of hidden units in the network and compared their results based on training error, sensitivities and physical plausibility. I restricted myself to network architectures containing one layer of hidden neurons. T h e major results of the retrieval investigations can be summarised as follows: • C o m p a r i s o n of the computed brightness temperatures to those observed by M O D I S showed that a large number of observations were outside the computed range. M a n y of these "irretrievable pixels" could be a t t r i b u t e d to broken clouds and clear sky pixels; for the remaining pixels I found indications of overlying cirrus clouds. • Unfortunately, the computations of the 8.5 /um brightness temperatures d i d not agree w i t h the observations. A n error in the forward model seemed to be the likely cause, hence, the corresponding channel could not be used i n the retrieval. • T h e uncertainty estimation of Aires et al. (2004a) failed for all networks trained from the lookup table. It is possible that there was only little intrinsic noise i n the lookup table, which could have amplified the numerical problems noted above. W h i l e I was able to o b t a i n uncertainty estimates due to the network fit w i t h a regularised Hessian, I question the usefulness of the obtained values. • T h e brightness temperatures at 3.7, 11 and 12 /jm alone were not sufficient for retrieving effective radius, cloud optical thickness and cloud temperature. A l l networks employing only these inputs showed unphysical results in at least one output variable. • T h e J a c o b i a n and its variability proved to be a valuable tool for analysing the networks. Point estimates of the network J a c o b i a n were compared w i t h estimations of the physical J a c o b i a n obtained f r o m the lookup table, the average J a c o b i a n of the satellite scene gave i n f o r m a t i o n about the average information content of the network inputs and the ill-conditioning of the problem, and maps of 137 Chapter 5. Summary the J a c o b i a n showed the spatial distribution of the sensitivities and were interpreted for physical plausibility. • O b t a i n i n g a good network fit to the inverse function proved to be a highly ill-conditioned problem. T h i s led to a very unstable learning process and the high variability in the J a c o b i a n made it difficult to find a network modelling the expected dependences. • A l m o s t all network architectures exhibited unphysical behaviour in at least one output variable. T h i s included networks employing sea surface temperature as an additional input to the satellite observations and networks employing brightness temperature differences instead of brightness temperatures as inputs. • A network that predicted reasonable results employed the brightness temperatures at 3.7 and 11 pm, the brightness temperature difference between 11 and 12 pm and sea surface temperature as inputs and included 15 units in the hidden layer. Satellite image and in-situ measurements were recorded w i t h a five hour difference; histograms of advected cloud properties w i t h histograms of the in-situ measurements showed good agreement of cloud top temperature, ranges of observed and retrieved effective radii also agreed well. Retrievals of all three retrieved parameters were physically plausible, as was the Jacobian. I have demonstrated that it is feasible to use artificial neural networks for the retrieval of nocturnal marine stratocumulus properties. M y results are promising in as far as t h a t despite the difficulties I encountered a good agreement between retrieved and observed d a t a was obtained. In particular the J a c o b i a n proved to be a very valuable tool that has not been employed in the investigations of other authors. However, further refinements and analysis of the method are required. I discussed open questions and suggested areas for continuing work in the conclusions of C h a p t e r 4. In conclusion, I believe that it is essential to improve the ill-conditioning of the inverse problem in order to achieve a more stable training process and improved retrieval results. M y work provides a foundation for future work, but further research is required to provide a reliable, stable method that can provide accurate uncertainty estimates. 138 Bibliography A b d i , H . , 2007: Encyclopedia of Measurement and Statistics, chapter Singular Value Decomposition ( S V D ) and Generalized Singular Value Decomposition ( G S V D ) . Sage, T h o u s a n d O a k s ( C A ) . A c k e r m a n , A . S., M . P. K i r k p a t r i c k , D. E . Stevens and O . B . T o o n , 2004: T h e i m p a c t of humidity above stratiform clouds on indirect aerosol climate forcing. Nature, 432(7020), 1014-1017. A c k e r m a n , S. A . , K . I. S t r a b a l a , P. W . M e n z e l , R . A . Frey, C . C . M o e l l e r a n d L . E . Gumley, 1998: D i s c r i m i n a t i n g clear sky from clouds w i t h M O D I S . Journal of Geophysical Research, 103(D24), 3 2 1 4 1 32158. Aires, F . , 2004: N e u r a l network uncertainty assessments using Bayesian statistics w i t h application to remote sensing: 1. Network weights. Journal of Geophysical Research, 109, D10303. Aires, F., A . C h e d i n , N . A . Scott and W . B . Rossow, 2002: A regularized neural net approach for retrieval of atmospheric a n d surface temperatures w i t h the iasi instrument. Journal of Applied Meteorology, 41, 144-159. A i r e s , F . , C . Prigent and W . Rossow, 2004a: N e u r a l network uncertainty assessments using Bayesian statistics w i t h application to remote sensing: 2. O u t p u t errors. Journal of Geophysical Research, 109, D10304. A i r e s , F . , C . Prigent and W . Rossow, 2004b: N e u r a l network uncertainty assessments using Bayesian statistics w i t h application to remote sensing: 3. Network Jacobians. Journal of Geophysical Research, 109, D10305. A i r e s , F., C . Prigent, W . B . Rossow and M . Rothstein, 2001: A new neural network approach including first guess for retrieval of atmospheric water vapor, cloud liquid water p a t h , surface temperature, a n d emissivities over land from satellite microwave observations. Journal of Geophysical Research, 106, 14887-14908. Albrecht, B., 1989: Aerosols, cloud microphysics, and fractional cloudiness. Science, 245, 1227-1230. Albrecht, B . A . , D . A . R a n d a l l a n d S. Nicholls, 1988: Observations of marine stratocumulus clouds during F I R E . Bulletin of the American Meteorological Society, 69(6), 618-626. A n d e r s o n , G . P., S. A . C l o u g h , F . X . K n e i z y s , J . H . C h e t w y n d and E . P. Shettle, 1986: A F G L atmospheric constituent profiles (0-120km). A F G L - T R - 8 6 - 0 1 1 0 , A i r Force Geophys. L a b , H a n s c o m A i r F O r c e Base, Bedford, Mass. A r d u i n i , R . F . , P. M i n n i s , S m i t h , J . K . Ayers, M . M . K h a i y e r and P. Heck, 2005: Sensitivity of satelliteretrieved cloud properties to the effective variance of cloud droplet size distribution, i n Fifteenth Atmospheric Radiation Measurement (ARM) Science Team Meeting, Daytona Beach, FL (US), 03/14/200503/18/2005. A r k i n g , A . a n d J . D. C h i l d s , 1985: Retrieval of cloud cover parameters from multispectral images. Journal of Applied Meteorology, 24, 322-334. satellite A u s t i n , P. H . , Y . W a n g , R. P i n c u s and V . K u j a l a , 1995: P r e c i p i t a t i o n i n stratocumulus clouds: Observational and modeling results. Journal of Atmospheric Sciences, 52, 2329-2352. 139 Bibliography Barnes, W . L . , T . S. Pagano and V . V . Salomonson, 1998: P r e l a u n c h characteristics of the moderate resolution imaging spectroradiometer ( M O D I S ) o n E O S - A M I . Geoscience and Remote Sensing, IEEE Transactions on, 36(4), 1088-1100. B a u m , B . A . , R. F . A r d u i n i , B . A . W i e l i c k i , P. M i n n i s and S. C . Tsay, 1994: M u l t i l e v e l cloud retrieval using multispectral hirs and avhrr data: Nighttime oceanic analysis. Journal of Geophysical Research, 99, 5499-5514. B a u m , B . A . , R . A . Frey, G . G . M a c e , M . K . H a r k e y and P. Y a n g , 2003: N i g h t t i m e multilayered cloud detection using M O D I S and A R M data. Journal of Applied Meteorology, 42, 905-919. Bishop, C . M . , 1994: N e u r a l networks a n d their applications. Review of Scientific Instruments, 65, 1803-1832. B i s h o p , C . M . , 1995: Neural Networks for Pattern Recognition, O x f o r d U n i v . Press. Bishop, C . M . , 2006: Pattern Recognition and Machine Learning. Springer. Blaskovic, M . , R . Davies and J . B . Snider, 1991: D i u r n a l variation of marine stratocumulus over S a n Nicolas island during J u l y 1987. Monthly Weather Review, 119(6), 1469-1478. B o h r e n , C . F . and E . C l o t h i a u x , 2006: Fundamentals of Atmospheric Radiation: An Introduction with 400 Problems. W i l e y - V C H . B o h r e n , C . F . , J . R. Linskens and M . E . C h u r m a , 1995: A t what optical thickness does a cloud completely obscure the sun? Journal of Atmospheric Sciences, 52, 1257-1259. Bony, S., R. C o l m a n , V . M . K a t t s o v , R. P. A l l a n , C . S. Bretherton, J . L . Dufresne, A . H a l l , S. Hallegatte, M . M . H o l l a n d , V . Ingram, D . A . R a n d a l l , B . J . Soden, G . Tselioudis and M . J . W e b b , 2006: H o w well do we understand and evaluate climate change feedback processes? Journal of Climate, 19(15), 3445-3482. Bony, S. and J . L . Dufresne, 2005: M a r i n e boundary layer clouds at the heart of tropical cloud feedback uncertainties i n climate models. Geophysical Research Letters, 32, 20806+. Brenguier, J . L., H . Pawlowska, L . Schuller, R. Preusker, J . Fischer and Y . Fouquart, 2000: Radiative properties o f boundary layer clouds: Droplet effective radius versus number concentration. Journal of Atmospheric Sciences, 57, 803-821. Bretherton, C . S., T . U t t a l , C . W . F a i r a l l , S. E . Y u t e r , R. A . Weller, D . B a u m g a r d n e r , K . Comstock, R. W o o d and G . B . R a g a , 2004: T h e E P I C 2001 stratocumulus study. Bulletin of the American Meteorological Society, 85, 967-977. Cerdena, A . , A . Gonzalez and J . C . Perez, 2007: Remote sensing of water cloud parameters using neural networks. Journal of Atmospheric and Oceanic Technology, 24(1), 52-63. Cerdena, A . , J . C . Perez and A . Gonzalez, 2004: C l o u d properties retrieval using neural networks, i n K . P. Schafer, A . C o m e r o n , M . R. Carleer, R. H . P i c a r d and N . I. Sifakis, editors, Remote Sensing of Clouds and the Atmosphere IX., volume 5571 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, pp. 11-19. Coakley, J . A . and F . P. Bretherton, 1982: C l o u d cover from high-resolution scanner data: Detecting and allowing for p a r t i a l l y filled fields of view. Journal of Geophysical Research, 87, 4917-4932. Cornet, C , J . C . B u r i e z , J . R i e d i , H . Isaka and B . Guillemet, 2005: Case study of inhomogeneous cloud parameter retrieval from modis data. Geophysical Research Letters, 32, 13807+. Cornet, C , H . Isaka, B . Guillemet and F . Szczap, 2004: N e u r a l network retrieval of cloud parameters of inhomogeneous clouds f r o m multispectral a n d multiscale radiance d a t a : Feasibility study. Journal of Geophysical Research, 109, 12203+. 140 Bibliography C u r r y , J . A . a n d P. J . Webster, 1999: Thermodynamics Geophysics). A c a d e m i c Press. of Atmospheres and Oceans (International D ' E n t r e m o n t , R. P . , 1986: Low- a n d midlevel cloud analysis using nighttime multispectral Journal of Applied Meteorology, 25, 1853-1869. imagery. Driedonks, A . G . M . and P. G . Duynkerke, 1989: Current problems i n the stratocumulus-topped atmospheric boundary layer. Boundary-Layer Meteorology, 46, 275-303. Durkee, P. A . , R . E . Chartier, A . B r o w n , E . J . Trehubenko, S. D. Rogerson, C . Skupniewicz, K . E . Nielsen, S. P l a t n i c k and M . D . K i n g , 2000a: Composite ship track characteristics. Journal of Atmospheric Sciences, 57, 2542-2553. Durkee, P. A . , K . J . Noone and R. T . B l u t h , 2000b: T h e Monterey area ship track experiment. Journal of Atmospheric Sciences, 57(16), 2523-2541. Evans, K . F., 1998: T h e spherical harmonics discrete ordinate method for three-dimensional atmospheric radiative transfer. Journal of Atmospheric Sciences, 55, 429-446. Faure, T . , H . Isaka and B . Guillemet, 2001a: M a p p i n g neural network c o m p u t a t i o n of high-resolution radiant fluxes of inhomogeneous clouds. Journal of Geophysical Research, 106, 14961-14974. Faure, T . , H . Isaka and B . Guillemet, 2001b: Neural network analysis of the radiative interaction between neighboring pixels i n inhomogeneous clouds. Journal of Geophysical Research, 106, 14465-14484. Faure, T . , H . Isaka and B . Guillemet, 2001c: Neural network retrieval of cloud parameters of inhomogeneous and fractional clouds - feasibility study. Remote Sensing of Environment, 77(2), 123-138. Faure, T . , H . Isaka and B . Guillemet, 2002: N e u r a l network retrieval of cloud parameters from highresolution multispectral radiometric data - a feasibility study. Remote Sensing of Environment, 80(2), 285-296. F u , Q . and K . N . L i o u , 1992: O n the correlated k-distribution method for radiative transfer i n nonhomogeneous atmospheres. Journal of Atmospheric Sciences, 49, 2139-2156. Gonzalez, A . , J . C . Perez, F . Herrera, F . Rosa, M . A . Wetzel, R. D. B o r y s and D . H . Lowenthal, 2002: Stratocumulus properties retrieval method from noaa-avhrr data based o n the discretization of cloud parameters. International Journal of Remote Sensing, 23(4), 627-645. H a n , Q . , W . B . Rossow and A . A . Lacis, 1994: Near-global survey of effective droplet r a d i i i n liquid water clouds using isccp data. Journal of Climate, 7, 465-497. Hansen, P., 1994: Regularization tools: A M a t l a b package for analysis and solution of discrete ill-posed problems. Numerical Algorithms, 6(1), 1-35. Harshvardhan, 1982: T h e effect of brokenness on cloud-climate sensitivity. Journal of Atmospheric Sciences, 39(8), 1853-1861. H a r t m a n n , D . L . , M . E . O c k e r t - B e l l a n d M . L. Michelsen, 1992: T h e effect of cloud type on earth's energy balance: G l o b a l analysis. Journal of Climate, 5, 1281-1304. Heck, P. W . , W . L. S m i t h , P. M i n n i s and D. F . Y o u n g , 1999: M u l t i s p e c t r a l retrieval of nighttime cloud properties for C E R E S , A R M , and F I R E , i n Proceedings of ALPS 99 Symposium, M e r i b e l , France. H o b b s , P. V . , T . J . G a r r e t t , R. J . Ferek, S. R. Strader, D . A . Hegg, G . M . Prick, W . A . H o p p e l , R. F . Gasparovic, L. M . Russell, D. W . Johnson, C . O ' D o w d , P. A . Durkee, K . E . Nielsen a n d G . Innis, 2000: Emissions from ships w i t h respect to their effects o n clouds. Journal of Atmospheric Sciences, 57, 2570-2590. 141 Bibliography Hsieh, W . W . and B . T a n g , 1998: A p p l y i n g neural network models to prediction and d a t a analysis i n meteorology and oceanography. Bulletin of the American Meteorological Society, 79, 1855-1870. H u , Y . and K . Stamnes, 2000: C l i m a t e sensitivity to cloud optical properties. Tellus B, 52(1), 81-93. H u n t , G . E . , 1973: R a d i a t i v e properties of terrestial clouds at visible a n d infra-red thermal wavelengths. Quarterly Journal of the Royal Meteorological Society, 99, 346-369. window Iwabuchi, H . , 2007: Retrieval of cloud optical thickness and effective radius using multispectral remote sensing a n d accounting for 3 D effects, i n A . A . K o k h a n o v s k y , editor, Light Scattering Reviews 2, p p . 97-124. Springer. K a t o , S., L. M . H i n k e l m a n and A . C h e n g , 2006: E s t i m a t e of satellite-derived cloud optical thickness and effective radius errors and their effect on computed domain-averaged irradiances. Journal of Geophysical Research, 111, 17201+. K a w a m o t o , K . , T . N a k a j i m a and T . Y . N a k a j i m a , 2001: A global determination of cloud microphysics w i t h avhrr remote sensing. Journal of Climate, 14, 2054-2068. K a w a n i s h i , T . , T . Sezai, Y . Ito, K . Imaoka, T . Takeshima, Y . Ishido, A . S h i b a t a , M . M i u r a , H . Inahata and R. W . Spencer, 2003: T h e A d v a n c e d Microwave Scanning Radiometer for the E a r t h Observing Syst e m ( A M S R - E ) , N A S D A ' s contribution to the E O S for global energy a n d water cycle studies. Geoscience and Remote Sensing, IEEE Transactions on, 41(2), 184-194. K i n g , M . , S. Tsay, S. P l a t n i c k , M . W a n g and K . L i o u , 1997: C l o u d retrieval algorithms for M O D I S : optical thickness, effective particle radius, a n d thermodynamic phase. MODIS Algorithm Theoretical Basis Document No. ATBD-MOD-05, NASA. K l e i n , S. A . and D. L . H a r t m a n n , 1993: T h e seasonal cycle of low stratiform clouds. Journal of Climate, 6, 1587-1606. Krasnopolsky, V . M . , 2007: Reducing uncertainties i n neural network jacobians and improving accuracy of neural network emulations w i t h n n ensemble approaches. Neural Networks, 20(4), 454-461. Krasnopolsky, V . M . , L. C . Breaker and W . H . G e m m i l l , 1995: A neural network as a nonlinear transfer function model for retrieving surface w i n d speeds from the special sensor microwave imager. Journal of Geophysical Research, 100, 11033-11046. Krasnopolsky, V . M . , W . H . G e m m i l l and L. C . Breaker, 2000: A neural network multiparameter algor i t h m for S S M / I ocean retrievals - comparisons and validations. Remote Sensing of Environment, 73(2), 133-142. K r a t z , D. P., 1995: T h e correlated k-distribution technique as applied to the A V H R R channels. Journal of Quantitative Spectroscopy and Radiative Transfer, 53, 501-517. K r a t z , D. P., 2001: M O D I S correlated k-distributions. http://asd-www.larc.nasa.gov/~kratz/modis.html, accessed A p r i l 9th, 2007. Lee, T . E . , S. D. M i l l e r , F . .1. Turk, C . Schueler, R . J u l i a n , S. Deyo, P. D i l l s and S. W a n g , 2006: T h e N P O E S S V I I R S d a y / n i g h t visible sensor. Bulletin of the American Meteorological Society, 87, 191-199. L i n , X . and J . A . Coakley, 1993: Retrieval of properties for semitransparent clouds from multispectral infrared imagery data. Journal of Geophysical Research, 98, 18501-18514. L o h m a n n , U . a n d J . Feichter, 2005: G l o b a l indirect aerosol effects: a review. Atmospheric Physics, 5, 715-737. Chemistry & L u o , G . , X . L i n and J . A . Coakley, 1994: 11-pm emissivities and droplet radii for marine stratocumulus. Journal of Geophysical Research, 99, 3685-3698. 142 Bibliography M a c K a y , D . J . C , 1992a: Bayesian interpolation. Neural Computation, 4(3), 415-447. M a c K a y , D . J . C , 1992b: A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448-472. M a c K a y , D . J . C , 1995: P r o b a b l e networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6, 469-505. M a r t i n , G . M . , D . W . J o h n s o n and A . Spice, 1994: T h e measurement and parameterization of effective radius of droplets i n warm stratocumulus clouds. Journal of Atmospheric Sciences, 51, 1823-1842. M a y e r , B . a n d A . K y l l i n g , 2005: Technical note: T h e l i b R a d t r a n software package for radiative transfer calculations - description and examples of use. Atmospheric Chemistry & Physics, 5, 1855-1877. M c C a n n , D . W . , 1992: A neural network short-term forecast of significant thunderstorms. Weather and Forecasting, 7(3), 525-534. M i e , G . , 1908: Beitrage zur O p t i k triiber M e d i e n , speziell kolloidaler Metallosungen. Annalen der Physik, 330, 377-445. M i l e s , N . L., J . Verlinde a n d E . E . C l o t h i a u x , 2000: C l o u d droplet size distributions i n low-level stratiform clouds. Journal of Atmospheric Sciences, 57, 295-311. M i n n i s , P., D . P. K r a t z , J . A . Coakley, M . D . K i n g , D . G a r b e r , P. Heck, S. M a y o r , D . F . Y o u n g and R. A r d u i n i , 1995: Cloud Optical Property Retrieval (Subsystem 4-3). volume 3, pp. 135-176. N A S A R P 1376. M l a w e r , E . J . , S. J . T a u b m a n , P. D . B r o w n , M . J . Iacono and S. A . C l o u g h , 1997: R a d i a t i v e transfer for inhomogeneous atmospheres: R R T M , a validated correlated-k model for the longwave. Journal of Geophysical Research, 102, 16663-16682. Nabney, I. T . , 2002: NETLAB - Algorithms for Pattern Recognition. Springer. N a k a j i m a , T . and M . D . K i n g , 1990: Determination of the optical thickness and effective particle radius of clouds from reflected solar radiation measurements. P a r t I: Theory. Journal of Atmospheric Sciences, 47(15), 1878-1893. N a k a j i m a , T . , M . D . K i n g , J . D . Spinhirne and L. F . Radke, 1991: D e t e r m i n a t i o n of the optical thickness and effective particle radius of clouds from reflected solar radiation measurements. P a r t II: M a r i n e stratocumulus observations. Journal of Atmospheric Sciences, 48, 728-751. N a k a j i m a , T . Y . a n d T . N a k a j m a , 1995: W i d e - a r e a determination of cloud microphysical properties from noaa avhrr measurements for fire a n d astex regions. Journal of Atmospheric Sciences, 52(23), 4043-4059. N o r r i s , J . R. and C . B . Leovy, 1994: Interannual variability i n s t r a t i f o r m cloudiness and sea surface temperature. Journal of Climate, 7, 1915-1925. Pawlowska, H . and J . Brenguier, 2000: M i c r o p h y s i c a l properties of stratocumulus clouds during A C E - 2 . Tellus B, 52, 868+. Pearlmutter, B . A . , 1994: Fast exact multiplication b y the Hessian. Neural Computation, 6(1), 147-160. Perez, J . C , P . H . A u s t i n and A . Gonzalez, 2002: Retrieval of b o u n d a r y layer cloud properties using infrared satellite d a t a during the dycoms-ii field experiment, i n Proceedings 15th Symposium Boundary Layer and Turbulence, Wageningen, Netherlands. 143 Bibliography- Perez, J . C , F . Herrera, F . R o s a , A . Gonzalez, M . A . Wetzel, R . D . B o r y s and D. H . Lowenthal, 2000: Retrieval of marine stratus cloud droplet size from N O A A - A V H R R nighttime imagery. Remote Sensing of Environment, 73(1), 3 1 - 4 5 . Petty, G . W . , 2006: A First Course in Atmospheric Radiation (2nd Ed.). Sundog P u b l i s h i n g . P h i l l i p s , T . a n d P. L. Barry, 2002: C l o u d s i n the headlines/y2002/22apr-ceres.htm, accessed June 30th, 2007. greenhouse, http://science.nasa.gov/ P i n c u s , R., M . Szczodrak, J . G u and P. H . A u s t i n , 1995: U n c e r t a i n t y i n cloud optical depth estimates made from satellite radiance measurements. Journal of Climate, 8, 1453-1462. P l a t n i c k , S., M . D . K i n g , S. A . A c k e r m a n , W . P. M e n z e l , B . A . B a u m , J . C . R i e d i and R. A . Frey, 2003: T h e M O D I S cloud products: algorithms a n d examples from T e r r a . Geoscience and Remote Sensing, IEEE Transactions on, 41(2), 459-473. P l a t n i c k , S. and S. Twomey, 1994: Determining the susceptibility of cloud albedo to changes i n droplet concentration w i t h the A d v a n c e d V e r y H i g h Resolution Radiometer. Journal of Applied Meteorology, 33, 334-347. P l a t n i c k , S. and F . P. J . Valero, 1995: A validation of a satellite cloud retrieval during A S T E X . Journal of Atmospheric Sciences, 52, 2985-3001. Press, W . H . , S. A . Teukolsky, W . T . Vetterling and B . P. Flannery, 2007: Numerical Recipes: The Art of Scientific Computing. Cambridge University Press. R i g d o n , E . E . , 1997: N o t positive definite matrices ~mkteer/npdmatri.html, accessed June 18th, 2007. causes a n d cures, http://www2.gsu.edu/ Ringer, M . A . , B . J . Mcavaney, N . A n d r o n o v a , L . E . B u j a , M . E s c h , W . J . Ingram, B . L i , J . Quaas, E . Roeckner, C . A . Senior, B . J . Soden, E . M . V o l o d i n , M . J . Webb and K . D. W i l l i a m s , 2006: G l o b a l mean cloud feedbacks i n idealized climate change experiments. Geophysical Research Letters, 33, 7718+. Rosenblatt, F . , 1958: T h e perception: a probabilistic model for information storage a n d organization i n the brain. Psychological Review, 65(6), 386-408. Rossow, W . B . and R . A . Schiffer, 1999: Advances i n understanding clouds from I S C C P . Bulletin of the American Meteorological Society, 8 0 , 2261-2288. Rozendaal, M . A . , C . B . Leovy and S. A . K l e i n , 1995: A n observational study of diurnal variations of marine stratiform cloud. Journal of Climate, 8, 1795-1809. Savic-Jovcic, V . and B . Stevens, 2007: T h e structure and mesoscale organization of precipitating stratocumulus. Journal of Atmospheric Sciences, i n p r e s s . Schreier, M . , A . A . K o k h a n o v s k y , V . E y r i n g , L. Bugliaro, H . M a n n s t e i n , B . M a y e r , H . Bovensmann and J . P. B u r r o w s , 2006: Impact of ship emissions on the microphysical, optical and radiative properties of marine stratus: a case study. Atmospheric Chemistry & Physics, 6, 4925-4942. Schuller, L., R. B e n n a r t z , J . Fischer and J . L . Brenguier, 2005: A n algorithm for the retrieval of droplet number concentration and geometrical thickness of stratiform marine b o u n d a r y layer clouds applied to M O D I S radiometric observations. Journal of Applied Meteorology, 44, 28-38. Schuller, L . , J . L . Brenguier a n d H . Pawlowska, 2003: Retrieval of microphysical, geometrical, a n d radiative properties of marine stratocumulus from remote sensing. Journal of Geophysical Research, 108, 5-1. 144 Bibliography Sharon, T . M . , B . A . Albrecht, H . H . Jonsson, P. M i n n i s , M . M . K h a i y e r , T . M . van R e k e n , J . Seinfeld and R. F l a g a n , 2006: Aerosol and cloud microphysical characteristics of rifts a n d gradients i n maritime stratocumulus clouds. Journal of Atmospheric Sciences, 6 3 , 983-997. Stamnes, K . , S. C . Tsay, K . Jayaweera a n d W . W i s c o m b e , 1988: N u m e r i c a l l y stable algorithm for discrete-ordinate-method radiative transfer i n multiple scattering a n d e m i t t i n g layered media. Applied Optics, 27, 2502-2509. Stamnes, K . , S. C . Tsay, W . W i s c o m b e and I. Laszlo, 2000: D I S O R T , a general-purpose F o r t r a n program for discrete-ordinate-method radiative transfer i n scattering a n d emitting layered media: Documentation of methodology. Technical report, Dept. of Physics and Engineering P h y s i c s , Stevens Institute of Technology, H o b o k e n , N J 07030. Stephens, G . L . , 2005: C l o u d feedbacks i n the climate system: A critical review. Journal of Climate, 18, 237-273. Stephens, G . L., D. G . Vane, R . J . B o a i n , G . G . M a c e , K . Sassen, Z . W a n g , A . J . Illingworth, E . J . O ' C o n n o r , W . B . Rossow, S. L . D u r d e n , S. D. M i l l e r , R. T . A u s t i n , A . Benedetti, C . M i t r e s c u a n d T h e , 2002: T h e C l o u d s a t mission and the A - T r a i n . Bulletin of the American Meteorological Society, 83, 1771-1790. Stevens, B . , A . Beljaars, S. B o r d o n i , C . Holloway, M . K o h l e r , S. Krueger, V . Savic-Jovcic and Y . Zhang, 2007: O n the structure of the lower troposphere i n the summertime stratocumulus regime of the northeast pacific. Monthly Weather Review, 135(3), 985-1005. Stevens, B . , D. H . Lenschow, G . V a l i , H . Gerber, A . Bandy, B . B l o m q u i s t , J . L . Brenguier, C . S. Bretherton, F . B u r n e t , T . C a m p o s , S. C h a i , I. Faloona, D. Friesen, S. H a i m o v , K . L a u r s e n , D. K . Lilly, S. M . Loehrer, S. P. M a l i n o w s k i , B . Morley, M . D. Petters, D. C . Rogers, L . Russell, V . Savic-Jovcic, J . R. Snider, D . Straub, M . J . Szumowski, H . Takagi, D. C . T h o r n t o n , M . T s c h u d i , C . Twohy, M . Wetzel and M . C . van Zanten, 2003a: D y n a m i c s a n d C h e m i s t r y of M a r i n e Stratocumulus - D Y C O M S - I I . Bulletin of the American Meteorological Society, 84(5), 579-593. Stevens, B . , D . H . Lenschow, G . V a l i , H . Gerber, A . B a n d y , B . B l o m q u i s t , J . L . Brenguier, C . S. Bretherton, F . Burnet, T . C a m p o s , S. C h a i , I. Faloona, D. Friesen, S. H a i m o v , K . L a u r s e n , D . K . Lilly, S. M . Loehrer, S. P. M a l i n o w s k i , B . Morley, M . D. Petters, D. C . Rogers, L . Russell, V . Savic-Jovcic, J . R. Snider, D . Straub, M . J . Szumowski, H . Takagi, D . C . T h o r n t o n , M . T s c h u d i , C . Twohy, M . Wetzel and M . C . v a n Zanten, 2003b: Supplement to Dynamics a n d C h e m i s t r y of M a r i n e Stratocumulus D Y C O M S - I I (flight summaries). Bulletin of the American Meteorological Society, 84(5), S12-S25. Stevens, B . , G . V a l i , K . Comstock, R. W o o d , M . C . van Zanten, P. H . A u s t i n , C . S. B r e t h e r t o n and D. H . Lenschow, 2005: Pockets of open cells a n d drizzle i n marine stratocumulus. Bulletin of the American Meteorological Society, 86, 51-57. Turner, D. D., A . M . Vogelmann, R. T . A u s t i n , J . C . B a r n a r d , K . C a d y - P e r e i r a , J . C . C h i u , S. A . C l o u g h , C . F l y n n , M . M . K h a i y e r , J . Liljegren, K . Johnson, B . L i n , C . L o n g , A . M a r s h a k , S. Y . Matrosov, S. A . M c F a r l a n e , M . M i l l e r , Q . M i n , P . M i n n i s , W . O ' H i r o k , Z . W a n g a n d W . W i s c o m b e , 2007: T h i n liquid water clouds: T h e i r importance a n d our challenge. Bulletin of the American Meteorological Society, 88(2), 177-190. Twohy, C . H . , M . D . Petters, J . R. Snider, B . Stevens, W . Tahnk, M . W e t z e l , L . Russell a n d F . Burnet, 2005: E v a l u a t i o n of the aerosol indirect effect i n marine stratocumulus clouds: Droplet number, size, liquid water path, and radiative impact. Journal of Geophysical Research, 110, 8203+. Twomey, S., 1974: P o l l u t i o n and the planetary albedo. Atmospheric Environment, 8(12), 1251-1256. Twomey, S., 1977: T h e influence of p o l l u t i o n on the shortwave albedo of clouds. Journal of Atmospheric Sciences, 34, 1149-1154. 145 Bibliography Twomey, S. and T . Cocks, 1989: Remote sensing of cloud parameters f r o m spectral reflectance i n the near-infrared. Beitrdge zur Physik der Atmosphdre, 6 2 , 172-179. van Deist, P., 2005: Sensor SpcCoeff data, http://cimss.ssec.wisc.edu/~paulv/, 2007. accessed September 6th, V a u g h a n , M . A . , S. A . Y o u n g , D. M . W i n k e r , K . A . Powell, A . H . O m a r , Z . L i u , Y . H u and C . A . Hostetler, 2004: F u l l y automated analysis of space-based lidar data: a n overview of the C A L I P S O retrieval algorithms and d a t a products, i n U . N . Singh, editor, Laser Radar Techniques for Atmospheric Sensing., volume 5575 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, p p . 16-30. Wallace, J . M . and P. V . Hobbs, 2006: Atmospheric Science, Volume 92, Second Edition: An Introductory Survey (International Geophysics). A c a d e m i c Press. Weisstein, E . W . , 2007: M a t h W o r l d - a W o l f r a m web resource, http://mathworld.wolfram.com, June 5th, 2007. accessed W i e l i c k i , B . A . , E . F . H a r r i s o n , R. D . Cess, M . D . K i n g and D . A . R a n d a l l , 1995: M i s s i o n to planet earth: Role of clouds and radiation i n climate. Bulletin of the American Meteorological Society, 76(11), 2125-2154. W i l l i a m s , C . K . I., C . Qazaz, C . M . Bishop and H . Z h u , 1995: O n the relationship between Bayesian error bars and the input d a t a density, i n Artificial Neural Networks, 1995., Fourth International Conference on, pp. 160-165. W i l l i a m s , K . and G . Tselioudis, 2007: G C M intercomparison of global cloud regimes: present-day evaluation and climate change response. Climate Dynamics, 29(2), 231-250. W o o d , R., 2007: Cancellation of aerosol indirect effects i n marine stratocumulus t h r o u g h cloud thinning. Journal of Atmospheric Sciences, 64(7), 2657-2669. W o o d , R., C . S. B r e t h e r t o n and D . L. H a r t m a n n , 2002: D i u r n a l cycle of l i q u i d water p a t h over the subtropical and tropical oceans. Geophysical Research Letters, 2 9 , 7 - 1 . X u , K . M . , T . W o n g , B . A . W i e l i c k i , L . P a r k e r and Z . A . E i t z e n , 2005: S t a t i s t i c a l analyses o f satellite cloud object d a t a from C E R E S , part I: Methodology and preliminary results of the 1998 E l N i f i o / 2 0 0 0 L a N i n a . Journal of Climate, 1 8 , 2497-2514. X u e , H . , G . Feingold and B . Stevens, 2007: T h e role of precipitating cells i n organizing shallow cumulus convection. Journal of Atmospheric Sciences, i n p r e s s . 146 Appendix A Implementation of the Aires et al. Method in Matlab A.l Modified Netlab Functions Table A . l lists all N E T L A B functions that were modified for this study. A b o u t half of the functions implementing M L P s in N E T L A B h a d to be modified i n order to accommodate the m a t r i x hyperparameters and A , Ai n a n d some a d d i t i o n a l functions were implemented as well. However, only the most important r changes a n d implementation details w i l l be discussed i n this appendix. A.2 M a i n Loop: mlatrain T h e implementation of A l g o r i t h m 3.1 i n m l a t r a i n is straightforward. F o l l o w i n g the N E T L A B design, the network is stored i n a structure n e t (cf. N a b n e y , 2 0 0 2 ) . It is t r a i n e d using the generic N E T L A B function n e t net opt: = n e t o p t ( n e t , options , x , t , ' s c g ' ); T h e covariance m a t r i x C o is computed from the network predictions (evaluated w i t h m l a f wd) a n d the target variables by using the M A T L A B function cov: y = m l a f w d ( n e t , x ); e p s i l o n = y —t ; CO = c o v ( e p s i l o n ) N e x t , G and H are computed using m l a d e r i v and m l a h e s s , a n d (G G = m l a d e r [hess, hda invhess = GHGavg = for n = 1 i v ( n e t , x ) ; ta] = m l a h e s s ( n e t , inv(hess); z e r o s ( n e t . n o u t ) ; : ndata H G) is evaluated: x , t ) ; 147 Appendix A. Implementation of the Aires et al. Method in Matlab T a b l e A . l : L i s t of N E T L A B functions that were adapted or created i n order to accommodate the m a t r i x hyperparameters Ai a n d A (cf. N a b n e y , 2002, Table 5.1). n r original function new function function changes mlp mla create two-layer M L P scalar hyperparameters have been replaced by the m a t r i x ones mlpbkp mlabkp none mlpderiv mladeriv mlperr mlaerr backpropagate error gradient t h r o u g h network evaluate derivatives of network outputs with respect to weights evaluate error function mlpfwd mlpgrad mlafwd mlagrad forward propagation evaluate error gradient mlphdotv mlahdotv evaluate the product of the Hessian w i t h a vector 7 £ { - } - a l g o r i t h m has been adapted to the new error function mlphess mlahess mlppak mlapak use full covariance m a t r i x A instead of a none mlpunpak mlaunpak evaluate the Hessian m a trix combine weights a n d b i ases into one parameter vector separate parameter vector none i m p l e m e n t a t i o n of (3.22) instead of (3.24) none full backpropagation w i t h m a t r i x hyperparameters r none into weight and bias m a — mlatrain — mlarescale — mlareverserescale — mlaJacob — mlaJacobcheck — mlajacobuncertainty trices implementation of A l g o r i t h m 3.1 normalise input or target variables undo the rescaling mlarescale — — of — evaluate the J a c o b i a n m a trix verify the J a c o b i a n m a t r i x — w i t h finite differences compute the uncertainty i n the J a c o b i a n m a t r i x — — 148 Appendix A. Implementation of the Aires et al. Method in Matlab Gn = s q u e e z e ( G ( n , : , : ) ) ; GHG = G n ' * i n v h e s s * G n ; GHGavg = GHGavg + GHG; end GHGavg = GHGavg / ndata; F i n a l l y , the hyperparameters are re-estimated following equations (3.44) a n d (3.59): C n w n n i n = CO e t . A i n = = n e t p a k ewalpha = e t . A r = d A.3 1995, 1) — d i a g ( n e t . A r ) . * d i a g ( i n v h e s s ) ) w'.*2; ./ Gradient Computation with Backpropagation: mlagrad The and GHGavg; i n v ( C i n ) ; ( n e t ) ; ( o n e s ( n e t . n w t s , i a g ( n e w a l p h a ) ; derivative of the error function w i t h respect to the weights is needed by the o p t i m i s a t i o n a l g o r i t h m is computed i n m l a g r a d . T h e function is based o n the m e t h o d of error back-propagation (Bishop, Section 4.8). In short, the t o t a l error function is decomposed into error terms of the i n d i v i d u a l patterns of the t r a i n i n g dataset, so that d E - = dwn The error E can write n Y ™ l . (A.I) diva n J depends o n the weight Wji through the summed i n p u t aj = Y^i ji i w z t o u r u t i> hence we ^ . ^ • S L - V , , (A.2, &i = ^daj (A-3) ovjji oaj ovjji where the n o t a t i o n has been introduced and aj = ^tWjiZi has been differentiated to give daj/dvjji = Zi. Since the inputs Zi t o a p a r t i c u l a r unit are k n o w n , the 6s have t o be computed. F o r the o u t p u t units, FtE n h = ~g'{a )^fk (A.4) oyk (since y k = p(afc)), whereas B i s h o p (1995) shows that for the hidden units Sj = g'(aj)Y2wkj6k. (A.5) 149 Appendix A. Implementation of the Aires et al. Method in Matlab Since the error derivatives of the hidden units depend o n the Ss of the output units, the a l g o r i t h m is called back-propagation. The derivative of the output activation function i n our case is g'(a,k) = 1 for the output units, since g is a linear function. F o r the i n N E T L A B implemented sum-of-squares error function (3.23), = \Y,{y -t )\ k=i E n n (A.6) n k k z the output 5s evaluate to 6 = -^k (A.7) = --2-(y -t )=y -t . l n k n k n k k However, for the A i r e s error function (3.22), E = \ (V - t f n n n •A • (y - t ) + ~ w n in n T A r w (A.8) (where the weights t e r m has been divided by N to express its c o n t r i b u t i o n to the error of a single pattern n ) , the derivative w i t h respect t o the weights becomes dE _ dE% daj dwji daj duiji n The | 1 8E N dvjji ^ W weights t e r m can be evaluated directly to give dE dw d {w A ^w } dw w r r r s = A ^w , r s (A.10) r where a l l weights are considered to be a vector w i t h a single index. T h e d a t a t e r m is computed w i t h back-propagation, leading to the output 6s ^ =f f Here, the property that Ai n The = ^{|(^-*E)^(yr-*D} (A.ii) = ^(yP-*D- ( - ) A 12 is symmetric has been used. above equations can be implemented i n a few lines i n M A T L A B : d e l o u t = ( y — t ) * gdata = m l a b k p ( n e t , w = m l a p a k ( n e t ) ; n e t . A i n ; x , z , d e l o u t ) ; 150 Appendix A. Implementation of the Aires et al. Method in Matlab g p r i o r = w * n e t . A r ; g = g d a t a + g p r i o r ; First, all 6 k are computed w i t h ( A . 12). T h e n the d a t a t e r m of the error gradient is computed w i t h back-propagation (mlabkp) a n d the weights term ( A . 10) is added. A.4 7Z{-}-Algorithm: The Hessian with the Pearlmutter mlahess, mlahdotv N a b n e y (2002) uses a n a l g o r i t h m derived by P e a r l m u t t e r (1994) to compute t h e Hessian m a t r i x of the network. T h e idea is to use a n efficient algorithm t o compute the p r o d u c t v H T of a vector v w i t h the Hessian, a n d t o evaluate the full Hessian b y using a sequence of u n i t vectors t h a t each pick o u t one c o l u m n of H. P e a r l m u t t e r uses the notation TZ{} for the operator v \7, T (A.13) v V{VE)=TZ{VE}. v< H = T T His so that derivation, a s u m m a r y of which can be found i n B i s h o p (1995, Section 4.10.7), leads t o two expressions for the derivative of the error function w i t h respect to the first and second layer weights, respectively: (A.14) (A.15) where the 6s are the standard back-propagation expressions given by (A.4) a n d (A.5). A s was the case for m l a g r a d , the N E T L A B implementation had t o be adapted t o the new 6 . U s i n g (A.12), k n{s } k = n{Af y }-H{Af t } l n ^ = (A.16) n{At{yi-ti)} = n{At}y + l n l (A.17) J =o Atn{y } l (A.18) =o = A? K{ }. n yi (A.19) 151 Appendix A. Implementation of the Aires et al. Method in Matlab Together w i t h the results Tl{ } (A.20) TZ{ } (A.21) aj Zj n{ } (A.22) yk j 7c {5,} = j g" (aj)TZ{aj}^2wkjSk k + g' (aj)^TvkjSk + g' (a,-) 'Y^WkjIl fc {8k} (A.23) fc (see B i s h o p , 1995, Section 4.10.7), the algorithm is implemented i n m l a h d o t v . A g a i n , a l l 5kS i n (A.19) can be evaluated i n one vector: | r d e l = r y * n e t . A i n ; In m l a h e s s , the d a t a Hessian part of the t o t a l Hessian given by (3.27) is evaluated using m l a h d o t v , then the weights part A R for is added: v = e y e ( n e t . n w t s ) ; h d a t a ( find ( v ) ,:) = m l a h d o t v ( n e t , x , t , v ) ; end h = hdata A.5 + n e t . A r ; T h e Jacobian a n d its D i s t r i b u t i o n : mlajacob, mlajacobuncertainty As N a b n e y (2002, C h a p t e r 5) notes, the J a c o b i a n m a t r i x for a two layer M L P can be computed w i t h 8 M which is based on a t a n h hidden unit activation function. T h e implementation of (A.24) i n m l a j a c o b is straightforward. Following Nabney's radial basis functions implementation (Nabney, 2002, Section 6.4.2), I used the shortcut fyji = ( l — z?) Wji, leading t o [y, for z n Ps j a ,a ]= m l a f w d ( n e t , x ) ; = l : n d a t a i = ( o n e s ( n e t . n i n , 1)*(1—z(n, c ( n , :, : ) = P s i * net.w2; : ). "2)).* n e t . w l ; end 152 Appendix A. Implementation of the Aires et al. Method in Matlab N o t e that the function returns a three dimensional array which contains the J a c o b i a n for each i n d i v i d u a l input pattern. The A i r e s et a l . (2004b) technique to estimate the variability i n the J a c o b i a n (see section 3.5) is implemented i n m l a j a c o b u n c e r t a i n t y . Samples from the weights d i s t r i b u t i o n are generated w i t h the M A T L A B S T A T I S T I C S T O O L B O X function mvnrnd, w h i c h returns r a n d o m samples f r o m a multivariate Gaussian d i s t r i b u t i o n (here, according to (3.30), the most probable weights vector w* mean of the d i s t r i b u t i o n a n d the inverse Hessian H 1 is used as the is the covariance m a t r i x ) : invH = i n v ( H ) wmp = m l a p a k ( n e t ) ; wsamples For = mvnrnd(wmp, invH , n s a m p l e s ) ; each of these weights samples, the J a c o b i a n is computed a n d averaged over all i n p u t patterns mjac = z e r o s ( n s a m p l e s , f o r i = 1: n s a m p l e s n e t i = mlaunpak(net jac = m l a j a c o b ( n e t i m j a c ( i , : , :) = s q end x: n n e t . n i n , n e t . n o u t ) ; , w s a m p l e s ( i ,:)); , x ) ; u e e z e ( m e a n ( j a c ) ) ; Eventually, the mean and standard deviation of a l l mean Jacobians are c o m p u t e d a n d returned: meanjac = s q u e e z e ( m e a n ( m j a c ) ) ; s t d j a c = s q u e e z e ( s t d (mj ac.) ); A.6 Normalisation of Input and Output Variables The functions m l a r e s c a l e and m l a r e v e r s e r e s c a l e implement the n o r m a l i s a t i o n methods mentioned i n section 3.6.1. m l a r e s c a l e implements the simple linear rescaling (3.61): n r m .m e a n = m e a n ( D ); n r m . n o r m = s t d ( D ) ; RD = D — ( o n e s ( n d a t a , 1)*nrm. mean); RD = R D ./ ( o n e s ( n d a t a , 1) * n r m . n o r m ) ; All variables are passed as a m a t r i x D and processed all at once. Whitening, the more sophisticated linear rescaling m e t h o d (3.62), is also implemented i n m l a r e s c a l e : nrm n r m [ev nrm nrm .m e a n = m e a n ( D ); . s i g = c o v ( D ) ; U , e v L ] = e i g ( n r m . s i g ); . evU = evU ; . evL = evL ; 153 Appendix A. Implementation of the Aires et al. Method in Matlab R D = D — ( o n e s ( n d a t a , 1) * n r m . m e a n ) ; RD = ( e v L " ( - . 5 ) * e v U ' * RD ' ) '; The A.7 corresponding code to reverse b o t h these rescaling methods is implemented i n m l a r e v e r s e r e s c a l e . Implementation Difficulties: Regularisation of the Hessian and Numerical Symmetry W h i l e working w i t h the implementation of the Aires et a l . a l g o r i t h m , I encountered two difficulties. One of t h e m has been discussed i n section 3.6.2, the positive definite character of the Hessian. T h e N a b n e y (2002) m e t h o d is implemented at the corresponding places i n the m l a * routines: [hess , h [evec, e evl = e v hdata = hess = h d v l e d ata] = mlahess(net , x , t ) ; l] = e i g ( h d a t a ) ; .* ( e v l > 0); vec * e v l * evec '; ata + n e t . A r ; A n o t h e r p r o b l e m I encountered while working w i t h my implementation is t h a t the Hessian matrices that are computed (and consequently their inverses) are not completely numerically symmetric, as would be expected. T h i s means that due to numerical inaccuracies i n M A T L A B , the elements hij of the Hes- sian follow hij = hji + e, w i t h e being a small perturbation. Such small errors can amplify considerably when the matrices are m u l t i p l i e d w i t h other matrices. F u r t h e r m o r e , the function mvnrnd used i n m l a j a c o b u n c e r t a i n t y expects a numerically symmetric covariance m a t r i x . However, the issue is easily resolved by copying one half of the m a t r i x into the other half where needed. 154
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Neural network satellite retrievals of nocturnal stratocumulus...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Neural network satellite retrievals of nocturnal stratocumulus cloud properties Rautenhaus, Marc 2007
pdf
Page Metadata
Item Metadata
Title | Neural network satellite retrievals of nocturnal stratocumulus cloud properties |
Creator |
Rautenhaus, Marc |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | I investigate the feasibility of retrieving cloud top droplet effective radius, optical thickness and cloud top temperature of nocturnal marine stratocumulus clouds by inverting infrared satellite measurements using an artificial neural network. For my study, I use the information contained in the three infrared channels centred at 3.7, 11.0 and 12.0 μm of the Moderate Resolution Imaging Spectroradiometer (MODIS) on-board NASA ' s Terra satellite, as well as sea surface temperature. A database of simulated top-of-atmosphere brightness temperatures of a range of cloud parameters is computed using a correlated-k parameterisation which I have embedded in the radiative transfer package libRadtran. The database is used to train feed-forward neural networks of different architecture to perform the inversion of the satellite measurements for the cloud properties. I investigate the application of Bayesian methods to estimate the retrieval uncertainties, and analyse the Jacobian of the networks in order to gain information about the functional dependence of the retrieved parameters on the inputs. A high variability in the Jacobian indicates that the nocturnal retrieval problem is ill-posed. My experiments show that because the problem is ill-conditioned, it is very difficult to find a network that approximates the database of simulated brightness temperatures well. Sea surface temperature proves to be a necessary input. I compare the retrievals of a selected network architecture with in-situ cloud measurements taken during the second Dynamics and Chemistry of Marine Stratocumulus experiment. The results show general agreement between retrievals and in-situ observations, although no collocated comparsions are possible because of a time lag of five hours between both measurements. I establish that the uncertainty estimates are prone to numerical problems and their results are questionable. I show that the Jacobian is a valuable tool in evaluating the retrieval networks. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-03-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0052641 |
URI | http://hdl.handle.net/2429/32189 |
Degree |
Master of Science - MSc |
Program |
Atmospheric Science |
Affiliation |
Science, Faculty of Earth, Ocean and Atmospheric Sciences, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2007-0564.pdf [ 15.86MB ]
- Metadata
- JSON: 831-1.0052641.json
- JSON-LD: 831-1.0052641-ld.json
- RDF/XML (Pretty): 831-1.0052641-rdf.xml
- RDF/JSON: 831-1.0052641-rdf.json
- Turtle: 831-1.0052641-turtle.txt
- N-Triples: 831-1.0052641-rdf-ntriples.txt
- Original Record: 831-1.0052641-source.json
- Full Text
- 831-1.0052641-fulltext.txt
- Citation
- 831-1.0052641.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0052641/manifest