gives an idea of how cloud emission as well as absorpt ion depend on re\/f.n The cloud top radiance at a certain wavelength then becomes a complex funct ion of cloud temperature (determining the black body emission), particle size, the thickness of the cloud (determining how many scattering or absorpt ion\/emission events occur), and, for clouds th in enough to allow for transmission, surface temperature (assuming a surface emissivity = 1). Of course, cloud inhomogeneities add further complexity. Despite this complexity, the differences in the dependence of cloud droplet scattering, absorption and emission properties on wavelength can be exploited for the retrieval. Wh i l e the objective of B a u m et al . (1994) was to detect mult i level cloud situations, M inn is et al . (1995) and Heck et a l . (1999) report on a nightt ime retrieval algor i thm wi th in the Clouds and the Ear th 's Radiant Energy System ( C E R E S ) programme. They use a very simple radiative transfer approach, parametr is ing the forward model by an efficiently computable function that incorporates cloud opt ical depth and the emissivity dependence on effective radius, and minimise the sum-of-square error between the model and observations. Perez et al . (2000) also include surface temperature information, retrieved from clear sky pixels in the vic in i ty of the clouds, i n their retrieval method to determine effective radius and cloud top temperature of nocturnal marine stratocumulus clouds. Thei r forward model consists of a vert ical ly uniform, plane-paral lel cloud over a sea surface. Emiss ion and absorption effects of the atmosphere above the cloud are neglected. Fol lowing B a u m et a l . (1994), Perez et al . (2000) employ brightness temperature differences ( B T D ) between the satellite channels to express the varying behaviour of the cloud radiat ive properties wi th wavelength, and extensively study the behaviour of the forward model when r, reff, and temperature 1 1 Indeed, although making the retrieval of particle size possible at nighttime, this dependence becomes a problem in the removal of the thermal contribution to the near infrared channels during daytime. 22 Chapter 1. Introduction A 4 s -2 I-CQ Optical depth 15.0 I 2 8 4 K 2 8 5 K 2 8 6 K 2 8 7 K 2 8 4 2 8 6 2 8 8 2 9 0 2 8 2 2 8 4 T 4 ( K ) 2 8 6 T 4 ( K ) 2 8 8 2 9 0 F i g u r e 1.9: Phys ica l basis for the remote sensing methods used by Heck et a l . (1999), Perez et al . (2000), B a u m et al . (2003) and Cerdena et al . (2007): plotted is the brightness temperature difference ( B T D ) between A V H R R channels 3 and 4 (3.7 and 11.0 \/mi ) , as measured at the top of the atmosphere, versus the brightness temperature for channel 4. The curves are based on radiat ive transfer computations for horizontal ly and vert ical ly homogeneous clouds wi th varying part icle radius and optical thickness (left) and temperature (right). B y creating such curves wi th a radiat ive transfer model and comparing them to the brightness temperatures measured by the satellite, it is possible to infer information about the cloud properties that caused the satellite measurements (reprinted from Perez et a l . , 2000, \u00a9 2000, w i th permission from Elsevier). are va r i ed . 1 2 Figure 1.9 shows the B T D between the 3.7 fim and 11 \/j,m channels of the A V H R R instrument (channels 3 and 4, respectively), henceforth abbreviated wi th BTD(3 .7 -11) , versus the 11 \/j,m brightness temperature (BT ) for vert ical ly uni form clouds wi th (a) a fixed cloud top temperature of 285 K and several effective radi i and optical depths, and (b) a fixed radius of 8 fim and varying cloud temperature. The curves span an area of possible solutions of the forward model , and if some parameters can be fixed the remaining ones can readily be inferred from these diagrams. Simi lar diagrams exist for the B T D between 11 and 12 fim (BTD(11-12) in the following). Yet two issues complicate the retrieval. One problem for nocturnal retrievals is the saturat ion of cloud emission wi th opt ical depth. For th in clouds a number of photons from the surface and al l levels wi th in the cloud layer are able to reach the satellite without absorption. For thicker clouds absorpt ion wi l l effectively remove al l photons from lower levels in the cloud layer before they reach cloud top - leaving 1 2 T h e intensities in the cost function (1.22) now become brightness temperatures and brightness temperature differences. 23 Chapter 1. Introduction the cloud top radiance to be governed only by the upper cloud (of course, the absorpt ion is wavelength dependent). F igure 1.8d il lustrates the phenomenon. For an u> of 0.9, the emissivi ty of the cloud, equal to its absorptance, quickly reaches an asymptot ic l imit . A convergence for larger opt ical depths is also visible in the B T D diagrams of F igure 1.9. However, since most marine Sc have opt ical thicknesses of less that 10 ( X u et a l . , 2005; Rossow and Schiffer, 1999), the saturat ion problem does not pose a significant constraint. A lso problematic are ambiguities in the forward model. A s noted by Perez et a l . (2000), some sets of measured brightness temperatures can be caused by several clouds having differing opt ical properties. Th is , of course, makes the unique retrieval of the cloud parameters impossible. Perez et al . (2000) find that the effect is larger on BTD(3.7-11) , but also occurs for BTD(11-12) , especially for effective radi i in the range of 4-7 fim. The minimisat ion technique used by Perez et a l . (2000) is similar to the one used by Naka j ima and K i n g (1990) and K i n g et al . (1997), who first retrieve optical depth from the visible wavelengths in order to transform the opt imisat ion wi th respect to r e \/ \/ into a one-dimensional problem. Not ing the complexity of the hypersurface of the cost function (with equivalently deep min ima that resemble the ambiguit ies), Perez et a l . (2000) also split the retrieval into two parts. F i rs t , cloud top temperature is recovered from the BTD(11-12) signal. F i x i ng the atmospheric profile using the inferred surface and cloud top temperatures, they then run the forward model again for a range of effective radi i . Th is produces a one-dimensional minimisat ion problem wi th respect to reff, which is easier to solve for mult iple solutions. The retrieval result then consists of a cloud top temperature, together w i th a list of possible effective radi i . Perez et al . (2000) compare the retrieval results to time-averaged in-si tu measurements performed on the Canary Islands, and find that satellite and in-situ observed r e \/ \/ agree wi th in 1.25 ^ m . Gonzalez et a l . (2002) report on a modif ied version of Perez et al . (2000). They eliminate the ambi-guities due to effective radius by bui lding a L U T in which, for situations where mult ip le solutions exist, they only store the median radius. The discarded solutions are used to compute a confidence interval around the central value, which they find to be as large as 7 um for large r e \/ \/ in th in clouds. The cost funct ion then possesses a unique global min imum, which they f ind wi th a genetic a lgor i thm in order to avoid mult iple local min ima. In addi t ion to cloud temperature and effective radius, they also retrieve opt ical depth values. Perez et a l . (2002) adapt the retrieval to channels from the M O D I S instrument. Instead of using B T s at 3.7, 11 and 12 fim, they employ the 3.7, 3.9, 8.5 and 11 p m channels (plus surface temperature). 24 Chapter 1. Introduction Furthermore, atmospheric contributions to the top-of-atmosphere ( T O A ) B T s are included this t ime, assuming a known atmospheric profile from a different source. The error surface is minimised wi th a scatter search algori thm. B a u m et al . (2003) also extend the B a u m et a l . (1994) work to the M O D I S instrument. Thei r objective is again to recognise situations in which high level cirrus overlays low boundary layer clouds. Including the 8.5 jj,m channel in their computations, they note that the BTD(8.5-11) ( B T D between 8.5 and 11 \/um) shows a strong dependence on cloud top temperature but not on part icle radius. Apa r t f rom the expensive minimisat ion techniques employed in the out l ined nightt ime retrieval meth-ods, the procedures also suffer from the problem of comput ing a meaningful uncertainty estimate. Perez et al . (2000), as well as Gonzalez et al . (2002), perform a sensit ivity analysis of their retrievals wi th respect to errors in the observed brightness temperatures. They find that a var iat ion of \u00b1 2 K in the observed B T s of cloudy pixels leads to variations in reff of more than 3 fj.m, variations of \u00b1 0.5 K in the clear sky B T s lead to errors in r e \/ \/ of less than 0.5 u m (Perez et a l . , 2000). Gonzalez et a l . (2002) note that the sensitivities of retrieved cloud parameters to input B T s are largest in the case of th in clouds. They also discuss the effect of the ambiguities in effective radius and the error in the presence of broken clouds on the subpixel scale (neglecting nonlinear effects). However, these sensit ivity analyses cannot provide more than an idea of the general magnitude of the uncertainty. They also do not include the effects of assumptions made in the forward model or the accuracy of the numerical inversion. Indeed, Pincus et al . (1995), for the case of opt ical depth retrievals, report on the difficulty of obtaining a good uncertainty estimate. In order to obta in an accurate error estimate for an indiv idual retrieval or for an average over a populat ion of retrievals, addi t ional inversions wi th perturbed input variables would have to be performed, which increases the computat ional cost. 1.5.3 Art i f ic ia l Neural Networks A n interesting development to tackle the high computat ional cost of the inverse procedure is the use of art i f icial neural networks ( A N N s ) . The theory of A N N s as a tool for stat ist ical data model l ing has been developed since the 1950s (e.g. Rosenblatt, 1958), but they were not used much in the atmospheric sciences unt i l the 1990s (e.g. M c C a n n , 1992; Hsieh and Tang, 1998). General references to A N N s are the books by Bishop (1995, 2006). Inspired from the biological archetype of the human brain, an A N N imitates the concept of intercon-nected nodes, the neurons, that can \"fire\" in order to pass on informat ion if certain input conditions are 25 Chapter 1. Introduction bias bias F i g u r e 1.10: Schematic diagram of a feed-forward network wi th two layers of adaptive weights w^j and wfl- The input neurons on the left are labelled wi th a;;. The information propagates in a forward direct ion through the hidden neurons zj to the output neurons y^. In addi t ion to the input and hidden neurons, there are two bias parameters wi th a fixed input (activation) of XQ = 1 and zo = 1, respectively. met. In part icular, a neuron in an A N N computes the sum over al l of its input values and returns the value of an act ivat ion function of that sum. Th is activation funct ion can be an arbi t rary function, but usual ly a sigmoidal function is used that returns 1 if a certain threshold is reached and 0 otherwise. The connections between the indiv idual neurons are weighted, and by adjust ing these weights, the A N N can \" learn\" a certain behaviour. Ar t i f ic ia l neural networks have been widely used for both classif ication and regression problems (Bishop, 1995). There are many different types of A N N s , but the important one for this study and the works cited here is the mult i layer perceptron ( M L P ) . 1 3 Its architecture is schematical ly i l lustrated in F igure 1.10. Its neurons are grouped into several layers, one input layer, one output layer, and one to several hidden layers . 1 4 Each neuron in the input layer corresponds to an input variable, and each neuron in the output layer to an output. The number of hidden layers and neurons in them is variable and has to be determined indiv idual ly for each case. In order to learn a behaviour, the network has to be trained wi th a t ra in ing dataset that includes input values as well as their corresponding target values. The weights are then modif ied unt i l , for al l 1 3 I will not give a more detailed description of multilayer perceptrons in this thesis, a thorough introduction to many aspects concerning neural networks can be found in Bishop (1995), a shorter review in Bishop (1994). 1 4 T h e numbering of the layers can be confusing. A \"two-layer-perceptron\" refers to a network with two layers of adaptive weights, that is only one layer of hidden neurons. \"Two hidden layers\", however, refer to two layers of hidden neurons and hence three layers of adaptive weights. 26 Chapter 1. Introduction input values, the outputs of the A N N equal the target values to a sufficient accuracy. A n important property of such A N N s is that, given a large enough number of h idden neurons and weights, they are able to approximate any arbi t rary smooth function (e.g. B ishop, 1995). Th is makes them ideally suited for nonlinear regression problems in which l i t t le is known about the shape of the cost function. Furthermore, once the training process is completed, the computat ion of the output values includes only a few summations and evaluations of the activation functions, which makes a trained A N N very fast. O f course, there are also disadvantages, as it can be difficult to determine the number of hidden neurons or to find a good set of weights (Bishop, 1995, more specific problems arising in the atmospheric sciences are outl ined by Hsieh and Tang (1998)). In the context of atmospheric satellite remote sensing, some work has been done that employs A N N s . For instance, Krasnopolsky et al . (1995) and Krasnopolsky et al . (2000) used A N N s to retrieve surface wind speeds over the ocean from microwave measurements. Aires et a l . (2001) determined surface tem-perature, water vapour content and cloud l iquid water path, also from microwave observations, and Aires et al . (2002) retrieved atmospheric and surface temperature from infrared measurements. Faure et al . (2001c) applied A N N s to dayt ime retrievals of c loud parameters. In their work, they investigated the feasibil ity of simultaneously retr ieving cloud opt ical thickness, effective radius, a relative cloud inhomogeneity and fractional cloud cover of inhomogeneous clouds from observations at 0.6, 1.6, 2.1 and 3.7 u.m. They used a three dimensional Monte Car lo radiat ive transfer model to account for nonlinear effects due to the cloud inhomogeneities to bui ld a database of 3000 clouds. A n A N N wi th two hidden layers, 10 neurons in each layer, was trained from this database in order to model the inverse function. Faure et al . (2001c) concluded that a retrieval of inhomogeneous clouds using an A N N is feasible. The study was extended by Cornet et al . (2004) to combine dayt ime observations available at different resolutions. In order to account for the increased complexity of the problem, they employed a combination of several A N N s and applied the method to M O D I S data (Cornet et a l . , 2005). However, they d id not compare their results to in-situ observations. Schii l ler et al . (2003, 2005) also used A N N s for daytime retrievals of droplet number concentration, geometrical thickness and l iquid water path of shallow convective clouds, but they d id not give many details on the architecture they used. Mot ivated by the high computat ional efficiency of an ANN-approx ima ted inverse funct ion, Cerdena et al . (2004) continued the work of Perez et al . (2000) and Gonzalez et a l . (2002), becoming the first to apply A N N s to nocturnal cloud property retrievals. In their study, they empir ical ly explored several 27 Chapter 1. Introduction network architectures wi th differing numbers of hidden layers and neurons, and found that two layers w i th 100 neurons in the first and 20 neurons in the second layer yielded the best results. The A N N s were trained using a database of 20,000 data points, w i th an addi t ional 10,000 independent points for val idat ion. Cerdena et al . (2004) compare two forwards models, one employing the vert ical ly uniform cloud model , and the other an adiabatic profile wi th vert ical ly increasing l iquid water content (more details in Chapter 2). Analys ing the same data as Perez et al . (2000) and Gonzalez et a l . (2002), the A N N retrieved effective radi i agreed wi th in 2 \/jm wi th the in-si tu observed values, w i th the adiabatic cloud model yielding sl ightly improved results. Cerdena et a l . (2007) extended the retrieval method to the dayt ime case and also further investigated the network architecture. Ut i l is ing genetic algorithms, they were able to improve the network design to a network containing 20 neurons in the first and 5 neurons in the second hidden layer, yielding similar results as their first, more expensive architecture. They also presented a sensit ivi ty analysis, similar to the one conducted by Perez et al . (2000), by perturbing the input brightness temperatures by 0.5 K and noting the effect on the retrieved parameters for th in (r < 2) as well as thick ( r > 8) clouds. For th in clouds, such perturbations can lead to errors of up to over 4 um in reff, 4 K in cloud temperature (T), and 0.6 in r. For thick clouds, errors are smaller in rejf and T , but larger for r. However, in the Cerdena et al . (2007) work, the problem of est imating an accurate error interval for given inputs remains. Furthermore, they do not discuss the effect of an uncertainty in the neural network fit on the outputs (i.e. the uncertainty of having found the best set of weights dur ing the t ra in ing process), which can also be significant (Aires et a l . , 2004a). Indeed, there exist both analyt ical methods to compute the Jacobian (i.e. the sensit ivity) of a network for a given input and to compute the output uncertainty due to the network fit. M a c K a y (1992a,b, 1995) developed a Bayesian framework for neural network training that besides of making the t ra in ing process more stable allows for an est imation of the uncertainties in the network predictions that are due to the network fit and to noise in the t ra in ing data. Aires (2004); Ai res et a l . (2004a,b) generalised this concept to the mult id imensional case and demonstrated its appl icat ion in the context of their previous microwave remote sensing problem (Aires et al . , 2001). The Jacobian is very important for the val idat ion of the network fit. Since for the inverse problem, we do not know the shape of the function to model, we can only verify the A N N by comparing its predictions to independent test data or by control l ing the physical consistency of the Jacobian. Th is means that the output variables should be dependent on the expected input variables. For instance, dur ing nightt ime, 28 Chapter 1. Introduction we expect the effective radius output to be sensitive mainly to the BTD(3.7-11) signal (cf. F igure 1.9), while cloud top temperature should depend mainly on the thermal signals at 11 or 12 um. Th is can even be done in a quanti tat ive way by numerical ly estimating sensitivities at certain points in the L U T and comparing them to the A N N predicted values. However, inverse problems are often i l l -condit ioned. Tha t means that there exist several differing functional mappings that approximate the t ra in ing data equally well, but not necessarily represent the true function that generated the data - the training data are s imply not precise enough to unambiguously specify the correct function. In the case a bad mapping is afterwards appl ied to input data that was not part of the training dataset, the predict ion may be far f rom the actual value. The network is then said to generalise badly (Bishop, 1995). Wh i le this problem is difficult to avoid, addi t ional care should be exercised when using A N N s for i l l -condit ioned problems. Aires et al . (2004b) describe an approach to estimate the var iabi l i ty of the Jacobian due to the uncertainty in the network fit. Th is var iabi l i ty is a good indicator of how unstable the training process was and thus how i l l -condit ioned the problem is. 1.6 Objectives A neural network can potential ly provide a fast retrieval a lgor i thm for determining nocturnal cloud properties that might be able to overcome the performance problem encountered in classical retrieval approaches. Furthermore, the appl icat ion of the methods developed by M a c K a y and Aires et al . could be very beneficial for understanding the \"black-box nature\" of A N N s and in order to improve the retrieval quality. I w i l l extend the promising Cerdena et al . (2007) results to a different network architecture, a different satellite sensor and different in-situ data. M y focus lies on the uncertainty i n the retrieval, and what we can learn from its sensitivities. In part icular, the objectives for this thesis can be outl ined as follows. The goal is to set up a retrieval method capable of determining cloud effective radius, cloud top temperature and cloud opt ical thickness from nocturnal measurements of the M O D I S instrument. In-situ data for evaluation purposes are available from the D Y C O M S - I I field campaign, and the method should be able to provide error bars on the results. The indiv idual steps include: \u2022 The Aires (2004); Aires et a l . (2004a,b) method wi l l be implemented and it wi l l be investigated if it is suitable for the given problem. Apar t f rom obtaining an uncertainty estimate of the network fit, the hope is that ambiguities in the training database wi l l cause a larger uncertainty in the retrieval. 29 Chapter 1. Introduction \u2022 A forward model wi l l be developed that is capable of comput ing top-of-atmosphere brightness tem-peratures for the relevant near and thermal infrared M O D I S channels for vary ing cloud parameters. A L U T ( L U T wi l l be used synonymously for database from here on) has to be computed using this model. \u2022 The implemented methods wi l l be applied to the retrieval problem. A N N s have to be trained from the computed L U T , different network architectures have to be explored and the results have to be evaluated. For this thesis, the goal is to retrieve one scene from the D Y C O M S - I I field campaign (so that several parameters including the atmospheric profile can be prescribed in order to simpli fy the problem) and to evaluate the results using the in-situ data. \u2022 F ina l ly , the uncertainties and sensitivities of the retrieval wi l l be analysed and their usefulness investigated. Th is thesis is meant to bui ld a basis for further research. If the retrieval of the test case proves successful, future work can be performed on evaluating further scenes, w i th an eventual goal of creating a general method that could be used \"operat ional ly\" for any arbi t rary scene. The remainder of this document is structured as follows. In Chapter 2, the radiat ive transfer model wi l l be described. The employed cloud model wi l l be discussed, as well as radiat ive transfer techniques for comput ing the T O A brightness temperatures. A short section wi l l be devoted to the effects of the overlying atmosphere. In Chapter 3 I discuss neural network techniques. The Ai res method wi l l be derived, and its implementat ion wi l l be applied to simple test cases. The behaviour of the method for these cases wi l l be analysed and its appl icabi l i ty to the remote sensing problem discussed. The t ra in ing of A N N s from the L U T and the retrieval of the test scene is the topic of Chapter 4. Issues concerning the retrieval setup wi l l be discussed, and the results of sensit ivity and uncertainty estimates are presented. A discussion of the usefulness of the estimated variables is given, and the thesis concludes w i th a summary of the work in Chapter 5. 30 Chapter 2 Radiative Transfer and Development of the Forward Model In this chapter I wi l l describe the design of the forward model that is used for the case study. The scene of Ju ly 11, 2001, was chosen for the retrieval, and the in-situ aircraft measurements used in this chapter are taken from D Y C O M S - I I research flight II (RF02) , which took place on that day. The D Y C O M S - I I campaign wi l l briefly be introduced in section 2.1. Next , I wi l l introduce the radiat ive transfer equation ( R T E ) that mathematical ly describes the propagation of radiat ion through the atmosphere and discuss how it can be solved. For the radiat ive transfer calculations, cloud model and droplet size distr ibut ion are needed. Us ing the R F 0 2 data, I wi l l demonstrate the adequacy of the adiabat ic approximat ion for the cloud model and show that the size distr ibut ion can be described by a modif ied gamma distr ibut ion (sections 2.2 and 2.3). The question of which M O D I S channels are best suited for the retrieval is discussed in section 2.4. The radiances measured by the sensor at these channels are always average radiances over a wavelength interval, since the instrument cannot measure at indiv idual wavelengths. Hence, the forward model must be able to compute such interval-averaged intensities. I wi l l introduce the correlated-k approximat ion as an efficient way to calculate gaseous absorption over these intervals. The forward model also requires specification of absorpt ion and emission by gaseous atmospheric constituents including water vapour and carbon dioxide above the cloud. Th is is discussed in section 2.5. The radiat ive transfer package l ibRadt ran (Mayer and K y l l i n g , 2005) is employed to compute cloud top radiances. In section 2.6, the setup of the forward model including l i bRad t ran wi l l be described, and I conclude the chapter by demonstrat ing that the forward model is able to reproduce the B T D relationships of B a u m et a l . (1994) and Perez et al . (2000). 31 Chapter 2. Radiative Transfer and Development of the Forward Model 2.1 D Y C O M S - I I Data The D Y C O M S - I I field campaign took place from Ju ly 7, 2001, to Ju ly 28, 2001. Dur ing nine nights, research flights collected extensive in-situ datasets in the nocturnal Sc cloud layer over the east Pacif ic ocean off the coast of Cal i forn ia, approximately 350-400 k m west southwest of San Diego. The campaign and the available data are described by Stevens et al . (2003a,b). Wh i le the major objective of the cam-paign was to perform measurements to advance understanding of entrainment and drizzle processes, the collected data are also very useful for this remote sensing study. Dur ing seven nights, circles wi th an approximate diameter of 60 k m (30 min) were flown at sev-eral heights in the boundary layer (subcloud and cloud layer). Addi t ional ly , frequent vert ical profiles were taken. D a t a useful for this study includes the measurements taken dur ing vert ical profi l ing and in horizontal ly advected circles flown just below cloud top and above cloud base. Besides the standard meteorological measurements of temperature, humidity, pressure and wind speed, data of l iquid water content, droplet concentration and droplet sizes were taken (a complete list can be found in Stevens et a l . (2003b)). The ground speed of the airplane was about 100 m s _ 1 and measurements relevant for this work were taken every second. To ensure comparabi l i ty w i th the M O D I S data, I have averaged the D Y C O M S - I I data over 10 s intervals in order to yield measurements on the 1 k m scale. The measured droplet size distr ibutions allowed for the computat ion of effective radius and mean radius. For this thesis, the flights of Ju ly 11 and Ju ly 13 (RF02 and R F 0 3 in the D Y C O M S - I I l iterature) were chosen. Satell i te images of both scenes contain a number of ship tracks that provide contrast ing droplet sizes which can be used to check the physical consistency of the retrieval. Wh i l e the Ju l y 11 case wi l l serve as the retrieval example in Chapter 4, data from both nights are used in this chapter for evaluating the cloud model. The actual satellite images from the M O D I S sensor wi l l be introduced in Chapter 4. 2.2 Scattering and Droplet Spectra 2.2.1 R a d i a t i v e Trans fer E q u a t i o n Figure 2.1 i l lustrates the processes that influence radiat ion propagating along a line-of-sight as it passes through the atmosphere. As discussed in section 1.3, energy can be lost by absorpt ion and scattering out of the direct ion of propagat ion, while emission and scattering into the beam represent sources of energy. The change in intensity across an inf initesimal volume hence is the sum of a sink term describing 32 Chapter 2. Radiative Transfer and Development of the Forward Model scattering out absorption F i g u r e 2 . 1 : Processes that influence radiat ion as it passes through the atmosphere, ext inct ion by absorption and scattering, and source terms representing emission and scatter ing 1 5 : : dl = d l e x t + dlemit + d l s c a t . (2.1) The ext inct ion term is described by Beer's Law (1.6): dhxt = -pjds, (2.2) where ds represents an infinitesimal path length along the direct ion of propagat ion. Emiss ion is given by the emissivity of the medium times the P lanck funct ion (1.16): dlemit = PaB{T)ds. (2.3) Here, Kirchhof f 's Law (1.17) has been used to express the emissivity in term of the absorpt ion coefficient. The scattering source term, however, is more complicated. The scattering phase function p{Cl'', Cl) ex-presses the idea that radiat ion from any direction Cl' passing through an inf ini tesimal volume can con-tr ibute scattered radiat ion to our direction of interest Cl. Furthermore, d l s c a t must be proport ional to the scattering coefficient @ s , which describes how much scattering occurs: dlscat = T- \\ P ( \u00ab ' , Cl)I{Cl')dCl'ds (2.4) The scattering phase function can be interpreted as a probabi l i ty density. p(Cl', Cl) gives the probabi l i ty that a photon from direction Cl' is scattered into direction Cl. Hence, p(Cl', Cl)I(Cl') described the gain in 1 5 Note that I am still omitting the subscript A. All equations given here correspond to the monochromatic case. 33 Chapter 2. Radiative Transfer and Development of the Forward Model intensity in direct ion Cl due to scattered radiat ion from direct ion Cl'. In order to compute the tota l energy gain due to scattered radiat ion, contributions from al l directions are summed in the integral, where the factor of 4w arises from the spherical integration to ensure that the integral over the phase function is one. Pu t t ing the ext inct ion and source terms into (2.1), the radiat ive transfer equation becomes dl = -peIds + paBds + \u2014 f p(Cl', Cl)I{Cl')dCl'ds. (2.5) In this form, it has a general three-dimensional character, i.e. the direct ion vectors Cl can be expressed in any coordinate system, and the infinitesimal path length ds can be in any direct ion. The most general approach to solve the radiat ive transfer problem numerical ly is the Monte Carlo method (e.g. Bohren and Clo th iaux, 2006). It allows for arbi t rary three-dimensional scenes by simulat ing the propagation of a large number of photons through the medium. The path of each photons is traced from its original source (e.g. the sun) unt i l it leaves the defined scene (e.g. at the top of the atmosphere). Wh i le this method is very flexible and allows for inhomogeneities in the cloud layer, i t is computat ional ly very expensive and currently not suited to compute the T O A radiances for a large number of clouds, as needed for a L U T . For the reasons discussed in Chapter 1, the plane paral lel approximat ion is a good assumption for large parts of marine Sc on the scale of the M O D I S pixels (1 km). In order to keep the forward problem as simple as possible and at the same t ime computat ional ly tractable, I decided to construct the L U T for plane paral lel clouds. In fact, most analyt ic solutions and approximations to the radiat ive transfer equation have been developed for the plane parallel case (Petty, 2006). In a plane paral lel atmosphere, al l relevant parameters (i.e. pa, (3S, T, reff, cf. section 1.3) are only dependent on height. Since the important aspect for radiat ion is how much absorbing, emitt ing and scattering atmosphere it must traverse, it is convenient to express the vert ical coordinate in terms of the optical properties of the atmosphere rather than in geometrical units. The ext inct ion optical depth is defined as the opt ical thickness of the atmosphere between the top of the atmosphere and a level at height z : 1 6 (3e(z')dZ'. (2.6) A t the top of the atmosphere, r = 0. Furthermore, directions are usually expressed in terms of zenith 1 6Some authors use optical depth synonymous for optical thickness when referring to the cloud optical thickness; in order to avoid confusion I will use optical depth for the vertical coordinate in the R T E and optical thickness for the cloud property. 34 Chapter 2. Radiative Transfer and Development of the Forward Model angle 9 (measured from directly overhead; 6 = 0 is overhead, and 6 = TT\/2 is the horizon), and azimuth angle 0 (measured counterclockwise from a reference point on the horizon). The inf initesimal path length ds in (2.5) can now be expressed as where n = cos 9. F rom (2.6) it follows that dr = \u20140edz = \u2014{3e\/j.ds. D iv id ing (2.5) by this new height increment, the radiat ive transfer equation becomes The problem of comput ing the T O A radiances \"seen\" by the satell ite sensor now becomes the problem of solving this R T E . A common simpli f icat ion in order to solve the R T E is to divide the zenith angle into discrete intervals, so-called streams. The simplest of these methods is the two-stream method, which divides the intensity field into only two directions; upwell ing and downwelling. Hence, it assumes that the intensity is ap-proximately constant in each hemisphere. The mult i -stream code D I S O R T (Stamnes et a l . , 1988, 2000) generalises this concept to an arbi t rar i ly large number of discrete angles, so that i t becomes possible to accurately compute radiances for pixels that are not directly underneath the satellite. The code is well documented and already part of the radiat ive transfer package l ibRadt ran . It w i l l be used for the forward model computations in this thesis. 2.2.2 Droplet Size Dist r ibut ion and its Phase Function A big simpl i f icat ion for l iquid water clouds is that the water droplets can be treated as spherical droplets, which makes the computat ion of the phase function wi th M i e theory (Mie, 1908) possible. In contrast, the non-spherical character of ice particles makes the computat ion of the phase function for ice clouds much more difficult (Mayer and Ky l l i ng , 2005). M ie theory employs Maxwel l ' s equations to derive a three-dimensional wave equation for electromagnetic radiat ion, which is solved for boundary conditions at the surface of a sphere (Petty, 2006). For spherical particles, the phase function only depends on the angle 9 between the original direction $V and the scattered direction fl of a photon. Since cos 0 = fl' \u2022 fl, it is common to write the phase funct ion as p(cos 0 ) . (2.7) dl{y.,

\u2022 (3-1) 3 = 1 This model maps an input variable x to an output variable y, the prediction. The exact form of the model function depends on the order of the polynomial and the values of the parameters Wj. Hence, following Bishop (1995), I wri te y = y(x;w) (where al l parameters wj are grouped into one parameter vector w). A standard way to find the best possible model architecture for a given S is to consider the errors between the model predictions yn and the desired values, the targets, tn for the N data pairs in S. In order to obtain a good fit, the differences yn \u2014 tn should be as smal l as possible, which can be achieved by minimis ing the sum-of-squares error (cf. the cost function (1.22)) E=\\j2{y(xn-w)-n2. (3.2) Z n = l 26http:\/\/www.mathworks.com\/ 61 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation ' 0 0.5 1 0 0.5 1 F i g u r e 3.1: Left: a dataset S, from which we would like to infer informat ion about the underlying generator, i.e. the function from which the values were generated. Right : the sine function from which the data were produced. However, i f the shape of the generating function is not known, it is difficult to judge different models based only on their sum-of-squares error (compare Figure 3.2). However, wi l l the model wi th the smallest error E best represent the under ly ing generator of the data? In fact, this often is not the case. The phenomenon is known as overfitting in the l i terature, and describes the problem of f i t t ing the noise rather than the generating function. Its major effect is that models that fit the training data S too well do not generalise well. In this case the model , being presented wi th an input value which is not part of the training dataset, predicts a value far f rom the desired one. Figures 3.1 and 3.2 i l lustrate the problem. In the left panel of F igure 3.1, eleven datapoints are plotted, which are generated by adding random noise to the sine funct ion depicted in the right panel. F igure 3.2 shows two polynomials fitted to the data, in the left panel a cubic polynomial , in the right panel a polynomial of order ten. Since the underlying generator of the data is known, we easily judge that the cubic polynomial is a better representation of the original sine function. The sum-of-squares error, however, is much smaller for the higher order polynomial , since it perfectly fits the t ra in ing data. If we knew nothing about the original function, how could we judge which model is b e s t ? 2 7 The same problem arises if neural networks are used instead of polynomials. For simpl ic i ty of the derivation and implementat ion, I restrict myself to a two-layer perceptron (i.e. one layer of hidden units) as used by Aires et al . A s discussed by, for instance, Bishop (1995), this network is already capable of approximat ing arbi t rary functions. Mathemat ical ly , an A N N such as the one depicted in F igure 1.10 computes the output values from the following equations. The inputs X{ are weighted by the parameters of the first layer, then summed 2 7Concluding that less complex models (i.e. lower degree) are better is also misleading \u2014 consider, for example, a linear polynomial for the given case. 62 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation F i g u r e 3 .2 : Left: a cubic polynomial (solid line) fitted to the data f rom Figure 3.1, together w i th the generating sine function (dashed). Right : a polynomial of order ten, also plotted w i th the generating sine function. It is easily seen by eye that the cubic po lynomia l represents the sine funct ion better, however, the sum-of-squares error is smaller for the right polynomial . Th is phenomenon is known as overfitting. for each hidden neuron and transformed by an activation function n)r'Ain\"\u00a3n)' (3-17) where I have used en \u2014 tn \u2014 yn and replaced Ain = Cin~~X\u2022 The matr ix Ain is called a hyperparameter in the context of Bayesian learning, since it is a parameter that controls the dist r ibut ion of other parameters (the network weights). I wi l l explain its function in the next paragraph. In analogy to the conventional max imum l ikel ihood approach (Bishop, 1995, Chapter 6), the sum in the exponential is called the data error function 1 N ED(w) = ^-T(en)T - A i n e n . (3.18) A n = l In order to find an expression for the weights prior p(w), Aires (2004) assumes that the weights follow a Gaussian distr ibut ion as well: p(w) = exp f - \\ w T \u2022 Ar \u2022 w ) = exp (-Ew(w)). (3.19) Lw \\ I J ZJW For convenience and following Aires (2004), I have grouped al l normal isat ion factors into a single constant Zw- The parameter Ar = Cr~l is the inverse of the covariance matr ix Cr of the weights and the second hyperparameter that occurs in the context of Bayesian neural network learning. Analogous to the data error function, the weights error function is defined as Ew(w) = ^wT \u2022 AT \u2022 w. (3.20) 69 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation In the form given by (3.19), the prior distr ibut ion has zero mean, thereby expressing that the weights are expected to be centred around zero. The use of such a prior weights d ist r ibut ion regularises the training process (cf. subsection 3.1.2); smooth network mappings, which usual ly generalise better than strongly fluctuating functions, can be achieved by favouring smal l weights (Bishop, 1995, Sect ion 10.1.2). If the weights are large, then Ew wi l l be large and p(w) wi l l be smal l ; for smal l weights p(w) wi l l be large. Hence, by setting Ar correspondingly, we can prefer smal l weights over large ones. B ishop points out that priors other than Gaussian can be considered as well. In this work, however, I w i l l only consider the Aires (2004) approach. G iven the expressions for p(D\\w) and p(w), the posterior d istr ibut ion of the weights (3.11) becomes p{w\\D) = | exp I> U ) T \u2022 A\u2122 \u2022 6\") e x P (~\\wT \u2022 A r \u2022 w) = \\ e x P ( ~ E d ~ E w ^ > ( 3 2 1 ) where I have again used the shorthand notat ion Z for the normal isat ion factors. Th is expression could be maximised, however since many standard algorithms exist to minimise a funct ion rather than to maximise it (Bishop, 1995, Chapter 7) it is more convenient to minimise the negative logar i thm of (3.21). The logar i thm removes the exponential, and because of its monotonici ty the locat ion of the min imum remains unchanged. Since the normal isat ion constant Z also does not affect the posi t ion of the min imum, we are left w i th minimis ing the total error f unc t i on 3 0 1 N 1 E(w) = ED(w) + Ew{w) = - X>\")T ' Ai-n. \u2022 e \" + A r w . (3.22) n = l Convent ional (i.e. non-Bayesian) network learning can be regarded as a special case of this Bayesian framework; if we have no information about the weights pr ior and assume it to be a uniform distr ibut ion (p(w) = const.), then Ew = 0. If we furthermore assume independent output variables (all off-diagonal elements of Ain are zero) which have the same variance (all diagonal elements of Ain set to a2), then a2 becomes a constant factor in (3.22) and can be omit ted, leaving E to be a sum-of-squares error function as in the example of polynomial curve fitt ing: N c . fiW^EEW1\"^)-^. (3-23) n=l fc= l A problem w i th maximis ing the posterior weights distr ibut ion is that the hyperparameters Ain and Ar 3 0 This is analogous to the principle of maximum likelihood; Bishop (1995, Chapter 2). 70 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation are usual ly not known beforehand. I wi l l discuss how they can be estimated in section 3.4. 3.2.2 The Meaning of the Hyperparameters M a c K a y (1992a, 1995) and Bishop (1995) introduce the Bayesian framework wi th only scalars (3 and a as hyperparameters instead of the matrices Ain and AT. For simplici ty, they only use one output, thereby reducing Ain to a single element (5. The weights are assumed to be independent, making Ar a diagonal matr ix . In the simplest form, the variance of al l weights is the same ( 1 \/ a ) 3 1 . The error function (3.22) then becomes If the intr insic noise on the target data is smal l , then ft wi l l be large, and smal l deviations of the network predictions from the target wi l l result in a large \"penal ty\" . If the noise is large, (3 wi l l be smal l and larger differences wi l l be tolerated. Sett ing the ratio a\/f3 becomes important under the aspect of the size of the training dataset. Wh i le the first term in (3.24) grows wi th increasing numbers N of datapoints in the dataset, the second term does not. Hence, the ratio of both hyperparameters controls the importance of the weights prior and the size of the dataset form which it wi l l become insignif icant. The matr ix hyperparameters Ain and Ar used by Aires (2004) generalise this concept to interdepen-dent weights and outputs, respectively. Us ing Ain instead of (3 allows for more outputs and also incor-porates the correlat ion structure of the errors in the indiv idual variables. (Remember that C in \u2014 Ain does not represent the covariance matr ix of the outputs, but of the errors in the outputs (the noise).) Th is way, mapping errors for variables wi th smal l noise are penalised stronger than those for targets wi th larger intr insic noise. In order to understand Ar, it is important to keep in mind that the prior p(w) is a Gaussian distr ibu-t ion wi th zero mean. Th is means that, similar to the scalar hyperparameter case, we expect the weights to be small . The major difference to the scalar case is that we assign different inverse variances (diagonal elements of Ar if the weights are independent) to different weights, hence control l ing indiv idual ly how much the weights are penalised for being large. Th is may be useful since the magnitudes of weights in different layers of a network can have fundamental ly different ranges ( M a c K a y , 1995; Ai res, 2004). 3 1 MacKay (1995, Section 3.2) notes that the weights of a two-layer perceptron will usually fall into three or more distinct classes, depending on the structure of the inputs. For a good regularisation performance, he suggests the use of different hyperparameters a for these different classes. (3.24) 71 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation 3.2.3 Gaussian Approximat ion to the Posterior Dis t r ibut ion Al though (3.21) is an exact equation for a given noise model and prior, it is useful to approximate the posterior wi th a Gaussian distr ibut ion in order to make it analyt ical ly tracktable when used in integrals such as (3.8) (Bishop, 1995, Section 10.1.7). Th is can be obtained by performing a second-order Taylor expansion of the total error funct ion (3.22) around its min imum w* (i.e. the max imum of p(w\\D)): E(w) = E(w*) + bT- Aw + ^AwT-H-Aw, (3.25) where Aw = w \u2014 w*. b denotes the gradient of E at w*, b = VE(w)\\v,=w.=0, (3.26) which vanishes because w* marks the min imum of E. H is the Hessian matr ix of E (second derivative) wi th respect to the weights, and it wi l l play an important role in the remaining parts of this thesis. Look ing at (3.22), we can see that H is composed of two parts, the data Hessian Hp and the weights hyperparameter A r 3 2 : H = VVE(w)\\w=w* = VVED(w)\\w=w* +Ar (3.27) = H D + A r . (3.28) Using the approximated error function (3.25), (3.21) becomes p(w\\D) = ^ exp (-E{w*) - ^AwT \u2022 H \u2022 Aw^j , (3.29) and, including the constant exp (\u2014E(w*)) in the normalisat ion factor Z, p{w\\D) = exp ^ - i AwT \u2022 H \u2022 Aw^j . (3.30) The posterior weights distr ibut ion thus becomes a Gaussian wi th mean w* and covariance matr ix H The information contained in this covariance matr ix can be immediately used to give error bars on the most probable weights vector w*, an example of which is shown in Figure 3.3. . 3 2Note that if we use no prior information about the weights (i.e. p(w) = const.), E = ED and H = HQ. This case corresponds to A N N training without regularisation. 72 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation -0.5 0 weight value F i g u r e 3.3: Dis t r ibut ion of the first weight wn of the neural network used for the example given by equations (3.9) and (3.10) after a short t raining run (grey histogram, the network has not gained much certainty about the weight value yet) and after a longer t ra in ing run (black histogram, the distr ibut ion has narrowed considerably). 3.3 Output Uncertainties In the previous section, both terms under the integral in (3.8) - the noise model of the target variables (3.15) and the Gaussian approximat ion to the posterior weights d ist r ibut ion (3.30) - have been derived. Us ing these results, the derivation of the distr ibut ion of the network outputs is straightforward (Aires et a l , 2004a): p(t\\x,D) = J p(t\\x,w)p(w\\D)di J exp ^ - i (t - y)T \u2022 Ain -(t-y) \u2022 exp (3.31) --AwT H-Aw ) dw. (3.32) A l l normal isat ion factors have been omit ted in the notat ion in (3.32), instead, the \u2022 oc \u2022 sign has been used to indicate the missing normal isat ion. Th is integral can be evaluated by assuming that the posterior d istr ibut ion of the weights, (3.30), is narrow enough to approximate the network funct ion y(x; w) by its l inear expansion around the opt imal weights value w*: y(x; w) = y(x; w*) + G1Aw, (3.33) 73 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation where the W x c matr ix G represents the gradient of the network funct ion (c is the number of network outputs): G = Vy(x;w)\\w=w.. (3.34) Hence, (3.32) becomes p(t\\x,D) oc J exp ^ - i (t - y(x;w*) - GT Aw^ \u2022 A i n \u2022 (t-y(x;w*) - GT Aw^ \u2022 exp ^ A w T \u2022 H \u2022 Awj dw. (3.35) Wr i t ing e* = t \u2014 y{x\\ w*) for simplicity, expanding the product and rearranging yields p(t\\x, D) oc exp ^ - ^ e * T ' A i n \u2022 e* j \u2022 J exp (e*T \u2022 A i n \u2022 (GTAw) - ^(GTAw)T \u2022 A i n \u2022 (GTAw) - \\AWt \u2022 H \u2022 Aw^j dw, (3.36) where the first factor is independent of w and has hence been pul led out of the integral. Us ing the matr ix identity (AB)T = BTAT and further rearranging leads to p(t\\x, D) oc exp (^-^e*T ' A i n \u2022 \u20ac* j \u2022 J exp ^e* T \u2022 A i n \u2022 G T ) Aw - ^AwT (G \u2022 A i n \u2022 GT + H^j Awj dw. (3.37) Bishop (1995, Append ix B ) shows that Gaussian integrals w i th a linear term evaluate to J exp (LTW - ^wT \u2022Aw^dw = {2-n)w'2 | A | \" 1 \/ 2 e x p QiT \u2022 A \u2022 LJ . (3.38) Sett ing L = e*T \u2022 Ain \u2022 GT and A = G Ain \u2022 GT + H, the integral in (3.37) becomes J exp(.. .)d\u00ab> = (2n)w'2 G Ain G + H - 1 \/ 2 exp Q (e*T \u2022 A i n -GT)T \u2022 (G \u2022 A i n \u2022 GT + H) \u2022 (e*T \u2022 A i n \u2022 G T ^ . (3.39) 74 Chapter 3. Nonlinear Regression with Neural Networks and Uncertainty Estimation Omi t t ing the constant factor and rearranging (3.37) again gives 1 r \\ \/ I T p(t\\x,D) oc exp ( --e* \u2022 Ain \u2022 e* J exp I--e* -AinGT -(G-Airl-GT + hJ GAi (3.40) and further p(t\\x, D) oc exp (-\u00b1(t-y(x;w*)f AIN - AINGT \u2022 (G \u2022 AIN \u2022 GT + \u2022 GAIN (t - y(x; \u2122* ) ) ) \u2022 (3-41) Thus, the target variables follow a mult ivariate Gaussian distr ibut ion w i th mean y(x; w*) and covariance matr ix C 0 = Ain - AinGT \u2022 ( G \u2022 A i n G T + i ? ) 1 \u2022 GA - l (3.42) The expression for the covariance matr ix can be simplif ied by mul t ip ly ing by J G + H 1GAinGT^j GJ X . . . x [ G ( l + H^GA^G^ G ] _ 1 to give C0 = Cin + GTH ' G , (3.43) where Cin = A i \u201e _ 1 has been used. Equat ion (3.43) is the ma in result of this derivation (Aires et a l , 2004a). It shows that the uncertainty in the neural network predictions is composed of two parts; the intr insic noise contained in the training data, represented by its covariance matr ix Cin, and a term G H G that represents the impact of the uncertainty of the posterior weights distr ibut ion on the predictions - a result that is expected from (3.8). Unless we have si tuat ion dependent (= input dependent) information about intr insic noise on the data, Cin wi l l be constant. The neural predict ion term, however, is si tuat ion dependent through the gradient G (the Hessian is not dependent on the input data). We can determine the error bars on a network predict ion by tak ing the standard deviat ion from the covariance matr ix Co- Figures 3.4, 3.5 and 3.6 i l lustrate the results that can be obtained for the example T \u2014 i defined at the end of section 3.1. Wi l l i ams et a l . (1995) show that the weights uncertainty term G H G is approximately proport ional to the inverse training data dens i t y 3 3 , and note that consequently in high-3 3 They can prove the result for generalised linear regression models (models of the form y(x) \u2014 wi't>j(x)> where