LOCAL PARAMETRIC POISSON MODELS FOR FISHERIES DATA by IRENE MEI LING YEE B.SC, THE UNIVERSITY OF BRITISH COLUMBIA, 1986 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES Department of Statistics We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA September, 1988 © IRENE MEI LING YEE, 1988 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of S t.O.'fc i S"t j OS The University of British Columbia Vancouver, Canada Date S e p t . , 13 8 ? DE-6 (2/88) ABSTRACT Poisson process is a common model for count data. However, a global Poisson model is inadequate for sparse data such as the marked salmon recovery data that have huge extraneous variations and noise. An empirical Bayes model, which enables information to be aggregated to overcome the lack of information from data in individual cells, is thus developed to handle these data. The method fi t s a local parametric Poisson model to describe the variation at each sampling period and incorporates this approach with a conventional local smoothing technique to remove noise. Finally, the overdispersion relative to the Poisson model is modelled by mixing these locally smoothed, Poisson models in an appropriate way. This method is then applied to the marked salmon data to obtain the overall patterns and the corresponding credibility intervals for the underlying trend in the data. i i TABLE OF CONTENTS ABSTRACT i i TABLE OF CONTENTS i i i LIST OF TABLES V LIST OF FIGURES v i ACKNOWLEDGEMENTS . . . . . . . . . X 1. INTRODUCTION 1 2. BACKGROUND REVIEW 4 2.1 The Benchmark Data Set 4 2.2 Statistical Techniques 6 2.2.1 Negative binomial and mixed Poisson regression ... 6 2.2.2 Using SABL to decompose time series data 8 2.2.3 Time series analysis of a contagious process . . . . 9 2.2.4 Smoothing techniques 11 1. Estimating smooth functions by the local scoring algorithm 11 i i . Bayesian nonparametric smoothing method for local regular process 14 2.2.5 Empirical Bayes(EB) and Hierarchical Bayes(HB) analyses 17 i . Introduction 17 i i . Empirical Bayes(EB) analysis 18 i i i . Hierarchical Bayes(HB) analysis 19 i i i iv. Comparison of EB and HB 21 3. DATA EXPLORATION 22 3.1 introduction 22 3.2 Benchmark Release and Recovery Data Sets 23 3.2.1 Missing values in the two benchmark data subsets . 23 i . Release data 23 11. Recovery data 24 3.2.2 The structure of observed recoveries 24 3.3 Other Related Information and Data Sets 27 4. EMPIRICAL BAYES APPROACH FOR MODEL COUNT DATA 30 4.1 Introduction 30 4.2 Local Parametric Poisson Models with Smoothing Techniques 31 4.3 Inference on the Parameters of Interest 39 5. APPLICATION 45 5.1 Introduction 45 5.2 Problems Encountered when Modelling the Salmon Recovery Data 46 5.2.1 Missing values 46 5.2.2 The edge effect 48 5.3 Fitting the Proposed Models to the Selected Data Sets . . 48 5.4 Conclusion 56 REFERENCES 58 APPENDIX. The Mark Recovery Program (MRP) Database 60 iv L I S T OF T A B L E S TABLE 3.1 The tag codes found in the benchmark data subset 64 3.2 Summary of data fields for the benchmark release data subset . . 65 3.3 Summary l i s t of chinook data fields for the benchmark rollup recovery subset . . . . 66 3.4 Table for computing "Period" from statistical week (MMW) 67 3.5 List of catch region codes, and names 68 3.6 A summary of catch regions with observed recoveries . . . 69 3.7 A sample l i s t of coho release replicates that are classified according to size 70 v L I S T OF FIGURES FIGURE 3.1a Size of chlnook release for tag codes from the benchmark release data subset 71 3.1b Size of coho release for tag codes from the benchmark release data subset 71 3.2 Chinook observed recoveries over the recovery period considered, (tag code: 021827 brood year: 1979 recovery year: 1981 to 1984) 72 3.2a Commercial and sport observed recoveries 72 3.2b Commercial observed recoveries 72 3.2c Sport observed recoveries 72 3.3 Coho observed recoveries over the recovery period considered, (tag code: 081842 brood year: 1979 recovery year: 1981 to 1982) 73 3.3a Commercial observed recoveries 73 3.3b Sport observed recoveries 73 3.4a Plot of cumulative sum of Chinook commercial observed recoveries over time, (tag code: 021827) 74 3.4b Plot of cumulative sum of coho commercial observed recoveries over time, (tag code: 081842) . . . . 74 vl FIGURE 3.5 Pl o t s of chinook commercial observed recoveries over the sampling period, (tag code: 021827 brood year: 1979) . . 75 3.6 Pl o t s of coho commercial observed recoveries over the sampling period, (tag code: 081842 brood year: 1979) . . 76 5.1a Zeta(t) for coho. (Hatchery: Quinsam brood year: 1979 s i z e at release: medium) 77 5.1b Transformed zeta(t) (power =0.25) 77 5.1c Trend for the zeta(t) 77 5.2a Zeta(t) for coho. (Hatchery: Capilano brood year: 1980 s i z e at release: medium) 78 5.2b Trend for the zeta(t) 78 5.3a Zeta(t) for chinook tag code 021827 79 5.3b Trend for the zeta(t) 79 5.4 The trends of zeta's for Quinsam coho from d i f f e r e n t brood years 80 5.5 The trends of zeta's for Capilano coho from d i f f e r e n t brood years 81 5.6 The trends of zeta's for the three chinook tag codes . .82 5.7 The estimated recovery i n t e n s i t y of coho for each of the 4 catch regions. (Hatchery: Quinsam s i z e at release: large) 83 v i i FIGURE 5.8 The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Quinsam size at release: medium) 84 5.9 The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Quinsam size at release: small) . . . . 85 5.10 The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Capilano size at release: large) 86 5.11 The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Capilano size at release: medium) 87 5.12 The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: capilano size at release: small) 88 5.13 The estimated recovery intensity of chinook for each of the 3 trolling regions 89 5.14 Estimated recovery intensities of coho and the corresponding 95% credibility intervals. (Hatchery: Quinsam brood year: 1979 size at release: medium) 90 v i i i FIGURE 5.15 Estimated recovery intensities o£ coho and the corresponding 95% credibility intervals. (Hatchery: Capilano brood year: 1980 size at release: medium) 91 5.16 Estimated recovery intensities of Chinook and the corresponding 95% credibility intervals. (tag code: 021827 brood year: 1979) 92 ix ACKNOWLEDGEMENTS I would like to thank Dr. James Zidek for his guidance, assistance and encouragement in producing this thesis. I would like to express my gratitude to Dr. Mohan Delampady for his helpful comments and careful reading on this work. I am indebted to Dr. Jon Schnute from the Pacific Biological Station for providing the data sets and offering advice related to these data. Also I am grateful to technical staff members at the Station, in particular, Brian Kuhn for producing ancillary data sets when their importance emerged in the course of the analysis. I am very grateful for the comments and suggestions from Dr. Keith Knight and Dr. Jian Liu. I would also like to thank my parents and my brothers for their support and encouragement throughout the years. The financial support of the Natural Science and Engineering Research Council of Canada is gratefully acknowledged. x 1. I N T R O D U C T I O N This thesis develops an empirical Bayes model for marked salmon data collected over time. The method, which employs a hierarchical prior distribution, is used because the data are sparse and the empirical Bayes approach enables information to be aggregated to overcome the lack of information from data in individual cells. The novelty of our approach lies in our use of locally parametric Poisson models and smoothing techniques to obtain estimates of underlying trend in the tagged salmon data. Over years, data on the return of tagged salmon are collected and prepared for the Mark Recovery Program(MRP) database. This database, which is described in detail in the Appendix, consists of the release data on tagged and untagged salmon, the data on individual marked salmon observed when returning from the ocean for spawning, and data on the sampling periods for each of the fishing regions. Since the database contains a vast amount of information, only selected sample data sets, such as the benchmark data sets, are analyzed here. Other data sets are also formatted like the benchmark data because this benchmark is well documented. With these data, various questions can be posed and investigated. One topic of Interest is the relationship between the size of 1 smolts at release and their return rate as measured by observed marked salmon counts. Another is the comparison of marked salmon counts from different brood years and fishing regions. In this study, only the marked recoveries are examined. In addition, only two species of salmon, chinook and coho, are considered. In tackling the two problems of interest described above, we first develop a model for the observed fish counts. The Poisson model is a conventional choice for count data. However, i t will not be adequate for *noisy' data with large sampling variation. Our solution to this problem adopts a local Poisson model to describe the variation at each sampling period. Noise is removed by local smoothing. Finally, the overdispersion relative to the Poisson model is modelled by mixing these locally smoothed, Poisson models in an appropriate way. In Chapter 2, a brief description of the benchmark data sets is given. In addition, some relevant recent studies are summarized for completeness and later comparison or use. We discuss modelling Poisson processes with overdispersion, time series techniques for evaluating long-term trend effects, models for handling contagious or self-inhibiting processes, a local smoothing procedure for obtaining nonlinear regression estimates, and a Bayesian nonparametric smoothing method for modelling locally regular processes. Finally, the empirical Bayes method with hierarchical priors, the basis of this 2 thesis, is reviewed. To obtain insight for further investigation, the data are carefully examined in Chapter 3. The results indicate that some data pooling might be desirable to partially integrate the separate models for the marked recoveries observed in each catch region. The data from commercial fisheries appear to be more reliable and consistent than those from sport fisheries and escapement; thus, only the commercial data are used in the modelling stage of our analysis. The local parametric Poisson models are developed in chapter 4. Smoothing techniques are also developed there for removing noise, and estimating long-term trends in data. The main inferences are estimates of the Poisson intensity functions and the calculation of their corresponding credibility intervals. Finally, in Chapter 5, the proposed models are fitted to selected coho and Chinook data sets. A summary of the estimates and the corresponding credibility intervals is given. The problems of missing values and edge effects are also addressed there. 3 2 . BACKGROUND REVIEW 2 . 1 The Benchmark Data Set The benchmark data set is established in the Pacific Biological Station(PBS), which is a research branch of the Canadian Department of Fisheries and Oceans (DFO) in Nanaimo. The tag codes in this benchmark, which are obtained from the MRP database, form a sample data set for statistical analyses and exchanging data with other agencies. Complete documentation, including the selected formats and information related to the tag codes, is available in the 1986 report *A Canadian MRP Data Benchmark1. The benchmark consists of release data of tagged and associated juvenile salmon, and recovery data of adult marked salmon for selected tag codes. Data related to the sampling periods for recovering marked salmon are obtained from the MRP database directly. The following is a brief description of these data. The benchmark release data contain a code for each release group of juvenile salmon. In particular, the origin, age, and average size of fish, the number of tagged and associated fish, as well as the site and date of the release are included for each group. When adult fish are recovered, not every fish is inspected for 4 mark. Different recovery methods have different sampling and reporting procedures associated with them. These recovery methods are mainly of three types: i . commercial fisheries, i i . sport fisheries, and i i i . escapement — fish that are not captured by any fishery. Each benchmark recovery data record includes the code found on each recovered and tagged salmon, the time, region and method of recovery. Times are usually recorded as year, month and statistical week (about 5 per calendar month). The recovery regions are geographic catch regions divided according to each of the fishing methods: t r o l l , net, and sport. For each marked recovery, there is one record, except for escapement data. Thus, redundant sample information may appear on numerous records of individual tagged fish from the same sample. Fortunately, the fields of each data record are organized in such a way that data for an entire fish sample can easily be obtained. The period for observing marked salmon is different in each catch region; thus, the data on sampling periods (described later in section 3.2) are important for determining whether a record is missing because of no sampling, or simply because there is no recovery during that period. Therefore, together with the recovery data, the distribution of marked recoveries in each catch region and the abundance of each 5 group of tagged salmon can be observed over time. With a l l these data, many related questions can be tackled. 2.2 S t a t i s t i c a l T e c h n i q u e s 2*2.1 Negative binomial and mixed Poisson regression In this subsection, we describe for completeness and comparison, a model which bears some resemblance to that adopted in this thesis. However, i t seems less flexible than ours and so has been set aside during the current investigation. Suppose the response variable Y, a count, and a vector x of explanatory variables are specified. In general, let U I V denote the conditional distribution of U given V, where U and V are any two random variables with a joint distribution. Then a Poisson model for the response is as follows: Y I x is Poisson distributed with mean where p(x) is to be estimated. Very often data exhibit extra-variation or overdispersion relative to the proposed Poisson model. For the count data with no covariates, the negative-binomial distribution is a popular choice for handling the extra-Poisson variation. To handle covariates, this result can be generalized to 6 -1 P(Y = y I x) 1 + a /u(x) a M(x) 1 + a /Li(x) 1 a y! r(a *) y = 0, 1 ,..., (2.1) where a > 0 is called the index or dispersion parameter. The mean and variance of Y given x are E(Y I x) = fj(x) and Var(Y I x) = M(X) + a / J ( X ) 2 . Note that (2.1) yields the Poisson model i f a -» 0. Lawless(1987) studies these negative-binomial models and examines their properties in detail. He reviews the maximum likelihood and moment estimation procedures for estimating the dispersion parameter and regression parameters. In addition, he compares the asymptotic covariance structures, efficiency and robustness of the parameters estimated by these two methods. Since Poisson regression models are very useful, a test of the Poisson hypothesis is often of interest. One method is to test a = 0 within the negative-binomial model. Lawless suggests some useful statistics such as the likelihood-ratio and the standardized dispersion, for testing this hypothesis. He also gives a note of caution that the result of any test depends on the size of the sample and/u(x). 7 2.2.2 U s i n g S A B L t o d e c o m p o s e t i m e s e r i e s d a t a A method which is extensively used in this thesis will now be described. Suppose observations of a time series are taken at equally spaced time-points and the problem of interest is that of determining the long term trend in the deseasonalized series. Nicholls, Heathcote and Cunningham(1987) suggests a method, implemented in a software called SABL, that deseasonalizes the data, possibly after a transformation, without actually modelling the seasonal components. This method decomposes the series into three additive components by means of a minimization criterion and robust data smoothing techniques. The results at time t are the 'trend'(Tt), 'seasonal'(St) and * irregular'(I t) components. Let Y* denote the transformed response at time t. Then YT = T + S + I . t t t t Nicholls, et.al.(£&£d) explain that to construct the additive model, the original data must be transformed so as to minimize the interaction between the trend and the seasonal components. This criterion is reasonable since i f their interaction were not at its minimum, then for example, if the trend were increasing, the seasonal component might also increase. With robust smoothing techniques based on moving medians, the trend and seasonal components can be determined iteratively. These robust estimates will not be affected by outliers because these outliers will be incorporated in the irregular component of the series. (To give more flexibility to users, SABL allows them 8 to choose a p a r t i c u l a r transformation, and to selec t the widths of smoothing windows for the trend and seasonal components.) After the decomposition, the seasonally adjusted s e r i e s i s obtained by simply subtracting o f f the seasonal component to give Y T = T + i . t t t This s e r i e s can be converted back to the o r i g i n a l response scale by applying the inverse transformation to Y*. Once the trend and irre g u l a r components are computed, they may be plotted for v i s u a l inspection so that one can model the trend. The model can then be validated by other time series procedures, such as the Box-Jenkins autoregressive moving average (ARMA) technique. 2.2.3 Time s e r i e s analysis of a contagious process An a l t e r n a t i v e model to ours i s described i n t h i s subsection and one of i t s d e f i c i e n c i e s i s noted. However, i t promises to have some value and w i l l be investigated further i n future work. Holden(1987) developed a model for rare events l i k e the d a i l y a i r c r a f t hijackings i n US between 1968 and 1972, for example. The proposed model i s for stationary processes. I t incorporates the assumption that the contagiousness of an event eventually declines to zero, and that the rate of occurrences l e v e l s off over a long period with occasional, temporary peaks when an occurrence excites the 9 process. With modification, the model can also Incorporate the effects of exogenous time series. The data is potentially applicable to the commercial marked salmon data of a given tag code observed in a catch region since there is a long period of no recovery during the winter season. However, the leveling off phase of epidemics is not fully reflected in our data because of the definition of the yearly sampling 'periods' (See Table 3.4). During the sampling season, there are only occasional recoveries which can be thought of as rare events. But an important similarity is that the observed recoveries are serially correlated. Thus, we conclude that Holden's model might be adapted for modelling the salmon data in spite of its deficiencies with respect to our data. Holden assumes that the observed sequence of daily counts, { Nt>, is a sequence of Poisson variates with means given by some sequence, { X^}, which incorporates the stimulating effects of previous incidents. The linear contagion model for rare events is given by X t u < v + 6 (2.2) where CD (2.3) W. > 0 (i 1,2 ,... ) and t is an integer. For a discrete-time process V (2.2), N conditioned on the history { N , u < t} has a Poisson 10 distribution with mean is required to be less than one to ensure that s E( X ) > 0. The quantity v is the rate at which events are generated by factors other than contagion (assumed constant). The lag structure of W. (1 = 1,2,...) in (2.3) describes the contagiousness of an event i periods after its occurrence. To get a finite number of parameters, simply set Wt to zero after some maximum lag, or assume that Wt has a specified functional form, such as the lag weights associated with a given ARMA process. Then the time-series techniques suggested by Box-Jenkins may be used to obtain the parameter estimates. 2 . 2 . 4 Smoothing techniques i . Estimating smooth functions by the l o c a l scoring algorithm To provide additional perspective on the approach taken in this thesis, a very recently proposed method, similar to our own in spirit, will now be described. For likelihood-based regression models with response variable Y, such as normal linear regression, one usually assumes a linear form in the covariates X ,X ,...,X . A set of n independent realizations of 1 2 p these random variables will be denoted by (y,x ,...,x ), ...,(y ,x ,...,x ). Hastie and Tibshirani(1986) propose the class n n l n p i i of generalized additive models which replace the linear form ^ ft X^ by a sum of smooth functions ^ « (X ). The « (•) are unspecified functions that are estimated using a scatterplot smoother in an iterative procedure called the local scoring algorithm. Any regression procedure can be viewed as a method for estimating E(Y I X ,X ,...,X ). The additive model assumes the following form 1 2 p for this conditional expectation: p E(Y | X,X,...,X ) = s + ) s.(X.), (2.4) 1 2 p O £^ j J i =1 where the smooth s.(-)'s are standardized so that E(s.(X.)) = 0. J j j These functions are estimated one at a time using a scatterplot smoother. A simple class of scatterplot smoother estimates are the local average estimates, s(x ) = Ave < y> " j « N. J j where Ave represents some averaging operator like the mean and N is a neighborhood of x^ (a set of indices of points whose x values are close to x ) . The type of neighborhoods considered in Hastie and Tibshirani's paper are symmetric nearest neighborhoods. Associated with this is the span or window size w, which is the proportion of points contained in each neighborhood. Other more complicated estimates of E(Y I X) can be used, such as kernel or spline smoothers. 12 The span ro is selected to tradeoff between the bias and variability of the estimate. A data-based criterion is derived for selection. Let s'^x^.be the smoother of span w at x, having removed (x^, y j from the sample. Then the c r o s s — v a l i d a t i o n sum of squares (CVSS) is defined by n . 2 CVSS(w) = (1/n) i =1 The optimal span w is that which gives the smallest value of CVSS(w). This criterion effectively weighs bias and variance based on the sample. Note that the E(CVSS(w)) can be shown to be approximately equal to the integrated prediction squared error (PSE) PSE = E(Y - s(X)) 2, and that CVSS is approximately unbiased for the expected prediction squared error. In addition to these desirable properties, CVSS is recommended because i t is computationally efficient for obtaining the optimal value of u>. For the additive model in (2.4), the l o c a l s c o r i n g algorithm estimates the s(*)'s by iteratively smoothing the adjusted dependent variable on X, and i t requires a choice of span which can be estimated using the CVSS(w) in (2.5). Theoretically, this technique can be viewed as an empirical method of maximizing the expected log-likelihood, or equivalently, of minimizing the Kullback-Leibler distance to the true model. It is called l o c a l s c o r i n g because the Fisher scoring update is computed using a local estimate of the score. 13 i i . Bayesian nonparametric smoothing method f o r l o c a l regular process A potential refinement of the approach adopted in this thesis is described by Ma(1986) who improves on the Bayesian nonparametric approach proposed by weerahandi and Zidek(1988) for smoothing stochastic processes. The processes of concern are of the form R = S + N, where S is a smooth function and N is an independent noise process. R is assumed to be observed at a sequence of time-values, t , 1 = 1,..,n, and S is assumed to be locally regular, that is, expandable in a Taylor series to the pth term about t = t . Then an a priori structural model for the data is R = X ft + s, (2.6) where R = (R ,...,R ) is a vector of n observations, i n X = (1,X ,...,X ) is an n by (p+1) matrix, i p where l is a vector of ones and X* = ( [ t - t [t - t ] j/j!), j 1 n+l n n+1 i ft - {ft ,ft ,—,ft ) is vector of coefficients, o 1 p where ft - S(t ) and ft. = D vs(t ) with D as the operator O n+l v. r t + t * of differentiation, and s = 7) + N is the error term, where both 77 and N are vectors; specifically 7? is the remainder of the Taylor expansion of S(t ) and N is the noise with variance a2. One further assumption underlying this approach is that the expansion errors and a l l other a priori uncertainty about R, ft and the smoothing 14 parameter c, related to the variance of the noise s, have a joint multivariate normal distribution. The "smoothing parameter" c controls the degree of smoothness of the estimated R. The main objective of Ma's study( ibidD is a simple method to compute an estimate of c and to obtain R, a smooth estimate of R. His method estimates c, and computes a measure of accuracy for any given R, called the predictive squared error PSE (to be defined later) for each fixed order p which reflects the degree of local regularity of S. The value of p that has the minimum PSE is chosen to be the optimal value. For each fixed p, the parameter c can be estimated by cross-validation which chooses the value of c that minimizes the cross-validated sum of squared (a similar method is described in section 2.2.4.1). Ma(ibid) develops a simpler alternative called the back./it ting method and compares i t to cross-validation. His new method is recommended for obtaining c because i t is easier to implement and computationally more efficient than the cross-validat ion approach. Ma's method may be described as follows. Suppose (2.6), S has p + 1 derivatives. Then the a priori model form R.= ft + ft x. + ft x z +...+ ft + ft y? +1 + e., \. O 1 x. 2 V P + l l . l . in equation is of the (2.7) 15 where R is the tth component of the vector R, e is the ith component of the error vector s, and x.= (t - t )V i ! , i = l,...,p+l. The bachfi t ting method uses the fact that c = S2/ &z, where <52 is the prior variance of £>p+1S/(p+l) !, and a2 is the variance of the noise N. Then, for the order p, c is given by p ^ 6Z c * P sample variance of (ft ) - E L (2.8) <y ((p+D!) Thus, i f equation (2.7) holds, by the same argument, (2.8) can be used to estimate the values of for any j = 0,...,p. However, to use equation (2.8) c p + ± is needed. This parameter can be estimated by cross-val idat ion, or i t can simply be set to zero assuming p is large. Then, the with j < p is estimated byback.fi t ting using (2.8). The value of p is optimal in the sense that the R, estimated by the pth order locally regular f i t , minimizes the predictive squared error (PSE) n —m ^ PSE(j) s l/(n-m) V f R . - R ic.)], j = 0,...,p, / l_ m+x. m+x. 1 J i = 1 where m is the fixed span of observations (m < n), p + 1 is a fixed integer, and R. is the observed value at time t.. For each chosen 16 value of p, the back/itting method is used to compute the for j < p, and the related R's are estimated using generalized least squares procedure. The PSE(j) is computed for each j < p and these values are compared to obtain the optimal j S p which minimizes the PSE. 2 * 2 . 5 E m p i r i c a l B a y e s C E B O a n d H i e r a r c h i c a l B a y e s C H T D a n a l y s e s i * I n t r o d u c t i o n Let X = (X /...,X)* be a vector of n independent random i . n variables which come from a common distribution f with parameter 9. Given a sample of n observations, x = (x^,...,xn)', a l l relevant information about 9 is contained in the observed likelihood function f(x |©). Thus, a 9 with large f(x \9) is more plausibly the true 9 than a 9 with small f(x \9). Likewise, the occurrence of x would be more plausible i f f(x \9) were large. Therefore, as a corollary of the Likelihood Principle, only the observed x should be relevant to conclusions about 9. (More details on the Bayesian analysis can be found in Berger(1985).) Suppose prior knowledge about 9 is given by the distribution U(9). Bayesian analysis combines this prior information and the sample information using Bayes rule into what is called the posterior distribution of 9 given x, from which a l l decisions and inferences may be made. This posterior distribution n(0 |x), which reflects the 17 updated beliefs about 8 after observing the sample is defined as follows. Let the joint density of X and © be h (x ,0) = 7T(0) f ( x 19), (2.9) and the marginal density of X is <*> - J m(x) = | f ( x I©) dF(©). (2.10) © Then, providing m(x) * 0, When no prior information about 9 is available, what is needed in such situations is a noninformative prior, by which is meant a prior that contains no information about ©. A reasonable choice of such a prior is to give equal weights to a l l possible values of ©. A typical noninformative prior density is n(9) = 1 , the uniform density on K. Given the prior, the analysis can proceed in a conventional Bayesian fashion. i i . Empirical BayesCEBO analysis Assume n(e) has a given functional form, and choose the density of this given form which closely matches the prior beliefs. We assume TT e r with r = { TT : TT(0) = g(0 |\) where \ <= A }. (2.12) Here g is a specified function. Then the choice of prior reduces to 18 the choice of \ e A which is usually called a hyperparameter of the prior. The Type II maximum likelihood estimate (ML-II estimate) of n is such that m(x |?T) = sup m(x | T T ) , TC e r where m(x |n) = J* f(x |©) rr(©)d©, and r is the set described in (2.12). The marginal density of x given n, m(x In), reflects the plausibility of n with the data in hand. This function is clearly maximized by choosing n to be concentrated where f(x |©) is maximized (as a function of ©). Thus, i t is reasonable to consider m(x |rr) as a likelihood function for n. Then sup m(x 17T) = sup mCx |g(© |X)!> x <E r x «= A so that the selection is just a maximization over the hyperparameter X (ML-II hyperparameter). i l l * H i e r a r c h i c a l B a y e s C H B O a n a l y s i s It is often convenient to e l i c i t subjective prior information in stages. For two stage priors, for example, the i n i t i a l prior is n (9 |X), where X is a hyperparameter in A. Instead of estimating X, as in the empirical Bayes analysis, X is given a second stage prior distribution 7 r 2(X). This could be a proper prior, but more often i t is an appropriate noninformative prior. Sometimes x is written in the form of X = (X ,X ) for ease of computation. Then 19 TT (X) = TT v V l X 2 ) TT ( X 2 ) . (2.13) 2 2,1 2,2 The posterior distribution of 9 is then expressed in terms of the posterior distribution at various stages of the hierarchical structure. The procedure is as follows. If a l l densities below exist and are non-zero, then Tl(9 |x) = f rr (9 |x,X) rr (X4|x,X2) rr (X2|x) dX. (2.14) J . 1 2,1 2,2 A Here £ ( X \9) TT (© | \ ) nt(9 |x,M = -i-(x-tx) ' ( 2 ' 1 5 ) where m (x IM = f f(x \9)U (9 |X) 69, m (x |X) TT (X^X 2) TT (X* |x,X2) = — - ~ , (2.16) m (x |XZ) 2 m (x IX2) = P m (x IX) TT (X*|X2) dX2, 2 J 1 2,1 m (x |X2) rr (X 2) / • v 2 I \ 2 2 ' 2 n ^ J x l x ) = ' m(x) and m(x) = f m (x |X2) rr (X2) dX2. (2.17) J 2 2.2 20 i v . C o m p a r i s o n o f E B a n d H B First, the advantages of HB over EB are considered. The EB estimates of the hyperparameters obtained from the ML-II approach and then using the fir s t stage prior in a standard Bayesian way albeit with the hyperparameter replaced by its ML-II estimate ignores the inherent uncertainty about the hyperparameter. It leads to unduly optimistic estimates. The HB approach incorporates such uncertainty automatically. Furthermore, with only slight theoretical difficulty, HB can Incorporate actual subjective prior information at the second stage. Even though HB has many advantages over the EB approach, i t is more difficult to apply because of its greater computational complexity among other things. As well, Savage's Principle of Precise Measurement asserts that when the likelihood is "peaked" relative to the prior, the EB method is justifiable as a good approximation to HB. So, in particular, there is a large amount of data available in the marginal likelihood, the ML-II estimate produces a reliable estimate of the hyperparameter without an additional stage of prior modelling and conventional Bayes estimation. 21 3 . D A T A E X P L O R A T I O N 3*1 I n t r o d u c t i o n The structure of each of four sets of data is examined in this section as a preliminary step in model development. These sets Include the benchmark release data, the benchmark rollup recovery data, the rollup recovery data for replicated tag codes, and the sampling period data. These sample data sets involve two species of marked salmon: chinook(124) and coho(115). (The codes in parentheses are species Hart codes used by DFO for species identification.) Note that the rollup recovery data set for replicated tag codes and the sampling period data sets are not part of the benchmark data. These data are obtained directly from the MRP database, and they are formatted like the benchmark so that the documentation for the benchmark can be used for reference. The release, recovery and sampling period data have, respectively, 31, 33 and 15 fields in their records. The fields are of varying lengths depending on the type of information contained. A blank field in a release or recovery data record represents a missing value. Zero in one or both of the catch and sample fields of a sampling period data record indicates no sample is taken for that time period. With this background, we now begin a careful examination of these data sets. 22 3. 2 Benchmark Release and Recovery Data Sets The tag cedes in the benchmark release and recovery data subsets are chosen so as to include a wide range of total recoveries and to have several release years represented. In particular, some codes associated with true scientific replicates are selected for coho. A preliminary examination of these two data sets will indicate the problems and the sort of information available. The results are discussed below. 3*2.1 Missing values i n the two benchmark data subsets i . Release data Each record in the release data set contains the data for one tag code. Since there are only 9 tag codes for Chinook and 27 for coho in this benchmark subset, a l l records are examined together. Table 3.1 gives the 36 tag codes and their corresponding codes in the analysis. Table 3.2 shows that none of these records have any missing values. Also, note that fields lKnumber tagged), 12(adipose only) and 13(undipped) are a l l nonzero which indicates that not a l l released fish are tagged. Thus, for each release group, information such as brood year, production area, size at release and time of release, is important in associating unmarked and marked fish. The size of each release considered here is in thousands, but Figures 3.1a and 3.1b show that the Chinook release, in general, is much larger ln size than the coho. 23 i i . Recovery data For the benchmark rollup recovery data subset, there are more than 2000 records for the combined chinook and coho data. For illustrative purposes, only the chinook records are considered here. Some fields in a record contain information only relevant to one of the three recovery methods: commercial fisheries, sport fisheries and escapement. Thus, we have to know the number of records corresponding to each recovery method before computing the percentage of missing values for each field. The results, as shown in Table 3.3, show that one third of these 33 fields in the recovery data subset are missing more than 75 percent of its values. Most of them are about the physical characteristics of salmon, such as average fork length and percentage of mature females in the sample. Two other important fields with a high percentage of missing values are concerned with recoveries from sport fisheries. As a result, the reliability of the sport data is questionable. 3. 2. 2 The structure of observed recoveries The observed counts over time are of particular interest because the results may reflect the survival rate of tagged salmon and the trend of observed recoveries. These counts may also indicate some relationship between the return rate and factors that affect the survival of salmon. Plots of tag codes 021827(chinook) and 24 081842(coho) are given as illustrations. The results indicate that Chinook start to return a year after their release while the coho return the year they are released. The recorded recovery time has three components: year, month and statistical week. The statistical week (0 to 5) indicates the week within a month. When week = 0, the week is not known. These three components are usually known only for commercial recovery times. For the sport fisheries, the recoveries are mostly monthly data, and the escapement has only yearly bulk data. A term called xperiod', which represents the time of recovery in each year, is defined using the month and statistical week components. This *period' is a number ranging between 1 and 40 representing a one week time period during which salmon fishing may occur. Table 3.4 is a reference table for computing 'period' from month and statistical week in each calendar year. Note that period 40 actually covers three months of the winter season when there is no salmon fishing. Using the definition of 'period', the total observed number of recoveries from sport and commercial fisheries over the entire recovery period, ignoring the catch regions, are now inspected. The plot of chinook, as shown in Figure 3 .2a , shows that the tagged Chinook have a recovery period spanning four years, and that most recoveries are concentrated between May and September in each year, 25 but larger observations are obtained during the second and the third year. For the tagged coho, the sport recoveries are monthly data; therefore, they cannot be combined with the commercial recoveries. The plots in Figure 3.3a(commercial) and 3.3b(sport) indicate that only a few return in later months of the f i r s t recovery year, and most data from the commercial fisheries are observed between June and September in the second recovery year. A point that is not demonstrated by these plots is that most marked recoveries found during the f i r s t year for both species are from escapement. The variation of commercial and sport chinook recoveries can clearly be identified when they are plotted separately and this is done in Figures 3.2b and 3.2c, respectively. Note that the plots for the commercial recovery data are similar to those for sport recovery. Also note from Figure 3.2c the relative paucity of points for sport recovery. The results from the chinook and coho plots indicate that the sport fisheries contribute l i t t l e to the total number of recoveries. As mentioned earlier, the escapement data are yearly bulk data so we cannot compare them to the commercial data. Therefore, we have chosen to concentrate our study on the commercial recoveries with l i t t l e apparent loss of information obtainable from these data. The plots of the cumulative sum of chinook and coho commercial recoveries observed over time are also examined. These plots, as presented in Figures 3.4a and 3.4b , show that the recoveries are a 26 step function of time. The jumps that are of visible size in August in later recovery years are especially obvious for coho. These observations provide more information on the peak season of salmon return. 3 . 3 O t h e r R e l a t e d I n f o r m a t i o n a n d D a t a S e t s We f i r s t investigate the 37 commercial and sport catch regions. Table 3.5 is a l i s t of old and new catch region codes. The old codes are the originals used by DFO, and the new ones are created for ease of programming. The results in Table 3.6 show that only 25 of these regions have recoveries among the approximately thousand records considered, and very few of them have more than 100 recoveries over the entire recovery period. The sampling period data are the sampling schedules for different catch regions. In each year, each catch region has its own sampling scheme which may be different from previous years. For most commercial catch regions, there are usually consistent sampling periods during the fishing season. However, for the sport catch regions, often there are few samples taken each year and this results in a long period of no information (missing values). This is another reason why i t is difficult to analyze the sport data. Using the data on sampling periods and catch regions, the 27 commercial observed recoveries may be examined further. As a result, three commercial catch regions are found to have a substantial number of observations for chinook, and four for coho over the corresponding recovery period. The three regions for chinook are: Northwest Vancouver Island Troll, Northern Troll and North Central Troll. The four regions for coho are: Southwest Vancouver Island Troll, Georgia Strait Troll, South Central Troll and Johnstone Strait Net. For chinook, these results indicate that the length of the recovery period in these three catch regions is about three years long instead of four for each particular tag code. Also, the samples are mostly taken from the beginning of May to the end of October with some missing ones in between, and no sampling was done beyond these time limits. For this reason, there are about 63 sampling periods, which will hereafter be called the adjusted period, in each catch region over the three-year period. The observed recoveries over the adjusted periods are now plotted for each catch region. Note that the blanks between lines on the graphs, as shown in Figure 3.5, indicate missing values and that most observations are obtained at the beginning of the recovery period. For coho, the recovery period is only a year long for each tag code in the four regions mentioned. In addition, the sampling period, which starts in mid-June and ends at the end of October, is shorter than that of chinook. Plots of observations over the 17 periods in 28 Figure 3.6 reveals that few tagged coho are found during sampling. Data is available in the MRP database on replicated tag codes for both coho and Chinook. This enables us to pool information to estimate the trend of recoveries over time. In addition, the model developed for these replicates can be used as a guide in modelling other individual tag code or pseudo-replicates. This set of data consists of statistical replicates, in that each set of replicated tag codes of various time-size combinations come from a single pond representing one treatment. The replicates in each group of three can be further classified according to three relative size groups: large, medium, and small. Table 3.7 gives a sample l i s t of these size groups. Examination of both chinook and coho replicates shows that there are not many observed coho recoveries over the 17 periods in each catch region for each group and there are even fewer observations over the 63 adjusted periods for the chinook groups. The sparse data suggest that pooling observations appropriately may be helpful in obtaining meaningful results. In addition, since there is so much sampling variation and noise in the recovery data, some smoothing techniques may be useful when modelling this data. 29 4 . E M P I R I C A L B A Y E S A P P R O A C H F O R M O D E L L I N G C O U N T D A T A 4 . 1 I n t r o d u c t i o n The investigation o£ the trends in recovery rate for each catch region and species of salmon is the main interest of this study. The results of exploratory analysis indicate that the tagged salmon recovery data have not only noise but also huge variation due to other factors. Further, data on individual tag codes are sparse given that, in fact, there are few recoveries over the entire recovery period. However, replicated tag codes have a somewhat similar pattern of recoveries, sampling scheme and set of recovery regions. Thus, pooling the data for tagged salmon with similar characteristics provide more informative data for further statistical analysis. Some smoothing techniques for removing noise and sampling variation in the data also prove useful. Since no prior information about the distribution of the rates of observing marked salmon are available, we can only use the recovery data to suggest possible ways to estimate these rates and their corresponding confidence intervals. We now develop models that can aggregate data, remove most of the sampling variation and noise from data while reflecting the mechanisms of the underlying process and taking account of the source. 30 The Poisson process is a common model for count data. However, a global Poisson model is inadequate for these data because of the heterogeneity in these data. An empirical Bayes approach to fitting local Poisson models to these counts enables us to incorporate this heterogeneity. 4 . 2 L o c a l P a r a m e t r i c P o i s s o n M o d e l s w i t h S m o o t h i n g T e c h n i q u e s Let NAt) denote the count for the ith process up to time t, where i = 1, ...,Z. Assume that for a fixed period, N (t) has been observed at a sequence of time-intervals, (t^, J = 1, ...,-7-1 a l l equally long. Each of these I processes is serially correlated, and furthermore, a l l of them are interrelated. For a fixed i, let E[N. (t)l = X.(t). This mean function reflects how the arrival rate of NAt) changes over time. Suppose V (t) has a prior distribution which is exponentially distributed with mean ftAt). Then ft represents our prior expectation about the size of V . However, the -values are themselves uncertain so we put a second stage hyperprior on ft to incorporate this uncertainty and also to remedy the possible over-dispersion effect relative to the Poisson distribution. This prior is an inverse exponential distribution with parameter C ( t ) . Note that in choosing our prior distribution, we seek a distribution which is both noninformative and easy to handle. Further, to reflect our a priori view of these ft^s as different in unspecified ways, we postulate that they are exchangeable random 31 variables. More precisely, at every fixed time t, are regarded as a random sample taken from a common inverse exponential distribution so these ft As are independent and identically distributed(i.i.d.). Also, E(X.) = ft. for i = Once the X.'s are given, the NAs are independent, but the I processes are related indirectly through C • Recalling our convention of letting U I V denote the conditional distribution of U given V for any two random variables U and V having a joint distribution, we assume i) N. U)|X. {t),ft. U ) , C U ) isPoissonCX. U P , V V V V where ft At) and C U ) are non-negative, i i ) KU)\ftAt)rC(t) is exponential^ U P , and i i i ) fti (t) IC (t) is inverse exponential C U P . We now develop the model further by specifying the densities of these non-negative real variables. From now on, the time t is assumed to be fixed and is omitted for clarity. Then, for the ith process, we have for X^ , ft , C > 0: t(N= A. IX.,^ 3.,0 = , (4.1) v v. v. v n . ! , -K/ ft. f(X.|f3.,C) = - ~ e \ (4.2) 32 r - C / ft. t{ft. IC) = — ^ ~ e v . (4.3) Note that because of our choice of a noninformative prior for ft^f the moments of HftAC) do no exist. We can rewrite (4.3) as follows f ( ^ i o = - i - ( . -«V*> ( ^ / c r 2 ] . That is, iiftAC) = ~-g(ft./K)r (4.4) where g (^. /C) = exp(-f?./C) iftJK )"2 . We can easily identify C in equation (4.4) as the scale parameter of the density f(^3 IC) and 1/C as the precision parameter. Thus, C indicates the spread of the ft population from which the ft As are picked, and 1/C expresses the degree of equality among the ft As. In particular, the ft. 's are identical when C is zero and the ft. 's are very different when C is large. Now the joint density of V and ft given C is f ( \ , P . K > = f ( \ l ^ , C ) f(^.IC) 33 That Is, -(X.+ C ) / ft = — £ — e 1 1 . (4.5) Then the prior for V given C is 00 f ( \ K ) = I f(X.,^. IC) oV3.. That is, 0 0 -(X. + C ) / ^ f(X. IC) = C J &ft. • (4.6) v •* „ 3 v. t o c \After the change of variable, u = 1/ f?., equation (4.6) becomes -(x.+ C ) u f(X .IC) = C | u e v du. uu I -or f(X. IC) = - - • (4.7) (X.+ C) It is easy to show that £ ( / ? J C ) and f ( X J C ) are both unimodal functions with respect to C and their unique modes are at C = 2fti and C = x^, respectively. Thus, a priori, most of the ft As are concentrated near C/2. We now determine the joint density of N and X given C which is f(rV.,X. IC) = UN. IX. ) f(X. IC), 34 that is, -X n v i e X. _ f(W.,X. IC) = " 7 " " —• (4.8) v * V (x.+ c)2 After integrating out x^ , ve obtain -X n oo v. . v r e X. UN IC) = - - T " ^ dX. . (4.9) J 0 V (x.+ o 2 v Finally, to obtain the conditional posterior joint distribution of X and ft , we require an estimate of C = C ( t ) . The value of C can be estimated by maximizing an expression which involves the integral of (4.9). Then, using (4.1) ,(4.5) ,(4.9) and C, the above mentioned estimate, the posterior is UN IX ft C ) f ( X ft IC) f ^ ^ l V 1 = - ~ — -t(N. IC) That is, -X. n. e 1 X. p -(X. + C ) / ft-i. e n. J ft. 3 f(X.,/?. |W.,C) = 1 1 ,(4.10) v ^ 1 -X. rx 00 l .. l e X. ~ i £ I — dX. V (x. + C ) 2 1 o V where /? and X are non-negative. 35 The integral in the denominator o£ (4.10), which is just (4.9), is essentially a constant once C is estimated. It remains to estimate C using (4.9). For clarity, the subscripts in (4.9) are dropped and we let P = n.!. Then (4.9) becomes f -A. n g(C) = P" 4 e \ c ( \ + C ) " 2 d \ (4.11) o where \ > 0, n and P are positive integers. To prove that g(C) has a unique maximum, we appeal to a lemma of Brewster and Zidek(1974). First, suppose WCxO is a continuous non-negative function whose domain is either (0,oo) or ( - 0 0 , 0 0 ) , and i t is s t r i c t l y bowl-shaped. Thus, W is differentiable almost elsewhere. In addition, assume that, whenever necessary for integrals involving W, the Interchange of integral and derivative is permissible. The lemma i s : Iff is a. d&rxsi ty on CO, ooJ> [ C —oo, OCL5 ] arid < fCxc ±^>: c > O > [</Cx—cJ>: - 0 0 < c < 0 0 > ] has monotone lihelihood ratio property CMLRP), then c -+ J x W'Ccxl fCx> dx £ fWCx+cl fCxl dx j has at mos t one sign change and is stric t ly bowl-shaped Cor monotone}. 36 We f i r s t show that g(C) in (4.11) satisfies the conditions stated in this lemma. From (4.11), we have w(\) = p - 1 &~X x n , where n and P are positive integers and X > 0. It can easily be shown that ¥ ( X ) is strictly bowl-shaped (opening down) and / ( X l O is a scale density which can be written as / ( X I C ) = C (1 + X/C >' • / 0 ( M : ) , where / (y) = / ( y l l ) = (1 + y f 2 dy. In addition, < / ( X K ) > has the MLRP since i f X < X and C < C / then 1 2 1 ^2' / ( \ I C ± ) / ( X t I C 2 ) > 0. Since g(C) satisfies a l l the conditions in the lemma, we conclude that C -»> f x w"cxc:> / cx;> dx has at most one sign change, and C J* wc\> /<rxic^ dx is strictly bowl-shaped (opening down). Thus, g(C) in (4.11) has a unique maximum which implies the same for expression (4.9). Since at each time point t, are assumed to be a random sample from the inverse exponential distribution with parameter C/ we can use a l l the data from these I processes to estimate C ( t ) . An 37 advantage of this approach is that we can aggregate data indirectly instead of simply summing the data without considering their sources, degree of association, and so on. Further, i f observations obtained for each process are from replicated experiments, we can also use them together so that rv(the observed count) is a vector of counts from the replicates. To remove noise and other extraneous variation from the data, we use moving averages with appropriate window sizes (of at least 3). Other more complicated smoothing techniques may be used, but this simple method has the advantage that i t can easily be incorporated into the models we are developing. One assumption underlying a l l smoothing methods is that the counts within a window are homogeneous. So the size of the window is restricted since the data points ln a small neighborhood are expected to be more similar than those far apart. As an example, suppose there are 4 processes (1=4) with 2 replicates each, and a symmetric neighborhood with a 3-point window is used for smoothing. Then, for a given time t and for each i = 1,...,4, N. = (N. ,N. )' is a vector of the 2 replicates. The V. l l V 2 joint density of these two replicates given V is t •*• 1 2 t(N.\\.) = ~TT ~TT f(N |\. ,ft. ). (4.12) \. v. . 1 1 • • vjk x. x. 38 Note that f (N. ,X. \ft. ) = £(N |X. ) f (X. \ft.) and I I I V V X, X. £{N.F\.,{3. K) = f (N. |X. ) f(X. |/3. ) f(/?. IC) I I I l l l l I Thus, £ ( N , X IC) is given by 00 CD [ f (N. ,X. ,ft. IC ) d/3. = £(N. |X. ) f £(X. |/?. ) f(ft. IC ) d/3. J I ' I ' I i t , - J i i i i co f(Af. |X.) f t{\.,ft. IC) 6ft.. i i i i t i That is, £(N.,X.IC) = f(N. |X.) £(X. IC), (4.13) i i i i i which is a product o£ (4.12) and (4.6). 4 . 3 Inference on the Parameters of Interest To estimate C using data from the 4 processes, we require fCN ,N ,N ,N IC3 . First, note that t(N. ,X. ,ft. IC ) = fCAU (X. ),C3 f(X. ,/3. IC). In section 4.2, we assumed that the X^'s are conditionally independent given the ftAs and C, and that the N *s are independent given the X. 's. Then: i 1) the conditional joint distribution of the 4 processes at time t is given by: £ C W1 ' W a ' W a ' W * l ( \ ' ' ? i ) ' ( \ ' ^ 2 ) ' ( X a ^ 3 ) ' ( \ ' / ? 4 ) ' . C 3 T T few. I (X . ,/?.),C3; i = 1 39 2) KNx,Na,Na,Ns{\x,px)A\,fla)A\,fla),i\,PJ\W = "frfCW.|(X.,/?.),(3 f t V ^ J C ) . (4.14) 1 = 1 Thus, to obtain fCN ,N ,N ,N I O , we eliminate the X. 's and p. 's in x 2 3 4 C v (4.14) by integrating them over (0,oo). We fi r s t integrate out the X's and obtain OO OO I - J Ir: fCN. |(X. p.),0 UK ,ft. IC) dX ...dX x, x,, x. x. x, 1 4 o o = If fCN.i^.,c:> fC^.IC) . (4.15) i = 1 Then, we integrate the PAs in (4.15) over its entire domain to get 00 00 \ '"\ ~TT~ fCN I/? ,0 fCp. K ) &P. • • '&P. J J t SS 1 O O = fCN ,N ,N ,N IC3 . (4.16) Using the result in equation (4.13) and the assumptions of conditional independence, we have KN ,N ,N ,N IC3 i ' 2' a' 4 ^ 00 oo = J ...J ~JT~ f(W. IX. ) f(X. IC) dXi o o -X n. 00 00 v . 1, jk • • - - e X. r * » 4 - . t - t - 1 2 e X. ^ _ - I .»I T T [ T T T T 1 — c — 1 *V J J i = i j = t - i k = i n. ., ! J(X.+ C) O O v jk v 40 that is, KN ,N ,N ,N 1' z 7 a' « s - 6 \ . n. oo 00 t. i , • I -I T T - ! — - - — — _t J J i = l P. (\. o o where --d\., (4.17) ( V + C ) ' 1+1 2 and t +1 2 p i = T l TT - i j = t - l k=1 Note that KN.N.N.N given in equation (4.17) is a product x 2 3 4 of t(N. IC) and is similar to the result for the unreplicated case given ln equation (4.9). Though the function in equation (4.9) itself is unimodal, the analogous result in the present case has not been established. The shape of this product integral evaluated in equation (4.17) with numerous different combinations of n.., 's at various values of C's suggest the result is true. Numerical methods are used to compute the estimate of C that maximizes the function in equation (4.17) in spite of the potential risk of multi-modality. The smoothing window used for estimating C is small but the C(t) obtained may s t i l l contain a fair amount of noise and sampling variation. A numerical method called SABL, described in section 2.2.2, can be used to decompose this C-series into trend, seasonal and irregular components. The seasonal and irregular components, which 41 are usually o£ much smaller magnitude when compared with the trend, reflect some of the sampling variation and noise s t i l l in the C-series. The trend is smoother than the C-series, and is closely related to the trends of x\ (t)*s for the 4 processes so we will use i t in this study as a general summary of the data in the domains for which C is computed. With the smoothed version of C ( t ) , we can compute the posterior point estimates for V at each t by maximizing an equation similar to that of (4.9) with respect to V . So for each 1 = 1,...,4 and each time t, -6A. n. X. - X. + + » X. p max fCX. \N. = max , (4.18) X. ' " X. P. (X. + o 2 X. X. X. X. where 1 + 1 2 1+1 2 rx. 1 + + I l ^ ^ n = T y j = " - l k = l J j = t - 1 k = 1 This maximization problem is easier to handle when we take the natural logarithm of (4.18) to obtain a quadratic expression of X^ . The maximum can be explicitly evaluated in this case and is found to be at (n. - 2 - 6C ) + I (n. - 2 - 6C f + 24 rx. ( a* = 1+* , ( 4 . 1 9 ) 12 t + 1 K where rx, = S Y rx.., . x,++ . L, L, i j k The X^ in equation (4.19) is only 1/6 of the actual X^ so i t is 42 multiplied by 6 (2 for the number of replicates times 3 for the window size). Now, let V = 6X* which is comparable in magnitude with the observations of the i t h process. It would be desirable to present a 100(l-a)% credibility interval for \ at each t. Each interval is a subset, C, of the parameter space which gives the probability that V is in C. This amounts to choosing a pair of upper and lower limits, (a,£>), such that F ( C f 4 | fCX. dX. = 1-ot, (4.20) where a > 0 and 6 > 0, and -6\ . n. F(C) = Pf* J e 1 \ ; i + * C <\ + 6"2 dV, X. X. X. n t + i 2 t + 1 2 j= l-1 k=l j = t - l k = l In choosing a credibility Interval for V at time t, the usual approach is to minimize its length. This may be done by using the highest posterior density(HPD) criterion which is to include in the interval only those points with HPD, that is, the 'most likely' values of \.. To evaluate the HPD credibility interval in equation (4.20), we set up the following program: 1) set the lower limit a = 0; 2) create a subroutine which, for a given time t, computes the 43 A maximum of £CX. IN. as a function of X. at, say, X. = 3) set x = (a+m.)/2 and evaluate fcx. \N. , 0 at X. = x; 4) create a subroutine which find the value of 6 such that f(X.= 6 |N.,C) = f(X.= x l*.,C); 5) numerically integrate where a and fe are the values from steps (3) and (4). If this value is approximately (1-ot), then stop. If not and i f : i) the value is larger than (1-a), set m, = x (a remains unchanged) and go back to steps (3) to (5); i i ) the value is smaller than (1-a), set a = x (m remains unchanged) and return to steps (3) to (5). It can happen that the integral P(a,6) evaluated from a - 0 to the point where fCXJA^O has its maximum is very small so that the above procedure cannot be used. In such situations, we abandon the HPD criterion and find o such that P(<z,6) evaluated at (0,6) is approximately equal to (1-a). We can plot a l l these results with lines connecting the point estimates of X^ and their corresponding limits over time for visual inspection. These graphs should roughly indicate the trend of X^s and the size of their possible fluctuations. a 44 5 . A P P L I C A T I O N 5 . 1 I n t r o d u c t i o n in this section, the method developed in the last chapter is applied to the marked salmon recovery data. For coho, data from replicated tag codes observed in 4 catch regions are used. These data are further grouped according to hatchery, brood year and size at release of the tagged smolts. The hatcheries of interest are Quinsam with broods from 1978 and 1979, and Capilano with broods from 1979 and 1980. An example of such a grouping is (Quinsam, 1979, large) which refers to smolts which are raised in the Quinsam hatchery starting in 1979 with a large average size at release. For each grouping, there are 4 similar sets of replicated tag codes. Thus, in each catch region, a total of 12 codes are included for each possible hatchery, brood year and size combination. Details of these tag codes can be found in Appendix F of 'A Canadian MRP Data Benchmark'(1986). Overall, the recovery data set for replicated tag codes of chinook has very few observations. Even with 4 sets of replicates for each hatchery, brood year and size combination, few recoveries are found over the 63 adjusted sampling periods for the 3 catch regions considered. Therefore, over almost the entire recovery period, the estimated intensity, X., is near zero for each catch region. Three 45 tag codes (021827, 021829 and 021661) which have substantial recoveries are thus selected from the benchmark rollup data subset. The method described is then applied to these data. The local Poisson model allows us to aggregate data from related tag codes when there is a paucity of Information for a single tag. The use of this approach on two different sets of data, one from replicated tag codes and the other from individual codes, illustrates the flexibility and advantage of our method. 5* 2 P r o b l e m s E n c o u n t e r e d w h e n M o d e l l i n g t h e S a l m o n R e c o v e r y D a t a 5 * 2 * 1 M i s s i n g v a l u e s In fitting local Poisson models to the recovery data, the C 's are estimated from small neighborhoods where counts are homogeneous. As Indicated in subsection 3.2.2, usually the f i r s t 10 and the last 10 periods for a l l catch regions in each recovery year contain missing values so our approach cannot be used to obtain C -estimates over these long intervals. We thus include at most two periods with missing values to open or to close a sampling interval. As a result, there are 18 periods for coho and 63 adjusted periods for chinook in which the interval between these periods is more or less equally spaced. Suppose there are a few missing values in the recovery data over 46 m periods of interest. An ad hoc method is used to obtain a number for each period with missing values. Assume that a symmetric neighborhood with a 3-point window is used for smoothing. For a fixed catch region, let the 3 counts in a window be denoted by n ^, and n , where t = 1,...,m, n = 0 and n =0. Then, the procedure adopted to f i l l in missing value in that window is as follows: 1) i f n is missing and there are data on the left of n , search t - i ^ . t - i backward for at most 5 steps to obtain the fi r s t data value n ., where j = 2,..., or 6, for When is in the f i r s t position or when there are only missing values before i t , set n = 0. t - i 2) i f n is missing and there are data close to the right of n , search forward for at most 5 steps to find the fi r s t available count n .. where j = 2,..., or 6, for n . when there are only missing values following n , set n = 0. 3) i f n is missing and there are data points just beyond n , search forward for at most 5 steps and set n = n . for the t + i t+j f i r s t j = 2,..., or 6 which is not missing. If n is at the end or there are only missing values after i t , set = 0. In a few cases when there is more than one missing value in a window, we use an appropriate combination of the above three cases to obtain counts for these periods. The decision to search for at most 5 steps is based on the general pattern of missing values in the sampling periods for two of the species of salmon. Having 3 or more 47 consecutive missing values is rare, except for one chinook tag code. S. 2.2 The edge e f f e c t The edge effect, which occurs at the two extremes of the C-series, is a byproduct of the local smoothing technique in our model. To obtain C 's for m periods, we require counts for 2 extra periods, n and n , to obtain estimates for the fi r s t and the last windows. Here, we set n and n to zero since in most cases no more O m + l than one recovery was observed in the fi r s t and last periods. These two estimates must therefore be regarded with caution. 5* 3 F i t t i n g t h e P r o p o s e d Models t o t h e S e l e c t e d Data S e t s For a particular hatchery, brood year and size combination, consider the ith catch region at time t, where t = 1, ...,m. Then N. = C N . f t p is a matrix of counts where k = 1, ...,K denotes the x. \ . k l number of similar sets of replicated tag codes, and 1 = 1,...,L denotes the number of replicates in each of the K sets. For coho, K = 4, L = 3, and t = 16,...,33 (m = 18 actual periods) for each i = 1,...,4. For chinook, K = 1, L = 1, and t = 1,...,63 (m = 63 adjusted periods) for each i = 1,2,3. in each case, the smoothing technique uses a symmetric neighbourhood with a 3-point window, and the sampling periods are the same for the catch regions considered. 48 Now, following equation (4.12), the joint density of these Af k l(t)'s given V for a fixed catch region i at time t is f(N.|X.) = TT "TT TT" KW (3)|X). (5.1) j = t - l k = 1 1 = 1 The C ' s are estimated for each group of replicates by maximizing a modified equation similar to that of (4.17). That is, we maximize the following function for I catch regions with respect to C: f(W , . . . , N I C ) -X. n. , oo v x. j k l = TT f ~n—rr TT — ] ox, ( 5 . 2 ) i. = i J «• j = t - i k = i l = i n. 1 J (X. + C) O v j k l t where t = 1, ...,m. The C-series obtained from each group of codes are f i r s t plotted over the periods for visual inspection. Figures 5.1a is an example for coho of the Quinsam, 1979 and medium combination, Figures 5.2a is for a similar groupings of Capilano coho, and Figure 5.7a is the C-plot for chinook tag code 021827. These plots show for each group, that the C -series is not very smooth and has a different trend for each group. The computer routine SABL, described in subsection 2.2.2, is used to smooth these C ' s . The C ' s associated with tag codes from the Quinsam hatchery are transformed by taking their fourth root. A plot 49 o£ t h e s e results is illustrated in Figures 5.1b. For the remaining tag codes, no transformation is used on the C ' s . Examples of the resulting trends of C's are then given in Figures 5.1c, 5.2b and 5.3b. From now on, we will use the smoothed version of C ( t ) , denoted by C ( t ) , as a general summary of the data. The C curves for a l l brood year and size combinations of Quinsam coho are plotted against time in Figure 5.4. In general, the shapes of these C-plots exhibit two humps connected by a small dip. The second hump, which is between periods 26 and 33, is usually flatter than the f i r s t one, which is between periods 16 and 22. For brood year 1978, the size of the C's for each size-group is about the same. However, for brood year 1979, the medium size-group has larger C ' s , which Indicates that there is more ft dispersion among the 4 different catch regions. For Capilano coho, as shown in Figure 5.5, the shape of the C curves is unimodal, with the mode usually in period 21. The C -plots of the medium size coho for both brood years 1979 and 1980 have a similar magnitude. However, for the other two sizes, there are big differences between the two broods. Note that for each brood year and hatchery, i f the C-series of one size group is fixed, the C curves of each remaining group is only a constant multiple of i t . Also, in general, larger C-values indicate that there is more ft dispersion among different catch regions while smaller C-values reflect that the /Vs are more similar. 50 The C-plots for the 3 chinook tag codes are given in Figure 5.6. Clearly, we can observe that there are 3 distinct recovery intervals, with each interval of 21 periods representing a fishing season. Within each interval, an increasing trend in the C-series is always followed by a decrease, and the largest peak comes near the end. Further, the C's for the f i r s t two intervals are larger in magnitude than the third. When comparing the C's of the three tag codes, those of 021827 are usually smaller even during the peak periods. This result reflects that the (3As are more similar among the 3 catch regions for tag code 021827. With the smoothed version of C-series, V can be obtained for each catch region using the modified equation (4.19) that includes a l l replicates with similar characteristics. We then have (n. - 2 - AC) + [(n. - 2 - AC f + 4A rx. C x * = SI** y_ttt f { 5 3 ) 2A where n t+1 K L i=V-± k=± 1=1 and A = 3 x K x L (3 is the window size). The final result is multiplied by A to obtain Xfc for the i t h region. Figures 5.7 to 5.9 are plots of the estimated recovery intensities, X.'s, for the 3 sizes of Quinsam coho. Note that X(t)'s, 51 that is, the recovery intensity levels for the three trolling regions: Southwest Vancouver Island Troll, Georgia Strait Troll and South Central Troll, peak much earlier than the Johnstone Strait Net. In addition, the recovery intensity for the Johnstone Strait Net is the largest among the 4 catch regions for a l l sizes. For Capilano coho the \-plots are shown in Figures 5.10 to 5.12. Like the Quinsam coho, a l l trolling regions here have their x curves peaking earlier than the Johnstone Strait Net. Nevertheless, the general pattern for each of the four catch regions is different from that of Quinsam coho. First, the recovery intensity is often the largest for Southwest Vancouver Island Troll and the smallest for Johnstone Strait Net. Second, from periods 28 to 33, the intensity is usually zero for a l l catch regions. Now, we turn to examine the relationship between the smolts' size at release and their recovery intensity at maturity, and the difference in recovery rate between different brood years. For the fi r s t case, we have to study the estimated recovery intensities of the 3 sizes: small, medium and large, for each brood year separately. However, no obvious conclusions are suggested since there is so much variation among the 4 catch regions. For the second case, we compare the estimated intensity for different brood years and observe that: 1) for Quinsam coho, the recovery intensities for coho raised in 1979 are larger than those from 1978; and 52 2) for Capilano coho, the recovery Intensities for brood year 1980 are much larger than for 1979. These results Indicate that for coho from a particular hatchery more recoveries are observed from one brood than the other on average. The X-plots for the 3 chinook tag codes in each catch region are presented in Figure 5.13. Here, the trolling regions are: Northwest Vancouver Island Troll, Northern Troll, and North Central Troll. Again, as in the C-plots, the recovery intensities are distributed in 3 intervals, with larger Intensities found in the f i r s t two intervals. Among the 3 catch regions, the Northern Troll has larger recovery rate on average. When comparing the estimated intensities for the 3 tag codes, overall 021661 and 021827 has better recovery rate than 021827. This indicates that on average fewer recoveries are observed with tag code 021827. After computing the X-series, we derived the 95% credibility intervals for the X's using the procedure described in section 4.3. That is, we found the upper and lower limits (a,6) such that -A\. n. G(a,b,C) = F ( C ) " 1 ~ - — - - - dX. = 0.95, (5.4) J P. (x. + c) ' CL l i where A = 3 x K x L (3 is the window size), l + l K L • I 11 n i k i ( j ) ' j=1-1 k=l 1=1 53 t + 1 K L p t • " r r " r r " r r ^ » u . j = I - 1 k = l 1 = 1 and r -AX n. F(C) = Pt o V C (V+ C) • Note that in the case when the total count n and X. are zero, and C is close to zero, the corresponding credibility interval for X is (0,«) where « > 0 is arbitrarily small. In this case, we are sure that there is no recovery in the i t h catch region. The result is obtained to a good approximation by taking the limit of G(a,b,C) in (5.4) as C •* 0. The proof of the result lim G(a,b,C) = 1 is shown in C - 0 the following. First, note that rb J e 1 / (X.+ C) 2 dX. lim G(a,b,C) = Um — --. (5.5) C + 0 C - 0 r°° -X J e / (X.+ C) dX. Letting u = X + C/ (5.5) becomes rb + C f -(u.-C) 2 V L U v / u 2 J da lim — ^ --. (5.6a) du. 54 The expression in (5.6a) is just r - u . J * " 2 e u. du. lim -- --. (5.6b) C O § e " u." du. C -*• 0 r -u. ^ 1 t -2 Since in equation (5.6b), both numerator and denominator are analytical functions, they are differentiable. Further, they are both Infinite in the limit; therefore, we can apply the L'Hospital rule to (5.6b) and obtain g - ( b * C ) (b + C ) " 2 - e C C lim < * 0 - ( e ^ C " 2 ) lim C " 2 • b (b + C ) " 2 + 1 C 0 = 1. Sample plots of these credibility Intervals for coho and chinook and their corresponding \-estimates are portrayed in Figures 5.14 to 5.16. We must be careful in interpreting these credibility intervals. The bands in Figures 5.14 to 5.16 are not simultaneous interval estimates. They merely indicate the pointwise credibility interval for the recovery intensity, M t), at each period t. Note that during some periods when a region has substantial recoveries, the ratio of the \ and the width of theses intervals on average for that region is about two or three times smaller than regions with much smaller 55 recoveries. This indicates that large recoveries are ln general more informative for estimating the recovery intensities and computing the corresponding credibility interval just as Intuition would suggest. 5 . 4 Conclusion To identify an appropriate model for the salmon recovery data, we have examined the raw data in detail. We learned that the salmon recovery data set has missing values, huge extraneous variations and noise. An empirical Bayes model was thus developed to handle these data. The method fitted local parametric, Poisson models to be precise, and incorporated this approach with a conventional smoothing technique to obtain the overall recovery patterns and the corresponding credibility Intervals for the underlying salmon recoveries. The resulting conclusions are: 1 ) there are huge variations in the pattern of recovery intensities for both species of salmon from different brood years; 2) in a l l catch regions, the overall recovery trends for coho from the two hatcheries are different in shape; 3 ) no overall difference is observed for the three sizes of coho; 4) in general the Quinsam coho recovery is usually the largest in the Johnstone Strait Net area while for Capilano this is largest in Southwest Vancouver Island Troll area; and 56 5) among the 3 intervals of recovery for the chinook, the recovery rate is always the largest during the fi r s t two periods. Our method has demonstrated its usefulness in aggregating information from Individual sampling periods to overcome the problem of sparse data. However, we suggest that further investigations be carried out, including 1) extending our method to Include general parametric models, and 2) finding a weighting scheme for smoothing so that the weight given to a point depends on the distance i t is from the 'period' of interest. 57 REFERENCES Becker, R. A. and Chambers, J. M. (1984). S: An Interactive Environment for Data Analysis and Graphics. California: Wadsworth. Berger, J. 0. (1985). Statistical Decision Theory and Bayesian Analysis. New York: Springer-Verlag. Brewster, J. F. and Zldek, J. V. (1974). Improving on equivariant estimators. Annals of Stat is tics, 2, 21-38. A Canadian MRP Data Benchmark (1986). Fisheries Research Branch, Department of Fisheries and Oceans, Canada. English, K. K. (1985). The contribution of hatchery produced chinook and coho to west coast fisheries: preliminary analysis. Department of Fisheries and Oceans, Canada. Hastle, T. and Tibshirani, R. (1986). Generalized Additive Models. Stat ist ical Science, 1, 297-318. Holden, R. T. (1987). Time Series Analysis of a Contagious Process. J. Amer. Statist. Assoc. , 2, 1019-1026. 58 Lawless, J. F. (1987). Negative binomial and mixed Poisson regression. TK& Canadian Jo-urnal of Statistics, 15, 209-225. Ma, H., Joe, H. and Zidek, J. (1986). A Bayesian Nonparametric Univariate Smoothing Method, with Applications to Acid Rain Data Analysis. University o£ British Columbia Statistics Department Technical Report No. 47. Nicholls, D. F., Heathcote, C. R. and Cunningham, R. B. (1986). The Evaluation of Long Term Trend I. Austral. J. Statist. , 28, 294-313. Salmon Stock Interpretation Unit (1984). The mark recovery program as an assessment tool for the hatchery chinook and coho salmon enhancement program. Fisheries Research Branch, Department of Fisheries and Oceans, Canada. Salmon Stock Interpretation and Assessment Unit (1986). Development of a Pacific coastal database for assessing the contribution of B.C. hatchery chinook and coho salmon production to the Canadian commercial catch. Fisheries Research Branch, Department of Fisheries and Oceans, Canada. Weerahandi, S. and Zidek, J.V. (1988). Bayesian nonparametric smoothers for regular processes. 77ie Canadi an Joxirna I of Statistics , 16, 61-74. 59 A P P E N D I X T h e M a r k R e c o v e r y P r o g r a m C MRP5 D a t a b a s e A two-phase marking program, which includes tagging fish in the hatchery and recovering tags in the fishery, is currently the best known method for providing information to assess the benefits of a r t i f i c i a l l y reared fish. The most popular marking technique for hatchery coho and chinook is the use of coded wire tags (CWTs), which are usually inserted in the snouts of juvenile fish as an indicator of the fish's origin. In addition, the adipose fins of these tagged fish are clipped to allow detection of them later as adults. Two main types of data are thus available from this marking program. One is the hatchery release data and the other is the recovery data. The fi r s t type is of two broad categories: 1) fish released with a CWT and clipped adipose fin (marked), and 2 ) fish released without a CWT (unmarked). The second type of data include fishery data giving recoveries from commercial and sport fishing, and escapement data which are recoveries not from any fishery. In each category of the release data, there are time variables, such as the brood year (the year in which the eggs were spawned) and 60 the date of release. There are also geographic variables which include stock site (location from which eggs are taken), hatchery site and release site. In addition, an average size (gram/fish) and the actual number of releases are given to each released group of fish. The above variables represent only a small subset in the release data. Many others variables related to the survival of fish are also available. Note that in order to calculate the total contribution of hatchery fish to fisheries, each hatchery release group must be represented by a group of marked fish. A method has been developed to determine the marked release group that would represent the release of unmarked juveniles. Thus, the different variables in the release data are also important for associating marked and unmarked releases. In the fishery data, there are information about recoveries of marked fish and the corresponding sampling program. Whenever possible, the recovery time is recorded as year, month and statistical weeks (about 5 per calendar month), and the recovery region corresponding to the fishing method is also recorded. Not every fish caught by a commercial fishery is inspected for tag since this is costly. Thus, a sample is taken for mark inspection and only those marks detected in the sample are recorded as data. The sport recovery data usually come from voluntary returns of salmon heads by fisherman. Thus, these recoveries depend very much on the fishermen's awareness 61 of the clipped adipose fins. The sport catch size is obtained differently from commercial catch, which is estimated based on sales slips. The escapement data include tagged fish escapement to the hatchery or escapement to rivers or lakes near the hatchery. Detailed recovery information on a single tagged fish is difficult to obtain so only yearly bulk data are available. A brief review of the l i f e cycle of chinook and coho would show us how recoveries from a particular brood are distributed over time. A coho's or Chinook's l i f e begins as an egg in the year of spawning, the 'brood year'. Eggs are hatched, and juvenile fish are reared until the 'release year', when they are allowed to leave the hatchery and begin l i f e in the ocean. Eventually, in the 'recovery year', adults are captured by the fishery or escape to their spawning ground. Therefore, i f the brood year is defined to be year 0, then the following table from the Salmon Stock Interpretation and Assessment Units(1986) report is a summary for the majority of coho and chinook. Year chinook coho 0 brood year brood year 1 release year release year (fry) 2 release year (smolts) 3 recovery year (age 3) recovery year (age 3) 4 recovery year (age 4) 5 recovery year (age 5) 62 In some cases, some fish of both species are recovered in year 2 as jacks and chinook are sometimes recovered in years 6 and 7. These release and recovery data of tagged and untagged salmon have been collected over years, but they were unorganized and scattered among different agencies. Thus, i t is difficult to have an analysis on a complete set of data. In 1983, the Canadian Department of Fisheries and Oceans(DFO) decided to construct a Mark Recovery Program(MRP) database on the VAX computer at the Pacific Biological Station(PBS) in Nanaimo. As a result, many interesting questions can now be addressed. When this database is completed, valuable information will be available for assessing the coho and chinook salmon enhancement program. 63 Table 3.1. The tag codes found ln the benchmark data subset. code used original species in plots code Hart code 1 020408 124 (chinook) 2 020409 124 3 021615 124 4 021635 124 5 021661 124 6 021827 124 7 021829 124 8 022202 124 9 022405 124 1 081810 115 (coho) 2 081811 115 3 081812 115 4 081813 115 5 081841 115 6 081842 115 7 081843 115 8 081844 115 9 081845 115 10 082001 115 11 082002 115 12 082003 115 13 082004 115 14 082005 115 15 082006 115 16 082007 115 17 082008 115 18 082009 115 19 082019 115 20 082020 115 21 082021 115 22 082022 115 23 082023 115 24 082024 115 25 082025 115 26 082026 115 27 082027 115 64 Table 3.2. Summary of data fields for the benchmark release data subset. field description number of zeros % of zeros 1 tag code 0 0 2 species Hart code 0 0 3 brood year 0 0 4 run type code 27 75.00 5 day first released 29 80.56 6 month f i r s t released 29 80.56 7 year fi r s t released 29 80.56 8 day last released 0 0 9 month last released 0 0 10 year last released 0 0 11 number tagged 0 0 12 adipose only 0 0 13 undipped 0 0 14 total released 0 0 15 number of days held 5 13.89 16 size code 0 0 17 size at release 0 0 18 percentage tag loss 0 0 19 expected survival 36 100 20 stage code 0 0 21 study type 0 0 22 hatchery code 0 0 23 release site code 0 0 24 stock site code 0 0 25 agency code 0 0 26 co-ordinator code 0 0 27 production area code 0 0 28 province/state code 0 0 29 years with recoveries 0 0 30 release type 0 0 31 total associated release 0 0 65 Tab le 3 . 3 , Summary l i s t o£ chinook data fields for the benchmark rollup recovery subset. (There are four possible recovery methods: t r o l l , net, sport(S), or escapement(E). The letter in square brackets indicates that only one of these methods applies to the field. Those fields without any letters apply to a l l methods. The number and percentage of NAs were calculated according to the number of records corresponding to a particular catching method.) * NAs = missing values field description number of NAs % of NAs 1 tag code 0 0 2 recovery year 0 0 3 gear 0 0 4 catch region 0 0 5 brood year 0 0 6 non-tag indicator 0 0 7 species Hart code 0 0 8 statistical week 0 0 9 average fork length (mm) 1361 98.27 10 average hyperal length (mm) 1385 100 11 average total length (mm) 1385 100 12 average dress weight (kg) 1385 100 13 average round weight (kg) 1385 100 14 % immature female 1385 100 15 % mature female 1385 100 16 % immature male 1385 100 17 % mature male 1385 100 18 % unknown sexual maturity 35 2.53 19 recovery site code [E] 0 out of 35 0 20 recovery site number [E] 0 II 0 21 run type [El 0 II 0 22 sample age type [El 0 II 0 23 number of observed recoveries 0 0 24 catch or escapement 67 4.84 25 sample size 67 4.84 26 sum of known tags 5 0.36 27 number of no-pins 138 9.96 28 number of lost-pins 347 25.05 29 number with no data 614 out of 1350 45.48 30 number of sport marks observed [S] 68 out of 89 76.40 31 est. marks in the est.sport catch [S] 68 II 76.40 32 number observed sport recoveries [S] 0 0 33 sum of escapement non-tags [El 3 out of 35 8.57 66 Table 3.4. Table for computing "Period" from statistical week (MMW). Week Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 2 3 4 5 6 7 8 9 10 11 12 1 40 40 1 5 10 14 18 23 27 31 36 40 2 40 40 2 6 11 15 19 24 28 32 37 40 3 40 40 3 7 12 16 20 25 29 33 38 40 4 40 40 4 8 13 17 21 26 30 34 39 40 5 40 0 0 9 0 0 22 0 0 35 0 0 "Period" is a number ranging between 1 and 40 representing a one week time period during which salmon fishing may occur. 67 Table 3.5. List o£ catch region codes, and names. new code old code Name 1 1 NW Vancouver Is. Troll 2 2 SW Vancouver Is. Troll 3 3 Washington/Oregon Troll 4 4 Georgia Strait Troll 5 5 Central Troll 6 6 Northern Troll 7 7 Alaska Troll 8 14 Juan de Fuca Troll 9 15 NW Vane. Is. and Central Troll 10 17 NW Vane. Is. and SW Vane. Is. Troll 11 18 Northern and Central Troll 12 34 Georgia Strait and Central Troll 13 53 Georgia Strait and SW Vane. Is. Troll 14 56 North Central Troll 15 57 South Central Troll 16 8 Fraser Gillnet 17 9 Northern Net 18 10 Georgia Strait Net 19 11 Johnstone Strait Net 20 12 Central Net 21 13 Juan de Fuca Net 22 19 Johnstone Strait and Central Net 23 20 NW Vancouver Is. Net 24 21 SW Vancouver Is. Net 25 33 Northern and Central Net 26 36 Yukon Net 27 37 Juan de Fuca and Georgia Strait Net 28 45 Johnstone Strait and Georgia Strait Net 29 46 Fraser Gillnet and Georgia Strait Net 30 47 Alaska Net 31 48 British Columbia Net 32 58 Fraser Seine Net 33 25 Northern Sport 34 26 Central Sport 35 27 Washington Sport 36 28 Georgia Strait Sport 37 29 Freshwater Sport 38 99 Canadian Escapement 68 Table 3.6. k summary o£ catch regions with observed recoveries. (New catch region codes from Table 3.5 are used here.) Description COHO CHINOOK # Troll catch regions 10 6 catch region codes I, 2,4,5,6,9,10, II, 14,15* 1,4,6,11,14,15 # Net catch regions 9 6 Catch region codes 16,17,18,19*,20, 21,22,23,24 17*,18,19,20,22, 24 # Sport catch regions 4 4 Catch region codes 34,35,36*,37 33,34,35,36* # tags considered 153 69 # records examined 2989 940 * : Catch region with more than a hundred observed recoveries. 69 Table 3.7. A sample l i s t of coho release replicates that are classified according to size. (All the replicates are In groups of three.) release tag release brood total total tote date code site year size re l . obs. % rec. es] 10/5/81 081855 Quinsam 1979 small 7189 123 1.71 87 59 7191 130 1.81 86 62 7192 111 1.54 71 10/5/81 081856 Quinsam 1979 medium 7192 144 2.00 108 58 7210 114 1.58 88 61 7193 115 1.60 86 10/5/81 081857 Quinsam 1979 large 7202 148 2.05 110 60 7192 146 2.03 116 63 7207 134 1.86 99 26/5/81 081910 Capilano 1979 small 4098 135 3.29 49 11 4093 115 2.81 48 12 3845 103 2.68 42 26/5/81 081913 Capilano 1979 medium 3983 123 3.09 51 14 4038 127 3.15 54 42 4208 139 3.30 67 26/5/81 081943 Capilano 1979 large 3516 91 2.59 46 44 3570 102 2.86 57 45 3565 81 2.27 46 70 Figure 3.1 a. Size of chinook release for tag codes from the benchmark release data subset S S 10 Figure 3.1 b. Size of coho release for tag codes from the benchmark release data subset 10 15 20 25 30 tag c o d e 71 Figure 3.2. Chinook observed recoveries over the recovery period considered, (tag code: 021827 brood year: 1979 recovery year: 1981 to 1984) Figure 3.2a. Commercial and sport observed recoveries. 200 Figure 3.2b. Commercial observed recoveries. 200 Figure 3.2c. Sport observed recoveries. 200 periods over the 4 recovery years 72 Figure 3.3. Coho observed recoveries over the recovery period considered, (tag code: 081842 brood year: 1979 recovery year: 1981 to 1982) Figure 3.3a. Commercial observed recoveries. periods over the 2 recovery years Figure 3.3b Sport observed recoveries. months over the 2 recovery years 73 Figure 3.4a. Plot of cumulative sum of chinook commercial observed recoveries over time. 250 periods over the 4 recovery years Figure 3.4b. Plot of cumulative sum of coho commercial observed recoveries over time. 1 0 0 1 2 0 periods over the 2 recovery years Figure 3.5. Plots of chinook commercial observed recoveries over the sampling period. (tag code: 021827 brood year: 1979) observed recoveries from Northwest Vancouver Island Troll 0 20 40 60 80 observed recoveries from North Central Troll 0 20 40 60 80 a d j u s t e d p e r i o d s 75 Figure 3.6. Plots of coho commercial observed recoveries over the sampling period. (tag code: 081842 brood year: 1979) observed recoveries from Southwest Vancouver Island Troll IS 20 25 30 35 observed recoveries from Georgia Strait Troll 15 20 25 30 35 p e r i o d s 76 Figure 5.1a. Zeta(t) for coho. (Hatchery: Quinsam brood year: 1979 size at release: medium) 15 20 25 30 35 Figure 5.1b. Transformed zeta(t) (power = 0.25). 15 20 25 30 35 p e r i o d 77 78 Figure 5.3a. Zeta(t) for chinook tag code: 021827. Figure 5.4. The trends of zeta's for Quinsam coho from different brood years. Size at release: Small 80 Figure 5.5. The trends of zeta's for Capilano coho from different brood years. Size at release: Small p e r i o d 81 Figure 5.6. The trends of zeta's for the three chinook tag codes. 021661 021829 012827 | — 1 1 1 1 I I I 0 10 20 30 40 50 60 70 a d j u s t e d p e r i o d Figure 5.7. The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Quinsam size at release: large) S W Vancouver Island Troll " ' . brood year 1979 brood year 1978 o CM in o 15 20 25 30 35 Georgia Strait Troll in c & c £• CD > O South Central Troll ra -Q E ra Johnstone Strati Net p e r i o d 83 Figure 5.8. The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Quinsam size at release: medium) S W Vancouver Island Troll co brood year 1979 brood year 1978 15 20 25 30 35 Georgia Strait Troll co c CO > o South Central Troll CO •o .a E ro Johnstone Strati Net p e r i o d Figure 5.9. The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Quinsam size at release: small) S W Vancouver Island Troll 8 . • ' b r o o d year 1 9 7 9 b r o o d year 1 9 7 8 ^^^^^^ ~ * * * ' . 15 20 25 30 35 Georg ia Strait Troll tn c S c £< co > o u <u South Central Troll ra T3 JO E ra Johnstone Strati Net p e r i o d 85 Figure 5.10. The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Capilano size at release: large) S W Vancouver Island Troll o , CO LO . CM brood year 1980 O . CM W . brood year 1979 o . \ in • o • 15 20 25 30 35 Georgia Strait Troll o 86 Figure 5.11. The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Capilano size at release: medium) SW Vancouver Island Troll o CM in o 1— in o brood year 1980 brood year 1979 15 20 25 30 35 Georgia Strait Troll co c CD CD > O South Central Troll CO T3 .O E CO Johnstone Strati Net p e r i o d 87 Figure 5.12. The estimated recovery intensity of coho for each of the 4 catch regions. (Hatchery: Capilano size at release: small) S W Vancouver Island Troll brood year 1980 brood year 1979 15 20 25 30 35 Georg ia Strait Troll South Central Troll Johnstone Strati Net period 88 Figure 5.13. The estimated recovery intensity of chinook for each of the 3 trolling regions. NW Vancouver Island Troll a d j u s t e d p e r i o d Figure 5.14. Estimated recovery intensities of coho and the corresponding 95% credibility intervals. (Hatchery: Quinsam brood year: 1979 size at release: medium) SW Vancouver Island Troll 15 20 25 30 35 Georgia Strait Troil 8 • 15 20 25 30 35 Johnstone Strait Net i 1 " — • r 1 1 • 1 1 i 15 20 25 30 35 p e r i o d 90 Figure 5.15. Estimated recovery intensities of coho and the corresponding 95% credibility intervals. (Hatchery: Capilano brood year: 1980 size at release: medium) SW Vancouver Island Troll Georgia Strait Troll South Central Troll Johnstone Strait Net p e r i o d 91 Figure 5.16. Estimated recovery intensities of chinook and the corresponding 95% credibility intervals. (tag code: 021827 brood year: 1979) South Vancouver Island Troll 0 10 20 30 40 50 60 70 a d j u s t e d p e r i o d
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Local parametric poisson models for fisheries data
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Local parametric poisson models for fisheries data Yee, Irene Mei Ling 1988
pdf
Page Metadata
Item Metadata
Title | Local parametric poisson models for fisheries data |
Creator |
Yee, Irene Mei Ling |
Publisher | University of British Columbia |
Date Issued | 1988 |
Description | Poisson process is a common model for count data. However, a global Poisson model is inadequate for sparse data such as the marked salmon recovery data that have huge extraneous variations and noise. An empirical Bayes model, which enables information to be aggregated to overcome the lack of information from data in individual cells, is thus developed to handle these data. The method fits a local parametric Poisson model to describe the variation at each sampling period and incorporates this approach with a conventional local smoothing technique to remove noise. Finally, the overdispersion relative to the Poisson model is modelled by mixing these locally smoothed, Poisson models in an appropriate way. This method is then applied to the marked salmon data to obtain the overall patterns and the corresponding credibility intervals for the underlying trend in the data. |
Subject |
Poisson processes Bayesian statistical decision theory Fisheries -- Statistical methods |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-09-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0097859 |
URI | http://hdl.handle.net/2429/28360 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1988_A6_7 Y43.pdf [ 4.03MB ]
- Metadata
- JSON: 831-1.0097859.json
- JSON-LD: 831-1.0097859-ld.json
- RDF/XML (Pretty): 831-1.0097859-rdf.xml
- RDF/JSON: 831-1.0097859-rdf.json
- Turtle: 831-1.0097859-turtle.txt
- N-Triples: 831-1.0097859-rdf-ntriples.txt
- Original Record: 831-1.0097859-source.json
- Full Text
- 831-1.0097859-fulltext.txt
- Citation
- 831-1.0097859.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0097859/manifest