# Open Collections

## BIRS Workshop Lecture Videos

### Pseudo-observations (TG8) Andersen, Per Kragh

#### Description

<small>

Survival analysis is characterized by the need to deal with incomplete observation of outcome variables, most frequently caused by right-censoring, and several - now standard - inference procedures have been developed to deal with this. Examples include the Kaplan-Meier estimator for the survival function and partial likelihood for estimating regression coefficients in the proportional hazards (Cox) regression model. During the past 15 years, methods based on pseudo-observations have been studied. Here, the idea is to apply a transformation of the incompletely observed survival data and, thereby, to create a more simple data set on which `standard' techniques (i.e., for complete data) may be applied, e.g., methods using generalized estimating equations (GEE).

As an example, we can consider the problem of relating the survival probability, $S(t_0)$ at a single time point, $t_0$, to covariates, $z$, based on right-censored survival times $T_i$ and failure indicators $D_i$ for independent observations $i=1,...,n$. Here, $T_i=min(X_i,U_i)$ for potential complete failure times $X_i$ and right-censoring times $U_i$ and $D_i=I(T_i=X_i)$. Let $\hat{S}(t)$ be the Kaplan-Meier estimator for $S(t)=P(X>t)$. Then the pseudo-observations for the incompletely observed survival indicators $I(X_i>t_0)$, $i=1,...,n$ are $$S_i=n\hat{S}(t_0)-(n-1)\hat{S}^{(-i)}(t_0),\;\;\;i = 1,...,n,$$ where $\hat{S}^{(-i)}(t)$ is the Kaplan-Meier estimator applied to the sample of size $n-1$ with observation $i$ taken out. Regression coefficients in a generalized linear model $$g(S(t_0|z)) =\beta_0 +\beta^{T}z$$ with link function $g$ are then estimated by solving the GEE $$\sum_{i} A(\beta,z_i)\big(S_i-g^{-1}(\beta_0+\beta^{T}z)\big)=0$$ where, typically, $A(\beta,z)=\frac{\partial}{\partial \beta}g^{-1}(\beta_0+\beta^{T}z)$.

An advantage of this approach is that it applies quite generally to parameters for which no other regression methods are directly available (including average time spent in a state of a multi-state model), whereas disadvantages include that the method is not fully efficient and that it, in its most simple form, requires that the distribution of censoring times, $U$, is independent of the covariates, $z$. We will review the development in this field since the method was put forward by Andersen, Klein and Rosthoj (2003, Biometrika), with special emphasis on recent results by Overgaard, Parner and Pedersen (2017, Ann. Statist.) and Pavlic, Martinussen and Andersen (2019, Lifetime Data Anal.).

(Presentation 40 min. + Discussion 20 min.)</p>