Open Collections

UBC Faculty Research and Publications

Simulating pedigrees ascertained for multiple disease-affected relatives Nieuwoudt, Christina; Jones, Samantha J.; Brooks-Wilson, Angela; Graham, Jinko Oct 15, 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-13029_2018_Article_69.pdf [ 1.01MB ]
JSON: 52383-1.0372976.json
JSON-LD: 52383-1.0372976-ld.json
RDF/XML (Pretty): 52383-1.0372976-rdf.xml
RDF/JSON: 52383-1.0372976-rdf.json
Turtle: 52383-1.0372976-turtle.txt
N-Triples: 52383-1.0372976-rdf-ntriples.txt
Original Record: 52383-1.0372976-source.json
Full Text

Full Text

Nieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Open AccessSimulating pedigrees ascertained formultiple disease-affected relativesChristina Nieuwoudt1, Samantha J. Jones2,3, Angela Brooks-Wilson2,4 and Jinko Graham1*AbstractBackground: Studies that ascertain families containing multiple relatives affected by disease can be useful foridentification of causal, rare variants from next-generation sequencing data.Results: We present the R package SimRVPedigree, which allows researchers to simulate pedigrees ascertainedon the basis of multiple, affected relatives. By incorporating the ascertainment process in the simulation,SimRVPedigree allows researchers to better understand the within-family patterns of relationship amongstaffected individuals and ages of disease onset.Conclusions: Through simulation, we show that affected members of a family segregating a rare disease varianttend to be more numerous and cluster in relationships more closely than those for sporadic disease. We also showthat the family ascertainment process can lead to apparent anticipation in the age of onset. Finally, we use simulationto gain insight into the limit on the proportion of ascertained families segregating a causal variant. SimRVPedigreeshould be useful to investigators seeking insight into the family-based study design through simulation.Keywords: Pedigree simulation, Family-based study, Rare variant, Ascertainment bias, AnticipationBackgroundFamily-based studies of pedigrees with multiple disease-affected relatives are regaining traction for identificationof rare causal variants. These study designs were popu-lar, for a time, but were eclipsed as genome-wide asso-ciation studies (GWAS) gained popularity [1]. GWAShave been effective for identifying population associationswith common variants genome-wide, but have low powerto study rare variants [2]. Family-based studies requiresmaller sample sizes than their case/control counterpartsand enjoy increased power to detect effects of rare vari-ants [2]. Additionally, family-based studies are able toidentify next-generation sequencing (NGS) errors by uti-lizing familial relationships to identify unlikely calls [2].Improvements in the cost and technology associated withNGS have facilitated a revival in family-based studies [1].Family-based analyses coupled with NGS can uncoverrare variants that are undetected by GWAS [2]. For exam-ple, analysis of whole exome sequence data was used to*Correspondence: jgraham@sfu.ca1Department of Statistics and Actuarial Science, Simon Fraser University, 8888University Drive, V5A 1S6, Burnaby, CanadaFull list of author information is available at the end of the articleidentify rare variants associated with non-syndromic oralclefts in large pedigrees ascertained to contain at leasttwo affected relatives [3], to prioritize rare variants inlarge multi-generational pedigrees ascertained for mul-tiple relatives diagnosed with bipolar disorder [4], andto identify rare variants segregating in families that con-tained at least two siblings with an autism spectrumdisorder [5].Unfortunately, family-based studies do not comewithout complication; for example, identifying a suit-able number of pedigrees with desired criteria may betime consuming, sometimes requiring years to amass.In these circumstances, collecting new data to eval-uate methodology or replicate findings is impractical.To address this challenge we have created an R pack-age, entitled SimRVPedigree, which simulates pedi-grees ascertained to contain a minimum number ofdisease-affected relatives. SimRVPedigree models theaffected individuals in an ascertained pedigree as theresult of (1) sporadic disease or (2) a single, rare, disease-variant segregating in the pedigree. At the individuallevel, SimRVPedigree models competing age-specific© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (, which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.Nieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 2 of 11life events contingent on rare-variant status, disease sta-tus, and age through user supplied age-specific inci-dence rates of disease, and age-specific hazard rates fordeath. In a recursive manner, life events simulated atthe individual level build and shape simulated pedigrees.Upon specification of user-defined study characteristics,SimRVPedigree will simulate pedigrees ascertained tocontain multiple affected relatives according the specifiedcriteria. To our knowledge, this is the only program toincorporate a competing risk model and account for theascertainment process.MethodsGiven a sample of pedigrees we allow for the possibil-ity that different families may segregate different rarevariants, but assume that within a family genetic casesare due to a shared rare variant that increases diseasesusceptibility. We allow users to choose between twomethods of rare variant introduction to the pedigree.One option is to assume that all ascertained pedigreeswith genetic cases are segregating a variant that is rareenough to have been introduced by exactly one founder[6]. Alternatively, we allow users to simulate the start-ing founder’s rare variant status with probability equal tothe carrier probability of all causal variants consideredas a group. When this option is selected some ascer-tained pedigrees may not segregate a causal variant. Ineither scenario, we assume that a causal variant is intro-duced by at most one founder and, when it is intro-duced, it is transmitted from parent to offspring accordingto Mendel’s laws.Starting at birth and ending with death, we simulatelife events for the starting founder, censoring any eventsthat occur after the last year of the study. We repeat thisprocess, recursively, for all descendants of the founderallowing life events at the individual level to shape suc-cessive generations of the pedigree. To accomplish this,we condition on an individual’s age, rare-variant statusand disease status, and simulate waiting times to threecompeting life events: reproduction (i.e. producing off-spring), disease onset, and death. We select the event withthe shortest waiting time, update the individual’s age bythis waiting time, record the event type, and repeat thisprocess from the new age until the individual dies or theend of the study is reached.Simulating life eventsTo simulate life events SimRVPedigree users arerequired to specify:hazardDF, a data frame of age-specific hazard rates,where column one represents the age-specific hazardrates for the disease in the general population,column two represents the age-specific hazard ratesfor death in the unaffected population, and columnthree represents the age-specific hazard rates fordeath in the affected population, andpartition, a discrete partition of ages over which toapply hazardDF.Specifically, partition is a vector of ages, startingat age 0, such that hazardDF[k,] are the age-specifichazard rates for an individual whose age is contained in[partition[k], partition[k+1]). At the user’s dis-cretion, if the disease of interest is rare, the age-specifichazard rates for death in the unaffected population maybe approximated by age-specific hazard rates for death inthe general population. In the following subsections, wedetail the procedures to simulate waiting times to onset,death, and reproductive events.Disease onsetWe model disease onset using a non-homogeneousPoisson process (e.g. [7]), conditioned on an individual’scurrent age, t′, rare-variant status, x, and disease status, δ.In this context, x = 1 if the individual is a carrier of therare variant, and 0 otherwise; and δ = 1 if the individualhas developed disease by age t′, and 0 otherwise. Define κto be the relative-risk of disease for individuals who haveinherited the causal variant and λo(t) to be the baselineage-specific hazard rate of disease for an individual agedt years. That is, λo(t) is the age-specific hazard rate forindividuals who do not carry a causal variant, i.e. sporadiccases. Let λonset(t|x) denote the age-specific hazard rateof disease for an individual aged t years conditioned onrare-variant status such thatλonset(t|x) ={λo(t), if x = 0;κ · λo(t), if x = 1,for κ ≥ 1.If pc is the carrier probability of all causal variantsconsidered as a group, then we can express the populationage-specific hazard rate of disease, λonset(t), asλonset(t) = (1 − pc)λo(t) + κ · pc · λo(t).Users are expected to provide λonset(t); given pc and κ weinfer λo(t) as λo(t) = λonset(t)1+pc(κ−1) . We note that this methodfor calculating λo(t) has implications on the comparabilityof non-genetic individuals from studies simulated undervery different κ values. For example, when pc is constant,we see that for κ1 << κ2, the age-specific hazard ratefor non-carrier individuals under genetic relative-risk κ1will be much greater than that of non-carrier individualsunder genetic relative-risk κ2. As pc increases this effect isvisible more quickly for differing κ values.We note that not all individuals develop the disease;however, those who do are only permitted develop the dis-ease once in our model. Individuals who have developeddisease (i.e. δ = 1) do not develop disease again, but canNieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 3 of 11reproduce or die. When δ = 0, we use intensity func-tion λonset(t|x) conditioned on rare-variant status, x, tosimulate the waiting time to disease onset given currentage, t′. To clarify, if we denote the waiting time to diseaseonset by Wonset , and condition on the current age, t′, thecumulative distribution function ofWonset is given byP(Wonset < w|T= t′, x)=1−exp{−∫ t′+wt′λonset(u|x)du}.DeathWe model death using a non-homogeneous Poisson pro-cess, conditioned on an individual’s current age, t′, anddisease status, δ. Define δ as in the previous discus-sion, and let λu(t) and λa(t) denote the age-specifichazard rates of death, for individuals aged t years, inthe unaffected population and the affected population,respectively. We use intensity function λdeath(t|δ) condi-tioned on disease status δ to simulate the waiting time todeath given the current age, t′. In this context, λdeath(t|δ)represents the age-specific hazard rate of death for anindividual aged t years conditioned on their disease status,which we model asλdeath(t|δ) ={λu(t), if δ = 0;λa(t), if δ = 1.We do not model disease remission; after an individualhas developed disease we use the age-specific hazard ratesfor death in the affected population to model their waitingtime to death.ReproductionTo accommodate extra-Poisson variability in the numberof human offspring, we use a negative-binomial modelwith number of trials n ≈ 2 and success probability p ≈4/7, as proposed by [8]. We adopt this negative-binomialmodel of offspring number in SimRVPedigree. Weemploy an equivalent Poisson-Gamma mixture model [9]to obtain the negative-binomial offspring number and tosimulate the waiting time to reproduction.Letwt′ denote the waiting time to reproduction given anindividual’s current age t′, and assume that simulated sub-jects are able to reproduce from age a1 to age a2. Tomimicobserved data on first-born live births (see Additionalfile 1: Section 6), we simulate a1 and a2 as follows: samplea1 uniformly from ages 16 to 27, and a2 − a1 uniformlyfrom 10 to 18 years. At birth we simulate an individual’slifetime birthrate by taking a random draw, γ , from agamma distribution with shape 2 and scale 4/3. Individu-als who draw large γ will have high birth rates and manychildren, whereas individuals who draw small γ will havelow birth rates and few or no children.For some diseases, users may want to reduce the birthrate after disease onset; we allow users to achieve thisthrough an additional parameter f, assumed to be between0 and 1, which is used to rescale the birth rate after dis-ease onset. By default, f = 1 so that the birth rate remainsunchanged after disease onset. Given an individual’s birthrate, current age, and disease status, δ, we obtain theirwaiting time to reproduction as follows:1 Simulate the unconditional waiting time toreproduction by drawing w from an exponentialdistribution with rate γ f δ+γ (1−δ)(a2−a1) .2 Condition on the current age, t′, to obtain theconditional waiting time to reproduction:wt′ =⎧⎨⎩a1 + w − t′, if t′ < a1and (a1 + w) < a2;t′ + w, if t′ ∈[a1, a2) and (t′ + w) < a2;∞, otherwise.Pedigree simulationTo simulate all life events for a subject, starting at birth wegenerate waiting times to disease onset, death, and repro-duction, as outlined previously and choose the event withthe shortest waiting time to be the next life event. Next,we add the waiting time associated with the earliest eventto the current age and either record the year of diseaseonset or death, or add a new offspring to the pedigree. Werepeat this process from the updated age, recursively, untilthe individual dies or the study stop year is reached. Thisalgorithm details the full life event procedure at the indi-vidual level. Complete details are available in Additionalfile 1.To simulate a full pedigree, we recursively apply thealgorithm described above, as follows:• Step 1: Simulate life events for the first founder givenrare-variant status.• Step 2: Simulate life events for any new offspringgiven rare-variant status as outlined above.• Step 3: Repeat step 2 until life events have beensimulated for all offspring.Ascertainment featuresTheprimary functionofSimRVPedigree,sim_RVped(),simulates pedigrees ascertained formultiple disease-affectedrelatives.We allow users to specify family-based study fea-tures through the following arguments of sim_RVped():num_affected: the minimum number of disease-affected relatives required for ascertainment of thepedigree.ascertain_span: the start and stop year for pedigreeascertainment.stop_year: the last year of follow-up for the pedigree.recall_probs: the proband’s recall probabilities forrelatives of varying degree.Nieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 4 of 11In this context, the proband is the affected family mem-ber first in contact with the study, presumably at the timeof disease onset.The ascertainment span represents the time span,in years, over which the family could be ascertainedthrough the proband. For example, suppose that a par-ticular study ascertained families, containing at least twoaffected members, from 2000 to 2010. In this scenario,the user would set ascertain_span= c(2000,2010) and num_affected= 2. The sim_RVped()function would then simulate families such that theproband developed disease between 2000 and 2010 andwas at least the second family member to developdisease.The study stop year represents the last year data arecollected for ascertained families. Consider the previ-ous study, and suppose that data were collected until2016. To achieve this in simulation, users would sim-ply specify stop_year = 2016, which would resultin sim_RVped() simulating life events for ascertainedfamilies until the year 2016.Often researchers involved in family-based studies areconfronted by incomplete ascertainment of a proband’srelatives, which could occur if the proband cannot providea complete family history, or if he or she does not supportcontact of specific relatives. SimRVPedigree allowsusers to mimic this scenario, in simulation, by trim-ming relatives from a pedigree based on the proband’sprobability of recalling them. To specify a proband’s recallprobabilities for his or her relatives, i.e. recall_probs,the user provides a list of length q, such asp = (p1, p2, ..., pq). In this context, pi is used to denote theproband’s recall probability for a relative of degree i wheni = 1, 2, ..., q − 1, or the proband’s recall probability fora relative of degree q or greater when i = q. To simulatefully ascertained families, we set recall_probs =c(1), which corresponds to p = 1. Alternatively, ifunspecified, recall_probs is set to four times thekinship coefficient, e.g. [10]. This default value retains theproband’s first-degree relatives (i.e. parents, siblings, andoffspring) with probability 1, second-degree relatives (i.e.grandparents, grandchildren, aunts, uncles, nieces, andnephews) with probability 0.5, third-degree relatives withprobability 0.25, etc.In the event that a trimmed relative is required to fullyspecify the relationships among recalled family mem-bers, we include the trimmed relative, mark them asunavailable, and remove (i.e. mark as missing) any of theirrelevant information. That is, disease status, relative-riskof disease, and event years are all missing for any rela-tives not recalled by the proband. Since disease-affectedrelatives may be trimmed from a pedigree, trimmed pedi-grees may contain fewer than num_affected disease-affected relatives. When this occurs, sim_RVped()will discard the pedigree and simulate another until allconditions specified by the user are met.ResultsSettingsIn the following applications, we use SimRVPedigreein conjunction with R [11] to investigate the effect of therelative-risk of disease in genetic cases, κ , on ascertainedpedigrees. We first investigate the effect of κ on the num-ber of affected relatives per family, and on the degreeof familial clustering among affected relatives. Next, weinvestigate how ages of onset from more recent genera-tions tend to be younger than those from older gener-ations in the ascertained pedigrees [12], a phenomenonwhich we refer to as apparent anticipation. Lastly, wedemonstrate how SimRVPedigree may be used to esti-mate the proportion of families that segregate the causalvariant in a sample of ascertained pedigrees.To study pedigrees ascertained to contain multiple rel-atives affected by a lymphoid cancer, we simulated studysamples according to the following criteria.1 Each study sample contained a total of one thousandpedigrees, ascertained from the year 2000 to the year2015.2 Each pedigree contained at least two relativesaffected by lymphoid cancer.3 The birth year of the founder who introduced therare variant to the pedigree was distributeduniformly from 1900 to 1980.4 For each κ considered, the carrier probability, pc, forall causal variants with genetic-relative risk κ wasassumed to be 0.002.5 Sporadic cases, i.e. affected individuals who did notinherit the rare variant, develop lymphoid canceraccording to the baseline, age-specific hazard rate oflymphoid cancer. The population, age-specific hazardrate of lymphoid cancer were estimated through theSurveillance, Epidemiology, and End Results (SEER)Program [13, 14], and are displayed in Fig. 1.6 Genetic cases, i.e. affected individuals who did inheritthe rare variant, develop lymphoid cancer at κ timesthe baseline, age-specific hazard rate of lymphoidcancer. We considered κ ∈ (1, 10, 20) and simulatedone thousand pedigrees for each κ considered.7 Since lymphoid cancer accounts for a relatively smallproportion of all deaths, the age-specific hazard ratefor death in the unaffected population wasapproximated by that of the general population.Individuals who do not develop lymphoid cancer dieaccording to the age-specific hazard rate of death inthe general population [15], while individuals whohave developed lymphoid cancer die according to theage-specific hazard rate of death in the affectedNieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 5 of 11Fig. 1 Hazard Rates. (Left) Baseline, age-specific hazard rates of lymphoid cancer estimated by SEER [13, 14]. SEER provides age-specific incidence andmorality data, in yearly increments, up to age 84 years, and then aggregates data for ages of 85 years or greater. We considered the SEER reportedincidence rate for individuals of age 85 or greater to be the constant hazard rate of disease for individuals between the ages of 85 to 100. (Right)Age-specific hazard rates of death for the general population [15] and for the disease-affected population [13, 16, 17]. To promote continuity in theage-specific hazard rate of death for the affected population, we assume that it is twice that of the unaffected population after age 84 years. Afterage 84 years, the SEER data do not allow for the age-specific hazard rates of death in the affected population to be estimated in yearly incrementspopulation [13, 16, 17]. Figure 1 displays theage-specific hazard rates of death for these twogroups.8 The proband’s probabilities for recalling relativeswere set to recall_probs = (1, 1, 1,0.5, 0.125), so that all first, second, andthird-degree relatives of the proband were recalledwith probability 1, all fourth-degree relatives of theproband were recalled with probability 0.5, and allother relatives of the proband were recalled withprobability 0.125.9 The stop year of the study was set to 2017.ExampleWedemonstrate how to simulate a single pedigree accord-ing to the settings described previously.After installing SimRVPedigree, we load the packagein R using the library function.R> library(SimRVPedigree)Suppose that we can obtain age-specific hazard ratesin yearly increments starting at age 0 and ending withage 100. In this case, we define the partition of ages overwhich to apply the age-specific hazards rates using theseq function.R> age_part <- seq(0, 100, by = 1)Next, assume that LC_Hazards is a data frame whosecolumns provide age-specific hazard rates, in yearlyincrements, from age 0 to age 100, as indicated below.LC_Hazards[, 1] Age-specific hazard rates of lym-phoid cancer in the general population.LC_Hazards[, 2] Age-specific hazard rates of deathfor individuals in the general population.LC_Hazards[, 3] Age-specific hazard rates of deathfor individuals who have lymphoid cancer.We create a new object of class hazard from the par-tition of ages, age_part, and the data frame of hazardrates, LC_Hazards, by executing the following com-mand.R> haz_mat <- hazard(partition = age_part,hazardDF = LC_Hazards)To simulate a single pedigree with family identificationnumber 1 and a genetic relative-risk of 10, assuming thatthe eldest founder introduces the variant, and accordingto the settings described previously we use the followingcommand.R> ex_ped <- sim_RVped(hazard_rates =haz_mat, GRR = 10, FamID = 1,num_affected = 2, RVfounder = TRUE,ascertain_span = c(2000, 2015),founder_byears = c(1900, 1980),recall_probs = (1, 1, 1, 0.5, 0.125),stop_year = 2017)To view a description of the contents of ex_ped we usethe summary command.R> summary(ex_ped)length Class Modefull_ped 15 ped listascertained_ped 15 ped listUpon executing the command above, we see thatex_ped is a list containing two objects of class ped.The first is named full_ped and represents the orig-inal pedigree, prior to proband selection and trimming.The second is named ascertained_ped and repre-sents the ascertained pedigree; this data frame includes anadditional variable to identify the proband. In this applica-tion, we are interested in families that were ascertained forstudy; hence, we focus attention on ascertained_ped.To simplify the following examples, we store the ascer-tained pedigree as study_ped.Nieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 6 of 11R> study_ped <- ex_ped$ascertained_pedTo plot the ascertained pedigree we simply supply thepedigree to the plot function.R> plot(study_ped)The plotted pedigree is displayed in Fig. 2.To obtain summary information for study_ped wesupply it to summary.R> summary(study_ped)$family infoFamID total_rels num_affected1 18 4ave_onset_age ave_IBD asc_year seg_RV52.5 0.333 2002 TRUE$affected_infoFamID ID birthYr onsetYr1 1 1911 19651 3 1933 20141 9 1966 20021 10 1972 2011deathYr RR proband RVstatus1968 10 FALSE 1NA 1 FALSE 0NA 10 TRUE 1NA 10 FALSE 1As displayed above, when the argument of summaryis an object of class ped, summary returns two dataframes named family_info and affected_info.The family_info data frame catalogues the informa-tion for the entire family. For each family supplied itprovides (from left to right): family identification num-ber, the total number of relatives in the pedigree, the totalnumber of disease-affected relatives in the pedigree, theaverage onset age of the disease-affected relatives, theaverage of the pairwise probabilities of identity by descent(IBD) among the disease-affected relatives in the pedi-gree, the ascertainment year of the pedigree, and a logicalvariable indicating whether or not the pedigree segre-gates a casual variant. The affected_info data framecatalogues information for the disease-affected relatives.For each disease-affected relative it details (from left toright): family identification number, individual identifica-tion number, year of birth, year of disease-onset, year ofdeath, relative risk of disease, proband status, and rarevariant status.ApplicationsNumber of disease-affected relativesTo illustrate how the number of disease-affected rela-tives in each pedigree varies with κ , we refer to the datadescribed in Settings. This data contains simulated studysamples, containing 1000 pedigrees, for κ = 1, κ = 10,and κ = 20.Figure 3 summarizes the distribution of the numberof disease-affected relatives per pedigree for these threegroups. From the figure we see that for κ = 1 thisdistribution is more highly concentrated at two affectedmembers than for the other two groups considered. NotFig. 2 Simulated Pedigree. In this pedigree squares are used to symbolize males and circles are used to symbolize females. Mates are connected by ahorizontal line, and their offspring branch out below. Individuals who have died have a slash through their symbol. As indicated by the legend, if theupper left third of an individual’s symbol is shaded black, then that individual is disease-affected. If the upper right third of an individual’s symbol isshaded, then that individual is a carrier of the causal variant. If the bottom third of an individual’s symbol is shaded, then that individual is the probandNieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 7 of 11Fig. 3 Bar charts of Number of Disease-Affected Relatives per Pedigree. Barcharts of number of disease-affected relative per pedigree grouped bygenetic relative-risk of disease, κsurprisingly, as κ increases we see relatively fewer familiescontaining only two affected members, and more familiescontaining three or greater affected members.Familial clusteringTo investigate the relationship between familial cluster-ing among affected relatives and κ , we restrict atten-tion to pedigrees that contained two or three affectedrelatives. We did not consider pedigrees with four ormore disease-affected relatives because these pedigreesare rarely observed when κ = 1. This resulted in a totalof 999 simulated pedigrees in the κ = 1 group, 970simulated pedigrees in the κ = 10 group, and 939 simu-lated pedigrees in the κ = 20 group. To assess the levelof familial clustering among affected relatives, we com-puted the average of the pairwise IBD probabilities amongaffected members in a pedigree, which we will denote byAIBD. AIBD is proportional to the genealogical index offamiliality statistic [18], which has been used to summa-rize familial clustering of aggressive prostate cancer in theUtah population. In general, the IBD probability betweentwo relatives decreases as they become more distantlyrelated. For example, for an affected parent-child pair, ortwo affected siblings AIBD = 0.5; whereas for an affectedavuncular pair, or an affected grandparent-grandchild pairAIBD = 0.25.Figure 4 shows the conditional distribution of AIBDgiven the total number of affected relatives in a pedi-gree and κ . Tabulated results for Fig. 4 are availablein Additional file 1: Section 2. The left panel of Fig. 4summarizes the conditional distribution of AIBD for fam-ilies with two affected members. The conditional distri-bution of AIBD shifts probability mass toward 0.5 as κincreases and suggests that disease-affected individualstend to be more closely related in families with largervalues of κ . The right panel of Fig. 4 summarizes the con-ditional distribution of AIBD among families with threeaffected members, and shows the same trend as the leftpanel, of AIBD values shifted towards 0.5 for larger valuesof κ .AnticipationAnticipation is a decreasing trend in the age of diseaseonset, and possibly an increasing trend in severity, in suc-cessive generations of a family [19]. Some genetic diseaseswith unstable repeat expansions show anticipation, andinclude: Huntington’s Disease, fragile X syndrome, andmyotonic dystrophy [20].However, studies of genetic anticipation based solely onthe ages of onset of affected members have the potentialfor ascertainment bias [21]. Possible sources of ascer-tainment bias include: early detection in offspring dueto parental diagnosis or improved diagnostic techniquesand right-censoring of family members who have devel-oped the disease by the end of the study, especially instudies of large multi-generational pedigrees that havebeen ascertained to contain multiple affected members.[12, 21].Referring to the data described in section Settings,we illustrate how apparent anticipation can arise as anNieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 8 of 11Fig. 4 Bar charts ofAIBD Distributions. Barcharts ofAIBD distributions for pedigrees with two (left) or three (right) disease-affected relatives, groupedby genetic relative-risk of diseaseartefact of studies ascertaining families with multipledisease-affected relatives. Within each of the families con-sidered, generation number was assigned among affectedrelatives so that generation number one represents themost recent common ancestor with whom all affectedmembers could share a variant identical by descent. In thisassignment scheme, we allow an affected individual to behis or her own most recent common ancestor. To demon-strate this convention, consider a family with two affectedrelatives: if the affected members are a parent-child pair,then the parent would be assigned generation numberone, and the child assigned generation number two. How-ever, if the affectedmembers are a sibling pair, each siblingwould be assigned generation number two, since a par-ent is the closest relative from whom the affected siblingscould have inherited a disease variant.Figure 5 displays the ages of onset, by assigned genera-tion, grouped by κ , the relative-risk of disease for geneticcases. We emphasize that SimRVPedigree does notinclude a mechanism to simulate anticipation. However,we note that even though anticipation is not present inthe simulated data, within each genetic-relative-risk groupconsidered, the box plots exhibit a decreasing trend in theages of onset for successive generations. The false antic-ipation signal is likely due to many of the ascertainedpedigrees being large, and multi-generational, and there-fore prone to right-censoring of younger family memberswho will develop disease later in life, after the study stopyear.If there is right-censoring of younger family membersthen this censoring should be apparent in their ages ofdeath as well. Therefore it is useful to consider using theages of death in unaffected relatives as a negative con-trol to gain insight into ascertainment bias [19]. Box plotsof the ages of death in unaffected relatives by genera-tion for the relative-risk groups are similar to those inFig. 5 for the age of onset in disease-affected relatives. Thissimilarity strongly suggests the presence of ascertainmentbias. Further details of this investigation may be found inAdditional file 1: Section 3.Proportion of ascertained pedigrees segregating a causalvariantFamilial lymphoid cancer, i.e. a family that contains multi-ple relatives affected by lymphoid cancer, is relatively rare;however, lymphoid cancer is not a rare disease as it affectsroughly 1 in 25 [13, 14]. With such diseases, there is agreater risk of ascertaining pedigrees that contain multi-ple disease-affected relatives by chance alone. Since we donot expect these pedigrees to segregate a causal variantit is advantageous to choose ascertainment criteria thatreduces the likelihood of sampling such pedigrees.To determine what proportion of ascertained familieswe expect to segregate a causal variant we conducted asimulation study in which the rare variant status of thestarting founder was allowed to vary so that fully sporadicpedigrees were given an opportunity for ascertainment.The procedure to simulate a study containing bothgenetic and sporadic families may be described as follows.Step 1: Allow the starting founder to introduce a causalvariant with genetic relative-risk κ with probability0.002.Step 2: Simulate the rest of the pedigree, according tothe settings described in Settings, and add it to oursample of ascertained pedigrees if it meets the ascer-tainment criteria.Step 3: Repeat steps one and two until the requisite num-ber of pedigrees have been ascertained.For this procedure we considered κ = 1 and all multiplesof 5 between 5 and 100, i.e. κ ∈ (1, 5, 10, 15, . . . , 95, 100).For each κ considered we simulated a family studyNieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 9 of 11Fig. 5 Box plots of Age of Disease Onset by Assigned Generation Number. Boxplots of age of onset by assigned generation number, as defined in text,grouped by genetic relative-risk of disease, κ . The numbers of observations, n, used to create each box plot are displayed above their respective plotscontaining one thousand ascertained pedigrees. Next, wedetermined what proportion of the ascertained pedigreeswere segregating a causal variant that increased diseasesusceptibility. The results of this investigation are dis-played in Fig. 6. The leftmost panel in Fig. 6 indicatesthat most of the ascertained pedigrees are not segregatinga causal variant. For example, when the genetic relative-risk is 20, we see that less than 20% of the ascertainedpedigrees with two or more disease-affected relatives aresegregating a causal variant. Focusing attention on theascertained pedigrees that contain three or more affectedrelatives (themiddle panel of Fig. 6) we see that these pedi-grees tend to segregate a causal variant more often thanthe pedigrees that only contained two or more affectedrelatives. When we restrict our focus to the ascertainedpedigrees that contain four or more affected relatives(the rightmost panel of Fig. 6), we see more of thesepedigrees tend to segregate a causal variant. These esti-mates tend to be more erratic because we don’t oftenobserve fully sporadic families with four or more affectedrelatives. Among the original samples of one thousandpedigrees, we observe only two fully sporadic pedigreeswith five affected relatives, and none with six or moredisease-affected relatives.These results indicate that when a disease is notrare, and when the carrier probability of the causalvariant is very low (i.e. pc = 0.002), focusingon families with at least three affected relatives ismore effective for sampling pedigrees that segre-gate a causal variant. Focusing on pedigrees withFig. 6 Genetic Contribution Estimate. Scatter plots of the probability that a randomly selected pedigree from a sample of ascertained pedigrees issegregating a genetic variant with relative-risk of disease κ against the relative-risk of disease κ . Here we consider the effect of restricting attentionto the ascertained pedigrees with nA or more disease-affected relatives. In the leftmost panel, we consider all one thousand pedigrees ascertainedwith two or more disease-affected relatives; in the middle panel, we consider the subset with three or more disease-affected relatives, and in theright most panel the subset with four or more disease-affected relativesNieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 10 of 11at least four affected relatives provides even greaterimprovement.Computation timeWewould like to note that simulation of ascertained pedi-grees can be computationally expensive. Therefore, weurge users to take advantage of parallel processing, in R,or cluster computing when simulating a large number ofascertained pedigrees.There are several factors that effect the amount of timerequired to simulate a pedigree. For example, the geneticrelative-risk, the probability that a causal rare variant issegregating in the family, and the ascertainment span, toname a few. To illustrate the effect of the genetic relative-risk on timing we consider the family-study described inSettings. The following table provides summary statisticsfor the average computation time, in seconds, required tosimulate a single pedigree on a Windows OS with an i7-4790@ 3.60 GHz, 12 GB of RAM, and a C220 SATAAHCI(Table 1).When probability that a causal rare variant is segregat-ing in the family is small, the simulation time will tendtowards the time required to simulate an ascertained pedi-gree with a genetic relative-risk of 1. This is the case for allpedigrees simulated in Proportion of ascertained pedigreessegregating a causal variant since the probability that theeldest founder introduces the rare variant is 0.002.DiscussionWe provide several applications for SimRVPedigree toillustrate the effect of the genetic relative-risk, κ , on fea-tures of the ascertained pedigrees. First, we investigate therelationship between κ and the number of affected indi-viduals in each ascertained family. In this application, asκ increases we observe pedigrees that contain three ormore affected relatives more frequently than pedigreeswith only two affected relatives.Second, we examine the relationship between κ and theaverage, pairwise IBD probability among affected relativesin a pedigree. We observe that pedigrees simulated withlarger values of κ tend to contain affected relatives thatare more closely-related than pedigrees simulated withsmaller values of κ .Table 1 Comparison of Computation Time for Various GeneticRelative-Risk ValuesStandard deviationGenetic Average simulation of simulation time Numberrelative-risk time (in seconds) (in seconds) of trials1 30.752 32.539 5010 1.461 1.339 5020 0.650 0.564 50Tabulated average computation time and standard deviation of computation time,in seconds. These results were obtained over 50 repeated simulations of a singlepedigreeThird, we illustrate that the family-based study designcan contribute to apparent anticipation signals. In part,this is due to large, multi-generational pedigrees, whichare prone to right-censoring of younger family memberslikely to experience disease onset later in life. This typeof right-censoring can confound true genetic anticipation.We observe that it is possible to reduce this bias by follow-ing family members available at the time of ascertainmentfor a sufficient length of time. However, the necessary timeframe (roughly 100 years) is impractical for real studies[see Additional file 1: Section 4].Finally, we show how users can estimate the propor-tion of ascertained pedigrees that are segregating a variantthat increases disease susceptibility. In this application wefind that when the carrier probability of all causal vari-ants considered as a group is 0.002, many of the pedigreesascertained with two or more disease-affected relatives donot segregate a genetic variant. In this scenario, it maybe advantageous for researchers to focus on pedigreeswith three or more disease-affected relatives. We notethat when the carrier probability increases results willvary [see Additional file 1: Section 5]. SimRVPedigreeis intended for simulating diseases that are influencedby rare variants (e.g. allele frequency < 0.005); however,when the carrier probability is increased to reflect vari-ants that are less rare (e.g. allele frequency ∈[ 0.005, 0.01]),SimRVPedigree may underestimate the proportion ofascertained pedigrees that contain genetic cases.We emphasize that ascertained families can differ sub-stantially depending on the simulation settings chosen.For example, variations in the ascertainment span canaffect the distribution of the number of affected relativesin each pedigree, when all other study settings remainconstant.ConclusionsThe SimRVPedigree package provides methods to sim-ulate pedigrees that contain multiple disease-affected rel-atives ascertained by a family-based study. To simulate lifeevents at the individual level, SimRVPedigree modelsdisease onset, death, and reproduction as competing lifeevents; thus, pedigrees are shaped by the events simu-lated at the individual level. SimRVPedigree allows forflexible modelling of disease onset through user-suppliedage-specific hazard rates for disease onset and death, andalso permits flexibility in family-based ascertainment.Among their benefits, family-based studies of largepedigrees with multiple disease-affected relatives enjoyincreased power to detect effects of rare variants [2].However, to conduct a family-based study of a rare dis-ease it may take years to collect enough data. For planningand inference, we present the SimRVPedigree packageto readily simulate pedigrees ascertained for multiple rel-atives affected by a rare disease. To our knowledge, thisis the first package to dynamically simulate pedigrees toaccount for competing life events.Nieuwoudt et al. Source Code for Biology andMedicine  (2018) 13:2 Page 11 of 11Additional fileAdditional file 1: SimRVPedigree Supplement. This is a pdf file thatprovides detailed information about the simulation procedure, as well asadditional information for the applications discussed in the main text.(PDF 254 kb)AbbreviationsGWAS: Genome-wide association studies; IBD: Identity by descent; NGS:Next-generation sequencingFundingThis work was supported in part by the Natural Science and EngineeringResearch Council of Canada (NSERC), the Canadian Institutes of HealthResearch (CIHR), and the Canadian Statistical Sciences Institute (CANSSI).Availability of data andmaterialsSimRVPedigree is a platform-independent package and is readily availablefor R≥ 3.4.0 through the CRAN repository. SimRVPedigree requireskinship2≥ 1.6.4 and dplyr≥ 0.7.4.The datasets generated and/or analysed during the current study are availablein the SimRVPedigree—Supplementary-Data repository,’ contributionsCN developed the R package, drafted the manuscript and conducted theanalyses. SJ and ABW provided the initial motivation for the project andgenetics guidance. JG conceived of the R package and the overall statisticalapproach, and contributed to the development of the manuscript and theapplications. All authors read and approved the final manuscript.Ethics approval and consent to participateNot applicable.Consent for publicationNot applicable.Competing interestsThe authors declare they have no competing interests.Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.Author details1Department of Statistics and Actuarial Science, Simon Fraser University, 8888University Drive, V5A 1S6, Burnaby, Canada. 2Canada’s Michael Smith GenomeSciences Centre, British Columbia Cancer Agency, 675 W 10th Ave, V5Z 1L3,Vancouver, Canada. 3Department of Medical Genetics, University of BritishColumbia, Vancouver, Canada. 4Department of Biomedical Physiology andKinesiology, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby,Canada.Received: 22 January 2018 Accepted: 2 October 2018References1. Thomas DC. Some surprising twists on the road to discovering thecontribution of rare variants to complex diseases. Hum Hered. 2012. Wijsman EM. The role of large pedigrees in an era of high-throughputsequencing. Hum Genet. 2012. Bureau A, Parker MM, Ruczinski I, Taub MA, Marazita ML, Murray JC,Mangold E, Noethen MM, Ludwig KU, Hetmanski JB, Bailey-Wilson JE,Cropp CD, Li Q, Szymczak S, Albacha-Hejazi H, Alqosayer K, Field LL,Wu-Chou YH, Doheny KF, Ling H, Scott AF, Beaty TH. Whole exomesequencing of distant relatives in multiplex families implicates rarevariants in candidate genes for oral clefts. Genetics. 2014. Cruceanu C, Ambalavanan A, Spiegelman D, Gauthier J, Lafrenière RG,Dion PA, Alda M, Turecki G, Rouleau GA. Family-basedexome-sequencing approach identifies rare susceptibility variants forlithium-responsive bipolar disorder. Genome. 2013. Toma C, Torrico B, Hervás A, Valdés-Mas R, Tristán-Noguero A, Padillo V,Maristany M, Salgado M, Arenas C, Puente XS, Bayés M, Cormand B.Exome sequencing in multiplex autism families suggests a major role forheterozygous truncating mutations. Mol Psychiatry. 2013. Bureau A, Younkin SG, Parker MM, Bailey-Wilson JE, Marazita ML,Murray JC, Mangold E, Albacha-Hejazi H, Beaty TH, Ruczinski I. Inferringrare disease risk variants based on exact probabilities of sharing bymultiple affected relatives. Bioinformatics. 2014. Gallager RG. Poisson processes. In: Discrete Stochastic Processes. Boston:Springer; 1996. p. 31–55.8. Kojima KI, Kelleher TM. Survival of mutant genes. Am Nat. 1962. Zhou M, Carin L. Negative binomial process count and mixturemodeling. IEEE Trans Pattern Anal Mach Intell. 2015. Thompson EA. Statistical Inference from Genetic Data on Pedigrees.Beachwood: Institute of Mathematical Statistics; 2000.11. R Core Team: R: A Language and Environment for Statistical Computing.Vienna: R Foundation for Statistical Computing; 2017. R Foundation forStatistical Computing. Ridley RM, Frith CD, Crow TJ, Conneally PM. Anticipation in huntington’sdisease is inherited through the male line but may originate in thefemale. J Med Genet. 1988. Surveillance Research Program: National Cancer Institute SEER*Statsoftware. Version 8.2.1. Surveillance, Epidemiology, and End Results (SEER) Program: SEER*StatDatabase: Incidence - SEER 18 Regs Research Data + Hurricane KatrinaImpacted Louisiana Cases, Nov 2014 Sub (2000-2012) <Katrina/RitaPopulation Adjustment> - Linked To County Attributes - Total U.S.,1969-2013 Counties, National Cancer Institute, DCCPS, SurveillanceResearch Program, Surveillance Systems Branch, released April 2015,based on the November 2014 submission. 3 Dec 2015.15. Bell FC, Miller ML. Life Tables for the United States Social Security Area1900-2100, Actuarial Study No. 120. Accessed 27 Nov 2015.16. Surveillance, Epidemiology, and End Results (SEER) Program: SEER*StatDatabase: Incidence - SEER 9 Regs Research Data, Nov 2014 Sub(1973-2012) <Katrina/Rita Population Adjustment> - Linked To CountyAttributes - Total U.S., 1969-2013 Counties, National Cancer Institute,DCCPS, Surveillance Research Program, Surveillance Systems Branch,released April 2015, based on the November 2014 submission. Accessed 3 Dec 2015.17. Surveillance, Epidemiology, and End Results (SEER) Program: SEER*StatDatabase: Incidence-Based Mortality - SEER 9 Regs Research Data, Nov2014 Sub (1973-2012) <Katrina/Rita Population Adjustment> - Linked ToCounty Attributes - Total U.S., 1969-2013 Counties, National CancerInstitute, DCCPS, Surveillance Research Program, Surveillance SystemsBranch, released April 2015, based on the November 2014 submission. Accessed 25 Nov 2015.18. Nelson Q, Agarwal N, Stephenson R, Cannon-Albright LA. Apopulation-based analysis of clustering identifies a strong geneticcontribution to lethal prostate cancer. Front Genet. 2013. Minikel EV, Zerr I, Collins SJ, Ponto C, Boyd A, Klug G, Karch A, Kenny J,Collinge J, Takada LT, Forner S, Fong JC, Mead S, Geschwind MD.Ascertainment bias causes false signal of anticipation in genetic priondisease. AmJHumGenet. 2014. Nussbaum RL, McInnes RR, Willard HF, Hamosh A. Patterns ofsingle-gene inheritance. In: Thompson & Thompson Genetics inMedicine. 7th edn. Philadelphia: Saunders/Elsevier; 2007. p. 115–49.21. Boonstra PS, Gruber SB, Raymond VM, Huang SC, Timshel S, Nilbert M,Mukherjee B. A review of statistical methods for testing geneticanticipation: looking for an answer in lynch syndrome. Genet Epidemiol.2010.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items